17
[ RMLL 2013, Bruxelles – Thursday 11 th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur

Presentation of OpenNLP

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Presentation of OpenNLP

[ RMLL 2013, Bruxelles – Thursday 11th July 2013 ]

Presentation of OpenNLPPresenter : Dr Ir Robert Viseur

Page 2: Presentation of OpenNLP

2

What is OpenNLP ?

• Toolkit for the processing of natural language text.• Project of the Apache Foundation.• Developped in Java.• Under Apache License, Version 2.• Download and documentation:

http://opennlp.apache.org/.

Page 3: Presentation of OpenNLP

3

What are the features ?

• For common NLP tasks : • tokenization, • sentence segmentation,• part-of-speech tagging,• named entity extraction,• chuncking.

Page 4: Presentation of OpenNLP

4

What is the part-of-speech tagging ?

• Example :

• See more: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html.

Page 5: Presentation of OpenNLP

5

What is the named entity extraction ?

• Example :

• See more: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html.

Page 6: Presentation of OpenNLP

6

How does it work ? (1/2)

• The features are associated to pre-trained models.• Each pre-trained model is created for one language

and for one type of use.• Supported languages: da, de, en, es, nl, pt, se.• Warnings :

– The functional coverage varies with languages.– The french language is not supported !

• See http://opennlp.sourceforge.net/models-1.5/.

• Use in command line or as a Java library.• Warning : loading time of models with CLI.

Page 7: Presentation of OpenNLP

7

How does it work ? (2/2)

• Example (English vs Spanish languages) :

Page 8: Presentation of OpenNLP

8

What are the criteria of choice ?

• Support of the product. • License.• Available languages.• Precision / Recall. • Speed of text processing.

Page 9: Presentation of OpenNLP

9

Are there free (as freedom) alternative tools ?

• Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST),• Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit.

• Frameworks : • In Java : UIMA (Java), GATE (Java).• In other languages : NLTK (Python).

Page 10: Presentation of OpenNLP

10

Example: tag cloud creation (1/6)

• Starting point: website.• Example: www.adacore.com.

• What we want (from website content):• common tag cloud,• circular tag cloud.

• Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud).

Page 11: Presentation of OpenNLP

11

Example: tag cloud creation (2/6)

• Cleaning: • Remove the HTML tags and keep only the useful

content.• Warnings:

• NLP tools are sensitive to noise in raw data.• Pay attention to the language of the document.

• Use of HTML boilerplate tool (HTML -> TXT).• Tool: Boilerpipe.• See http://code.google.com/p/boilerpipe/.

• Next: normalization of the text.

Page 12: Presentation of OpenNLP

12

Example: tag cloud creation (3/6)

• Named entities extraction.• Standard in OpenNLP : OpenNLP adds tags in text.• Here : extraction of Person NE.

• Terminology extraction.• First : part-of-speech tagging (POST).• Next : identification et filtering (threshold) of :

• collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people).

Page 13: Presentation of OpenNLP

13

Example: tag cloud creation (4/6)

• Process :

Raw HTMLdocument

---- --- -- ----.--- -- -- -- ----

--- -- ----.

---- --- -- ----.--- -- -- -- ----

--- -- ----.

_--- _-- _-- __---- _--.

_--- _-- _-- _--

_______________

Conversion to text

Normalization

POStagging

_______________

Terminologyextraction

NE extraction

Tag cloud (for a website)

Website(Internet)

Website (local)

Crawl

Tags

Merge

Page 14: Presentation of OpenNLP

14

Example: tag cloud creation (5/6)

• Result: common tag cloud.

Page 15: Presentation of OpenNLP

15

Example: tag cloud creation (6/6)

• Result: circular tag cloud.

Page 16: Presentation of OpenNLP

16

Thanks for your attention.

Any questions ?

Page 17: Presentation of OpenNLP

17

Contact

Dr Ir Robert Viseur

Email (@CETIC) : [email protected] Email (@UMONS) : [email protected]

Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be

This presentation is covered by « CC-BY-ND » license.