Upload
robert-viseur
View
106
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
[ RMLL 2013, Bruxelles – Thursday 11th July 2013 ]
Presentation of OpenNLPPresenter : Dr Ir Robert Viseur
2
What is OpenNLP ?
• Toolkit for the processing of natural language text.• Project of the Apache Foundation.• Developped in Java.• Under Apache License, Version 2.• Download and documentation:
http://opennlp.apache.org/.
3
What are the features ?
• For common NLP tasks : • tokenization, • sentence segmentation,• part-of-speech tagging,• named entity extraction,• chuncking.
4
What is the part-of-speech tagging ?
• Example :
• See more: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html.
5
What is the named entity extraction ?
• Example :
• See more: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html.
6
How does it work ? (1/2)
• The features are associated to pre-trained models.• Each pre-trained model is created for one language
and for one type of use.• Supported languages: da, de, en, es, nl, pt, se.• Warnings :
– The functional coverage varies with languages.– The french language is not supported !
• See http://opennlp.sourceforge.net/models-1.5/.
• Use in command line or as a Java library.• Warning : loading time of models with CLI.
7
How does it work ? (2/2)
• Example (English vs Spanish languages) :
8
What are the criteria of choice ?
• Support of the product. • License.• Available languages.• Precision / Recall. • Speed of text processing.
9
Are there free (as freedom) alternative tools ?
• Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST),• Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit.
• Frameworks : • In Java : UIMA (Java), GATE (Java).• In other languages : NLTK (Python).
10
Example: tag cloud creation (1/6)
• Starting point: website.• Example: www.adacore.com.
• What we want (from website content):• common tag cloud,• circular tag cloud.
• Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud).
11
Example: tag cloud creation (2/6)
• Cleaning: • Remove the HTML tags and keep only the useful
content.• Warnings:
• NLP tools are sensitive to noise in raw data.• Pay attention to the language of the document.
• Use of HTML boilerplate tool (HTML -> TXT).• Tool: Boilerpipe.• See http://code.google.com/p/boilerpipe/.
• Next: normalization of the text.
12
Example: tag cloud creation (3/6)
• Named entities extraction.• Standard in OpenNLP : OpenNLP adds tags in text.• Here : extraction of Person NE.
• Terminology extraction.• First : part-of-speech tagging (POST).• Next : identification et filtering (threshold) of :
• collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people).
13
Example: tag cloud creation (4/6)
• Process :
Raw HTMLdocument
---- --- -- ----.--- -- -- -- ----
--- -- ----.
---- --- -- ----.--- -- -- -- ----
--- -- ----.
_--- _-- _-- __---- _--.
_--- _-- _-- _--
_______________
Conversion to text
Normalization
POStagging
_______________
Terminologyextraction
NE extraction
Tag cloud (for a website)
Website(Internet)
Website (local)
Crawl
Tags
Merge
14
Example: tag cloud creation (5/6)
• Result: common tag cloud.
15
Example: tag cloud creation (6/6)
• Result: circular tag cloud.
16
Thanks for your attention.
Any questions ?
17
Contact
Dr Ir Robert Viseur
Email (@CETIC) : [email protected] Email (@UMONS) : [email protected]
Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be
This presentation is covered by « CC-BY-ND » license.