12
FNERC OVERVIEW 05/12/2002

FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Embed Size (px)

Citation preview

Page 1: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

FNERC OVERVIEW05/12/2002

Page 2: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC : introduction

Lingway entered the project while CDC had already Lingway entered the project while CDC had already worked on FNERCworked on FNERC

Decision to use own tools : XTIRP Extraction ToolDecision to use own tools : XTIRP Extraction Tool

System Available at :System Available at :http://hugo.lingway.com/CmBin/cmCgi.exe?_rule=CrossmarcD1&_url=<XHTML_FILE>

Page 3: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC Version 2.0

FNERC System Description Architecture NE / TIMEX, NUMEX / TERM Annotation Name Matching and Normalisation

Evaluation Ellogon Measures FNERC Future Developments Ontology Matters

Page 4: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC System Description : architecture (1)

XTIRP Structure

Document(s) à traiter

(TXT, HTML,XML)

XTIRP Semantic Content

Document(s) Au format XML

XTIRP Structure

Document(s) à traiter

(TXT, HTML,XML)

XTIRP Semantic Content

Document(s) Au format XML

XTIRP Structure

Document(s) à traiter

(TXT, HTML,XML)

XTIRP Semantic Content

Document(s) Au format XML

XTIRP Tokenizer

XHTML Pages

XTIRP NE

Annotator

Annotated

XHTML Pages

FNERC Module

Ontology

RE Rules

Tok. Rules

XTIRP Semantic Content

XTIRP Semantic Content

XTIRP Semantic Content

Name Matching

Normalisation

XTIRP Structure

XTIRP Structure

XTIRP Structure

XTIRP To XML

XSLT

Page 5: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

XTIRP To_XML module :Ensures that the input is XML-conformant :• If the case, it process the input into a tree-structure with all tags kept;• If not the case, it applies a tidy-like module to create a XML-conformant structure and process it into a tree structure. • CROSSMARC : this module is normally not used, as the input of FNERC are XHTML files

XTIRP Tokenizer : Enables to split the input text into sequences, either corresponding to • Logical structure (such as a sentence, a paragraph, a section etc.), • Strong tags (such as td, p, br etc.),• CROSSMARC First domain, we decided to keep the tag splitting

XTIRP NE / NUMEX / TIMEX Annotator : • Set of Regular Expression rules enabling to identify patterns and add annotation (tags and attributes) to recognized sequences• Use of Ontology and Lexicon

Name Matching / Normalisation : • Name matching : match coreferential ne, numex and timex, • Normalisation : normalise the ne, numex and timex filling the slots.

FNERC System Description : architecture (2)

Page 6: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC System Description : Annotator (1)Rule Format

[AVAILABILITY]RegularExpression = '[0-9\/]+ *(heures|h\.|jours|j\.|mois)'Tag_1 = "timex4(OA-d0e2145,OF-d0e2143, OV-d0e2141, DURATION)"

Where :First Line : Rule TitleSecond Line : Perl-like Regular Expression for what is to be annotated Third Line : Action(s) to be taken. Refers to a general action sequence, refered by timex4.

Actions[Timex4]Name = "TIMEX"Attributes = "Feature=@1 Attribute=@2 Value=@3 Type=@4"POSITION = MATCH

Where :Second Line : the tag name (Name=”TIMEX”), Third Line : the attributes of this tag (@1, @2, @3, @4 variables corresponding to the values in the ruleFourth Line : the position of the tag

Page 7: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC System Description : Annotator (2)

Automatic Generation of Rules from Ontology and Lexicon Nodes information

• Ontology : Identifiers

• Lexicon : Regular Expressions

• XSLT Stylesheet to generate the Rule File

• Manual Checking of the Rules in the generated Rule File (corrections, adding of generic rules)

• Currently 194 Rules

Ambiguity Handling• In some cases several rules can apply (ex. NUMEX- CAPACITY, applying to Hard Disk Capacity,

and Memory Capacity)

• Generation of an embedding AMBIG Tag in FNERC : <AMBIG> <NUMEX Value=“”/> <NUMEX Value=“”/>24 MO </AMBIG>

• Resolution in FE Module (using contextual information (for example, using the TERM Mémoire vive on the left)

Terms : a lot of recognition, to be used in FE

Page 8: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC : Name Matching / NormalisationName Matching

• Matching co-referential NE, NUMEX and TIMEX inside a same product description. • Needs the demarcator process before being applied. • Lingway : use the attribute “value” that we add during the FNERC module • Example :

if, in the same product description, we annotate twice a PROCESSOR (say Intel PIII and Intel Pentium III)

=> they will have the same value Id, => when filling the NE – PROCESSOR slot, the module will just add one to the slot.

• Run with a XSLT style-sheet against the XHTML input file.

Normalisation• Enables to display extracted information in CROSSMARC various languages• Lingway :

use the attribute “value”by processing Ontology and Lexicons display the Synonym in one or the other

language• Run with a XSLT style-sheet against the XHTML input file.

Page 9: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC : EvaluationEllogon Evaluation

• Still to be done due to format problems and delays• Discussion about :

XHTML FormatSpecific Output vs Human Annotators Output

Lingway Evaluation : Developments still to be done• Compare one to one test files and Human Annotated File / FNERC Annotated File• Precision• Recall• Miscellaneous

Ontology Matters• Missing ontology items• Missing ontology attributes• Processing of Specific Information :

Textual additionnal informationFuzzy numerical Values Binary Values

Page 10: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC : Lingway Evaluation (1)See XSLT File

NE – MODEL : • Current system quite silent• Due to : No General Regular Expression (or noise with other rules) ?

NUMEX - MONEY : • Some problems with format (€ in UTF-8, ISO-Latin + &nbsp;)

NUMEX – DATE / TIME :• Date / Time Numex have not yet been implemented

Accordance with Human Annotators Decisions : • 1024x768 (NUMEX – RESOLUTION), 10/100 (Numex- Speed) VS 1024x768 pixels and 10/100 (TERM)• Misspellings : Compact (NE – MANUF) / Automatic System to take it into account ?

Redundant extraction : • Fujitsu-Siemens in the Title, in secondary frame etc. whereas human annotators just tagged the occurrence in the main table• Demarcator / Name Matching will handle theses cases

 

Page 11: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

FNERC : Ontology Matters (1)

Missing ontology items :* Cards (F) => Graphical Card, Sound Card, Mother Card, Network cards, Controler Card* Memory (F) => Cache memory, Flash memory, Video memory etc.

Missing ontology attributes :Hard disk Example : Disque dur Maxtor 40 Gb 7200 tours/s UDMAProposition :DD => Type (SCSI, IDE, External), Brand (list of brands), Capacity, Speed, UDMA (yes/no), Internal/External

ScreenExample : écran 14.1" TFT XGA/SVGA dp (pitch) 0.25Proposition : Screen => (Screen Size, Screen Resolution, Screen Pitch Type)

RemovablesExample : DVD-ROM - 17 Go - 8x - module enfichable Proposition : Removables => Type (DVD-ROM Reader / CD-ROM Reader / CD Writer), Capacity, Speed, External/ Internal

Page 12: FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway

Lingway, 05-06 of December 2002

Processing of additional information :

Additional textual information : moniteur NON INCLUS (screen not included) / (waranty details) garantie deux ans SUR SITE / RETOUR ATELIER

Fuzzy numerical Values :

garantie ILLIMITEE (unlimited warranty)

Binary Values :

Mémoire flash installé(e) ( max ) Aucun(e)

Network card No

Ontology Evolutivity :

pentium 4 / Pentium III-M : value not present in the ontology, poses the evolutivity problem. Perhaps we should imagine some rules to cover theses cases.

FNERC : Ontology Matters (2)