Upload
francine-leonard
View
212
Download
0
Embed Size (px)
Citation preview
FNERC OVERVIEW05/12/2002
Lingway, 05-06 of December 2002
FNERC : introduction
Lingway entered the project while CDC had already Lingway entered the project while CDC had already worked on FNERCworked on FNERC
Decision to use own tools : XTIRP Extraction ToolDecision to use own tools : XTIRP Extraction Tool
System Available at :System Available at :http://hugo.lingway.com/CmBin/cmCgi.exe?_rule=CrossmarcD1&_url=<XHTML_FILE>
Lingway, 05-06 of December 2002
FNERC Version 2.0
FNERC System Description Architecture NE / TIMEX, NUMEX / TERM Annotation Name Matching and Normalisation
Evaluation Ellogon Measures FNERC Future Developments Ontology Matters
Lingway, 05-06 of December 2002
FNERC System Description : architecture (1)
XTIRP Structure
Document(s) à traiter
(TXT, HTML,XML)
XTIRP Semantic Content
Document(s) Au format XML
XTIRP Structure
Document(s) à traiter
(TXT, HTML,XML)
XTIRP Semantic Content
Document(s) Au format XML
XTIRP Structure
Document(s) à traiter
(TXT, HTML,XML)
XTIRP Semantic Content
Document(s) Au format XML
XTIRP Tokenizer
XHTML Pages
XTIRP NE
Annotator
Annotated
XHTML Pages
FNERC Module
Ontology
RE Rules
Tok. Rules
XTIRP Semantic Content
XTIRP Semantic Content
XTIRP Semantic Content
Name Matching
Normalisation
XTIRP Structure
XTIRP Structure
XTIRP Structure
XTIRP To XML
XSLT
Lingway, 05-06 of December 2002
XTIRP To_XML module :Ensures that the input is XML-conformant :• If the case, it process the input into a tree-structure with all tags kept;• If not the case, it applies a tidy-like module to create a XML-conformant structure and process it into a tree structure. • CROSSMARC : this module is normally not used, as the input of FNERC are XHTML files
XTIRP Tokenizer : Enables to split the input text into sequences, either corresponding to • Logical structure (such as a sentence, a paragraph, a section etc.), • Strong tags (such as td, p, br etc.),• CROSSMARC First domain, we decided to keep the tag splitting
XTIRP NE / NUMEX / TIMEX Annotator : • Set of Regular Expression rules enabling to identify patterns and add annotation (tags and attributes) to recognized sequences• Use of Ontology and Lexicon
Name Matching / Normalisation : • Name matching : match coreferential ne, numex and timex, • Normalisation : normalise the ne, numex and timex filling the slots.
FNERC System Description : architecture (2)
Lingway, 05-06 of December 2002
FNERC System Description : Annotator (1)Rule Format
[AVAILABILITY]RegularExpression = '[0-9\/]+ *(heures|h\.|jours|j\.|mois)'Tag_1 = "timex4(OA-d0e2145,OF-d0e2143, OV-d0e2141, DURATION)"
Where :First Line : Rule TitleSecond Line : Perl-like Regular Expression for what is to be annotated Third Line : Action(s) to be taken. Refers to a general action sequence, refered by timex4.
Actions[Timex4]Name = "TIMEX"Attributes = "Feature=@1 Attribute=@2 Value=@3 Type=@4"POSITION = MATCH
Where :Second Line : the tag name (Name=”TIMEX”), Third Line : the attributes of this tag (@1, @2, @3, @4 variables corresponding to the values in the ruleFourth Line : the position of the tag
Lingway, 05-06 of December 2002
FNERC System Description : Annotator (2)
Automatic Generation of Rules from Ontology and Lexicon Nodes information
• Ontology : Identifiers
• Lexicon : Regular Expressions
• XSLT Stylesheet to generate the Rule File
• Manual Checking of the Rules in the generated Rule File (corrections, adding of generic rules)
• Currently 194 Rules
Ambiguity Handling• In some cases several rules can apply (ex. NUMEX- CAPACITY, applying to Hard Disk Capacity,
and Memory Capacity)
• Generation of an embedding AMBIG Tag in FNERC : <AMBIG> <NUMEX Value=“”/> <NUMEX Value=“”/>24 MO </AMBIG>
• Resolution in FE Module (using contextual information (for example, using the TERM Mémoire vive on the left)
Terms : a lot of recognition, to be used in FE
Lingway, 05-06 of December 2002
FNERC : Name Matching / NormalisationName Matching
• Matching co-referential NE, NUMEX and TIMEX inside a same product description. • Needs the demarcator process before being applied. • Lingway : use the attribute “value” that we add during the FNERC module • Example :
if, in the same product description, we annotate twice a PROCESSOR (say Intel PIII and Intel Pentium III)
=> they will have the same value Id, => when filling the NE – PROCESSOR slot, the module will just add one to the slot.
• Run with a XSLT style-sheet against the XHTML input file.
Normalisation• Enables to display extracted information in CROSSMARC various languages• Lingway :
use the attribute “value”by processing Ontology and Lexicons display the Synonym in one or the other
language• Run with a XSLT style-sheet against the XHTML input file.
Lingway, 05-06 of December 2002
FNERC : EvaluationEllogon Evaluation
• Still to be done due to format problems and delays• Discussion about :
XHTML FormatSpecific Output vs Human Annotators Output
Lingway Evaluation : Developments still to be done• Compare one to one test files and Human Annotated File / FNERC Annotated File• Precision• Recall• Miscellaneous
Ontology Matters• Missing ontology items• Missing ontology attributes• Processing of Specific Information :
Textual additionnal informationFuzzy numerical Values Binary Values
Lingway, 05-06 of December 2002
FNERC : Lingway Evaluation (1)See XSLT File
NE – MODEL : • Current system quite silent• Due to : No General Regular Expression (or noise with other rules) ?
NUMEX - MONEY : • Some problems with format (€ in UTF-8, ISO-Latin + )
NUMEX – DATE / TIME :• Date / Time Numex have not yet been implemented
Accordance with Human Annotators Decisions : • 1024x768 (NUMEX – RESOLUTION), 10/100 (Numex- Speed) VS 1024x768 pixels and 10/100 (TERM)• Misspellings : Compact (NE – MANUF) / Automatic System to take it into account ?
Redundant extraction : • Fujitsu-Siemens in the Title, in secondary frame etc. whereas human annotators just tagged the occurrence in the main table• Demarcator / Name Matching will handle theses cases
Lingway, 05-06 of December 2002
FNERC : Ontology Matters (1)
Missing ontology items :* Cards (F) => Graphical Card, Sound Card, Mother Card, Network cards, Controler Card* Memory (F) => Cache memory, Flash memory, Video memory etc.
Missing ontology attributes :Hard disk Example : Disque dur Maxtor 40 Gb 7200 tours/s UDMAProposition :DD => Type (SCSI, IDE, External), Brand (list of brands), Capacity, Speed, UDMA (yes/no), Internal/External
ScreenExample : écran 14.1" TFT XGA/SVGA dp (pitch) 0.25Proposition : Screen => (Screen Size, Screen Resolution, Screen Pitch Type)
RemovablesExample : DVD-ROM - 17 Go - 8x - module enfichable Proposition : Removables => Type (DVD-ROM Reader / CD-ROM Reader / CD Writer), Capacity, Speed, External/ Internal
Lingway, 05-06 of December 2002
Processing of additional information :
Additional textual information : moniteur NON INCLUS (screen not included) / (waranty details) garantie deux ans SUR SITE / RETOUR ATELIER
Fuzzy numerical Values :
garantie ILLIMITEE (unlimited warranty)
Binary Values :
Mémoire flash installé(e) ( max ) Aucun(e)
Network card No
Ontology Evolutivity :
pentium 4 / Pentium III-M : value not present in the ontology, poses the evolutivity problem. Perhaps we should imagine some rules to cover theses cases.
FNERC : Ontology Matters (2)