17
University Hospitals of Geneva University Hospitals of Geneva Medical Informatics Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch [email protected] CLEF 2006, Alicante, September 22-23

University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch [email protected]

Embed Size (px)

Citation preview

Page 1: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

University Hospitals of GenevaUniversity Hospitals of Geneva Medical Informatics Medical Informatics

Query and Document Translation by Text

CategorizationJulien Gobeill, Henning Müller, Patrick Ruch

[email protected]

CLEF 2006, Alicante, September 22-23

Page 2: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

2

CLIR vs. MMTQ

• Remark: Remark: CC.L.I.R vs. Manual Multilingual Translation of Queries.L.I.R vs. Manual Multilingual Translation of Queries• French + English + German queries French + English + German queries ≠ CLIR≠ CLIR

• Translation: What ?Translation: What ?• Queries: usual way (for ex: Rosemblat and al., CLEF 2003)Queries: usual way (for ex: Rosemblat and al., CLEF 2003)• Documents: rare/expensive (LOGOS - Oard and Hackett 1998)Documents: rare/expensive (LOGOS - Oard and Hackett 1998) Performance depends on the translation strategyPerformance depends on the translation strategy

Context-dependent or not ?Context-dependent or not ?• Both: Medical Subject Headings as interlinguaBoth: Medical Subject Headings as interlingua ! !

• Machine Translation: Machine Translation: CLIR RatioCLIR Ratio = 60% = 60%• Thesaurus-driven Lexicon: 65-75% (Eichmann and al. 1998)Thesaurus-driven Lexicon: 65-75% (Eichmann and al. 1998)• Text categorization: 80% (Ruch 2004)Text categorization: 80% (Ruch 2004)• Bilingual Parallel Corpora: > Bilingual Parallel Corpora: > 90% 90% (Dumais and al. 1997)(Dumais and al. 1997)• Text categorization + Machine Translation: > Text categorization + Machine Translation: > 90%90% (Ruch (Ruch

2004)2004)

Page 3: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

3

Query Translation by Machine TranslationORIGINAL QUERY (CF): ORIGINAL QUERY (CF): What are the effects of What are the effects of

calcium on the physical properties of mucus from calcium on the physical properties of mucus from cystic fibrosiscystic fibrosis patients patients ??

FRENCH (Human)FRENCH (Human): : Quels sont les effets du calcium Quels sont les effets du calcium sur les proprisur les propriééttéés physiques physiquess d du u mucus chez les mucus chez les patients atteints de patients atteints de mucoviscidose ?mucoviscidose ?

SYSTRANSYSTRAN: : Which are the effects of calcium on the Which are the effects of calcium on the

properties properties physiquesphysiques of mucus among patients of mucus among patients reached reached of of mucoviscidose ?mucoviscidose ?

~Grammatical Translation ~Grammatical Translation but...but...

Page 4: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

4

Some Issues of MT for IR...

• Babel Fish (Systran’s system in AltaVista)

alcoholicalcoholic fatty livefatty liverr foie grasfoie gras alcooliquealcoolique

Stéatose hépatique alcooliqueStéatose hépatique alcoolique

ExGalliaExGallia - La Boutique des - La Boutique des FrancaisFrancais dudu Monde Monde     ... vos abonnements partout dans le monde chez vous ! ... vos abonnements partout dans le monde chez vous ! FoieFoie--grasgras truffé du Lot, confit de canard à l ... Apéritif 100% naturel truffé du Lot, confit de canard à l ... Apéritif 100% naturel obtenu par la fermentation obtenu par la fermentation alcooliquealcoolique de 6 à 700 fleurs par de 6 à 700 fleurs par bouteilles. A ...bouteilles. A ...www.exgallia.com/produits-francais02.htm • www.exgallia.com/produits-francais02.htm • TranslateTranslate

CLIR Ratio = 60%CLIR Ratio = 60% ... But... But available on the shelf ! available on the shelf !

Page 5: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

5

Bilingual parallel corpora (source UN)81. Avec81. Avec la mise en oeuvre du système intégré de gestion ( la mise en oeuvre du système intégré de gestion (SIGSIG), grâce à l'analyse ), grâce à l'analyse

des traces électroniques, les possibilités de contrôle et de vérification seront des traces électroniques, les possibilités de contrôle et de vérification seront plus étendues que jamais.plus étendues que jamais. Le SIG marque une étape importante dans Le SIG marque une étape importante dans l'uniformisation et la rationalisation de la pratique de la gestion dans tous les l'uniformisation et la rationalisation de la pratique de la gestion dans tous les lieux d'affectation de l'Organisation.lieux d'affectation de l'Organisation. Pour la première fois, l'Pour la première fois, l'ONUONU va pouvoir va pouvoir disposer en temps voulu d'une information complète et récente sur ses disposer en temps voulu d'une information complète et récente sur ses ressources et leur emploi.ressources et leur emploi. Utilisé par d'autres programmes et organismes des Utilisé par d'autres programmes et organismes des Nations Unies, le SIG pourrait également être un facteur de transparence et de Nations Unies, le SIG pourrait également être un facteur de transparence et de plus grande compatibilité de l'information d'un organisme à l'autre, ce qui plus grande compatibilité de l'information d'un organisme à l'autre, ce qui conduirait à une conduirait à une harmonisationharmonisation sur le plan administratif. sur le plan administratif.

81. With81. With the implementation of the Integrated Management Information the implementation of the Integrated Management Information System System ((IMISIMIS), greater monitoring and audit capabilities will be available through ), greater monitoring and audit capabilities will be available through electronic audit trails than ever before.electronic audit trails than ever before. IMIS is a major step in standardizing IMIS is a major step in standardizing and rationalizing the management process in the Organization across duty and rationalizing the management process in the Organization across duty stations.stations. The The OrganizationOrganization will be able, for the first time, to have access to will be able, for the first time, to have access to timely, up-to-date and comprehensive information on its resources and their timely, up-to-date and comprehensive information on its resources and their utilization.utilization. The use of IMIS by other programmes and organizations in the The use of IMIS by other programmes and organizations in the United Nations system could also promote greater transparency and United Nations system could also promote greater transparency and compatibility of information across organizations, leading to compatibility of information across organizations, leading to standardizationstandardization in in administrative matters.administrative matters.

Linear projection methods to built transfer matrices, CLIR Ratio = 90%Linear projection methods to built transfer matrices, CLIR Ratio = 90% Problem: overkill to develop these resources if not available ! Problem: overkill to develop these resources if not available !

Page 6: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

6

Text categorization…Text categorization…

Page 7: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

7

Functional Example

FRENCH (Human)FRENCH (Human): : Quels sont les effets du calcium sur les Quels sont les effets du calcium sur les propripropriééttéés physiques physiquess d du u mucus chez les patients atteints de mucus chez les patients atteints de mucoviscidose ?mucoviscidose ?

Top 1: mucoviscidoseTop 1: mucoviscidoseTop 2: Top 2: calciumcalcium + mucoviscidose + mucoviscidoseTop 3: Top 3: calciumcalcium + + physiphysique + mucoviscidoseque + mucoviscidoseTop 4: Top 4: calciumcalcium + + physiphysique + mucoviscidose + que + mucoviscidose + humanithumanitésésTop 5Top 5: : calciumcalcium + + physiphysique + que + mucoviscidose mucoviscidose + + humanithumanitéés s + +

mucusmucus......Top N: {...}Top N: {...} BoW: {calcium, humanities, physics, BoW: {calcium, humanities, physics, cystic fibrosiscystic fibrosis, mucus}, mucus}

Specific needs: French Stemmer (Savoy), but cognates are Specific needs: French Stemmer (Savoy), but cognates are frequent !frequent !

What is the best threshold value for What is the best threshold value for NN ? ?

Page 8: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

8

Resources and Work Plan

• A multilingual controlled vocabularyA multilingual controlled vocabulary• Medical Subject Headings (~20 000 concepts)Medical Subject Headings (~20 000 concepts)• Thesaurus: 120 020 (UMLS) – 5000 (UMLThesaurus: 120 020 (UMLS) – 5000 (UMLF F ))

• CollectionCollection• OHSUMED OHSUMED • Tuning: OHSUMED queries translated in French by an Tuning: OHSUMED queries translated in French by an

expert expert

• Development (if possible language-independent)Development (if possible language-independent)• Translation: Use the categorizer and MeSHTranslation: Use the categorizer and MeSH

[as interlingua] for CLIR purposes[as interlingua] for CLIR purposes

Page 9: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

9

Example of MEDLINE RecordPMID: 11924965PMID: 11924965Simple multiplex genotyping by surface-enhanced resonance Raman scattering.Simple multiplex genotyping by surface-enhanced resonance Raman scattering.Graham D, Mallinder BJ, Whitcombe D, Watson ND, Smith WE.Graham D, Mallinder BJ, Whitcombe D, Watson ND, Smith WE.

The accurate detection of DNA sequences is essential for a variety of post humanThe accurate detection of DNA sequences is essential for a variety of post human genomegenome projectsprojects including detection of specific gene variants for medicalincluding detection of specific gene variants for medical diagnostics and diagnostics and pharmacogenomics. A specific DNA sequence detection assay basedpharmacogenomics. A specific DNA sequence detection assay based on surface-enhanced on surface-enhanced resonance Raman scattering (SERRS) and an amplificationresonance Raman scattering (SERRS) and an amplification refractory mutation system (ARMS) refractory mutation system (ARMS) is reported. Initially, generation of PCRis reported. Initially, generation of PCR products was achieved by using specifically designed products was achieved by using specifically designed allele-specific SERRSallele-specific SERRS active primers. Detection by SERRS of the PCR products confirmed the active primers. Detection by SERRS of the PCR products confirmed the presence ofpresence of the sequence tested for by the allele-specific oligonucleotides. This leadthe sequence tested for by the allele-specific oligonucleotides. This lead directly directly to the multiplex genotyping of human DNA samples for the deltaF508to the multiplex genotyping of human DNA samples for the deltaF508 mutational status of themutational status of the ccystic fibrosisystic fibrosis transmembrane conductance regulatortransmembrane conductance regulator gene using SERRS active primers in an gene using SERRS active primers in an ARMS assay. Removal of the unincorporatedARMS assay. Removal of the unincorporated primers allowed fast and accurate analysis inprimers allowed fast and accurate analysis in this this system in a multiplex format without any separation of amplicons. Thesystem in a multiplex format without any separation of amplicons. The results indicate that results indicate that SERRS can be used in modern genetic analysis and offers anSERRS can be used in modern genetic analysis and offers an opportunity for the development opportunity for the development of novel assays. This is the first demonstrationof novel assays. This is the first demonstration of the use of SERRS in multiplex genotyping of the use of SERRS in multiplex genotyping and shows potential advantages overand shows potential advantages over fluorescence as a detection technique with considerable fluorescence as a detection technique with considerable promise for futurepromise for future development.development.

Major MeSHMajor MeSH: : Cystic Fibrosis*Cystic Fibrosis*; ; DNA*DNA*; ; Genotype*Genotype*; ; Polymerase Chain ReactionPolymerase Chain ReactionMinor MeSHMinor MeSH: : HLA-DQ AntigensHLA-DQ Antigens; ; HumanHuman; ; Reverse TranscriptaseReverse Transcriptase;;

Sequence AnalysisSequence Analysis; ; Spectrum Analysis, RamanSpectrum Analysis, Raman;; Support, Non-U.S. Gov'tSupport, Non-U.S. Gov't

Page 10: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

10

ATC General Strategies

• Empirical learning of text-concept associations from a training set of texts and their associated concepts:• Reuters (Bayesian classifiers, Lewis 1992): 100 classesReuters (Bayesian classifiers, Lewis 1992): 100 classes• Text categorization/filtering paradygm [Sebastiani: hundreds…]Text categorization/filtering paradygm [Sebastiani: hundreds…]

Effective but Learning Conditions and Scalability…

• Retrieval based on word-matching, which attributes concepts to text based on lexical similarities:• Cross Language IR (SAPHIRE Int., Hersh et al. 1998)Cross Language IR (SAPHIRE Int., Hersh et al. 1998)• Recent and rareRecent and rare• Hypothesis: sufficient for mapping queries/documents and Hypothesis: sufficient for mapping queries/documents and

MeSH termsMeSH terms

Page 11: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

11

Basic Classifiers

• C1: FSA Pattern matcher + thesaurus [RegEx]C1: FSA Pattern matcher + thesaurus [RegEx]

wordword11…word…wordnn word word11… _… _[[**,2],2] …word …wordnn

wordword11…word…wordnn word word11… [word… [wordii]*…word]*…wordnn

Boolean scoringBoolean scoring

• C2: Vector Space: Porter stems + TF*IDF weighting C2: Vector Space: Porter stems + TF*IDF weighting [VS][VS]

Cosine distance/Similarity/Pivoted normalizationCosine distance/Similarity/Pivoted normalization

• C’: UMLS Thesaural resources

• C’’: Linguistically-motivated indexing units (NP)

Page 12: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

12

Metrics: Recall and Precision

• Relevant retrievedRelevant retrieved

How many terms are found for complete run ?How many terms are found for complete run ?

• Mean Reciprocal Rank (Maximize precision)Mean Reciprocal Rank (Maximize precision)

Precision of the top-ranked categoryPrecision of the top-ranked category

• Mean Average precisionMean Average precision

Average Precision over 11 recall points Average Precision over 11 recall points

Page 13: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

13

Query Translation

What are the effects of calcium on the physical properties of What are the effects of calcium on the physical properties of mucusmucus

from from cystic fibrosiscystic fibrosis patients patients ??

Top 1: cystic fibrosisTop 1: cystic fibrosisTop 2: Top 2: calciumcalcium + cystic fibrosis + cystic fibrosisTop 3: Top 3: calciumcalcium + + physicsphysics + cystic fibrosis + cystic fibrosisTop 4: Top 4: calciumcalcium + + physicsphysics + cystic fibrosis + + cystic fibrosis + humanitieshumanitiesTop 5Top 5: : calciumcalcium + + physicsphysics + cystic fibrosis + + cystic fibrosis + humanitieshumanities + +

mucusmucusTop 10: cystic fibrosis + Top 10: cystic fibrosis + calcium calcium + + physics physics + + humanities humanities ++

humanismhumanism + + mucus mucus + + health physicshealth physics + + humahumann rightsrights + + calcium compoundscalcium compounds + + physical therapyphysical therapy

?? Automatic Text Categorization Automatic Text Categorization ??::Fine-tuning on an OHSUMED Document mapping task (200-300 Fine-tuning on an OHSUMED Document mapping task (200-300 t.)t.)then then shortshort queries (7-30 tokens) queries (7-30 tokens)

Page 14: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

14

Query Length as Parameter

What are the effects of calcium on What are the effects of calcium on cystic fibrosiscystic fibrosis patientspatients ??

0840|cystic fibrosis transmembrane conductance 0840|cystic fibrosis transmembrane conductance regulator|regulator|

1001|1001|humanismhumanism||1001|1001|humanitieshumanities||

T=3T=3 2501|2501|calciumcalcium||2724|2724|fibrosisfibrosis||6070|6070|cystic fibrosiscystic fibrosis||

What are the effects of calcium on the physical What are the effects of calcium on the physical properties of mucus from properties of mucus from cystic fibrosiscystic fibrosis patients patients ??

1001|1001|humanitieshumanities|| T=4T=4 1807|1807|mucusmucus||

2501|2501|calciumcalcium||2724|2724|fibrosisfibrosis||6070|6070|cystic fibrosiscystic fibrosis||

Maximum ~ 3 for query mapping Maximum ~ 3 for query mapping ,,what about document ?what about document ?

Page 15: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

15

Results

• Medical ImageCLEFMedical ImageCLEF

Top 3: MAP = 0.1913Top 3: MAP = 0.1913

Top 5: MAP = 0.1967Top 5: MAP = 0.1967

Top 8: MAP = 0.2255 [GE_8EN.treceval.eval]Top 8: MAP = 0.2255 [GE_8EN.treceval.eval]

[…][…]

Top 20 ?Top 20 ?

Page 16: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

16

Conclusion

• Data-poor Text Categorization is effective for CLIRData-poor Text Categorization is effective for CLIR• Mostly language independentMostly language independent

• Try with more than 8 terms !Try with more than 8 terms !• Recompute fusion with image features !Recompute fusion with image features !

Page 17: University Hospitals of Geneva Medical Informatics Query and Document Translation by Text Categorization Julien Gobeill, Henning Müller, Patrick Ruch julien.gobeill@sim.hcuge.ch

17

Thank you for your attention…Thank you for your attention…

-EAGL Consortium: Swiss National Foundation-EAGL Consortium: Swiss National Foundation-Swiss-Prot Group: Anne-Lise Veuthey, Violaine Pillet-Swiss-Prot Group: Anne-Lise Veuthey, Violaine Pillet