24
Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project (http://www-tsujii.is.s.u-tok yo.ac.jp/GENIA/) Computer Science, University of Tokyo

Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Embed Size (px)

Citation preview

Page 1: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Text Mining and Knowledge Management

Junichi Tsujii

GENIA Project, Kototoi Project(http://www-tsujii.is.s.u-tokyo.ac.jp/GENI

A/)Computer Science, University of Tokyo

Page 2: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Increments

: accumulation

Increase in Medline

2002

2000

1998

199219941996

1990

1988

1980198219841986

1978

1970197219741976

1968

1966

1964

0

100,000

200,000

300,000

400,000

500,000

600,000

incr

emen

ts

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

acc

um

ula

tio

n

Page 3: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

1. Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways

2. Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data

3. Research Institute for Genetics (RIG) Disease-Gene Association

4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation

TEXT MINING for Bio-Medicine in Japan

Page 4: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

1. Institute for Medical Science (IMS), U-Tokyo Information Extraction from Text for Signal Pathways

2. Japan Bio-Informatics Research Centre (JBIRC) Interpretation of Micro-Array Data

3. Research Institute for Genetics (RIG) Disease-Gene Associations

4. Research Institute for Natural Science (Riken) Tool for Curators for GO Annotation

TEXT MINING for Bio-Medicine in Japan

Resource Building for TM in BM : GENIA Project (1998 - ) GENIA Corpus (Annotated Text) Information Exploitation System : Kototoi Project (2000 - ) Adaptable POS Tagger (Bio-Tagger), NER adapted for BM Parser based on HPSG (Enju), ML for Text Processing

Page 5: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

TEXT Mining = DATA Mining + BOW ?

BOW : “Bag of Words” Model

The model does not work because

(1) Language is a complex system (2) Language is inherently associated with knowledge

Mining + NLP + Knowledge Management

TM products on market with fanciful visualization facilitiesand trend analysis tools

Page 6: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Ontology-basedKMS

Natural Language Processing

Information Exploitation

A Huge amount of Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases)

Effective management of knowledge and information is the key

Page 7: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Information ExtractionModule

•Identify & classify terms•Identify events

Raw(OCR) TextStructure

Annotated

Corpus

Document Named-Entity Event

Database

Ontology Markuplanguage

Data model

Background Knowledge

MEDLINE

Retrieval Module

•Request enhancement•Spawn request•Classify documents

Security

User

•IR Request•Abstract•Full Paper

User

•IR Request•Abstract•Full Paper

Interface Module

•GUI•HTML conversion•System integration

Concept Module

Corpus Module

•Markup generation / compilation•Annotated corpus construction

Database Module

•DB design / access / management•DB construction•BK design / construction / compilation

Overview of GENIA System

Page 8: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology

5. Inconsistency

Motivated Independently of language

TerminologyNLPParaphrasing

Page 9: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology

5. Inconsistency

Motivated Independently of language

TerminologyNLPParaphrase

Page 10: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

address

Terms Concepts

address-as-a-speech

address-as-a-mail-address

address-as-a-street-address

A term is introduced, without explicit understanding whatit means, in order for one to make statements on it.

Semantic Web by Tim Berners-Lee, et.al. Scientific American (2001)

Page 11: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Language Domain Concept Domain

A cluster of realizations of terms

Page 12: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

1.000 NF kappa B 128 0.500 Transcription Factor NF kappa B 0 0.429 NF-kappa B 912 0.286 NF kB, Transcription Factor 0 0.286 NF kB 0 0.286 Immunoglobulin Enhancer-Binding Protein 0 0.286 Immunoglobulin Enhancer Binding Protein 0 0.286 Enhancer-Binding Protein, Immunoglobulin 00.286 kappa B Enhancer Binding Protein 0 0.286 Transcription Factor NF-kB 00.286 Transcription Factor NF kB 0 0.286 Factor NF-kB, Transcription 0 0.286 nuclear factor kappa beta 2 0.286 NF kappaB 1 0.273 NF kappa B chain 00.273 NF kappa B subunit 0 0.214 Transcription Factor NF-kappa B 0 0.214 NF-kB, Transcription Factor 0 0.214 NF-kB 67 0.200 Neurofibromatosis Type kappa B 0

Automatically Generated Variants

Page 13: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology

5. Inconsistency

Motivated Independently of language

TerminologyNLPParaphrase

Page 14: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Non-trivial Mapping

Language Domain Knowledge Domain 

Independently motivated of Language

Spelling VariantsSynonyms

Acronyms  

Same relationswith differentStructures

Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..

[A] protein activates [B] (Pathway extraction)

Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.

Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.

[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra

Page 15: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Predicate-argument structureParser based on Probabilistic HPSG (Enju)

The protein is activated by it

DT NN VBZ VBN IN PRP

dt np vp vp pp np

np pp

vp

vp

s

arg1arg2mod

Page 16: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University
Page 17: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

1. Size of Knowledge2. Context Dependency3. Evolving Nature of Science4. Hypothetical Nature of Ontology

5. Inconsistency

Motivated Independently of language

TerminologyNLPParaphrase

Page 18: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University
Page 19: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University
Page 20: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

and in its absence, deficient 60 S ribosomes are assembled which are inactive in protein synthesis resulting in cell lethality.

Mutations that completely abolish recognition of 26 S rRNA, however,block the formation of 60S particles, demonstrating that binding of L25to this rRNA is an essential step in the assembly of the large ribosomalsubunit.

Depletion of Saccharmoyces cerevisiae ribosomal protein L16 causes decrease in 60S ribosomal subunits and formation of half-mer polyribosomes.

Without L3, apparent synthesis of several 60 S subunit proteins diminished, and 60S subunit did not assemble. A similar phenomenon occurred, when a second strain, synthesis of ribosomal protein L29 was prevented.

Term: Ribosomal large subunit assembly and maintenance

Page 21: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Language Domain Concept Domain

Process of Ribosomal subunit assembly

A cluster of realizations of terms

Page 22: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Information and Knowledge Exploitation System

as

an integrated management system of raw data, semi-structured data, text and

structured data base

+

Mining Tools (Task Specific Software)

Page 23: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Text Archive with Feature ObejctsManaging texts, data representation and their semanticsManaging texts, data representation and their semantics

binded

Eventcontent

ninomicreatordc

endp

startp

wsjtextid

extent

Pr

:

30

10

02

Text ID

Start Position of the region

End Position of the region

Annotator

Content

agent

content 核開発内容問題

Text DB

DB of Feature Objects

Data Base Module

Ubiquitincontent

agent

binded

Event

content Pr

Copy and Unification

Specialization by unification

ubiquitinagent

bindtypeeventcontent

ninomicreatordc

endp

startp

wsjtextid

extent

interactinprotein:

30

10

02

Adding more augmented information induced by inference, type restriction, unification

Adding more augmented information induced by inference, type restriction, unification

Data representation

Text

Semantics

Ubiquitin E is bound with

Page 24: Text Mining and Knowledge Management Junichi Tsujii GENIA Project, Kototoi Project ( tokyo.ac.jp/GENIA/) Computer Science, University

Information ExtractionModule

•Identify & classify terms•Identify events

Raw(OCR) TextStructure

Annotated

Corpus

Document Named-Entity Event

Database

Ontology Markuplanguage

Data model

Background Knowledge

MEDLINE

Retrieval Module

•Request enhancement•Spawn request•Classify documents

Security

User

•IR Request•Abstract•Full Paper

User

•IR Request•Abstract•Full Paper

Interface Module

•GUI•HTML conversion•System integration

Concept Module

Corpus Module

•Markup generation / compilation•Annotated corpus construction

Database Module

•DB design / access / management•DB construction•BK design / construction / compilation

Overview of GENIA System