36
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)

Intelligent Information Directory System for Clinical Documents

  • Upload
    nellis

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Intelligent Information Directory System for Clinical Documents. Qinghua Zou 6/3/2005. Dr. Wesley W. Chu (Advisor). Keyword Search Problems Hard to compose good keywords Lack an outlook of the content Interchangeable words. When searching clinical reports. Intelligent Directory System. - PowerPoint PPT Presentation

Citation preview

Page 1: Intelligent Information Directory System for Clinical Documents

Intelligent Information Directory System for Clinical Documents

Qinghua Zou

6/3/2005

Dr. Wesley W. Chu (Advisor)

Page 2: Intelligent Information Directory System for Clinical Documents

When searching clinical reports

Keyword Search Problems

Hard to compose good keywords

Lack an outlook of the content

Interchangeable words

Page 3: Intelligent Information Directory System for Clinical Documents

Intelligent Directory System

1. Overview 2. Extracting Key Concepts 3. Mining Topics 4. Building Directories 5. Searching 6. Conclusion

Page 4: Intelligent Information Directory System for Clinical Documents

1. System Overview

Page 5: Intelligent Information Directory System for Clinical Documents

2. Concept Extraction

2.1 Introduction 2.2 Our approach: IndexFinder

Index Phase (Offline) Search Phase (Real Time)

2.3 Experiments 2.4 Summary

Page 6: Intelligent Information Directory System for Clinical Documents

2.1 Motivation Clinical texts are

valuable in medical practice Search relevant reports Search similar patients

What is key information? UMLS provides

key medical concepts Our Goal

Extract UMLS concepts from clinical texts

Clinical Texts

•Extract key info.•Standard terms

Page 7: Intelligent Information Directory System for Clinical Documents

2.1 Previous Approaches

Free text

ip

dp i1

i0 vplambs

will v0

eat

dp

oats

NLP Parser

UMLS

Mapping

UMLS Concepts

Noun phrases

•lambs•oats

Page 8: Intelligent Information Directory System for Clinical Documents

2.1 Problems of Previous Approaches

Concepts cannot be discovered if they are not in a single noun phrase. E.g. In “second, third, and fourth ribs”,

“Second rib” can not be discovered.

Difficult to scale to large text computing. Natural language processing requires

significant computing resources

Page 9: Intelligent Information Directory System for Clinical Documents

2.2 Our Approach: IndexFinder

Free text

NLP Parser

Noun phrases

UMLS

Mapping Concepts

We would discard all words in the text except “lung” and “cancer”.

Our approach: UMLSfree text Previous: free textUMLS

Suppose UMLS contains only“Lung cancer”

Indexing

Index Data ~80MB

UMLS 2GB Index phase(offline)

conceptsFilteringExtracting

Free text Search phase(real time)

Page 10: Intelligent Information Directory System for Clinical Documents

2.2 Our Approach: What’s New?

Knowledge-based approach Using the compact index data

without using any database system

Permuting words in a sentence to generate UMLS concept candidates.

Using filters to eliminate irrelevant concepts.

Page 11: Intelligent Information Directory System for Clinical Documents

2.2 Concept Candidates GenerationAssumptions Knowledge base provides a

phrase table. Each phrase (concept) is a

set of words. An input text T is

represented as a set of words.

Goal Combining words in T to

generate concept candidates

Example T={D,E,F}

Answer: 5

Page 12: Intelligent Information Directory System for Clinical Documents

2.2 Search Phase: FilteringUse filters to eliminate irrelevant

concepts Syntactic filter:

Word combination is limited within a sentence.

Semantic filter: Filter out irrelevant concepts using

semantic types (e.g. body part, disease, treatment, diagnose).

Filter out general concepts using the ISA relationship and keep the more specific ones.

Page 13: Intelligent Information Directory System for Clinical Documents

2.3 Experiment Comparison with MetaMap [3]

Input: A small mass was found in the left hilum of the lung.

MetaMap

IndexFinder

Page 14: Intelligent Information Directory System for Clinical Documents

2.4 Summary An efficient method that maps from UMLS

to free text for extracting concepts without using any database system.

Syntactic and semantic filters are used to eliminate irrelevant candidates.

IndexFinder is able to find more specific concepts than NLP approaches.

IndexFinder is scalable and can be operated in real time.

Page 15: Intelligent Information Directory System for Clinical Documents

3. Mining Topics: SmartMiner

3.1 Introduction 3.2 Search Space 3.3 SmartMiner 3.4 Experiment 3.5 Summary

Page 16: Intelligent Information Directory System for Clinical Documents

3.1 Introduction

A Topic (assumption) a set of concepts a frequent pattern

Finding topics by data mining Frequent patterns, or Maximal frequent patterns

Require efficient data mining

Page 17: Intelligent Information Directory System for Clinical Documents

3.1 Data Mining Problem

1: a b c d e2: a b c d3: b c d4: b e5: c d e

id: item setDataset

MinSup=2

MFI abcd, be, cde

What itemsets are frequent itemsets (FI)?

a, b, c, d, e, ab, ac, ad, bc, bd, be, cd, ce, de, abc, abd, acd, bcd, cde,

abcd

Maximal frequent itemset(MFI): No superset is frequent.

Page 18: Intelligent Information Directory System for Clinical Documents

3.1 Why MFI not FI? Mining FI is infeasible when there exists long FI. E.g, Suppose we have a 20-item frequent set a1 a2 … a20. All of its subset are frequent, i.e., 220=1,048,576

Mining MFI is fast and we can generate all the FI.

Page 19: Intelligent Information Directory System for Clinical Documents

3.1 Previous work

Superset checking. A study shows that CPU spends 40% time for superset checking.

Search tree is too large A large number of support counting

Need more efficient method

Page 20: Intelligent Information Directory System for Clinical Documents

3.2 Search spaceGiven 5 items: a, b, c, d, e. What is the search space?

Ø, a, b, c, d, e, ab, ac, ad, ae, bc, …, abcde

We use “head:tail” to denote the space as:

:abcdesimplify

Ø:abcde

What is the space of ? ab:cd

ab, abc, abd, abcd

Page 21: Intelligent Information Directory System for Clinical Documents

3.2 Space decomposition

For a space :abcde, if abcg is frequent,

Then, the known space any subset of abc is frequent known space is :abc

The unknown space are: Any itemsets contain d or e. d:abce and e:abc

:abcde = d:abce + e:abc + :abc

Page 22: Intelligent Information Directory System for Clinical Documents

3.3 The basic idea

(b) SmartMiner Strategy

SmartMiner takes advantages of the information from previous steps.

(a) Previous approach

B2

A1

B1 …

Creating B2 before exploring B1

Bn B’

A1

B1 …

Creating B’ after exploring B1

Using information from B to prune the space at B’

Page 23: Intelligent Information Directory System for Clinical Documents

3.3 The tail information

For the space :abcde, if we know abcf, abcg and abfg are frequent, then we project them to the space. abcf abc. abcg abc. abfg ab.

Thus Tinf(abcf,abcg, abfg|:abcde)={abc}

Page 24: Intelligent Information Directory System for Clinical Documents

3.4 Running time on Mushroom

0

1

10

100

1000

10 1 0.1 0.01 Minimum Support (%)

Total Time(sec)SmartMinerGenMaxMafia

Page 25: Intelligent Information Directory System for Clinical Documents

3.5 Summary

SmartMiner uses tail information to guide the mining, efficient since A smaller search tree. No superset checking. Reduces the number of support counting.

Page 26: Intelligent Information Directory System for Clinical Documents

4. Building Directories

4.1 Introduction 4.2 Knowledge Hierarchies 4.3 User Specification 4.4 Directory Generation 4.5 Integration various

directories 4.6 Summary

Page 27: Intelligent Information Directory System for Clinical Documents

4.1 Introduction

Three Inputs Topics

Key Content Knowledge

trees Meaningful

User specs Customized

Page 28: Intelligent Information Directory System for Clinical Documents

4.2 Knowledge Hierarchies UMLS concept hierarchies

PA: parent-child relationship RA: rather-than relationship

Problems A concept: several parents, different granularity

[lung cancer] [Neoplasms, Respiratory Tract] [lung cancer] [Neoplasms, Respiratory

System] A concept: hundreds of paths to roots

[lung cancer]: 233 different paths in UMLS by PA

Page 29: Intelligent Information Directory System for Clinical Documents

4.2 Select Proper Hierarchies Set source preference order, e.g

[disease]: ICD9>SNOMED>MeSH [body part]: SNOMED>ICD9

Select proper granularity C: a set of concepts; n: a path node Score function for selecting the

node n S(n)=|{ci| cin, ci in C}|

Expert review

Page 30: Intelligent Information Directory System for Clinical Documents

4.3 User Specifications A good directory ~ usage pattern User spec usage pattern User may have different specs A spec: a series of knowledge

names [disease] + [body part], or [body part] + [disease]

Build a directory for a spec by the ordering

Page 31: Intelligent Information Directory System for Clinical Documents

4.4 Directory GenerationAn example

User spec 1: d + p [disease] + [body part]

User spec 2: p + d [body part] + [disease]

Page 32: Intelligent Information Directory System for Clinical Documents

4.4 ~ An example d + p

p + d

1

1 11

1 1 11

Page 33: Intelligent Information Directory System for Clinical Documents

4.4 ~ Algorithm

Page 34: Intelligent Information Directory System for Clinical Documents

4.5 Integration various directories

For each Di, get all dir paths to Di

A Di is tree: XML Key words can

associate with tree nodes

Query: xpath Exist redundant

information

Page 35: Intelligent Information Directory System for Clinical Documents

4.5 simplified model Keep only the

first level knowledge trees

For //d6//p6, we use XPath query

//doc[//d6 and //p6]

Size smaller, require some computation

Page 36: Intelligent Information Directory System for Clinical Documents

4.6 Summary

Build directory by Topics Knowledge hierarchies User specifications

Mapping directories to XML By collecting directory paths for

each document Leverage on existing XML

technologies