Upload
nikola-milosevic
View
125
Download
0
Embed Size (px)
Citation preview
Extracting patient data from tables in clinical literature
Case study on extraction of BMI, weight and number of patients
Nikola Milosevic, Cassie Gregson, Robert Hernandez, Goran Nenadic
Clinical trial literature
• PubMed contains nearly 800 000 clinical trial publications
• Researchers challenged with the amount of published literature
Help from text mining?
• Text mining provides methods to process text on a large scale
• Current text mining efforts were mainly focused on text, rather than tables and figures
Tables in clinical documents
• A clinical trial publication contain 2.1 tables • Tables often contain information about
settings and findings of experiments
Challenges for table mining
• Dense content• Variety of layouts• Variety of value representation formats• Misleading visualization markup• Lack of resources (labelled datasets)• How to automatically make make sense from tables
Aim – a case study
• Extract information about number of patients, patient’s BMI and weight from tables in clinical trial literature
• A multi-layered approach to mining information from tables – to facilitate largescale semi-automated extraction – curation of data stored in tables
Methodology overview
• Rule based methodology– Rules created based on a manual analysis of small
subset of tables• Five processing layers– Detection– Functional– Structural– Syntactic– Semantic
Methodology overview
Table model
• We model 4 main types of tables– List– Matrix– Super-row– Multi-tables
• Based on table dimensionality
Table types (1)• List table:
Table types (2)
• Matrix table
Table types (3)
• Super-row table
Table types (4)
• Multi-table
1. Functional analysis
• Classifies cells to functional classes– Header, – super-row, – stub, – data
• Uses heuristics based on content and position
2. Structural analysis
• Determines relationships between cells• Using cell functions and table structure classifies
table into one of the structural table type:– List– Matrix– Super-row– Multi-table
• Based on the type, set of rules resolves the relationships
3.1 Extracting number of patient• Heuristic based approach• Searches captions, headers, cells• In captions 2 rules:
– n=%d– %d Adj*(patients|participants|subjects|individuals)– Usually total number of patients is found
• In header – usually n=%d– can be partial, needs adding up
• In cells– stub contains defined word or phrase– Can be partial, needs adding up
3.2 Extracting BMI
• Based on trigger phrase (BMI, body mass index) list and black list (change, increase)
• Trigger words in stub or header invoke possibility of appearance
• If black listed word is in vicinity it discards the value
• Range of 14-40
3.3 Extracting weights
• Based on trigger words and black lists• Looking in stub and header for words from
lists and values in data cells• Not useful to set range– Person can have 40 – 150 kg– In lbs: 80 – 350 lbs– Baby can have 1500 – 5000 g
Results• Corpus contained 3573 tables in 1273 documents• Each table on average 80 cells• Evaluating Functional and Structural processing: – Selected random 100 tables of each type and
evaluated• Evaluating information extraction:– Number of patients: • 758 contained data• 50 random documents
– BMI and weight: • 113 documents containing these information
Functional analysis results
Results for information extraction
• Extracting number of patients:
• Extracting weight and BMI:
Discussion
• Better scoped values, such as BMI can be modelled – better performance
• Define exhaustive white and black lists• Variety of presentation formats and means• Misleading markup• However, promising results
Summary
• Large-scale table mining to harvest population details from clinical trials
• Classified tables based on layout• Case study on clinical trial patient number,
BMI and weight• Promising performance