Extracting patient data from tables in clinical literature
Case study on extraction of BMI, weight and number of patients
Nikola Milosevic, Cassie Gregson, Robert Hernandez, Goran Nenadic
Clinical trial literature
• PubMed contains nearly 800 000 clinical trial publications
• Researchers challenged with the amount of published literature
Help from text mining?
• Text mining provides methods to process text on a large scale
• Current text mining efforts were mainly focused on text, rather than tables and figures
Tables in clinical documents
• A clinical trial publication contain 2.1 tables • Tables often contain information about
settings and findings of experiments
Challenges for table mining
• Dense content• Variety of layouts• Variety of value representation formats• Misleading visualization markup• Lack of resources (labelled datasets)• How to automatically make make sense from tables
Aim – a case study
• Extract information about number of patients, patient’s BMI and weight from tables in clinical trial literature
• A multi-layered approach to mining information from tables – to facilitate largescale semi-automated extraction – curation of data stored in tables
Methodology overview
• Rule based methodology– Rules created based on a manual analysis of small
subset of tables• Five processing layers– Detection– Functional– Structural– Syntactic– Semantic
Methodology overview
Table model
• We model 4 main types of tables– List– Matrix– Super-row– Multi-tables
• Based on table dimensionality
Table types (1)• List table:
Table types (2)
• Matrix table
Table types (3)
• Super-row table
Table types (4)
• Multi-table
1. Functional analysis
• Classifies cells to functional classes– Header, – super-row, – stub, – data
• Uses heuristics based on content and position
2. Structural analysis
• Determines relationships between cells• Using cell functions and table structure classifies
table into one of the structural table type:– List– Matrix– Super-row– Multi-table
• Based on the type, set of rules resolves the relationships
3.1 Extracting number of patient• Heuristic based approach• Searches captions, headers, cells• In captions 2 rules:
– n=%d– %d Adj*(patients|participants|subjects|individuals)– Usually total number of patients is found
• In header – usually n=%d– can be partial, needs adding up
• In cells– stub contains defined word or phrase– Can be partial, needs adding up
3.2 Extracting BMI
• Based on trigger phrase (BMI, body mass index) list and black list (change, increase)
• Trigger words in stub or header invoke possibility of appearance
• If black listed word is in vicinity it discards the value
• Range of 14-40
3.3 Extracting weights
• Based on trigger words and black lists• Looking in stub and header for words from
lists and values in data cells• Not useful to set range– Person can have 40 – 150 kg– In lbs: 80 – 350 lbs– Baby can have 1500 – 5000 g
Results• Corpus contained 3573 tables in 1273 documents• Each table on average 80 cells• Evaluating Functional and Structural processing: – Selected random 100 tables of each type and
evaluated• Evaluating information extraction:– Number of patients: • 758 contained data• 50 random documents
– BMI and weight: • 113 documents containing these information
Functional analysis results
Results for information extraction
• Extracting number of patients:
• Extracting weight and BMI:
Discussion
• Better scoped values, such as BMI can be modelled – better performance
• Define exhaustive white and black lists• Variety of presentation formats and means• Misleading markup• However, promising results
Summary
• Large-scale table mining to harvest population details from clinical trials
• Classified tables based on layout• Case study on clinical trial patient number,
BMI and weight• Promising performance