Extracting patient data from tables in clinical literature

Extracting patient data from tables in clinical literature

Case study on extraction of BMI, weight and number of patients

Nikola Milosevic, Cassie Gregson, Robert Hernandez, Goran Nenadic

Clinical trial literature

• PubMed contains nearly 800 000 clinical trial publications

• Researchers challenged with the amount of published literature

Help from text mining?

• Text mining provides methods to process text on a large scale

• Current text mining efforts were mainly focused on text, rather than tables and figures

Tables in clinical documents

• A clinical trial publication contain 2.1 tables • Tables often contain information about

settings and findings of experiments

Challenges for table mining

• Dense content• Variety of layouts• Variety of value representation formats• Misleading visualization markup• Lack of resources (labelled datasets)• How to automatically make make sense from tables

Aim – a case study

• Extract information about number of patients, patient’s BMI and weight from tables in clinical trial literature

• A multi-layered approach to mining information from tables – to facilitate largescale semi-automated extraction – curation of data stored in tables

Methodology overview

• Rule based methodology– Rules created based on a manual analysis of small

subset of tables• Five processing layers– Detection– Functional– Structural– Syntactic– Semantic

Methodology overview

Table model

• We model 4 main types of tables– List– Matrix– Super-row– Multi-tables

• Based on table dimensionality

Table types (1)• List table:

Table types (2)

• Matrix table

Table types (3)

• Super-row table

Table types (4)

• Multi-table

1. Functional analysis

• Classifies cells to functional classes– Header, – super-row, – stub, – data

• Uses heuristics based on content and position

2. Structural analysis

• Determines relationships between cells• Using cell functions and table structure classifies

table into one of the structural table type:– List– Matrix– Super-row– Multi-table

• Based on the type, set of rules resolves the relationships

3.1 Extracting number of patient• Heuristic based approach• Searches captions, headers, cells• In captions 2 rules:

– n=%d– %d Adj*(patients|participants|subjects|individuals)– Usually total number of patients is found

• In header – usually n=%d– can be partial, needs adding up

• In cells– stub contains defined word or phrase– Can be partial, needs adding up

3.2 Extracting BMI

• Based on trigger phrase (BMI, body mass index) list and black list (change, increase)

• Trigger words in stub or header invoke possibility of appearance

• If black listed word is in vicinity it discards the value

• Range of 14-40

3.3 Extracting weights

• Based on trigger words and black lists• Looking in stub and header for words from

lists and values in data cells• Not useful to set range– Person can have 40 – 150 kg– In lbs: 80 – 350 lbs– Baby can have 1500 – 5000 g

Results• Corpus contained 3573 tables in 1273 documents• Each table on average 80 cells• Evaluating Functional and Structural processing: – Selected random 100 tables of each type and

evaluated• Evaluating information extraction:– Number of patients: • 758 contained data• 50 random documents

– BMI and weight: • 113 documents containing these information

Functional analysis results

Results for information extraction

• Extracting number of patients:

• Extracting weight and BMI:

Discussion

• Better scoped values, such as BMI can be modelled – better performance

• Define exhaustive white and black lists• Variety of presentation formats and means• Misleading markup• However, promising results

Summary

• Large-scale table mining to harvest population details from clinical trials

• Classified tables based on layout• Case study on clinical trial patient number,

BMI and weight• Promising performance

[email protected]

mailto:[email protected]

Nikola Milosevic

Add website and twitter contacts

Science

Extracting patient data from tables in clinical literature