Worldwide Safety Strategy
A Hit-Miss Model for Duplicate Detection in the WHO Drug Safety Database
Andrew BateSenior Director, Analytics Team Lead, Epidemiology, Worldwide Safety Strategy
Person Validation and Entity Resolution Conference
Washington DCMay 23, 2011
Worldwide Safety Strategy
Acknowledgements
• This research was wholly funded by the WHO Collaborating Centre for International Drug Monitoring
• I was at the time an employee of the WHO Centre.• This presentation is my current opinion of the completed
research• Co-authors Niklas Norén and Roland Orre played
instrumental role in this research; Niklas Norén developed many of these slides
• For more information please contact:
Worldwide Safety Strategy
Overview
• Background– Post marketing safety surveillance
– WHO Programme for International Drug Monitoring
– The problem of duplicate reports
• Method for detecting duplicate reports
• Results
• Concluding remarks
Worldwide Safety Strategy
The WHO International Drug Monitoring Programme, 2004
Worldwide Safety Strategy
The WHO International Drug Monitoring Programme• Aim to discover suspected adverse drug
reactions (ADRs) not identified in clinical trials, when drugs are on the market
• Collect reports from healthcare professionals and consumers internationally on suspected ADR incidents in clinical practice
• Run by WHO Collaborating Centre, Sweden• Analysis based on a combination of quantitative
methods for exploratory data analysis and expert clinical review
Worldwide Safety Strategy
Spontaneous reporting limitations• Often limited clinical information on reports and satisfactory
secondary case evaluation is not always possible• Not all ADRs that occur will be recognized as drug induced by a
healthcare professional• Even those that are suspected will not necessarily be reported• Suspicion can mistakenly rest on the drug, coincidental
spontaneous ADR case reports resulting• Control information is not collected as part of spontaneously
reported systems, the drug use is not known, and there is no direct information on disease incidence
Ref Bate et al 2008 FCP
Worldwide Safety Strategy
The WHO database of suspected side drug effects• Strengths
– Database size (>6 million case reports, 200+ fields), now more than 1 million per year
– International coverage since 1967– Reporting of all marketed drugs from 100+ countries
• Spontaneous reporting remains the data primarily used for post-marketing identification of suspected ADRs
Worldwide Safety Strategy
Quantitative signal detection
• Detect potential signals for further investigation that are not readily recognisable on a single case report nor otherwise readily apparent at case entry
• Enhance rather than replace other methods of signal detection– Clinical review remains critical
• Methods assume independence between reports
Worldwide Safety Strategy
Duplicate case reports
• Unlinked case reports related to the same ADR incident: ‘duplicates’
• Duplication may be due to:– Different reporting sources (health professionals,
national authorities, different companies) having provided separate case reports related to the same incident
– Mistakes in linking follow-up case reports to the earlier database records
Worldwide Safety Strategy
Problem extent
• Case report duplication is one of the most important data quality problems in post-marketing drug safety data, and therefore limits ADR identification capability
• There was no published research on methods for automated duplicate detection in this type of data
• No studies on how common duplicates really are (studies on vaccine ADR data suggest 5%, but for specific case series, rates around 20% have been reported)
Worldwide Safety Strategy
Data extraction
• Hundreds of possible record fields for each case report (administrative information, incident information, patient information) but most case reports carry little information
• Anonymised data (but patient age and gender may be available)
• The following record fields are used: age, gender, country, date, drug substances, ADRs, outcome.
• Note no free text fields involved as rarely entered in the anonymized reports entered into WHO database
Worldwide Safety Strategy
Duplicate case report characteristics
• Typically more similar than other record pairs
• Sometimes VERY different
• Great variety of discrepancies – no “safe” record fields
• Missing data can complicate things
Worldwide Safety Strategy
Example: impact of missing data
• Consider the following two case reports:
• Likely duplicates?
• Identical case reports but too little information for the evidence to be considered strong!
Patient age Patient gender Country Drug substances ADR terms Onset date Outcome
? ? USA Bactrim Rash ? ?
? ? USA Bactrim Rash ? ?
Worldwide Safety Strategy
The hit-miss model (Copas & Hilton, 1990)
• Compare the probability of a certain matching event under the assumption that the two records are related, to the same probability under the assumption that they are independent
• Under additional assumption of independence between record fields, the weights for the different record fields can be added to provide an overall match score
• Hit-miss model provides model for P(x,y) – the probability for different matching events between related records
W xy logP x , yP x P y
Worldwide Safety Strategy
Hit-miss model weights
• Matches receive positive weights (greater rewards for matches on rare events)
• Mismatches receive negative weights (greater penalties for mismatches in record field with few errors in training data)
• Record fields for which at least one of the records have missing data receive weight 0
Worldwide Safety Strategy
Properties
• Accounts for both the level of agreement and the amount of information
• Imposes no strict criteria that a record pair must fulfil in order to be highlighted
• Allows the threshold for manual review to be adjusted based on the available resources
• Robust with respect to small amounts of training data
Worldwide Safety Strategy
Fitting standard hit-miss modelsto the WHO database
• Model fitting is based on simple parameter estimation
• The probability for different values and for missing data in each record field can be estimated based on the data set as a whole
• The probability for a miss in a given record field needs to be estimated based on labelled duplicates (38 pairs available for the WHO database)
Worldwide Safety Strategy
Extending the hit-miss mixture model
• A generalisation of the standard hit-miss model to numerical record fields
• In addition to hits, misses and blanks, the hit-miss mixture model includes deviations
• Motivation: many types of errors in numerical record fields are likely to lead to small differences compared to the true value, rather than to random values
Worldwide Safety Strategy
Evaluation on Norwegian data
• The last Norwegian batch of reports from 2004 included 19 confirmed duplicates
• We used the hit-miss model to highlight suspected duplicates in this batch of 1559 case reports
• The match score threshold for likely duplicates was set to 37.5 (based on an assumed 5% duplicates in the data set and in order to achieve an estimated rate of false alarms of below 0.05)
Worldwide Safety Strategy
Results
• 17 record pairs were highlighted as suspected duplicates
• 12 of these were confirmed duplicates, 5 were not– 5 false positives– 7 false negatives– 63% recall– 71% precision
Worldwide Safety Strategy
Andrew Bate, UMC21
Top scoring record pair
• The highest match score in the study is for an alleged false positive
– Only near-matches on age and date, no matching ADR terms
– BUT 6 matching drug substances (not commonly co-prescribed)
– ... and ADR terms are semantically close
Age Gender Country Drug substances ADR terms Onset date Outcome Score
51 F NOR 6 matched, 1 unmatched 3 unmatched 2004-04-30 ? +76.97
50 F NOR 2004-04-20 ?
Worldwide Safety Strategy
Follow-up
The Norwegian centre informed us that:
• The top scoring record pair does relate to a set of confirmed duplicates (submitted by different doctors in the same hospital)
• One of the other 'false positives' corresponds to a pair of suspected but yet unconfirmed duplicates
Worldwide Safety Strategy
23
Duplicates?
• Cluster of 3 reports highlighted in a a specific country
• Onset date: 16th Dec 2003• Age: 8, 18 and 29• All female• All had one drug listed and one AE listed –
both drug and AE quite rarely reported
Worldwide Safety Strategy
24
Duplicates?
• No – confirmed by reporting country• But: were all reported by the same dentist• Clearly these reports are not completely
independent• Analysis methods treat all reports as equally
important and weigh them equally– Can use duplicate detection algorithm to down
weigh very similar reports that are less likely to be ‘independent’
Worldwide Safety Strategy
References
Copas J, Hilton F. Record linkage: statistical models for matching
Computer records. Journal of the Royal Statistical Society: Series A 153
(1990) 287-320.
Norén GN, Orre R, Bate A. A hit-miss model for duplicate detection in the WHO drug safety database 2005. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Awarded Best Application Paper at the SIGKDD annual meeting, Chicago 2005).
Norén GN, Hopstadius J, Bate A, Star K, Edwards IR. Temporal Pattern Discovery in Electronic Patient Records. Data Mining and Knowledge Discovery, 2010. 20(3):361-387.
Worldwide Safety Strategy
Conclusions
• The extended hit-miss model has several beneficial theoretical properties for this application
• Overall performance on duplicate detection in real world post-marketing drug safety data is very useful
• The hit-miss mixture model's capability to account for near-matches on age and date important in real world applications