Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Practical challenges in assessing indirectness and the implications for integrating multiple streams of evidence in systematic reviewsPaul WhaleyEvidence Based Toxicology Collaboration Research FellowLancaster Environment Centre, UKNational Academy of Sciences, Engineering and Medicine, Washington DCEvidence Integration Workshop, 3 June 2019
About me
• Researcher at Lancaster University and the Evidence-Based
Toxicology Collaboration at Johns Hopkins Bloomberg School of
Public Health
• Editor for Systematic Reviews at Environment International (IF 7.297)
• Focus on systematic review methods for environmental health
research: frameworks for systematic evidence surveillance and
synthesis; critical appraisal tools; research standards; quality
assurance and control in SR publishing
2
Today’s themes
• Systematic review as a grounded approach to evidence review
• A PECO-based framework for assessing external validity of studies
• Evidence that successfully grounding SRs is extremely challenging
• How our PECO framework anticipates a computational approach to SR
• Research needs for delivering grounded, computational SRs
3
Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits
What is a systematic review?
• A systematic review is a research project which tests a hypothesis using pre-existing evidence instead of conducting a novel experiment
• The test should minimise bias introduced by (a) the evidence included in the review, and (b) by the performance of the review• Include all the evidence relevant to testing the
hypothesis (search and screening)
• Appraise the quality of the evidence (at level of individual study and body of evidence)
• Synthesise the evidence into a summary result (qualitative & quantitative methods)
5
Systematic review = grounded interpretation
• SR is an advance on traditional narrative review because it uses explicit,
discussable methods to ground the test of the hypothesis
• SRs are grounded when they connect interpretation of the validity of
study procedures and observations with:
a. the textual record in the study documents of those procedures & observations
b. empirical evidence of the validity of the procedures described in that record
• Can’t take grounding for granted, but because SR methods are explicit,
they can be repeated, evaluated and deliberately changed
6
7
Interpretation of the textual record of the
procedures and observations of multiple
studies to determine extent to which existing evidence
already supports the SR hypothesis
Aggregation of textual record of existing studies
relevant to testing hypothesis
SR hypothesis
Search
Selection
Appraisal
Synthesis and integration
Textual record of a study
Hypothesis
Procedures
Observations
Data extraction: recording and coding of procedures and observations relevant to testing SR hypothesis
Bias; from meta-epidemiology
External validity; based on what?
Claim to validity of study design
Claim to validity of SR design
Derivation of summary results;
assessment of certainty based on
emergent properties of the aggregated
evidence, including its indirectness and
risk of bias
What is “evidence integration”?
• Evidence integration is based on a concept of dividing evidence into streams (or lines) of readily-comparable populations – usually animal vs. human, though could be a species, genus, or family
• Evidence is synthesised to produce summary results of effect of exposure in each stream
• Certainty of the evidence for the effect is assessed for each stream
• Integration is a function of combined certainty across each stream, generating a judgement of the overall level of evidence
• In the OHAT framework, mechanistic data can inform changes to the level of evidence; in the 2019 update to the IARC preamble, mechanistic evidence is a distinct stream in its own right
8
Integrating mechanistic information in SRs
• Current approaches were designed to support qualitative hazard classification, not obviously applicable to complex analysis objectives (e.g. quantifying health effects of exposures)
• We already exclude or combine multiple study designs according to principles of relevance or similarity which are informed by mechanistic data
• Mechanistic studies are conducted because they describe and/or predict health outcomes in a target population – why separate them from the whole-organism models of which they are intended to be informative?
• Can we do more to systematically incorporate mechanistic evidence into systematic reviews of exposures?
9
Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits
The role of PECO statements in SRs
• SR = test of a hypothesis using existing evidence
• Hypothesis interpreted as a research question, formulated as a Population-Exposure-Comparator-Outcome statement
• Common research scenario in environmental health: there is a suspected relationship between an exposure and an outcome, but the nature of the relationship is unknown (scenario 1, right)
11
P: Among adult females, what is the effect ofE: 1 μg/kg bw childhood organochlorine levels in blood, versusC: 1 μg/kg bw incremental increase onO: endometriosis?
Including indirect evidence
• Necessary in a SR of an exposure-outcome relationship when we do
not have certain evidence within the strict confines of the PECO
• Look at intermediate outcomes, disease markers, animal models,
similar chemicals (read-across), etc. etc.
• All indirect evidence but still relevant to the question, and therefore
could increase certainty in test of hypothesis
• There are lots of ways in which this evidence can be organised
12
Example: Matta et al. (2019)
13
Interpreting Matta et al. into PECOs
Level Population Exposure Comparator Outcome
1°
Non-human primate Chronic ? Spontaneous endometriosis
Non-human primate with implanted tissue Transgen ? Invasiveness of implanted tissue
Rodent Chronic ? Proliferation of endometriotic tissue
2°
In vivo ? ? PR-B/A expression (progesterone resistance)
In vitro ? ? CYP19A1 expression (aromatase pathway)
In vivo ? ? Inflammation
14
• As described, relationship between included studies, hypotheses under test and the relevant PECOs are ambiguous – characteristics need to be more tightly defined
• In actuality, we probably don’t need to define in advance all the potentially relevant sub-PECOs (cumbersome, p-hacking) – can’t we just observe how direct the evidence is?
Proposal: PECOs as a directness framework
Relative to the PECO which is the target of a SR, all evidence is to some
extent indirect, and may therefore be evaluated as follows:
• Define the target PECO (tPECO) for the SR, as we do already
• Extract the experimental PECO (ePECO) from each included study
• Evaluate the similarity of each ePECO to the SR tPECO (ePECOtPECO)
• Describe directness of the evidence overall as a function of how the
ePECOs map in aggregate onto the SR tPECO
15
What this might look like…
Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome
Target HumanWhole
organismPre-
menopauseFemale OC 1 μg/kg bw
Pre-puberty
1 μg/kg bwincrements
Endometriosis
Ref013 HumanWhole
organismAdult Female Furan mix
High exposure group
Up to 16 years age
Low exposure group
Endometriosis
Ref852 Human HESC cells - Female TCDD10uM
solution-
10 uMincrements
Migration
Ref134 Wistar RatWhole
organism24 months Male Chlorpyrifos
1000 μg/kg bw/d
Until weaning
VehiclePR-B/A
expression
P features E features C features O features
16
• Allows us to describe all types of study design using the same set of categories
• We can make comparisons between experimental PECOs and our target question, without having to divide evidence up into streams beforehand
• Makes explicit the information being interpreted (if not yet the rules for interpretation)
Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome
Target HumanWhole
organismPre-
menopauseFemale OC 1 μg/kg bw
Pre-puberty
1 μg/kg bwincrements
Endometriosis
Ref013 HumanWhole
organismAdult Female Furan mix
High exposure group
Up to 16 years age
Low exposure group
Endometriosis
Ref852 Human HESC cells - Female TCDD10uM
solution-
10 uMincrements
Migration
Ref134 Wistar RatWhole
organism24 months Male Chlorpyrifos
1000 μg/kg bw/d
Until weaning
VehiclePR-B/A
expression
P features E features C features O features
17
A
B
C
Judgement of similarity at level ofA. whole study
B. broad PECO element
C. individual PECO sub-element
• How do we ensure these
judgements are valid?
Rules for interpretation? Maybe in AOPs
• Is the observed intermediate event strongly predictive of the target outcome?
• Do the mechanisms in the observed population also happen in the target population?
• Intuition: the more certain the answer, the lower the sense that the evidence is indirect
• If true, maybe judgement of similarity can be derived from a function of certainty in the AOP network
• Potential for grounding judgements of directness in biological knowledge (so long as that knowledge is gathered systematically)
18
Exposure
Initiating Event Event 2 Event 3 Event 4
Outcome
causes causes causes
causes
causes
Assay A
Assay B
Experiment
Z
investigates
Level of cellular organisation
Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits
Two major, practical threats to grounded SR
• Implementing valid processes
• Overwhelming data volume
20
Prepublication data on EH systematic reviews
• At Environment International, we triage submissions on six key features of a SR:
1. Are objectives appropriate to investigating research question?
2. Does the search methodology miss relevant evidence?
3. Do the exclusion criteria and screening process exclude relevant evidence?
4. Have included studies been appraised using a valid risk of bias instrument?
5. Have appropriate quantitative and qualitative been used to synthesise the evidence?
6. Has certainty in the evidence been assessed using appropriate, defined criteria?
• We score the methods on a Likert scale of 1-5 (1 = serious concerns)
• A score of 1 or 2 in any domain is a critical shortcoming and results in desk-rejection*
21
*Authors receive a detailed triage report and editor feedback on identified issues; as often as possible issues are discussed with authors with a view to enabling resubmission
Summary of Triage Decisions
22
Period April 2018 - May 2019, since introduction of triage tool. n=52
Methods for Study Appraisal
23
No
. of
sub
mis
sio
ns
Editor’s Score
56%
27%
6%8%
4%
0
5
10
15
20
25
30
35
40
45
50
55
1 2 3 4 5
>80% of submissions use invalid study appraisal instruments, or
often none at all
Synthesis methods
24
33%29%
27%
8%4%
0
5
10
15
20
25
30
35
40
45
50
55
1 2 3 4 5
60% of SR submissions employ obviously flawed methods for synthesising the findings of included studies
Methods for Certainty Assessment
25
76%
6%10%
0%
8%
0
5
10
15
20
25
30
35
40
45
50
55
1 2 3 4 5
>80% of submissions fail to assess certainty in the evidence according to a
defined set of criteria
Post-publication data from medical SRs
• 8989 PubMed records tagged by 2004 as “systematic review” yet actual number of stringently-defined SRs was ≈2500 (Moher et al. 2007)
• Most published SRs have major flaws in conduct and reporting (Page et al. 2016)
• ≈3% of manuscripts are “decent and clinically useful” (Ioannidis 2016)
• What about Cochrane?
• Propadalo et al. 2019: 29% of Cochrane reviews arediscrepant with guidance on allocation concealment
• Babic et al. 2019 : “Assessments of attrition bias inCochrane systematic reviews are highly inconsistent”
• These are intervention reviews, not aetiology
26
Educating our way out of this challenge?
• Most EH research teams do not successfully apply even the simpler, well-documented instruments (e.g. OHAT, Navigation Guide, GRADE) which would better ground their SR methods
• Even if we ended up doing as well on average as the medics, we wouldn’t be doing well enough
• Doing as well as the outlier (setting up a Cochrane for EH research) is not a near-future event
• Complex tools like ROBINS-E: what prospects for successful use given the above?
27
The data volume problem
28
From: Villeneuve et al. (2018) Adverse Outcome Pathway Network Analytics
CYP19 AOP network>50 biokinetic events>65 event/event relationships
The integration challenge, in a nutshell
• If the stars align, simple SRs can successfully be conducted
• But in most normal scenarios, SR methods are out of reach of most
researchers’ capacity to apply them successfully
• Methods for integrating mechanistic data into SRs are unlikely to be
any easier to apply successfully – plus, they overwhelm us with data
• We can’t escape this challenge: the methods need to be applied in
order for SRs to be grounded
• So we need a scalable approach to grounded integration methods
29
Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits
In favour of algorithms
• By turning features into numbers, we can make processes repeatable and scalable (i.e. computers can do the work for us)
• Discussable inputs which can be changed deliberately
• The challenge is preserving the links in the chain of evidence that keeps the process grounded (score-text-design-validity)
• How do we do that for complex SR questions, e.g. predicting dose-response relationships in human populations using indirect evidence?
31
32
• We can readily turn judgements of similarity into
numbers within our tPECO framework
Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome
Target HumanWhole
organismPre-
menopauseFemale OC 1 μg/kg bw
Pre-puberty
1 μg/kg bwincrements
Endometriosis
Ref013 HumanWhole
organismAdult Female Furan mix
High exposure group
Up to 16 years age
Low exposure group
Endometriosis
Ref852 Human HESC cells - Female TCDD10uM
solution-
10 uMincrements
Migration
Ref134 Wistar RatWhole
organism24 months Male Chlorpyrifos
1000 μg/kg bw/d
Until weaning
VehiclePR-B/A
expression
P features E features C features O features
Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome
Target HumanWhole
organismPre-
menopauseFemale OC 1 μg/kg bw
Pre-puberty
1 μg/kg bwincrements
Endometriosis
Ref013 1 1 2 1 3 4 4 2 1
Ref852 1 3 - 1 1 2 - 2 3
Ref134 3 1 2 4 1 6 2 1 4
33
Any adult Furans Up to 16
1 2 3 4 5 6 7
Same DifferentHow similar?
Human
How do we ground similarity scores?
• In our mechanistic study, what makes a rat score a 3? Or PR-B/A a 4?
• The million (multi-trillion?) dollar question
34
Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome
Target HumanWhole
organismPre-
menopauseFemale OC 1 μg/kg bw
Pre-puberty
1 μg/kg bwincrements
Endometriosis
Ref013 1 1 2 1 3 4 4 2 1
Ref852 1 3 - 1 1 2 - 2 3
Ref134 3 1 2 4 1 6 2 1 4
P features E features C features O features
Rat PR-B/A expression
Research for grounding similarity scores
• Grounding requires us to connect the numbers to the textual record,
and to the empirical evidence for their interpretation (their value)
• There are at least three big jobs that need to be done
1. Systematic methods for AOP development
2. Automated data extraction
3. Machine-learning models for weighting evidence
• Probably all three need doing, because it looks like a big-data challenge
35
1. Systematic approach to AOP development
• Data model for external validity is underpinned by AOPs
• But we haven’t formalised the key features from which AOPs are built
• What information in the textual record should we use when developing an AOP?
• What rules should we follow in developing valid AOPs / determining their plausibility?
• This will need to be grounded, and therefore systematic*
• If we figure this out, we will know what rules the machines should be
following when identifying and evaluating putative AOPs for us
36
*SR approach to AOPs is subject of EBTC GRADE pre-meeting in Hamilton next week
2. Automated data extraction
• PECO features and AOP information need extracting from narrative text in full study reports
• This will be a very large extraction job: high level of granularity across thousands of documents
• Would require automation to be practically doable, therefore natural language processing (NLP) approach
• NLP methods can’t yet differentiate the features we are interested in, at level of full text, with enough reliability to do data extraction for us
• The step-change which is required implies need for a full-text toxicology corpus training set
37
Teaching computers to read
• Computers “read” by building statistical models to attempt to discern the same regularities in a written document that people respond to when discerning the meaning therein (the written concept “rat” will have a certain statistical shape in a document)
• The problem is there are lots of things which will, to the statistical models, look like regularities which are not meaningful (i.e. look like rats but are not rats), while many meaningful regularities will be invisible to them (are rats, but do not look like them)
• To help, we can manually annotate a large, representative set of documents (a corpus) to show the machines the parts which are meaningful to us (where the rats actually are). The machine can heavily weight this information in its statistical model, massively improving its performance for a data extraction task
Rats have four legs, big ears
and a tail.
We breed Han-Wistars.
Paul is being ratty because he
is tired and hungry.
Paul isn’t ratty enough for
Warfarin to poison him.
Artists like a golden ratio.
No rats were harmed during
filming.
Tree-rats keep stealing food
from our bird-feeders.
Machine-learning models for weighting evidence
• Starts off with responding to the features we know are important
(blinding, species, vehicle, event, dose regimen, formulation etc.)
• Uses statistical models of those features to repeat human processes
at high volume (e.g. judges risk of bias, indirectness, etc.)
• Large datasets yielded by success with NLP implies quantitative
models for interpreting meaning of dataset features
• Over time, the machine identifies predictive features we are not
aware of, and improves its performance beyond human capability
39
Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits
Summary
• Successful evidence integration requires us to ground complex
judgements of the directness of evidence in (a) the textual record
of research and (b) in biological knowledge
• We have proposed a framework for using PECO statements to
structure judgments about external validity, which seems to
necessitate a computational implementation
• We have outlined a research roadmap toward how such an
implementation can be realised and grounded
41
Thanks to…*
• Stephen Wattam, WAP Academic Consultancy Ltd
• Daniele Wikoff, Toxstrategies LLC
• Oliver Wild, Lancaster Environment Centre, UK
• Taylor Wolffe, Lancaster Environment Centre, UK
• Paul Rayson, Lancaster University School of Computing and Communications
• John Vidler, Lancaster University School of Computing and Communications
• EBTC staff: Katya Tsaioun, Sebastian Hoffmann, Rob de Vries
• Patient listeners: Rebecca Morgan, Michelle Angrish
42
*Credit is theirs, mistakes are mine
Thank you for [email protected] / @paul_msg
Image: Brian Henderson / Flickr / CC-BY-NC-2.0
Wikoff Model* for quantitative integration
• Model measures the extent to which
a body of evidence relevant to the
potential carcinogenicity of a
chemical fulfils the KCCs
• Uses three inputs (1-3) and an
algorithm (4) to provide a numeric
description (5) of how well the
evidence “matches” the KCCs
• It works a bit like calculating Flesch-
Kincaid readability scores in word
processors: overall target
characteristic described as a function
of some measurable properties,
normalised onto a scale
44
2 31
4
-1 0 1
KCCs not fulfilled KCCs fulfilled
5
*Oversimplified version presented here, see Wikoff et al. (2019) for detail
From Wikoff to grounded integration
45
Empirical evidence of similarity from being embedded in biological knowledge, i.e. AOPs – systematic
development and assessment of certainty in AOPs
Annotation in the textual record of the study characteristics which are being interpreted in making
judgements about external validity
Internal validity model: study characteristics
predictive of systematic error in results
External validity model: study features predictive
of generalisability
Statistical approachesto quantifying
exposure/outcome relationships
Algorithm responding to large number of interdependent features
Task-dependent output
2 31
4
5
Grounding
Not e.g. KlimischSimilarity to tPECO rather
than KCC featureNot vote counting
(3) is already done computationally; we think we can extend a computational approach to (2) with the framework we are describing; the challenge is then in ensuring (2) is grounded. Note: the principles of our approach also apply to (1)
Machine learning rather than simple operation