Practical challenges in assessing indirectness and the ... · a. the textual record in the study documents of those procedures & observations b. empirical evidence of the validity

Practical challenges in assessing indirectness and the implications for integrating multiple streams of evidence in systematic reviewsPaul WhaleyEvidence Based Toxicology Collaboration Research FellowLancaster Environment Centre, UKNational Academy of Sciences, Engineering and Medicine, Washington DCEvidence Integration Workshop, 3 June 2019

About me

• Researcher at Lancaster University and the Evidence-Based

Toxicology Collaboration at Johns Hopkins Bloomberg School of

Public Health

• Editor for Systematic Reviews at Environment International (IF 7.297)

• Focus on systematic review methods for environmental health

research: frameworks for systematic evidence surveillance and

synthesis; critical appraisal tools; research standards; quality

assurance and control in SR publishing

2

Today’s themes

• Systematic review as a grounded approach to evidence review

• A PECO-based framework for assessing external validity of studies

• Evidence that successfully grounding SRs is extremely challenging

• How our PECO framework anticipates a computational approach to SR

• Research needs for delivering grounded, computational SRs

3

Recap of systematic review and evidence integrationA PECO-based framework for evidence integrationPractical challenges in achieving grounded analysisSolution: a computational approachConclusions and credits

What is a systematic review?

• A systematic review is a research project which tests a hypothesis using pre-existing evidence instead of conducting a novel experiment

• The test should minimise bias introduced by (a) the evidence included in the review, and (b) by the performance of the review• Include all the evidence relevant to testing the

hypothesis (search and screening)

• Appraise the quality of the evidence (at level of individual study and body of evidence)

• Synthesise the evidence into a summary result (qualitative & quantitative methods)

5

Systematic review = grounded interpretation

• SR is an advance on traditional narrative review because it uses explicit,

discussable methods to ground the test of the hypothesis

• SRs are grounded when they connect interpretation of the validity of

study procedures and observations with:

a. the textual record in the study documents of those procedures & observations

b. empirical evidence of the validity of the procedures described in that record

• Can’t take grounding for granted, but because SR methods are explicit,

they can be repeated, evaluated and deliberately changed

6

7

Interpretation of the textual record of the

procedures and observations of multiple

studies to determine extent to which existing evidence

already supports the SR hypothesis

Aggregation of textual record of existing studies

relevant to testing hypothesis

SR hypothesis

Search

Selection

Appraisal

Synthesis and integration

Textual record of a study

Hypothesis

Procedures

Observations

Data extraction: recording and coding of procedures and observations relevant to testing SR hypothesis

Bias; from meta-epidemiology

External validity; based on what?

Claim to validity of study design

Claim to validity of SR design

Derivation of summary results;

assessment of certainty based on

emergent properties of the aggregated

evidence, including its indirectness and

risk of bias

What is “evidence integration”?

• Evidence integration is based on a concept of dividing evidence into streams (or lines) of readily-comparable populations – usually animal vs. human, though could be a species, genus, or family

• Evidence is synthesised to produce summary results of effect of exposure in each stream

• Certainty of the evidence for the effect is assessed for each stream

• Integration is a function of combined certainty across each stream, generating a judgement of the overall level of evidence

• In the OHAT framework, mechanistic data can inform changes to the level of evidence; in the 2019 update to the IARC preamble, mechanistic evidence is a distinct stream in its own right

8

Integrating mechanistic information in SRs

• Current approaches were designed to support qualitative hazard classification, not obviously applicable to complex analysis objectives (e.g. quantifying health effects of exposures)

• We already exclude or combine multiple study designs according to principles of relevance or similarity which are informed by mechanistic data

• Mechanistic studies are conducted because they describe and/or predict health outcomes in a target population – why separate them from the whole-organism models of which they are intended to be informative?

• Can we do more to systematically incorporate mechanistic evidence into systematic reviews of exposures?

9


The role of PECO statements in SRs

• SR = test of a hypothesis using existing evidence

• Hypothesis interpreted as a research question, formulated as a Population-Exposure-Comparator-Outcome statement

• Common research scenario in environmental health: there is a suspected relationship between an exposure and an outcome, but the nature of the relationship is unknown (scenario 1, right)

11

P: Among adult females, what is the effect ofE: 1 μg/kg bw childhood organochlorine levels in blood, versusC: 1 μg/kg bw incremental increase onO: endometriosis?

Including indirect evidence

• Necessary in a SR of an exposure-outcome relationship when we do

not have certain evidence within the strict confines of the PECO

• Look at intermediate outcomes, disease markers, animal models,

similar chemicals (read-across), etc. etc.

• All indirect evidence but still relevant to the question, and therefore

could increase certainty in test of hypothesis

• There are lots of ways in which this evidence can be organised

12

Example: Matta et al. (2019)

13

Interpreting Matta et al. into PECOs

Level Population Exposure Comparator Outcome

1°

Non-human primate Chronic ? Spontaneous endometriosis

Non-human primate with implanted tissue Transgen ? Invasiveness of implanted tissue

Rodent Chronic ? Proliferation of endometriotic tissue

2°

In vivo ? ? PR-B/A expression (progesterone resistance)

In vitro ? ? CYP19A1 expression (aromatase pathway)

In vivo ? ? Inflammation

14

• As described, relationship between included studies, hypotheses under test and the relevant PECOs are ambiguous – characteristics need to be more tightly defined

• In actuality, we probably don’t need to define in advance all the potentially relevant sub-PECOs (cumbersome, p-hacking) – can’t we just observe how direct the evidence is?

Proposal: PECOs as a directness framework

Relative to the PECO which is the target of a SR, all evidence is to some

extent indirect, and may therefore be evaluated as follows:

• Define the target PECO (tPECO) for the SR, as we do already

• Extract the experimental PECO (ePECO) from each included study

• Evaluate the similarity of each ePECO to the SR tPECO (ePECOtPECO)

• Describe directness of the evidence overall as a function of how the

ePECOs map in aggregate onto the SR tPECO

15

What this might look like…

Study Specie L. Org. Age Sex Chem Dose Timing Dose Outcome

Target HumanWhole

organismPre-

menopauseFemale OC 1 μg/kg bw

Pre-puberty

1 μg/kg bwincrements

Endometriosis

Ref013 HumanWhole

organismAdult Female Furan mix

High exposure group

Up to 16 years age

Low exposure group

Endometriosis

Ref852 Human HESC cells - Female TCDD10uM

solution-

10 uMincrements

Migration

Ref134 Wistar RatWhole

organism24 months Male Chlorpyrifos

1000 μg/kg bw/d

Until weaning

VehiclePR-B/A

expression

P features E features C features O features

16

• Allows us to describe all types of study design using the same set of categories

• We can make comparisons between experimental PECOs and our target question, without having to divide evidence up into streams beforehand

• Makes explicit the information being interpreted (if not yet the rules for interpretation)


Target HumanWhole

organismPre-


Pre-puberty


Endometriosis

Ref013 HumanWhole


High exposure group

Up to 16 years age

Low exposure group

Endometriosis


solution-

10 uMincrements

Migration



1000 μg/kg bw/d

Until weaning

VehiclePR-B/A

expression


17

A

B

C

Judgement of similarity at level ofA. whole study

B. broad PECO element

C. individual PECO sub-element

• How do we ensure these

judgements are valid?

Rules for interpretation? Maybe in AOPs

• Is the observed intermediate event strongly predictive of the target outcome?

• Do the mechanisms in the observed population also happen in the target population?

• Intuition: the more certain the answer, the lower the sense that the evidence is indirect

• If true, maybe judgement of similarity can be derived from a function of certainty in the AOP network

• Potential for grounding judgements of directness in biological knowledge (so long as that knowledge is gathered systematically)

18

Exposure

Initiating Event Event 2 Event 3 Event 4

Outcome

causes causes causes

causes

causes

Assay A

Assay B

Experiment

Z

investigates

Level of cellular organisation


Two major, practical threats to grounded SR

• Implementing valid processes

• Overwhelming data volume

20

Prepublication data on EH systematic reviews

• At Environment International, we triage submissions on six key features of a SR:

1. Are objectives appropriate to investigating research question?

2. Does the search methodology miss relevant evidence?

3. Do the exclusion criteria and screening process exclude relevant evidence?

4. Have included studies been appraised using a valid risk of bias instrument?

5. Have appropriate quantitative and qualitative been used to synthesise the evidence?

6. Has certainty in the evidence been assessed using appropriate, defined criteria?

• We score the methods on a Likert scale of 1-5 (1 = serious concerns)

• A score of 1 or 2 in any domain is a critical shortcoming and results in desk-rejection*

21

*Authors receive a detailed triage report and editor feedback on identified issues; as often as possible issues are discussed with authors with a view to enabling resubmission

Summary of Triage Decisions

22

Period April 2018 - May 2019, since introduction of triage tool. n=52

Methods for Study Appraisal

23

No

. of

sub

mis

sio

ns

Editor’s Score

56%

27%

6%8%

4%

0

5

10

15

20

25

30

35

40

45

50

55

1 2 3 4 5

>80% of submissions use invalid study appraisal instruments, or

often none at all

Synthesis methods

24

33%29%

27%

8%4%

0

5

10

15

20

25

30

35

40

45

50

55

1 2 3 4 5

60% of SR submissions employ obviously flawed methods for synthesising the findings of included studies

Methods for Certainty Assessment

25

76%

6%10%

0%

8%

0

5

10

15

20

25

30

35

40

45

50

55

1 2 3 4 5

>80% of submissions fail to assess certainty in the evidence according to a

defined set of criteria

Post-publication data from medical SRs

• 8989 PubMed records tagged by 2004 as “systematic review” yet actual number of stringently-defined SRs was ≈2500 (Moher et al. 2007)

• Most published SRs have major flaws in conduct and reporting (Page et al. 2016)

• ≈3% of manuscripts are “decent and clinically useful” (Ioannidis 2016)

• What about Cochrane?

• Propadalo et al. 2019: 29% of Cochrane reviews arediscrepant with guidance on allocation concealment

• Babic et al. 2019 : “Assessments of attrition bias inCochrane systematic reviews are highly inconsistent”

• These are intervention reviews, not aetiology

26

Educating our way out of this challenge?

• Most EH research teams do not successfully apply even the simpler, well-documented instruments (e.g. OHAT, Navigation Guide, GRADE) which would better ground their SR methods

• Even if we ended up doing as well on average as the medics, we wouldn’t be doing well enough

• Doing as well as the outlier (setting up a Cochrane for EH research) is not a near-future event

• Complex tools like ROBINS-E: what prospects for successful use given the above?

27

The data volume problem

28

From: Villeneuve et al. (2018) Adverse Outcome Pathway Network Analytics

CYP19 AOP network>50 biokinetic events>65 event/event relationships

The integration challenge, in a nutshell

• If the stars align, simple SRs can successfully be conducted

• But in most normal scenarios, SR methods are out of reach of most

researchers’ capacity to apply them successfully

• Methods for integrating mechanistic data into SRs are unlikely to be

any easier to apply successfully – plus, they overwhelm us with data

• We can’t escape this challenge: the methods need to be applied in

order for SRs to be grounded

• So we need a scalable approach to grounded integration methods

29


In favour of algorithms

• By turning features into numbers, we can make processes repeatable and scalable (i.e. computers can do the work for us)

• Discussable inputs which can be changed deliberately

• The challenge is preserving the links in the chain of evidence that keeps the process grounded (score-text-design-validity)

• How do we do that for complex SR questions, e.g. predicting dose-response relationships in human populations using indirect evidence?

31

32

• We can readily turn judgements of similarity into

numbers within our tPECO framework


Target HumanWhole

organismPre-


Pre-puberty


Endometriosis

Ref013 HumanWhole


High exposure group

Up to 16 years age

Low exposure group

Endometriosis


solution-

10 uMincrements

Migration



1000 μg/kg bw/d

Until weaning

VehiclePR-B/A

expression



Target HumanWhole

organismPre-


Pre-puberty


Endometriosis

Ref013 1 1 2 1 3 4 4 2 1

Ref852 1 3 - 1 1 2 - 2 3

Ref134 3 1 2 4 1 6 2 1 4

33

Any adult Furans Up to 16

1 2 3 4 5 6 7

Same DifferentHow similar?

Human

How do we ground similarity scores?

• In our mechanistic study, what makes a rat score a 3? Or PR-B/A a 4?

• The million (multi-trillion?) dollar question

34


Target HumanWhole

organismPre-


Pre-puberty


Endometriosis

Ref013 1 1 2 1 3 4 4 2 1

Ref852 1 3 - 1 1 2 - 2 3

Ref134 3 1 2 4 1 6 2 1 4


Rat PR-B/A expression

Research for grounding similarity scores

• Grounding requires us to connect the numbers to the textual record,

and to the empirical evidence for their interpretation (their value)

• There are at least three big jobs that need to be done

1. Systematic methods for AOP development

2. Automated data extraction

3. Machine-learning models for weighting evidence

• Probably all three need doing, because it looks like a big-data challenge

35

1. Systematic approach to AOP development

• Data model for external validity is underpinned by AOPs

• But we haven’t formalised the key features from which AOPs are built

• What information in the textual record should we use when developing an AOP?

• What rules should we follow in developing valid AOPs / determining their plausibility?

• This will need to be grounded, and therefore systematic*

• If we figure this out, we will know what rules the machines should be

following when identifying and evaluating putative AOPs for us

36

*SR approach to AOPs is subject of EBTC GRADE pre-meeting in Hamilton next week

2. Automated data extraction

• PECO features and AOP information need extracting from narrative text in full study reports

• This will be a very large extraction job: high level of granularity across thousands of documents

• Would require automation to be practically doable, therefore natural language processing (NLP) approach

• NLP methods can’t yet differentiate the features we are interested in, at level of full text, with enough reliability to do data extraction for us

• The step-change which is required implies need for a full-text toxicology corpus training set

37

Teaching computers to read

• Computers “read” by building statistical models to attempt to discern the same regularities in a written document that people respond to when discerning the meaning therein (the written concept “rat” will have a certain statistical shape in a document)

• The problem is there are lots of things which will, to the statistical models, look like regularities which are not meaningful (i.e. look like rats but are not rats), while many meaningful regularities will be invisible to them (are rats, but do not look like them)

• To help, we can manually annotate a large, representative set of documents (a corpus) to show the machines the parts which are meaningful to us (where the rats actually are). The machine can heavily weight this information in its statistical model, massively improving its performance for a data extraction task

Rats have four legs, big ears

and a tail.

We breed Han-Wistars.

Paul is being ratty because he

is tired and hungry.

Paul isn’t ratty enough for

Warfarin to poison him.

Artists like a golden ratio.

No rats were harmed during

filming.

Tree-rats keep stealing food

from our bird-feeders.

Machine-learning models for weighting evidence

• Starts off with responding to the features we know are important

(blinding, species, vehicle, event, dose regimen, formulation etc.)

• Uses statistical models of those features to repeat human processes

at high volume (e.g. judges risk of bias, indirectness, etc.)

• Large datasets yielded by success with NLP implies quantitative

models for interpreting meaning of dataset features

• Over time, the machine identifies predictive features we are not

aware of, and improves its performance beyond human capability

39


Summary

• Successful evidence integration requires us to ground complex

judgements of the directness of evidence in (a) the textual record

of research and (b) in biological knowledge

• We have proposed a framework for using PECO statements to

structure judgments about external validity, which seems to

necessitate a computational implementation

• We have outlined a research roadmap toward how such an

implementation can be realised and grounded

41

Thanks to…*

• Stephen Wattam, WAP Academic Consultancy Ltd

• Daniele Wikoff, Toxstrategies LLC

• Oliver Wild, Lancaster Environment Centre, UK

• Taylor Wolffe, Lancaster Environment Centre, UK

• Paul Rayson, Lancaster University School of Computing and Communications

• John Vidler, Lancaster University School of Computing and Communications

• EBTC staff: Katya Tsaioun, Sebastian Hoffmann, Rob de Vries

• Patient listeners: Rebecca Morgan, Michelle Angrish

42

*Credit is theirs, mistakes are mine

Thank you for [email protected] / @paul_msg

Image: Brian Henderson / Flickr / CC-BY-NC-2.0

Wikoff Model* for quantitative integration

• Model measures the extent to which

a body of evidence relevant to the

potential carcinogenicity of a

chemical fulfils the KCCs

• Uses three inputs (1-3) and an

algorithm (4) to provide a numeric

description (5) of how well the

evidence “matches” the KCCs

• It works a bit like calculating Flesch-

Kincaid readability scores in word

processors: overall target

characteristic described as a function

of some measurable properties,

normalised onto a scale

44

2 31

4

-1 0 1

KCCs not fulfilled KCCs fulfilled

5

*Oversimplified version presented here, see Wikoff et al. (2019) for detail

From Wikoff to grounded integration

45

Empirical evidence of similarity from being embedded in biological knowledge, i.e. AOPs – systematic

development and assessment of certainty in AOPs

Annotation in the textual record of the study characteristics which are being interpreted in making

judgements about external validity

Internal validity model: study characteristics

predictive of systematic error in results

External validity model: study features predictive

of generalisability

Statistical approachesto quantifying

exposure/outcome relationships

Algorithm responding to large number of interdependent features

Task-dependent output

2 31

4

5

Grounding

Not e.g. KlimischSimilarity to tPECO rather

than KCC featureNot vote counting

(3) is already done computationally; we think we can extend a computational approach to (2) with the framework we are describing; the challenge is then in ensuring (2) is grounded. Note: the principles of our approach also apply to (1)

Machine learning rather than simple operation

Documents

Practical challenges in assessing indirectness and the ... · a. the textual record in the study documents of those procedures & observations b. empirical evidence of the validity