Cis-Regulatory/ Text Mining Interface

Cis-Regulatory/Text Mining Interface

Discussion

Questions(1) What does ORegAnno want from text mining?

– Curation queue– Document mark-up– Mapping to database IDs

(2) What does text mining need from ORegAnno?

(3) What can text mining provide?– What level of performance is needed?

(4) What is the right way to proceed? – Data sets for BioCreAtIvE?– Custom tools for individual “early adopters”?

Answers: (1) What does ORegAnno Want from Text Mining

• Management of curation queue – Ideally, user customized, so that user annotates those

documents of immediate interest to her/him• Document mark-up to highlight relevant

passages– A workflow pipeline making either the html or pdf

version of the document available, with the (potentially) relevant terms highlighted

– Support for “cut and paste” transfer of relevant regions to the database comments fields

• Mapping to IDs, ontology codes– Gene, transcription factor (protein), organism, cell and

tissue type, evidence types

Answers: (2) What does Text Mining Need From ORegAnno?

• Significant quantity of reliably annotated data to train text mining systems– Annotated at a level useful for natural language

processing (e.g., marked for evidence at the phrase, sentence or passage level, depending on task)

• This requires that ORegAnno have:– A clear statement of the scope of the ORegAnno

database and a stable set of annotation guidelines– Annotations with high inter-annotator agreement– Tracking of entries by annotator, including depth of

annotation (different annotators will annotate to different levels of detail, depending on interests)

Answers: (3) What Can Text Mining Provide?

• Curation queue management:– Document classification approaches (from e.g., TREC Genomics or

BioCreAtIvE) can be applied and evaluated, making use of new training data from pre-jamboree and jamboree annotation

– We can experiment with “user defined” criteria, based on restrictions for gene, transcription factor, organism, tissue, etc.

• Document mark-up– Users could be provided with a list of genes/transcription factors in a

paper, with hot links into the paper to find relevant passages– This would allow the annotator to drive the annotation process, selecting

only those annotations that are correct and relevant. This in turn provides feedback using ORegAnno annotations to validate & train the text mining

– Such a tool should make it easy for the annotator to provide the underlying text passages as evidence for the annotation, to provide more training data

• Mapping to unique identifiers/controlled vocabulary/ontology– For each entity type (gene, transcription factor, organism, tissue type...),

a tool can provide a mapping to the correct identifier; where there is possible ambiguity, the tool could provide a ranked list for the annotator to choose from

– A tool can also flag different evidence types, with suggested code(s)

Answers: (4) How to Proceed?• Stabilize guidelines and redo the inter-annotator

agreement expt (and write up)• Prepare a Gold Standard data set of expert

annotated data for training new annotators• Collect sufficient amount of training data for the

various tasks (queue management, document mark up, automated mapping)

• Develop end-to-end pipeline (in the style of the FlySlip project) to capture whole documents in machine-readable form for mark-up

Recommendations: Training Materials & Tools

• Case studies and gold-standard annotated articles

• On-line training– Perhaps with a way for new annotators to test

themselves against a set of gold standard annotations– This will require automated comparison of annotations

for certain fields

• Best tools links• Tools:

– Copy mechanism for largely duplicated record

Documents

Cis-Regulatory/ Text Mining Interface