Upload
brooke-kelly
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
An Identity Crisis in the Life Sciences
Jun Zhao, Carole Goble and Robert StevensThe University of Manchester, UK
Thanks to: Tom Oinn, Matthew Pocock, Daniele Turi
And our usersAnd the EPSRC
UK e-Science project
Middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications.
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
Bioinformatics workflows
Taverna workflow workbench
collected metabolic pathway
computed BLAST report
computed BLAST report
• Data pipelines• Collect data• Compute data• Frequently
updated public resources
• Open world• Get the same data
product in different experiment context
Bioinformatician users
urn:data:f2urn:data:f2
urn:data1urn:data1
urn:data2urn:data2
urn:compareinvocation3urn:compareinvocation3
urn:data12urn:data12
Blast_report
[input]
[output]
[input]
[distantlyDerivedFrom]
SwissProt_seq[instanceOf]
Sequence_hit
[hasHits]
urn:hit2….urn:hit2….
urn:hit1…urn:hit1…
urn:hit50…..
urn:hit50…..
[instanceOf]
[similar_sequence_to]
Data generated by services/workflows
Concepts
[ ]
[performsTask]
Find similar sequence[contains]
Services
urn:data:3urn:data:3
urn:hit8….urn:hit8….
urn:hit5…urn:hit5…
urn:hit10…..
urn:hit10…..
[contains]
[instanceOf]
urn:BlastNInvocation3urn:BlastNInvocation3
urn:invocation5urn:invocation5urn:data:f1urn:data:f1[output]
New sequenceMissed sequence
[hasName][hasName]
literalsDatumCollection
[type]
LSDatum
[type]Properties
[instanceOf]
[output]
[output]
[directlyDerivedFrom]
Concept
Data
Fusion between different data models using
shared concepts and shared data
outputOf
createdFromcontains_similiar_seq_to
urn:genbank2…
urn:genbank2…
urn:genbank1…
urn:genbank1…
urn:genbank50…
urn:genbank50…
Blast_reportDNA_sequence
DNA_sequence
urn:BlastNInvocation3urn:BlastNInvocation3
urn:data:3urn:data:3urn:data2urn:data2
inputOf
Blast_service
instanceOf
instanceOf
instanceOf
instanceOf
urn:williamsA
urn:williamsA
urn:run5urn:run5
urn:data2urn:data2
urn:run7urn:run7
urn:williamsBurn:williamsB
GenBank UniProt
runOfinputOf
inputOf
runOf
createdBy
LSID
createdBy
urn:data:f2
urn:data:f2
urn:data1urn:data1
urn:data2urn:data2
urn:compareinvocation3urn:compareinvocation3
urn:data12
urn:data12
Blast_report
[input]
[output]
[input]
[distantlyDerivedFrom]
SwissProt_seq[instanceOf]
Sequence_hit
[hasHits]
urn:hit2….
urn:hit2….
urn:hit1…urn:hit1…
urn:hit50…..
urn:hit50…..
[instanceOf]
[similar_sequence_to]
Data generated by services/workflowsConcepts
[ ]
[performsTask]
Find similar sequence
[contains]
Services
urn:data:3urn:data:3
urn:hit8….
urn:hit8….
urn:hit5…urn:hit5…
urn:hit10…..
urn:hit10…..
[contains]
[instanceOf]
urn:BlastNInvocation3urn:BlastNInvocation3
urn:invocation5urn:invocation5
urn:data:f1
urn:data:f1
[output]
New sequence
Missed sequence
[hasName] [hasName
]literalsDatumCollection
[type]
LSDatum
[type]Properties
[instanceOf]
[output]
[output]
[directlyDerivedFrom]
Add assertions, Add rules
Reason over assertions
Putting Provenance to Use
• Single workflow– audit trail– recipe
• Multiple workflow runs (versions)– Aggregation - gathering– Integration - merging– Comparison - differencing
Any idea?
• 30350027• 30350027
• gi:30350027 Life Science IdentifierA ruddy great lump of RDF
URIs for Dataurn:lsid:mygrid.ac.uk:data:49841:1
• Life Science Identifier• Protocol for allocation and
resolution• Adopted by a range of data
providers• LSIDs in the data providers
databases we collect during workflow execution
• LSIDs for the data products we computed during the workflow execution
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
http://www.omg.org/cgi-bin/doc?lifesci/2003-12-02
Having a BLAST in every workflow!Seq
GenBankReport
databasescore
BLAST
BLAST_simplifer
GenBank_retrieve
BlastReport
A list of Sequences
Alignment of sequence AC005089
Computed Collections and Collected data items
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence4
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
BLAST
ReportSequence1
Sequence2
Sequence3
BLAST
ReportSequence1
Sequence2
Sequence4
SEQ
listOf
BLASTsimplifer
SEQ
listOf
BLASTsimplifer
Equivalent data
Corresponding data
Data Co-references
Context of the
workflow
Run2Run1
Aggregation of repeated run
AC005089
BLASTReport
urn:lsid:tav:ic531
urn:lsid:tav:ic537
urn:lsid:tav:ic538
urn:lsid:tav:57b6
urn:lsid:tav:57b13
urn:lsid:tav:57b14
refersTo
derivedFrom
derivedFrom
derivedFrom
DNASeq
DNASeq
derivedFrom
refersTo
refersTo
rdf:type
rdf:type
rdf:type rdf:type
rdf:type
External Duplicates
gi:15145617
ac073846
urn:lsid:myg:ac073846
mmu:11423
Different providers
A replica
Different tool providers
Sequence
LSID Assignment Process
Workflow enactorProvenance
service
Data service
External domainservice
Data storage group
wfEvents
Taverna LSID Authority
MySQL relational stores
KAVE
BAKLAVA
CustomizedDB
CustomizedDB
Jena/Sesame RDF store
Equivalent data in repeated runsDuplicate ids for these data
Provenance from two repeated runs
my:derivedFrom
my:hasElement
my:derivedFrom
my:derivedFrom
my:hasElement
Run1
Run2
No convergence
urn:lsid:tav:brpt1
urn:lsid:tav:brpt2
urn:lsid:tav:seqcollection1
urn:lsid:tav:seqcollection2
urn:lsid:tav:seq1
urn:lsid:tav:seq2
my:derivedFrom
urn:lsid:tav:brpt1 urn:lsid:tav:brpt2
urn:gb:seq1Sequence1 Sequence1
Execution duplicates
BLAST BLAST_simpliferBlastReport A list of Seq
GenBank_retrieve
But hidden!!
urn:gb:seq1
BLAST report BLAST report
BLAST BLAST_simpliferBlastReport A list of Seq
GenBank_retrieve
SEQ1 Sequence1
Sequence2
Sequence3
listOfurn:tav:seqc1 urn:tav:seq1
urn:gb:seq1
SEQ1 listOfurn:tav:seqc2 urn:tav:seq2Sequence1
Sequence2
Sequence3
Execution duplicates
Managing identity co-reference
• Identity co-reference:– Identifying duplicate identities that refer to the
same object but kept context
• An approach:– An IDSet entity
• Identity equivalence for collected data• Identity correspondence for computed data
– An identity service– Identity normalisation and cleansing activity
IDSet entity
• IDSet(BLASTrpt) = {{urn:tav:brpt1}, {urn:tav:brpt3}}
urn:gb:seq1Sequence
Query by its identity
Query by
its content
IDSet1
merge
IDSet created by another organization
IDSet3
urn:lsid:tav:brpt1
BLASTreport
Extended Architecture
Workflow enactor Provenance service
Data service
External domainservice
Data storage group
wfEvents
Taverna LSID Authority
MySQL relational stores
BAKLAVA
CustomizedDB
CustomizedDB
Identity service
KAVE+
Jena/Sesame RDF store MySQL
relational store
Identitystore
KAVE
Identifying collected product
Identity service
urn:gb:seq1
Identitystore
Receivean identity
Look for or create
Its IDSet
KAVE+
1
2 3
3
urn:gb:seq1
Store the id and the
IDSet
IDSet
1urn:gb:seq1
Identifying a collection product
Identity service
Identitystore
Receivean identity
Look for or create
Its IDSet
KAVE+
1
2 3
3
Store the id and the
IDSet
IDSeturn:lsid:seqc1
Seq1
Seq2
Seq3
SEQ2listOf
unr:lsid:seqc2
Look for equivalent
collection
unr:lsid:seqc1
unr:lsid:seqc2
Putting the Identity Service to Use
Provenance Integration
Provenance Aggregation
Identity Management
Provenance Normalization
Run2
Run1b1
c1s1
b2
c2s2
Discussion
• Scalability issues:– Normalizing provenance graphs– Building IDSet for collections with multiple hierarchies
• Open world data type-free context• Use experimental context more effectively –
workflows are not independently executed.• Granularity of identity• Identity aware operations in workflow• Multiple naming schemes• Migration duplicates• Compacting data results
Conclusion
• Combining provenance kind of depends on finding points of commonality. Like data identity.
• Duplicate identities will occur in an open world• Hard to achieve uniqueness without community
commitment• Different types of equivalent objects• How much can be avoided? • And how much has to be repaired?