Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Content-based Comparison for Collections Identification
Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1, Ramona Walls2,
1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org
IEEE BigData’16, Workshop on Computational Archival Science
Dec 8, 2016 @ Washington D.C.
1
Panorama • Bio data collections evolve in a redundant, unstable,
and distributed research environment • Data is big, has multiple components at different
stages of completion and may be stored across repositories
• Pre and post data publication events are difficult to document
• Auto-archiving is now an ubiquitous practice for research data
• Metadata is not enough to establish data uniqueness
2
Identifier Services (IDS) Research Project Automated lifecycle identifiers management Use-case driven Focus on genomics data IDS services to:
Bind dispersed data objects Track and represent provenance Validate data location and integrity over time Aid data identity
Cyberinfrastructure to: Deal with large data/metadata Deal with evolving data over time Deal with big data tasks in a distributed environment
DATA AUTHENTICITY
3
IDS Architecture User
Agave Tenant
Agave API
...
HPC System
User registers repository(s) data, providing access mechanism(s). User selects files from a registered storage system
1
Public Cloud
idenHfierservices.org IDS Web
Metadata API
Systems API
Files API
Apps API
Jobs API
IDS queues corresponding Agave apps.
2
IDS updates metadata with data analysis results.
4
Data app pulls data from repository, computes analysis on high performance computing system.
3
Data Repository Data Repository
Data Repository
User User User User User
4
Content-based Comparison Service: Motivation • To identify changes, connections, and differences
between datasets • Infer issues of provenance • Establish data identity • Promote data reuse
• How is the uniqueness of the datasets determined? • Data curators mostly use manual processes • Rely on metadata
• Data description • File checksums/fixity
5
Questions • If two datasets share the same metadata, are they
the same dataset?
• If two similar or identical datasets have different metadata, are they two different datasets?
• In any of the cases above, how can curators apply global unique identifiers and corresponding metadata?
6
Content-based Data Comparison • Goals:
• Provide automated methods to verify data identity • Provide additional information regarding two or more data
collections • Provenance: documentation of data origin and changes
• Increase data reuse • Challenges:
• Diverse data formats • E.g. fasta vs. fastq, vs SRA formats
• Different naming conventions/identifier structures • Performance and scalability
• Tens of gigabytes on disk • Includes millions of data records per dataset (all pairwise record
comparisons are not feasible)
7
Algorithm Overview • Determine the
composition of the collections to be compared
• Decide record readers: gff, fastq, fasta
• List record pairs for comparison
• 3 to 40 Gb file sizes ~19 mil to 165 mil records to compare
• Compare all the records from the lists using Spark framework
• Report results
Records comparison: compare records for each pair
Collection A Collection B
Collections analysis: determine records in each collection
Records list in A Records list in B
Records analysis: determine best pairs of records for comparison
Pairs of records from each collection
Comparison report
8
Algorithm • Comparison works as follows:
Convert list of records as (ID, value) pairs, ︎Sort the records based on the IDs︎Start with a, b as the pointer to the head of list A, B respectively according to a score function. ︎
If score(a, b)<=0, record results and move a forward, ︎If score(a, b) >0, record results and move b forward. ︎
• Different distance functions can be used to compare records: Hamming distance, edit distance, prefix distance etc.
9
Presentation of Results for Decision-making • Number of records identified and compared • Number of matched keys (id) and/or values
(sequence) between collections • Numbers of records that only exist in one of the
collections (A or B) • Degree of similarity of records as a histogram of the
match score distribution
10
Case 1 • The dataset:
• Rice genomic variations • Two copies (105 gff files) available at:
• Cyverse Data Commons • Direct connection to HPC resource for data analysis
• Dryad • Integrate data to journal publication
• At a glance • Each repository has a different functionality • Metadata does not match exactly • Different formats: compressed file vs. individual files
• Per content-based comparison both datasets are identical
• Metadata record in Cyverse will be updated to reflect the relationship between the datasets
11
Case 2 • A research group wants to publish a complete dataset
resulting from the analysis of five maize lines • The input data, whole genome bisulfite FASTQ
sequencing files, have been published via SRA (Sequence Read Archive)
• At a glance, • The working copy consists of two FastaQ format files • Researchers think that the working copy is also
available from SRA, 20.7 GB. (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR850328)
12
Set A: 165 million sequences from working copy Set B: 167 million sequences from archived data with SRA
• Through the results the researchers were able to infer that the working copy had been processed by adaptive trimming (provenance)
• They concluded that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same “work.”
• Both datasets will have different unique identifiers, clarifications in their metadata, and identifiers will be related in the metadata
Results from Case 2
13
Case 3 • Two datasets published with almost identical metadata • Datasets are referenced as
• http://www.onekp.com/samples/single.php?id=DDEV • http://www.onekp.com/samples/single.php?id=IFCJ
• At a glance • Same datasets organization • 12 of 14 metadata fields have identical information • No provenance information or stated relationship between both datasets
14
Results from Case 3 Set A: solexa reads of DDEV from 1kp (13 million records) Set B: solexa reads of IFCJ from 1kp (18 million records) Prefix match score: e.g. abcde and abdc, match score 2*2/(5+4) = 0.44
DistribuHon of prefix match score
• Researchers could infer that:
• One the plant samples may be contaminated
• Beginning 20% segments match may be due to the use of the same sequencing primer
15
Performance and Scalability
• All computations were conducted in the data intensive system Wrangler at TACC
• Implementation using Spark data processing framework • Data format detection using API in bioPython package • mpiBlast takes hours instead of minutes
0
20
40
60
80
100
120
Data reading Records match Pre8ix check
Execution Time in Seconds 4 nodes 8 nodes
ExecuHon Hme in comparing two sequencing files in use case 3
16
Conclusions • Identity and provenance of data has to be managed over
time as copies and versions of a dataset are generated, reused and published
• Metadata alone may not be enough to determine uniqueness of the data
• Metadata needs to be updated and relationships between datasets need to be clarified
• Determining data uniqueness requires content-based comparison
• Comparison results help curators understand the reasons for differences and similarities
• More automated scalable computation services are needed for data curation
17
Thanks & Questions?
Acknowledgement This work is supported through funding provided by the National Science Foundation from the following project: • Evaluating Identifier Services for the Lifecycle of Biological Data
(#26100741) • The iPlant Collaborative: Cyberinfrastructure for the Life
Sciences (#1265383, data and use cases) • Wrangler: A Transformational Data Intensive Resource for the
Open Science Community (#1341711, computational resource support)
18