Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,

Content-based Comparison for Collections Identification

Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1, Ramona Walls2,

1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org

IEEE BigData’16, Workshop on Computational Archival Science

Dec 8, 2016 @ Washington D.C.

1

Panorama •  Bio data collections evolve in a redundant, unstable,

and distributed research environment •  Data is big, has multiple components at different

stages of completion and may be stored across repositories

•  Pre and post data publication events are difficult to document

•  Auto-archiving is now an ubiquitous practice for research data

•  Metadata is not enough to establish data uniqueness

2

Identifier Services (IDS) Research Project Automated lifecycle identifiers management Use-case driven Focus on genomics data IDS services to:

Bind dispersed data objects Track and represent provenance Validate data location and integrity over time Aid data identity

Cyberinfrastructure to: Deal with large data/metadata Deal with evolving data over time Deal with big data tasks in a distributed environment

DATA AUTHENTICITY

3

IDS Architecture User

Agave Tenant

Agave API

...

HPC System

User registers repository(s) data, providing access mechanism(s). User selects files from a registered storage system

1

Public Cloud

idenHfierservices.org IDS Web

Metadata API

Systems API

Files API

Apps API

Jobs API

IDS queues corresponding Agave apps.

2

IDS updates metadata with data analysis results.

4

Data app pulls data from repository, computes analysis on high performance computing system.

3

Data Repository Data Repository

Data Repository

User User User User User

4

Content-based Comparison Service: Motivation •  To identify changes, connections, and differences

between datasets •  Infer issues of provenance •  Establish data identity •  Promote data reuse

•  How is the uniqueness of the datasets determined? •  Data curators mostly use manual processes •  Rely on metadata

•  Data description •  File checksums/fixity

5

Questions •  If two datasets share the same metadata, are they

the same dataset?

•  If two similar or identical datasets have different metadata, are they two different datasets?

•  In any of the cases above, how can curators apply global unique identifiers and corresponding metadata?

6

Content-based Data Comparison •  Goals:

•  Provide automated methods to verify data identity •  Provide additional information regarding two or more data

collections •  Provenance: documentation of data origin and changes

•  Increase data reuse •  Challenges:

•  Diverse data formats •  E.g. fasta vs. fastq, vs SRA formats

•  Different naming conventions/identifier structures •  Performance and scalability

•  Tens of gigabytes on disk •  Includes millions of data records per dataset (all pairwise record

comparisons are not feasible)

7

Algorithm Overview •  Determine the

composition of the collections to be compared

•  Decide record readers: gff, fastq, fasta

•  List record pairs for comparison

•  3 to 40 Gb file sizes ~19 mil to 165 mil records to compare

•  Compare all the records from the lists using Spark framework

•  Report results

Records comparison: compare records for each pair

Collection A Collection B

Collections analysis: determine records in each collection

Records list in A Records list in B

Records analysis: determine best pairs of records for comparison

Pairs of records from each collection

Comparison report

8

Algorithm •  Comparison works as follows:

Convert list of records as (ID, value) pairs, ︎Sort the records based on the IDs︎Start with a, b as the pointer to the head of list A, B respectively according to a score function. ︎

If score(a, b)<=0, record results and move a forward, ︎If score(a, b) >0, record results and move b forward. ︎

•  Different distance functions can be used to compare records: Hamming distance, edit distance, prefix distance etc.

9

Presentation of Results for Decision-making •  Number of records identified and compared •  Number of matched keys (id) and/or values

(sequence) between collections •  Numbers of records that only exist in one of the

collections (A or B) •  Degree of similarity of records as a histogram of the

match score distribution

10

Case 1 •  The dataset:

•  Rice genomic variations •  Two copies (105 gff files) available at:

•  Cyverse Data Commons •  Direct connection to HPC resource for data analysis

•  Dryad •  Integrate data to journal publication

•  At a glance •  Each repository has a different functionality •  Metadata does not match exactly •  Different formats: compressed file vs. individual files

•  Per content-based comparison both datasets are identical

•  Metadata record in Cyverse will be updated to reflect the relationship between the datasets

11

Case 2 •  A research group wants to publish a complete dataset

resulting from the analysis of five maize lines •  The input data, whole genome bisulfite FASTQ

sequencing files, have been published via SRA (Sequence Read Archive)

•  At a glance, •  The working copy consists of two FastaQ format files •  Researchers think that the working copy is also

available from SRA, 20.7 GB. (http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR850328)

12

Set A: 165 million sequences from working copy Set B: 167 million sequences from archived data with SRA

•  Through the results the researchers were able to infer that the working copy had been processed by adaptive trimming (provenance)

•  They concluded that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same “work.”

•  Both datasets will have different unique identifiers, clarifications in their metadata, and identifiers will be related in the metadata

Results from Case 2

13

Case 3 •  Two datasets published with almost identical metadata •  Datasets are referenced as

•  http://www.onekp.com/samples/single.php?id=DDEV •  http://www.onekp.com/samples/single.php?id=IFCJ

•  At a glance •  Same datasets organization •  12 of 14 metadata fields have identical information •  No provenance information or stated relationship between both datasets

14

Results from Case 3 Set A: solexa reads of DDEV from 1kp (13 million records) Set B: solexa reads of IFCJ from 1kp (18 million records) Prefix match score: e.g. abcde and abdc, match score 2*2/(5+4) = 0.44

DistribuHon of prefix match score

•  Researchers could infer that:

•  One the plant samples may be contaminated

•  Beginning 20% segments match may be due to the use of the same sequencing primer

15

Performance and Scalability

•  All computations were conducted in the data intensive system Wrangler at TACC

•  Implementation using Spark data processing framework •  Data format detection using API in bioPython package •  mpiBlast takes hours instead of minutes

0

20

40

60

80

100

120

Data reading Records match Pre8ix check

Execution Time in Seconds 4 nodes 8 nodes

ExecuHon Hme in comparing two sequencing files in use case 3

16

Conclusions •  Identity and provenance of data has to be managed over

time as copies and versions of a dataset are generated, reused and published

•  Metadata alone may not be enough to determine uniqueness of the data

•  Metadata needs to be updated and relationships between datasets need to be clarified

•  Determining data uniqueness requires content-based comparison

•  Comparison results help curators understand the reasons for differences and similarities

•  More automated scalable computation services are needed for data curation

17

Thanks & Questions?

Acknowledgement This work is supported through funding provided by the National Science Foundation from the following project: •  Evaluating Identifier Services for the Lifecycle of Biological Data

(#26100741) •  The iPlant Collaborative: Cyberinfrastructure for the Life

Sciences (#1265383, data and use cases) •  Wrangler: A Transformational Data Intensive Resource for the

Open Science Community (#1341711, computational resource support)

18

Documents

Content-based Comparison for Collections Identification · 2017-01-12 · Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1,