BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005

Preview:

Citation preview

BioMart Query Network

Arek KasprzykEuropean Bioinformatics Institute8 January 2005

Biological databases

• Distributed• Different format• Different focus• Different release schedule• Scalability factor

BioMart

Retrieval

myDatabase

SNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

MartView

BioMart@Ensembl

MartShell

MartExplorer

Database

FK

FK

FK

FK

PK

FK FK FKFK

PK PK

PK PK

Schema

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Schema

FK

FK

FK

FK

PK

PK

Schema

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Schema - ‘reversed star’

Fixed schema transformationA

B

TA

TB

C

Schema transformation

• Central table– Longest n:1, 1:1 path

• Dimension table– Central transformation ‘around’ 1:n

table. – Link tables are decomposed into a set

of 1:n first

MartBuilder• Input

– central object– database meta data– cardinalities

• Output– Set of SQL statements:

• “create table as select …”

• Transformations – represented as asymmetric tree

MartBuilder

DATASET: hsapiens_gene_ensemblTYPE MAIN [M] DIMENSION [D] EXIT [E]: MTABLE NAME: genegene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: Sgene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: STYPE MAIN [M] DIMENSION [D] EXIT [E]: EADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: NCHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO:

CREATE TABLE TEMP0 as SELECT gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id;CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.display_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id;drop table TEMP0;

Transformation configuration

satellog_repeats M repeats disease n1satellog_repeats M repeats gc 11satellog_repeats M repeats linkage_depth Ssatellog_repeats M repeats repeats Ssatellog_repeats M repeats transcripts Ssatellog_repeats M repeats ugcount Ssatellog_repeats M repeats ugstats Ssatellog_repeats M repeats rep_class n1satellog_repeats D ugcount ugcount Ssatellog_repeats D ugcount ugstats Ssatellog_repeats D ugcount gc Ssatellog_repeats D ugcount repeats n1r

Data access

Dataset – Key Abstraction

• Dataset– Organised into a single schema– BioMart database contains one or more dataset(s)– Attribute– Filter– Exportable/Importable (Links)

• Dataset - an equivalent of relational table– Exportable/Importable = PK/FK

Key Abstractions

GENE CENTRAL

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

Exportables, Importables and Links

• Exportable = ordered list of attributes• Importable = ordered list of filters

– WHERE filt1=value1– WHERE filt1=value1 or filt1=value2– WHERE filt1>value1 and filt2<value2

• Links = matching importable and exportable

MartView

Dataset Configuration

• Dataset configuration • Attributes • Filters• Trees, Groups, Collections• Links • Semantics• Relational mapping

• User interface• Linking datasets• XML-based

Dataset Configuration

XML

XML

XML

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key– Boolean filter _bool– List filter _list

MartEditor

MartEditor

• Naïve configuration• Updates• Links• Automatic discovery of new tables

Class diagram - configuration

Class diagram - querying

Information flow

• Read connections• Register individual datasets and create

linked datasets• Get input from the user, split queries to

individual datasets. • Find the shortest path between datasets

(Dijikstra)• Compile SQL

Summary

BioMart

• Domain independent• Platform independent

– MySQL 4– Oracle 9i

• Plugin architecture

BioMart model

• Already applied– Ensembl– Vega– dbSNP– Uniprot– MSD– Variety of small projects

• In development– ArrayExpress– Wormbase– RGD

Future work

• BioMart v 0.2 to be released later on in january

• Java library to be upgraded over coming months to the new architecture

• BioMart has been integrated with Taverna

• MartBuilder - to be properly implemented

BioMart

• www.ebi.ac.uk/biomart• Open source (LGPL)• Public MySQL server• ftp• mart-dev@ebi.ac.uk• mart-announce@ebi.ac.uk

Acknowledgments

• BioMart– Damian Smedley– Darin London

• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)– Will Spooner (CSHL)

Recommended