14
Creation and Creation and Maintenance of Maintenance of GeneKeyDB GeneKeyDB Research being conducted by Research being conducted by Kevin Kastner Kevin Kastner Under the direction of Under the direction of Dr. Erich Baker Dr. Erich Baker

Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Embed Size (px)

DESCRIPTION

The Problem  Traditional database approaches are too structured.  Scientific objects change identification over time.  Gene names change over time.  The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols.  SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2- like 1.

Citation preview

Page 1: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Creation and Maintenance Creation and Maintenance of GeneKeyDBof GeneKeyDB

Research being conducted byResearch being conducted byKevin KastnerKevin Kastner

Under the direction ofUnder the direction ofDr. Erich BakerDr. Erich Baker

Page 2: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

The ProblemThe Problem

There exists thousands of biomedical data There exists thousands of biomedical data sources.sources. In 2006, there were ~557 relevant public In 2006, there were ~557 relevant public

resources in molecular biology.resources in molecular biology. This is growing rapidly.This is growing rapidly.

203 sources in 1999203 sources in 1999 226 sources in 2000226 sources in 2000 277 sources in 2001.277 sources in 2001.

Page 3: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

The ProblemThe Problem Traditional database approaches are too Traditional database approaches are too

structured.structured. Scientific objects change identification over time.Scientific objects change identification over time. Gene names change over time.Gene names change over time.

The Human Genome Nomenclature Database The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols. literature aliases, and 2739 withdrawn symbols.

SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-like 1.like 1.

Page 4: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Scientific Object IdentitiesScientific Object Identities

Hugo NameHugo Name GDBGDB GenAtlasGenAtlas OMIMOMIM GeneCardsGeneCards LocusLinkLocusLink

TP53TP53 11 3333 5252 2222 1313

P53P53 1(same)1(same) 1717 188188 6969 6363

SIRT1SIRT1 11 00 55 11 22

SIR2L1SIR2L1 00 00 11 1(same)1(same) 1(same)1(same)

Page 5: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

The SolutionThe Solution GeneKeyDBGeneKeyDB

A gene-centered relational database A gene-centered relational database developed to enhance data mining in developed to enhance data mining in biological data sets.biological data sets.

GeneKeyDB relies primarily on existing GeneKeyDB relies primarily on existing database identifiers derived from community database identifiers derived from community databases (NCBI, GO, Ensembl, et al.) as databases (NCBI, GO, Ensembl, et al.) as well as the known relationships among those well as the known relationships among those identifiers.identifiers.

Version 1 is already out!Version 1 is already out! http://www.biomedcentral.com/1471-2105/6/72http://www.biomedcentral.com/1471-2105/6/72

Page 6: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Weaknesses of Version 1Weaknesses of Version 1

Can no longer be updatedCan no longer be updated Complex queries must be made to the Complex queries must be made to the

database in order to obtain desired database in order to obtain desired informationinformation

Page 7: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker
Page 8: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Complex QueriesComplex QueriesSELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organismSELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organismFROM ll_xp_cdd, ll_np_cdd, ll_locusFROM ll_xp_cdd, ll_np_cdd, ll_locusWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_scoreWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score

AND ll_id INAND ll_id IN(SELECT ll_id(SELECT ll_idFROM ll_refseq_xmFROM ll_refseq_xmWHERE ll_refseq_xm_id INWHERE ll_refseq_xm_id IN

(SELECT ll_refseq_xm_id(SELECT ll_refseq_xm_idFROM ll_xp_cdd, ll_np_cddFROM ll_xp_cdd, ll_np_cddWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score))

AND ll_id INAND ll_id IN(SELECT ll_id(SELECT ll_idFROM ll_refseq_nmFROM ll_refseq_nmWHERE ll_refseq_nm_id INWHERE ll_refseq_nm_id IN(SELECT ll_refseq_nm_id(SELECT ll_refseq_nm_idFROM ll_xp_cdd, ll_np_cddFROM ll_xp_cdd, ll_np_cddWHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));

Page 9: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Current ResearchCurrent Research

Creation of APIs to validate data in the Creation of APIs to validate data in the database and to enable querying to database and to enable querying to become much easier for the user. become much easier for the user.

One-step updating of the database and One-step updating of the database and the information it contains. the information it contains.

Page 10: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

API AlternativeAPI Alternative// fxn(search_params, desired_info), returns ll_id// fxn(search_params, desired_info), returns ll_idcurated.cdd(score[ ],curated.cdd(score[ ],nullnull))curated_score[ ] curated_score[ ] score[ ] score[ ]locus_id1[ ] locus_id1[ ] gaa.cdd((name[ ],score[ ]), score[ ]) gaa.cdd((name[ ],score[ ]), score[ ])gaa_name[ ] gaa_name[ ] name[ ] name[ ]gaa_score[ ] gaa_score[ ] score[ ] score[ ]locus_id2[ ] locus_id2[ ] curated.cdd(name[ ],score[ ]) curated.cdd(name[ ],score[ ])curated_name[ ] curated_name[ ] name[ ] name[ ]locus_id[ ] locus_id[ ] intersect(locus_id1[ ],locus_id2[ ]) intersect(locus_id1[ ],locus_id2[ ])locus(organism[ ], locus_id[ ])locus(organism[ ], locus_id[ ])print(gaa_name[ ], curated_name[ ], organism[ ])print(gaa_name[ ], curated_name[ ], organism[ ])

Page 11: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

External ImplementationsExternal Implementations

Some databases have APIs as well.Some databases have APIs as well. EnsemblEnsembl

APIs are done in Perl.APIs are done in Perl.

APIs for GeneKeyDB will be done in Java.APIs for GeneKeyDB will be done in Java. More structured language.More structured language. Easier to read.Easier to read.

Page 12: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

The Future of GeneKeyDBThe Future of GeneKeyDB

GeneKeyDB will join even more external GeneKeyDB will join even more external and widely used databases together.and widely used databases together.

Code for updating GeneKeyDB will tie into Code for updating GeneKeyDB will tie into database information that will change in database information that will change in expected ways.expected ways. Lowers the required number of code rewrites.Lowers the required number of code rewrites.

GeneKeyDB will be dynamically updated.GeneKeyDB will be dynamically updated.

Page 13: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

The Future of GeneKeyDBThe Future of GeneKeyDB

APIs made that will be written in Perl.APIs made that will be written in Perl. Perl is used often, almost exclusively, by Perl is used often, almost exclusively, by

biologists.biologists. Can have Perl APIs tie into Java APIs, rather Can have Perl APIs tie into Java APIs, rather

than creating all new ones.than creating all new ones.

Page 14: Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker

Comments? Questions?Comments? Questions?

http://http://genereg.ornl.gov/gkdbgenereg.ornl.gov/gkdb//