82
Genomics and Personalized Care in Health Systems Lecture 2. Databases Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

Genomics and Personalized Care in Health Systems Lecture 2. Databases Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health

Embed Size (px)

Citation preview

Genomics and Personalized Care in Health Systems

Lecture 2 Databases

Leming Zhou PhDSchool of Health and Rehabilitation

SciencesDepartment of Health Information

Management

Department of Health Information Management

Outlinebull Nucleotide and protein sequence databases

ndash NCBI

bull GenBank RefSeq dbEST UniGene

ndash PDB

bull Flybasebull dbSNPbull OMIM and HuGE Navigator

Department of Health Information Management

Molecular Biology Databasesbull Categories

ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip

bull Each Database contains specific information

bull Each of these databases is interrelated

Nucleotide and Protein Sequence Databases

Department of Health Information Management

NCBIbull Created as a part of National Library of

Medicine in 1988ndash Establish public databases

ndash Perform research in computational biology

ndash Develop software tools for sequence analysis

ndash Disseminate biomedical information

bull Databasesndash Sequence such as GenBank RefSeq dbSNP

ndash Literature such as PubMed OMIM

bull Toolsndash Entrez Blast Cn3D etc

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Outlinebull Nucleotide and protein sequence databases

ndash NCBI

bull GenBank RefSeq dbEST UniGene

ndash PDB

bull Flybasebull dbSNPbull OMIM and HuGE Navigator

Department of Health Information Management

Molecular Biology Databasesbull Categories

ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip

bull Each Database contains specific information

bull Each of these databases is interrelated

Nucleotide and Protein Sequence Databases

Department of Health Information Management

NCBIbull Created as a part of National Library of

Medicine in 1988ndash Establish public databases

ndash Perform research in computational biology

ndash Develop software tools for sequence analysis

ndash Disseminate biomedical information

bull Databasesndash Sequence such as GenBank RefSeq dbSNP

ndash Literature such as PubMed OMIM

bull Toolsndash Entrez Blast Cn3D etc

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Molecular Biology Databasesbull Categories

ndash Nucleotide Sequence Databasesndash Protein Sequence Databasesndash Structure Databasesndash Metabolic and Signaling Pathwaysndash Human Genes and Diseasesndash Microarray Data and other Expression Databasesndash hellip

bull Each Database contains specific information

bull Each of these databases is interrelated

Nucleotide and Protein Sequence Databases

Department of Health Information Management

NCBIbull Created as a part of National Library of

Medicine in 1988ndash Establish public databases

ndash Perform research in computational biology

ndash Develop software tools for sequence analysis

ndash Disseminate biomedical information

bull Databasesndash Sequence such as GenBank RefSeq dbSNP

ndash Literature such as PubMed OMIM

bull Toolsndash Entrez Blast Cn3D etc

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Nucleotide and Protein Sequence Databases

Department of Health Information Management

NCBIbull Created as a part of National Library of

Medicine in 1988ndash Establish public databases

ndash Perform research in computational biology

ndash Develop software tools for sequence analysis

ndash Disseminate biomedical information

bull Databasesndash Sequence such as GenBank RefSeq dbSNP

ndash Literature such as PubMed OMIM

bull Toolsndash Entrez Blast Cn3D etc

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

NCBIbull Created as a part of National Library of

Medicine in 1988ndash Establish public databases

ndash Perform research in computational biology

ndash Develop software tools for sequence analysis

ndash Disseminate biomedical information

bull Databasesndash Sequence such as GenBank RefSeq dbSNP

ndash Literature such as PubMed OMIM

bull Toolsndash Entrez Blast Cn3D etc

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

NCBI Homepage

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Molecular Databasesbull Primary Databases

ndash Original submissions by experimentalists

ndash Database staff organize but donrsquot add additional

ndash Information for instance GenBank

bull Derivative Databasesndash Human curated

bull Compilation and correction of data

bull Example SWISS-PROT NCBI RefSeq

ndash Computationally Derived

bull Example UniGene

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GenBankbull httpwwwncbinlmnihgovGenbank

bull Nucleotide only sequence database

bull GenBank Datandash Direct submissions individual records (BankIt Sequin)

ndash Batch submissions via email (EST GSS STS)

ndash ftp accounts established for sequencing centers

bull Data shared nightly amongst three collaborating databasesndash GenBank

ndash DNA Database of Japan (DDBJ)

ndash European Molecular Biology Laboratory Database (EMBL)

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GenBank Release 1870bull ftpftpncbinihgovgenbankbull Full release every two monthsbull Incremental and cumulative updates daily

Release 1810 (12152011)

bull 146413798 Sequences bull 135117731375 Base Pairs

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GenBank Record (Header)LOCUS NM_001963 5600 bp mRNA linear PRI 15-JAN-2012 DEFINITION Homo sapiens epidermal growth factor (EGF)

transcript variant 1 mRNA ACCESSION NM_001963 VERSION NM_0019634 GI296011011 KEYWORDS SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota Metazoa Chordata

Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Primates Haplorrhini Catarrhini Hominidae Homo

REFERENCE 1 (bases 1 to 5600) AUTHORS de DiesbachMT CominelliA NKuliF

TytecaD and CourtoyPJ TITLE Acute ligand-independent Src activation mimics low EGF-induced EGFR surface signalling and redistribution into recycling endosomes

JOURNAL Exp Cell Res 316 (19) 3239-3253 (2010) PUBMED 20832399

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GenBank Record (Features)FEATURES LocationQualifiers source 15600

organism=Homo sapiens mol_type=mRNA db_xref=taxon9606 chromosome=4 map=4q25

gene 15600 gene=EGF gene_synonym=HOMG4 URG note=epidermal growth factor db_xref=GeneID1950ldquo db_xref=MIM131530

exon 1579 number=1

CDS 4534076 codon_start=1 protein_id=NP_0019542 db_xref=GI166362728ldquo db_xref=GeneID1950ldquo db_xref=MIM131530 translation=MLLTLIILLPVVSKFSFVSLSAPQHWSCPEGTLAGNGNSTCVGP hellip

exon 580779 number=2

exon 780961 number=3

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GenBank Record (Sequence)ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc

61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt

121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt

181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc

241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga

301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag

361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg

421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc

481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg

541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt

601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg

661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt

721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga

781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag

841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt

901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa

961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

1021 tgggagtgaa ggctctgttg gagacatcag agaaaataac agctgtgtca ttggatgtgc

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

FASTA Formatgtgi|371502116|ref|NM_0011261132| Homo sapiens tumor protein p53 (TP53) transcript variant 4 mRNA GATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGACCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAAT

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Too Many Results

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Search Limits

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Reduced Search Results

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Gene Record

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

RefSeqbull Database of reference sequences

ndash httpwwwncbinlmnihgovRefSeq

bull Curatedndash Many experimentally validated

ndash Some partially validated via ESTs

ndash Some computationally predicted

bull Non-redundant one record for each gene or each splice variant from each organism represented

bull Status Codesndash Provisional (temporary)

ndash Reviewed

ndash Predicted

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Page 26

Accession Numbersbull DNA sequences and other molecular data are

tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

bull RefSeq provides an expertly curated accession number that corresponds to the most stable agreed-upon ldquoreferencerdquo version of a sequence

bull RefSeq identifiers include the following formatsndash Complete chromosome NC_

ndash Genomic contig NT_

ndash mRNA (DNA format) NM_ XM_

ndash Protein NP_ XP_

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

EST

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

ESTbull mRNA Genomic regions actively transcribed in

cellbull cDNA (complementary DNA)

ndash Copy of mRNA using mRNA as a templatendash Sequence is complementary to mRNA

bull EST Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)ndash Partial cDNA sequencendash Can be 5rsquo or 3rsquondash Typical size 200 - 500 bpndash Represents mRNA actively transcribed in cellndash Use to identify

bull Genes Alternative splicing etc

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

dbEST (release 120111 Dec 1

2011)bull httpwwwncbinlmnihgovdbESTdbEST_summaryhtml

bull Number of Entries 71276166ndash Homo sapiens (human) 8315294

ndash Mus musculus (mouse) 4853562

ndash Arabidopsis thaliana (thale cress) 1529700

ndash Danio rerio (zebrafish) 1488275

ndash Drosophila melanogaster (fruit fly) 821005

ndash Gallus gallus (chicken) 600433

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Access to dbEST Databull EST sequences are included in the EST division of

GenBank available from NCBI by anonymous ftp and through Entrez

bull The nucleotide sequences may be searched using the BLAST server

bull EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the repositorydbEST directory at ftpncbinihgov

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Protein Structure

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Cn3D ftpftpncbinihgovcn3dCn3D-43msi

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Crystal Structure of A Protein

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Protein Databasesbull Proteins have structure and functionbull InterPro Protein families and domains

httpwwwebiacukinterprobull Protein Information Resource (PIR)

httppirgeorgetownedubull SWISS-PROTTrEMBL curated protein sequences

httpwwwexpasychsprot bull UniProt

httpwwwexpasyuniprotorgindexshtml

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Protein Sequence Motifs Databasesbull Proteins have conserved regions (motifs

domains) which may have functional significance

bull Databases exist to store protein families motifs and structural domainsbull CDD

httpwwwncbinlmnihgovStructurecddcddshtml bull Pfam httpwwwsangeracukSoftwarePfam bull PROSITE httpwwwexpasyorgprosite

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Protein Structure Databasesbull Proteins take on 3D structure

bull 3D data for some proteins is available due to techniques such as NMR and X-Ray crystallographyndash PDB httpwwwpdborg

ndash SCOP httpscopmrc-lmbcamacukscop

ndash MMDB httpwwwncbinlmnihgovStructure

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

PDB (wwwpdborg)bull The Protein Data Bank (PDB) is the single

worldwide depository of information about the 3D structures of large biological molecules including proteins and nucleic acids

bull Understanding the shape of a molecule helps to understand how it works

bull As of January 2010 there are 62787 searchable structures in the PDB database

bull PDB providesndash Sequence Atomic Coordinates Derived geometric data

Secondary structure content Annotations about protein literature references

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

PDB Statistics

httpwwwrcsborgpdbstatisticscontentGrowthChartdocontent=totalampseqid=100

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

FlyBase

httpwwwflybaseorg

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

FlyBase Introduction

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Quick Searches

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Quick Search Results

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Gene Report Page gfzf

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

More Details Gene Model amp Product

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Sequence Searches (BLAST)

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Choosing Database Inputting Sequence

41

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

More BLAST Options

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

BLAST Results

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Genetic Variations

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Polymorphismsbull Genomic sequences from two unrelated

individuals are 999 identical

bull The 01 difference is due to genetic variations and mainly (~90) one form of variation called Single Nucleotide Polymorphisms (SNPs single-base variations)

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Importance of Genetic Variationsbull Genetic variations underlie phenotypic differences

among different individuals

bull Genetic variations determine our predisposition to diseases and responses to drugs therapies and environmental insults such as bacteria virus and chemicals

bull Genetic variations reveal clues of ancestral human migration history

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Major Types of Genetic Variationsbull Single nucleotide mutation

ndash Majority of SNPs do NOT directly contribute to any phenotypes

bull Insertion or deletion of one or more nucleotidesndash Tandem repeat polymorphisms (Genomic regions consisting of

variable length usually 1-100 bases long of sequence motifs repeating in tandem with variable copy number)

bull Used as genetic markers for DNA finger printing (forensic parentage testing)

bull Many cause genetic diseases

ndash InsertionDeletion polymorphisms (Often resulted from localized rearrangements between homologous tandem repeats)

bull Gross chromosomal aberrationndash Deletions inversions or translocation of large DNA fragments

ndash Often causing serious genetic diseases

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

SNPs and Mutationsbull Terminology for variation at a single nucleotide

position is defined by allele frequencyndash A single base change occurring in a population at a

frequency of gt1 is termed a single nucleotide polymorphism (SNP)

ndash When a single base change occurs at lt1 it is considered to be a mutation

bull A SNP is a polymorphic position where the point mutation has been fixed in the population

bull In practice however SNPs databases contains multiple types of variations including SNPs mutations insertions deletions tandem repeats copy number variations etc

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

SNPsbull SNPs can occur anywhere on a genome they are

classified based on their locationsndash Many SNPs in genomic non-coding regions

ndash SNPs in gene regions including promoter region coding region intronic exonic regioin UTR etc

bull Often play an important role in differentiation and disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

The Effect of SNPsbull The phenotypic consequence of a SNP is

significantly affected by the location where it occurs (gene or non-gene) as well as the nature of the mutation (synonymous or non-synonymous)ndash No consequence

ndash Affect gene transcription quantitatively or qualitatively

ndash Affect gene translation quantitatively or qualitatively

ndash Change protein structure and functions

ndash Change gene regulation at different steps

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

SimpleComplex Genetic Diseases and SNPsbull Simple genetic diseases (Mendelian diseases) are

often caused by mutations in a single genendash eg Huntingtonrsquos Cystic fibrosis etc

bull Many complex diseases are the result of mutations in multiple genes the interactions among them as well as between the environmental factorsndash eg cancers heart diseases Alzheimers diabetes

asthmas obesity etc

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Sickle Cell Anemiabull Due to 1 swapping an A for a T causing inserted amino acid

to be valine instead of glutamine in hemoglobin

httpmmcentersdiscoveryhospitalcomsharedencimg_htmIM-56htm

1 Normal red blood cells 2 Sickled red blood cells

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

A Few Relevant Conceptsbull Allele A specific ldquoversionrdquo of a gene or an

alternative DNA sequences at the same physical locus which may or may not result in different phenotypic traits

bull Genotype the genetic constitution of a cell an organism or an individual

bull Genotyping the process of identifying what genotype a person has for any given locus (loci)

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Genetic Variations Databasesbull dbSNP

ndash httpwwwncbinlmnihgovSNP

bull Online Mendelian Inheritance in Man (OMIM)ndash httpwwwncbinlmnihgovomim

bull International HapMap Projectndash httpwwwhapmaporg

bull Genome Variation Server (Seattle SNPs)ndash httpgvsgswashingtoneduGVS

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

dbSNPbull The Single Nucleotide Polymorphism database (dbSNP) is a

public- domain archive for a broad collection of simple genetic variations

bull This collection of polymorphisms includesndash Single-base nucleotide substitutions (or single nucleotide

polymorphisms -SNPs)

bull Roughly 10 million in human population or on average 1 per 300 bps

bull Less than half of these SNPs are identified and stored in the database

ndash Microsatellite repeat variations (or short tandem repeats - STRs)

bull In sillico estimation of potentially polymorphic variable number tandem repeats (VNTR) are over 100000 across the human genome

ndash Small-scale multi-base deletions or insertions

bull The short insertiondeletions are difficult to quantify and the number is likely to fall in between SNPs and VNTR

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

dbSNP Data Typesbull The dbSNP contains two classes of records

ndash Submitted record

bull The original observations of sequence variation submitted SNPs (SS) records started with ss

ndash Computationally annotated record

bull Generated during the dbSNP build cycle by computation based the original submitted data Reference SNP Clusters (ref SNP) start with rs

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

A dbSNP Recordgtgnl|dbSNP|ss5586300|allelePos=214|len=475|taxid=9606|alleles=AG|mol=Genomic

ATAAACATGG ACTTTTACAA AACCCATATC GTATACCACC ACTTTTTCCCATCAAGTCAT YTGTTAAAAC TAAATGTAAG AAAAATCTGC TAGAGGAAAACTTTGAGGAA CATTCAATRT CACCTGAAAG AGAAATGGGA AATGAGAACATTCCAAGTAC AGTGAGCACA ATTAGCCGTA ATAACATTAG AGAAAATGTT TTTAAAGRAG CCA R CTCAAGCAAT ATTAATGAAG TAGGTTCCAG TACTAATGAA GTGGGCTCCAGTATTAATGA AATAGGTTCC AGTGATGAAA ACATTCAAGC AGAACTAGGT AGAAACAGAG GGCCAAAATT GAATGCTATG CTTAGATTAG GGGTTTTGCA ACCTGAGGTC TATAAACAAA GTCTTCCTGG AAGTAATTGT AAGCATCCTGAAATAAAAAA GCAAGAATAT GAAGAAGTAG TTCAGACTGT TAATACAGAT TTCTCTCCAT A

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

International Union of Pure and Applied Chemistry (IUPAC) Code and MeaningIUPAC code MeaningA AC CG GT TM A or CR A or GW A or TS C or GY C or TK G or TV A or C or GH A or C or TD A or G or TB C or G or TN G or A or T or C

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Different Ways to Search SNPs in dbSNP

bull dbSNP web site

ndash Direct search of SS record batch search allow SNP record submission No search limit

bull Entrez SNP

ndash httpwwwncbinlmnihgovsitesentrezdb=Snp

ndash Search limits options allows precise retrieval

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Search SNPs from dbSNP Web Page

bull httpwwwncbinlmnihgovSNPindexhtml

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

dbSNP Search Examples

Search using wild-card() ranging() AND OR and NOT operatorsExample DescriptionBRC[Gene Name] Search SNPs on all genes with names

starting with the letter BRC (ie BRCA1 and BRCA2)

1[CHR] AND (frameshift[Function_Class])

Search SNPs located on chromosome 1 with function class frame-shift

1[CHR] OR 2[CHR] Search all SNPs on chromosome 1 or 21[CHR] OR 2[CHR] NOT unknown[METHOD]

Search all SNPs on chromosome 1 or 2 detected by all methods except unknown

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Legend in Results

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Search dbSNP Example bull Some mutations on human BRCA1 gene have been

reported to be involved in the early onset of breast cancer

bull Retrieve all validated non-synonymous coding reference SNPs for BRCA1 from dbSNP

bull Starting from the Entrez SNP httpwwwncbinlmnihgovsitesentrezdb=Snp

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Entrez SNP Search Results

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

dbSNP RefhttpwwwncbinlmnihgovprojectsSNPsnp_refcgirs=799920

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

SNP Locationgtgnl|dbSNP|rs799916|allelePos=301|totalLen=601|taxid=9606|

snpclass=1|alleles=AC|mol=Genomic|build=130 AAAATAATCA AGAAGAGCAA AGCATGGATT CAAACTTAGG TATTGGAACC AGGTTTTTGT GTTTGCCCCA GTCTATTTAT AGAAGTGAGC TAAATGTTTA TGCTTTTGGG GAGCACATTT TACAAATTTC CAAGTATAGT TAAAGGAACT GCTTCTTAAA CTTGAAACAT GTTCCTCCTA AGGTGCTTTT CATAGAAAAA AGTCCTTCAC ACAGCTAGGA CGTCATCTTT GACTGAATGA GCTTTAACAT CCTAATTACT GGTGGACTTA CTTCTGGTTT CATTTTATAA AAGCAAATCC M GGTGTCCCAA AGCAAGGAAT TTAATCATTT TGTGTGACAT GAAAGTAAAT CCAGTCCTGC CAATGAGAAG AAAAAGACAC AGCAAGTTGC AGCGTTTATA GTCTGCTTTT ACATCTGAAC CTCTGTTTTT GTTATTTAAG GTGAAGCAGC ATCTGGGTGT GAGAGTGAAA CAAGCGTCTC TGAAGACTGC TCAGGGCTAT CCTCTCAGAG TGACATTTTA ACCACTCAGG TAAAAAGCGT GTGTGTGTGT GCACATGCGT GTGTGTGGTG TCCTTTGCAT TCAGTAGTAT GTATCCCACA

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

SNP Fasta Header FormatHeader

Fasta header line starts with gt and has fields separated by | Each field is explained below

Gnl Internal usedbSNP Database name

ss or rs numberdbSNP accession for the snp ss refers to submitted snp accession rs refers to the accession of refSNP cluster of one or more submitted snp

allelePosVariation allele position(1 based) on the fasta It is always the 5 length plus 1

lentotalLenTotal number of bases of the fasta sequence a sum of length of 5 3 and variation Variation is expressed in one IUPack code and has a length of 1 in the totalLen calculation

handle|submitted_snp_id

Only for submitted snp The two fields after totalLen are the submitter handle and submitter snp id

Taxid NCBI taxonomy id

MolMolecular source of the sequence Valid values are genomic cDNA or mitochondria

snpclassVariation class of the snp most common value is 1 - single nucleotide polymorphism Click on snpclass for details

Alleles Lists alleles of the snp separated by

Lower or upper caseSequence in lower case is used for sequence identified by RepeatMasker as low-complexity or repetitive elements

ATCG Green color is used for assay sequence (observed by the submitter)

ATCGBlack color is used for flank sequence (extracted from sequence databases )

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

GeneView of a SNP

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Links to Various Gene Records

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Disease Causing GenesDisease centric databases

bull OMIM httpwwwncbinlmnihgovomim

bull CDC HugeNavigator httphugenavigatornet

bull HGMD httpsportalbiobase-internationalcomhgmdprostartphp

bull A Catalog of Published Genome-Wide Association Studies httpwwwgenomegov26525384

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

NCBImdashOMIM

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Online Mendelian Inheritance in Man (OMIM)bull httpwwwncbinlmnihgoventrezqueryfcgidb=OMIM

bull OMIM is a human genetic disorders database built and curated using results from published studies

bull Each OMIM record provides a summary of the current state of knowledge of the genetic basis of a disorder which contains the following informationndash description and clinical features of a disorder or a gene involved

in genetic disorders biochemical and other features cytogenetics and mapping molecular and population genetics diagnosis and clinical management animal models for the disorder allelic variants

bull OMIM is searchable via NCBI Entrez and its records are cross-linked to other NCBI resources

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

OMIM Variantbull The OMIM database includes genetic disorders

caused by various mutationvariation from SNPs to large-scale chromosomal abnormalities

bull Variants are represented by a 10-digit OMIM number and can be searched in two waysndash Search for a gene or a disease when retrieved view its

variants

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Variants in OMIM Recordsbull For most genes only selected mutations are included

ndash Criteria for inclusion include the first mutation to be discovered high population frequency distinctive phenotype historic significance unusual mechanism of mutation unusual pathogenetic mechanism and distinctive inheritance

bull Most of the variants represent disease-producing mutations NOT polymorphisms

bull A few polymorphisms are included many of which show a positive statistical correlation with particular common disorders

bull Few neutral polymorphisms are included in OMIM

bull Some SNPs in the dbSNP records are not linked to the corresponding OMIM records

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Office of Public Health Genomics CDCbull The CDC established the Office of Public Health

Genomics (OPHG) in 1997 bull OPHG aims to integrate genomics into public health

research policy and programs Doing so could improve interventions designed to prevent and control the countryrsquos leading chronic infectious environmental and occupational diseases

bull OPHGs efforts focus on bull conducting population-based genomic research bull assessing the role of family health history in disease risk and

preventionbull supporting a systematic process for evaluating genetic testsbull translating genomics into public health research and

programsbull strengthening capacity for public health genomics in disease

prevention programs

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

HuGENetbull The Human Genome Epidemiology Network (HuGENettrade)

ndash Established to help translate genetic research findings into opportunities for preventive medicine and public health by advancing the synthesis interpretation and dissemination of population-based data on human genetic variation in health and disease

bull HuGENetTM resourcesndash HuGE Navigator Coordinating centers Collaborators Workshops

Reviews Case studies Book

bull HuGE Navigator provides access to a continuously updated knowledge base in human genome epidemiology

ndash information on population prevalence of genetic variants

ndash gene-disease associations

ndash gene-gene and gene- environment interactions

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

HuGE Navigator

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Finding Disease Causing Genes

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Finding Genersquos Associated Diseases

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Disease Databasesbull Genes are involved in disease

bull Many diseases are well studied

bull Description of diseases and what is known about them is storedndash OMIM httpwwwncbinlmnihgovOmim

ndash Tumor Gene Family Databases httpwwwtumor-geneorgtgdfhtml

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease

Department of Health Information Management

Homework 1bull Using PubMed search for a recent paper related to genetic

disease (or disease with known genetic basis) and with a research method named ldquoGenome-Wide Association Study or GWASrdquo The title of the journals could be ldquoPLoS geneticsrdquo ldquoNature Geneticsrdquo etc The genetic disease could be obesity type II diabetes etc

bull Choose one or a few SNPs record or genes mentioned in the paper which may highly relevant to the disease search for the corresponding dbSNP records in dbSNP Summarize the genetic variation relevant genes and location of the variation

bull Search for the Genbank record of the relevant genes extract and save their sequences in FASTA format Identify the corresponding protein sequences and use Cn3D to display the structure of the protein

  • Genomics and Personalized Care in Health Systems Lecture 2 Databases
  • Nucleotide and Protein Sequence Databases
  • NCBI Homepage
  • EST
  • Protein Structure
  • FlyBase
  • Genetic Variations
  • Gene and Disease