19
RefSeq curation and annotation of the reference human genome GRCh38 Kim D. Pruitt National Center for Biotechnology Information National Library of Medicine National Institutes of Health www.ncbi.nlm.nih.gov/refseq/

Ashg2015 grc-pruitt

Embed Size (px)

Citation preview

RefSeq curation and annotation of the reference human genome GRCh38

Kim D. PruittNational Center for Biotechnology Information

National Library of MedicineNational Institutes of Health

www.ncbi.nlm.nih.gov/refseq/

RefSeq Background

• RefSeq provides -• Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline)

• Collaborations -• Genome Reference Consortium (GRC)• HUGO Gene Nomenclature Committee (HGNC)• Consensus CDS (CCDS) Collaboration (HAVANA curators)• RefSeqGene/Locus Reference Genomic (LRG)/LSDB

RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/

An NCBI project to provide reference sequence standards that incorporate current knowledge.

Archaea – Bacteria – Eukaryotes - Virus

Curation support of genic regions of the reference human assembly

• RefSeqGene and LRG collaboration• Genomic and cDNA standards for clinical reporting• Report potential issues to the GRC

• Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC

• RefSeq• Curation of genes, transcript & protein records• Report potential issues to the GRC• Review GRC patch updates for gene annotation impact

Genome annotation leverages curation + computation

Genes:• Type, location, length

Sequence:• Accuracy, length• Alternate splice products• Functional annotation

Align curated RefSeqsAlign transcripts, proteinsAlign RNA-SeqFilter best alignmentsBuild model RefSeqsAssign accessions, GeneID

Evidence-based genome annotation pipeline

Manual CurationSequence - Literature

Transcripts ProteinsKnown RefSeqs 50,540 39,363

Model RefSeqs 112,735 60,599

Annotated Genes CountProtein-coding 20,576Non-coding 18,037Pseudogene 12,474

Transition from GRCh37 to GRCh38 • Identify gene/sequence differences vs. GRCh38• Automatic update at synonymous mismatches• Curation review of remainder• >5,100 Known RefSeq transcripts updated since October 2013• 47,031 Known RefSeqs identical to genome• 2,916 intentionally retain a mismatch or indel• ~600 pending• ~132 genes merged

0 200 400 600 800 1000 1200

2013 Q1

2013 Q3

2014 Q1

2014 Q3

2015 Q1

2015 Q3

Number of updates

* GRCh38 12/24/2013

*

Updating RefSeq to match GRCh38

• Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4)• Model RefSeq XM_005257026.1 promoted to Known RefSeq

GRCh38

GRCh37

alignment

alignment

RefSeq curation & genome maintenance

GRCh38

GRCh37

GRCh37 Issue: SCX duplicationMROH1 split

GRCh38 update:Gap closedMROH1 completeOne SCX gene

gap

RefSeq curation & genome maintenance

• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38

• This maintains the correct reading frame GRCh38

alignment

RefSeq curation & genome maintenance

• RefSeq reported this sequence issue to the GRC

GRCh38 ALT LOCI and PATCHES

Pre-Patch & ALT reviewPolymorphic pseudogenesHaplotype & CNV variation

ALT-specific RefSeq recordsCurator-stored placement data

Evidence-based genome annotation pipelineManual Curation

Assembly-ALT alignmentsAlignment quality reports

Subsequent genome annotation build corrects the annotation

Interim alignment updates

Polymorphic pseudogenes

• RefSeq provides different transcripts to represent the protein-coding gene versus the pseudogene

• Curators store assembly placement information (chromosome versus ALT) in a local database

• This is used by annotation pipeline to ensure correct annotation

Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2GRCh38 chr22 null pseudo coding pseudo nullALT_REF_LOCI_1 coding coding coding pseudo pseudo

An example – GSTT cluster on chromosome 22:

GSTT* variation, chromosome 22

• Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more

• Accurate gene annotation is important to downstream users

GRCh38 chr22

GRCh38 ALT

pseudogene

chr22 = null allelecoding allele

ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer

GSTT2 polymorphism

AT splice donor Premature stop codon

GT splice donor Stop codon

GRCh38 chr22

GRCh38 ALT

GRCh38 chr22 GSTT2 pseudogene

GRCh38 chr22

Data access• Genes:

• <…ncbi root url…>/gene/• ftp://ftp.ncbi.nlm.nih.gov/gene/• NCBI YouTube ‘Download genomic sequence for a gene’

• https://www.youtube.com/watch?v=RHz2nZbzjpA

• RefSeq transcripts and proteins:• Links from NCBI Gene• Nucleotide/protein query:

• human[organism] + use facets to specify RefSeq and molecule type• ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/

• NCBI Genome Annotation• Links from NCBI Assembly or Genome resources

• <ncbi>/assembly/ or <ncbi>/genome/

Data access to annotated genome

Gene

Assembly details

Genome FTP formats• FASTA

• genome, transcripts, proteins • GenBank file format

• – genome transcripts, proteins• GFF genome annotation • Feature table

• features and locations in tabular format

• AGP, Assembly details & statistics • Repeat masker results• Md5checksums• Documentation

• README files• <ncbi>/genome/doc/ftpfaq/

AcknowledgementsRefSeq Curators

Annotation pipelinePaul KittsTerence MurphyFrancoise Thibaud-Nissen

Eric CoxCatherine FarrellTamara GoldfarbTripti GuptaVinita JoardarVamsi Kodali

Kelly McGarveyMike MurphyNuala O'LearyShashi PujarBhanu RajputSanjida Rangwala

Lillian RiddickDave WebbMatt Wright

Susan Hiatt

www.ncbi.nlm.nih.gov/refseq/

CollaboratorsElspeth Bruford (HGNC)Jen Harrow (HAVANNA)Locus-Specific DatabasesExpert databasesIndividual scientists

NCBI Posters & Booth 2405