1
Experimental Setup Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011] Synthetic hybrids with different levels of heterozygosity generated by pooling reads from C57/BL6 and four other strains Read statistics Strain variation Inference accuracy Allele Specific Gene/Isoform Expression Make cDNA & shatter into fragments A B C D E Map reads Allele Specific Gene Expression (GE) Allele Specific Isoform Expression (IE) A B C D E Sequence fragment ends H0 H1 H0 H1 H0 H1 H0 H1 H0 H1 H1 H0 H1 Inference of Allele Specific Expression Levels from RNA-Seq Data Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering Dept., University of Connecticut Current Approaches [Gregg et al., 2010]: parent-of-origin effect in hybrids of inbred mouse strains [McManus et al., 2010]: cis- and trans-regulatory effects in hybrids of inbred drosophila species [Heap et al., 2010]: allelic expression imbalance in human primary cells by allele coverage analysis for heterozygous SNP sites within transcripts [Turro et al., 2011]: allele specific isoform expression through SNP calling and diploid transcriptome construction [Missirian et al. , 2012]: parentally biased gene expression in Arabidopsis hybrids RNA-PhASE Analysis Pipeline Methods Hybrid Mapping Approach Independently map reads onto reference genome and transcriptome using bowtie (for Illumina or SOLiD reads) or tmap (for ION Torrent and 454 reads) Discard reads with multiple alignments in either genome or transcriptome, or unique but discordant alignments in both Discordance determined at base level to accomodate local alignments of long reads with indel errors (ION Torrent and 454) SNV Calling and Genotyping (SNVQ) [Duitama et al. 2012] Bayesian model for SNV discovery and genotype calling from RNA-Seq reads Phasing SNVs RefHap [Duitama et al. 2010] Based on finding a maximum-weight cut in each connected component of the read graph with edges between reads with overlapping alignments ; edge weights given by #mismateches Coverage Based Phasing Haplotypes in disconnected blocks of SNVs connected based on allele coverage at the their closest SNV sites Inference of Allele Specific Isoform Expression Diploid extension of IsoEM [Nicolae et al. 2011] Expectation-Maximization algorithm based on a probabilistic model that incorporates fragment length distribution, quality scores, read pairing and, if available, strand information Detection of Allelic Expression Imbalance Fisher Exact test for isoforms/genes with allelic expression change fold over a certain threshold Preliminary Results J. Duitama, et al., ReFHap: A Reliable and fast algorithm for Single Individual Haplotyping, Proc. ACM-BCB, pp. 160-169, 2010 J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards accurate detection and genotyping of expressed variants from Whole Transcriptome Sequencing data, BMC Genomics 13(Suppl 2):S6, 2012 C. Gregg et al., Sex-specific parent-of-origin allelic expression in the mouse brain, Science 239:682-685, 2010 G.A. Heap, et al, Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing, Human Molecular Genetics, 19(1):122134, 2010 T.M. Keane, et al., Mouse genomic variation and its efect on phenotypes and gene regulation, Nature 477(7364):289-294, 2011 C.J. McManus, et, al., Regulatory divergence in Drosophila revealed by mRNA-seq, Genome Research 20:816-825, 2010 V. Missirian, I. Henry, L. Comai, and Vladimir Filkov, POPE: Pipeline of Parentally-Biased Expression, Proc. ISBRA, LNCS 7292:177-188, 2012 M. Nicolae, S. Mangul, I.I. Mandoiu, A. Zelikovsky, Estimation of alternative splicing isoform frequencies from RNA-Seq data, Algorithms for Molecular Biology 6:9,2011 E. Turro, et al., Haplotype and isoform specific expression estimation using multimapping RNA-Seq reads, Genome Biology 12(2):R13, 2011 ACKNOWLEDGEMENTS: This work is supported in part by awards IIS-0546457 from Strain SNPs Private SNPs C57BL 9,844 1,488 BALBc 3,920,925 29,973 A/J 4,198,324 44,837 CAST 17,673,726 5,368,019 SPRET 35,441,735 23,455,525 Strain/Hybrid # read pairs # m apped pairs % m apped C57BL 57,187,342 21,756,070 38.0 BALBc 62,465,347 28,358,653 45.4 A/J 46,993,887 22,449,227 47.8 CAST 54,569,423 22,307,194 40.9 SPRET 57,411,555 19,016,949 33.1 C57BLxBALBc 114,374,684 47,682,108 41.7 C57BLxAJ 93,987,774 35,353,398 37.6 C57BLxCAST 109,138,846 43,134,951 39.5 C57BLxSPRET 114,374,684 40,780,806 35.7 C57BL x Strain Hybrid C57BL IE Strain IE C57BL GE Strain GE C57BL x BALBc 0.705 0.675 0.706 0.675 C57BL x AJ 0.855 0.902 0.856 0.903 C57BL x CAST 0.872 0.824 0.924 0.882 C57BL x SPRET 0.952 0.726 0.951 0.725 Pearson correlation between strain-specific FPKM values inferred from separate strain RNA-Seq reads vs. those inferred from pooled reads RNA-PhASE pipeline addresses limitations of existing ASE methods Does not require prior availability of diploid genome/transcriptome Mapping reads against the diploid transcriptome reconstructed on-the-fly resolves bias towards reference alleles EM model improves inference accuracy by using all reads, including those that map to more than one isoform Conclusions and Ongoing Work References & Acknowledements

Allele Specific Gene/Isoform Expression

  • Upload
    sheba

  • View
    68

  • Download
    3

Embed Size (px)

DESCRIPTION

Inference of Allele S pecific Expression Levels from RNA- Seq Data. H0. H1. Sahar Al Seesi and Ion M ă ndoiu. Allele Specific Gene/Isoform Expression. Make cDNA & shatter into fragments. Computer Science and Engineering Dept., University of Connecticut. - PowerPoint PPT Presentation

Citation preview

Page 1: Allele Specific Gene/Isoform Expression

Experimental SetupWhole brain RNA-Seq Data from Sanger Institute Mouse Genomes

Project [Keane et al. 2011]Synthetic hybrids with different levels of heterozygosity generated by

pooling reads from C57/BL6 and four other strains

Read statistics Strain variation

Inference accuracy

Allele Specific Gene/Isoform Expression

Make cDNA & shatter into fragments

A B C D E

Map reads

Allele Specific Gene Expression (GE) Allele Specific Isoform Expression (IE)

A B C D E

Sequence fragment ends

H0 H1

H0 H1

H0

H1

H0H1 H0

H1H1

H0H1

Inference of Allele Specific Expression Levels from RNA-Seq DataSahar Al Seesi and Ion Măndoiu

Computer Science and Engineering Dept., University of Connecticut

Current Approaches [Gregg et al., 2010]: parent-of-origin effect in hybrids of inbred mouse

strains [McManus et al., 2010]: cis- and trans-regulatory effects in hybrids of

inbred drosophila species [Heap et al., 2010]: allelic expression imbalance in human primary cells by

allele coverage analysis for heterozygous SNP sites within transcripts [Turro et al., 2011]: allele specific isoform expression through SNP calling

and diploid transcriptome construction [Missirian et al. , 2012]: parentally biased gene expression in Arabidopsis

hybrids

RNA-PhASE Analysis Pipeline

Methods Hybrid Mapping Approach

Independently map reads onto reference genome and transcriptome using bowtie (for Illumina or SOLiD reads) or tmap (for ION Torrent and 454 reads)

Discard reads with multiple alignments in either genome or transcriptome, or unique but discordant alignments in both

Discordance determined at base level to accomodate local alignments of long reads with indel errors (ION Torrent and 454)

SNV Calling and Genotyping (SNVQ) [Duitama et al. 2012] Bayesian model for SNV discovery and genotype calling from RNA-

Seq reads Phasing SNVs

RefHap [Duitama et al. 2010] Based on finding a maximum-weight cut in each connected

component of the read graph with edges between reads with overlapping alignments ; edge weights given by #mismateches

Coverage Based Phasing Haplotypes in disconnected blocks of SNVs connected based

on allele coverage at the their closest SNV sites Inference of Allele Specific Isoform Expression

Diploid extension of IsoEM [Nicolae et al. 2011] Expectation-Maximization algorithm based on a probabilistic model

that incorporates fragment length distribution, quality scores, read pairing and, if available, strand information

Detection of Allelic Expression Imbalance Fisher Exact test for isoforms/genes with allelic expression change

fold over a certain threshold

Preliminary Results

• J. Duitama, et al., ReFHap: A Reliable and fast algorithm for Single Individual Haplotyping, Proc. ACM-BCB, pp. 160-169, 2010• J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards accurate detection and genotyping of expressed variants from Whole

Transcriptome Sequencing data, BMC Genomics 13(Suppl 2):S6, 2012• C. Gregg et al., Sex-specific parent-of-origin allelic expression in the mouse brain, Science 239:682-685, 2010• G.A. Heap, et al, Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome

resequencing, Human Molecular Genetics, 19(1):122134, 2010• T.M. Keane, et al., Mouse genomic variation and its efect on phenotypes and gene regulation, Nature 477(7364):289-294, 2011• C.J. McManus, et, al., Regulatory divergence in Drosophila revealed by mRNA-seq, Genome Research 20:816-825, 2010• V. Missirian, I. Henry, L. Comai, and Vladimir Filkov, POPE: Pipeline of Parentally-Biased Expression, Proc. ISBRA, LNCS 7292:177-188, 2012• M. Nicolae, S. Mangul, I.I. Mandoiu, A. Zelikovsky, Estimation of alternative splicing isoform frequencies from RNA-Seq data, Algorithms

for Molecular Biology 6:9,2011• E. Turro, et al., Haplotype and isoform specific expression estimation using multimapping RNA-Seq reads, Genome Biology 12(2):R13,

2011ACKNOWLEDGEMENTS: This work is supported in part by awards IIS-0546457 from NSF, Agriculture and Food Research Initiative Competitive Grant no. 2011-67016-30331 from the USDA NIFA, and a Collaborative Research Compact award from Life Technologies Corporation.

Strain SNPs Private SNPsC57BL 9,844 1,488 BALBc 3,920,925 29,973

A/J 4,198,324 44,837 CAST 17,673,726 5,368,019

SPRET 35,441,735 23,455,525

Strain/Hybrid # read pairs # mapped pairs % mapped

C57BL 57,187,342 21,756,070 38.0BALBc 62,465,347 28,358,653 45.4

A/J 46,993,887 22,449,227 47.8CAST 54,569,423 22,307,194 40.9SPRET 57,411,555 19,016,949 33.1

C57BL x BALBc 114,374,684 47,682,108 41.7C57BL x AJ 93,987,774 35,353,398 37.6

C57BL x CAST 109,138,846 43,134,951 39.5C57BL x SPRET 114,374,684 40,780,806 35.7

C57BL x StrainHybrid C57BL IE Strain IE C57BL GE Strain GE

C57BL x BALBc 0.705 0.675 0.706 0.675

C57BL x AJ 0.855 0.902 0.856 0.903

C57BL x CAST 0.872 0.824 0.924 0.882

C57BL x SPRET 0.952 0.726 0.951 0.725

Pearson correlation between strain-specific FPKM values inferred from separate strain RNA-Seq reads vs. those inferred from pooled reads

RNA-PhASE pipeline addresses limitations of existing ASE methods Does not require prior availability of diploid genome/transcriptome Mapping reads against the diploid transcriptome reconstructed on-

the-fly resolves bias towards reference alleles EM model improves inference accuracy by using all reads, including

those that map to more than one isoform In collaboration with the Michael and Rachel O’Neill labs, RNA-PhASE is

being used to identify parentally imprinted genes associated with autism

Conclusions and Ongoing WorkReferences & Acknowledements