Upload
keita
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. Christine Bird [email protected]. Hypothesis: Conserved non-coding DNA has a function in the human genome. Does human variation data suggest selection is acting on noncoding DNA? - PowerPoint PPT Presentation
Citation preview
Exploring the Role of Non-Coding Exploring the Role of Non-Coding DNA in the Function of the Human DNA in the Function of the Human
Genome through Variation.Genome through Variation.
Christine Bird
Hypothesis: Conserved non-coding DNA has a function in the human genome
Does human variation data suggest selection is acting on noncoding DNA?
Are conserved non-coding sequences selectively constrained?
Detection of fast evolving conserved non-coding sequence.
Exploring the properties and genomic context of human fast evolving non-coding regions.
The Human Genome:
~25,000 genes
1 to 1.5% of human DNA is coding
Is the remaining 98.5% “junk”?
Selective constraint in mammalian genomes
Waterston et al. Nature 2002
Neutral
Constrained 5%
Proportions of Lineage Specific Conserved non-coding (CNC) sequences
Margulies et al. PNAS 2005
418 MCSs (Multiple vertebrate Conserved Sequences) in 571Kb:58 coding, 46 UTRs and 314 non-coding. ~ 27 species
CNCs are evenly distributed in the human genome
Dermitzakis et al. Nat Rev Genet 2005
The density of CNCs and exons is negatively correlated
Dermitzakis et al. Nat Rev Genet 2005
Why study conserved non-coding DNA?
Abundance beyond that expected under neutral evolution.
If function is gene regulation, understanding is limited.
Gene regulation is considered a crucial contributor to evolutionary change (King and Wilson, 1975).
Conserved non-coding sequences (CNCs) may well harbour critical regulatory changes that have driven recent human evolution.
Conserved non-coding sequences
Top conserved 5% of the human genome as detected with a phylogenetic hidden Markov model (phyloHMM) (Siepel, 2005).
Best-in-genome pairwise alignments by blastz, followed by chaining.
A multiple alignment constructed by MULTIZ. PhastCons constructs a two-state phylo-HMM for
conserved and non-conserved regions.
Remove overlap with Ensembl gene annotation.
http://genome.ucsc.edu/
Are conserved non-coding sequences selectively constrained?
Conservation of non-coding sequence due to forces acting on the human genome.
CNC SNP density only 82% of noncoding non-conserved sequence. 3.9 x 10-4 vs. 4.8 x 10-4; chi2= 686, 1 df; p<10-99
Just due to low local mutation rates? Or
Are New alleles deleterious, therefore less likely to be fixed in population?
Address this by looking at the derived allele frequency (DAF) spectra as it is unaffected by local mutation rates.
Drake et al. Nat Genet 2006
Derived Allele Frequency
Selective constraint shifts the distribution of constrained alleles toward rarer frequencies (Fay & Wu, 2000).
Allele frequencies in 4 populations from 210 unrelated individuals in the HapMap project:
CEU - American of European ancestry (60) YRI - Yoruba from Nigeria (60)JPT - Japanese from Tokyo (45)CHB - Han Chinese from Beijing (45)
Derived Allele Frequency (DAF) was generated for 1 million Phase I HapMap SNPs & 4 million Phase II.
The ancestral allele was inferred by comparison to chimp and/or macaque.
SNPs were assigned to defined genomic features to allow comparison.
Drake et al. Nat Genet 2006
CNCs are selectively constrained
Drake et al. Nat Genet 2006
0
0.05
0.1
0.15
0.2
0.25
Binned Derived Allele Frequency
Fra
cti
on
of
SN
Ps
ConservedNon-conserved
Selective constraint
Low High
Mann-Whitney-U test; P<<10-4
CNCs have an excess of low frequency derived alleles compared to Introns
Low High
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Binned Derived Allele Frequency
Fra
ctio
n o
f S
NP
s
CNCExonsIntronsRest
Mann-Whitney-U test; CNC vs Introns P<<10-16
CNC sequences are selectively constrained and not mutation cold spots
Nucleotide variation revealed strong selective constraints upon CNCs in human populations.
SNP density 82% lower in CNCs
CNCs have an excess of low frequency derived alleles.
CNCs subject to purifying selection in humans, likely to harbour functionally important variants.
Drake et al. Nat Genet 2006
Why are they conserved?
Regions of the genome are therefore selectively constrained despite being non-coding.
But what is the reason for this conservation…?
What is novel about their biology? How can we tackle this question for so many elements? What are the most interesting regions?
A subset of CNCs undergoing rapid change with potential common properties or roles.
Why study fast-evolving non-coding?
If CNCs are part of chimpanzee-human lineage differentiation by changes in gene regulation then changes in their nucleotide sequence should be expected despite their overall conservation.
Following gene duplication subfunctionalization by the partitioning of gene regulation among descendant copies (Force, 1999)
Older models of gene duplication proposed an important role for positive selection after duplication (Bridges 1935, Ohno 1970, Ohta, 1987).
Heart HeartBrain
Duplicated gene and separated tissue specific regulation
Subfunctionalization
Duplicated genes preserved through subfunctionalization by the Duplication-Degeneration-Complementation model.
If CNCs are regulatory elements involved in this process they would have changed rapidly since duplication.
Lynch and Force, Genetics 2000
Detecting fast-evolving non-coding sequences
HumanChimp Macaque
GACTACGTTTGGTTTAGAGATGACTGGCTTTACTTTTGAGATGTCTGGGTTTACTTTTCAGAT
Lineage Specific
SubstitutionsTajima’s Relative rate test
512
GACTACGTTTGGTTTAGAGATGACTGGCTTTACTTTTGAGATGTCTGGGTTTACTTTTCAGAT
MULTIZ alignments (Webb Miller).
Human
Chimp
Macaque
S1
S2
(S1 - S2)2
(S1 + S2)= χ2
Tajima, Genetics 1993
χ2 test of base substitutions.
Alignments = 304,291Power to detect acceleration = 26,477 P < 0.05 Accelerated = 2,794 (11%)
Accelerated in chimp = 1438
Accelerated in human = 1356
ANC (Accelerated Non-Coding)
Are Accelerated Non-Coding (ANCs) sequences functional?
Compare to 3 sets of control sequences: Power CNCs (not lineage specific):
CNCs with >= 4 substitutions = 23,683Non-accelerated CNCs:
CNCs < 4 substitutions = 277,814DAF controls 1&2:
1356 x 20Kb windows 500Kb from 5’ & 3’ of ANCs.
Repeat analyses excluding potential confounder: Segmental Duplications (SD), Copy Number Variants (CNV), pseudogenes and retroposed genes.
Are ANC sequences functional?
Does nucleotide variation data indicate particular modes of selection implying function? (Is acceleration recent or ancient?)Derived allele frequency spectrum comparisonsPopulation differentiation, FST
Are ANCs involved in subfunctionalization? Is there enrichment in recently duplicated sequences?
What function do these rapidly evolving sequences have?Association of ANC variation with expression levels of
nearby genes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
>0-0
.1
0.1-
0.2
0.2-
0.3
0.3-
0.4
0.4-
0.5
0.5-
0.6
0.6-
0.7
0.7-
0.8
0.8-
0.9
0.9-
<1
Binned Derived Allele Frequency
Fra
cti
on
of
SN
Ps
NonAccelerated CNC
Control
ANC
Excess of high frequency derived alleles in ANCs
Loss of constraint & Directional Selection?
Selective constraint
Mann-Whitney-U test; Non-accelerated CNC vs ANCs P =1.63x10-6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Binned Derived Allele Frequency
Fra
cti
on
of
SN
Ps
NonAccelerated CNC
ANC
Control
Power
Power CNCs are neutral
Mann-Whitney-U test; Power CNC vs Control P =0.15
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
>0-0
.1
0.1-
0.2
0.2-
0.3
0.3-
0.4
0.4-
0.5
0.5-
0.6
0.6-
0.7
0.7-
0.8
0.8-
0.9
0.9-
<1
Binned Derived Allele Frequency
Fra
cti
on
of
SN
Ps
NonAccelerated CNCControlPowerANCANC no confounding
Excess of rare alleles in ANCs excluding confounding elements
Loss of constraint & Directional Selection?
Mann-Whitney-U test; ANCs vs ANC no confounders P =0.48
Detecting recent evolution and population-specific selection
A measure of population structure, Wright’s FST. Compares the mean amount of genetic diversity found
within subpopulations to the meta-population. Sampling from 2 diverged subpopulations as if it is a
panmitic population gives an excess of homozygotes & a deficiency of heterozygotes.
FST can be defined as:
Calculated for ANCs MSG - mean square error within populations MSP - mean square error between populations nc - variance-corrected average sample size
FST = HT - HS
HT
Weir and Cockerham, Evolution 1984
ANC FST values higher than non-accelerated CNCs
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-0.0
5 to
0
0 to
0.0
5
0.05
to 0
.1
0.1
to 0
.15
0.15
to 0
.2
0.2
to 0
.25
0.25
to 0
.3
0.3
to 0
.35
0.35
to 0
.4
0.4
to 0
.45
0.45
to 0
.5
0.5
to 0
.55
0.55
to 0
.6
0.6
to 0
.65
0.65
to 0
.7
0.7
to 0
.75
0.75
to 0
.8
0.8
to 0
.85
0.85
to 0
.9
0.9
to 0
.95
0.95
to 1
Fst bins
Fre
qu
ency
ANCs No ConfoundingANCsPower CNCsNon-Accelerated CNCs
Mann-Whitney-U-test; Non-accelerated CNCs vs ANCs P = 0.0504 ; Non-accelerated CNCs vs ANCs no confounders P = 0.0363
Enrichment in Segmental Duplications
Approximately 5-6% of the human genome in SDs (Bailey et al, Science 2002)
ANCs 8%power CNCs 10%non-accelerated CNCs 5%
Excess of ANCs and power CNCs in SDs (chi-square; P< 10-4).
The general enrichment in SDs is not surprising, as it has been observed that sequence divergence is elevated in duplicated sequences.(Hurles et al. GenBio. 2004; She et al. GenRes. 2006).
Excess of recent segmental duplications associated with ANCs
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
90 91 92 93 94 95 96 97 98 99 100
% identity of SDs
Fra
ctio
n o
f ca
terg
ory
ove
rlap
pin
g S
Ds
Non-Accelerated CNCs
Power CNCs
ANC
Mann-Whitney-U test; P<<10-4
Human Specific
Testing for evidence of involvement in Gene Regulation
GENEANC
SNPAssociation mRNA
ANC SNP- Expression Association
What is the functional impact of ANC variation on gene expression phenotypes?
47,294 transcripts probed in lymphoblastoid cell lines of 210 unrelated HapMap
Associate SNPs genotypes within ANCs to transcript expression levels by linear regression.
Statistical significance adjusted following 10,000 permutations per gene.
Additive association model:Linear regression e.g. CC = 0, CT = 1, TT = 2.
CC CT TT
8.0
8.5
9.0
9.5
Genotype
Exp
ress
ion
leve
l
0 1 2
SNPs within ANCs are significantly associated with gene expression
phenotypes. Significant SNPs at the 0.01 permutation threshold:
68% ANCs SNPs tested (496 out of 729)
9% Power CNCs SNPs tested (1047 out of 11468) A SNP within an ANC is 7 times more likely to be associated with gene expression levels than a SNP within a power CNC.
Significant at the 0.01 permutation threshold:16% of ANCs tested (59 out of 366)
3% of Power CNCs tested (165 out of 5968) Nucleotide variation within ANCs is 5 times more likely to be associated with gene expression levels than variation in a power CNC.
Tendency for derived alleles within ANCs to be associated with lower expression levels.
Summary
CNCs are not mutation cold spots but selectively constrained.
Fast evolving noncoding sequences in the human lineage have lost this constraint and some are potentially undergoing positive selection.
This may have contributed to some recent differentiation in human populations.
ANCs are enriched in the most recent segmental duplications.
SNPs in ANCs are associated with significant change in gene expression phenotypes.
AcknowledgementsThanks to my joint supervisors Emmanouil Dermitzakis and Matthew Hurles and the members of their teams; Barbara Stranger Dan Jeffares Catherine Ingle Julian Huppert Antigone Dimas Sarah Lindsay Dan Andrews Dan Turner Chris Barnes
Particular thanks to my other co-authors, Webb Miller - human-chimpanzee-macaque alignments Daryl Thomas - DAF for both phase I and II SNPs Maureen Liu - quantifying gene density
The Rhesus Macaque Genome Sequencing Consortium (RMGSC) and the HapMap consortium for making data available, and the Wellcome Trust and MRC for funding.
Exploring the Role of Non-Exploring the Role of Non-Coding DNA in the Function of Coding DNA in the Function of the Human Genome through the Human Genome through
Variation.Variation.
By Christine Bird
Margulies et al. PNAS 2005
Fig. 3. Phylogenetic tree of vertebrate species. By using the generated 27-species multisequence alignment, branch lengths were calculated based on analysis of synonymous coding positions. The branch lengths (as substitutions per synonymous site) between human and each species are listed (with additional pair-wise branch lengths provided in the supporting information). The last common ancestor among the catarrhine primates (A) is estimated at 25 mya (36, 37), between the rodents and primates (B) at 75 mya (5,6),between eutherians and metatherians (C) at 185 mya (14), between monotremes and other therians (D) at 200 mya (14), and between mammals and birds (E) at 310 mya (13).
Proportions of Lineage Specific Conserved non-coding sequences
Margulies et al. PNAS 2005
Fig. 4. Lineage specificity of MCSs. The proportion of nonexonic MCSs found in the sequences of species in each category is indicated. Note that virtually all MCSs overlapping known exonic sequences are present in all mammals (data not shown). All Mammals: cat, dog, cow, pig, rat, mouse, N.A. opossum, wallaby, and platypus; Eutherian: cat, dog, cow, pig, rat, and mouse; Marsupials: N.A. opossum and wallaby; and Other: species combinations containing2% of the analyzed MCSs (see the supporting information for the complete data set). Hashed areas of ‘‘All Mammals’’ reflect portions lacking one or both rodents, and hashed portions of ‘‘Eutherian Marsupials’’ reflect portions lacking both rodents.
3020100
400
300
200
100
0
Megabases (long arm)
Freq
uenc
y
Mb
Freq
uen
cy exons
’’CNGs big’’
’’CNGs small’’3020100
400
300
200
100
0
Freq
uenc
y
Mb
Fre
qu
en
cy
Exons
Big CNCs
Small CNCs
Dermitzakis et al. Nature 2002
Distribution of large and small CNCs (Conserved Non-Coding sequences) and
exons on Hsa21
Big CNCs: 70% ID, 100 bps ungappedSmall CNCs: 85% ID, 35-99 bps ungapped
Conservation of CNCs in multiple species
mousemouse
humanhumanConservedConserved
blockblock
0 55 110 165 220
Human
Mouse
Green Monkey
Lemur
Rabbit
Pig
Bat
Cat
Shrew
Elephant
Platypus
Wallabysp
ecie
s
# conserved sequences
Dermitzakis et al. 2003 Science
Drake et al. Nat Genet 2006
Testing DAF spectrum distributions
Non-parametric distributions of unequal sample size
Mann-Whitney U-test: Compares the median of two populations Uses the rank order of values in the two samples.
Kolmogorov-Smirnov test: Measures differences in the entire distributions of two samples in both
shape and location of distributions, but at the cost that it is less sensitive to differences in location only.
KS is less powerful with respect to the alternative hypothesis of differences in location than the Mann-Whitney U-test
No. of significant CNC to gene associations
No. of significant CNCs of those tested
Population
No. of tested CNCs
No. of SNPs
No. of probes tested
No. of associations 0.01 0.001 0.0001 0.01 0.001 0.0001
CEPH ANC 387 555 8673 23330 77 9 0 5915% 9 2% 0 0
Power 6232 8388 14906 350309 181 36 18 149 2% 33 1% 17 0
CHB ANC 356 499 8092 21291 83 13 0 5616% 11 3% 0 0
Power 5737 7579 14893 317518 202 41 15 159 3% 39 1% 15 0
CHB& JPT ANC 342 466 7919 20163 109 11 1 59
17% 9 3% 1 0
Power 5474 7162 14852 301636 203 12 1 149 3% 12 0 1 0
JPT ANC 355 490 8197 21166 88 12 0 5917% 11 3% 0 0
Power 5674 7531 14852 315476 241 48 20 194 3% 42 1% 19 0
YRI ANC 391 583 9118 24310 113 15 2 6416% 15 4% 2
1%
Power 6724 9218 14908 381407 196 32 15 173 3% 30 0 14 0