View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Finding Sequence Motifs in Finding Sequence Motifs in AluAlu
Transposons that Enhance Transposons that Enhance the Expression of Nearby the Expression of Nearby
GenesGenesKendra Baughman Kendra Baughman
York Marahrens’ LabYork Marahrens’ Lab
UCLAUCLA
OverviewOverviewGoalGoal
BackgroundBackground
Prior StudiesPrior Studies
StrategyStrategy
ResultsResults
Remaining TasksRemaining Tasks
Future DirectionsFuture Directions
GoalGoal
Determine if there are motifs present among Alu Determine if there are motifs present among Alu elements near highly expressed genes, and elements near highly expressed genes, and
missing from Alu elements near poorly missing from Alu elements near poorly expressed genes, that might contribute to gene expressed genes, that might contribute to gene
expressionexpression
Background – Alu Background – Alu ElementsElements
Repetitive sequenceRepetitive sequence
Transposons Transposons (DNA sequences that make copies (DNA sequences that make copies of themselves and insert elsewhere in the genome)of themselves and insert elsewhere in the genome)
Over 1 million in human genomeOver 1 million in human genome
~50 subfamilies categorized by ~50 subfamilies categorized by sequence differencessequence differences
Prior StudiesPrior Studies
“Repetitive sequence environment distinguishes housekeeping genes”
Eller, Daniel et al. submitted
“Alu abundance positively correlates with gene expression level”
C.D. Eller et. al. submitted
Higher Alu concentration near widely expressed genes
05
1015
20
HK TS RS
Pe
rce
nt
p= 2e-45Alu
Higher Alu concentration
near highly expressed
genes
Alu SubfamiliesAlu Subfamilies
Subfamily
# A
lu i
n t
he
Su
bfa
mil
y
DataData
Human gene expression levels from Human gene expression levels from microarray data (Stan Nelson’s lab, UCLA)microarray data (Stan Nelson’s lab, UCLA)
Alu information from UCSC Genome Alu information from UCSC Genome Browser, Repeat masker tracksBrowser, Repeat masker tracks
Goal, reiteratedGoal, reiterated
Determine if there are motifs present among Alu Determine if there are motifs present among Alu elements near highly expressed genes, and elements near highly expressed genes, and
missing from Alu elements near poorly missing from Alu elements near poorly expressed genes, that might contribute to gene expressed genes, that might contribute to gene
expressionexpression
StrategyStrategyFind Alu “near” high and low expression Find Alu “near” high and low expression genes (within 20kb)genes (within 20kb)
Perform multiple sequence alignment on Perform multiple sequence alignment on Alu sequencesAlu sequences
Identify motifs preferentially conserved Identify motifs preferentially conserved around highly expressed genes (these around highly expressed genes (these motifs could help the genes be highly motifs could help the genes be highly expressed)expressed)
StrategyStrategyFind Alu “near” high and low expression Find Alu “near” high and low expression genes (within 20kb)genes (within 20kb)
Perform multiple sequence alignment on Perform multiple sequence alignment on Alu sequencesAlu sequences
Identify motifs preferentially conserved Identify motifs preferentially conserved around highly expressed genes (these around highly expressed genes (these motifs could help the genes be highly motifs could help the genes be highly expressed)expressed)
Used Perl scripts to Used Perl scripts to extract information extract information from MySQL from MySQL databasesdatabases
Grouped genes by Grouped genes by expression level in Rexpression level in R
Chose genes in top Chose genes in top and bottom 20%and bottom 20%
Genes
Exp
ress
ion
Leve
lScreening the genes…
Used MySQL queries to determine flanking region
Used Perl scripts to screen
Alu located within 20kb of genes
Omitted Alu in overlapping flanking regions
PERCENTAGES OF ALU THROWNOUT
50%11%17%50kb
28%7%7%20kb
20%6%3%10kb
Chrom19 1st 20mb
Chrom10Chrom1 1st 20mb
HI-gene
LO-gene
HI-Alu ??-Alu LO-Alu
Screening the Alu…
StrategyStrategyFind Alu “near” high and low expression Find Alu “near” high and low expression genes (within 20kb)genes (within 20kb)
Perform multiple sequence alignment on Perform multiple sequence alignment on Alu sequencesAlu sequences
Identify motifs preferentially conserved Identify motifs preferentially conserved around highly expressed genes (these around highly expressed genes (these motifs could help the genes be highly motifs could help the genes be highly expressed)expressed)
Alignment Process…Alignment Process…
First alignment tool: Clustalw First alignment tool: Clustalw – Slow, inaccurateSlow, inaccurate
Second alignment tool: T-COFFEESecond alignment tool: T-COFFEE– Can’t handle hundreds of sequencesCan’t handle hundreds of sequences
Third alignment tool: MUSCLEThird alignment tool: MUSCLE
Aligning thousands of sequences = big gaps and Aligning thousands of sequences = big gaps and processing limitationsprocessing limitationsChose to analyze by subfamily (S, Sp/q)Chose to analyze by subfamily (S, Sp/q)– Aligned Aligned elements around highly expressed geneselements around highly expressed genes– Aligned elements around poorly expressed genesAligned elements around poorly expressed genes– Profile high/low alignmentProfile high/low alignment– Consensus sequence alignmentConsensus sequence alignment
Alignment viewed in Jalview
AluS
AluSp-q EPS
AluSp/q
Alignments of Alu Sp/q and AluS Elements
High Alu
High conserv.
Low conserv.
StrategyStrategyFind Alu “near” high and low expression Find Alu “near” high and low expression genes (within 20kb)genes (within 20kb)
Perform multiple sequence alignment on Perform multiple sequence alignment on Alu sequencesAlu sequences
Identify motifs Identify motifs preferentially conserved preferentially conserved around highly expressed genes (these around highly expressed genes (these motifs could help the genes be highly motifs could help the genes be highly expressed)expressed)
AluS*5547666896759699995769699999999999*9989979Alu w/ a base:
All Alu: 0444762289674300448576809499545545409449808
Frequency of consensus
base
77488 66899 67444999995 455645 98 9Alu w/ a base:All Alu: 76044 55899 37444989894 454045 98 8Frequency of
consensus base
High Alu: TATCCACGCCTGCAAAATCTCAGCCACTCCCAAAGTTGCTGCG
CANCC-CGCCT-CGTAATCCCAA--------AATGTT--TG-GLow Alu
Alu consensus
sequence
All Alu: 0860005458443600233333323333333345400000000
All Alu: 55 4 58 444544 0 77
Alu w/ a base: 596**65559458765699999978999999966566******
Alu w/ a base: 56 5 69 555655 6 99
High Alu: TGCTCAGAAATTTCTCGGCTCACTGCAACCTCCGTATCACCCC
Low Alu: CG---A-AA--------------------CTCCGT--T---CT
AluSp/q
Alu consensus
sequence
Frequency of consensus
base
Frequency of consensus
base
Remaining TasksRemaining TasksAnalyze the remaining sub-familiesAnalyze the remaining sub-families
Determine whether identified motifs agree Determine whether identified motifs agree across subfamiliesacross subfamilies
BLAST motifs against all Alu sequences BLAST motifs against all Alu sequences and correlate alignment scores with and correlate alignment scores with expression levelexpression level
Future DirectionsFuture DirectionsCluster alignments into a relationship tree Cluster alignments into a relationship tree to see if HI and LO Alu groups cluster to see if HI and LO Alu groups cluster differently from each otherdifferently from each other– Create a matrix of pairwise alignments and Create a matrix of pairwise alignments and
cluster these into a tree using nearest cluster these into a tree using nearest neighbour clusteringneighbour clustering
Use Hidden Markov Models or Gibbs Use Hidden Markov Models or Gibbs sampling to identify sequence motifs (non-sampling to identify sequence motifs (non-multiple sequence alignment method of multiple sequence alignment method of motif finding)motif finding)
AcknowledgementsAcknowledgementsDanny EllerDanny Eller
York MarahrensYork Marahrens
Marc SuchardMarc Suchard
Chiara SabattiChiara Sabatti
SoCalBSISoCalBSI
NIH/NSFNIH/NSF