Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
BMS353
Bioinforma)csforhigh-throughputdata
BMS353
PartA: Introduc)ononhigh-throughputdata Experimentaldesign
Pla<ormsforgeneexpressiondata
Break–10min(ques5onsifany)PartB: Exponen)alandlogs
Lowlevelanalysis• GeneExpressionEs)ma)on• Normalisa)on
ClassAc)vity:debate.
Today’sOutline
BMS353
Todaywewillbelearning:
• Characteris)csofhigh-throughputdataandhowitisproduced
• WhatweintendforOMICS
• Howwedefinehigh-throughputdata
• Todefineprinciplesofexperimentaldesignandpipelines
• Methodsforgeneexpressionquan)fica)on
• Toes)mategeneexpressionlevelsfromdataanddifferenceinmethodsapplied
• Tonormalisethedataandwhatitmeans
BMS353
High-throughputdata
High-throughputdata is a large amountof data collectedusing automatedmethodsandnon-conven)onal technologies. This is to be able to performa largenumberofexperimentsatthesame)metomonitorabehaviorofasystem.
High throughput cell biology is the use of automa)on equipmentwith classical cellbiology techniques. This is to address biological ques)ons that are otherwiseunaPainableusingconven)onalmethods.High-throughput biology has created a new field of biology called OMICS. It is aresearch filed that interface between large scale biology (genome, proteome,transcriptome),technologyandcomputa)onalmethods.
BMS353
Characteris)csofhigh-throughputdata
Whatarethecharacteris)csofhigh-throughputdata?• Largeinsize
• Pronetomanyfalseposi)ves(lowspecificity)
• CapturebiologicalNoise
• Accurate
• Technicallynoisy
BMS353
NextGenera)onSequencing
OMICSdataisproducedmainlythroughmoderntechniquesforsequencingDNA.Itdiffersfromtradi)onalSangersequencingforitspowerofgenera)ngfastandefficientlyhigh-throughputdata.NextGenera)onSequencing(NGS)itisatechniqueusedtosequencemillionsofsmallindividualfragmentsofDNAinparallel.Thesesegmentsarethanmappedontothereferencegenomeunderstudytoiden)fy:- Genomicmuta)ons:SNPs,Dele)onInser)onandtransloca)on- Quan)fica)onofgeneexpression- Quan)fica)onandiden)fica)onofDNAmethyla)onloci- 3DGenomeofsingle–cellusingHi-C,amolecularmethodthatenablesthe
iden)fica)onofgenomicregionsthatareincloseproximity
BMS353
The highly interconnected hierarchical organiza)on and func)onal complexityneedasophis)catedintegratedsystemsapproach.
JournalofExperimentalBotany,Vol.66,No.2pp.479–493,2015
OMICSanditshierarchy
Tomicroscopictomacroscopic
BMS353
Whydoweneedmathema)cstounderstandbiology?
HandlingcomplexityHighdimensionsPaPernsinthedata
Computa(onalBiology---Bioinforma(cs
Quan)fica)onofgenomewidegeneexpressionGenenetworksandtargetpredic)onsProtein-ProteinInterac)onNetworksAlignmenttoolsMiningoflargedataAccessinglargedatafromrepositories
Biologicalsystems
LargeData
Pa9ernsandstructures
Quan)fica)onofUncertainty:Beabletopredictwhatwecannotmeasureandhaveatheore)calwayofquan)fyaccuracy
Modeling
BMS353
Toolsweneed
• Op)malExperimentalDesign–minimisethenoiseinthemeasurement
• Mathema)calModels–definerules(func)ons)todescribeprocesses
• Sta)s)caltools–quan)fyaccuracyinpredic)onandsensi)vityines)ma)on
• Computa)onalSkills–handlinglargeamountofdatainautomatedway
• Visualisa)ontools–iden)fypaPernsinthedata
BMS353
Theroleofexperimentaldesign
Definingyourresearchques(ons,keepinginmindlimita(onandeffec(veuseofthedata
Weneedtolimittheuncertaintyofthe“unknown”bydefineveryclearques)ons.Thishelpsto:• reducevariabilityinthedata.• reducingthecomplexityofthedatabyfocusingthesearchofinforma)onon
theques)onsyouhaveinmind.
AproperexperimentaldesignMUSTreflect:• thebiologicalques)onsthatyouareasking,• theprotocolsop)misedtominimisethevaria)onsinthedata• iden)fythelimita)onsofthedatacollected.
BMS353
1. Importance of defining your research ques)ons,keeping in mind limita)ons and effec)ve use ofthedata
2.Consistencyinsampleprepara)on,op)misa)onofthesamples,extensiveQCofthedata.LOOKatthedatageneratedandQCbeforeprocessing
3. Choose the correct model to analyse your data,define appropriate parameters (RNA-Seq analysis)togetthemaximuminforma)onoutofyourdata
4. Use the best tool to visualise your data, todiscriminate, cluster and rank your significanttargets
5.Usingofpathwayanalysisfordefiningnovelhypothesisthatcanbeinves)gatedwith“specifictools”,mathema)calandexperimental
BMS353
WorkflowandPipelinesinouranalysis
• Theyhelptodefinethecorrectimplementa)onofthedata
• Makesusthinkonthelogicalstepsweneedtotake
• Ouranalysiscanbereproducedonnewdataordifferentdata
Inthelabwehavethetasktoanalysearealcasestudy:datacollectedfromqPCRsingle-celldata
1.Reorganisa)onofdataQCdata
2.Datatransforma)on
3.Dataanalysis
normalisa)on
Exercise9and10
BMS353
All the varia)on and to ensure that the analysis of the data is asreproducible as the experimental collec)on of samples generate theneedtodefinepipelinesfortheanalysisofthedata
PIPELINES:Reproducibleandrobustprotocolsfornumericalexperimenta;ons.Incaseofbiologicaldatatheyaretailoredtothe
system/organismunderstudy.
Pipelines
HOWDOWEGETTHEM?
BMS353
Geneexpressiones)ma)on
DifferentplaLormsthatgenerategeneexpression:• TwoorOnecolorspoPedcDNAarrays
• Affymetrix-newHuman/Mousewholetranscriptomearrays• IlluminaArrays
• Reads–RNA-SeqandotherNGSassays
Interpretandanalysethedatabyfirstunderstandingwherethedataiscomingfrom
BMS353
20µm
Millions of copies of a specificoligonucleotide sequence element
Image of Hybridised Array
~ 1,000,000 differentcomplementary oligonucleotides
Single stranded, labeled RNA sampleOligonucleotide element
1.28cm
GeneChip® Array
Slide courtesy of Affymetrix
* **
**
BMS353
Example:HG_U133Plusv2AffymetrixgeneChipThesequencesfromwhichtheseprobesetswerederivedwereselectedfromGenBank®,dbEST,andRefSeq.Asinglearraycontainswithmorethan54,000probesetsrepresen)ngapproximately38,500genes(es)matedbyUniGenecoverage).70percentoftheprobesetsrepresentsubclusterassembliescontainingoneormorenon-ESTsequences.Ofthe16,737EST-basedprobesets,approximately9,000probesetscannowbeassociatedwithanmRNAorothernon-ESTsequence.
NowwithnewarraysHJAY…..
BMS353
_at:probesetsarepredictedtoperfectlymatchonlyasingletranscript _s_at:arepredictedtoperfectlymatchmul)pletranscripts,whichmaybefromdifferentgenesa_at:allprobesintheprobesethitalternatetranscriptofthesamegene_x_at:probesetswillcontainsomeprobesthatareiden)calorhighlysimilartoothersequencesfromdifferentgene.Hybridizeuniformlyacrossprobepairstotheintendedtarget
ProbeSetNota)on
(“_x”suffix)
BMS353
Gene expression quantification using RNA-Seq
• Sharedexons• Biologicalvarianceoffragments• Splicingvaria)ons
High-throughputsequencingofcDNA:
BMS353
ExampleofRNA-Seqworkflow
BMS353
Illuminasequencing–HiSeq
• Mappedreadcountpropor)onaltoabundanceoffragments• Abundanceoffragments≈(geneexpression)x(length)• Problemswhichlength?whichtranscripts?• Otherdifficul)es:mismatches,varyingqualityofreads,non-uniform
readdistribu)on
Whatareweabletodetect?
IlluminaHiSeqsequencer
BMS353
PartB
BMS353
What’sintheGeneChipdata?
Howwedefineameasurethatbestrepresenttheabsoluteexpressionlevelofeachgeneonthechip?
0500
1000150020002500
1 2 3 4 5 6 7 8 910111213141516 0200400600800
1000120014001600
1 2 3 4 5 6 7 8 9 10111213141516
PMMM
1.Summarisetoasingleexpressionleveltheprobeintensi)esforeachprobeset
2.Es)matethevaria)onsintroducedbybackgroundeffectprobeaffinityeffect
3.SomePM/MMpairsaremorereliablethanothers4.Thesignalneedstobescaledbeforecomparingdatafromdifferentarrays
BMS353
UseSta(s(calmethods(singlepointsta(s(cs)makeuseoftheinforma)onwehavetodefinevaluesthates)mategeneexpression
MAS5.RMA–GCRMAPLIER
Useaprobabilis(capproach(inWeek10-probabilityfunc(ons)
makeuseoftheobserveddatatoes)matefunc)onthathavegeneratedthatdataEs)matesofgeneexpressionwillbethemostprobablevaluethatsummarisestheprobeset
Theapproaches
BMS353
Datatransforma)on
Insta)s)cs,datatransforma)onreferstothemodifica)onofeverypointinadatasetbyamathema)calfunc)on.Data transforma)on ismostosenused to changedata to theappropriate formforapar)cularsta)s)caltestormethod.
Data transforma(on converts a set of data values from the format of itsoriginal source into a new format that bePer suits the des)na)on systemwherethedataisthenmanipulated/used/minedetc...
• calculatethegeneexpressionasthe2^{-deltaCt}Inprac)ce:
deltaCt<-sweep(Ct_full,MARGIN=2,Ct_full[1,],FUN="-")Gene_expression=2^-deltaCt
BMS353
DataTransforma)on:Exponen)alfunc)ons
y = f (x) = xe = exp(x)
y = f (x,µ,σ ) = 1σ 2π
exp(− (x −µ)2
2σ 2 )
ThenumbereistheEuler’snumber,anirra)onalnumberThefirstfewdigitsare:2.7182818284590452353602874713527(andmore...)
GaussianFunc)on.
1+ x + x2 +…+ xn
xn * xm = xn−m
xn / xm = xn−m
BMS353
y=logbxifandonlyifby=x,wherex>0,b>0,andb≠1.
logb(xy) = logbx + logby logb(x/y) = logbx - logby logb(xn) = n logbx logbx = logax / logab
DataTransforma)on:Logarithmicfunc)ons
BMS353
Es)ma)onofGeneexpression
1.Howwedefineameasurethatbestrepresenttheabsoluteexpressionlevelofeachgeneonthechip?
0500
1000150020002500
1 2 3 4 5 6 7 8 910111213141516 0200400600800
1000120014001600
1 2 3 4 5 6 7 8 9 10111213141516
PMMM
2.Howweensurethatcomparisonacrosschipsinunbiased?
BMS353
Normalisa)on:removingbias
Whydoweneedtonormalisethedata?
1.wewanttocompareacrosschips2.weneedtoensurethatallthedataisequallycomparedacrossbaselinewithinthechip
Mostmethodswillhavenormalisa)onstepincorporated,someotherwillneedtoperformitasergeneexpressiones)ma)onScaling–MeanandMedianQuan)leLoess(notrelevantforaffymetrixdataandsequencingdata)
BMS353
Normalisa)on:scaling
Theassump)onthatnormalisingusingquan(lesorscalingisreasonable,isbasedontheassump)onthat“mostgenesdon’tchange”Ifthisunderlyingassump)onisdoub<ul,thenusingtheabovemethodsisnotadvisable.
Scalingnormalisa)on,linearlyscalethegeneexpressionvaluessothattheoverallmean(ormedian)arethesame.Themedianismorescale-invariant,butforthemostpartthereisliPleprac)caldifference.
BMS353
Normalisa)on:quan)le
Assumethatthedistribu)onsofprobeintensi)esshouldbecompletelythesameacrosschips.Startwithnarrays,andpprobes,andforma[p,n]matrixX.Rankfirst:SortthecolumnsofX,sothattheentriesinagivenrowcorrespondtoafixedquan)le(Q2=themedian==mean).Thenalign:Replaceallentriesinthatrowwiththeirmeanvalue.
In sta)s)cs, quan5le normaliza5on is a technique for making two distribu)onsiden)calinsta)s)calproper)es.Whenwequan)le-normaliseasampledistribu)ontoareferencedistribu)onofthesamelength,wealignthesampledistribu)ontothereferencesotomakethemthesame.
BMS353
Microarray Suite (MAS5.0)
Signal~TukeyBiweight(log2(PMj–IMj))
• Signal = Smoothed average over PM,MM pairs representing a gene
• Signal is always positive: Absent - Present Call
Correction for global background.- based on 11 sectors on each array
Ideal mismatch (IM) intensity calculated from MM value and subtracted from PM.
- if MM < PM then IM = MM - if MM > PM then IM = PM – correction value
BMS353
MAS5:characteris)cs
• Notveryprecise• accurateonlywhenmanyreplicatesareavailable.• DependentstronglyonMM• Useslinearscalingnormalisa)on
BMS353
• Subtract background for each array from PM
• Intensity- dependent normalisation of PM-Bkgd
– Quantile normalisation :Fitallthechipstothesamedistribu)on.Scalethechipssothattheyhavethesamemean.
• Log transform
Robust Multi-array Average (RMA)
Signal~Tukey (log2(PMj–bkgdj))
Signal = regression-based average over PM pair representing a gene
BMS353
• Precise• Onlyworksiftherearereplicates• accurateonlywhenmanyreplicatesareavailable.• Quan)lenormalisa)onflaPensthetails.Only
strongsignalsaredetected.
Robust Multi-array Average (RMA)
BMS353
• Visualisetheeffect:M-Aplot
• Correc)onoftheintensitydependantvaria)ons:
BMS353
DataVisualisa(on
ScaPerPlot
BoxPlot
Slide2Cy3Cy5Slide1
Cy3Cy5
median
Q3=75thpercen)le
Q1=25thpercen)le
minimum
maximum
MAPlot
LogAbundance
LogFoldChange
BMS353
Classac)vity:debatethefollowingques)ons:
1.HowmuchdoIneedtoknowaboutthesystemthatIamstudying?
2.Howmuchthetechnologiesthatareavailablefordatacollec)onneedtotosensi)veformysystem?
3.Whatissensi)vityandspecificity?
4.InsequencingwhatisareferencegenomeandhowIgetit?
5.Whenthehigh-throughputapproachisthecorrectapproachformyresearchques)on?