Bioinformacs for high- throughput data - Open Dataopendsi.cc/bioinformatics/assets/Lecture_Wk9.pdf · High-throughput data High-throughput data is a large amount of data collected

BMS353

Bioinforma)csforhigh-throughputdata

BMS353

PartA: Introduc)ononhigh-throughputdata Experimentaldesign

Pla<ormsforgeneexpressiondata

Break–10min(ques5onsifany)PartB: Exponen)alandlogs

Lowlevelanalysis•  GeneExpressionEs)ma)on•  Normalisa)on

ClassAc)vity:debate.

Today’sOutline

BMS353

Todaywewillbelearning:

•  Characteris)csofhigh-throughputdataandhowitisproduced

•  WhatweintendforOMICS

•  Howwedefinehigh-throughputdata

•  Todefineprinciplesofexperimentaldesignandpipelines

•  Methodsforgeneexpressionquan)fica)on

•  Toes)mategeneexpressionlevelsfromdataanddifferenceinmethodsapplied

•  Tonormalisethedataandwhatitmeans

BMS353

High-throughputdata

High-throughputdata is a large amountof data collectedusing automatedmethodsandnon-conven)onal technologies. This is to be able to performa largenumberofexperimentsatthesame)metomonitorabehaviorofasystem.

High throughput cell biology is the use of automa)on equipmentwith classical cellbiology techniques. This is to address biological ques)ons that are otherwiseunaPainableusingconven)onalmethods.High-throughput biology has created a new field of biology called OMICS. It is aresearch filed that interface between large scale biology (genome, proteome,transcriptome),technologyandcomputa)onalmethods.

BMS353

Characteris)csofhigh-throughputdata

Whatarethecharacteris)csofhigh-throughputdata?•  Largeinsize

•  Pronetomanyfalseposi)ves(lowspecificity)

•  CapturebiologicalNoise

•  Accurate

•  Technicallynoisy

BMS353

NextGenera)onSequencing

OMICSdataisproducedmainlythroughmoderntechniquesforsequencingDNA.Itdiffersfromtradi)onalSangersequencingforitspowerofgenera)ngfastandefficientlyhigh-throughputdata.NextGenera)onSequencing(NGS)itisatechniqueusedtosequencemillionsofsmallindividualfragmentsofDNAinparallel.Thesesegmentsarethanmappedontothereferencegenomeunderstudytoiden)fy:-  Genomicmuta)ons:SNPs,Dele)onInser)onandtransloca)on-  Quan)fica)onofgeneexpression-  Quan)fica)onandiden)fica)onofDNAmethyla)onloci-  3DGenomeofsingle–cellusingHi-C,amolecularmethodthatenablesthe

iden)fica)onofgenomicregionsthatareincloseproximity

BMS353

The highly interconnected hierarchical organiza)on and func)onal complexityneedasophis)catedintegratedsystemsapproach.

JournalofExperimentalBotany,Vol.66,No.2pp.479–493,2015

OMICSanditshierarchy

Tomicroscopictomacroscopic

BMS353

Whydoweneedmathema)cstounderstandbiology?

HandlingcomplexityHighdimensionsPaPernsinthedata

Computa(onalBiology---Bioinforma(cs

Quan)fica)onofgenomewidegeneexpressionGenenetworksandtargetpredic)onsProtein-ProteinInterac)onNetworksAlignmenttoolsMiningoflargedataAccessinglargedatafromrepositories

Biologicalsystems

LargeData

Pa9ernsandstructures

Quan)fica)onofUncertainty:Beabletopredictwhatwecannotmeasureandhaveatheore)calwayofquan)fyaccuracy

Modeling

BMS353

Toolsweneed

•  Op)malExperimentalDesign–minimisethenoiseinthemeasurement

•  Mathema)calModels–definerules(func)ons)todescribeprocesses

•  Sta)s)caltools–quan)fyaccuracyinpredic)onandsensi)vityines)ma)on

•  Computa)onalSkills–handlinglargeamountofdatainautomatedway

•  Visualisa)ontools–iden)fypaPernsinthedata

BMS353

Theroleofexperimentaldesign

Definingyourresearchques(ons,keepinginmindlimita(onandeffec(veuseofthedata

Weneedtolimittheuncertaintyofthe“unknown”bydefineveryclearques)ons.Thishelpsto:•  reducevariabilityinthedata.•  reducingthecomplexityofthedatabyfocusingthesearchofinforma)onon

theques)onsyouhaveinmind.

AproperexperimentaldesignMUSTreflect:•  thebiologicalques)onsthatyouareasking,•  theprotocolsop)misedtominimisethevaria)onsinthedata•  iden)fythelimita)onsofthedatacollected.

BMS353

1. Importance of defining your research ques)ons,keeping in mind limita)ons and effec)ve use ofthedata

2.Consistencyinsampleprepara)on,op)misa)onofthesamples,extensiveQCofthedata.LOOKatthedatageneratedandQCbeforeprocessing

3. Choose the correct model to analyse your data,define appropriate parameters (RNA-Seq analysis)togetthemaximuminforma)onoutofyourdata

4. Use the best tool to visualise your data, todiscriminate, cluster and rank your significanttargets

5.Usingofpathwayanalysisfordefiningnovelhypothesisthatcanbeinves)gatedwith“specifictools”,mathema)calandexperimental

BMS353

WorkflowandPipelinesinouranalysis

•  Theyhelptodefinethecorrectimplementa)onofthedata

•  Makesusthinkonthelogicalstepsweneedtotake

•  Ouranalysiscanbereproducedonnewdataordifferentdata

Inthelabwehavethetasktoanalysearealcasestudy:datacollectedfromqPCRsingle-celldata

1.Reorganisa)onofdataQCdata

2.Datatransforma)on

3.Dataanalysis

normalisa)on

Exercise9and10

BMS353

All the varia)on and to ensure that the analysis of the data is asreproducible as the experimental collec)on of samples generate theneedtodefinepipelinesfortheanalysisofthedata

PIPELINES:Reproducibleandrobustprotocolsfornumericalexperimenta;ons.Incaseofbiologicaldatatheyaretailoredtothe

system/organismunderstudy.

Pipelines

HOWDOWEGETTHEM?

BMS353

Geneexpressiones)ma)on

DifferentplaLormsthatgenerategeneexpression:•  TwoorOnecolorspoPedcDNAarrays

•  Affymetrix-newHuman/Mousewholetranscriptomearrays•  IlluminaArrays

•  Reads–RNA-SeqandotherNGSassays

Interpretandanalysethedatabyfirstunderstandingwherethedataiscomingfrom

BMS353

20µm

Millions of copies of a specificoligonucleotide sequence element

Image of Hybridised Array

~ 1,000,000 differentcomplementary oligonucleotides

Single stranded, labeled RNA sampleOligonucleotide element

1.28cm

GeneChip® Array

Slide courtesy of Affymetrix

* **

**

BMS353

Example:HG_U133Plusv2AffymetrixgeneChipThesequencesfromwhichtheseprobesetswerederivedwereselectedfromGenBank®,dbEST,andRefSeq.Asinglearraycontainswithmorethan54,000probesetsrepresen)ngapproximately38,500genes(es)matedbyUniGenecoverage).70percentoftheprobesetsrepresentsubclusterassembliescontainingoneormorenon-ESTsequences.Ofthe16,737EST-basedprobesets,approximately9,000probesetscannowbeassociatedwithanmRNAorothernon-ESTsequence.

NowwithnewarraysHJAY…..

BMS353

_at:probesetsarepredictedtoperfectlymatchonlyasingletranscript _s_at:arepredictedtoperfectlymatchmul)pletranscripts,whichmaybefromdifferentgenesa_at:allprobesintheprobesethitalternatetranscriptofthesamegene_x_at:probesetswillcontainsomeprobesthatareiden)calorhighlysimilartoothersequencesfromdifferentgene.Hybridizeuniformlyacrossprobepairstotheintendedtarget

ProbeSetNota)on

(“_x”suffix)

BMS353

Gene expression quantification using RNA-Seq

•  Sharedexons•  Biologicalvarianceoffragments•  Splicingvaria)ons

High-throughputsequencingofcDNA:

BMS353

ExampleofRNA-Seqworkflow

BMS353

Illuminasequencing–HiSeq

•  Mappedreadcountpropor)onaltoabundanceoffragments•  Abundanceoffragments≈(geneexpression)x(length)•  Problemswhichlength?whichtranscripts?•  Otherdifficul)es:mismatches,varyingqualityofreads,non-uniform

readdistribu)on

Whatareweabletodetect?

IlluminaHiSeqsequencer

BMS353

PartB

BMS353

What’sintheGeneChipdata?

Howwedefineameasurethatbestrepresenttheabsoluteexpressionlevelofeachgeneonthechip?

0500

1000150020002500

1 2 3 4 5 6 7 8 910111213141516 0200400600800

1000120014001600

1 2 3 4 5 6 7 8 9 10111213141516

PMMM

1.Summarisetoasingleexpressionleveltheprobeintensi)esforeachprobeset

2.Es)matethevaria)onsintroducedbybackgroundeffectprobeaffinityeffect

3.SomePM/MMpairsaremorereliablethanothers4.Thesignalneedstobescaledbeforecomparingdatafromdifferentarrays

BMS353

UseSta(s(calmethods(singlepointsta(s(cs)makeuseoftheinforma)onwehavetodefinevaluesthates)mategeneexpression

MAS5.RMA–GCRMAPLIER

Useaprobabilis(capproach(inWeek10-probabilityfunc(ons)

makeuseoftheobserveddatatoes)matefunc)onthathavegeneratedthatdataEs)matesofgeneexpressionwillbethemostprobablevaluethatsummarisestheprobeset

Theapproaches

BMS353

Datatransforma)on

Insta)s)cs,datatransforma)onreferstothemodifica)onofeverypointinadatasetbyamathema)calfunc)on.Data transforma)on ismostosenused to changedata to theappropriate formforapar)cularsta)s)caltestormethod.

Data transforma(on converts a set of data values from the format of itsoriginal source into a new format that bePer suits the des)na)on systemwherethedataisthenmanipulated/used/minedetc...

•  calculatethegeneexpressionasthe2^{-deltaCt}Inprac)ce:

deltaCt<-sweep(Ct_full,MARGIN=2,Ct_full[1,],FUN="-")Gene_expression=2^-deltaCt

BMS353

DataTransforma)on:Exponen)alfunc)ons

y = f (x) = xe = exp(x)

y = f (x,µ,σ ) = 1σ 2π

exp(− (x −µ)2

2σ 2 )

ThenumbereistheEuler’snumber,anirra)onalnumberThefirstfewdigitsare:2.7182818284590452353602874713527(andmore...)

GaussianFunc)on.

1+ x + x2 +…+ xn

xn * xm = xn−m

xn / xm = xn−m

BMS353

y=logbxifandonlyifby=x,wherex>0,b>0,andb≠1.

logb(xy) = logbx + logby logb(x/y) = logbx - logby logb(xn) = n logbx logbx = logax / logab

DataTransforma)on:Logarithmicfunc)ons

BMS353

Es)ma)onofGeneexpression

1.Howwedefineameasurethatbestrepresenttheabsoluteexpressionlevelofeachgeneonthechip?

0500

1000150020002500

1 2 3 4 5 6 7 8 910111213141516 0200400600800

1000120014001600

1 2 3 4 5 6 7 8 9 10111213141516

PMMM

2.Howweensurethatcomparisonacrosschipsinunbiased?

BMS353

Normalisa)on:removingbias

Whydoweneedtonormalisethedata?

1.wewanttocompareacrosschips2.weneedtoensurethatallthedataisequallycomparedacrossbaselinewithinthechip

Mostmethodswillhavenormalisa)onstepincorporated,someotherwillneedtoperformitasergeneexpressiones)ma)onScaling–MeanandMedianQuan)leLoess(notrelevantforaffymetrixdataandsequencingdata)

BMS353

Normalisa)on:scaling

Theassump)onthatnormalisingusingquan(lesorscalingisreasonable,isbasedontheassump)onthat“mostgenesdon’tchange”Ifthisunderlyingassump)onisdoub<ul,thenusingtheabovemethodsisnotadvisable.

Scalingnormalisa)on,linearlyscalethegeneexpressionvaluessothattheoverallmean(ormedian)arethesame.Themedianismorescale-invariant,butforthemostpartthereisliPleprac)caldifference.

BMS353

Normalisa)on:quan)le

Assumethatthedistribu)onsofprobeintensi)esshouldbecompletelythesameacrosschips.Startwithnarrays,andpprobes,andforma[p,n]matrixX.Rankfirst:SortthecolumnsofX,sothattheentriesinagivenrowcorrespondtoafixedquan)le(Q2=themedian==mean).Thenalign:Replaceallentriesinthatrowwiththeirmeanvalue.

In sta)s)cs, quan5le normaliza5on is a technique for making two distribu)onsiden)calinsta)s)calproper)es.Whenwequan)le-normaliseasampledistribu)ontoareferencedistribu)onofthesamelength,wealignthesampledistribu)ontothereferencesotomakethemthesame.

BMS353

Microarray Suite (MAS5.0)

Signal~TukeyBiweight(log2(PMj–IMj))

•  Signal = Smoothed average over PM,MM pairs representing a gene

•  Signal is always positive: Absent - Present Call

Correction for global background.- based on 11 sectors on each array

Ideal mismatch (IM) intensity calculated from MM value and subtracted from PM.

- if MM < PM then IM = MM - if MM > PM then IM = PM – correction value

BMS353

MAS5:characteris)cs

•  Notveryprecise•  accurateonlywhenmanyreplicatesareavailable.•  DependentstronglyonMM•  Useslinearscalingnormalisa)on

BMS353

•  Subtract background for each array from PM

•  Intensity- dependent normalisation of PM-Bkgd

–  Quantile normalisation :Fitallthechipstothesamedistribu)on.Scalethechipssothattheyhavethesamemean.

•  Log transform

Robust Multi-array Average (RMA)

Signal~Tukey (log2(PMj–bkgdj))

Signal = regression-based average over PM pair representing a gene

BMS353

•  Precise•  Onlyworksiftherearereplicates•  accurateonlywhenmanyreplicatesareavailable.•  Quan)lenormalisa)onflaPensthetails.Only

strongsignalsaredetected.

Robust Multi-array Average (RMA)

BMS353

•  Visualisetheeffect:M-Aplot

•  Correc)onoftheintensitydependantvaria)ons:

BMS353

DataVisualisa(on

ScaPerPlot

BoxPlot

Slide2Cy3Cy5Slide1

Cy3Cy5

median

Q3=75thpercen)le

Q1=25thpercen)le

minimum

maximum

MAPlot

LogAbundance

LogFoldChange

BMS353

Classac)vity:debatethefollowingques)ons:

1.HowmuchdoIneedtoknowaboutthesystemthatIamstudying?

2.Howmuchthetechnologiesthatareavailablefordatacollec)onneedtotosensi)veformysystem?

3.Whatissensi)vityandspecificity?

4.InsequencingwhatisareferencegenomeandhowIgetit?

5.Whenthehigh-throughputapproachisthecorrectapproachformyresearchques)on?

Documents

Bioinformacs for high- throughput data - Open Dataopendsi.cc/bioinformatics/assets/Lecture_Wk9.pdf · High-throughput data High-throughput data is a large amount of data collected