Upload
lamthuan
View
213
Download
0
Embed Size (px)
Citation preview
Alison Motsinger-Reif, PhDBioinformatics Research Center
Department of StatisticsNorth Carolina State University
Pathway and Gene Set AnalysisPart 1
Theearlystepsofamicroarraystudy
• ScientificQuestion(biological)• Studydesign(biological/statistical)• ConductingExperiment(biological)• Preprocessing/NormalizingData(statistical)• Findingdifferentiallyexpressedgenes(statistical)
Adataexample
• Leeetal(2005)comparedadiposetissue(abdominalsubcutaenousadipocytes)betweenobeseandleanPimaIndians
• SampleswerehybridisedonHGu95e-Affymetrixarrays(12639genes/probesets)
• AvailableasGDS1498ontheGEOdatabase• Weselectedthemalesamplesonly
– 10obesevs9lean
The“Result”Probe Set ID log.ratio pvalue adj.p73554_at 1.4971 0.0000 0.000491279_at 0.8667 0.0000 0.001774099_at 1.0787 0.0000 0.010483118_at -1.2142 0.0000 0.013981647_at 1.0362 0.0000 0.013984412_at 1.3124 0.0000 0.022290585_at 1.9859 0.0000 0.025884618_at -1.6713 0.0000 0.025891790_at 1.7293 0.0000 0.035080755_at 1.5238 0.0000 0.035185539_at 0.9303 0.0000 0.035190749_at 1.7093 0.0000 0.035174038_at -1.6451 0.0000 0.035179299_at 1.7156 0.0000 0.035172962_at 2.1059 0.0000 0.035188719_at -3.1829 0.0000 0.035172943_at -2.0520 0.0000 0.035191797_at 1.4676 0.0000 0.035178356_at 2.1140 0.0001 0.035990268_at 1.6552 0.0001 0.0421
WhathappenedtotheBiology???
SlightlymoreinformativeresultsProbe Set ID Gene SymbolGene Title go biological process termgo molecular function term log.ratio pvalue adj.p73554_at CCDC80 coiled-coil domain containing 80--- --- 1.4971 0.0000 0.000491279_at C1QTNF5 /// MFRPC1q and tumor necrosis factor related protein 5 /// membrane frizzled-related proteinvisual perception /// embryonic development /// response to stimulus--- 0.8667 0.0000 0.001774099_at --- --- --- --- 1.0787 0.0000 0.010483118_at RNF125 ring finger protein 125 immune response /// modification-dependent protein catabolic processprotein binding /// zinc ion binding /// ligase activity /// metal ion binding-1.2142 0.0000 0.013981647_at --- --- --- --- 1.0362 0.0000 0.013984412_at SYNPO2 synaptopodin 2 --- actin binding /// protein binding1.3124 0.0000 0.022290585_at C15orf59 chromosome 15 open reading frame 59--- --- 1.9859 0.0000 0.025884618_at C12orf39 chromosome 12 open reading frame 39--- --- -1.6713 0.0000 0.025891790_at MYEOV myeloma overexpressed (in a subset of t(11;14) positive multiple myelomas)--- --- 1.7293 0.0000 0.035080755_at MYOF myoferlin muscle contraction /// blood circulationprotein binding 1.5238 0.0000 0.035185539_at PLEKHH1 pleckstrin homology domain containing, family H (with MyTH4 domain) member 1--- binding 0.9303 0.0000 0.035190749_at SERPINB9 serpin peptidase inhibitor, clade B (ovalbumin), member 9anti-apoptosis /// signal transductionendopeptidase inhibitor activity /// serine-type endopeptidase inhibitor activity /// serine-type endopeptidase inhibitor activity /// protein binding1.7093 0.0000 0.035174038_at --- --- --- --- -1.6451 0.0000 0.035179299_at --- --- --- --- 1.7156 0.0000 0.035172962_at BCAT1 branched chain aminotransferase 1, cytosolicG1/S transition of mitotic cell cycle /// metabolic process /// cell proliferation /// amino acid biosynthetic process /// branched chain family amino acid metabolic process /// branched chain family amino acid biosynthetic process /// branched chain family amino acid biosynthetic processcatalytic activity /// branched-chain-amino-acid transaminase activity /// branched-chain-amino-acid transaminase activity /// transaminase activity /// transferase activity /// identical protein binding2.1059 0.0000 0.035188719_at C12orf39 chromosome 12 open reading frame 39--- --- -3.1829 0.0000 0.035172943_at --- --- --- --- -2.0520 0.0000 0.035191797_at LRRC16A leucine rich repeat containing 16A--- --- 1.4676 0.0000 0.035178356_at TRDN triadin muscle contraction receptor binding 2.1140 0.0001 0.035990268_at C5orf23 chromosome 5 open reading frame 23--- --- 1.6552 0.0001 0.0421
Ifwearelucky,someofthetopgenesmeansomething tous
Butwhatiftheydon’t?
Andhowwhataretheresultsforothergeneswithsimilarbiologicalfunctions
Howtoincorporatebiologicalknowledge
• Thetypeofknowledgewedealwithisrathersimple:
Weknowgroups/setsofgenesthatforexample– Belongtothesamepathway– Haveasimilarfunction– Arelocatedonthesamechromosome,etc…
• Wewillassumethesegroupingstobegiven,i.e.wewillnotyetdiscussmethodsusedtodetectpathways,networks,geneclusters• Wewilllater!
Whatisapathway?
• Nocleardefinition– Wikipedia:“In biochemistry,metabolicpathways are
seriesofchemicalreactionsoccurringwithinacell.Ineachpathway,aprincipalchemicalismodifiedbychemicalreactions.”
– Thesepathwaysdescribeenzymesandmetabolites
• Butoftentheword“pathway”isalsousedtodescribegeneregulatorynetworksorproteininteractionnetworks
• Inallcasesapathwaydescribesabiologicalfunctionveryspecifically
WhatisaGeneSet?• Justwhatitsays:asetofgenes!
– AllgenesinvolvedinapathwayareanexampleofaGeneSet
– AllgenescorrespondingtoaGeneOntologytermareaGeneSet
– AllgenesmentionedinapaperofSmithetalmightformaGeneSet
• AGeneSetisamuchmoregeneralandlessspecificconceptthanapathway
• Still:wewillsometimesusetwowordsinterchangeably,astheanalysismethodsaremainlythesame
WhereDoGeneSets/ListsComeFrom?
• Molecularprofilinge.g.mRNA,protein– Identificationà Genelist– Quantificationà Genelist+values– Ranking,Clustering(biostatistics)
• Interactions:Proteininteractions,Transcriptionfactorbindingsites(ChIP)
• Geneticscreene.g.ofknockoutlibrary• Associationstudies(Genome-wide)
– Singlenucleotidepolymorphisms(SNPs)– Copynumbervariants(CNVs)
– ……..
WhatisGeneSet/Pathwayanalysis?
• Theaimistogiveonenumber(score,p-value)toaGeneSet/Pathway– Aremanygenesinthepathwaydifferentiallyexpressed(up-regulated/downregulated)
– Canwegiveanumber(p-value)totheprobabilityofobservingthesechangesjustbychance?
Goals• Pathwayandgenesetdataresources
• Geneattributes• Databaseresources
• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping
• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting
Goals• Pathwayandgenesetdataresources
• Geneattributes• Databaseresources
• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping
• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting
GeneAttributes• Functionalannotation
– Biologicalprocess,molecularfunction,celllocation
• Chromosomeposition• Diseaseassociation• DNAproperties
– TFbindingsites,genestructure(intron/exon),SNPs
• Transcriptproperties– Splicing,3’UTR,microRNA bindingsites
• Proteinproperties– Domains,secondaryandtertiarystructure,PTMsites
• Interactionswithothergenes
GeneAttributes• Functionalannotation
– Biologicalprocess,molecularfunction,celllocation
• Chromosomeposition• Diseaseassociation• DNAproperties
– TFbindingsites,genestructure(intron/exon),SNPs
• Transcriptproperties– Splicing,3’UTR,microRNA bindingsites
• Proteinproperties– Domains,secondaryandtertiarystructure,PTMsites
• Interactionswithothergenes
DatabaseResources• Usefunctionalannotationtoaggregategenesintopathways/genesets
• Anumberofdatabasesareavailable– Differentanalysistoolslinktodifferentdatabases– Toomanydatabasestogointodetailoneveryone– Commonlyusedresources:
• GO• KeGG• MsigDB• WikiPathways
PathwayandGeneSetdataresources
• TheGeneOntology(GO)database– http://www.geneontology.org/– GOoffersarelational/hierarchicaldatabase– Parentnodes:moregeneralterms– Childnodes:morespecificterms– Attheendofthehierarchytherearegenes/proteins– Atthetopthereare3parentnodes:biologicalprocess,molecularfunctionandcellularcomponent
• Example:wesearchthedatabasefortheterm“inflammation”
Thegenesonourarraythatcodeforoneofthe44geneproductswouldformthecorresponding “inflammation”geneset
WhatistheGeneOntology(GO)?
• Setofbiologicalphrases(terms)whichareappliedtogenes:– proteinkinase– apoptosis– membrane
• Ontology:Aformalsystemfordescribingknowledge
GOStructure• Termsarerelatedwithinahierarchy– is-a– part-of
• Describesmultiplelevelsofdetailofgenefunction
• Termscanhavemorethanoneparentorchild
GOStructurecell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
Speciesindependent.Somelower-leveltermsarespecifictoagroup, buthigher leveltermsarenot
WhatGOCovers?
• GOtermsdividedintothreeaspects:– cellularcomponent– molecularfunction– biologicalprocess
glucose-6-phosphate isomeraseactivity
Celldivision
Terms• WheredoGOtermscomefrom?
– GOtermsareaddedbyeditorsatEBIandgeneannotationdatabasegroups
– Termsaddedbyrequest– Expertshelpwithmajordevelopment– 27734terms,98.9%withdefinitions.
• 16731biological_process• 2385cellular_component• 8618molecular_function
• Genesarelinked,orassociated,withGOtermsbytrainedcuratorsatgenomedatabases– Knownas‘geneassociations’orGOannotations– Multipleannotationspergene
• SomeGOannotationscreatedautomatically
Annotations
AnnotationSources• Manualannotation
– Createdbyscientificcurators• Highquality• Smallnumber(time-consumingtocreate)
• Electronicannotation– Annotationderivedwithouthumanvalidation
• Computationalpredictions(accuracyvaries)• Lower‘quality’thanmanualcodes
• Keypoint:beawareofannotationorigin
EvidenceTypes• ISS: Inferred from Sequence/Structural Similarity• IDA: Inferred from Direct Assay• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern• TAS: Traceable Author Statement• NAS: Non-traceable Author Statement• IC: Inferred by Curator• ND: No Data available
• IEA: Inferred from electronic annotation
SpeciesCoverage
• Allmajoreukaryoticmodelorganismspecies
• HumanviaGOAgroupatUniProt
• SeveralbacterialandparasitespeciesthroughTIGRandGeneDB atSanger
• Newspeciesannotationsindevelopment
VariableCoverage
LomaxJ.GetreadytoGO!Abiologist's guidetotheGeneOntology.BriefBioinform.2005Sep;6(3):298-304.
ContributingDatabases– BerkeleyDrosophila GenomeProject(BDGP)– dictyBase (Dictyostelium discoideum)– FlyBase (Drosophilamelanogaster)– GeneDB (Schizosaccharomyces pombe,Plasmodiumfalciparum, Leishmania
major andTrypanosoma brucei)– UniProtKnowledgebase (Swiss-Prot/TrEMBL/PIR-PSD)andInterPro databases– Gramene (grains, including rice,Oryza)– MouseGenomeDatabase(MGD)andGeneExpressionDatabase(GXD) (Mus
musculus)– RatGenomeDatabase(RGD) (Rattus norvegicus)– Reactome– Saccharomyces GenomeDatabase(SGD) (Saccharomyces cerevisiae)– TheArabidopsis InformationResource(TAIR) (Arabidopsis thaliana)– TheInstituteforGenomicResearch(TIGR):databasesonseveralbacterial
species– WormBase (Caenorhabditis elegans)– ZebrafishInformationNetwork(ZFIN):(Danio rerio)
GOSlimSets• GOhastoomanytermsforsomeuses– Summaries(e.g.Piecharts)
• GOSlimisanofficialreducedsetofGOterms– Generic,plant,yeast
GOSoftwareTools
• GOresourcesarefreelyavailabletoanyonewithoutrestriction– Includestheontologies,geneassociationsandtoolsdevelopedbyGO
• OthergroupshaveusedGOtocreatetoolsformanypurposes– http://www.geneontology.org/GO.tools
AccessingGO:QuickGO
http://www.ebi.ac.uk/ego/
OtherOntologies
http://www.ebi.ac.uk/ontology-lookup
KEGGpathwaydatabase
• KEGG=KyotoEncyclopediaofGenesandGenomes– http://www.genome.jp/kegg/pathway.html– ThepathwaydatabasegivesfarmoredetailedinformationthanGO• Relationshipsbetweengenesandgeneproducts
– But:thisdetailedinformationisonlyavailableforselectedorganismsandprocesses
– Example:Adipocytokinesignalingpathway
KEGGpathwaydatabase
• Clickingonthenodesinthepathwayleadstomoreinformationongenes/proteins– Otherpathwaysthenodeisinvolvedwith– EntriesinGene/Proteindatabases– References– Sequenceinformation
• UltimatelythisallowstofindcorrespondinggenesonthemicroarrayanddefineaGeneSetforthepathway
Wikipathways
• http://www.wikipathways.org
• Awikipedia forpathways– Onecanseeanddownloadpathways– Butalsoeditandcontributepathways
• TheprojectislinkedtotheGenMAPP andPathvisio analysis/visualisationtools
MSigDB• MSigDB =MolecularSignatureDatabasehttp://www.broadinstitute.org/gsea/msigdb
• RelatedtothetheanalysisprogramGSEA• MSigDB offersgenesetsbasedonvariousgroupings– Pathways– GOterms– Chromosomalposition,…
SomeWarnings• Inmanycasesthedefinitionofapathway/genesetina
databasemightdifferfromthatofascientist
• Thenodesinpathwaysareoftenproteinsormetabolites;theactivityofthecorrespondinggenesetisnotnecessarilyagoodmeasurementoftheactivityofthepathway
• Therearemanymoreresourcesoutthere(BioCarta,BioPax)
• Commercialpackagesoftenusetheirownpathway/genesetdefinitions(Ingenuity,Metacore,Genomatix,…)
• GenesinagenesetareusuallynotgivenbyaProbeSetID,butrefertosomegenedatabase(Entrez IDs,Unigene IDs)
• Conversioncanleadtoerrors!
SomeWarnings• Inmanycasesthedefinitionofapathway/genesetina
databasemightdifferfromthatofascientist
• Thenodesinpathwaysareoftenproteinsormetabolites;theactivityofthecorrespondinggenesetisnotnecessarilyagoodmeasurementoftheactivityofthepathway
• Therearemanymoreresourcesoutthere(BioCarta,BioPax)
• Commercialpackagesoftenusetheirownpathway/genesetdefinitions(Ingenuity,Metacore,Genomatix,…)
• GenesinagenesetareusuallynotgivenbyaProbeSetID,butrefertosomegenedatabase(Entrez IDs,Unigene IDs)
• Conversioncanleadtoerrors!
GeneAttributes• Functionalannotation
– Biologicalprocess,molecularfunction,celllocation• Chromosomeposition• Diseaseassociation• DNAproperties
– TFbindingsites,genestructure(intron/exon),SNPs• Transcriptproperties
– Splicing,3’UTR,microRNA bindingsites• Proteinproperties
– Domains,secondaryandtertiarystructure,PTMsites• Interactionswithothergenes
SourcesofGeneAttributes
• Ensembl BioMart (eukaryotes)– http://www.ensembl.org
• Entrez Gene(general)– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
• Modelorganismdatabases– E.g.SGD:http://www.yeastgenome.org/
• Manyothers…..
EnsemblBioMart• Convenientaccesstogenelistannotation
Selectgenome
Selectfilters
Selectattributestodownload
GeneandProteinIdentifiers• Identifiers(IDs)areideallyunique,stablenamesornumbersthathelptrackdatabaserecords– E.g.SocialInsuranceNumber,EntrezGeneID41232
• Geneandproteininformationstoredinmanydatabases– à GeneshavemanyIDs
• Recordsfor:Gene,DNA,RNA,Protein– Importanttorecognizethecorrectrecordtype– E.g.Entrez Generecordsdon’tstoresequence.TheylinktoDNAregions,RNAtranscriptsandproteins.
NCBIDatabaseLinks
http://www.ncbi.nlm.nih.gov/Database/datamodel/data_nodes.swf
NCBI:U.S.NationalCenterforBiotechnologyInformation
PartofNationalLibraryofMedicine(NLM)
CommonIdentifiersSpecies-specificHUGOHGNCBRCA2MGIMGI:109337RGD2219ZFINZDB-GENE-060510-3FlyBase CG9097WormBase WBGene00002299orZK1067.1SGDS000002187orYDL029WAnnotationsInterPro IPR015252OMIM600185Pfam PF09104GeneOntologyGO:0000724SNPs rs28897757ExperimentalPlatformAffymetrix 208368_3p_s_atAgilentA_23_P99452CodeLink GE60169Illumina GI_4502450-S
GeneEnsembl ENSG00000139618EntrezGene675UnigeneHs.34012
RNAtranscriptGenBankBC026160.1RefSeq NM_000059Ensembl ENST00000380152
ProteinEnsembl ENSP00000369497RefSeq NP_000050.2UniProt BRCA2_HUMANorA1YBP1_HUMANIPIIPI00412408.1EMBLAF309413PDB1MIU
Red=Recommended
IdentifierMapping
• SomanyIDs!– Mapping(conversion)isaheadache
• Fourmainuses– Searchingforafavoritegenename– Linktorelatedresources– Identifiertranslation
• E.g.Genestoproteins,Entrez GenetoAffy– Unificationduringdatasetmerging
• Equivalentrecords
IDMappingServices
• Synergizer– http://llama.med.harvard.edu/syner
gizer/translate/
• EnsemblBioMart– http://www.ensembl.org
• UniProt– http://www.uniprot.org/
IDMappingChallenges• Avoiderrors:mapIDscorrectly• Genenameambiguity– notagoodID
– e.g.FLJ92943,LFS1,TRP53,p53– Bettertousethestandardgenesymbol:TP53
• Excelerror-introduction– OCT4ischangedtoOctober-4
• Problemsreaching100%coverage– E.g.duetoversionissues– Usemultiplesourcestoincreasecoverage
ZeebergBRetal.Mistakenidentifiers: genenameerrorscanbeintroducedinadvertentlywhenusingExcelinbioinformatics BMCBioinformatics.2004Jun 23;5:80
Goals• Pathwayandgenesetdataresources
• Geneattributes• Databaseresources
• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping
• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting
Goals• Pathwayandgenesetdataresources
• Geneattributes• Databaseresources
• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping
• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting
AimsofAnalysis
• Reminder:Theaimistogiveonenumber(score,p-value)toaGeneSet/Pathway– Aremanygenesinthepathwaydifferentiallyexpressed(up-regulated/downregulated)?
– Canwegiveanumber(p-value)totheprobabilityofobservingthesechangesjustbychance?
– Similartosinglegeneanalysisstatisticalhypothesistestingplaysanimportantrole
Generaldifferencesbetweenanalysistools
• Selfcontainedvs competitivetest– Thedistinctionbetween“self-contained”and“competitive”methodsgoesbacktoGoeman andBuehlman (2007)
– Aself-containedmethodonlyusesthevaluesforthegenesofageneset
• Thenullhypothesishereis:H={“NogenesintheGeneSetaredifferentiallyexpressed”}
– Acompetitivemethodcomparesthegeneswithinthegenesetwiththeothergenesonthearrays
• HerewetestagainstH:{“ThegenesintheGeneSetarenotmoredifferentiallyexpressedthanothergenes”}
Example:AnalysisfortheGO-Term“inflammatoryresponse”(GO:0006954)
BacktotheRealDataExample
• UsingBioconductor softwarewecanfind96probesetsonthearraycorrespondingtothisterm
• 8outofthesehaveap-value<5%
• Howmanysignificantgeneswouldweexpectbychance?
• Dependsonhowwedefine“bychance”
The“self-contained”version
• Bychance(i.e.ifitisNOTdifferentiallyexpressed)ageneshouldbesignificantwithaprobabilityof5%
• Wewouldexpect96x 5%=4.8significantgenes
• Usingthebinomialdistributionwecancalculatetheprobabilityofobserving8ormoresignificantgenesasp=0.108,i.e.notquitesignificant
The“competitive”version• Overall1272outof12639
genesaresignificantinthisdataset(10.1%)
• Ifwerandomlypick96geneswewouldexpect96x 10.1%=9.7genestobesignificant“bychance”
• Ap-valuecanbecalculatedbasedonthe 2x2table
• Testsforassociation:Chi-Square-TestorFisher’sexacttest
In GS Not in GSsig 8 1264
non-sig 88 11 279
P-valuefromFisher’sexacttest(one-sided):0.733,i.e veryfarfrombeingsignificant
CompetitiveTests• Competitiveresultsdependhighlyonhowmanygenesareon
thearrayandpreviousfiltering– Onasmalltargetedarraywhereallgenesarechanged,acompetitive
methodmightdetectnodifferentialGeneSetsatall
• Competitivetestscanalsobeusedwithsmallsamplesizes,evenforn=1– BUT:Theresultgivesnoindicationofwhetheritholdsforawider
populationofsubjects,thep-valueconcernsapopulationofgenes!
• Competitiveteststypicallygivelesssignificantresultsthanself-contained(asseenwiththeexample)
• Fisher’sexacttest(competitive)isprobablythemostwidelyusedmethod!
Cut-offmethodsvs wholegenelistmethods
• Aproblemwithbothtestsdiscussedsofaris,thattheyrelyonanarbitrarycut-off
• Ifwecallagenesignificantfor10%alphathresholdtheresultswillchange– Inourexamplethebinomialtestyieldsp=0.022,i.e.forthiscut-offtheresultissignificant!
• Wealsoloseinformationbyreducingap-valuetoabinary(“significant”,“non-significant”)variable– Itshouldmakeadifference,whetherthenon-significantgenesinthesetarenearlysignificantorcompletelyunsignificant
P-value histogram for inflammation genes
pvalue[incl]
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
• Wecanstudythedistributionofthep-valuesinthegeneset
• Ifnogenesaredifferentiallyexpressedthisshouldbeauniformdistribution
• Apeakontheleftindicates,thatsomegenesaredifferentiallyexpressed
• WecantestthisforexamplebyusingtheKolmogorov-Smirnov-Test
• Herep=0.082,i.e.notquitesignificant
•Thiswouldbea“self-contained”test,asonlythegenesinthegenesetarebeingused
Kolmogorov-SmirnovTest
• TheKS-testcomparesanobservedwithanexpectedcumulativedistribution
• TheKS-statisticisgivenbythemaximumdeviationbetweenthetwo
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Observed and Expected culmulative distribution
x
Fn(x)
Histogram of the ranks of p-values for inflammation genes
p.rank[incl]
Frequency
0 2000 4000 6000 8000 10000 12000 14000
05
1015
• AlternativelywecouldlookatthedistributionoftheRANKSofthep-valuesinourgeneset
• Thiswouldbeacompetitivemethod,i.e wecompareourgenesetwiththeothergenes
• AgainonecanusetheKolmogorov-Smirnovtesttotestforuniformity
• Here:p=0.851,i.e.veryfarfromsignificance
Othergeneralissues• Directionofchange
– Inourexamplewedidn’tdifferentiatebetweenupordown-regulatedgenes
– Thatcanbeachievedbyrepeatingtheanalysisforp-valuesfromone-sidedtest
• Eg.wecouldfindGO-Termsthataresignificantlyup-regulated– Withmostsoftwarebothapproachesarepossible
• MultipleTesting– AswearetestingmanyGeneSets,weexpectsomesignificantfindings
“bychance”(falsepositives)– Controllingthefalsediscoveryrateistricky:Thegenesetsdooverlap,
sotheywillnotbeindependent!• EvenmoretrickyinGOanalysiswherecertainGOtermsaresubsetofothers
– TheBonferroni-Methodismostconservative,butalwaysworks!
• Resampling strategies(dependencebetweengenes)– Themethodsweusedsofarinourexampleassumethatgenesareindependentofeachother…ifthisisviolatedthep-valuesareincorrect
– Resampling ofgroup/phenotypelabelscancorrectforthis
– Wegiveanexampleforourdataset
Multiple TestingforPathways
ExampleResampling Approach1. Calculatetheteststatistic,e.g.thepercentageofsignificant
genesintheGeneSet
2. Randomlyre-shufflethegrouplabels(lean,obese)betweenthesamples
3. Repeattheanalysisforthere-shuffleddatasetandcalculateare-shuffledversionoftheteststatistic
4. Repeat2and3manytimes(thousands…)
5. Weobtainadistributionofre-shuffled%ofsignificantgenes:thepercentageofre-shuffledvaluesthatarelargerthantheoneobservedin1isourp-value
• Thereshufflingtakesgenetogenecorrelationsintoaccount
• Manyprogramsalsooffertoresamplethegenes:ThisdoesNOTtakecorrelationsintoaccount
• Roughlyspeaking:– Resampling phenotypes:correspondstoself-containedtest
– Resampling genes:correspondstocompetitivetest
Resampling Approach
• Genesbeingpresentmorethanonce– Commonapproaches
• Combineduplicates(average,median,maximum,…)• Ignore(i.e treatduplicateslikedifferentgenes)
• Usingsummarystatisticsvs usingalldata– Ourexamplesusedp-valuesasdatasummaries– Otherapproachesusefold-changes,signaltonoiseratios,etc…
– Somemethodsarebasedontheoriginaldataforthegenesinthegenesetratherthanonasummarystatistic
Resampling Approaches
Resampling Approaches
• Theresamplingapproachesarehighlycomputationallyintensive
• Newmethodsarebeingdevelopedtospeedthisup– Empiricalapproximationsofpermutations– Empiricalpathwayanalysis,withoutpermutation.
• ZhouYH,BarryWT,WrightFA.Biostatistics.2013Jul;14(3):573-85.doi:10.1093/biostatistics/kxt004.Epub 2013Feb20.
Summary
• Databases• Choicemakesadifference• NotallusethesameIDs– watchoutJ• Majordifferencesbetweenmethods• Issueswithmultipletesting
• Nextlecture,willgointomoredetailonafewmethods
Questions?