scRNAseqnormalization and gene set selection

Preview:

Citation preview

scRNAseq normalizationandgenesetselection

Åsa Björklundasa.bjorklund@scilifelab.se

Outline

• Introduction• Normalization• Genesetselection• Removalofconfounders

Biologicalandtechnicalvariation

• Biologicalvariation:– Celltype/state– Cellcycle– Cellsize– Sex,Age,…– Etc..

• Technicalvariation– Cellquality– Libraryprepefficiency– Batcheffects– Etc…

Biologicalandtechnicalvariation

• Biologicalvariation:– Celltype/state– Cellcycle– Cellsize– Sex,Age,…

– Etc..

• Technicalvariation– Cellquality– Libraryprepefficiency– Batcheffects– Etc..

Toidentifycelltypeswewouldliketoremoveallothersourcesofvariation.

UMIsdoesnotsolvetheproblem

Vallejos etal.NatureMethods2017

Normalization

• Countnormalization –forunevensequencingdepth• Genelengthnormalization– fordifferencesingenedetectionduetogenelength

• Drop-outratenormalization– fordifferencesinRNAcontent/drop-outrates

BulkRNAseq methods• CPM:Controlsforsequencingdepthwhendividingbytotalcount• RPKM/FPKM:Controlsforsequencingdepthandgenelength.Goodfor

technicalreplicates,notgoodforsample-sampleduetocompositionalbias.AssumestotalRNAoutputissameinallsamples.

• TPM:SimilartoRPKM/FPKM.Correctsforsequencingdepthandgenelength.Alsocomparablebetweensamplesbutnocorrectionforcompositionalbias.

Xi:observedcountli:lengthofthetranscriptNnumberoffragmentssequenced

BulkRNAseq methods

• TMM/RLE/MRN:Improvedassumption:Theoutputbetweensamplesforacoresetonlyofgenesissimilar.Correctsforcompositionalbias.RLEandMRNareverysimilarandcorrelateswellwithsequencingdepth. edgeR::calcNormFactors() implementsTMM,TMMwzp,RLE&UQ. DESeq2::estimateSizeFactors implementsmedianratiomethod(RLE).Doesnotcorrectforgenelength.

• VST/RLOG/VOOM:Varianceisstabilised acrosstherangeofmeanvalues.Foruseinexploratoryanalyses. vst() and rlog() functionsfrom DESeq2. voom() functionfrom Limma convertsdatatonormaldistribution.

scRNAseq normalization

• Deconvolution/Scran (Pooling-Across-Cells)• SCnorm (Expression-DepthRelation)• SCTransform• Census• Linnorm• ZINB-WaVE• BASiCS• More…

Logtransformation

• Log-transformedvaluesapproachesnormaldistributionforbulkRNAseq data

• ForscRNAseq – moresimilartozero-inflatedbinomial

• Whilenon-transformeddataishardtofit.

Depthnormalizationandlogtransformation

• Themostsimplenormalizationistodividebysequencingdepth*ascalefactorandlog-transformthedata

• Scater normalize – usestotalcountsorsizefactors.Defaultisreturn_log =TRUE.

• SeuratNormalizeData – returnslog-normalizeddatawithscale.factor =10Kbydefault.

• Scanpy normalize_per_cell/normalize_total –normalizebysequencingdepth– thenneedtorunlog1p.

Depthnormalization

• AssumingsameRNAcontentinallcells– mayworkwellinhomogeneouscellpopulation

• InmostcasestheamountofRNA– andofUMIs/readsdifferbetweencells.

• Alsoimportanttocheckforoulier genesthatconstitutelargeproportionofthereads!

Deconvolution

Lun etal.GenomeBiol.2016

Scran - computeSumFactors

• Deconvolutionwithallcells– Theassumptionisthatmostgenesarenotdifferentiallyexpressed(DE)betweencells,

• Deconvolutionwithinclusters(FastClusterbeforehand)– Sizefactorscomputedwithineachclusterandrescaledbynormalizationbetweenclusters.

– WhenmanygenesareDEbetweenclustersinaheterogeneouspopulation.

• computeSumFactors – willalsoremovelowabundancegenes

Normalizationwithgenegroups

• Globalscalefactorsmayleadtoovercorrectionforweaklyandmoderatelyexpressedgenesandundernormalization forhighlyexpressedgenes.

• Solution:Donormalizationforgenesatdifferentexpressionlevels.

SCNorm:Expressionvs.DepthBiasCorrection

Bacher etal.NatureMethods2017)

Quantileregressiontoestimatethecount–depthrelationship

SCNorm:Expressionvs.DepthBiasCorrection

IdenticalcellsintwogroupsshouldresultinnoDEandFC=1ifnormalizationwasefficient

Bacher etal.NatureMethods2017)

SCTransform (Seurat)

Hafmeister &Satija GenomeBiology2019

SCTransform (Seurat)

Pearsonresidualsfromregularizednegativebinomial(NB)regression

Hafmeister &Satija GenomeBiology2019

SCTransform (Seurat)

• OBS!SCTransform functioninSeuratalsodoesvariablegeneselction inthesamestepwithaslightlydifferentmethodthanthedefaultinSeurat.

• Butyoucanalsospecifywhichgenestoruniton.

• Youcanalsorunregressioninthesamestep.

Zero-InflatedNegativeBinomial-basedWantedVariationExtraction(ZINB-WaVE).

• Bothgene-levelandsample-levelcovariates• ExtensionoftheRUVmodel

Risso etal.Nat.Comm.2018

ZINB-WaVE

ReducestechnicalinfluenceonPCA,alsobatcheffect.

Sizefactorswithdifferentnormalizations

Vieth etal.NatureComm.2019

DEwithdifferentnormalizations

Vieth etal.NatureComm.2019

Imputation

• scRNAseq hasalotofzerosinexpressionmatrix• CommonforGWASdatatoimputeSNPs• Manymethodsrecentlypublished:– SAVER– DrImpute– scImpute– MAGiC– Knn-smooth– Deepcountautoencoder

Imputationcanintroducefalsecorrelations

Andrewsetal.F1000research2018

ImputationhaslittleeffectonDEdetection

Vieth etal.NatureComm.2019

Normalization+imputationcomparison

TianNatureMethods2019

Scalingdata– Z-scoretransformation

• Z-scoretransformation- linearly transform data toameanofzeroandastandarddeviationof1.

• PCAoranyothertypeofanalysiswillbedominatedbyhighlyexpressedgeneswithhighvariance.

• ItcanbewisetocenterandscaleeachgenebeforeperformingPCA

Whatnormalizationshouldyouuse?

• Normalizationhasbigimpactondifferentialgeneexpression,butnotasmuchonclustering

• Inmostcasesitisenoughtodosequencedepthnormalization

• Whenworkingwithhighlysimilarsubtypesofthesamecelltype,orwithcelltypes ofverydifferentsizes,individualsizefactorscouldhelp.

• Binningbygenelevel(SCTransform)helpstoremovetheeffectofdifferentgenedetectionacrosscells.

Selectinggenes

• Excludinginvariablegenesthatdonotcontributeinformative/interestinginformation– Improvedsignaltonoiseratio– Reducedcomputationalrequirements

• Highlyvariablegenes(HVGs)• Correlatedgenepairs/groups• TopPCAloadings

Variablegeneselection

• Geneswhichbehavedifferentlyfromanullmodeldescribingtechnicalnoise– Mean-variancetrend:geneswithhigherthanexpectedvariance

– Coefficientofvariation(Brennecke etal.2013)

• Highdropoutgenes– Numberofzerosunexpectedlyhighcomparedtonullmodel

Highlyvariablegenes(HVGs)

(Brennecke etal.NatureMethods2013)

Fitagammageneralizedlinearmodel

NoERCCs?->estimatetechnicalnoisebasedonallgenes

HVGswithspike-incontrols– normalizationmatters

M3Drop

• ReversetranscriptionisanenzymereactionthuscanbemodelledusingtheMichaelis-Menten equation:

S:averageexpressionKM:Michaelis-Menten constant

Confoundingfactors

• Anysourceofvariationthatyoudonotexpecttogiveseparationofthecelltypes.– Cellcycle– Cellsize– Sequencingdepth– Cellquality– Batch– More…

Linearregression

• Fitalinetothegeneexpressionvsvariableofinterest

• Calculateresiduals• Removevarianceexplainedbythevariableofinterestbytakingtheresiduals.

• Multiplelinearregressionifmultiplefactors.

Othertoolstoremoveunwantedvariance

• RUVseq()orsvaseq()• Linearmodelswithe.g.removeBatchEffect()inlimma orscater

• ComBat()insva

Whatconfoundersshouldyouremove?

• Percentmitochondrialreads– oftencorrelateswithqualityofcell

• Sequencingdepth• Genedetectionrate– relatestoamountofRNApercell.

• Cellcycle• Batcheffects(Sample,sortdate,sex,etc.)ALWAYS checkQCparametersafteranalysisandseehowtheyinfluenceyourdata.BUT, becarefulthatyourconfoundersarenotrelatedtoyourbiologicalquestion!

Scalingandregressioninpractice

• SeuratScaleData:doesZ-scoretransformationandregressionofvariablesinvars.to.regress. Canuselinear(default),poisson ornegbiommodels.

• Scran: runsscalingbutnotcenteringautomaticallyinPCAstep.trendVar functionestimatesunwantedvariationeitherwithadesignmatrixorwithblockfactors.decomposeVar ordenoisePCA toremoveunwantedvariation.

• Scanpy:pp.regress_out andpp.scale functions.

Cellcycleeffect

Buettner etal.NatureBiotech.2019

Predictcellcyclestage/scores

• Seurat– CellCycleScoring – buildsonG2M- &S-phasehumangenelistsfromTirosh etal.paper

• Scran – cyclone function– trainedonmousecellcyclesortedcells.Usesrelativeexpressionofpairsofgenes.

• Scanpy - tl.score_genes_cell_cycle – usessamegenelistasSeurat

Cellcycleremoval

• Regressiononcellcyclescores.• scLVM (betapre-release)- Designedforcell-cyclevariationcorrection.Alsocorrectionofotherconfoundingvariables.

• ccRemover (stableversionfromCRAN).“ccRemoveroutperformsscLVM slightly.”

• Oscope• reCAT

Conclusions

• Normalizationhasbigimpactondifferentialgeneexpression.

• Manydifferentmethodstoremoveunwantedvariance– oftenanimportantstep!

• Selectionofvariablegenesisimportanttoremovenoiseinthedata.AlwayssubsetgenesbeforerunningPCA/clustering.

• Alwaysaimforsamesequencingdepthinallsamples– toavoidatleastoneconfoundingfactor.

Donotworry!

Ifyouhavedistinctcelltypes – theclusteringwillbethesameregardlessofhowyoutreatthedata.

But,forsubclustering ofsimilarcelltypes normalizationandremovalofconfoundersmaybecrucial.

Recommended