Upload
dmitry-grapov
View
416
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Overview of common data normalization approaches with applications to 2 large scale metabolomic studies.
Citation preview
Data Normalization Approaches for Large-Scale Metabolomic Studies
Dmitry Grapov, PhD
Analytical VarianceVariation in sample measurements due to sample handling, data acquisition, processing, etc:• Masks true biological variability• Calculated based on replicated measurements• Can be accounted for using data normalization approaches
Common approaches to minimize analytical variance• Analytical Standards• Quality Control Based Normalizations• Scaling or variance stabilizing normalizations
Case Study: Environmental Determinants of Diabetes in the Young (TEDDY, https://teddy.epi.usf.edu/)
• >1,000 analytes (GC-TOF and LC-Q-TOF)• ~12,000 samples collected over 3yrs (current >5,500 samples)
Need for NormalizationRemove non-biological (e.g. analytical) drift/variance/artifacts in measurements
Acquisition order Processing/acquisition batches
Principal Component Analysis (PCA) of all analytes, showing QC sample scores
Batch EffectsDrift in >400 replicated measurements across >100 analytical batches for a single analyte
Acquisition batch
Abun
danc
e QCs embedded among >5,500 samples (1:10) collected over 1.5 yrs
If the biological effect size is less than the analytical variance
then the experiment will incorrectly yield insignificant results
Estimation of Analytical VarianceReplicated measurements’ median inter- and intra-batch %RSD
Analyte specific performance across the whole study (inter-batch)
Within batch (intra-batch) performance
Common Normalization ApproachesSample-wise scalar corrections
• L2 norm, mean, median, sum, etc.Quality control (QC) or reference sample
• Loess (Dunn et al.,2011; locally estimated scatterplot smoothing)
• Batch ratio (mean, median)
• Hierarchical mixed effects (Jauhiainen et al. 2014)
• Quantile (Bolstad et al., 2003; minimize variance in metabolite distribution)
Internal standard (ISTD) • Ratio response (metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
Variance Based• RUV-2 (De Livera et al., 2012; variance removal for hypothesis testing)
• Variance stabilizing normalizations (Huber et al. 2002)
Scalar Normalization
Assumption: equal X signal per sample where X can be sample sum, mean, median, etc.
• Can correct for batch effects when valid
• simple
Can hide true biological trends or create false ones
LOESS Optimization, locally weighted non-liner model
LOESS span has a large effect model fit
span (α) controls the degree of smoothing
LOESS Normalizationraw span =0.75 span =0.005
• Training Set• Test Set
LOESS Normalization
Replicated measurements are use to optimize a locally weighted non-liner model:
1. Double cross-validation (33% test set)• Span optimized with k-fold cross-validation on the
training set 2. Model validated on the test set3. Use QC derived model to remove analytical variance
from samples
Image: http://pingax.com/regularization-implementation-r/?utm_source=rss&utm_medium=rss&utm_campaign=regularization-implementation-r
LOESS ValidationAvoiding over fitting is critical
Training dataTest data
Batch Ratio (BR) Normalization
Training SetTest Set
Calculate: • batch/analyte specific
correction factor = (batch median /global median)
• Apply ratio to samples
• simple
Case Study I: TEDDY GC-TOF
• 310 metabolites for 4930 samples • 132 batches
• ~41 samples per batch • ~1:10 QCs/samples (487 QCs or 9%)• No Internal Standards (ISTDs)
Normalizations Implemented• Batch ratio • LOESS • Sum known metabolite signal (mTIC) normalization
MedianRSD count cumulative %0-10 17 610-20 103 3920-30 112 7530-40 57 9340-50 12 9750-60 5 9960-70 3 10070-80 1 100
Normalization Performance
Median RSD count cumulative %
10-20 75 5720-30 51 9630-40 4 9940-50 1 9950-60 1 100
LOESS normalization showed optimal performance
Intra-batch
Inter-batch
PCA of Normalization MethodsRaw LOESS
Batch RatioSum Normalized
Colors = ~3 months
BR Normalization Limitations
Training SetTest Set
• Very susceptible to outliers
• Requires many QCs• Can inflate variance
when training and test set trends do not match
LOESS Limitations
Training setTest Set
LOESS normalization can inflate variance when:• overtrained• training examples do
not match test set
Case Study II: TEDDY LC-Q-TOF
• 340+ metabolites for 4930 samples • 132 batches
• ~41 samples per batch • ~1:10 QC/samples (524 QCs or 11%)• NIST reference (63 or 1%)• 14 internal standards (ISTDs)
• NOMIS (IS = ISTD)• qcISTD
Internal Standards Normalization
Methods • qcISTD (QC optimized
metabolite/ISTD)
• NOMIS (Sysi-Aho et al., 2007; selection of optimal combination ISTDs)
• CCRMN (Redestig et al., 2009; removal of metabolite cross contribution to ISTDs)
NOMIS
qcISTD Normalization
Use replicated measurements to define optimal internal standard/analyte ratio pairs
1. Double cross-validation (33% test set)
2. k-fold cross-validation on the training set
3. validate on the test set4. Apply QC defined
surrogate/analyte pairs to samples
Number of corrected analytes
Comparison of Normalizations• qcISTD performs better than LOESS • qcISTD + LOESS leads to highest replicate precision
Intra-batch Inter-batch
Raw (RSD = 13%) qcISTD (9%)
LOESS (12%)
qcISTD + LOESS (8%)Only LOESS included normalizations effectively remove analytical batch effects
PCA of Normalization Methods
[email protected] metabolomics.ucdavis.edu
This research was supported in part by NIH 1 U24 DK097154