50
University of Groningen Mastering data pre-processing for accurate quantitative molecular profiling with liquid chromatography coupled to mass spectrometry Mitra, Vikram IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2017 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Mitra, V. (2017). Mastering data pre-processing for accurate quantitative molecular profiling with liquid chromatography coupled to mass spectrometry. University of Groningen. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne- amendment. Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 12-04-2022

University of Groningen Mastering data pre-processing for

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Groningen Mastering data pre-processing for

University of Groningen

Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometryMitra, Vikram

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2017

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Mitra, V. (2017). Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometry. University of Groningen.

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-amendment.

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 12-04-2022

Page 2: University of Groningen Mastering data pre-processing for

135

Supporting information for chapter 3

Rat CSF sampling and LC-MS acquisition

1.1 Experimental design

In total 7 male Lewis rats (Harlan Laboratories B.V.) with a starting weight of an average of

200 g were used to model onset and development of experimental autoimmune

encephalomyelitis (EAE) induced with guinea pig myelin basic protein. At the start of the

study (day 0) EAE was induced in four male Lewis rats, by subcuntaneous injection in left

hind paw (under isoflurane anesthesia) of 100 μL of a saline-based emulsion containing 20

μg guinea pig myelin basic protein (MBP, Vrije Universiteit Amsterdam), 500 μg

Mycobacterium tuberculosis type 37HRa (Difco) and 50 μL Complete Freunds’ Adjuvant

(CFA) (EAE groups B, F and H). Three male Lewis rats were used as inflammatory control

by the injection of the same emulsion, but without MBP (EAE inflammatory control groups

A, C and G). Three animals were treated with minocycline at the onset of the study (day 0)

at 50 mg/kg bodyweight injected intraperitoneal in the belly (groups C, G and H). Four rats

were treated with vehicle (groups A, B and F). The animals were kept in type III cages three

by three in random order, food and water was available ad libitum. The animal groups and

samples included in the present study with the corresponding file name and LC-MS analysis

places are summarized in Table S1.

Group Treatment Sampl

e File names in laboratory 1 (orbitrap) File names in laboratory 2 (qTOF)

A CFA + vehicle 15 100419O2c1_MS_28_EAE_Mino_15_00

2 20100612_15_Exclusion_400

B EAE + vehicle 50 100505O2c1_MS_28_EAE_Mino_50_05

3 20100612_50_Exclusion_400

C CFA +

minocycline 53

100628O2c1_MS_28_EAE_Mino_53_00

4 20100609_EAE_53_MSMS_400_2

F EAE + vehicle 98 100614O2c1_MS_28_EAE_Mino_98_06

3 20100609_EAE_98_MSMS_400_2

F EAE + vehicle 6 100419O2c1_MS_28_EAE_Mino_6_017 20100612_6_Exclusion_400

G CFA +

minocycline 12

100419O2c1_MS_28_EAE_Mino_12_02

1 20100612_12_Exclusion_400

H EAE

+minocycline 111

00625O2c1_MS_28_EAE_Mino_111_00

6

20100609_EAE_111_MSMS_400_

2

Page 3: University of Groningen Mastering data pre-processing for

136

Table S1. Treatment summary of sample analysed in rat EAE CSF dataset. Treatments are

the following: CFA: animals treated with complete Freund’s adjuvant (inflammatory control

animals); EAE: animals treated with CFA and Myelin Basic Protein (MBP) (diseased

animals); Minocycline: animals treated with minocycline; vehicle: animal not treated with

minocycline only with saline solution.

1.2. CSF Sampling

Three of the animals were euthanized at day 10 (groups A-C) and the rest of the animals

(group F-H) were euthanized at day 14 using CO2/O2. For the CSF sampling procedure the

head of each rat was held in a fixed position using a holder. The arachnoid membrane was

revealed by a skin incision followed by an incision in the musculus trapezius pars

descendens. The CSF was collected from the cisterna magna using an insulin syringe

needle (Myjector, 29 G × 1/2" - 0.33 × 12 mm, 0.3 mL = 30 units); a maximum of 60 μL was

collected from each animal. Each sample was centrifuged at 2000 g for 10 minutes at 4°C

within 20 minutes after the collection. The supernatant was divided in aliquots of five tubes

of ~10 μL each and stored at -80 °C until the analysis. Samples containing blood

contamination determined by visual inspection were discarded from the study.

1.3. Sample preparation

The protein digestion was performed in a random order according to following procedure;

10 μL CSF was added to a tubes containing 10 μL 0.1% RapiGest™ (Waters) dissolved in

50 mM ammonium bicarbonate. The proteins in the CSF were reduced by the addition of

0.5 μL 1,4-dithiotreitol (DTT) (0.5 M) incubated at 60 °C for 30 min. Samples were cooled

down to room temperature, and were subsequently alkylated using 1 μL iodoacetamide

(IAM) (0.3 M) incubated in the dark for 30 min at room temperature. One micro liter

sequencing grade modified trypsin (Promega, Madison, WI, USA, part # V5111) of 1 μg/μL

concentration was added to the samples and further incubated for ~16 h at 37°C under

agitation (450 rpm). At the end of the digestion 3 μL hydrochloric acid (0.5 M) was added to

the sample solution and were further incubated for 30 min at 37 °C. The samples were

centrifuged at 13 250 g for 10 min at 4 °C to remove RapiGest™ particles. The samples

were transferred to glass sample vials, and kept at -80°C. Each sample was exposed to two

freeze-thaw cycles prior to the LC-MS analysis.

Page 4: University of Groningen Mastering data pre-processing for

137

1.4. LC-MS proteomic analysis in two laboratories

The digested CSF samples were analyzed in a random order. Before and after the analysis

of the 7 samples a blank (water with 0.1 % formic acid) and a quality control (horse hearth

cytochrome C spiked into pooled CSF sample after digestion at 200 fmol/μL) was injected

to check the technical quality of the analysis. Samples from group A, C and G were injected

at a volume of 1 μL and samples from group B, F and H were injected at a volume of 0.2 μL.

The reason for the difference in the injected sample amount was to normalize the total ion

chromatograms (TIC) of the samples (approximately five times higher in group B, F and H

compared to the rest of the groups, determined by a pre-analysis of samples from all groups

at the same volume) and therefore to avoid overloading the trap column.

1.4.1. Orbitrap nanoLC-MS/MS analysis (laboratory 1, Rotterdam)

Digested rat CSF samples were analyzed by LC-MS/MS using an Ultimate 3000 nano LC

system (Dionex, Germering, Germany) online coupled to a hybrid linear ion trap/Orbitrap

mass spectrometer (LTQ Orbitrap XL; Thermo Fisher Scientific, Bremen, Germany). Five

microliter digest were loaded onto a C18 trap column (C18 PepMap, 300µm ID 5mm of 5

µm particle size, 100 Å pore size; Dionex, The Netherlands) and desalted for 10 minutes

using a flow rate of 20 µL/min. The trap column was switched online with the analytical

column (PepMap C18, 75 μm ID 150 mm, 3 μm particle and 100 Å pore size; Dionex, The

Netherlands) and peptides were eluted with the following binary gradient: 0% - 25% eluent

B for 120 min and 25% - 50% eluent B for further 60 minutes, where eluent A consisted of

2% acetonitrile and 0.1% formic acid in ultra pure water and eluent B consisted of 80%

acetonitrile and 0.08% formic acid in water. The column flow rate was set to 300 nL/min. For

MS/MS analysis a data dependent acquisition method was used composed from a high

resolution survey scan between 400 – 1800 m/z performed with an Orbitrap (automatic gain

control (AGC) 106, resolution 30 000 at 400 m/z; lock mass set to 445.120025 m/z

(protonated (SiO(CH3)2)6). Based on this survey scan the 5 most intense ions were

consecutively isolated (AGC target set to 104 ions) and fragmented by collision-activated

dissociation (CAD) applying 35% normalized collision energy in the linear ion trap. Once a

precursor had been selected, it was excluded for 3 minutes.

1.4.2. qTOF chipLC-MS/MS proteomic analysis (laboratory 2, Groningen)

The peptide separation was performed on a reverse phase LC-chip (Protein ID chip #3;

G4240-63001 SPQ110: Agilent Technologies (Santa Clara, USA); analytical column: 150

Page 5: University of Groningen Mastering data pre-processing for

138

mm × 75 μm Zorbax 300SB-C18, particle size of 5 μm and pore size of 300 Å; trap column:

160 nL Zorbax 300SB-C18, 5 μm) coupled to a nanoLC system (Agilent 1200) with a 40 μL

injection loop. The chip was interfaced with and electrospray ionization to quadrupole-time-

of-flight (QTOF) type of mass spectrometer (Agilent 6510). The MassHunter software

(version B.02.00; Agilent Technologies) for data acquisition. The LC separation was

performed by using following eluents: A: ultra-pure water (conductivity 18.2 MΩ obtained

with Sartorius Stedim purification system, Nieuwegein, The Netherlands) containing 0.1%

formic acid (98-100%, pro analysis, Merck, Darmstadt, Germany); B: acetonitrile (HPLC-S

gradient grade, Biosolve, Valkenswaard, The Netherlands) containing 0.1% formic acid.

All samples were desalted and enriched on the trap column for 10 minutes at a flow rate of

3 μL/min (3% B). The samples were then transferred to the separation column at a flow rate

of 250 nL/min. For the elution of the peptides, following gradient was used: 100 min linear

gradient from 3 to 50% B; 5 min linear gradient from 50 to 70% B and finally 4 min linear

gradient from 70 to 3%.

MS analysis was performed using 2 GHz extended dynamic range mode under the following

conditions: mass range: 275-2000 m/z, acquisition rate: 1 spectrum/sec, data storage:

profile and centroid mode, fragmentor: 175 V, skimmer: 65 V, OCT 1 RF Vpp: 750 V, spray

voltage: ~1900 V, drying gas temp: 325ºC, drying gas flow (N2): 6 L/min. Mass correction

was performed during analysis using internal standards of 371.31559 m/z (originating from

a ubiquitous background ion (Dioctyl adipate, DOA, plasticizer) and 1221.990637 m/z (HP-

1221 calibration standard, evaporating from a wetted wick inside the spray chamber).

To assess the repeatability of the LC-MS analysis the relative standard deviation (RSD) was

calculated for the mass accuracy, retention time and peak area based on selected

cytochrome C peaks in the QC samples. The peaks were first smoothed (Gaussian function

width; 15 points, (15 sec)) and subsequently integrated; the peak area RSD was within +/-

25%, the retention time deviation was less than +/- 0.3% (or 5 sec) and the mass accuracy

(calculated as the mean of five measurements from each selected cytochrome C peak), was

within +/- 9 ppm of the theoretical value.

Mouse experimental design dataset

2.1 Experimental design

The mouse experimental design dataset was obtained from the National Cancer Institute’s

Mouse Proteomic Technology Initiative (http://proteomics.cancer.gov/programs/mouse), who

launched in 2004 to assess proteomic strategies for discovering candidate biomarkers for

Page 6: University of Groningen Mastering data pre-processing for

139

early detection of cancer from genetically modified mouse models of human cancer. The

dataset that we obtained was specially design to identify the sources of variation and factors

which has the largest influence on the compound variability of mouse serum modeling

human cancer. The following four factors were studied: laboratory (2 levels), depletion

method (4 levels), disease or healthy stage (2 levels) and type of cancer (3 levels). This

modified version of the Latin square experimental design(1) provides then 48 analysis in

total with 24 identic samples analysed in two laboratories. In a standard Latin square design,

the number of levels for each factor must be the same. In this particular experiment, the

factors mouse models has tree and laboratory and disease status have two levels each

whereas the depletion method has four levels, hence a modified Latin square. In a Latin

square design, the overall variation is partitioned into four sources, which allows the

experimenter to isolate the effects of mouse model, lab, disease status and depletion

method. The data set is available on-line at (http://www.proteomecommons.org/dev/dfs/examples/nci-mouse-

models/index.html).

2.2 Depletion methods and protein digestion

MARS Immunoaffinity Depletion. 40 μL of each mouse model plasma sample was used

for the immunoaffinity depletion. Three high-abundance proteins (albumin, IgG, and

transferrin) that compose 75-80% of the total protein mass in mouse plasma were removed

simultaneously using a 4.6 50 mm murine MARS column (Agilent, Palo Alto, USA, CA)

per the manufacturer’s instructions. The flow-through fractions were concentrated in iCON

concentrators with 9 kDa molecular weight cutoffs (Pierce, Rockford, IL) followed by buffer

exchange into 50 mM NH4HCO3 in the same unit per the manufacturer’s instructions.

Cysteinyl Peptide Enrichment. Cysteinyl peptides were captured from the tryptic digest as

previously described(2, 3). All solutions used in this method were degassed to prevent

oxidation of the thiol content. The peptides resulting from the above protein digestion step

were reduced with 5 mM DTT in 80 mL of 50 mM Tris buffer (pH 7.5), 1 mM EDTA (coupling

buffer) for 30 min at 37ºC, after which the samples were diluted 5-fold by adding coupling

buffer. Thiopropyl Sepharose 6B thiol-affinity resin (100 μL; Amersham Biosciences,

Uppsala, Sweden) was prepared from dried powder per the manufacturer's instructions.

Briefly, the dried powder was rehydrated in water for 15 min and washed by 50 bed volumes

of water, followed by 50 bed volumes of coupling buffer in a Handee Mini-Spin column

(Pierce, Rockford, USA, IL). The reduced peptide sample was then incubated with the resin

for 1 h at room temperature with gentle mixing, and the unbound portion (non-cysteinyl

Page 7: University of Groningen Mastering data pre-processing for

140

peptides) was removed by spinning the column at low speed. The resin was washed in the

spin column sequentially with 0.5 mL of each of the following solutions: 1) 50 mM Tris buffer

(pH 8.0), 1 mM EDTA (washing buffer); 2) 2 M NaCl; 3) 80% ACN/0.1% TFA solution; and

4) washing buffer. To release the captured cysteinyl peptides, 100 μL of 20 mM DTT freshly

prepared in washing buffer was added to the resin and incubated for 30 min at room

temperature. The resin was further washed with 100 μL of 80% ACN which was pooled with

the previous DTT eluate. The sample was alkylated with 80 mM iodoacetamide for 30 min

at room temperature in dark. The eluted cysteinyl peptides were desalted by using a SPE

C18 column and lyophilized. Cysteinyl peptide samples were reconstituted in 25 mM

NH4HCO3 and stored at -80ºC until time for LC-MS analysis (MARS+Cys sample).

Plasma Protein Digestion. The MARS flow-through proteins were denatured and reduced

in 50 mM NH4HCO3 (pH 8.2), 8 M urea, 10 mM dithiothreitol (DTT) for 1 h at 37ºC. The

resulting protein mixture was diluted 10 fold with 50 mM NH4HCO3, and then sequencing

grade modified porcine trypsin (Promega, Madison, USA, WI) was added at a trypsin:protein

ratio of 1:50. The sample was incubated overnight at 37ºC. The following day, the trypsin

digested sample was loaded on a 1 mL SPE C18 column (Supelco, Bellefonte, USA, PA)

and washed with 4 mL of 0.1% trifluoroacetic acid (TFA)/5% acetonitrile (ACN). Peptides

were eluted from the SPE column with 1 mL of 0.1% TFA/80% ACN and lyophilized

afterwards. Peptide samples were reconstituted in 25 mM NH4HCO3 and stored at -80ºC

until LC-MS analysis.

Reversed-Phase Capillary LC-MS Analyses. A custom-built high-pressure capillary LC

system(4) coupled on-line to an Agilent LC/MSD TOF (G1969A, laboratory 2) via an in-

house-manufactured electrospray ionization interface was used to analyze the peptide

samples. In the other laboratory an LC-MS system with time-of-flight detector was used

(Waters LCT Premier for laboratory 1). The reversed-phase capillary column is prepared by

slurry packing 3-mm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a

65-cm long and 75 mm i.d. fused silica capillary (Polymicro Technologies, Phoenix, AZ) that

incorporated a retaining stainless steel screen in an HPLC union (Valco Instruments Co.,

Houston, TX). The mobile phases consisted of 0.2% acetic acid and 0.05% TFA in water (A)

and 0.1% TFA in 90% ACN/10% water (B) and were degassed on-line by using a vacuum

degasser (Jones Chromatography Inc., Lakewood, CO). After loading 5 mL of sample

solution onto the column, an exponential gradient elution was achieved by increasing the

mobile-phase composition in a stainless steel mixing chamber from 0 to 70% B over 120

Page 8: University of Groningen Mastering data pre-processing for

141

min. The TOF mass spectrometer was scanned in the m/z range of 400-2000 at 1

scan/second.

Monte-Carlo simulated dataset The Monte-Carlo simulation to imitate the outcome of peak-matching procedure was

performed with the following criteria:

a) There are two types of peak pairs: Accurately matched peak pairs where the retention

time coordinates of the matched peaks follow a non-linear, monotonic trend. In case of

peak order inversion, the retention times of the accurately matched peak pairs fluctuate

along the non-linear, monotonic trend with the maximal value of retention time difference

of peak changing elution order. The second type of peak pairs is obtained by randomly

matching peaks between the two chromatograms and simulates the error in the peak

matching procedure. These peak pairs are distributed randomly throughout the retention

time space while taking the initial peak density distribution in the two chromatograms

into account.

b) The non-linear monotonic trend is simulated using a cubic spline function and peak

elution order inversion is represented as random fluctuation (orthogonal residuals) along

this trend. Distribution of peak pairs along the non-linear monotonic retention time trend

is sampled directly from the peak distribution of a real LC-MS chromatogram.

c) The parameters of the simulation that can be set by the user are the following: (1)

number of accurately matched peak pairs, (2) number of randomly matched peak pairs,

(3) fluctuation in minutes of the accurately matched peak pairs simulating the amount of

maximal retention time differences related to changes in peak elution order, (4-6) three

LC-MS peak distributions expressed as a histogram along the retention time (one is

used to sample the peak distribution of the accurately matched peak pairs along the

main monotonic retention time correspondence trend and the other two are used to

sample randomly matched peak pairs from two LC-MS/MS chromatograms).

Parameters for the Monte Carlo simulations were the following:

Total number of MPPs: 100, 250, 500, 750 and 1000

Fluctuation of AMPPs around the monotonic retention time trend: 0.05, 01, 1, 5 and 15

minutes

Ratio of AMPPs relative to the MPPs: 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.75, 0.90 and 1.00

Number of repetitions: 3

Page 9: University of Groningen Mastering data pre-processing for

142

Detailed description of the time alignment algorithm

Pre-processing of single stage LC-MS data Figure 1 shows the main parts of the quality assessment procedure and indicates in red

modules where the procedure can stop due to improper conditions for time alignment such

as low number of matched peak pairs with respect to random peak pairing, low number of

accurately matched peak pairs or high probability of peak order inversion. The quality control

procedure is a pairwise method and expect that the subsequent time alignment method

change only the retention time of one chromatogram (refereed here as sample

chromatogram, and shown as peak list 2 in Figure 1). To process the raw LC-MS/MS data,

data in vendor specific format were converted to mzXML format using msconvert tool of the

ProteoWizard library(5). Single stage part of LC-MS/MS datasets in mzXML format were

submitted to data pre-processing which included peak detection and quantification, de-

convolution of isotopic peak clusters, charge state determination of isotopologue peaks

clusters and summing of the most abundant isotopologues of each charge state per

peptides. The initial noise filtering and peak quantification was carried out using the

PeakPicker module of the OpenMS pipeline(6). The signal-to-noise ratio parameter of the

PeakPicker algorithm was set to 10. Detected isotopologues (chemical species of the same

compound with the same atomic, but different isotope constitution) of one particular charge

state are then clustered and clusters which are not in accordance with the isotope wavelet

model following the “averagine” peptide constitution(6) are filtered out during the feature

finding step. The charge state of each detected isotope cluster is then determined and the

decharged mass of the most abundant isotopologues is calculated and is attributed as mass

of single peptide. This is followed by summing of the most abundant decharged

isotopologues with the same mass (mass tolerance within ±0.01 Da) within ±30 seconds of

retention time. The final quantitative value for each compound is characterized with the mass

value of the decharged most abundant isotopologue and the average retention time of all

charge states. This information along with ioncounts are exported in tab-delimited text file;

which is referred as “peak list” in the article.

Intensity-rank-based peak matching of LC-MS data (left part of step 1. in Figure 1) Prior peak matching, all peak lists were sorted and ranked according to decreasing intensity.

Correspondences between a pair of peak lists are determined by finding peak pairs that are

close in mass and intensity rank. The peak correspondences between a pair of intensity

Page 10: University of Groningen Mastering data pre-processing for

143

sorted peak lists are identified using a sliding window technique with the following

parameters: (1) peak pairs should be close in m/z therefore a threshold for the maximal m/z

difference between peak pairs is applied. This threshold should be set according to the

maximal mass calibration differences between the two LC-MS chromatograms. For

improvement of mass calibration it is advised to recalibrate the mass axis using either known

masses of background contaminants(7) or using accurate mass of identified peptides from

MS/MS data, if available(8); (2) number of the most abundant peaks used to identify peak

pairs. The end of the intensity sorted peak lists contains noise and other data processing

artifacts, therefore this parameter should be set to include only peaks from the intensity

sorted peak lists and exclude non informative items such as noise; (3) length of the sliding

window used in the peak matching procedure. This window defines the largest differences

between the intensity ranks of paired peaks that are considered by the matching algorithm.

In case of multiple hits for the same mass within the sliding window, the algorithm only

selects the peak with lowest difference of intensity ranks. Figure S2 in the supporting

information provides a visual summary of the peak matching procedure used to define peak

pairs between two LC-MS intensity sorted peak lists using the sliding window approach.

Optimizing peak matching parameters of single stage LC-MS peak lists (parameter optimisation in step 1. in Figure 1) All peak matching algorithms provide a certain ratio of accurately and inaccurately matched

peak pairs. The accurately matched peak pairs are common peaks between the two

chromatograms and contain the information for the correction of the retention time

differences between the two LC-MS chromatograms. When the ratio of accurately matched

peaks is high within the dataset, the retention time coordinates of the accurately matched

peaks accumulate along the retention time correspondence trend. Bivariate kernel density

estimation (2D-KDE) is applied over the retention time vectors of the matched peak pairs to

identify the regions where peaks accumulate in higher density compared to what is expected

from random pairing of peaks from two LC-MS/MS chromatograms. In 2D-KDE for n peak

pairs of ,x y retention time coordinates the estimated probability density function f̂ is

given by:

1

1

ˆ , ,n

H H i ii=

f x y = n K x x y y (1)

where KH is a bivariate ellipsoid symmetric Gaussian kernel that integrates to 1 from - to

+ for x and y values. H is the bandwidth described by the sigma of two-dimensional

Page 11: University of Groningen Mastering data pre-processing for

144

Gaussian kernel in the x (x) and y (y) directions and is greater than zero. KH determines

the smoothing extent of the 2 dimensional density histogram, and is expressed by the

following equation:

2 2( ) ( )

2 222

2 ( )( )1 1( , ) exp2 12 1

i ix x y yi i

H i ix y x yx y

x x y yK x x y y

(2)

ρ is the correlation between the two 1-dimensional Gaussian kernel functions and defines

the rotation of the Gaussian kernel. The bandwidth parameter was optimally set using a

plug-in bandwidth matrix approach developed by Botev et al.(9). An important feature of the

full bandwidth matrix is that it does not use any normal reference rules and is data centric.

For n peak pairs, the algorithm estimates a square density matrix of size 2i, where i is

arg min 2i

in , and the data matrix cover the entire retention time domains of the matched

peak pairs. The value of 2i is maximized to i = 7, to avoid long calculation time for 2-

dimensional-Kolmogorov-Smirnov test (see two paragraph below).

Peaks paired from the two peak lists contain randomly matched peaks pairs and accurately

matched peak pairs. The ratio of correctly and incorrectly matched pairs using decharged

and isotope deconvoluted peak lists depends from the molecular composition of the two

samples and from the parameters of the peak matching procedure. In order to assess

statistically the ratio of correctly and incorrectly matched peak pairs a p-value is calculated

using 2-dimensional Kolmogorov-Smirnov (2D-KS) test between the 2D-KDE matrix

obtained from the matched peak pair distribution and a the density matrix calculated for

random peak pairing. The density matrix for random peak pairing is obtained with the cross

product of the 1-dimensional KDE of peak distribution for each LC-MS chromatogram using

x and y for the corresponding chromatograms. Therefore 2D-KS measures the statistical

probability that the peak pair distribution originates from the distribution of random pairing of

peaks from the two chromatograms.

The 1-dimensional Kolmogorov-Smirnov (1D-KS) test provides the non-parametric

probability that two distributions is equivalent and that the observed differences between the

two distributions is due to random sampling. The 1D-KS uses the maximum absolute

difference between the two cumulative distributions to calculate the probability of the equality

of two empirical distributions. Extending KS statistic to multi-dimensional space is

challenging, while there are 2d-1 number of independent cumulative distributions in d

Page 12: University of Groningen Mastering data pre-processing for

145

dimensions. We have slightly modified the algorithm developed by Peacock et al.(10), which

estimates the largest difference between the two cumulative distributions for any possible

ordering for two dimensions. Given n points in a two-dimensional space defined by the

retention time domains of the two chromatograms, this amounts to calculating the

cumulative distribution functions in 4n2-1 quadrants. Our modification comprises that the

cumulative functions is not calculated for each peak pairs, but it is obtained directly from the

two 2D-KDE matrices, one obtained with the cross product of two 1D-KDE calculated from

the peak distribution in the two LC-MS chromatograms and the other obtained with the peak

pairing algorithm described in previous section 3.2. The DKS test statistic is then obtained by

calculating the largest difference between cumulative distributions considering all possible

4n2-1 quadrant divisions, where n in this case corresponds to dimension of the 2D-KDE

square matrices. The null hypothesis considering, that the distribution of peak pairs obtained

with random peak paring and the distribution obtained with matching of intensity sorted peak

lists is same is rejected at a significance level of α if

αKS Z>Dn2

(3)

where Zα is the cumulative standard normal deviate for the corresponding α probability. The

exact p-value for the 2D-KS estimate is obtained from the left part of the inequality (3). 2D-

KS test is then used to optimise the three parameters of the intensity-rank-based peak

pairing algorithm (the length of the intensity rank window, the threshold for the m/z

differences and the number of the most abundant peaks taken into consideration by the

peak matching procedure) with a predefined set of parameters (the exact values of the

parameters are presented in section 7 in supporting information). The 2D-KDE and the 2D-

KS calculations are performed only for peak lists matching a minimum of 100 peaks pairs.

If during the whole optimisation procedure the minimum number of matched peak pairs is

not reached the two-step time alignment procedure stops. The alignment procedure stops

as well if the probability of 2D-KS test measuring if intensity-rank-based peak pairing

distribution is the same that would be obtained with random peak pairing is higher than a p-

value of 0.001. Peak matching parameters have large effect on the peak matching accuracy

(ratio of accurately and inaccurately matched peak pairs and number of obtained peak

pairs), and for that reason this step is crucial to find the optimal peak matching parameters,

which provide the most different peak pairs distribution in the retention time space of the two

chromatograms from the distribution obtained with random peak pairing. Few examples on

the effect of parameters on peak matching results are shown in this supporting information

Page 13: University of Groningen Mastering data pre-processing for

146

in Figure S3. Figure S4 in this supporting information shows plots presenting the mains

steps of selection of accurately matched peaks.

Selection of accurately matched peak pairs (step 2. in Figure 1) An optimal threshold selection method is required to select the dense region of 2D-KDE

obtained with any types of peak pairing method containing mixture of accurately and

inaccurately matched peak pairs. This threshold is calculated by constructing a 1

dimensional histogram from all the density values of the 2D-KDE matrix (histogram is made

with number of bins equal to the size of the square 2D-KDE matrix). The threshold (d) is set

to a density value where the positive part of the histogram’s frequency’s first derivative is

closest to the median of

dh

dh ~

minarg ,where h correspond to the abundance of the

histograms, d is the density estimate, the + sign refer’s to positive value of the hd

and ~

sign to median value of hd

). Peak pairs that are located at density areas higher than this

density threshold are selected as accurately matched peak pairs, while other peaks are

considered as randomly grouped, mismatched peak pairs. It should be noted that this

threshold selection is sensitive to peak distribution, and slight manual readjustment of the

threshold value may improve the accuracy of accurately matched peak pairs.

Monotonic non-linear alignment function (step 3. in Figure 1) The retention time coordinates of the selected accurately matched peak pairs are used to

calculate a monotonic non-linear global alignment function by using Locally Weighted

Scatterplot Smoothing (LOWESS)(11) regression in combination with bagging

resampling(12) technique. A robust version of the LOWESS regression assigning a lower

weight to outliers has been used for calculating the alignment function. The method assigns

zero weight to peak pairs outside of six median absolute deviation of the residuals from the

tested position. The four times the root mean square of the 2D-KDE bandwidth is used to

set the span and third order polynomial function is used for the LOWESS regression. The

final smoothed regression points are calculated as average of 100 bootstrap resampling.

The bootstrap resampling is performed uniformly with replacement by using all extracted

peak pairs. This procedure reduces the variance of the LOWESS predictor and helps to

avoid overfitting. When peak elution order of common peaks is the same in two

chromatograms, one-to-one peak correspondence is expressed by monotonic function

between the retention time of accurately matched peaks. For that reason the main retention

Page 14: University of Groningen Mastering data pre-processing for

147

time correspondence trend – the alignment function should be monotonic. To make the main

time alignment function monotonic, least squared linear optimisation with monotonic

constraint is applied on the average LOWESS regression points of 100 bootstraps. A

piecewise cubic Hermite interpolating polynomial (PCHIP) function with cubic spline(13) is

used to perform monotonic interpolation for transformation of retention time of peaks

between retention time space of the two chromatogram. Partitioning of the data for PCHIP

was performed on the basis of the span used in LOWESS (root mean square of the 2D-KDE

bandwidth). Before performing PCHIP, linear interpolation was performed between

experimental data for partitioned part containing no data points, to avoid large jumps in the

main monotonic alignment function.

Probability of peak elution order similarity between two chromatograms (step 3. In Figure 1) When the peak elution order of common peaks is same in two chromatograms the accurately

matched peak pairs follow a non-linear monotonic trend between the chromatograms

without any fluctuation of the retention time coordinates of accurately matched peak pairs

along this trend. However, small scattering may be observed due to improper determination

of the peak maxima. In this case it is possible to determine the one-to-one correspondence

of peaks unambiguously in the two chromatograms with the monotonic alignment function.

This means that it is possible to unambiguously find the same peaks or to determine if a

peak has no correspondence in the other chromatogram. When the peak elution order of

common peaks is different in the two chromatograms, fluctuation of the correctly matched

peak pairs becomes larger around the non-linear monotonic retention time trend. In this case

it is not possible to match peaks between two chromatograms unambiguously. The

corresponding peak could be anywhere within the fluctuation domain of the accurately

matched peaks pairs and the non-linear monotonic function just represent the average

retention time correspondence function.

The probability for peak order inversion can be calculated by comparing the orthogonal

residual variance of the accurately matched peak pairs between two chromatograms that

have the same elution order of common peaks (e.g. two chromatograms of two samples

with similar molecular composition acquired in the same batch) with the orthogonal residual

variance obtained in two chromatograms that are of interest. It is advantageous to use at

least one same chromatogram in the two chromatogram pairs to minimise differences due

to the difference between different samples and/or different LC-MS acquisitions. By

conducting an F-test on the orthogonal residual variances obtained for the two conditions

Page 15: University of Groningen Mastering data pre-processing for

148

the probability of peak elution order similarity can be estimated, which is the null hypothesis

of the F-test. When comparing two chromatograms obtained under different conditions (e.g.

acquired in two different laboratory), it is possible to perform two separate F-test, in which

the orthogonal residual variance with no peak order inversion are determined for both

chromatograms separately. For final decision for peak elution order similarity, F-test

providing the smaller p-value should be taken into consideration. If the probability for

similarity of peak elution order is lower than 0.01, then the algorithm stops, because the

chance for similar peak elution order is low and therefore it is not possible to establish an

unambiguous one-to-one correspondence between peaks or chromatographic locations of

the two chromatograms.

The orthogonal residuals are calculated in different way than residuals of a usual regression

analysis. In regression analysis the dependent and independent variable axis are fixed,

however in time alignment the two axis should be interchangeable (e.g. the same results

should be obtained by aligning chromatogram A to B and B to A). For this reason we have

calculated the orthogonal residual distance from the main monotonic function, by

transforming one of the retention time of peaks by using the main retention time

corresponding function. In this case the main monotonic retention time correspondence

function becomes a line with 45° regarding the two retention time axes of the scatter plots.

The procedure calculating orthogonal residuals and performing F-test to assess the

probability of peak elution order similarity is demonstrated in Figure S5.

maxD is calculated for the orthogonal variance. Components of maxD according to the

chromatograms provide the retention time error to determine retention time locations in the

other chromatogram after alignment.

Retention time correction (step 4. in Figure 1) In the case of a high probability of peak elution order similarity of common peaks in two

chromatograms (null hypothesis of the F-test is not rejected), the alignment function is used

to correct the retention times of the peaks in the sample chromatogram with respect to a

reference chromatogram. The method does not depend on which chromatogram is selected

as reference or sample, as the monotonic nature of the retention time trend between the two

chromatograms allows to determine the same one-to-one correspondence of common

peaks. Figure S12 in supporting information shows that the non-linear main retention time

correspondence trend obtained with two different order of LC-MS chromatograms is highly

similar. The retention time of peaks in the sample chromatogram is calculated by

interpolation using the monotonic alignment function. The algorithm results finally a sample

Page 16: University of Groningen Mastering data pre-processing for

149

peak list aligned to the reference peak list. It should be noted that any other type of time

alignment method devising monotonic non-linear retention time correspondence function

can be used instead of the proposed monotonic constrained LOWESS/PCHIP approach.

Hardware and software environment The Monte-Carlo simulation, the intensity-rank-based peak matching, the 2D-KDE, the 2D-

KS algorithm were written in matlab scripting language using Matlab Mathworks R2010b

(version 7.11.0.584 64-bit linux version) and was run on desktop computer equipped with

Intel Quad Q9300 CPU at 2.5GHz, 8 GB RAM and 64-bit linux Ubuntu 10.04 operating

system. The source code is available at https://trac.nbic.nl/pre-alignment.

Peptide identification parameters The peptide and protein identification was performed using Phenyx database search

program (Geneva Bioinformatics, version 2.6, Geneva, Switzerland) using raw data in

mzData format. Datasets were searched against the Uniprot database (version: 57.4) and

against the reverse sequence of this database with following parameters: taxonomy: Rattus

Norvegicus; instrument types were selected according to the used mass spectrometer; FDR

rate: <1; scoring model: ESI-QTOF (QTOF) for QTOF data and CID_LTQ_scan_LTQ for

Orbitrap data; parent ion charge states: +2, +3, +4 (with trusted medium charge). The search

was performed in two subsequent cycles. The following search parameters were common

for both cycles: peptide AC score: ≥5; peptide length: ≥5; p-value: <0.0001; cleaving

enzyme: trypsin (KR); number of allowed missed cleavage: ≤ 1. The following search

parameters were different between the first and second search cycles: for cycle 1 amino

acid modifications: Cys_CAM (carboxy methylation, fixed), Oxidation_M (oxidation of

methionine, variable, ≤ 2); for cycle 2: Cys_CAM (fixed), Oxidation_M (variable, ≤ 2),

Oxidation_HW (oxidation of histidine and tryptophan, variable, ≤ 2), DEAMID (deamidation,

variable, ≤ 2), PHOS (phosphorylation, variable, ≤ 2). The cleavage mode was set to ‘normal’

for cycle 1 and to ‘half cleaved’ for cycle 2. Parent ion m/z tolerance: 600 ppm for the first

and 800 ppm for the second cycle for Orbitrap data and was 800 ppm for both cycles for the

QTOF data. Only MS/MS spectra of ions with intensities above the background noise

(50−100 counts) were considered for both search cycles. Source code, installation guide,

user manual and example dataset is available at https://trac.nbic.nl/pre-alignment.

Page 17: University of Groningen Mastering data pre-processing for

150

Parameters used for optimisation of intensity-based peak matching Delta mass: 0.005, 0.01, 0.05, 0.1 and 0.3

Maximal number of most abundant peaks: 50, 100, 200, 500

Rank window fraction: 0.50, 0.60, 0.80, 0.90 and 1

Labels legend of G-score versus 2D-KS plot in Figure 2.

Parameters corresponding to labels of G-score versus 2D-KS plot in Figure 2. In addition to

these parameters, the size of the marker indicates the ratio of trend peak pairs relative to

the total number of peak pairs (parameters are ranging from 0.00, 0.10, 0.20, 0.30, 0.40,

0.50, 0.75, 0.90 and 1.00). Larger size of the marker indicates higher percentage of trend

points used in the simulation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

G-SCORE

-log

10(p

valu

e)

N = 100 Points 0.05 mins.N = 100 Points 0.3 mins.N = 100 Points 1 mins.N = 100 Points 5 mins.N = 100 Points 15 mins.

N = 250 Points 0.05 mins.N = 250 Points 0.3 mins.N = 250 Points 1 mins.N = 250 Points 5 mins.N = 250 Points 15 mins.

N = 750 Points 0.05 mins.N = 750 Points 0.3 mins.N = 750 Points 1 mins.N = 750 Points 5 mins.N = 750 Points 15 mins.

N = 1000 Points 0.05 mins.N = 1000 Points 0.3 mins.N = 1000 Points 1 mins.N = 1000 Points 5 mins.N = 1000 Points 15 mins.

N = 500 Points 0.05 mins.N = 500 Points 0.3 mins.N = 500 Points 1 mins.N = 500 Points 5 mins.N = 500 Points 15 mins.

Page 18: University of Groningen Mastering data pre-processing for

151

Supplementary figures for chapter 3

Figure S1. Uncertainty of determining peak maxima in single stage LC-MS peak detection (green double

arrow) and uncertainty of having an MS/MS event (red double arrow) during data dependent acquisition

demonstrated on extracted ion chromatogram of a chromatographic peak. The threshold of data-dependent

acquisition is presented with a red line.

164 166 168Retention Time (min)

Inte

nsity

0

2000

4000

6000

8000

10000

Page 19: University of Groningen Mastering data pre-processing for

152

Figure S2. Schematic representation of intensity-rank-based peak matching using the sliding window

technique. The first and second sliding windows are represented in black dashed box in the two single stage

LC-MS peak lists. The length of the window in this case is 10. The window is sliding from the most abundant

to less abundant peaks until reaching a limit rank provided as a parameter. The grey arrows show

corresponding peaks in the two single stage LC-MS peak lists with a certain mass tolerance (here ±0.1 da),

which are included in the matched peak list. The resulting matched peak pairs are the outcome of the

procedure.

117.2238 679.326 1.23E+0892.02026 633.8222 1.05E+0876.86034 575.3129 8931580060.25212 509.2758 7268670476.83325 720.3972 72064400102.3785 733.397 6125790045.08288 699.4488 5802700078.10147 485.94 5299000044.11627 716.3162 52803800139.4195 554.9541 51339200139.6532 831.9259 4738210057.61272 550.843 4582900042.09376 416.8869 44581000119.4607 679.8202 44010300132.1907 740.432 3960630023.37762 491.2722 3912560082.66945 575.8082 37671400138.7721 907.9263 3621580041.55041 525.7471 3578620023.39004 747.4247 35451300117.7689 955.547 34737000131.9638 493.9559 3338370033.47533 538.2569 32477600

75.7529 700.974 32173600

203.9339 733.3986 517989107.2073 583.8935 51113977.32381 716.3185 493099329.0283 550.9585 465810178.2039 485.945 376887179.3107 485.9439 375440

122.622 772.8679 316733187.6178 633.8263 312388201.5122 628.322 285237

196.048 735.3841 283167144.1663 575.3158 28272895.91563 708.283 26142978.25997 463.2606 254432

329.007 831.9324 252834205.8059 733.397 24401880.16918 716.3165 227115150.6807 486.794 216580177.1648 485.9433 212658260.4004 753.8816 20996679.97867 487.2531 20620997.08019 708.284 20419990.24291 416.8859 20200477.71372 421.2163 19782279.76277 322.6951 194113

RT M/Z Intensity RT M/Z Intensity

Ref. Peaklist Samp. Peaklist

92.02026 633.8222 1.05E+08 187.6178 633.8263 31238876.86034 575.3129 89315800 144.1663 575.3158 28272857.61272 550.843 45829000 329.0283 550.9585 465810

Matched Peaklist

Page 20: University of Groningen Mastering data pre-processing for

153

Fi

gure

S3.

Effe

ct o

f pea

k m

atch

ing

para

met

ers

(max

imal

m/z

diff

eren

ces,

leng

th o

f slid

ing

win

dow

and

rank

frac

tion

para

met

er; t

hese

par

amet

ers

are

show

n at

the

botto

m p

art o

f the

plo

ts) o

n pe

ak p

arin

g re

sults

pre

sent

ed w

ith s

catte

r plo

ts. Y

axi

s is

the

rete

ntio

n tim

e of

mat

ched

pea

k in

labo

rato

ry 2

and

X a

xis

is th

e re

tent

ion

time

of

mat

ched

pe

ak

pairs

in

la

bora

tory

1.

Th

e ab

ove

scat

ter

plot

s w

ere

obta

ined

us

ing

NC

I m

ouse

se

rum

da

tase

t an

d ch

rom

atog

ram

s of

Lab1

_GLY

_Lun

gEG

FR_t

umor

and

Lab

2_G

LY_L

ungE

GFR

_tum

or.

010

2030

4050

6070

8090

020406080100

120

Ret

entio

n tim

e ax

is in

Lab

1 in

min

utes

.

Retention time axis in Lab 2 in minutes.

010

2030

4050

6070

8090

020406080100

120

Ret

entio

n tim

e ax

is in

Lab

1 in

min

utes

.

Retention time axis in Lab 2 in minutes.

010

2030

4050

6070

8090

020406080100

120

Ret

entio

n tim

e ax

is in

Lab

1 in

min

utes

.

Retention time axis in Lab 2 in minutes.

010

2030

4050

6070

8090

020406080100

120

Ret

entio

n tim

e ax

is in

Lab

1 in

min

utes

.

Retention time axis in Lab 2 in minutes.

0.1

500

0.2

0.5

500

0.3

0.1

500

0.3

0.1

750

0.3

(a)

(b)

(c)

(d)

Page 21: University of Groningen Mastering data pre-processing for

154

Fi

gure

S4.

Vis

ualiz

atio

n of

the

mos

t im

porta

nt s

teps

of t

he ti

me

alig

nmen

t qua

lity

cont

rol p

roce

dure

usi

ng s

imul

ated

dat

a w

ith 5

00 m

atch

ed p

eak

pairs

, 5 m

inut

es o

f

cons

tant

fluc

tuat

ion

of a

ccur

atel

y m

atch

ed p

eak

pairs

and

ratio

of 0

.75

of a

ccur

atel

y m

atch

ed p

eak

pairs

. The

plo

t a) s

how

s th

e in

itial

dat

a in

a s

catte

r plo

t with

rete

ntio

n

time

of th

e m

atch

ed p

eak

pairs

in tw

o ch

rom

atog

ram

s. B

lue

dots

rep

rese

nt th

e ra

ndom

ly m

atch

ed p

eak

pairs

and

red

dot

s re

pres

ent t

he a

ccur

atel

y m

atch

ed p

eak

pairs

follo

win

g th

e m

ain

mon

oton

ic re

tent

ion

time

trend

. Plo

t b) s

how

s th

e co

rres

pond

ing

2D-K

DE

dens

ity im

age

of p

eak

pairs

sho

wn

in p

lot a

). Pl

ot c

) sho

ws

the

2D-

KDE

den

sity

imag

e of

cro

ss p

rodu

ct o

f tw

o 1D

-KD

E pe

ak d

ensi

ty o

f tw

o LC

-MS

chro

mat

ogra

ms.

Thi

s pl

ot re

pres

ents

the

peak

den

sity

that

is o

btai

ned

with

rand

om

pairi

ng o

f pea

k in

two

chro

mat

ogra

ms.

The

2D

-KS

test

is p

erfo

rmed

by

com

parin

g th

e cu

mul

ativ

e de

nsity

dis

tribu

tion

of p

lot b

) and

c) w

hen

optim

izin

g th

e pa

ram

eter

s

of in

tens

ity-b

ased

pea

k m

atch

ing

proc

edur

e. T

he p

lot d

) sho

ws

the

hist

ogra

m o

f the

den

sity

val

ues

of p

lot b

) with

nat

ural

loga

rithm

of c

ount

s w

ithin

a h

isto

gram

bin

s.

The

red

line

show

s th

e lo

catio

n of

the

thre

shol

d (p

= 9

.144

·10-2

9 ) s

elec

ted

auto

mat

ical

ly a

nd c

orre

spon

ding

to th

e lo

catio

n w

here

the

posi

tive

first

der

ivat

e of

the

hist

ogra

m is

the

clos

est t

o its

med

ian.

Plo

t e) s

how

s th

e 2D

-KD

E de

nsity

imag

e of

plo

t b) e

nclo

sing

the

dens

ity re

gion

hig

her t

han

the

auto

mat

ical

ly s

elec

ted

thre

shol

d

(whi

te c

onto

urs)

. Plo

t f) s

how

s th

e sc

atte

r plo

t pre

sent

ed in

a) w

ith th

e co

ntou

r of h

igh

dens

ity re

gion

s (r

ed c

onto

urs)

indi

catin

g th

e pe

aks

pairs

sel

ecte

d an

d co

nsid

ered

bein

g ac

cura

tely

mat

ched

.

050

100

150

050100

150

rete

ntio

n tim

e ch

rom

atog

ram

1 (m

in)

retention time chromatogram 2 (min)

050

100

150

050100

150

rete

ntio

n tim

e ch

rom

atog

ram

1 (m

in)

retention time chromatogram 2 (min)

050

100

150

050100

150

rete

ntio

n tim

e ch

rom

atog

ram

1 (m

in)

retention time chromatogram 2 (min)

00.

20.

40.

60.

81

1.2

1.4

1.6

x 10-3

100

101

102

103

104

105

106

ln(counts)

dens

ity

rete

ntio

n tim

e ch

rom

atog

ram

1 (m

in)

retention time chromatogram 2 (min)

050

100

150

050100

150

rete

ntio

n tim

e ch

rom

atog

ram

1 (m

in)

retention time chromatogram 2 (min)

050

100

150

050100

150

b)c)

e)

a)

f)d)

Page 22: University of Groningen Mastering data pre-processing for

155

Fi

gure

S5.

Cal

cula

tion

of o

rthog

onal

res

idua

ls a

nd c

alcu

latio

n of

pro

babi

lity

for

peak

elu

tion

orde

r si

mila

rity.

Orig

inal

ret

entio

n tim

e of

chr

omat

ogra

m (

left

plot

s;

chro

mat

ogra

ms

to b

e tra

nsfo

rmed

are

: sam

ple

12 in

labo

rato

ry 1

and

sam

ple

6 in

labo

rato

ry 2

) of

acc

urat

ely

mat

ched

pea

k pa

irs a

re tr

ansf

orm

ed u

sing

the

mai

n

rete

ntio

n tim

e co

rres

pond

ence

func

tion

to th

e re

tent

ion

time

spac

e of

the

othe

r ch

rom

atog

ram

(rig

ht p

lots

). In

this

tran

sfor

med

sca

tter

the

plot

the

mai

n m

onot

onic

func

tion

is a

dia

gona

l lin

e w

ith 4

5° b

etw

een

the

two

axes

, fro

m w

hich

the

orth

ogon

al d

ista

nce

can

be c

alcu

late

d us

ing

right

-ang

led

trian

gle

rule

s. T

he o

rthog

onal

resi

dual

var

ianc

e is

then

cal

cula

ted

for t

he tw

o ch

rom

atog

ram

s of

inte

rest

(upp

er ri

ght p

lot)

and

for a

pai

r of L

C-M

S ch

rom

atog

ram

s, w

hich

do

not h

ave

peak

elu

tion

orde

r inv

ersi

on, a

cqui

red

gene

rally

with

in th

e sa

me

batc

h an

d w

ell c

ontro

lled

chro

mat

ogra

phic

par

amet

ers

(low

er ri

ght p

lot).

It is

pre

fera

ble

that

one

chr

omat

ogra

m o

f

the

refe

renc

e ch

rom

atog

ram

pai

r with

no

peak

elu

tion

orde

r inv

ersi

on is

cho

sen

from

the

chro

mat

ogra

m th

at s

houl

d be

alig

ned

(in th

is p

lot S

ampl

e6_L

ab1)

. The

F-te

st

is c

alcu

late

d us

ing

thes

e tw

o or

thog

onal

res

idua

l var

ianc

es. I

t is

poss

ible

to p

erfo

rm th

is tr

ansf

orm

atio

n fo

r th

e tw

o LC

-MS

chro

mat

ogra

ms

(Sam

ple6

_Lab

1 as

it is

pres

ente

d he

re a

nd S

ampl

e6_L

ab2)

and

ther

efor

e pe

rform

two

F-te

st c

alcu

latio

n w

ith b

oth

chro

mat

ogra

ms,

whi

ch s

houl

d be

alig

ned.

The

F-te

st p

rovi

ding

the

low

est

p-va

lue

is c

onsi

dere

d fo

r the

fina

l dec

isio

n if

the

two

LC-M

S ch

rom

atog

ram

s of

inte

rest

hav

e or

not

the

sam

e el

utio

n or

der o

f com

mon

pea

ks.

5010

015

020

025

030

035

040

020406080100

120

140

160

180

050

100

150

200

250

300

350

400

050100

150

200

250

300

350

400 0

5010

015

020

025

030

035

040

0050100

150

200

250

300

350

400

050

100

150

200

250

300

350

400

050100

150

200

250

300

350

Orig

inal

axe

s

With

in la

bW

ithin

lab

Betw

een

labs

Sample12_Lab1(trans)*

Sam

ple6

_Lab

1

Sample6_Lab2(trans)*

Sample6_Lab2 Sample12_Lab1

Betw

een

labs

Sam

ple6

_Lab

1

Sam

ple6

_Lab

1

Sam

ple6

_Lab

1

Axi

s tr

ansf

orm

atio

n

Axi

s tr

ansf

orm

atio

n

Comparison(F-TEST)

Uni

form

axe

s

Page 23: University of Groningen Mastering data pre-processing for

156

Figure S6. 3 dimensional bar plots of specificity plot (top left), sensitivity plot (top right) and minus log of the 2D-KS test

probability (bottom left) obtained with Monte Carlo simulation representing the complete studied parameter space. The

number of peak pairs were 100, 250, 500, 750 and 1000, the fluctuation of the accurately matched peaks were 0.05,

0.3, 1, 5, 10 and 15 minutes and the fraction of the accurately were 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.75, 0.90, 1.00.

00.10.20.30.40.50.750.910

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.050.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15

100 200 300 500 1000100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000

Sens

itivi

ty

Fluctuation and N

Sensitivity

00.10.20.30.40.50.750.910

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.050.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15

100 200 300 500 1000100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000

Spec

ificit

y

Fluctuation and N

Specificity

0.000.10

0.200.300.400.500.750.901.000

10

20

30

40

50

60

70

0.05 0.05 0.05 0.05 0.3 0.3 0.3 0.3 0.3 1 1 1 1 1 5 5 5 5 5 15 15 15 15 15

200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000 100 200 300 500 1000

-log 1

0(p-v

alue

)

Fluctuation and N

-log10(p-value)

Page 24: University of Groningen Mastering data pre-processing for

157

Figure S7. Scatter plot of matched peak pairs obtained with intensity-rank-based peak matching method of deisotoped

LC-MS peak list (left), and after deisotoping and decharging the same two LC-MS peaks list (right). Decharging the peak

list results in lower number of matched peak pairs but the peak pairs are more rich in accurately matched peak pairs

indicating the retention time trend. Peak matching parameters are for Lab1_GLY_LungEGFR_normal vs

Lab2_GLY_LungEGFR_normal using 500 as the window length 0.1 Da of maximal m/z difference and 0.9 as the rank

fraction parameters. The analysed two LC-MS peak list had the following pre-analytical parameters: LC-MS 1: laboratory

1, GLY depletion, Lung EGFR cancer type, without tumor; LC-MS 1: laboratory 2, GLY depletion, Lung EGFR cancer

type, without tumor.

0 10 20 30 40 50 60 70 80 9010

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 9010

20

30

40

50

60

70

80

90

100

Ret

entio

n tim

e LC

-MS

1 (m

in)

Retention time LC-MS 2 (min)Retention time LC-MS 2 (min)

Ret

entio

n tim

e LC

-MS

1 (m

in)

Page 25: University of Groningen Mastering data pre-processing for

158

Figure S8. Peak with long tailing in dataset of rat CSF analysed in laboratory 2 (a) and histogram of peak width at half

peak height (35 bins taking the 1000 most abundant peaks) of the 4 samples in two datasets acquired in different

laboratories (b) (chromatograms used are the same that are in middle column of Figures 3 and S9). The plot a) was

prepared with help of OpenDX (http://www.opendx.org/) visualization software tool.

0.5 1 1.5 2 2.5 30

20

40

60

80

100

120

140

160

peak width at half peak height (min)

coun

ts

rat CSF Lab1rat CSF Lab2rat serum Lab1rat serum Lab2

(b)

(a)

Page 26: University of Groningen Mastering data pre-processing for

159

Figure S9. Extracted ion chromatograms (EIC) of three peptides from the same sample (sample 6 from the

rat CSF dataset) in two laboratories using the original retention time values. Peptide LTLPQLEIR (green

arrows) is located on the monotonic retention-time corresponding function, while the peptides DIAPTLTLYVGK

(red arrows) and VHQFFNVGLIQPGSVK (blue arrows) are located far from this function and Figure 5 shows

the location of these peak after alignment one of chromatogram to the other. Locations of the three peaks are

shown in the scatter plot of Figure 4 with corresponding red, green and blue circles. The extracted ion

chromatograms are normalized to the highest peaks, for that reason the Y axis represent ion counts relative

to the most abundant signal intensity of the most abundant signal.

0 50 100 150 200 250 300 350 4000

1

2

3

4

5

6

x 107

Time (min)

Ionc

ount

(cts

)

645.87 +-0.025 Da, Original, Lab1.541.83 +-0.025 Da, Original, Lab1.590.66 +-0.025 Da, Original, Lab1.645.87 +-0.025 Da, Original, Lab2.541.83 +-0.025 Da, Original, Lab2.590.66 +-0.025 Da, Original, Lab2.

Page 27: University of Groningen Mastering data pre-processing for

160

Figure S10. Scatter plots of matched peaks between two LC-MS chromatograms with time alignment

functions. All chromatograms were obtained from LC-MS chromatograms of the National Cancer Institute’s

Mouse Proteomic Technology Initiative and originate from an experimental design study of mouse serum

analysis. The scatter plots in the middle column were obtained from two LC-MS chromatograms of the same

sample prepared in two laboratories, while the right and left columns were obtained with two LC-MS

chromatograms of the same laboratory, from which one was used in the middle scatter plot. Matched peak

pairs were obtained using peak list obtained from single stage LC-MS data with OpenMS workflow and using

intensity-rank-based peak matching procedure. Peak pairs not select as accurately matched peak pairs are

blue. The peak pairs selected as accurately matched are contoured with dashed red lines and are highlighted

in green circle. The main monotonic retention time correspondence function is showed in solid red line.

Samples have the following factors in the experimental design: GLY depletion method, Lung EGFR cancer

type, tumor (middle plot) and tumor and healthy (side plots).

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Retention time axis in Lab 2 in minutes.

Ret

enti

on ti

me

axis

in L

ab 2

in m

inut

es.

0 10 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

80

90

Retention time axis in Lab 1 in minutes.

Ret

enti

on ti

me

axis

in L

ab 2

in m

inut

es.

0 10 20 30 40 50 60 70 80 900

10

20

30

40

50

60

70

80

90

Retention time axis in Lab 1 in minutes.

Ret

entio

n tim

e ax

is in

Lab

1 in

min

utes

.

(b) (c)(a)

Within laboratory (2 samples) Within laboratory (2) samplesInterlaboratory (same sample)

Single-stage MS peak list

Intensity-rank-based peak matching

Page 28: University of Groningen Mastering data pre-processing for

161

Fi

gure

S11

. Ove

rlaid

plo

t of m

ultip

le m

ain

mon

oton

ic re

tent

ion

time

corr

espo

nden

ce fu

nctio

n in

7 c

hrom

atog

ram

pai

rs o

f the

sam

e ra

t CSF

sam

ples

(a, b

and

c) a

nd

24 c

hrom

atog

ram

pai

rs o

f mou

se s

erum

sam

ples

(d) m

easu

red

in tw

o la

bora

torie

s. In

a) p

airs

of L

C-M

S p

recu

rsor

ion

peak

list

s w

ere

mat

ched

bas

ed o

n ag

reem

ent

of id

entif

ied

pept

ide

sequ

ence

and

PTM

s, in

b) t

wo

LC-M

S pr

ecur

sor i

on p

eak

lists

wer

e m

atch

ed u

sing

inte

nsity

-ran

k-ba

sed

peak

mat

chin

g ap

proa

ch a

nd in

c) p

airs

of L

C-M

S si

ngle

sta

ge io

n pe

ak li

sts

wer

e m

atch

ed u

sing

inte

nsity

-ran

k-ba

sed

peak

mat

chin

g al

gorit

hm. O

verla

id p

lot i

n d)

was

obt

aine

d w

ith p

airs

of L

C-M

S si

ngle

stag

e io

n pe

ak lis

ts m

atch

ed u

sing

inte

nsity

-ran

k-ba

sed

peak

mat

chin

g al

gorit

hm a

nd th

e m

ain

mon

oton

ic fu

nctio

n is

col

ored

acc

ordi

ng to

the

appl

ied

depl

etio

n m

etho

d

(in re

d G

LY, i

n gr

een

MAR

S, in

blu

e M

+CYS

and

in b

lack

NF)

. The

hig

h si

mila

rity

of th

e m

ain

mon

oton

ic re

tent

ion

time

corr

espo

nden

ce fu

nctio

ns s

how

s th

at m

etho

d

usin

g se

quen

ce in

form

atio

n to

mat

ch p

recu

rsor

ion

peak

list

s an

d in

tens

ity-r

ank-

base

d m

atch

ed s

ingl

e st

age

LC-M

S pe

ak li

sts

are

robu

st w

ith r

espe

ct o

f bio

logi

cal

varia

bilit

y an

d th

at th

e tw

o m

etho

ds p

rovi

de h

ighl

y si

mila

r cor

rect

ion

of re

tent

ion

time.

Sin

gle

stag

e LC

-MS

peak

list

s w

ith c

ombi

natio

n of

inte

nsity

-ran

k-ba

sed

peak

mat

chin

g is

slig

htly

less

acc

urat

e, w

hich

is re

flect

ed b

y th

e la

rger

var

iabi

lity

of th

e m

ain

mon

oton

ic re

tent

ion

time

corre

spon

denc

e fu

nctio

ns o

btai

ned

with

this

met

hod

com

pare

d w

ith th

ose

obta

ined

with

pre

curs

or io

n LC

-MS

peak

list

s m

atch

ed u

sing

agr

eem

ent o

f ide

ntifi

ed p

eptid

e se

quen

ce a

nd P

TMs.

050

100

150

200

250

300

350

400

020406080100

120

140

160

180

Ret

entio

n tim

e la

bora

tory

1 (i

n m

inut

es)

Retention time laboratory 2 (in minutes)

010

2030

4050

6070

8090

020406080100

120

Ret

entio

n tim

e ax

is la

bora

tory

1 (i

n m

inut

es)

Retention time axis laboratory 2 (in minutes)

050

100

150

200

250

300

350

400

020406080100

120

140

160

180

Ret

entio

n tim

e la

bora

tory

1 (i

n m

inut

es)

Retention time laboratory 2 (in minutes)

050

100

150

200

250

300

350

400

020406080100

120

140

160

Ret

entio

n tim

e in

labo

rato

ry 1

(in

min

utes

)

Retention time in laboratory 2 (in minutes)

Rat

CSF

Rat

seru

m

c)d)

a)b)

Page 29: University of Groningen Mastering data pre-processing for

162

Figure S12. Monotonic nonlinear time alignment function (solid red and green lines) determined with different order of

two LC-MS/MS chromatograms as sample and reference chromatogram. Peak matching was performed using identified

peptide sequence and post-translational modification data, and blue dots shows the matched peak pairs. The two

chromatograms were from sample 6 in laboratory 1 and laboratory 2. The two monotonic retention time correspondence

functions are highly similar, which shows that the time alignment procedure do not depend from the order of the

chromatograms.

0 50 100 150 200 250 300 350 4000

20

40

60

80

100

120

140

160

180

200

Retention time in minutes (laboratory 1)

Rete

ntio

n tim

e in

min

utes

(lab

orat

ory

2)

Page 30: University of Groningen Mastering data pre-processing for

163

References (1) Kendall MG, Buckland WR, Institute IS. A dictionary of statistical terms: Hafner Pub. Co.; 1971.

(2) Liu T, Qian WJ, Chen WN, Jacobs JM, Moore RJ, Anderson DJ, et al. Improved proteome coverage by

using high efficiency cysteinyl peptide enrichment: the human mammary epithelial cell proteome. Proteomics.

2005;5:1263-73.

(3) Liu T, Qian WJ, Strittmatter EF, Camp DG, 2nd, Anderson GA, Thrall BD, et al. High-throughput

comparative proteome analysis using a quantitative cysteinyl-peptide enrichment technology. Anal Chem.

2004;76:5345-53.

(4) Livesay EA, Tang K, Taylor BK, Buschbach MA, Hopkins DF, LaMarche BL, et al. Fully automated four-

column capillary LC-MS system for maximizing throughput in proteomic analyses. Anal Chem. 2008;80:294-

302.

(5) Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid

proteomics tools development. Bioinformatics. 2008;24:2534-6.

(6) Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, et al. OpenMS - an open-source software

framework for mass spectrometry. BMC Bioinformatics. 2008;9:163.

(7) Scheltema RA, Kamleh A, Wildridge D, Ebikeme C, Watson DG, Barrett MP, et al. Increasing the mass

accuracy of high-resolution LC-MS data using background ions: a case study on the LTQ-Orbitrap. Proteomics.

2008;8:4647-56.

(8) Palmblad M, van der Burgt YE, Dalebout H, Derks RJ, Schoenmaker B, Deelder AM. Improving mass

measurement accuracy in mass spectrometry based proteomics by combining open source tools for

chromatographic alignment and internal calibration. J Proteomics. 2009;72:722-4.

(9) Botev Z, Grotowski J, Kroese D. Kernel density estimation via diffusion. The Annals of Statistics.

2010;38:2916-57.

(10) Peacock JA. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal

Astronomical Society. 1983;202:23.

(11) Cleveland WS, Devlin SJ. Locally Weighted Regression: An Approach to Regression Analysis by Local

Fitting. Journal of the American Statistical Association. 1988;83:596-610.

(12) Breiman L. Bagging Predictors. Machine Learning. 1996;24:123-40.

(13) Fritsch FN, Carlson RE. Monotone Piecewise Cubic Interpolation. Siam Journal on Numerical Analysis.

1980;17:238-46.

Page 31: University of Groningen Mastering data pre-processing for

164

Supporting information for chapter 4

Xrea score calculation Xrea score measures the quality of the MS/MS spectra using the ranked cumulative intensity

distribution of the fragment ions, which measure is independent from the identification status

of the fragment spectra. It is assumed that the intensities of the fragments ions in a MS/MS

spectrum which contains only noise are evenly distributed with respect of fragment ion

intensity. In contrast, MS/MS containing fragments from a compound exclusively or mixed

with noise, the fragment ion intensity distribution is uneven showing intensity distribution

difference between fragments originating from the compounds and noise. Xrea score can

take values between 0 (MS/MS contains only noise fragments) and 1 (MS/MS contains only

compound derived fragments). Na et al.1 contains details of Xrea calculation.

Supplementary figures for chapter 4

Figure S1. Example of MS1 isotopologue peaks with high and low quality MS/MS spectra. Panel a) shows an

extracted ion chromatogram of m/z 994.02 Da with a chromatographic peak at retention time of 52.04 minutes.

The maximum ion intensity of this peak is 1.30·104 ion counts. This peak at retention time indicated with an

arrow was submitted for MS/MS fragmentation with precursor intensity of 1.15·104 ion counts. The resulting

MS/MS spectrum (panel c) shows random distribution of fragment ions with respect of ion intensity indicating

low MS/MS spectral quality and has an Xrea value of 0.14. This spectrum did not obtain peptide sequence

annotation according to the applied search parameters and FDR settings. However, the chromatographic peak

in the extraction ion chromatogram of m/z 625.31 Da at retention time of 60.37 minutes (panel b) provided

MS/MS spectra sampled at the top of the peak at retention indicated with an arrow (panel d) with precursor ion

m/z: 625.31; RT: 60.37Charge: +3; Scan Number:7301

Xrea: 0.854Xrea: 0.140

m/z: 994.02;RT: 52.04Charge: +2; Scan: 6139

a)

c)d)

e)

b)

Page 32: University of Groningen Mastering data pre-processing for

165

intensity reaching 8.0·105 ion counts at MS/MS precursor sampling time. Since the peptide feature was

submitted for MS/MS fragmentation at the highest intensity, the peptide feature ion intensity corresponded to

the precursor ion intensity of 8.0·105 ion counts. This MS/MS spectrum is of high quality showed by the uneven

fragment ion distribution differing from MS/MS spectrum in panel c). The high quality of the MS/MS spectrum

is confirmed by the Xrea value of 0.85, and obtained a successful PSM attributing the primary amino acid

sequence of TTPPVLDSDGSFFLYSK (panel e). This figure was produced for the LC-MS/MS file obtained

from 14th fraction of kidney using the bRP approach.

Figure S2. Overview of the main steps of the identification transfer workflow. Peptide sequences are first

matched (based on amino acid sequences) and common annotated peptide features are used to assess

orthogonality between the reference and sample datasets as described in Mitra et al7. The retention time

coordinates of the common annotated peptide features are used to correct retention time of peptide features

with monotonic retention time correction function followed by assessment of orthogonality between sample

and reference chromatograms. Identification transfer is then performed between peptide feature of the

Identification transfer workflow

Assessment of orthogonality

Match common peptide features

beween datasets and correct retention time for monotonic shift in

sample chromatogram

Transfer peptide identification with 0.005 Da and ≤ 1 minutes thresholds

Unidentified features

(sample list)

Annotated peptides in sample list

Sample dataset

Referencedataset

Identified features

(reference list)

Page 33: University of Groningen Mastering data pre-processing for

166

reference dataset having PSMs and unannotated peptide features of the sample dataset by matching peptides

features based on retention time and m/z with thresholds of 0.005 Da for m/z and 1 minute respectively. The

identification transfer is performed using the LOOCV procedure shown in Figure S4 and described in the

section “4.2.4 Identification transfer” in material and methods section of the manuscript.

Figure S3. Scheme showing the error rate assessment of the identification transfer method using leave-one-

out cross validation (LOOCV). In LOOCV the peptide features with the same annotation between reference

and sample datasets (common peptide features) are divided into k (k = 5) subsets of equal size. Training

subsets were constructed using 80% of the common peptide features equivalent of 4 out of 5 data subsets

and a test subset included 20% of the common peptide features (1 out of 5 data subset). Monotonic retention

time shift is corrected with a nonlinear monotonic LOWESS regression function in the sample set as described

in Mitra et al.2 using common peptide features from the training set. Peak matching using all common peptide

features was performed using different tolerances for m/z (0.005, 0.05, 0.10, 0.15 or 0.20 Da) and retention

time (1.00, 3.25, 5.50, 7.75 or 10.00 minutes). Performance metrics such as FDR based on agreement and

disagreement of peptide identity, difference between the predicted and measured retention time of peptide

features receiving annotation in the sample list are calculated using the common peptide features in the test

set. This procedure is repeated to include all common peptide features as test set 5 times and the complete

training/test partitioning procedure was repeated 100 times.

Training set

Retention time

correction function

Test set

Calculate FDR and retention time

prediction error for the test set

Correct retention time of peptide features in

sample datasetSelect common

annotated peptides features between datasets

Reference and sample

datasets

Each kth iteration

Leave-one-out-cross validation

Match all peptide features based on m/z

and retention time coordinates

repeated 100 times

Page 34: University of Groningen Mastering data pre-processing for

167

Figure S4. Distributions (histogram) of Xrea scores of all MS/MS spectra (red trace) and MS/MS spectra that

have successful PSM annotation with PEAKS (blue trace).

0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0kidney/bRP f14

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

020

030

040

0

kidney/bRP f15

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

kidney/bRP f16

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

030

050

0

kidney/gel f14

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

030

050

0

kidney/gel f15

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

kidney/gel f16

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

050

150

250

350

esophagus/bRP f14

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

050

150

250

350

esophagus/bRP f15

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

050

100

200

300

esophagus/bRP f16

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

060

0

esophagus/gel f14

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

esophagus/gel f15

Xrea

Cou

nts

MS2 XreaPSMs Xrea

0.2 0.4 0.6 0.8 1.0

010

030

050

0

esophagus/gel f16

Xrea

Cou

nts

MS2 XreaPSMs Xrea

Page 35: University of Groningen Mastering data pre-processing for

168

Page 36: University of Groningen Mastering data pre-processing for

169

Figure S5. Scatter plot of Xrea scores and log10 precursor ion intensity of unidentified MS/MS spectra (green dots),

MS/MS spectra identified using PEAKS (blue dots) and MS/MS spectra that were not identified with PEAKS, but

obtained identification with identification transfer (red dots). The marginal distributions are obtained for Xrea scores

(top-panel) and log10 precursor ion intensity (right panel) for each precursor ion class as described above.

Page 37: University of Groningen Mastering data pre-processing for

170

Supporting information for chapter 5

Materials and methods

LC-MS/MS (QTOF) For the analysis of 40 depleted serum samples, the HPLC equipment and elution program was

identical to the LC-MS analyses on the iontrap (see section 1.4.1. in the main manuscript), while

MS/MS analysis was performed using a quadrupole time-of-flight mass spectrometer (qTOF, Agilent

6510). Data dependent LC-MS/MS analysis was performed using 2 GHz extended dynamic range

mode collection of 3 MS/MS in one duty cycle under the following additional parameters: mass range:

275-2000 m/z, acquisition duty cycle: 1 spectrum/sec, data storage: profile and centroid mode,

fragmentor: 175 V, skimmer: 65 V, OCT 1 RF Vpp: 750 V, spray voltage: ~1900 V, drying gas temp:

325ºC, drying gas flow (N2): 6 L/min. Mass correction was performed during analysis using internal

standards of 371.31559 m/z (originating from a ubiquitous background ion Dioctyl adipate, DOA,

plasticizer) and 1221.990637 m/z (HP-1221 calibration standard, evaporating from a wetted wick

inside the spray chamber).

Data pre-processing and quantification Figure 1 shows the workflow that illustrates the steps in the analysis of the experimental design

data. Raw iontrap single stage LC-MS datasets were obtained in Bruker Daltonics HPLC-MS.dat

format which was further converted into the mzXML proteomics standard format using the msConvert

tool from the ProteoWizard toolset1,2. The Threshold-Avoiding Proteomics Pipeline (TAPP)3 was

used to extract chromatographic peaks from the raw data and for data pre-processing. Centroid data

were smoothed and reduced using a normalized two dimensional Gaussian filter with a peak

resolution in m/z dimension of σm/z = 0.3 m/z. This parameter was obtained by optimizing the peak

detection quality upon visual inspection of one chromatogram (see Figure S5). Smoothing low-

resolution single stage iontrap data with σm/z = 0.3 m/z and σrt = 0.5 minutes using a 2-dimensional

Gaussian filter results in one peak without isotopic resolution for each peptide isotope cluster for a

given charge state. The non-linear retention time shifts between LC-MS peak lists were corrected

using Warp2D, which is a tool based on Correlation Optimised Warping (COW). To find the best

reference chromatogram, all possible pairwise combinations of alignments were performed resulting

in a total of 256 pairwise alignments using distributed grid computing with the Data Analysis

Framework (DAF)4. The parameters for time alignment were as follows: retention time width: 0.5

minutes; m/z width 0.3 Da; maximal retention time difference 0.6 minutes; maximum m/z difference

0.6 Da; windows size: 50 points; slack parameter: 10 points; maximal number of peaks/segment: 50;

number of total time points: 2 000; constant retention time shift: 0 min. The raw data was analysed

for peaks with 100-1 500 m/z and 65-135 minutes of retention time. Every alignment combination

with Warp2D produced a quality score between 0 (no alignment and/or no peak list similarity) and 1

Page 38: University of Groningen Mastering data pre-processing for

171

(perfect alignment and peak list similarity). The peak list with the highest geometrical mean of the

sum of overlapping peak volumes normalized to the sum of peak volumes of the two individual

chromatograms after warping to all combinations was selected as the optimal reference (sample ID

16090525). All other peak lists were aligned and corrected to this reference and used for further

processing. Corresponding peaks in multiple chromatograms were matched with the MetaMatch

module of the TAPP pipeline using the following parameters: delta m/z: 0.3 Da; delta retention time:

0.5 minutes; minimal fraction of class occupancy of peaks: 0.50, meaning that a matched peak was

retained if the peak was identified in a minimum of 8 out of the 16 analysed samples. Isolated peaks

that did not belong to a peak cluster were removed as “orphaned” peaks. Finally a quantitative peak

matrix containing the intensity, average m/z ratio and average retention time information of 2 559

common peaks was obtained and used in the following statistical analysis. The resulting quantitative

peak matrix contained 13,492 zeros, corresponding to 32.95% of the total number of peaks. The

intensity of the orphaned peaks was considered as representative of the noise level. Gaussian

mixture curves were fitted to the natural logarithm of the intensity of the orphaned peaks. This

analysis resulted in a normal distribution for the noise, N(µ=6.042, σ=0.5391). This distribution was

used to noise-fill the zeros in the peak matrix.

Annotation of the aligned peak matrix All depleted human serum samples for the experimental design were analysed in single stage (MS1)

mode, therefore no identification of peaks with MS/MS spectra was possible. In order to annotate

the quantitative peak matrix obtained with the TAPP pipeline, we used the data of 40 depleted human

serum samples obtained with the same sample preparation and analysed by a QTOF instrument

using the same LC columns and elution conditions. The obtained LC-MS/MS data was identified with

the PEAKS 7.5 database search tool5 with the following parameters: database: Uniprot (July 22,

2015); parent mass error tolerance: 50.0 ppm; fragment mass error tolerance: 0.05 Da; precursor

mass search type: monoisotopic; enzyme: trypsin; max missed cleavages: 2; non-specific cleavage:

one variable modifications: Oxidation (M); max variable ptm per peptide: 3; searched entry: 29,5778;

MS/MS quality filter: >0.65; FDR (Peptide-Spectrum Matches): 0.1%; FDR (Peptide Sequences):

1.0%; FDR (Protein): 0.0% determined with reverse decoy approach were retained and used for

annotation transfer. The dataset contained 356,633 MS/MS scans, and the database search resulted

in 17,4331 peptide-spectrum matches, 1884 identified unique peptide sequences, 2,417 unique

peptides of different charge and oxidation states, 106 protein groups and 229 identified proteins. The

FDR rate for PSM, peptides and proteins were <1%. The iontrap data was identified using the same

parameters except for the following: parent mass error tolerance: 0.5 Da; fragment mass error

tolerance: 0.3 Da. The search for iontrap resulted in 334 peptide-spectrum matches, 163 unique

peptide sequences, 183 unique peptides of different charge and oxidation states, 33 protein groups

and 113 proteins. Since the iontrap LC-MS data used for quantification was obtained with a different

instrument than the QTOF LC-MS/MS data used for peptide identification it was necessary to check

if the elution order of common peaks in the two analysis batches is the same6. Two depleted human

Page 39: University of Groningen Mastering data pre-processing for

172

serum samples (sample IDs 30B and 29B) were analysed in QTOF and in the iontrap using MS/MS

mode. A database search of these two analyses resulted in 181 unique peptides of different charge

and oxidation states identified in both Q-TOF and iontrap datasets. The analysis of two different

samples with QTOF and iontrap instruments separately resulted in 470 and 27 MS/MS with common

identification respectively (plots 3 and 1 in Figure S6 in supporting information). Due to the low

number of identified common peptides it was not possible to apply our quality control method

assessing peak elution order inversion6, however visual inspection of the scatter plot of the MS/MS

spectra and the calculated Dmax (Figure S6 in supporting information) showed slight orthogonality of

the separation despite the fact that the same liquid chromatography system, elution condition and

column were used7. This setup allows to transfer the peptide identifications from the 40 QTOF

MS/MS files with a minimal error of 2 minutes as determined by Dmax to annotate the peaks in the

single stage MS profile of the experimental design dataset, however, the results should considered

with care. The scheme of the main steps of identification transfer is presented in Figure S7

(supporting information). Using the derived non-linear retention time correspondence function the

retention times of 2,417 unique peptides with different charge and oxidation states in the QTOF

datasets were aligned to the iontrap LC-MS dataset. The corrected retention time coordinates of

QTOF’s unique peptides were used to annotate peaks corresponding to peptides in the pre-

processed iontrap experimental design LC-MS dataset. In the annotation we allowed 0.85 m/z and

3.5 min of retention time difference between m/z and the retention time of peptide identifications and

peak position in the matched single stage experimental design dataset. The matching procedure

resulted in 629 peaks annotated with peptide and protein identification. For the annotation of peaks

most affected by significant factors in Figure 3 and Table S2 we have used protein names

corresponding to SwissProt identifiers.

Parameters of the simulated dataset We have simulated data matrix X with the same dimensions as the data matrix X obtained from the

experimental design LC-MS dataset with goal to assess ASCA performance to identify significant

pre-analytical factor, to identify peaks affected by significant factors and to assess ASCA

performance with respect of peak selection using Volcano plot parameters (t-test p-value and fold

change).

The X matrix of dimension (2,559 × 16) was obtained as follow: from the seven pre-analytical factors,

three (factors 1, 3 and 5) were constructed to have a significant effect on 5%, 3.5% and 5% of

randomly chosen peaks with a mean difference in peak intensity between the two factor levels of 3,

4 and 6, respectively. Any peaks not affected by factors was sampled from noise distribution found

in the experiment design data and using normal distribution of N(µ=6.042, σ=0.539). The peak

intensities obtained with this approach for the seven factors were averaged out and the outcome

was used as simulated data matrix X for ASCA analysis.

Page 40: University of Groningen Mastering data pre-processing for

173

To test eventual overfitting we have simulated completely random data matrix X 15 times with the

same dimensions than the experimental design dataset, where all factors were non-significant and

there was no peak affected by any of the factors. During the simulation we have analysed the

complete simulated data matrix and matrix obtained after Volcano based filtering using the same set

of threshold used during assessment of ASCA performance with simulation. Figures from these

analysis are available in file ParameterOptFactdesRandom.pdf submitted as supporting information.

Main steps requiring bioinformatics intervention Three steps require bioinformatics intervention: 1.) planning experimental design providing level

distribution of the different factors can be performed with MODDE software; 2.) data pre-processing

of LC-MS/MS resulting in a table that contains quantitative information on compounds for all samples

designed in point 1, which data pre-processing can be performed with any single-stage LC-MS/MS

processing workflow such as TAPP3, OpenMS6, mzMine7 or maxQuant8 and 3.) the ASCA analysis

(matlab script provided at https://github.com/vikrammitra/ASCA).

Supplementary figures for chapter 5 Figure S1. Volcano plots of the matched peak matrices using the low and high levels of seven factors leading

to 7 2,559 dots. a) simulated dataset, b) experimental design dataset. In the simulated dataset peaks

sampled with different means between the high and low level of factors 1, 3 and 5 are shown with +, while all

other peaks are represented by dots. The factors are represented by differently colored symbols. Peaks

selected for ASCA analysis using a threshold of 2 and 0.05 for fold ratio change and t-test significance,

respectively, are encircled.

Page 41: University of Groningen Mastering data pre-processing for

174

-10 -8 -6 -4 -2 0 2 4 6 80

2

4

6

8

10

12

14Peak selection with -log10(p-value): 1.301 and log2(fold ratio): -1 and 1

log2(fold change)

-log 10

(p-v

alue

)

blood collection tubehaemolysisclotting timefreeze-thaw cycletrypsin digestionstopping trypsinsample stability

-5 -4 -3 -2 -1 0 1 2 3 40

1

2

3

4

5

6

7

8

9

10Peak selection with -log10(p-value): 1.301 and log2(fold ratio): -1 and 1

log2(fold change)

-log 10

(p-v

alue

)

Factor 1Factor 2Factor 3Factor 4Factor 5Factor 6Factor 7

b)

a)

Experimental design dataset

Simulated dataset

Page 42: University of Groningen Mastering data pre-processing for

175

Figure S2. Surface plots showing the value of SSQ (z axis) as a function of log2(fold change) and -log10(p-value) t-test

significance thresholds for the main effects in the simulated data set. Factors 1, 3 and 5 contained peaks affected by

the factors, while the other factors have no effect on any of the peaks.

f)d) e)

g)

c)a) b)

Page 43: University of Groningen Mastering data pre-processing for

176

Figure S3. Surface plots showing the factor’s ASCA variance significance (SSQ, z axis) as a function of log2(fold

change) and -log10(p-value) t-test significance thresholds for the main effects in the simulated data set. Factors 1, 3 and

5 contained peaks affected by the factors, while the other factors have no effect on any of the peaks. SSQ significance

were set to 0.001 as lowest value that occurred in permutation test.

f)d) e)

g)

c)a) b)

Page 44: University of Groningen Mastering data pre-processing for

177

Figure S4. Surface plots showing the recall (a), precision (b), g-score (c) f-score (d) and log10 number of selected

variables (e) as a function of log2(fold change) and -log10(p-value) t-test significance thresholds for a simulated data

matrix.

b)

c)

a)

e)

d)

Page 45: University of Groningen Mastering data pre-processing for

178

Figure S5. Optimisation of the smoothing parameter for optimal quantitative pre-processing using the TAPP pipeline.

The effect of parameters σrt and σm/z (half the peak width at the inflection point of the Gaussian distribution) of the 2-

dimensional Gaussian smoothing procedure on the noise content of single-stage MS data in Grid module of TAPP

pipeline. The value of σ in the retention time dimension (σrt) was 1 minute and the value of σ in the m/z dimension (σm/z)

was varied with 0.1, 0.25, 0.3 and 0.5 Da. The 0.3 Da σm/z provided the optimal settings for peak detection, without

missing peaks (too much smoothing) or peak splitting (too much noise). These settings smoothed out isotopic clusters

of one peptide with one charge state resulting in one Gaussian peak in the retention time, m/z and ion count space.

0.25 Da m/z; 1 min rt

0.5 Da m/z; 1 min rt

0.1 Da m/z; 1 min rt

0.3 Da m/z; 1 min rt

Page 46: University of Groningen Mastering data pre-processing for

179

Figure S6. Scatterplot of identic MS/MS identifications in two chromatograms acquired with 1) ion trap (samples 30A

and 30B), 2) ion trap and QTOF (sample 30B) and 3) QTOF (sample 30 B). The black lines show the values of Dmax

(0.50, 1.67 and 2.00 respectively), while the red lines correspond to the main retention time correspondence trend6.

40 60 80 100 120 140 16040

60

80

100

120

140

160

Iontrap sample 30B (rt in Minutes)

Iont

rap

sam

ple

30A

(rt in

Min

utes

)

40 60 80 100 120 140 16040

60

80

100

120

140

160

Qtof sample 30B (rt in Minutes)

Iont

rap

sam

ple

30B

(rt in

Min

utes

)

1

40 60 80 100 120 140 16040

60

80

100

120

140

160

Qtof sample 29B (rt in Minutes)

Qto

f sam

ple

30B

(rt in

Min

utes

)

3

2

Page 47: University of Groningen Mastering data pre-processing for

180

Figure S7. Main steps of the peptide identification transfer to annotate the quantitative peak matrix obtained from 16

LC-MS iontrap dataset acquired to study the effect of pre-analytical factor on depleted human serum peptide profile in

experimental design study. 2 iontrap and 40 QTOF LC-MS/MS files were subjected to spectrum peptide match

identification using PEAKS database search tool. These datasets were combined by using the retention time

correspondence function obtained with retention time of identic MS/MS spectra in the two datasets aligning the

identification of the 40 QTOF files to the aligned retention time domain of the 2 iontrap LC-MS/MS data. The 2 iontrap

data were aligned to the best reference chromatogram of the 16 iontrap LC-MS chromatograms which allowed the

transfer of combined identifications from 40 QTOF and 2 iontrap LC-MS/MS files by finding the highest peaks within

retention window of 3.5 min and 0.85 m/z.

Truth by Methods Selected peaks Not selected peaks

Affected peaks True Positive (tp) False Negative (fn)

Not affected peaks False Positive (fp) True Negative (tn)

Table S1. Confusion table. The columns correspond to features as predicted by a given method,

while the rows correspond to the actual class of the features. Adapted from Christin et al.9

40 QTOF LC-MS/MSanalysis

2 iontrapLC-MS/MS analysis

combined peptide

identification

16 LC-MSiontrapdataset

data pre-processing

(TAPP)

Annotated quantitative peak

matrix used in factorial design

Same sample, different instrument

Identification transfer

Same instruments different samples

1% FDR

Page 48: University of Groningen Mastering data pre-processing for

181

Measure Equation

Sensitivity = Recall = True Positive Rate (TPR) tp

tp+ fn

Precision tp

tp+ fp

Specificity = True Negative Rate (TNR) tn

tn+ fp

Geometric Mean Accuracy (g-score) TNRTPR

f-score recallprecision

recallprecision

2

2 )1(

Table S2. Definition of the scores that were used to compare the performance of different feature

selection methods. In this manuscript the value of β in f-score calculation is 1. Adapted from

Christin et al.9

Page 49: University of Groningen Mastering data pre-processing for

182

Factor Peak rank Peptide sequence Protein name

Heamolysis 1 VADALTNAVAHVDDMPNALSALSDLHAHK Hemoglobin subunit alpha (P69905)

Heamolysis 2 FFESFGDLSTPDAVMGNPK Hemoglobin subunit beta (P68871)

Heamolysis 3 VLGAFSDGLAHLDNLK Hemoglobin subunit beta (P68871)

Heamolysis 4 VADALTNAVAHVDDMPNALSALSDLHAHK Hemoglobin subunit alpha (P69905)

Heamolysis 5 VGFYESDVMGR Alpha-2-macroglobulin (P01023)

Heamolysis 6 AIGYLNTGYQR Alpha-2-macroglobulin (P01023)

Heamolysis 7 HVIILMTDGLHNM(Ox)GGDPITVIDEIR Complement factor B precursor

(P00751)

Heamolysis 9 FVTWIEGVM(Ox)R Plasminogen (P00747)

Heamolysis 10 FFESFGDLSTPDAVMGNPK Hemoglobin subunit beta (P68871)

Trypsin digestion 1 KFPSGTFEQVSQLVK Vitamin D-binding protein (P02774)

Trypsin digestion 4 EQLGPVTQEF Apolipoprotein A-I (P02647)

Trypsin digestion 6 AEAESLYQSK Keratin, type II cytoskeletal 1

(P04264)

Trypsin digestion 7 FVELTMPYSVIR Alpha-2-macroglobulin (P01023)

Trypsin digestion 8 PSLVPASAENVNK Inter-alpha-trypsin inhibitor heavy

chain H4 (Q14624)

Trypsin digestion 10 ILTVPGHLDEM(Ox)QLDIQAR Complement C4-A and B (P0C0L5)

Stopping Trypsin 3 VVNNSPQPQNVVFDVQIPK Inter-alpha-trypsin inhibitor heavy

chain H2 (P19823)

Stopping Trypsin 7 YFKPGMPFDLMV Complement C3 (P01024)

Stopping Trypsin 9 DFVQPPTK Kininogen-1 (P01042)

Table S3. Peptide sequences and protein names of the most discriminating, annotated peaks for the 3

factors that affect depleted human serum peptide profiles. The peak rank reflects the discriminating rank

of the peak according to the average absolute ASCA loadings obtained with 100 repetitions of the ASCA

analysis as displayed in the bar diagrams of Figure 3. The protein name reflect the occurrence of the

peptides in SwissProt entries.

Page 50: University of Groningen Mastering data pre-processing for

183

References (1) Holman, J. D.; Tabb, D. L.; Mallick, P. Current protocols in bioinformatics / editoral board, Andreas D.

Baxevanis ... (et al.) 2014, 46, 13 24 11-19.

(2) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. Bioinformatics 2008, 24, 2534-2536.

(3) Suits, F.; Hoekman, B.; Rosenling, T.; Bischoff, R.; Horvatovich, P. Analytical Chemistry 2011, 83, 7786-

7794.

(4) Ahmad, I.; Suits, F.; Hoekman, B.; Swertz, M. A.; Byelas, H.; Dijkstra, M.; Hooft, R.; Katsubo, D.; van

Breukelen, B.; Bischoff, R.; Horvatovich, P. Bioinformatics (Oxford, England) 2011, 27, 1176-1178.

(5) Zhang, J.; Xin, L.; Shan, B.; Chen, W.; Xie, M.; Yuen, D.; Zhang, W.; Zhang, Z.; Lajoie, G. A.; Ma, B.

Molecular & cellular proteomics : MCP 2012, 11, M111 010587.

(6) Zwanenburg, G.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Jansen, J. J.; Smilde, A. K. Journal of Chemometrics

2011, 25, 561-567.

(7) Mitra, V.; Smilde, A.; Hoefsloot, H.; Suits, F.; Bischoff, R.; Horvatovich, P. Journal of chromatography. A

2014, 1373, 61-72.

(8) Cox, J.; Mann, M. Nature biotechnology 2008, 26, 1367-1372.

(9) Christin, C.; Hoefsloot, H. C.; Smilde, A. K.; Hoekman, B.; Suits, F.; Bischoff, R.; Horvatovich, P. Mol Cell

Proteomics 2013, 12, 263-276.