MVApp Multivariate Analysis Application for · BREAKTHROUGH TECHNOLOGIES MVApp—Multivariate Analysis Application for Streamlined Data Analysis and Curation1[OPEN] Magdalena M. Julkowska,a,2,3

BREAKTHROUGH TECHNOLOGIES

MVApp—Multivariate Analysis Application forStreamlined Data Analysis and Curation1[OPEN]

Magdalena M. Julkowska,a,2,3 Stephanie Saade,a Gaurav Agarwal,b Ge Gao,a Yveline Pailles,a

Mitchell Morton,a Mariam Awlia,a and Mark Testera

aBiological and Environmental Sciences and Engineering Division, King Abdullah University of Science andTechnology, Thuwal 23955-6900, Kingdom of Saudi ArabiabComputer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University ofScience and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia

ORCID IDs: 0000-0002-4259-8296 (M.M.J.); 0000-0003-3817-7234 (S.S.); 0000-0001-6353-0955 (G.A.); 0000-0001-5488-8648 (G.G.);0000-0003-2283-2759 (Y.P.); 0000-0001-8375-7954 (M.M.); 0000-0002-9372-0933 (M.A.); 0000-0002-5085-8801 (M.T.).

Modern phenotyping techniques yield vast amounts of data that are challenging to manage and analyze. When thoroughlyexamined, this type of data can reveal genotype-to-phenotype relationships and meaningful connections among individualtraits. However, efficient data mining is challenging for experimental biologists with limited training in curating, integrating,and exploring complex datasets. Additionally, data transparency, accessibility, and reproducibility are important considerationsfor scientific publication. The need for a streamlined, user-friendly pipeline for advanced phenotypic data analysis is pressing. Inthis article we present an open-source, online platform for multivariate analysis (MVApp), which serves as an interactivepipeline for data curation, in-depth analysis, and customized visualization. MVApp builds on the available R-packages andadds extra functionalities to enhance the interpretability of the results. The modular design of the MVApp allows for flexibleanalysis of various data structures and includes tools underexplored in phenotypic data analysis, such as clustering and quantileregression. MVApp aims to enhance findable, accessible, interoperable, and reproducible data transparency, streamline datacuration and analysis, and increase statistical literacy among the scientific community.

Advances in data acquisition methods have enabledthe rapid collection of a vast range of multivariatebiological datasets from plants, animals, single-cellsystems, and more (Marx, 2013). The development ofnext-generation high-throughput phenotyping plat-forms in particular has led to data-rich experimentaloutputs that capture many physiological processes si-multaneously across large populations and under var-ious conditions through time. And yet, the usefulness ofthese emerging technologies is limited by high opera-tional costs of commercial phenotypic platforms andthe ability of researchers to curate, integrate, and ex-plore these complex outputs (Howe et al., 2008). Theappropriate interpretation of phenotypic data oftenrequires expertise in programming (R, Python, MAT-LAB) and statistics or access to expensive statisticalsoftware, such as SPSS or JMP. Although great effortshave been made to develop open-source data curation

and analysis platforms for exploring RNA sequencing(McCarthy et al., 2017) and metabolomics data (Xiaet al., 2015), the platforms for large phenomic datasetsare not as advanced. Although the interactive toolsused for genome-wide association studies (Seren et al.,2012) include a brief section on analyzing phenotypicdata, the scope of such analysis is not comprehensive.While the software typically provided with the com-mercial phenotyping platform provides an insight onthe observed trends during the experiment, the un-derlying code is not available, making it inflexible andoftentimes not suitable for various experimental de-signs. Additionally, most off-the-shelf tools lack mod-ules crucial for multivariate analysis, such as principalcomponent analysis (PCA), multidimensional scaling(MDS), or regression models, and none of them pro-vides a standardized pipeline for outlier identification.Yet, standardizedmethods of data curation, processing,and analysis before publication are increasingly neces-sary as the scientific community strives for findable,accessible, interoperable and reproducible (FAIR) data(Wilkinson et al., 2016; Reiser et al., 2018). Such opti-mization of data processing tools and protocols will notonly accelerate and standardize data exploration andvisualization but will also promote better reproduci-bility and transparency of data curation and analysis.Tools for high-throughput phenotyping, like HTPmod(Chen et al., 2018), are gradually providing easier accessto the tools for data analysis, but integrating the com-munity input, keeping the code open-source, and

1This work was supported by King Abdullah University of Scienceand Technology (baseline support to M.T. and Office of SponsoredResearch Award No. 2302).

2Author for contact: [email protected] author.The author responsible for distribution of materials integral to the

findings presented in this article in accordance with the policy de-scribed in the Instructions for Authors (www.plantphysiol.org) is:Magdalena M. Julkowska ([email protected]).

[OPEN]Articles can be viewed without a subscription.www.plantphysiol.org/cgi/doi/10.1104/pp.19.00235

Plant Physiology�, July 2019, Vol. 180, pp. 1261–1276, www.plantphysiol.org � 2019 American Society of Plant Biologists. All Rights Reserved. 1261 www.plantphysiol.orgon September 14, 2020 - Published by Downloaded from

Copyright © 2019 American Society of Plant Biologists. All rights reserved.

http://orcid.org/0000-0002-4259-8296

http://orcid.org/0000-0002-4259-8296

http://orcid.org/0000-0003-3817-7234

http://orcid.org/0000-0003-3817-7234

http://orcid.org/0000-0001-6353-0955

http://orcid.org/0000-0001-6353-0955

http://orcid.org/0000-0001-5488-8648

http://orcid.org/0000-0001-5488-8648

http://orcid.org/0000-0003-2283-2759

http://orcid.org/0000-0003-2283-2759

http://orcid.org/0000-0001-8375-7954

http://orcid.org/0000-0001-8375-7954

http://orcid.org/0000-0002-9372-0933

http://orcid.org/0000-0002-9372-0933

http://orcid.org/0000-0002-5085-8801

http://orcid.org/0000-0002-5085-8801

http://orcid.org/0000-0002-4259-8296

http://orcid.org/0000-0003-3817-7234

http://orcid.org/0000-0001-6353-0955

http://orcid.org/0000-0001-5488-8648

http://orcid.org/0000-0003-2283-2759

http://orcid.org/0000-0001-8375-7954

http://orcid.org/0000-0002-9372-0933

http://orcid.org/0000-0002-5085-8801

http://crossmark.crossref.org/dialog/?doi=10.1104/pp.19.00235&domain=pdf&date_stamp=2019-06-15

mailto:[email protected]

http://www.plantphysiol.org

mailto:[email protected]

http://www.plantphysiol.org/cgi/doi/10.1104/pp.19.00235


keeping those tools up-to-date is crucial for sustainableand transparent data analysis.

To contribute toward this goal, we developed anopen-source online platform called “MVApp,” an in-teractive and modular data curation and in-depthmultivariate analysis pipeline. MVApp provides acomprehensive toolkit for the thorough exploration ofdiverse data structures, where the effect of multipleindependent variables, including genotype, treatment,and time, can be examined. MVApp aims to help sci-entists and research staff with limited knowledge of“R” programming and statistics to curate their data in astandardized manner and to perform comprehensiveand robust statistical tests within minutes. We devel-oped MVApp in “R” using the “Shiny” framework(Chang et al., 2017), exploiting R’s computational po-tential by integrating various packages into a graphicaluser-interface. Examples of previously developedShiny applications are abundant, including: box plotgenerators (Spitzer et al., 2014), RNA sequencingdata analysis pipelines (Li and Andrade, 2017; Nelsonet al., 2017), genomic prediction simulators (Morota,2017), growth modeling (Chen et al., 2018), and toolsfor teaching plant breeding (Matias et al., 2018) orbusiness analytics (Nijs, 2018). MVApp not only com-bines the standard methods for multivariate analysisbut also helps with data interpretation. By followingthe comprehensive README (https://mmjulkowska.github.io/MVApp/), first-time users can decidewhich modules are of interest for their datasets. Fur-ther guidance is provided by in-app suggestions,which advise on decisions between specific methods(e.g. type of curve to fit, number of clusters to use)to apply for their datasets. The MVApp providesthe output of the analytical tools and statistical testsalong with guidelines for interpretation, e.g. what canbe inferred from a statistical test based on the null hy-pothesis and resulting P value. Additionally, theR-commands used to make individual parts ofthe analysis can be viewed for each section, ensuringthe full transparency of the analysis. To make ourplatform accessible and safe to use, the data uploadedto MVApp (http://MVApp.kaust.edu.sa/MVApp) arenot saved on the server; the input is deleted when thesession is closed.

In this article, we showcase the possibilities ofMVApp using a dataset that includes the phenotypicdata of nine Arabidopsis (Arabidopsis thaliana) acces-sions grown under either control or salt-stress condi-tions over the course of 8 d (Awlia et al., 2016). Thedataset (Supplemental Dataset S1) was collected usingthe high-throughput phenotyping PlantScreen con-veyor system (Photon Systems Instruments), whichcaptured seven plant morphology traits that describethe rosette size and architecture, as well as 43 chloro-phyll fluorescence traits, reflecting plant photosyntheticactivity (Supplemental Dataset S2). In total, the datasetcontains 50 phenotypes scored for 160 individual plantsacross seven time points. This dataset exemplifies datastructures containing up to three independent

variables. A similar analysis could be employed to ex-amine natural diversity panels consisting of hundredsof genotypes, to compare the phenotypes ofmutant andwild-type lines, or to explore phenotype-to-phenotyperelationships across different genetic backgrounds.The spatial variation in individual phenotypes canalso be inspected, though MVApp does not yet pro-vide a spatial correction feature. The modular designof theMVApp ensures that othermodules can be addedin the future, dependent on community involvement,contributions, and feedback (https://github.com/mmjulkowska/MVApp/blob/master/CONTRIBUTING.md). We hope that the widespread use of MVApp willincrease statistical literacy among early career peers,accelerate scientific discovery, and ensure transparencyof data processing.

RESULTS

Dynamic Response Characterization

High-throughput phenotyping platforms allow forthe nondestructive screening of plants and, therefore,the characterization of their dynamic temporal re-sponses, such as growth or changes in photosyntheticefficiency. These dynamic responses can be summa-rized by fitting different types of curves (linear, qua-dratic, exponential, and square root) and polynomialfunctions to changes in the measured trait values. Here,we use the increase in rosette area as an example(Fig. 1). The simple functions (linear, quadratic, expo-nential, and square root) are fitted using a linear modelthrough the data points. For quadratic, exponential,and square root functions, the phenotype (e.g. ro-sette area) is transformed using squaring, natural log-arithm, or square root, respectively, before fittingthe linear model. The linear fit for each sample can beexamined using the fit-plot tab in MVApp (Fig. 1A).Based on the mean regression coefficient (R2) of thecurve, which is calculated for each individual sample,MVApp suggests the model with the best fit. Polyno-mial functions can also be fitted to the data points usingcubic and smoothed splines. Because polynomialfunctions have high R2 values compared to the othermodels, they are not included in the best model-fitsuggestions. In our example, we evaluate the increas-ing rosette area in the nine Arabidopsis accessions de-scribed above. We found that a quadratic functionfit the dataset most accurately (R2 values of 0.9711 and0.913 for the control and salt-treated plants, respec-tively), while the exponential model had the second-best fit (R2 values of 0.9668 and 0.907 for the controland salt-treated plants, respectively; SupplementalDataset S3).

Curve fitting can also be used as a data curationmethod in which the user excludes samples whose R2 isbelow a chosen threshold (R2 , 0.7 is the default op-tion). All of the fitted curves can be compared to theoriginal data points and viewed on a fit plot to identify

1262 Plant Physiol. Vol. 180, 2019

Julkowska et al.

www.plantphysiol.orgon September 14, 2020 - Published by Downloaded from Copyright © 2019 American Society of Plant Biologists. All rights reserved.

https://mmjulkowska.github.io/MVApp/

https://mmjulkowska.github.io/MVApp/

http://MVApp.kaust.edu.sa/MVApp

http://www.plantphysiol.org/cgi/content/full/pp.19.00235/DC1


https://github.com/mmjulkowska/MVApp/blob/master/CONTRIBUTING.md






samples that do not follow the expected dynamics ob-served in the experiment (Fig. 1B). For example, in ourdataset, we identified seven samples with R2 values,0.7 (Supplemental Dataset S4). By examining picturesof those specific samples, we realized that those plantshad died during the experiment. Therefore, thosesamples were removed from the curated dataset forsubsequent analyses.The MVApp offers the option to view summary

statistics based on the fitted curves. The user canchoose whether the summary statistics are calculatedfor all the data points or for the data curated based onthe R2-value threshold. The summary of the growthdynamics can be viewed as box plots, violin plots, dotplots, or bar plots, where the significant differencesbetween user-defined groups (e.g. based on genotypeand treatment) are tested using one-way ANOVAand Tukey’s honestly significant difference (HSD)pairwise-comparison tests. For our dataset, we ob-served that the treatment, the genotype, and theirinteraction had significant effects on the rosettegrowth rate (Fig. 1C). “Col-0” and “Te” were thefastest growing accessions under control conditions,while the highest growth rate was observed in “Rsch”and “C24” under the salt-stress conditions. “Cvi”wasthe slowest growing accession under both the controland salt-stress conditions. Including the seven sam-ples with R2 values below the threshold level of 0.7did not significantly affect the observed trends(Supplemental Fig. S1).

Outlier Identification

Data curation before any analysis is crucial but isoftentimes a lengthy, inconsistent, and manual processwith a distinct paucity of standardized guidelines. Toensure consistency between individual experimentsperformed across different labs, methods used for datacuration require standardization. MVApp acceleratesand standardizes the curation of data by highlightingpossible outliers based on user-selected grouping var-iables, such as treatment, genotype, and time point.These outliers are identified using various detectionmethods (Chaloner and Brant, 1988; Diaz-Garcia andGonzlez-Faras, 2004; Leys et al., 2013), which include theinterquartile range (1.5 3 IQR), the Cook’s distance, theBonferroni outlier test, or the SD from the median, basedon one or multiple measured traits. If the input datacontain a time element or a gradient series suitable forcurve-fitting, then a low regression coefficient (indicatinga poor fit of the trend line) can be used as an additionalcriterion for outlier selection. In our example, all traitswere used for data curation. Samples were identified asoutliers when the values for at least 12 measured traitsfrom that specific sample were outside of the 1.5 3 IQR.In this manner, 28 outliers were identified and removedfrom the curated dataset (Supplemental Dataset S5).The graphical overview in the outlier removal tabs in

MVApp allows the user to visualize the data both withand without the outliers to observe the effects of theselected outlier identification method. We use the trait

Figure 1. Dynamic responses summarized by using MVApp. The dynamic data can be analyzed in MVApp by transforming thephenotypic data and fitting linear models. A, The phenotype undergoes square root, quadratic, or natural logarithmic transfor-mation, or no transformation at all before fitting a linear model. The model with the best fit in terms of the mean regressioncoefficient (R2) value is suggested to the user. B, The curve-fitting feature can be used to select outliers based on the R2 value. Theuser defines the R2 threshold and examines the fit of themodel to individual samples, which are listed in order of their R2 value. C,A summary of the growth dynamics can be viewed for the predefined groups. The ANOVA test, complemented by a pairwiseTukey’s HSD test, is used to examine significant differences among the predefined groups. Bars5 average growth rate fitted withthe quadratic function, as calculated from nine replicates. Error bars 5 SE. The letters indicate significantly different groups, asidentified by the Tukey’s HSD pairwise test, with a P value threshold of 0.05. The figure was created using data curated based onthe fit of the quadratic function to the observed data. The samples with R2 , 0.7 were eliminated from the dataset.

Plant Physiol. Vol. 180, 2019 1263

MVApp—Multivariate Data Analysis Pipeline






rosette perimeter as an example (Fig. 2). The effects ofoutlier removal were particularly visible in the “Te”accession, where both the variance among individualtime points and the number of data points outside thenormal distribution range were visibly reduced. Forother traits, such as nonphotochemical quenching(f[NPQ]) or maximal quantum yield in the light-adapted state (Fv9/Fm9; Supplemental Fig. S2), the effectsof outlier removal were not as pronounced, with onlyvery extreme data points being removed. As the removalof outliers can result in the loss of important information,we strongly encourage the user to inform their decisionsconcerning outlier removal and data curation byreviewing their experimental notes, rawdata, and images.

Tests for Significant Differences

MVApp facilitates standard statistical tests, includ-ing parametric and nonparametric tests. Initially,measured traits can be evaluated for normal distribu-tion and equal variance and be visually examined usinghistograms, quantile–quantile plots, and box plots tovalidate whether the parametric test assumptions havebeen met. Responsive in-app message boxes provideassistance with the interpretation of the P values andselection of parametric or nonparametric tests. In ourdataset, the rosette area follows a normal distributionacross all accessions according to the Shapiro–Wilk test(Fig. 3A). As the parametric tests assume that thesamples being compared have equal variance, MVAppintegrates two tests for variance (Bartlett and Levene)and a graphical visualization of the observed varianceusing box plots for the observed data (y), the subtracted

median (y2med[y]), and the absolute deviation from themedian (abs[y2med(y)]).We observed an equal varianceamong the accessions for all the salt-stressed plants 7 dafter salt treatment (P values . 0.05), but not for plantsgrown under control conditions (P values of 1.65 3 1026

and 0.026 for the Bartlett and Levene tests, respectively;Fig. 3B). This suggests that the differences among themeasured accessions are more pronounced under controlconditions than under salt stress. Furthermore, the traitsrelated to maximum quantum yield under the light-adapted state (Fv9/Fm9) and f(NPQ) showed a normaldistribution and equal variance under both control andsalt-stress conditions (Supplemental Fig. S3, A and B),indicating that genetic variation and the environment af-fect these traits less significantly than the rosette area.

Basic parametric tests such as ANOVA assume anormal distribution and equal variance. When theseassumptions are not met, users should consider trans-forming the data or using a nonparametric test, such asKruskal–Wallis. Both parametric and nonparametrictests are included inMVApp, together with Tukey’s testand the Mann–Whitney/Wilcoxon tests for pairwisecomparison. As our data on rosette area did not meetthe assumptions for a parametric test, we examined thedifferences among the accessions using the Kruskal–Wallis test. Significant differences were observedbetween accessions grouping for the final day of mea-surement under control and salt-stress conditions(Fig. 3C). On the final day of measurements, the “Col-0,” “Te,” and “C24” accessions had developed thelargest rosettes under both conditions studied, while“Nd,” “Can,” and “Cvi” had the smallest. “Rsch” and“Co” exhibited the highest and lowest quantum yields(Fv9/Fm9), respectively, measured across the nine

Figure 2. Data curation with visual inspection. The box plots represent the rosette perimeters from data curated using thequadratic fit (sampleswith R2, 0.7 were removed). Data from all the accessions were curated using the 1.53 IQRmethod for allthe measured traits. A data point was considered an outlier if the sample was identified as an outlier in at least 12 traits. Theincrease in the rosette perimeter of three studied accessions through time is presented using blue and red box plots, which re-spectively indicate the trait values under either control and salt-stress conditions both (A) before and (B) after removal of the 28samples that were identified as outliers. The average number of replicates per genotype, condition, and day is 9 in (A) and 6.74 in(B).


Julkowska et al.





accessions under both the control and salt-stress con-ditions (Supplemental Fig. S3 C). The exact oppositetrend was observed for f(NPQ; Supplemental Fig.S3 G). This suggests that the traits reflecting chlorophyllfluorescence might be complementary to the traitscommonly used to describe rosette size, providing newinsight into plant performance under various conditions.The parametric tests implemented in MVApp in-

clude the one- or two-sample t test and one- or two-way ANOVA using Tukey’s HSD pairwise test. Thenonparametric tests available in MVApp include theKolmogorov–Smirnov test for examining the differencesbetween two samples, as well as the Kruskal–Wallis andWilcoxon tests, which are commonly used to comparemultiple samples. All of these methods of analyses are

accompanied by graphical visualizations in the form ofbox plots, violin plots, bar plots, and scatter plots.Additionally, MVApp facilitates analysis of theinteraction between two predefined factors, such astreatment and genotype, using two-way ANOVA.Depending on the data structure, the interaction betweentwo user-defined variables, such as genotype and treat-ment, can be examined in separate data subsets, e.g.across individual time points. In our datasets, we exam-ined the interaction between genotype and treatment inthe data subsets for individual time points. In the case ofrosette area, we found a significant interaction be-tween genotype and treatment starting from 4 d after thesalt treatment and continuing through to the end ofthe experiment (Fig. 3D). Interestingly, the interaction

Figure 3. Hypothesis testing for parametric test assumptions. MVApp allows users to visually inspect the distribution of phe-notypic data using histograms, and to test for a normal distribution using the Shapiro-Wilk test. A, The histogram represents thenumber of samples (y axis) from which the rosette perimeter measurements (x axis) were taken 7 d after treatment under eithercontrol or salt-stress conditions (left and right, respectively). The values observed for the different accessions are represented bydifferent stacked colors. B, The equal variance among the samples was tested using Levene’s and Bartlett’s tests, and is visualizedgraphically using box plots for the observed data (y), subtractedmedian (y2med[y]), and the absolute deviation from themedian(abs[y2med(y)]). The blue and red box plots indicate the variance in rosette area between plants grown under control and salt-stress conditions, respectively. C, Significant differences between the predefined groups can be tested using parametric ornonparametric tests (ANOVA or Kruskal–Wallis test) and a pairwise comparison test (Tukey HSD and Wilcoxon/Mann–Whitneytest). D, The interaction between two predefined factors (genotype and treatment) was tested using two-way ANOVA and visuallyrepresented using interaction plots for the data subsets representing 1 h, 3 d, and 7 d after the salt treatment. The red lines representa significant interaction between genotype and treatment, as identified by two-way ANOVAwith a P value threshold of 0.05. Forall the plots, the datawere curated for outliers across all themeasured traits using the 1.53 IQRmethod. A data point was definedas an outlier if the sample was identified as an outlier in at least 12 traits.








between genotype and treatment was already appar-ent in Fv9/Fm9 and f(NPQ) just 1 d after treatment(Supplemental Fig. S3, F–H). This indicates that thechlorophyll fluorescence traits are more responsive togenotype-by-environment interactions than plant size.

Correlation Analysis

Measuring multiple traits facilitates the study ofphenotype-to-genotype relationships (e.g. comparingmutants to wild-type varieties), but also allows a betterunderstanding of the phenotype-to-phenotype inter-actions. For instance, correlation analyses can be usedto identify phenotypic traits that are highly correlated.Changes in correlation caused by different environmentsor genetic contexts, for example, can also be detected.These subtleties often remain unexplored. Interactivescatterplots can be used to further inspect the correlationbetween two chosen traits. Various data structures areaccommodated by allowing the user to subset the databefore the correlation analysis. As such, the MVApp al-lows the user to examine the strength, variability, andsignificance of correlations across individual subsets.

We examined the correlations for similarities anddifferences between the control and salt-stressed plantsin our dataset (Fig. 4; Supplemental Dataset S6), whichallowed us to identify major clusters of traits. Traitsrelated to Fv9/Fm9 and actual quantum yield in the light-adapted state (f[P]) were strongly correlated with eachother, while traits reflecting photochemical quenchingand electron transport beyond PSII photochemistry (qP,qL, and ETR) were less correlated with the single time-point measurements of chlorophyll fluorescence traitsreflecting the maximum, minimum, and instantaneousfluorescence (Fm, Fo, and Ft, respectively). All of thesephotosynthetic traits were found to be negatively cor-related with traits reflecting NPQ parameters (f[NPQ],non-photochemical quenching), but the strongest neg-ative correlations were observed between Fv9/Fm9 andf(P). Additionally, various photosynthetic traits mea-sured at different photon irradiances (Lss1–Lss4)showed highly significant correlations. The rosette areawas positively correlated with the quantum-yieldcluster of traits and was negatively correlated withthe NPQ traits. However, the relationship betweenplant size and the NPQ traits was less pronouncedunder salt-stress conditions; many of the correlationswere not significant (P value . 0.05; Fig. 4).

Reduction of Dimensionality

When multiple traits are measured from the sameindividual, a multidimensional phenotypic space thatdefines each sample can be obtained. However, themeasured traits are likely to be highly correlated.Therefore, the dimensionality of the data can be re-duced, thus simplifying the data while maintainingimportant trends and patterns.

The contribution of each measured trait to the overallobserved variance can be examined using PCA. PCAallows users to determine the minimum number of di-mensions required to adequately summarize the phe-notypic variance. Observations of potentially correlatedmeasured traits are transformed into a set of linearlyuncorrelated variables, called “principal components”(PCs; Lever et al., 2017). The first PC accounts for thelargest observed variance, providing the best possiblesummary of the variance observed across the experi-ment. As PCA is sensitive to relative scaling of the ob-served variables, we included the option to scale thedata before performing PCA in MVApp. Additionally,depending on the data structure, the user can performPCA on the specific data subsets, grouped by treatmentor time point, for instance. In our example of a scaleddataset, 10 PCs were needed to describe at least 98%of the variance observed in the plants grown underboth the control and salt-stress conditions (Fig. 5A;Supplemental Dataset S7). The highest contribution tothe first PC came from traits related toNPQ and quantumyield under the light-adapted state (Fig. 5B; SupplementalDataset S8), with no differences in contribution observedbetween the two conditions.When the datawere reducedto the first two PCs, differences between individual ac-cessions were found in both the salt-stress and controlconditions (Fig. 5C). Although PCA is not uncommon formultivariate datasets, its implementation and interpreta-tion can be challenging. Furthermore, the contributions ofindividual traits to PCs are seldom described. Those is-sues are addressed in MVApp, making it intuitive for theuser to understand and explore the PCA, providing ad-ditional insight into the data.

Another method of reducing the dimensionality of adataset containing multiple phenotypes is MDS. WithMDS, the levels of similarity among the objects in adataset are visualized based on their distances from eachother. MDS can be done for individual samples in thedataset to see the similarity among individual replicates,or for all measured traits to determine their relationshipsand changes therein under different conditions. AsMDS,like PCA, is also sensitive to the scaling of the variables,the scaling is optional. MVApp also allows the user toperform MDS on all the measured phenotypes and thencluster themusing k-means clustering. In our dataset, weobserved four clusters, which were related to rosettearea, quantum yield, NPQ, and single time-point chlo-rophyll fluorescence parameters (Fig. 6). Interestingly,the traits were grouped into the same clusters and thepositions of the traits respective to each other were al-most mirror images for both the control and salt-stressconditions. These results suggest that the relationshipsamong the traits do not change in response to salt stress.

Cluster Analysis

Once the most interesting traits have been identified,they can be used to group samples with similar be-havior into clusters. Clustering can be based on any


Julkowska et al.









number of measured traits and is used to reveal sub-groups of samples within an experiment that exhibitsimilar response patterns. In MVApp, users can per-form either k-means or hierarchical clustering analyses.K-means clustering assigns the individual samples to auser-defined number of centroids. To aid the user,MVApp provides a number of visual and cumulativemethods for defining the optimal number of k-meanclusters. The clusters can be visualized using bar plotsor scatter plots according to their measured traits to

evaluate the contribution of the user-defined trait to theoverall variation in the data. Hierarchical clustering, incontrast, does not require the user to predefine thenumber of clusters. Based on selected traits, the samplesare sorted into a dendrogram. This is matched to aheatmap illustrating the value of each trait selected forclustering (Fig. 7A).Using hierarchical clustering, we grouped the aver-

age values of the accessions scored on the final dayof measurement based on traits reflecting rosette

Figure 4. Photosynthesis-focused clusters revealed by general correlation trends. The data from plants grown under control andsalt-stressed conditions were examined separately for general correlations and their trends. The correlation coefficients for theplants grown under control and salt-stress conditions are shown in the lower left and upper right panels, respectively. The color ofeach individual cell reflects the strength of the correlation, with hues of green and pink reflecting positive and negative corre-lations, respectively. Nonsignificant correlations (P value , 0.05) are marked with a cross. The correlation coefficients werecalculated using curated data from a total of 966 samples. The outliers were identified using the 1.5 3 IQR method on all themeasured traits, with at least 12 outlying traits necessary to qualify a sample as an outlier. The datawere additionally curated usingthe quadratic fit (samples with R2 , 0.7 were removed). RMS, rotational mass symmetry; SOL, slenderness of leaves.





architecture, quantum yield, and NPQ using Ward’smethod (Fig. 7A). From this, we found that accessionswith a large rosette area showed high quantum yieldand low NPQ, while accessions with a small rosettearea showed low quantum yield and high quenching.Using the dendrogram, which represented the rela-tionships between the individual samples, we selected aspecific distance at which the samples were separatedinto three clusters (Fig. 7B; Supplemental Dataset S9).MVApp allows the examination of the differences be-tween clusters using box plots, ANOVA, or Tukey’s testfor significance (Fig. 7C) and automatically lists thetraits that are significantly different between clusters.

Cluster 1 included accessions grown mainly under salt-stress conditions, with a medium-sized rosette area andhigh NPQ. Cluster 2 contained the accessions with thelargest rosette area, measured exclusively under controlconditions. In Cluster 3, accessions grown under bothsalt-stress and control conditions were found, and theseaccessions had the smallest rosette area but a highquantum yield. No differences among the clusters wereobserved in the traits reflecting rosette architecture(Supplemental Fig. S4). The high photosynthetic per-formances of Clusters 2 and 3 were also reflected inETR (Supplemental Fig. S4). These results confirm theprevious observation that plant size is defined by

Figure 5. Majority of the observed variance isexplained by chlorophyll fluorescence traits.Multidimensional data can be simplified by per-forming PCA. A, Scree plots of the PCA eigen-vectors illustrate the percentage of varianceexplained by individual PCs. B, The plot shows thecontributions of the measured traits to the first PC.The traits are listed on the x axis according to theirpercentage of contribution; the percentage ofcontribution to PC1 is presented on the y axis. Thered line represents the expected value if the con-tribution was uniform. For a given PC, any traitwith contributions above the reference line isconsidered an important contributor to that PC.SOL, slenderness of leaves; RMS, rotational masssymmetry. C, The PC coordinates for individualsamples for the two first PCs showed no correla-tion, and a distinct spread of data from the nineaccessions studied. The individual points repre-sent samples, where the colors indicate individualaccessions. PCA was performed on the curateddata, subset by treatment. The plots represent thesubsets for control (left) and salt stress (right), re-spectively. The outliers were identified using the1.53 IQR method on all the measured traits, withat least 12 outlying traits necessary to qualify asample as an outlier. The data were additionallycurated using the quadratic fit (samples with R2 ,0.7 were removed).


Julkowska et al.






photosynthetic efficiency (Fig. 4), which is representedby both quantum yield and NPQ. Additionally, thehierarchical clustering shown here indicates that agroup of accessions can develop small rosettes despitehigh photosynthetic performance. As no differences inthe single time-point measurements of chlorophyllfluorescence were found between Clusters 2 and 3,mechanisms other than photosynthetic efficiency maybe limiting the growth of accessions belonging toCluster 3. This type of cluster analysis for multivariatephenotypic data remains uncommon in the field, de-spite the intriguing insight it can provide into plantresponses. Through its inclusion in MVApp, we aim toencourage the use of this type of analysis.

Quantile Regression

Advances in phenotyping methods have enabledscientists to simultaneously examinemultiple traits thatcould be contributing to a trait of major interest, suchas plant size or yield. However, correlation and PCanalyses limit this examination to the general trendsobserved in an experiment. Furthermore, the ordinary

least-squares regression models that are typically usedto evaluate trait contributions to yield or survival arebased on the trait average values observed across apopulation. However, the contribution of individualtraits to biomass or yield can differ between the smallestand largest plants within an experiment. Therefore, weintegrated quantile regression, which identifies thosetraits that might contribute significantly to the trait ofmajor interest in each quantile, into MVApp. Asquantile regression is relatively new, quite uncommon,and challenging to set up—MVApp brings it within thereach of any scientist. The various data structures canbe accommodated by selecting multiple grouping var-iables and performing the quantile regression on thesubsets of the data. Additionally MVApp allows forresults of quantile regression to be compared with theresults obtained from traditional ordinary least-squaresregression model.For our dataset, we grouped the traits per treatment

and performed the quantile regression for the subsets ofindividual time points. The f(P) and ETR contributedto the size of the rosette area in plants grown under boththe control and salt-stress conditions through time,though the contribution was higher for the plants

Figure 6. MDS reveals a cluster of traits unaltered by salt treatment. All the measured traits were scaled before MDS. The color ofthe groups indicates traits belonging to distinct k-means clusters. Traits in close proximity aremore related than distant traits. MDSwas performed on the curated data, which was subset by treatment. The plots represent the subsets for control (left) and salt-stress(right) conditions. The outliers were identified using the 1.5 3 IQR method on all the measured traits, with at least 12 outlyingtraits necessary to qualify a sample as an outlier. The data were additionally curated using the quadratic fit (samples with R2, 0.7were removed). RMS, rotational mass symmetry; SOL, slenderness of leaves.





grown under control conditions (Fig. 8A; SupplementalDataset S10). Interestingly, neither quantumyield in thedark-adapted state nor NPQ contributed to the rosettearea under either condition (Fig. 8A; SupplementalDataset S10). When f(NPQ) was chosen as the trait ofmajor interest, we observed significant contributions ofFv9/Fm9 in most quantiles; this trait’s contribution wasmainly observed under the control condition (Fig. 8B;Supplemental Dataset S10). When either f(P) or ETRwas chosen as the trait of major interest, these two traitswere contributing only to each other, and no otherchlorophyll fluorescence parameters contributed tothose parameters, such as f(NPQ) or Fv9/Fm9 (Fig. 8C;Supplemental Dataset S10). By performing quantileregression on different traits of interest, we developedan understanding of how the traits related to each other(Fig. 8D). We observed that ETR and Fv9/Fm9 weretightly linked to each other in all quantiles, affecting themajority of the measured traits and mutually affecting

each other, but were not explained by any of the othermeasured traits. Although the trait contribution did notdiffer substantially among the individual quantiles inour example dataset, we think that it will be a usefulfeature for more complex datasets covering higherranges of variation.

Estimation of Heritability

Many high-throughput phenotyping experimentsuse data from forward genetics studies. For in-depthanalyses in these types of studies, estimating heritabil-ity based on a limited number of genotypes is of utmostimportance. MVApp allows the estimation of broad-sense heritability from the measured traits, therebyenabling an informed decision on whether a spe-cific phenotypic trait has sufficient genetic varianceto be used for forward genetic studies. In MVApp,

Figure 7. Cluster analysis revealedmechanisms other than photosynthetic efficiency limiting plant performance. A, The heatmapreflects the rosette area, maximum quantum yield under the light-adapted state (Fv9/Fm9_Lss4), and NPQ (F[NPQ]_Lss4) mea-sured at the highest photon irradiance (Lss4 5 440 mmol m22 s21). The columns represent the means of the individual samples,while the rows represent the high (red) and low (blue) values for selected traits, normalized using a z-Fisher transformation. B, Inthis dendrogram, the samples are clustered based on rosette area, Fv9/Fm9_Lss4, and F(NPQ)_Lss4, where the distance betweenthe samples (x axis) is based on their dissimilarity as calculated usingWard’s method. Cutting the dendrogram at a distance of “5”resulted in three clusters. C, The box plots reflect the hierarchical-cluster validation results for rosette area, actual quantum yield(F[P]), and maximum quantum yield under the light-adapted state (Fv9/Fm9_Lss4), as well as NPQ (F[NPQ]), all measured at thehighest photon irradiance (Lss4 5 440 mmol m22 s21) as arbitrary units (a.u.). The average values representing area, NPQ, andquantum yield per accession and treatment were grouped into hierarchical clusters based on area, Fv9/Fm9_Lss4, and F(NPQ)_Lss4. The letters above the box plots indicate groups shown to be significantly different by Tukey’s HSD pairwise test (P value,0.05). The hierarchical clusterswere calculated using data collected on the final day of phenotyping only (7 d after treatment) fromthe curated dataset. The outliers were identified using the 1.53 IQR method on all the measured traits, with at least 12 outlyingtraits necessary to qualify a sample as an outlier. The data were additionally curated using the quadratic fit (sampleswith R2, 0.7were removed).


Julkowska et al.









Figure 8. Quantile regression reveals traits of major interest and importance. Quantile regression can reveal relationships amongtraits based on individual quantiles, rather than average values. A–C, Quantile regression was performed using (A) rosette area, (B)NPQ (F[NPQ]_Lss4), and (C) maximum quantum yield under the light-adapted state (Fv9/Fm9_Lss4) as the traits of major interest.The plots show the effects of these plant traits on the response of individual quantiles to the trait of major interest (actual quantumyield (F[P]), electron transport rate moving beyond PSII (ETR4), maximum quantum yield under the light-adapted state (Fv9/Fm9_Lss4), and NPQ (F[NPQ]), all measured at the highest photon irradiance (Lss45 440 mmol m22 s21). The x axis representsthe quantile level, and the y axis represents the estimated value of the regression coefficient. The colored data points indicate thata coefficient has a significant contribution to the trait of major interest, with a P value threshold of 0.05. Crosses represent anonsignificant contribution of a coefficient to the trait of major interest. Blue and red lines indicate the quantile regression co-efficients for the control and salt-stress conditions, respectively. The quantile regression was calculated using the raw dataseparated by day of measurement and treatment. The P values of the regression coefficients are listed in Supplemental DatasetS10. D, The relationships among the measured traits are modeled. The blue and red arrows indicate a significant contribution ofone trait to another for all quantiles and time points under either the control or salt-stress conditions, respectively. The dashed linesindicate that a contribution is not significant for all of the studied time points and quantiles. The gray boxes indicate traits that wereused as traits of major interest in the quantile regression analysis.







heritability is calculated as the ratio of the total geneticvariance to the total phenotypic variance. Dependingon the data structure, heritability can be estimated forindividual data subsets, such as treatment and timepoint. For our datasets, we calculated heritability ofindividual traits per treatment and time point. Theheritability of individual traits varied between 0 and0.9862 (Supplemental Dataset S11). Traits measuredwithin the first 2 d exhibited low heritability (,0.6),which increased .0.8 after 4 d of measurement for alltraits except the single time-point measurements ofphotosynthetic activity in light- and dark-adaptedconditions (Fm, Fo, Ft in Lss1–Lss4). Traits associatedwith quantum yield, NPQ, and rosette area showed thehighest heritability, while traits associated with rosettearchitecture and fluorescence in the dark-adapted state(Fm, Fo, and Ft) had the lowest heritability. The esti-mated heritability was slightly higher in plants grownunder the control condition (0.9081) than in plantsgrown under salt stress (0.875; Supplemental DatasetS11). As the median-estimated heritability for all themeasured traits was.0.89, most of the measured traitsseemed to have a genetic basis, which could be dis-sected using forward genetic screens.

DISCUSSION

We live in the exciting time of high-throughputphenotyping, which can generate an avalanche ofdata reflecting the complexity of different biologicalsystems (Fahlgren et al., 2015). However, analysis ofthese data takes a significant amount of time and effort(Howe et al., 2008), and comparing data outputs re-quires standardization of data curation and analysis.The MVApp provides a flexible analytical pipeline,which can deal with multiple data structures and ac-commodates multiple independent variables, such astreatment, genotype, and time points, which can beused to subset or group the data for individual analy-ses. The existing tools, such as the software HTPmod(Chen et al., 2018), recognize the need for visualizationand modeling of the high-throughput phenotypingdata but do not offer opportunities for future exten-sions. MVApp uniquely strives to make a first step to-ward a future framework for standardizing datacuration, analysis processing, and visualization of di-verse datasets, with community input in this processbeing clearly invaluable. Therefore, we encourage ourpeers to submit their comments and suggestions fornew and improved feature of the MVApp as indicatedin the contribution guidelines (https://github.com/mmjulkowska/MVApp/blob/master/CONTRIBUT-ING.md). By including in-app messages helping to in-terpret the results of various tests, MVApp provides acomprehensive guideline, which is particularly valu-able to first-time users and early career researchers, fordata curation, interpretation, and analysis. Stream-lining and standardizing data analysis protocols willcontribute to the FAIR data principles and improve the

review process for publications and other scientificoutputs. Our interactive tool for data curation andanalysis, MVApp, strives to enhance the transparencyof data curation and analysis by improving processingtimes and by reducing the need for expensive softwareand extensive knowledge of R or statistics. Despite thewide availability of data analysis tools, none of themare designed for time-efficient analyses of the bigdatasets generated by medium- and high-throughputphenotyping platforms. MVApp exploits the flexi-bility of R-based statistical analysis and combinesit with a graphical user interface. The graphs andtables produced in MVApp can be downloaded in apublication-ready format with default figure legends,which contain information about the analysis per-formed and the preprocessing steps, including datacuration, allowing the graphs to be easily repro-duced. We think that applications such as MVAppwill not only contribute to the availability of FAIRdata, but also will encourage the scientific commu-nity to both share their raw data and standardize datacuration so that the figures and analyses reported inscientific publications can be reproduced and betterunderstood by the wider audience.

Outlier curation is an integral part of data analysis,but its importance is often overlooked. As none of theexisting data-analysis pipelines includes outlier detec-tion, the MVApp is a pioneer for streamlined datacuration, providing a significant contribution to FAIRdata processing. In the past, outlier curation methodswere developed using coarse-to-fine models for theidentification of abnormalities in phenotypes related toplant photosynthesis (Xu et al., 2015). The outlier se-lection methods in MVApp provide opportunities toidentify the possible outliers independent of the natureof the phenotypic trait. In our first release of MVApp,we have included four different methods for outlieridentification. Provided that the data contain a time orgradient component, curation can also be performedbased on fitting a curve to the growth of individualplants. By visualizing the data before and after outlieridentification, the user can make an informed decisionabout which samples to retain for further analysis.However, the user must be careful when removing datapoints, as the outliers themselves might contain im-portant information (Altman and Krzywinski, 2016).Therefore, the original dataset containing all the sam-ples, including outliers, remains available in theMVApp dropdown menu for subsequent analysis. Weare not aware of any other application that integratesoutlier curation that is as transparent and simple to useas the one integrated into MVApp.

For the large datasets that are often produced byhigh-throughput phenotyping platforms, MVApp of-fers different methods of dimensionality reduction.Dimension reduction by PCA or MDS can simplifythe data by summarizing it in a limited number of di-mensions. Both PCA and MDS are often used inmetabolomic (Zhang et al., 2016), transcriptomic (Yanoet al., 2018), and genomic studies (Miłobędzka and


Julkowska et al.









Muszy�nski, 2017) to help identify interesting patternsamong samples. However, PCA and MDS are under-explored for use in large phenotypic studies. We pro-pose that these methods can clarify trait contributionsto observed variances, phenotype-to-phenotype rela-tionships, and trends that change in response to com-ponents of treatment, genotype, or time. Including PCAand MDS in MVApp facilitates the broader use of thesemethods by the scientific community.Although most studies still use linear regression

models, focusing only on the contribution of traits foran average plant (Sellam and Poovammal, 2016; Sitieneiet al., 2017), MVApp integrates quantile regression,which models the contribution of traits across the en-tire distribution of plants. Quantile regression can beused as a hypothesis-generating tool, identifying novelplant phenotypes with significant contributions toyield or stress tolerance, and the trait contribution canbe estimated for individual quantiles. The use ofquantile regression in the field of plant phenotyping isa recent concept. Its integration into the MVApp willhelp to develop a better understanding of the pheno-type-to-phenotype interactions and the contributionof individual traits to the trait of major interest, e.g.survival, yield, or metabolite production. The quantile-regression approach can be applied to the field of plantbreeding, where understanding the traits contributing

to productivity is key to the development of superiorplant varieties with increased yields.MVApp combines several existing statistical R li-

braries into a pipeline (presented in Fig. 9), guiding theuser through the interactive process of data curation,exploration, and analysis. The graphic user interface ofthe MVApp and the messages aim to provide betterunderstanding and interpretation of the statistical testoutputs, as well as empower the users without skills incommand-line software to explore the full potential ofmultivariate analysis. In this article, we have presentedthe different functionalities of MVApp, which we aimto expand in the future. We encourage the submissionof new modules, suggestions, and contributions fromthe scientific community to new releases of MVApp.Please see https://github.com/mmjulkowska/MVApp/blob/master/CONTRIBUTING.md for moreinformation about how to contribute. Our goal forMVApp is that it will facilitate data analysis and sta-tistical literacy across the scientific community bycompiling different methods that are already in useacross various disciplines. We hope that MVApp canunlock the potential of those methods and enhance theexperience of data curation and exploration for re-searchers, especially those investigating phenotype-to-genotype and phenotype-to-phenotype relationships inphenotypic datasets.

Figure 9. An overview of the MVApp modules used for streamlined data curation, exploration, and analysis. Through a user-friendly interface, MVApp provides different statistical methods for curating data, testing hypotheses, identifying trends, anddefining genotype-to-phenotype and phenotype-to-phenotype relationships. Data curation can be performed using 1.5 3 IQR,Cook’s distance, SD from the median, or the Bonferroni outlier test. Both curated and noncurated data remain available forcomparison and further analyses. For generating new hypotheses, MVApp offers a wide array of tools. The relationships amongdifferent themeasured traits can be examined using correlation analysis. The contribution of traits to the observed variance can beevaluated using PCA, and the dimensionality of the dataset can be reduced using either PCA or MDS. Individual samples can beclustered into subgroups using hierarchical or k-means clustering. The relationships among the measured traits and the traits ofmajor interest can be evaluated using quantile regression. For hypothesis testing, MVApp allows the user to visually inspect theconditions of parametric tests and to examine predefined groups for significant differences using both parametric and non-parametric tests. MVApp is an open-source project, and the online version is available at http://mvapp.kaust.edu.sa/MVApp/.Community contributions to improve existing modules or to add new ones are encouraged.






http://mvapp.kaust.edu.sa/MVApp/


MATERIALS AND METHODS

MVApp Setup

MVApp was written in “R” (R Core Team, 2015) and its interactive userinterface was made from the “shiny” library (Chang et al.). The interactive plotswere produced from the “plotly” (Sievert et al., 2017), “ggplot2” (Wickham,2009), and “gplots” (Warnes et al., 2016) libraries. The color scales for plots wereenabled using the “RcolorBrewer” (Neuwirth, 2014) and “colorRamps” (Keitt,2012) libraries. The data table display was based on the “DT” library (Xie et al.,2018). Reshaping of the data tables into various formats was accomplishedusing “reshape” and “reshape2” (Wickham, 2007) libraries. Users can down-load all the graphs in the “.pdf” format, and all the tables in the “.csv” format.The data were scaled using the “scale()” function. The values represented in thetables were rounded to four decimal numbers using the “round()” function, butwere left unaltered in the table available for download. All functions were partof the “stats” library in “R” unless indicated otherwise.

Curve Fitting

The individual samples were separated based on sample ID, genotype, andselected independent variables. Depending on the fitted function, the trait datawere not transformed (linear) orwere transformed using square root, quadratic,or natural logarithm functions (for quadratic, square root, and exponentialfunctions, respectively). After the transformation, the linear model was fittedusing the “lm()” function, and the r-squared value was calculated using the“summary(lm())$r.squared” function. The cubic splines were calculated using“lm(trait ; bs()),” while the smoothed splines were calculated using“smooth.spline().” Spline functions were developed in the “splines” library (RCore Team, 2015). The ANOVA analysis was performed using the “aov()function,”while the Tukey-HSD pairwise comparison test was calculated usingthe “HSD.test()” function from the “agricolae” library (de Mendiburu, 2017).

Data Curation

For outlier detection, the data were first grouped based on the selected in-dependent variables (e.g. genotype, treatment, and time). The selected inde-pendent variableswere fused into one ID (e.g. “genotype_treatment_time”) andused as grouping variables to identify the outliers.

In our example, the “boxplot()$out” function was used for the outliersidentifiedwith 1.53 IQR. For the Bonferroni test, a linearmodel was fitted with“lm(),” followed by “car::outlierTest()” from the “car” library (Fox andWeisberg, 2011). To identify outliers using the Cook’s distance, the linearmodel was fitted with “lm(),” followed by the “cooks.distance()” function fromthe “base” library (R CoreTeam, 2015). The samples having a Cook’s distancelarger than four times themean Cook’s distancewere classified as “outliers.” Toidentify outliers by their 6SD from the median, the median and SD were calcu-lated using the “summaryBy(),” “median(),” and “sd()” functions, respectively,from the “doBy” library (Hojsgaard and Halekoh, 2018) per predefinedgrouping variables (e.g. genotype, treatment, time). The dataset was thensubsetted for predefined grouping variables, and individual values outside ofmedian 6SD range for that specific subset were identified.

Summary Statistics

The summary statistics were calculated using the “summaryBy()” functionfrom the “doBy” library (Hojsgaard and Halekoh, 2018). The mean, SD, SE, min,max, sum, and number of samples were calculated using the “mean(),” “me-dian(),” “sd(),” “std.error(),” “min(),” “max(),” “sum(),” and “length()” func-tions, respectively, from the “doBy” (Hojsgaard and Halekoh, 2018) and“plotrix” libraries (Lemon, 2006).

Hypothesis Testing Using Parametric andNonparametric Tests

Thenormaldistributionofdatawere testedusing theShapiro–Wilk test usingthe “shapiro.test()” function and the QQ-plots were produced using the“qqnorm()” and “qqline()” functions. The equal variance was tested usingBartlett and Levene tests by applying the “bartlett.test()” and “leveneTest()”functions from the “car” library (Fox and Weisberg, 2011). The equal variationwas represented visually using the “hovPlot.bf()” function from the “HH”

library (Heiberger, 2018). One- and two-sample t tests were performed usingthe “t.test()” function, while the Kolmogorov–Smirnov test was performedusing “ks.test().” The ANOVA analysis was performed using the “aov()”function, while the Tukey-HSD pairwise comparison test was calculated usingthe “HSD.test()” function. The nonparametric Kruskal–Wallis test was per-formed using “kruskal.test(),” while the pairwise Wilcoxon/Mann-Whitneytest was executed using “pairwise.wilcox.test().” The two-way ANOVA anal-ysis was performed using the linear model “lm(),” followed by the “anova()”function. The two-way ANOVA interaction plot was produced by “inter-action.plot(),” while the residual plot was produced by plotting “lm()$fittedversus lm()$residual.”

Correlation Analysis

For our data, the correlation plot was made using the “corrplot()” functionfrom the “corrplot” library (Wei and Simko, 2017). The correlation coefficientwas calculated by the selected input method (Pearson or Spearman), and ther-squared and P values were extracted from the “rcorr()” functions. The cor-relation significance test was performed by “cor.mtest().”

Reduction of Dimensionality

PCA was performed using the “PCA()” function from the “factoextra”(Kassambara andMundt, 2017) and “FactoMineR” (Le et al., 2008) libraries. Theeigenvectors were extracted by using “PCA$eig,” and the PCA contributionplot was made using the “fviz_pca_var()” function. The contribution of indi-vidual traits for each selected PC was calculated using “PCA$var$contrib,”while the plot was made using the “fviz_contrib()” function. MDS was per-formed using the “dist(),” “cmdscale(),” and “as_tibble() functions,” and thek-means clustering of the MDS results was performed using the“kmeans()$cluster” function. MDS of the measured traits was performed bytransposing the dataset using the “t()” function and completing the MDSanalysis as described above. Thus, dimensional relationships were examinedamong the traits rather than among the samples.

Cluster Analysis

For hierarchical clustering inMVApp, the correlations among the samples interms of the chosen traits were calculated using the “cor() function.” Then, the“dist() function”was used to determine the distances between the samples. Thehierarchical analysis was performed on the transposed dataset using the “t()”function to determine the relationship among the selected traits with functionsfrom the “pvclust” (Suzuki and Shimodaira, 2015) and “NbClust” (Charradet al., 2014) libraries. The hierarchical clustering was performed using the“hclust()” function. The heatmap was produced using the “heatmap.2()”function from the “gplots” library (Warnes et al., 2016), scaled per row. Thedendrogramwas produced from the output of the “hclust()” function using the“plot(as.dendrogram())” function. The number of clusters was determined bycutting the dendrogram at a user-identified distance using the “cutree()”function. The significant differences among individual clusters were identifiedby the ANOVA and Tukey-pairwise comparison tests using the “aov()” and“glht()” functions from the “multcomp” library (Hothorn et al., 2008), followedby the “cld()” function, which allows letters indicating significant differencesamong the predefined groups to be integrated into a box plot.

For k-means clustering in MVApp, the optimal cluster number was calcu-lated using different methods. The elbow plot was produced using the“fviz_nbclust(method 5 “ws”)” function from the “factoextra” library(Kassambara and Mundt, 2017). The silhouette plot was made using the“fviz_nbclust(method 5 “silhouette”)” function. To ensure that the suggestednumber of clusters was neither too ambitious nor too conservative, the resultsof 30 other indices were compared (Charrad et al., 2014), and thebest number of clusters was suggested according to “majority rule” usingthe “NbClust(distance 5 “euclidean”, min.nc 5 2, max.nc 5 10, method 5“kmeans”)” function, followed by “fviz_nbclust().” Based on these results,the user select number of the clusters and the individual samples weresubsequently assigned to a k-means cluster using “kmeans().”

Estimation of Heritability

Broad-sense heritability was calculated as


Julkowska et al.



h25VG

Vp

where VG was the genetic variance and VP was the total phenotypic vari-ance. The phenotypic variance can be explicitly expressed as

VG 1VGL

l1

VGY

y1

VGLY

ly1

VR

rly

where VGL, VGY, VGLY, and VR represent the genotype-by-location, geno-type-by-year, genotype-by-location-by-year, and residual variances, respec-tively. The number of locations, years, and replicates (within location and year)were represented by l, y, and r, respectively, and these values were input by theuser. The variance components were estimated using the “VarCorr” function onthe fitted linear mixedmodel “lmer(Trait; 11 (1 jGenotype)1 (1 j Year)1 (1 jLocation) 1 (1 j Genotype:Location) 1 (1j Genotype:Year) 1 (1 j Genoty-pe:Year:Location)” from the “lme4” library (Bates et al., 2015). Heritabilityvalues were rounded to two digits.

Quantile Regression

The quantile regression models for the traits of major interest were fittedusing the “rq()” function from the “quantreg” library (Koenker, 2017) for dif-ferent quantile levels. The estimated values of the regression coefficients and theP values were extracted from the summary of the fitted model using “sum-mary(rq())$coefficients(, “Value”)” and “summary(rq())$coefficients(, “Pr(.jt)”),” respectively.

The quantile plots (Agarwal et al., 2018) were produced by plotting the es-timated regression coefficients against the quantile level using the “plot()” and“lines()” functions, and the legends were produced using the “legend()”function.

Supplemental Data

The following supplemental information is available.

Supplemental Figure S1. Effect of outliers identified using R2 on the dif-ferences observed between genotypes.

Supplemental Figure S2. Data curation assisted with visual inspection.

Supplemental Figure S3. Hypothesis testing including assumptions forANOVA.

Supplemental Figure S4. Hierarchical clusters validation.

Supplemental Dataset S1. A dataset from Awlia et al. (2016) is used toexhibit the functions of MVApp.

Supplemental Dataset S2. List of all the measured traits used for the dataanalysis in Supplemental Dataset S1.

Supplemental Dataset S3. The R2 values for different models included inMVApp, reflecting their fit to the increase in rosette area under controland salt-stress conditions.

Supplemental Dataset S4. Outliers identified using the fit of quadraticfunction to rosette area in MVApp.

Supplemental Dataset S5. List of samples identified as outliers by the 1.53 IQR method, based on all traits.

Supplemental Dataset S6. Correlation between individual traits measuredfor plants grown under salt-stress and control conditions.

Supplemental Dataset S7. Eigen values PCA using all measured traits andcurated dataset.

Supplemental Dataset S8. Contributions (%) of the individual measuredtraits to the PCs under control and salt-stress conditions.

Supplemental Dataset S9. Clustering of the phenotypes of nine Arabidop-sis accessions using hierarchical clustering with the Ward method, andArea, FvFm, Lss4, and NPQ Lss4 as the major determinants.

Supplemental Dataset S10. Quantile regression of various traits of majorinterest, calculated using MVApp.

Supplemental Dataset S11. Broad-sense heritability estimations calculatedfor all individual traits by day of experiment and treatment for nineArabidopsis accessions with an average of nine biological replicates.

ACKNOWLEDGMENTS

Figure 9 was produced by Ivan Gromicho, scientific illustrator at KingAbdullah University of Science and Technology (KAUST). We thank AntonioArena from Research Computing at KAUST for his help with putting MVAppon the server and making it accessible online; KAUST IT Linux Systems Teamwho provided the infrastructure for the online hosting ofMVApp; andVeronicaTremblay, scientific editor at KAUST, for editing the article. Additionally, wethank Dr. Guillaume Lobet (Louvain/Jurlich University), Dr. SandraSchmöckel and Dr. Boubacar Kountche (KAUST), Prof. Julia Bailey-Serrez (Uni-versity of California-Riverside), and Dr. Nazgol Emrani (Kiel University), fortheir helpful comments on the MVApp design and functionality.

Received February 25, 2019; accepted April 30, 2019; published May 6, 2019.

LITERATURE CITED

Agarwal G, Saade S, Shahid M, Tester M, Sun Y (2018) Quantile functionmodeling applied to analysis of salinity tolerance of plants.OFS10.17605/OSF.IO/WZHE7

Altman N, Krzywinski M (2016) Analyzing outliers: Influential or nui-sance? Nat Methods 13: 281–282

Awlia M, Nigro A, Fajkus J, Schmoeckel SM, Negrão S, Santelia D,Trtílek M, Tester M, Julkowska MM, Panzarová K (2016) High-throughput non-destructive phenotyping of traits that contribute tosalinity tolerance in Arabidopsis thaliana. Front Plant Sci 7: 1414

Bates D, Maechler M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67: 1–48

Chaloner K, Brant R (1988) A Bayesian approach to outlier detection andresidual analysis. Biometrika 75: 651–659

Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J (2017) Shiny: Web ap-plication framework for R. R package version 1.0.4.https://CRAN.R-project.org/package5shiny

Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: An Rpackage for determining the relevant number of clusters in a data set.J Stat Softw 61: 1–36

Chen D, Fu L-Y, Hu D, Klukas C, Chen M, Kaufmann K (2018) TheHTPmod Shiny application enables modeling and visualization of large-scale biological data. Commun Biol 1: 89

de Mendiburu F (2017) agricolae: Statistical procedures for agriculturalresearch. R package version 1.2-8.https://CRAN.R-project.org/package5agricolae

Diaz-Garcia JA, Gonzlez-Faras G (2004) A note on the Cook’s distance.J Stat Plan Inference 120: 119–136

Fahlgren N, Gehan MA, Baxter I (2015) Lights, camera, action: High-throughput plant phenotyping is ready for a close-up. Curr OpinPlant Biol 24: 93–99

Fox J, Weisberg S (2011) An {R} Companion to Applied Regression, 2ndEdition. SAGE, Thousand Oaks, CA

Heiberger RM (2018) HH: Statistical analysis and data display: Heibergerand Holland. R package version 3.1-35.https://CRAN.R-project.org/package5HH

Hojsgaard S, Halekoh U (2018) doBy: Groupwise statistics, lsmeans, linearcontrasts, utilities. R package version 4.6-2. https://CRAN.R-project.org/package5doBy

Hothorn T, Bretz F, Westfall P (2008) Simultaneous inference in generalparametric models. Biom J 50: 346–363

Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP,Kania R, Schaeffer M, St Pierre S, et al (2008) Big data: The future ofbiocuration. Nature 455: 47–50

Kassambara A, Mundt F (2017) factoextra: Extract and visualize the resultsof multivariate data analyses. R package version 1.0.5.https://CRAN.R-project.org/package5factoextra

Keitt T (2012) colorRamps: Builds color tables. R package version2.3.https://CRAN.R-project.org/package5colorRamps

Koenker R (2017) quantreg: Quantile regression. R package version5.34.https://CRAN.R-project.org/package5quantreg




















https://CRAN.R-project.org/package=shiny

https://CRAN.R-project.org/package=shiny

https://CRAN.R-project.org/package=agricolae

https://CRAN.R-project.org/package=agricolae

https://CRAN.R-project.org/package=HH

https://CRAN.R-project.org/package=HH

https://CRAN.R-project.org/package=doBy

https://CRAN.R-project.org/package=doBy

https://CRAN.R-project.org/package=factoextra

https://CRAN.R-project.org/package=factoextra

https://CRAN.R-project.org/package=colorRamps

https://CRAN.R-project.org/package=quantreg


Le S, Josse J, Husson F (2008) FactoMineR: An R package for multivariateanalysis. J Stat Softw 25: 1–18

Lemon J (2006) Plotrix: A package in the red light district of R. R News 6:8–12

Lever J, Krzywinski M, Altman N (2017) Points of significance: Principalcomponent analysis. Nat Methods 14: 641–642

Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Donot use standard deviation around the mean, use absolute deviationaround the median. J Exp Soc Psychol 49: 764–766

Li Y, Andrade J (2017) DEApp: An interactive web interface for differentialexpression analysis of next generation sequence data. Source Code BiolMed 12: 2

Marx V (2013) Biology: The big challenges of big data. Nature 498: 255–260Matias FI, Granato I, Fritsche-Neto R (2018) Be-Breeder: An R/Shiny ap-

plication for phenotypic data analyses in plant breeding. Crop BreedAppl Biotechnol 18: 241–243

McCarthy DJ, Campbell KR, Lun ATL, Wills QF (2017) Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33: 1179–1186

Miłobędzka A, Muszy�nski A (2017) Can DNA sequencing show differ-ences between microbial communities in Polish and Danish wastewatertreatment plants? Water Sci Technol 75: 1447–1454

Morota G (2017) ShinyGPAS: Interactive genomic prediction accuracysimulator based on deterministic formulas. Genet Sel Evol 49: 91

Nelson JW, Sklenar J, Barnes AP, Minnier J (2017) The START app: Aweb-based RNAseq analysis and visualization resource. Bioinformatics33: 447–449

Neuwirth E (2014) RColorBrewer: ColorBrewer palettes. R package version1.1-2.https://CRAN.R-project.org/package5RColorBrewer

Nijs V (2018). radiant: Business Analytics using R and Shiny. R packageversion 0.9.0. https://github.com/radiant-rstats/radiant

R Core Team (2015) R: A Language and Environment for Statistical Com-puting. R Foundation for Statistical Computing,Vienna, Austria.https://www.R-project.org/

Reiser L, Harper L, Freeling M, Han B, Luan S (2018) FAIR: A call to makepublished data more findable, accessible, interoperable, and reusable.Mol Plant 11: 1105–1108

Sellam V, Poovammal E (2016) Prediction of crop yield using regressionanalysis. Ind J Sci Tech 9 doi:10.17485/ijst/2016/v9i38/91714

Seren Ü, Vilhjálmsson BJ, Horton MW, Meng D, Forai P, Huang YS, LongQ, Segura V, Nordborg M (2012) GWAPP: A web application for

genome-wide association mapping in Arabidopsis. Plant Cell 24:4793–4805

Sievert C, Parmer C, Hocking T, Chamberlain S, Ram K, Corvellec M,Despouy P (2017) plotly: Create interactive web graphics via ‘plotly.js’.R package version 4.7.1.https://CRAN.R-project.org/package5plotly

Sitienei B, Juma S, Opere E (2017) On the use of regression models topredict tea crop yield responses to climate change: A case of Nandi East,sub-county of Nandi county, Kenya. Climate 5: 54

Spitzer M, Wildenhain J, Rappsilber J, Tyers M (2014) BoxPlotR: A webtool for generation of box plots. Nat Methods 11: 121–122

Suzuki R, Shimodaira H (2015) pvclust: Hierarchical clustering withp-values via multiscale bootstrap resampling. R package version 2.0-0.https://CRAN.R-project.org/package5pvclust

Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, LumleyT, Maechler M, Magnusson A, Moeller S, Schwartz M, et al (2016)gplots: Various R programming tools for plotting data. R package ver-sion 3.0.1.https://CRAN.R-project.org/package5gplots

Wei T, Simko V (2017) R package “corrplot”: Visualization of a correlationmatrix (Version 0.84).https://github.com/taiyun/corrplot

Wickham H (2007) Reshaping data with the Reshape package. J Stat Softw21: 1–20

Wickham H (2009) ggplot2: Elegant Graphics for Data Analysis. Springer,New York

Wilkinson MD, et al (2016) Comment: The FAIR Guiding Principles forscientific data management and stewardship. Scientific Data 3: 160018

Xia J, Sinelnikov IV, Han B, Wishart DS (2015) MetaboAnalyst 3.0—making metabolomics more meaningful. Nucleic Acids Res 43:W251–W257

Xie Y, Cheng J, Tan X (2018) DT: A wrapper of the JavaScript Library‘DataTables’. R package version 0.5.https://CRAN.R-project.org/package5DT

Xu L, Cruz JA, Savage LJ, Kramer DM, Chen J (2015) Plant photosynthesisphenomics data quality control. Bioinformatics 31: 1796–1804

Yano R, Nonaka S, Ezura H (2018) Melonet-DB, a grand RNA-seq geneexpression atlas in melon (Cucumis melo L.). Plant Cell Physiol 59: e4–e4

Zhang J, Luo W, Zhao Y, Xu Y, Song S, Chong K (2016) Comparativemetabolomic analysis reveals a reactive oxygen species-dominated dy-namic model underlying chilling environment adaptation and tolerancein rice. New Phytol 211: 1295–1310


Julkowska et al.


https://CRAN.R-project.org/package=RColorBrewer

https://github.com/radiant-rstats/radiant

https://www.R-project.org/

https://CRAN.R-project.org/package=plotly

https://CRAN.R-project.org/package=pvclust

https://CRAN.R-project.org/package=gplots

https://github.com/taiyun/corrplot

https://CRAN.R-project.org/package=DT

https://CRAN.R-project.org/package=DT


Documents

MVApp Multivariate Analysis Application for · BREAKTHROUGH TECHNOLOGIES MVApp—Multivariate Analysis Application for Streamlined Data Analysis and Curation1[OPEN] Magdalena M. Julkowska,a,2,3