Item response models for the longitudinal analysis of ... · the item response theory to analyze directly the raw data from questionnaires. Using a recent classi-ﬁcation of generalized

Item response models for the longitudinal analysis ofhealth-related quality of life in cancer clinical trials

Antoine Barbieri?,1,2,3, Jean Peyhardi1,4,5, Thierry Conroy6,7,Sophie Gourgou2, Christian Lavergne3,8 and Caroline Mollevi2,9,10

Abstract

Statistical research regarding health-related quality of life (HRQoL) is a major challenge to betterevaluate the impact of the treatments on their everyday life and to improve patients’ care. Among themodels that are used for the longitudinal analysis of HRQoL, we focused on the mixed models fromthe item response theory to analyze directly the raw data from questionnaires. Using a recent classi-fication of generalized linear models for categorical data, we discussed about a conceptual selectionof these models for the longitudinal analysis of HRQoL in cancer clinical trials. Through method-ological and practical arguments, the adjacent and cumulative models seem particularly suitable forthis context. Specially in cancer clinical trials and for the comparison between two groups, the cu-mulative models has the advantage of providing intuitive illustrations of results. To complete thecomparison studies already performed in literature, a simulation study based on random part of themixed models is then carried out to compare the linear mixed model classically used to the discusseditem response models. As expected, the sensitivity of item response models to detect random effectwith lower variance is better than the linear mixed model sensitivity. Finally, a longitudinal analysisof HRQoL data from cancer clinical trial is carried out using an item response cumulative model.

Keywords: Item response theory; Mixed models; Ordinal categorical data; Longitudinal analysis;Health-related quality of life.

1 Introduction

Endpoints refer to biological and clinical measures to assess the efficiency of new therapeutic strate-gies. The overall survival endpoint is the gold standard to show a clinical benefit of these strategies andtreatments. Therapeutic treatments being more efficient and increasing the patients’ lifetime, the overallsurvival endpoint may become insufficient to show a significant difference between two treatments. It isthen necessary to consider a longer follow-up or a larger cohort of patients to have a sufficient number

?Corresponding author: [email protected]é de Montpellier, Montpellier, France2Institut régional du Cancer Montpellier (ICM) - Val d’Aurelle, Biometrics Unit, Montpellier, France3Institut Montpelliérain Alexander Grothendieck (IMAG), Montpellier, France4CIRAD, AGAP and Inria, Virtual Plants, France5Institut de génomique fonctionnelle, Montpellier, France6National Quality of Life in Oncology Platform, France7Institut de Cancérologie de Lorraine, Nancy, France8Université Paul-Valéry Montpellier 3, Montpellier, France9IRCM, Institut de Recherche en Cancérologie de Montpellier, Montpellier, France

10INSERM, U1194, Montpellier, France

1

arX

iv:1

611.

0685

1v1

[st

at.A

P] 2

1 N

ov 2

016

of events and a good statistical power (Fiteni et al., 2014), both representing considerable costs. Thus, toconclude to the benefit of a new treatment, other endpoints have emerged and the health-related qualityof life (HRQoL) is currently one of the most important. In cancer clinical trials, the patient-reportedoutcomes are increasingly used to analyze a clinical benefit for medical decision-making (Fiteni et al.,2014). The HRQoL endpoint may seem more pertinent to show the interest of a new therapy in somecases such as the palliative or geriatric situations. However, there are conceptual and methodologicalbrakes underlying to the concept and the assessment of HRQoL. Indeed, HRQoL is a multidimensionalconcept regarding the physical, psychological and social functions as well as symptoms associated withthe disease and treatments. Another conceptual brake is the subjectivity of its measurement. Indeed,patients report their feelings about their HRQoL thanks to self-reported questionnaires. Both argumentspreclude the use of HRQoL as sole primary endpoint in clinical trials.

In oncology, HRQoL is assessed using a general questionnaire for a set of different cancers, and anadditional specific questionnaire associated with each type of cancer (Aaronson et al., 1993; Cella et al.,1993). Each questionnaire decomposes the HRQoL to measure several under-concepts (dimensions ofHRQoL) which themselves comprise one or several items. The items are built on the Likert scales inwhich the response variable is ordinal. Thus, considering several items for a given dimension, HRQoLdata are composed of multiple ordinal responses. Also, the questionnaires are filled by the subjects them-selves, and collected at different times defined in the trial protocol (usually at inclusion, during treatmentand follow-up). These repeated measures are used to assess the evolution of the subject’s HRQoL overtime. In Europe, these questionnaires are developed and validated by the European organization for re-search and treatment of cancer (EORTC). The standard questionnaire currently used in oncology is theEORTC Quality of Life Questionnaire - Core 30 (EORTC QLQ-C30) (Aaronson et al., 1993), togetherwith the scoring procedure proposed by the EORTC (Fayers et al., 2001). The score is then calculated foreach dimension and for each subject, corresponding to the average of the item responses for a single di-mension, and expressed on a scale ranging from 0 to 100. The interpretation is such that high functionalscores reflect good functional capacities and a good HRQoL level, and conversely, high symptomaticscores represent strong symptoms and point out difficulties. The use of scoring procedures is commonin practice because the statistical methods for quantitative variables are more powerful and easier to im-plement and interpret (Grilli and Rampichini, 2011). But, in a Likert scale, the gap which separates eachadjacent category of response ("not at all", "a little", "quite a bit" and "very much") may not be the same,and the HRQoL score calculation does not take into account this characteristic. Another drawback in thescore use is that subjects could have different item outcomes and obtain the same score. In this situation,the score does not make a distinction between these subjects (Gorter et al., 2015).

The longitudinal statistical models classically used in oncology are performed on the summary scorethrough using the linear mixed models (LMM) or time-to-event models(Anota et al., 2014). In the LMM,the variable associated with the HRQoL score is considered as a Gaussian variable while it presents thecharacteristics of an ordinal variable, being non-continuous and bounded. These models allow to takeinto account the correlation introduced by repeated measurements on the same patient (collection of theHRQoL questionnaires over time) and different covariates such as time, treatment group, age... However,the use of the LMM for HRQoL analysis is scientifically questionable given the characteristics of thescore. Furthermore, many symptomatic dimensions are composed of only one item, the HRQoL scorehas exactly the same properties than ordinal categorical data, and using the LMM is not appropriated.Thus, if a ceiling or floor effect is observed, the categorical feature is even more marked when oneof the two extreme categories is over-represented. The second approach for the longitudinal analysis

2

of HRQoL is based on the time-to-event models: the time-to-deterioration (TTD) and the time-until-definitive-deterioration (TUDD) (Hamidou et al., 2011). Survival approaches are often used and thuswell-known in the oncology, and are appreciated for their easiness to interpret result and their goodunderstanding by clinicians. In these models, an event is classically defined by the (definitive or not)deterioration of the HRQoL score between baseline and a follow-up time, given a minimally clinicallyimportant difference (Anota et al., 2013). The lack of homogeneity of the methods used for the HRQoLdata analyses in different oncology clinical trials is also a real obstacle to the comparison of results.Indeed, the LMM and TUDD approaches show results which may sometimes seem contradictory, andwith different interpretations, but they may also be complementary. An example can be taken comparingtwo similar cancer clinical trials investigating the effect of bevacizumab. In the first trial (Chinot et al.,2014), HRQoL analysis through TUDD showed that the bevacizumab group had a later deterioration ofHRQoL compared with patients in the standard group. Conversely, in the second trial (Gilbert et al.,2014), HRQoL analysis using the LMM showed a worse HRQoL overtime in the bevacizumab group.

Interest in the HRQoL endpoint is growing rapidly in cancer clinical trials and it is essential touse a suitable methodology to analyze HRQoL data, taking into account the data properties (repeatedmeasurements of the multiple ordinal responses). In our study, we first focused on the different andmost adapted models to analyze HRQoL from raw data, i.e. directly on the item outcomes. Studieson psychometric properties from questionnaires such as the one used for HRQoL have been ongoingfor a long time (Edelen and Reeve, 2007; Jafari et al., 2012), known as the item response theory (IRT).The IRT models link the individual’s item responses and the latent variable which represents the studiedHRQoL concept. They can be seen as generalized linear mixed models (GLMM) for ordinal responseswith a particular parameterization of the linear predictor. The interest for this kind of model to analyzethe data, including the longitudinal analyzes, is increasing (Titman et al., 2013; Hardouin et al., 2015;Gorter et al., 2015; Santos et al., 2016). However, to our knowledge, there is no work that discussesof the choice of one of the different IRT models over the others, and specially for HRQoL longitudinalanalysis. First, we propose in section 2 a conceptual selection of these models through practical andmethodological arguments. In this selection, we replace IRT models in the GLMM framework using thenew specification of generalized linear models (GLM) for categorical responses, proposed by Peyhardiet al. (2015). Then, to complete the comparisons already done between the IRT models and the LMM ontheir capacity to detect fixed effects, we focus in section 3 on the sensibility of these models to detect therandom effects through a simulation study. Finally, section 4 presents an application of the chosen IRTmodel on real data from a multicenter randomized phase III clinical trial in first-line metastatic pancreaticcancer patients (Conroy et al., 2011).

2 Conceptual selection of IRT models

This section concerns a conceptual selection of IRT models for the longitudinal analysis of HRQoL incancer clinical trials. HRQoL raw data are repeated measurements of ordinal multiple responses. TheGLMM for ordinal responses seem well suitable to analyze this kind of data. The use of random effectstakes into account the inter-patient variability and the correlation between the repeated measurements foreach single patient. The IRT models have been increasingly used to analyze health data deriving fromself-questionnaires made of polytomous responses (Hardouin et al., 2012; Anota et al., 2014; Barbieriet al., 2015). These models turn out to be GLMM for polytomous data with a specific parameterization

3

of the linear predictor taking into account the multiple outcomes. For ordinal responses, three familiesof regression models are described: the families of adjacent models (Masters, 1982; Agresti, 2010),cumulative models (Samejima, 1968; McCullagh, 1980), and sequential models (Tutz, 1990; Fahrmeirand Tutz, 2001). Many IRT models are proposed for the analysis of this kind of data, often with noexplanation regarding the choice of one model over another.

In this section, we use the new specification of the GLM for categorical data proposed by Peyhardiet al. (2015) to talk about the relevance of the models adopted in the context of the HRQoL longitudi-nal analysis in cancer clinical trials. Each model is defined according to three components (r,F,Z): theratio of probabilities (r), the cumulative distribution function (F ), and the parameterization of the linearpredictor determined by the design matrix (Z). For the GLMM framework, we extended this new speci-fication to the quadruplet (r,F,Z,U) with Z the design matrix of fixed effects and U the design matrix ofrandom effects. The relationship between these components is determined by R = F(Zβ +Uξ). Giventhe linear predictor η = Zβ + Uξ and π(j)

iv = (π(j)iv0, . . . , π

(j)iv,M−1) the truncated vector1 of conditional

probabilities with π(j)ivm = Pr(Y

(j)iv = m|ξi

)the conditional probability that subject i(i = 1, . . . , n) se-

lects the category m ∈ {0, . . . ,Mj} for item j(j = 1, . . . , J) at visit v(v = 1, . . . , V ) given individualrandom effect, we defined:

• R ={rm

(π(j)iv

)}i,j,v,m

;

• F{(

η(j)ivm

)i,j,v,m

}={F(η(j)ivm

)}i,j,v,m

where η =(η(j)ivm

)i,j,v,m

.

After a presentation of the IRT parameterization used concerning the linear predictor, we comparedifferent polytomous IRT models on the basis of the link function (ratio of probabilities and the cumula-tive distribution function (CdF)) using methodological and practical arguments.

2.1 The IRT parameterization of the linear predictor

The IRT probabilistic models emerged following the works of Georg Rasch (Rasch, 1960) on dichoto-mous responses, and were then extended to ordinal responses. Considering the three families of adjacent,cumulative and sequential models, there are three associated famous IRT models (Boeck and Wilson,2004; Bacci et al., 2014), respectively the graded response model (Samejima, 1968), the (generalized)partial credit model (Masters, 1982; Muraki, 1992), and the sequential model (Tutz, 1990). These modelslink the individual’s item responses to the unidimensional latent variable which represents a concept notdirectly measurable. In oncology framework, the concept is HRQoL relatively to one specific HRQoLdimension.

From the IRT, the specific parameterization of the linear predictor η(j)im is built into two parts: theindividual part and the item part. The best-known way is to consider the following decomposition:

η(j)im = αj (θi − δjm) , (1)

where θi is associated with an unidimensional random variable (currently assuming to be standard normalfor identifiability) representing the latent value for the subject i, δjm and αj being the item parameters.

1This truncated vector is sufficient to characterize the categorical distribution since π(j)ivM = 1−

∑M−1c=0 π

(j)ivc ;

4

Generally called difficulty parameter, δjm is the intercept (or threshold) associated with the item j forthe category m ∈ {1, . . . ,Mj}. The parameter αj is called the discrimination parameter of item j, andrepresents the sensitivity of each response probability according to the value of the latent trait. Indeed,the more the discrimination parameter value is high, the more the item allows well discriminating twoindividuals with a close latent trait value. However, the predictor is no longer linear for IRT models usingdiscrimination parameters because it includes a product of parameters. Thus, these models do not belongto the class of GLMM (Liu and Hedeker, 2006).

In oncology, HRQoL analysis is carried out using IRT models which do not include the discrimi-nation parameters (fixed to one for all items). Thus, these IRT models are within the class of GLMM.Concerning the longitudinal analysis, several studies proposed to extend some IRT models using the lin-ear decomposition of the latent variable θ with fixed and random effects (Hardouin et al., 2012; Verhagenand Fox, 2013; Huber et al., 2013):

θiv = x′ivβ + u′vξi, (2)

with the vector β is associated with the fixed effects, the vector ξi with the subject-specific random effectsand the index v the current visit. In the equation (2), θiv is thus the estimation of latent process at thevisit v.

2.2 The probability ratio: structure of the models

The linear predictor is not directly related to the response probability but to a particular transformationratio. The choice of the ratio is related to the nature of response from the ordering assumption amongcategories. Thus, reference ratio (Peyhardi et al., 2015) for nominal response is excluded in this workbecause the HRQoL responses are ordinal. Let’s consider the simple situation from GLM with one itemwith (M + 1) response categories. The three model families for ordinal data are distinguished by thechoice of the ratio of probabilities r (π) = (r0 (π) , . . . , rM−1 (π)). Each model is summarized by Mequations {rm (π) = F (η?m)}m=0,...,M−1 with η?m = δm−θ, highlighting the decomposition of the linkfunction which is determined through the ratio of probabilities and the CdF. Indeed, we may distinguishdifferent ratios of probabilities for these different families, respectively, for the cumulative models,

rm (π) = π0 + . . .+ πm, m = 0, . . . ,M − 1; (3)

for the adjacent models,

rm (π) =πm

πm + πm+1, m = 0, . . . ,M − 1; (4)

and, for the sequential models,

rm (π) =πm

πm + . . .+ πM, m = 0, . . . ,M − 1.

In the IRT, adjacent and cumulative models are usually presented given the reverse permutation(Samejima, 1968; Masters, 1982; Bacci et al., 2014). This permutation is defined as the reversal ofcategory order (McCullagh, 1980). Assuming that the considered CdF is symmetric (i.e. the corre-sponding probability density function is symmetric about the y-axis), these models are invariant underthis permutation (Peyhardi et al., 2015). For our application context, this is as an advantage for result

5

interpretation. A lower item-response category reflects a lower level of the symptomatic dimensionswhereas it represents a higher level of capacity for the functional dimensions. The reverse permutationfor the functional dimensions makes it easier and intuitive for clinicians to present their results. Thisallows the homogenization of the result interpretation as it is done in the scoring procedure proposedby the EORTC (for functional dimensions, the score scale is reversed compared with the item responsescategories order) (Fayers et al., 2001). However, sequential models correspond to process ordering andreversing the process may change its nature. These models are not reversible (i.e. no invariant under thereverse permutation). Thus, sequential models will not be used and only the adjacent and cumulativemodels which correspond to scale ordering as used for HRQoL measurements, will be consider.

The cumulative models also have additional properties (McCullagh, 1980), including that they areinvariant when successive categories are gathered. Thus, if one category is not observed, it can be com-bined with its successive categories without changing the model. Another advantage of the cumulativemodels is their interpretation through a continuous latent variable. Indeed, the continuous latent variableY underlying the model exists and a direct link with the response variable Y through the thresholdspresumed to be strictly increasing (−∞ = δ0 < δ1 < . . . < δM < δM+1 = +∞) is such as:

{Y = m} if{δm < Y ≤ δm+1

}, m = 0, . . . ,M ,

where Y = θ + ε and ε is the error term distributed following the CdF. Here, the latent variable Yrepresents HRQoL and its interpretation is then equivalent to the interpretation of the response variableusing a LMM.

However, an advantage of the adjacent models is that there is no constraint affecting the modelestimation. Nonetheless, the cumulative models have to respect constraints, which can make difficult themodel estimation, particularly in the case of non-proportional design of linear predictor (Peyhardi et al.,2015). For the proportional design, a common slope (θ) is considered for all categories, else the slope isdependent of the category (θm). Let the simple parameterization of the linear predictor ηm = θm − δmfor m ∈ {1, . . . ,M} where θm and δm are the slope and the intercept associated with the category m,respectively. Considering proportional design (θ = θ1 = . . . = θM ), the cumulative models refer to theprinciple of thresholds (McCullagh, 1980; Hedeker and Gibbons, 1994) with the constraint they have tobe strictly increasing such as −∞ < δ1 < . . . < δM < +∞. Considering the non-odd proportionalmodels, the constraint then becomes −∞ < ηM < . . . < η1 < +∞ which is more difficult to verify.For the longitudinal analysis of HRQoL data in oncology, the proportional design is considered and toverify the constraint only on the threshold is easier.

Table 1 summarizes some characteristics of these three families of models which are important forthe longitudinal analysis of HRQoL in cancer clinical trials. In this context, a proportional design ofthe linear predictor is preferred. Under this parameterization, there is few difficulties to respect thecumulative models constraints and to estimate them. The adjacent models seem more flexible thancumulative models because they are always define for all linear predictor. But, their interpretation of theresults using the cumulative model is more intuitive than adjacent models.

2.3 The cumulative distribution function

The latest component discussed in the IRT model selection is the CdF. Each model probability can bedefined with any CdF. As commonly seen in the IRT models, and thank to the reversibility property, the

6

Table 1: Summary of the characteristics for the three model families

ModelsCharacteristics Adjacent Cumulative SequentialReversibility yes yes noInterpretation using the latent variable no yes yesAlways defined yes yes(no1) yes1: for some non proportional design models

adjacent and cumulative models are used in descending order. For the cumulative model, the probabilitiesare defined from the equation (3) and given the F as:

π0 = 1− F (η1)πm = F (ηm)− F (ηm+1) , m = 1,M − 1πM = F (ηM )

. (5)

Then, general expression of the sequential model whatever the CdF used is such that

πm = F (ηm)m−1∏k=1

{1− F (ηk)} ,

where m = 1, . . . ,M and∏0k=1{.} = 1 (Fahrmeir and Tutz, 2001).

Such general equations have never been presented for adjacent models, only described with the lo-gistic CdF (Masters, 1982; Muraki, 1992; Fahrmeir and Tutz, 2001; Agresti, 2010; Hardouin et al., 2012;Anota et al., 2014). However, the different response probabilities can be presented from the adjacentratio and according to a general CdF (F ):

π0 =1

1 +∑M

m=1

∏mk=1

(F (ηk)

1−F (ηk)

)πm =

∏mk=1

(F (ηk)

1−F (ηk)

)1 +

∑Mm=1

∏mk=1

(F (ηk)

1−F (ηk)

) , m = 1, . . . ,M

(6)

The CdF choice is especially used to best fit the data. Let’s four CdF from two different kinds:the most commonly used symmetric distributions, the logistic and Gaussian distributions, and the twoasymmetric distributions, the Gumbel min and Gumbel max distributions. The two later distributionsare respectively defined by F (η) = exp(−exp(−η)) for the Gumbel max distribution and by F (η) =1− exp(−exp(η)) for the Gumbel min distribution.

Figure 1a shows different slopes depending on the particular CdF. The CdF allows to take into ac-count the influence of linear predictor (η) change on the response probability evolution. In general IRTparameterization (equation 1), the slope adjustment is managed by the discrimination parameter. De-pending on different discrimination parameter values, Figure 1b presents the CdF logistic according tothe individual latent variable. This item parameter has the task of fitting the CdF for each considereditem to distinguish more the different response variable.

7

(a) CdF (b) CdF adjustment

Figure 1: Relationship between the CdF and the IRT parameterization for dichotomous items where F (η) =F (αj(θ − δj)). Figure 1a presents the different CdF for one item j given αj = 1 and δj = 0. Figure 1b presentsthe logistic CdF adjustment for three items (j = 1, 2, 3) with different αj and δj = 0, given the linear predictordefined in equation (1).

In the literature, the cumulative model is presented according to the use of several of the previouslymentioned CdF (Samejima, 1968; Fahrmeir and Tutz, 2001; Liu and Hedeker, 2006), while the adjacentmodels are most often presented with the logistic CdF. However, for both, we recommend the use of asymmetric CdF. As seen in Figure 2b, the IRT parametrization is a subtle and effective way to take intoaccount the multiple item outcome characteristics in GLMM for categorical responses. In the context ofthe HRQoL in clinical trial, the HRQoL dimension considers a small set of items which are correlatedand measure an unique latent variable. The discrimination parameter is routinely not use in this kind ofanalysis.

Relatively to the literature, Table 2 shows the specification of the famous polytomous IRT modelsfollowing the different components. For IRT models being within the class of GLMM, we proposedto define them with the four components (r,F,Zq,Ur). The kind of considered location item parameterscan be indicated by the index q where q = 1 for including only difficulty parameters. Let q = 2 forconsidering the rating scale model (Andrich, 1978) parameterization where the difficulty parameters arecommon for all items and one shift parameter is considered for each item. Regarding the random part,the number of random effects can be then indicate by the index r. For the classical IRT parametrizationpresented in Table 2, only one random effect (r = 1) is taken into account : the capacity parameter θ. ForIRT models including discrimination parameters for each item, we proposed to replace the componentsZ and U by a component specifying that the predictor is no longer linear (nl), such as (r,F,nl).

8

Table 2: Specification of the famous IRT model following the components : (r,F,Zp,Ur) for the GLMM and(r,F,nl) for IRT model with no longer linear predictor. Index p denotes the number of kind of item parametersconsidered in the IRT model and r the number of random effect.

IRT models η(j)im (r,F,Zp,Ur)

Rating scale model θi − (δm + τj) (adjacent,logistic,Z2,U1)Partial credit model θi − δjm (adjacent,logistic,Z1,U1)Sequential Rasch model θi − δjm (sequential,logistic,Z1,U1)Graded response model αj (θi − δjm) (cumulative,logistic,nl)Generalized partial credit model αj (θi − δjm) (adjacent,logistic,nl)

3 Simulation study

In the previous section, we focused on the use of the mixed models for ordinal data analysis and theirrelevance in the HRQoL analysis in oncology was discussed. Some comparisons studies exist betweenthese different approaches (Blanchin et al., 2011; Anota et al., 2014; Barbieri et al., 2015), mainly onthe fixed part of the mixed models to identify the trend of latent trait. Anota et al. (2014) had shown anequivalent capacity to detect a fixed effect for the LMM and for one of the IRT models. Indeed, evenif the LMM take into account the HRQoL score, which is a summary variable, this approach is at leastequivalent to the IRT models in terms of power.

In this simulation study, the adjacent and cumulative models with the same parameterization of thelinear predictor and the logistic CdF were used (as usually in the IRT models). The aim of the followingsection is to reinforce these comparisons between the LMM and the IRT models on the random partof the mixed models. The datasets were simulated from an IRT model (adjacent and cumulative mod-els). Regarding the parameterization, two subject-specific random effects ξi0 and ξi1 were considered,respectively associated with the intercept and the slope. Of course, the usefulness of the random effectintroduction in the model is strongly associated with the observed data. HRQoL is a subjective endpoint,and the individual random effect ξi0 is thus entirely justified. Indeed, it is easy to imagine that eachpatient has a different level of HRQoL at baseline. The random slope is more questionable, indeed, theassumption that the specific HRQoL evolution of one single patient diverges from the average evolu-tion for the whole population, is less obvious than the previous one. In this section, the capacity of themixed models to detect the slope random effect was thus studied. No group effect was considered in thissimulation study.

Design

We want to study the capacity of each model to detect the random effect ξi1 associated with time (randomslope). The two subject-specific random effects are considered independent where ξi0 ∼ N (0, σ20) andξi1 ∼ N (0, σ21). The following model choice study is performed on the basis of the Bayesian informationcriteria (BIC) where two models were considered:M2 with the two random effects (r,F,Z1, U2) andM1

excluding the random slope (r,F,Z1, U1). For the IRT models, the linear decomposition of the latent traitθiv only took into account the time as a fixed effect. The two considered models with proportional design

9

are:M2 : θiv = (tv − t0)β1 + ξi0 + (tv − t0) ξi1M1 : θiv = (tv − t0)β1 + ξi0

(7)

In order to best reflect the EORTC QLQ-C30 questionnaire, the most frequent HRQoL dimensionwith two items (j = 1, 2) comprising four response categories (m ∈ {0, . . . ,M} withM = 3), was usedto design the simulation study. A sample size of three hundred subjects (i = 1, . . . , n with n = 300)and eight follow-up time (v = 0, . . . , 7), as for the trial presented in the previous section, were consid-ered. The datasets were simulated from a multinomial distribution. The different response probabilities{π(j)ivm = Pr

(Y

(j)iv = m|θiv, δj

)}concerning the subject i for item j were determined by equation (6)

for the adjacent model and by equation (5) for the cumulative model, given:

• the item parameters (δj1, δj2, δj3)j=1,2;

• the latent trait (θiv) deduced in accordance with equation (7);

• the logistic CdF,

F(η(j)ivm

)=

exp(η(j)ivm

)1 + exp

(η(j)ivm

) ,where η(j)ivm = θiv − δjm.

The values of the parameters used were deduced from the pain symptom data of the clinical trial pre-sented in the previous section. We considered two kinds of difficulty parameters: near δne = (δne1 , δne2 )

and far δfa = (δfa1 , δfa2 ). These parameter values were chosen in order to illustrate several scenariosdescribed in Table 3. The different scenarios were due with the different associations between the modelused to simulate the data, (adjacent,logistic,Z1, Ur)r=1,2 or (cumulative,logistic,Z1, Ur)r=1,2, and thedifferent considered values of the difficulty parameters. Table 3 shows the simulated responses expectedat baseline (t = 0). The responses simulated across time depended of the considered coefficient β1. Eachscenario was simulated N = 500 times.

Table 3: Values of difficulty parameters used to simulate the data and expected responses at t0 under each studiedscenarios.

Difficulty parametersModels δne1 = (−1.6, 1, 1.45) δfa1 = (−2.1, 1, 2.75)

(r, F, Z1, Ur)r=1,2 δne2 = (−0.8, 1.15, 1.9) δfa2 = (−1.25, 1.4, 3.3)

(adjacent,logistic,Z1, Ur)r=1,2 balanced responses focus on center categories (1 and 2)(cumulative,logistic,Z1, Ur)r=1,2 focus on extreme categories (0,1 and 3) balanced responses

Concerning the LMM, the scoring procedure proposed by the EORTC was considered (Fayers et al.,2001), and the score associated with a symptomatic dimension was first calculated using the simulateddata. Let the two simulated ordinal outcomes y(1)iv and y(2)iv concerning the individual i at the visit v, therelated score was:

Siv =

(∑J=2j=1 y

(j)iv

2

)100

M

10

Similarly to the parameterization in equation (7), we took into account the related choice model with:

M2 : Siv = βl

0 + (tv − t0)βl

1 + ξi0 + (tv − t0) ξi1 + εiv

M1 : Siv = βl

0 + (tv − t0)βl

1 + ξi0 + εiv

where βl

0 is the fixed parameter associated with the intercept, the ξl

are the random effects normallydistributed with the mean equals to zero and εiv ∼ N (0, σ2ε) the error term.

Results

Table 4 shows the capacity of the three models (adjacent model, cumulative model and LMM) to detectthe random slope given different scenarios (Table 1). When we simulated the data underM2 accordingto the random effect variances estimated from real data, each model detected the random slope (ξi1) in100% of cases whatever the different situations. On the contrary, underM1, the simulated modelM1

was correctly chosen in most cases. For all simulations underM1, the cumulative model seemed to detectthe random slope although it was not included in the simulation step. Moreover, the IRT model whichwas not use to simulate the data, wrongly detected this random effect given a negative value of β1 and thedifficulty parameter coefficients δne. This could be explained by the fact that the difficulty parameterswere not uniformly separated around zero and also because they were too close. Indeed, given β1 < 0,the probabilities to observe the upper categories were very small over time and under-represented incomparison with the lower categories (as illustrated in Figure 2a). In the specific case where β1 = −0.3,the IRT model which did not simulate the data could not explain the different outcomes only with thefixed effect and the random intercept, and it compensated the lack of information with the random slope.We then could expect symmetric results from β1 (positive values) considering the opposite sign of thedifficulty parameters because of the reversibility property of the IRT models.

On the contrary, the LMM was stable and thus allowed making the good choice of model whateverthe β1 values and the IRT model used to simulate the data. Concerning the IRT models where onlyone model out of the two detected the random effect ξi1, the most suitable model seemed the one notdetecting this random effect.

The capacity of the different models to detect the random slope when its variance value changes,is presented in Table 5. All models were sensitive to the signal-to-noise ratio. Indeed, the more β1increased, the less the random effect provided information. This was well characterized as the capacityto detect the random effect for greater variances when the signal was strong. In this case, the signalprovided the essential information explaining the different responses. In the model comparison, theLMM was less sensitive than the IRT models. Indeed, the LMM detected the random slope for a greatervariance of this one whatever the β1 value. This result was expected because the LMM is based on theHRQoL score which is a summary variable with less information than the raw data. Thus, the IRT modelsare more sensitive in all cases. Comparing the two IRT models, there is a tendency for the random slopemodel to be preferred under the cumulative model regardless of whether it is the true model model ornot. On the contrary, in the specific case where β1 = −0.3, the IRT model used to simulate the data wasless efficient than the other IRT model which detected a random slope to remedy the lack of information.This was coherent with our previous results shown in Table 4. Finally, the more β1 was close to zero, themore the models detected the random slope for a low variance.

11

Table 4: Frequency (on N = 500 datasets) of the M1 selection according to the BIC, given tv =(0, 1, 2, 4, 6, 8, 10, 12) and σ2

0 = 1.5. For r = 1, 2, the (adjacent,logistic,Z1, Ur) models and the(cumulative,logistic,Z1, Ur) models are denoted respectively by AM and CM. For the random component, U1

if σ21 = 0 and U2 if σ2

1 > 0.

Simulated ScenariosModel AM using δne CM using δfa CM using δne AM using δfa

σ21 β1 LMM AM CM LMM AM CM LMM AM CM LMM AM CM0.2 −0.3 0 0 0 0 0 0 0 0 0 0 0 00.2 0.3 0 0 0 0 0 0 0 0 0 0 0 00 −0.5 97.67 99.29 56.49 100 94.63 92.98 100 61.33 95.71 100 99.66 89.540 −0.3 99.00 100 33.04 100 88.63 93.30 100 36.33 94.91 100 100 83.330 −0.2 100 99.62 49.28 100 94.56 93.81 100 71.67 95.77 100 99.64 79.020 −0.1 98.67 95.65 94.78 100 98.65 89.62 100 98.98 90.41 100 100 88.100 0.0 95.60 100 94.55 99.00 99.66 91.75 99.00 99.66 89.71 97.00 99.66 94.420 0.1 83.00 100 94.78 93.33 100 92.63 97.00 100 90.91 87.33 100 94.690 0.3 98.33 99.64 90.61 100 99.64 89.05 100 100 93.67 100 99.65 93.780 0.5 100 100 94.29 100 99.32 94.71 100 100 97.61 100 100 97.19

Table 5: Frequency (on N = 500 datasets) of the M2 selection according to the BIC, giventv = (0, 1, 2, 4, 6, 8, 10, 12) and σ2

0 = 1.5. For the (adjacent,logistic,Z1, Ur) models and the(cumulative,logistic,Z1, Ur) models are denoted respectively by AM and CM. For the random component, U1

if σ21 = 0 and U2 if σ2

1 > 0.

Simulated ScenariosModel AM using δne CM using δfa CM using δne AM using δfa

β1 σ21 LMM AM CM LMM AM CM LMM AM CM LMM AM CM0.01 0 2.33 24.92 0 5.03 6.94 0 2.69 3.70 0.33 6.44 24.750.02 0 21.40 54.67 0 37.58 44.11 0 17.73 18.12 0 50.00 77.000.03 0 61.00 90.97 0 75.67 80.00 0 41.33 45.58 0 86.33 98.33

1 0.05 0 97.67 99.66 0 100 100 0.33 89.00 90.00 0 99.33 1000.2 39.33 100 100 40.67 100 100 10.67 100 100 57.67 100 1000.5 100 100 100 100 100 100 100 100 100 100 100 100

0.002 16.67 6.33 21.40 0 2.03 3.97 0 3.06 3.94 11.00 11.04 15.250.005 72.33 86.33 92.67 30.67 55.33 59.00 0 32.33 46.00 85.67 87.33 91.67

0.3 0.008 97.67 100 100 86.00 97.3 98.00 4.00 76.33 88.33 99.33 99.67 1000.01 100 100 100 96.33 99.67 99.33 17.33 94.00 97.00 100 100 1000.02 100 100 100 100 100 100 96.67 100 100 100 100 1000.002 0.67 4.36 61.43 0 54.00 5.09 0 93.33 1.82 0 2.07 18.57

−0.3 0.005 5.67 62.33 79.00 0 95.67 40.40 0 99.67 33.22 0 56.00 48.330.008 23.67 96.33 97.33 0 100 86.67 0 100 82.67 1.67 96.33 86.33

4 Application to HRQoL data

In this section, we performed a longitudinal analysis of the HRQoL data from a multicenter random-ized phase III clinical trial in first-line metastatic pancreatic cancer patients: PRODIGE4/ACCORD11

12

(Conroy et al., 2011). Three hundred and forty-two patients were randomly assigned to Folfirinox (exper-imental arm) versus Gemcitabine (control arm) regimens. The detailed inclusion and exclusion criteria,the study design and protocol, the treatment, the compliance to the questionnaires, and the HRQoL anal-yses have previously been published (Conroy et al., 2011; Gourgou-Bourgade et al., 2012; Barbieri et al.,2015). The patients filled the EORTC QLQ-C30 questionnaire themselves at different follow-up timesdefined in the protocol: at baseline, day 15, day 30, and at months 2, 4, 6, 8, and 10. The different mea-suring times reflected the longitudinal aspect of the HRQoL and allowed the assessment of the change ofHRQoL for each dimension.

Analyses were performed using the SAS software (version 9.3) (Institute, 2011; Boeck and Wilson,2004). For all previous arguments, the cumulative models are preferred for the longitudinal analysis ofthe HRQoL. There, we considered the (cumulative,logistic,Z1, U2) model with to analyze the data. InHRQoL study in oncology, the analysis is carried out for each HRQoL dimension. Given one HRQoLdimension with few correlated items, the discrimination parameters could be considered equals to onefor each item. The distinction between the multiple-item responses is only achieved though difficultyparameters (thresholds) (Anota et al., 2014; Barbieri et al., 2015). For the HRQoL longitudinal analysiswith the subject i (i = 1, . . . , 342), the visit v (v = 1, . . . , 8), the item j with Mj response categories,the (cumulative,logistic,Z1, U2) model is defined by:

Pr(Y

(j)iv ≥ m|θi

)=

exp(η(j)ivm

)1 + exp

(η(j)ivm

) , (8)

with the following linear predictor is considered in the analyses:η(j)ivm =θiv − δjmθiv =giβ1 + (tv − t0)β2 + gi (tv − t0)β3

+ ξi0 + (tv − t0) ξi1

(9)

where:

• δjm is the difficulty parameter (threshold) associated with the category m of item j;

• tv is the date of the visit v, and t0 is the date of baseline;

• gi = 1 if the patient i belongs the experimental group (Folfirinox), gi = 0 if the patient i belongsthe control group (Gemcitabine);

• β1 is the effect difference at baseline between Folfirinox and Gemcitabine groups;

• β2 is the slope (evolution) of health-related quality-of-life perception for the Gemcitabine group;

• β2+β3 is the slope (evolution) of health-related quality-of-life perception for the Folfirinox group;

• ξi0 and ξi1 are respectively the subject-specific random effects associated with the intercept andthe slope such as (ξi0, ξi1)

′ ∼ N (0,Σ), Σ being the unstructured covariance matrix.

13

These HRQoL data have been already analyzed with different approaches. Gourgou-Bourgade et al.(2012) have presented the results using time-to-event models. They concluded for a better HRQoL in theFolfirinox arm than the Gemcitabine arm. Then, Barbieri et al. (2015) have presented the results throughthe LMM and the partial credit model extended for the longitudinal analysis (adjacent,logistic,Z1, U2).The conclusions of both mixed models are similar.

For the (cumulative,logistic,Z1, U2) model, Table 6 shows the estimations of fixed parameters, theirstandard deviation and the associated P-value of the Wald test. Concerning the functional dimension,we performed a reverse permutation on the functional scales for an intuitive interpretation. This allowsto consider that an increase of the latent trait θ is associated with an increase of the functional capacity(improvement of HRQoL) or increase of the symptoms (deterioration of HRQoL). For all HRQoL di-mensions, there should be no difference at baseline (β1 = 0) in a randomized clinical trial. However, weobserved a significantly difference concerning the diarrhea symptom between the two groups at baseline(p = 0.007∗∗). It referred to a difference between the two arms at day 15, day 30 (during treatmentperiod) but no necessarily at baseline. Then, the perception of diarrhea symptom remained higher in theFolfirinox arm over time. This result is expected because the Folfirinox is more toxic than Gemcitabineand is known to cause more diarrhea symptom.

Regarding the others dimensions, the HRQoL changed over time for several dimensions (emotionalfunctioning, pain, insomnia, constipation and appetite loss) returning a significant improvement in termsof HRQoL perception. Only the pain showed a significantly different evolution between the two arms(p = 0.04). Indeed, the patients receiving the Folfirinox had a perception of pain which decreasedsignificantly more over time than the patients receiving the Gemcitabine.

Its interpretability and intuitive illustration is one of the many advantages of the cumulative models.The constraints on the item parameter in these models allow an interpretation through the latent variable(e.g. comparing the proportions of the response categories for one specific item over time or betweendifferent groups during a fixed time). Figure 2 presents the evolution concerning the probability ofresponse either over time (Figure 2a) or between group (Figure 2b). This example is illustrated throughthe first item of the pain symptom of the clinical trial previously presented. The probability (πm) for apatient to response the category m corresponds to the area under the curve delimited by the horizontallines. For both groups, Figure 2a shows that the probability to choose the categories 2 or 3 decreasedover time while the probability to choose the category 0 increased. At baseline, the response proportionfor the categories 0, 1, 2 and 3 were respectively π0 = 0.10, π1 = 0.62, π2 = 0.22 and π3 = 0.06for each group. Then, the evolution of the proportions showed a decrease of the level of pain betweenthe baseline and the 4-month visit, and, finally, a decrease of the latent trait over time. Likewise, Figure2b shows the different response proportions between the two groups at four months. In the controlgroup, the proportions were π0 = 0.29, π1 = 0.61, π2 = 0.08 and π3 = 0.02 for the categories 0,1, 2 and 3. In the experimental group, they were π0 = 0.47, π1 = 0.48, π2 = 0.04 and π3 = 0.01.The probability to response category 3 was the lowest whatever the group, but was even less likely forpatients in experimental group than in control group. In contrary, the probability to response category0 was more likely in experimental group than in control group. The observed gab corresponded tothe difference between the two linear predictors associated with each group only four months after thebaseline. One of the interests of this illustration concerns the clinical interpretation. The IRT modelsthus offer a complete analysis: the general analysis of a HRQoL dimension and the specific analysis foreach item (Edelen and Reeve, 2007).

14

Table 6: Estimations of fixed effect parameters (βp)p=1,2,3 of the (cumulative,logistic,Z1, U2) model. All HRQoLdimensions of the EORTC QLQ-C30 are considered.

HRQoLDimensions Coefficient Standard error PvalueGlobal Health Statusβ2 0.098 0.070 0.166β3 0.130 0.085 0.128

Physical functioningβ2 -0.150 0.077 0.051β3 0.122 0.098 0.212

Role functioningβ2 -0.011 0.081 0.892β3 0.157 0.103 0.131

Emotional functioningβ2 0.335 0.070 < .001∗∗∗

β3 0.001 0.086 0.992Cognitive functioningβ2 -0.002 0.054 0.972β3 0.088 0.067 0.189

Social functioningβ2 0.010 0.073 0.888β3 0.116 0.093 0.211

Fatigueβ2 -0.087 0.085 0.308β3 -0.033 0.107 0.761

Nausea and vomitingβ2 -0.052 0.060 0.393β3 -0.069 0.072 0.336

Painβ2 -0.330 0.076 < .001∗∗∗

β3 -0.188 0.092 0.040∗

Dyspneaβ2 -0.060 0.075 0.420β3 -0.093 0.088 0.295

Insomniaβ2 -0.359 0.080 < .001∗∗∗

β3 0.046 0.083 0.627Appetite lossβ2 -0.354 0.072 < .001∗∗∗

β3 -0.026 0.080 0.747Constipationβ2 -0.325 0.077 < .001∗∗∗

β3 0.003 0.083 0.974Diarrheaβ1 0.739 0.272 0.007∗∗

β2 0.018 0.067 0.792β3 -0.026 0.076 0.786

Financial difficultiesβ2 -0.522 0.282 0.066β3 0.302 0.208 0.146

15

(a) Pain evolution (b) Outcomes at t4

Figure 2: Interpretation of the (cumulative,logistic,Z1, U2) model through its underlying latent variable con-cerning the pain symptom (including the items 9 and 19). The estimated difficulty parameter for the item 9 areδ9,1 = −2.1, δ9,2 = 1 and δ9,3 = 2.75, and for the item 19 : δ19,1 = −1.28, δ19,2 = 1.40 and δ19,3 = 3.34.Figure 2a: the different HRQoL evolution of the latent variable and response variable (Y (j))j=9,19 between thetwo groups. Figure 2b: the different proportions (πm) of different response categories of Y (9) between the twogroups four month after the Baseline (t4).

5 Discussion

We have explored the suitable mixed models for the longitudinal analysis of the HRQoL in oncology.This data coming from questionnaires through Likert scales, we focused on regression models for ordinaldata. These models have been specified with three components, the linear predictor parameterization, theratio of probabilities and the CdF (Peyhardi et al., 2015). In oncology, the analysis being performed onmultiple-item measurements associated with one HRQoL dimension (Fayers et al., 2001), the specificIRT parameterization of the linear predictor is thus used. The item parameters allow to distinguish theoutcomes from different items which measure an unique unidimensional latent variable. This latentvariable was decomposed linearly to take into account the different covariates in the fixed part of themodel and to incorporate subject-specific random effects. The analysis with IRT models is the richerbecause they are based on raw data (Gorter et al., 2015). The analysis can be made on one specific itemthrough the item parameters or on the whole HRQoL dimension (Edelen and Reeve, 2007). Indeed, thesemodels take into consideration all available information from the data, it is why the use of this kind ofmodel is more and more studied (Gorter et al., 2015).

Then, concerning the choice of the model family, the cumulative and adjacent models are preferred.From the ratios of probabilities which characterize them and the symmetric CdF, the practical property ofthe invariant seems important to interpret the results. The cumulative models also assume an underlyingcontinuous latent variable that is associated with a linear mixed regression model (McCullagh, 1980;

16

Hedeker and Gibbons, 1994). This allows a better interpretation and illustration of the results such asthe easy analysis of the evolution of the response proportions of the different categories over time orbetween groups, given one item. The adjacent models show the advantage not to have any constraintfor the model estimation. These models can thus be preferred when the regression is performed on theitem part of the linear predictor, given non-proportional design. Finally, the choice of the CdF essentiallydepends on the observed data and properties which interest the users. These IRT models are reversibleonly if the CdF is symmetric. Thu, the use a commonly symmetric CdF is preferred (the logistic and theGaussian distributions). From the conceptual IRT model selection, the cumulative model seems the mostsuitable given its advantages for the longitudinal analysis of HRQoL in cancer clinical trials.

The simulation study showed that the IRT model capacity to detect the random effect was better thanthe LMM currently used. This result seems natural because the LMM is based on the study of a summaryvariable with less information. Thus, the variability from data is also reduced. Of course, the usefulnessof the random effect introduction in the model is strongly associated with the observed data. Moreover,the more the difficulty parameters were distinct and the influence of covariates was stronger, the lessthe random effect provided information. All these results confirmed that the IRT models allow a moredetailed analysis to interpret the results from a specific dimension or item. Whatever the IRT model usedto generate the data, the LMM remained competitive through these simulations. However, the IRT modelthat did not generate data, seemed more sensitive to the random slope than the other IRT model used tosimulate the dataset. Indeed, in some cases, it tended to detect the random slope while it did not exist.In case where one of the two models detects the random slope, the use of the model not detecting theeffect as it is, seems the most appropriated concerning a data-driven choice. However, we recommend touse only one kind of model (with same components discussed previously) allowing to make the resultscomparable across HRQoL dimensions.

An aspect that remains to be discussed is the multidimensional aspect of HRQoL. Nowadays in on-cology, the different dimensions are analyzed independently of one another, and this causes the problemof multiple tests. An approach to consider the all HRQoL dimensions would be the use of structuralequation modeling. This would allow to show the influence of each HRQoL dimension through somefactors to explain the global HRQoL and potential structural links between the latent variables.

Funding

This study was supported by a grant from the French Public Health Research Institute (www.iresp.net)under the 2012 call for projects as part of the 2009-2013 Cancer Plan.

Acknowledgements

We thank Dr. Hélène de Forges for her editorial assistance and UNICANCER for the data from PRODIGE4/ ACCORD11 clinical trial which is used in this paper.

References

Aaronson, N. K., Ahmedzai, S., Bergman, B., Bullinger, M., Cull, A., Duez, N. J., Filiberti, A., Flecht-ner, H., Fleishman, S. B., and de Haes, J. C. (1993). The European Organization for Research and

17

Treatment of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials inoncology. Journal of the National Cancer Institute, 85(5):365–376.

Agresti, A. (2010). Analysis of Ordinal Categorical Data. John Wiley & Sons.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4):561–573.

Anota, A., Barbieri, A., Savina, M., Pam, A., Gourgou-Bourgade, S., Bonnetain, F., and Bascoul-Mollevi, C. (2014). Comparison of three longitudinal analysis models for the health-related quality oflife in oncology: a simulation study. Health and Quality of Life Outcomes, 12.

Anota, A., Hamidou, Z., Paget-Bailly, S., Chibaudel, B., Bascoul-Mollevi, C., Auquier, P., Westeel, V.,Fiteni, F., Borg, C., and Bonnetain, F. (2013). Time to health-related quality of life score deteriorationas a modality of longitudinal analysis for health-related quality of life studies in oncology: do we needRECIST for quality of life to achieve standardization? Quality of Life Research: An InternationalJournal of Quality of Life Aspects of Treatment, Care and Rehabilitation.

Bacci, S., Bartolucci, F., and Gnaldi, M. (2014). A Class of Multidimensional Latent Class IRT Mod-els for Ordinal Polytomous Item Responses. Communications in Statistics - Theory and Methods,43(4):787–800.

Barbieri, A., Anota, A., Conroy, T., Gourgou-Bourgade, S., Juzyna, B., Bonnetain, F., Lavergne, C., andBascoul-Mollevi, C. (2015). Applying the Longitudinal Model from Item Response Theory to Assessthe Health-Related Quality of Life in the PRODIGE 4/ACCORD 11 Randomized Trial. MedicalDecision Making: An International Journal of the Society for Medical Decision Making.

Blanchin, M., Hardouin, J.-B., Neel, T. L., Kubis, G., Blanchard, C., Mirallié, E., and Sébille, V. (2011).Comparison of CTT and Rasch-based approaches for the analysis of longitudinal Patient ReportedOutcomes. Statistics in Medicine, 30(8):825–838.

Boeck, P. d. and Wilson, M. (2004). Explanatory Item Response Models: A Generalized Linear andNonlinear Approach. Springer, New York.

Cella, D. F., Tulsky, D. S., Gray, G., Sarafian, B., Linn, E., Bonomi, A., Silberman, M., Yellen, S. B.,Winicour, P., and Brannon, J. (1993). The Functional Assessment of Cancer Therapy scale: devel-opment and validation of the general measure. Journal of Clinical Oncology: Official Journal of theAmerican Society of Clinical Oncology, 11(3):570–579.

Chinot, O. L., Wick, W., Mason, W., Henriksson, R., Saran, F., Nishikawa, R., Carpentier, A. F., Hoang-Xuan, K., Kavan, P., Cernea, D., Brandes, A. A., Hilton, M., Abrey, L., and Cloughesy, T. (2014).Bevacizumab plus Radiotherapy–Temozolomide for Newly Diagnosed Glioblastoma. New EnglandJournal of Medicine, 370(8):709–722.

Conroy, T., Desseigne, F., Ychou, M., Bouché, O., Guimbaud, R., Bécouarn, Y., Adenis, A., Raoul, J.-L.,Gourgou-Bourgade, S., de la Fouchardière, C., Bennouna, J., Bachet, J.-B., Khemissa-Akouz, F., Péré-Vergé, D., Delbaldo, C., Assenat, E., Chauffert, B., Michel, P., Montoto-Grillot, C., and Ducreux, M.(2011). FOLFIRINOX versus gemcitabine for metastatic pancreatic cancer. The New England journalof medicine, 364(19):1817–1825.

18

Edelen, M. O. and Reeve, B. B. (2007). Applying item response theory (IRT) modeling to question-naire development, evaluation, and refinement. Quality of Life Research: An International Journal ofQuality of Life Aspects of Treatment, Care and Rehabilitation, 16 Suppl 1:5–18.

Fahrmeir, L. and Tutz, G. (2001). Multivariate Statistical Modelling Based on Generalized LinearModels. Springer.

Fayers, P. M., Aaronson, N. K., Bjordal, K., Groenvold, M., Curran, D., and Bottomley, A. o. b. o. t. E.Q. o. L. G. (2001). EORTC QLQ-C30 Scoring Manual (3rd edition), volume Brussels: EORTC 2001.

Fiteni, F., Westeel, V., Pivot, X., Borg, C., Vernerey, D., and Bonnetain, F. (2014). Endpoints in cancerclinical trials. Journal of Visceral Surgery, 151(1):17–22.

Gilbert, M. R., Dignam, J. J., Armstrong, T. S., Wefel, J. S., Blumenthal, D. T., Vogelbaum, M. A.,Colman, H., Chakravarti, A., Pugh, S., Won, M., Jeraj, R., Brown, P. D., Jaeckle, K. A., Schiff, D.,Stieber, V. W., Brachman, D. G., Werner-Wasik, M., Tremont-Lukats, I. W., Sulman, E. P., Aldape,K. D., Curran, W. J., and Mehta, M. P. (2014). A Randomized Trial of Bevacizumab for NewlyDiagnosed Glioblastoma. New England Journal of Medicine, 370:699–708.

Gorter, R., Fox, J.-P., and Twisk, J. W. (2015). Why item response theory should be used for longitudinalquestionnaire data analysis in medical research. BMC Medical Research Methodology, 15(1):55.

Gourgou-Bourgade, S., Bascoul-Mollevi, C., Desseigne, F., Ychou, M., Bouché, O., Guimbaud, R.,Bécouarn, Y., Adenis, A., Raoul, J.-L., Boige, V., Bérille, J., and Conroy, T. (2012). Impact ofFOLFIRINOX Compared With Gemcitabine on Quality of Life in Patients With Metastatic PancreaticCancer: Results From the PRODIGE 4/ACCORD 11 Randomized Trial. Journal of clinical oncology:official journal of the American Society of Clinical Oncology.

Grilli, L. and Rampichini, C. (2011). Multilevel Models for Ordinal Data. In Kenett, R. S. and Salini,S., editors, Modern Analysis of Customer Surveys, pages 391–411. John Wiley & Sons, Ltd.

Hamidou, Z., Dabakuyo, T. S., Mercier, M., Fraisse, J., Causeret, S., Tixier, H., Padeano, M.-M.,Loustalot, C., Cuisenier, J., Sauzedde, J.-M., Smail, M., Combier, J.-P., Chevillote, P., Rosburger,C., Arveux, P., and Bonnetain, F. (2011). Time to Deterioration in Quality of Life Score as a Modalityof Longitudinal Analysis in Patients with Breast Cancer. The Oncologist, 16(10):1458–1468.

Hardouin, J.-B., Audureau, E., Leplège, A., and Coste, J. (2012). Spatio-temporal Rasch analysis ofquality of life outcomes in the French general population. Measurement invariance and group compar-isons. BMC Medical Research Methodology, 12(1):182.

Hardouin, J.-B., Blanchin, M., Feddag, M.-L., Néel, T. L., Perrot, B., and Sébille, V. (2015). Powerand sample size determination for group comparison of patient-reported outcomes using polytomousRasch models. Statistics in Medicine, pages n/a–n/a.

Hedeker, D. and Gibbons, R. D. (1994). A random-effects ordinal regression model for multilevel anal-ysis. Biometrics, 50(4):933–944.

Huber, C., Limnios, N., Mesbah, M., and Nikulin, M. (2013). Mathematical Methods in SurvivalAnalysis, Reliability and Quality of Life. John Wiley & Sons.

19

Institute, S. (2011). SAS/STAT 9.3 User’s Guide: Mixed Modeling (Book Excerpt). SAS Institute.

Jafari, P., Bagheri, Z., Ayatollahi, S. M., and Soltani, Z. (2012). Using Rasch rating scale model toreassess the psychometric properties of the Persian version of the PedsQLTM 4.0 Generic Core Scalesin school children. Health and Quality of Life Outcomes, 10(1):27.

Liu, L. C. and Hedeker, D. (2006). A mixed-effects regression model for longitudinal multivariate ordinaldata. Biometrics, 62(1):261–268.

Masters, G. (1982). A rasch model for partial credit scoring. Psychometrika, 42(2):149–174.

McCullagh, P. (1980). Regression models for ordinal data (with discussion). Journal of the RoyalStatistical Society, Series B, 42:109–142.

Muraki, E. (1992). A Generalized Partial Credit Model: Application of an EM Algorithm. AppliedPsychological Measurement, 16(2):159–176.

Peyhardi, J., Trottier, C., and Guédon, Y. (2015). A new specification of generalized linear models forcategorical responses. Biometrika, page asv042.

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Danmarks Paeda-gogiske Institut.

Samejima, F. (1968). Estimation of Latent Ability Using a Response Pattern of Graded Scores1. ETSResearch Bulletin Series, 1968(1):i–169.

Santos, V. L. F., Moura, F. A. S., Andrade, D. F., and Gonçalves, K. C. M. (2016). Multidimensional andlongitudinal item response models for non-ignorable data. Computational Statistics & Data Analysis,103:91–110.

Titman, A. C., Lancaster, G. A., and Colver, A. F. (2013). Item response theory and structural equationmodelling for ordinal data: Describing the relationship between KIDSCREEN and Life-H. StatisticalMethods in Medical Research.

Tutz, G. (1990). Sequential item response models with an ordered response. British Journal ofMathematical and Statistical Psychology, 43(1):39–55.

Verhagen, J. and Fox, J.-P. (2013). Longitudinal measurement in health-related surveys. A Bayesian jointgrowth model for multivariate ordinal responses. Statistics in Medicine, 32(17):2988–3005.

20

Documents

Item response models for the longitudinal analysis of ... · the item response theory to analyze directly the raw data from questionnaires. Using a recent classi-ﬁcation of generalized