18
1 Behavioral economics in software quality engineering Radosław Hofman 1 Sygnity Research, Poland This article describes empirical research results regarding the “history effect” in software quality evaluation processes. Most software quality models and evaluation processes models assume that software quality may be deterministically evaluated, especially when it is evaluated by experts. Consequently, software developers focus on the technical characteristics of the software product. A similar assumption is common in most engineering disciplines. However, in regard to other kinds of goods, direct violations of the assumption about objective evaluation were shown to be affected by the consequences of cognitive processes limitations. Ongoing discussion in the area of behavioral economics raises the question: are the experts prone to observation biases? If they are, then software quality models overlook an important aspect of software quality evaluation. This article proposes an experiment that aims to trace the influence of users’ knowledge on software quality assessment. Measuring the influence of single variables for the software quality perception process is a complex task. There is no valid quality model for the precise measurement of product quality, and consequently software engineering does not have tools to freely manipulate the quality level for a product. This article proposes a simplified method to manipulate the observed quality level, thereby making it possible to conduct research. The proposed experiment has been conducted among professional software evaluators. The results show the significant negative influence (large effect size) of negative experience of users on final opinion about software quality regardless of its actual level. Index terms— software, quality perception, cognitive psychology, behavioral economics. Introduction Motivation Software quality has been a subject of study since the 1970’s when software development techniques started to be perceived as an engineering discipline. The first quality models were published by McCall (McCall, Richards, & Walters, 1977) and Boehm (Boehm, Brown, Lipow, & MacCleod, 1978). The most disseminated model among those currently developed is the SQuaRE 1 EUR ING, Sygnity Research, Poland, email: [email protected]

Behavioral economics in software quality engineering v06 ESE

Embed Size (px)

Citation preview

1

Behavioral economics in software quality

engineering

Radosław Hofman1

Sygnity Research, Poland

This article describes empirical research results regarding the “history effect” in software quality

evaluation processes.

Most software quality models and evaluation processes models assume that software quality may

be deterministically evaluated, especially when it is evaluated by experts. Consequently, software

developers focus on the technical characteristics of the software product. A similar assumption is

common in most engineering disciplines. However, in regard to other kinds of goods, direct

violations of the assumption about objective evaluation were shown to be affected by the

consequences of cognitive processes limitations. Ongoing discussion in the area of behavioral

economics raises the question: are the experts prone to observation biases? If they are, then

software quality models overlook an important aspect of software quality evaluation.

This article proposes an experiment that aims to trace the influence of users’ knowledge on

software quality assessment. Measuring the influence of single variables for the software quality

perception process is a complex task. There is no valid quality model for the precise measurement

of product quality, and consequently software engineering does not have tools to freely manipulate

the quality level for a product. This article proposes a simplified method to manipulate the

observed quality level, thereby making it possible to conduct research.

The proposed experiment has been conducted among professional software evaluators. The results

show the significant negative influence (large effect size) of negative experience of users on final

opinion about software quality regardless of its actual level.

Index terms— software, quality perception, cognitive psychology, behavioral

economics.

Introduction

Motivation

Software quality has been a subject of study since the 1970’s when software development

techniques started to be perceived as an engineering discipline. The first quality models were

published by McCall (McCall, Richards, & Walters, 1977) and Boehm (Boehm, Brown, Lipow, &

MacCleod, 1978). The most disseminated model among those currently developed is the SQuaRE

1 EUR ING, Sygnity Research, Poland, email: [email protected]

2

(Software product QUality Requirements and Evaluation) model developed within the

ISO/IEC25000 standards series. This new approach is perceived as being the new generation of

software quality models (Suryn & Abran, 2003) and is being used for the decomposition of the end

users’ perspectives regarding software components requirements(Abramowicz, Hofman, Suryn, &

Zyskowski, 2008).

The general conclusion about software quality models is that there is no commonly accepted

model or evaluation method.

The valuation of goods has been studied by economic scientists for centuries (Camerer &

Loewenstein, 2003). The neo-classical economic model of human behavior generated assumptions

about utility maximization, equilibrium and efficiency. These assumptions correspond with the

classical model of human behavior known as homo economicus. This concept appeared in the book

considered to be the beginning of economic science (Smith, 1776). Although the assumptions

discussed in this text are widely accepted, they are just simplifications made for the purpose of

modeling decision processes or economic behavior. Publications in the last decades have criticized

the above assumptions (Camerer & Loewenstein, 2003).

Economists have started to accept the counterexamples to neo-classical models based on

empirical observation results. A new approach in psychology, cognitive psychology, has proposed

a new model of the human brain using the metaphor of an information processing system (Nęcka,

Orzechowski, & Szymura, 2008). Psychologists have started to compare their psychological

models with those of economics. In the 1950’s Herbert A. Simon proposed a decision theory

regarding bounded rationality concept (Simon, 1956). In the 1970’s Daniel Kahneman and Amos

Tversky published their research results for their study of heuristics in decision making (Tversky

& Kahneman, 1974) and their prospect theory (Kahneman & Tversky, 1979). These publications

are considered to be the most important milestones of behavioral economics (Camerer &

Loewenstein, 2003).

The works of Herbert A. Simon (Simon, 1956), Garry S. Becker (Becker, 1968), George A.

Akerlof (Akerlof, 1970), A. Michael Spence, Joseph E. Stiglitz, and Daniel Kahneman (Kahneman

& Tversky, 1979) were awarded with the Bank of Sweden Prize in Memory of Alfred Nobel in

1978, 1992, 2001 and 2002 respectively. The prize for Daniel Kahneman was shared with Vernon

L. Smith, who received an award for his research results in experimental economics (Nobel

Foundation, 2009).

In subsequent years several models of judgment formulation and decision making were

developed (Levitt & List, 2008). However, there were also some drawbacks (List, 2004). Several

scientists regard irrational behavior as being a result of a lack of experience in the area where the

decision is to be made (Brookshire & Coursey, 1987). Unresolved discussion, along with a slow

build up process of cognitive structures (Camerer & Hogarth, 1999), leaves room for empirical

evidence analysis.

The summary of the background analysis, described further by Hofman (Hofman, 2010),

stresses that the analyzed areas - software engineering, software quality engineering, behavioral

economics and cognitive psychology - are continually developing. Up until now, much research

has been conducted regarding the understanding of human cognitive processes. Researchers have

3

documented a long list of cognitive biases, heuristics etc. which are associated with the judgment

formulation process. Their research results may be used to discover potential benefits for the

software engineering discipline, including a better understanding of users’ ways of thinking and

ways of perceiving software quality. Consequently, software engineering researchers will be able

to propose more effective methods for the management of software delivery projects, aiming to

improve users’ satisfaction levels.

Related work

The experimental methods developed for psychological, sociological and behavioral economics

purposes have also been adopted by software engineering scientists (Basili, 1993). However, this

research focuses on the observable impacts of the use of certain techniques, methods, tools etc. on

software engineering processes (Basili, 2007). The customer’s perspective is considered mainly as

being a stakeholder’s perspective during the software project.

The evaluation of product quality was investigated by Xenos et al. (Xenos & Christodoulakis,

1995). Their model takes into account users’ qualifications in commonly understood computer

skills. It follows the dimensions proposed by Pressman (Pressman, 1992). This model assumes that

users have initial opinions at the beginning. Users then continuously discover features of the

software product, gaining new information and reviewing their beliefs. The authors concluded that

users finally come to an objective opinion about the software product’s quality (Stavrinoudis,

Xenos, Peppas, & Christodoulakis, 2005).

The above approach is based on the assumption that users are rational agents using deductive

reasoning and that beliefs may be represented in a formal system. The authors do not analyze the

context of the user (the context of the purpose) or the user’s personal aspects (tiredness, attitude

towards the experiment etc). The most important problem with the results is the problem of

repetitive observations within the same group of users. The phenomenon observed by the authors

may also be explained as a regression to mean effect (Shaughnessy, Zechmeister, & Zechmeister,

2005).

Another approach, used commonly for the evaluation of services (not exclusively those of IT) is

the SERVQUAL model (Miguel, daSilva, Chiosini, & Schützer, 2007). This approach measures

the difference between observed characteristics and a subjectively ideal object.

One of the first quality perception models, not related to software, was proposed by Steenkamp

et al. for food quality perception (Steenkamp, Wierenga, & Meulenberg, 1986). Their research on

the model’s validity was conducted using a psychological research paradigm. Using an

independent groups plan, the authors have shown the influence on the quality assessment level as a

result of a statement presented to subjects.

The new approach to software quality perception modeling and the model presented by Hofman

(Hofman, 2010) is being validated in a series of empirical studies. This model is presented in fig 1.

4

Fig. 1, The software quality perception model

The overall quality grade depends on the knowledge based importance of characteristics and on

the current needs saturation level. If the observed quality is above expectations then the

diminishing of marginal increases caused by each “quality unit” may be expected (compare to

Gossen’s law). On the other hand, if the observed quality is below expectations then radical

dissatisfaction disproportional to the “quality gap” may be expected (compare to positive-negative

asymmetry) (Tversky & Kahneman, 1982).

Furthermore, both the observer’s knowledge and mental state influence their perception of

software quality attributes (e.g. they supply the process with valid or invalid associations etc.).

Both of these structures also influence the behavior of the attention filter.

The general perceived quality grade is a non linear combination of perceived attributes with an

open question as to what comes first: judgments about attributes or the general grade judgment.

This article focuses on the influence of observers’ knowledge (including their past experience,

learning results etc.).

Experiment planning

Goal

The motivation for the experiment is strictly based on the industrial experience of the author. In

many real world projects, the managers aim to deliver the software, overlooking the process of

communication with the customer. First prototypes or versions are presented to unprepared users.

Moreover, the quality level of these releases is below the acceptance level. Evaluators gain

experience evaluating these applications. The question is: does experience with previous versions

influence the final version’s quality assessment?

5

The goal of the experiment presented in this article is to trace the influence of knowledge on

software quality perception. Roughly speaking, the research question is: does the quality of

sequential software versions and users’ motivation affect the quality assessment of the final

version of the software?

Users’ knowledge is being built into the learning process (Nęcka, Orzechowski, & Szymura,

2008). According to Kahneman and Tversky, if knowledge is a result from a personal experience

then it is overestimated by the observer (Tversky & Kahneman, 1982). In order to separate the

influence from personal experience, a research method that allows the variables for the research to

be observed and manipulated has to be defined.

Method

Experiments are conducted in order to trace the cause-effect relations between certain values of

an independent variable and the resulting level of a dependant variable(s). A controllable approach

for such research may be planned as an independent groups plan (Shaughnessy, Zechmeister, &

Zechmeister, 2005). In designing the control methods for this plan it is important to identify a set

of possible differences.

As stated before, the goal of this research is to trace the influence of both: the sequence of

quality levels associated with sequential versions of software and the different motivation levels of

evaluators. Although the establishing of motivation is purely organizational, the purposive

manipulation of software quality may cause difficulties. As was mentioned in the previous section,

there is no precise model for objective software quality assessment. Therefore, it is difficult to

compare the quality levels of two applications. The comparison of different applications by

independent groups would have faced a problem with the identification of the exact value of the

quality difference between applications (the difference could rely on the subjective preferences of

evaluators – e.g. a comparison of the quality levels of Microsoft and Macintosh operating

systems).

On the other hand, the evaluation of the same application by independent groups poses another

problem - the deliberate manipulation of the application’s quality level (e.g. if an additional feature

is added then it may be said that the application has changed and that it is a different one). In the

literature, however, there are no examples of such manipulation.

Considering these restrictions, quality level manipulation may be limited to the manipulation of

quality level order. Denoting the quality level of a version v as Qv., and the order relation < as Qv1

< Qv2 (the quality level of v2 is higher than the quality level of v1). The quality levels order is

transitive: if Qv1 < Qv2 and Qv2<Qv3 then Qv1<Qv3.

The manipulation of quality levels, which aims to construct a set of versions with a known

order, is possible with the use of the fault probability function fp. For versions of the same

application, fault probability is (ceteris paribus2) negatively correlated with the quality level. The

quality levels order may thus be indicated by having a set of versions ordered by fault probability.

The general plan for the experiment is based on an independent groups plan. This research aims

2 Latin – with other things being the same

6

to investigate the influence of two potential effects: the “history effect” and “motivation effect.”

Therefore, four groups are required to analyze treatments independently. In each case secondary

perception will be analyzed. Each group thus has a manager receiving the evaluators’ reports. The

general overview of this research plan is presented in fig 2.

Fig. 2, Four independent groups layout

The subjects will be located in physically separated locations (different cities) and have no

communication. Additionally, the test team managers will be located in different locations, and

will be able only to read the evaluation reports. The experiment begins with a personal survey. The

aim of this survey is to gather information about each subject’s domain knowledge (in this case

regarding the scope of TestApp1 and TestApp2) and their preferences for software quality

characteristics. The results of the survey will be used to analyze the groups’ homogeneity (see

section “Threats to validity” below). The personal survey is followed by a pre-test, which also

aims to verify the homogeneity of quality assessment levels among groups for a task handled in

TestLab. Then, for the succeeding 5 days the subjects will evaluate sequential versions of the

application. At the end of each evaluation task they will fill out a survey regarding the

application’s quality level. These phases of the experiment are shown in fig 3.

Fig. 3, Phases of the experiment

7

The two treatments being investigated are: the different patterns of quality levels of sequential

versions of the applications, and the existence of additional motivation for the evaluators. The fault

probability related to the quality level of the versions of the application for patterns A and B is

presented in table 1.

version (v) fp(v) version (v) fp(v)

A.0.1 0.00 B.0.1 0.00

A.0.2 0.00 B.0.2 0.00

A.0.3 0.10 B.0.3 0.80

A.0.4 0.15 B.0.4 0.50

A.0.5 0.10 B.0.5 0.10

Tab. 1, Fault probability (fp) patterns A and B

It should be noted that the final versions for both patterns (for all groups) will have the same

quality level. The goal of the experiment is to address the potential difference in the assessed

quality level by both subjects and test managers.

The second treatment will use additional motivation for groups A.H and B.H (groups A.L and

B.L. have no such additional motivation treatment). The *.H groups will be told that the proper

evaluation of the software is of key importance for their employer because of a strategic decision

which is associated with the evaluation results.

Tools

This section describes the tools prepared for the experiment. The dedicated, realistic

environment prepared for the experiment was named the “TestLab” framework. The details of this

tool will not be discussed, although the most important factors regarding the requirements of the

experiment will be presented.

TestLab is a framework that allows for the handling of subjects’ profiles and assignments, the

monitoring of evaluation tasks, and the gathering of feedback from subjects etc. A more important

factor is its ability to deploy real-like applications (called TestApps). TestLab allows the

experimenter to set the fault probability for each task. Therefore, it allows the generation of a set

of versions with a known quality level order. The general concept of the platform is presented in

fig 4.

8

Fig. 4, Logical architecture of the TestLab platform

TestLab generates 12 failure types (a list based on categorized bugs reports from >100 real

projects). This list contains:

1. “Blue screen” - the application fails, producing literally a blue screen containing a vast

amount of technical information about the error in the application.

2. Performance error - after an action is initiated by a user the application waits for 90

seconds before issuing the answer.

3. Data lost error on write - after a completed form is submitted to the application the

answer is “you have not completed the form.”

4. Data lost error on read – when selecting data (a list of records or a single record) the

application returns an empty information set.

5. Data change error on write - after data is submitted to the application it stores truncated

data (for example “J” instead of “John”).

6. Data change error on read – this type of failure is similar to number 5 above, but data is

being truncated only during the presentation. When the user requests data for the

second time they may receive correct information.

7. Calculation error - the application returns an incorrect calculation result.

8. Presentation error - the application presents the screen as a maze of objects (objects are

located in incorrect positions, and also have incorrect sizes, color sets etc).

9. Static text error – the application randomly swaps static texts on the screen.

10. Form reset error – when this error is generated the user is unable to finish filling in a

form on the screen. The form is cleared before completion (every 2-10 seconds).

11. Inaccessibility of functions – the application disables the possibility of using any of the

active controls on the screen (for example, after filling out a form a user cannot submit

it).

12. Irritating messages error - the application shows sequential messages about errors but

without any real failure. The user has to confirm each message by clicking on the

9

“OK” button.

TestLab supports three types of tasks: a survey about a subject, an evaluation task, and a

manager’s evaluation task. The personal survey allows for the gathering of statistical information

about a subject’s preferences, their experience, attitude etc. The evaluation task consists of the

presentation of evaluation instructions, access to the tested application with a predefined quality

level (as discussed above), and the final survey about quality. The manager’s evaluation task refers

to the real life situation where higher level management has to present their own opinion about

software quality, but they do not perform the actual evaluation. The manager’s evaluation task

contains the evaluation instruction given to the subjects and their evaluation reports outlining their

personal opinion about the quality. A quality assessment survey is to be filled out by the manager

after they read all of the reports submitted by members of their particular group.

Surveys in TestLab may be constructed using all of the typical answer types (singe/multiple

choice, free text etc.). Additionally, a Likert-type scale was implemented, which enables bipolar

terms at the ends to be defined, following Osgood’s semantic differential (Osgood, Suci, &

Tannenbaum, 1957). Following the ISO/IEC 25010 draft, questions were asked regarding the

following characteristics: rich functionality, general software quality and compliance with formal

rules, efficiency, productivity, satisfaction, learnability, adaptability, reliability, safety and

security3. The ends of the scale are anchored to definitions. Application is the absolute negation of

this attribute [value=1] and Application is the ideal example of this attribute [value=11]. In the

middle point, the neutral anchor is defined as Application has this quality attribute but not on an

exceptional level [value=6]. The scale is intended to look like a continuous scale (using the color

red on the negative and green at the positive and with gradient change between). This is

represented in fig 5.

Fig. 5, Scale example for the general quality question

For the purpose of the described experiment two TestApps were prepared: the issue submitting

system (TestApp1) and the internet banking system (TestApp2). Each application has complete

documentation for evaluation purposes, requirements, test scenarios etc.

Threats to validity

Every experiment has to be considered against threats to its validity. The considered experiment

plan uses the purposive sampling method, selecting subjects from a group of professional

evaluators. The decision regarding the sampling method may affect both the internal and external

validity of the results. The random assignment of tasks to groups and the doubling of the number

of groups will occur in order to mitigate the risks to validity. The external validity threat resulting

3 The list is based on Software Product Quality in Use as in ISO/IEC 25010 Commission Draft, 2009

10

from the selection of professional software evaluators is not mitigated, however, as the results will

be discussed only in regard to commercial projects involving professional evaluators.

Groups, however, could have systematic differences between them (e.g. a subject from one city

could have a different way of evaluating software or could have some prior experience which

could affect the result). One way of counteracting this threat is to randomize the tasks assigned to

groups. Additionally, the experiment plan includes a series of homogeneity tests on the results

from the personal survey, the pre-test and the evaluation of version *.0.1 (see fig 3).

Another threat to the internal validity is the possibility of external influence on the results (e.g.

information in the mass media about a software security scandal or even information causing

strong emotions (Zelenski, 2007)) and the possibility of the uncontrolled flow of information

between groups (e.g. a subject from one group could have shared their opinion about quality level

with a subject from another group). To avoid these threats, the experiment will be conducted at the

same time in all groups (the subjects will have similar external information) but in physically

separated locations.

A typical external validity threat is a result of the sample size and the possible lack of statistical

significance. For experiments based on psychological research, external validity may be verified

using the effect size (Mook, 1983). This observation uses a corollary that behavioral patterns are

rather constant even in different situations (Underwood & Shaughnessy, 1975), thus the size of the

effect can be used to estimate the likelihood of the effect replication in a real situation

(Shaughnessy, Zechmeister, & Zechmeister, 2005).

The method of data collection may also possibly influence the result and threaten the

experiment’s validity. This threat is countered by the plan of having a pre-test and a version *.0.1

evaluated on the same quality level. The homogeneity check will compare the results to see if the

reaction was similar in all groups.

Experiment execution

Subject selection

For the purpose of the experiment a purposive sampling method was chosen. Four groups in

four different locations were tested simultaneously in order to avoid information exchange. Each

group consisted of five evaluators and a manager located in a different building.

Professional software evaluators were used as the subjects because their levels of experience

seemed to be similar. The application TestApp2 was designed to simulate an internet banking

system and 100% of the evaluators were users of such applications in the real world.

During the experiment four subjects were lost due to absence in work. These individuals were

removed from the result analysis. The final number of subjects in each of the four groups who

completed each stage of the experiment was four evaluators and one manager (i.e. there were 16

evaluators and 4 managers in total).

11

Executed tasks

The experiment was conducted according to an independent groups plan (as discussed in the

previous section). There were four independent groups (the experiment was doubled – see fig. 2).

The “history effect” was tested using a different pattern of sequential versions quality (refer to

table 1) while the “motivation effect” was tested using the additional information provided for

evaluators.

The evaluation of TestApp1 (pre-test) and the first two days of the TestApp2 tests had no

remarkable events. On the third day of the TestApp2 tests, the fp was set to a significant value. The

evaluators wanted to stop the evaluation. They complained to their managers about the

dramatically low quality level. They were told to continue and to do as many test-scenarios as they

were able to (the mail message was sent equally to all 4 groups). The following days did not bring

any new occurrences not expected in the plan.

It seems that the evaluators treated the task seriously. They provided extensive written reports,

screen shots, and conclusions as attachments to their quantifiable opinions about quality. The

managers also seem to have tried to do their best for the evaluation, as they prepared written

opinions for each version including advice for the artificial “development team”.

Analysis

Results

The analysis of the results focuses on the general quality grade. The grade was assessed by

subjects during the personal survey and after the pre-test and each version evaluation. For this

analysis it is important to consider the survey results for the pre-test as well as versions *.0.1 (for

the homogeneity test) and *.0.5 (for the final effect result). The data gathered is presented in

table 2.

Stage of the experiment A.L A.H B.L B.H

Personal survey (evaluators)

10 9 11 11

11 10 10 10

11 8 10 10

11 10 11 11

Personal survey (managers) 10 11 11 11

TestApp1 (evaluators)

7 7 6 8

6 7 9 8

9 8 6 8

8 10 2 6

TestApp1 (managers) 6 7 4 6

TestApp2 version: *.0.1

(evaluators)

7 8 5 7

6 8 9 8

9 9 8 8

6 6 5 4

TestApp2 version: *.0.1 (managers) 6 8 7 6

12

Stage of the experiment A.L A.H B.L B.H

TestApp2 version: *.0.5

(evaluators)

10 3 2 1

3 3 4 2

3 6 2 2

4 3 1 2

TestApp2 version: *.0.5 (managers) 4 3 2 1

Tab. 2, Data gathered from the first surveys, the pre-test and the evaluation of versions *.0.1 and

*.0.5

The reaction of subjects’ teams seemed to be uniform except for one answer in the A.L group

for the A.0.5 version of TestApp2. The entry could have been removed if there were clues

suggesting that it was entered mistakenly. The results were analyzed both with this entry and with

omission of this entry, and in both cases the conclusions were similar (the calculated values of F

and d were different but in both cases the result interpretation did not change).

The negative grades could have been affected by the floor of the scale. The negative end of the

scale was described as the level where the application is the absolute negation of quality.

Therefore, it seems that this effect could not been avoided. It seems that for negative emotions the

evaluators selected the worst possible grade for the purpose of “punishing” the producers (compare

to the negative side of the Kahneman and Tversky curve (Kahneman & Tversky, 1979)).

Interpretation of results

The results will be interpreted using the following assumptions:

• The main effect of each variable will be calculated for the joint groups (the “history

effect” will be calculated by comparing the reaction of the A.* and B.* groups, and the

“motivation effect” will compare the *.L and *.H groups)

• The confidence level is set to α=5%

• The manager’s opinion estimation will be compared to the results of the evaluation of

TestApp1 and all versions of TestApp2 (personal surveys will not be analyzed as the

managers were not shown subjects’ personal surveys)

The results will be interpreted using the ANOVA method (Shaughnessy, Zechmeister, &

Zechmeister, 2005) and the null hypothesis verification procedure.

For the homogeneity analysis the null hypothesis was assumed H0: HA.*=HB.* and H*.L=H*.H for

the analysis of two groups’ results as discussed above. The results from the analysis using the

ANOVA method are presented in tables 3 and 4.

Test MA.* MB.* SS SSE F p Fcrit 5%

H0: MA.*=MB.* (personal survey) 10,0 10,5 1,0 10,0 1,4 0,27 4,6

H0: MA.*=MB.* (TestApp1) 7,8 6,6 5,1 45,4 1,6 0,23 4,6

H0: MA.*=MB.* (TestApp2 v: *.0.1) 7,4 6,8 1,6 35,4 0,6 0,44 4,6

Tab. 3, ANOVA table for homogeneity tests for H0: MA.*=MB.*

13

Test M*.L M*.H SS SSE F p Fcrit 5%

H0: M*.L=M*.H (personal survey) 10,6 9,9 2,3 8,8 3,6 0,08 4,6

H0: M*.L=M*.H (TestApp1) 6,6 7,8 5,1 45,4 1,6 0,23 4,6

H0: M*.L=M*.H (TestApp2 v: *.0.1) 6,9 7,3 0,6 36,4 0,2 0,65 4,6

Tab. 4, ANOVA table for homogeneity tests for H0: M*.L=M*.H

There are no reasons to reject the null hypothesis for all of the above tests. Therefore, it should

be assumed that the groups are comparable. It is also observable that the similarity of groups was

stronger regarding the TestApp2 version: *.0.1 evaluation than in the TestApp1 application. This

observation has the following implications: TestApp1 was an artificial application for a bank’s

helpdesk (subjects could have no personal experience in using such an application) while

TestApp2 was an internet banking application (all of the subjects declared themselves to be

internet banking applications users in the real world).

For analyzing the main effects, a similar procedure was followed. A null hypothesis was

assumed H0: HA.*=HB.* and H*.L=H*.H for the analysis of two groups’ results, as discussed in the

beginning of this section. The analyses are presented in tables 5 and 6.

Test MA.* MB.* SS SSE F p Fcrit 5%

H0: MA.*=MB.* (TestApp2 v: *.0.5) 4,4 2,0 22,6 49,9 6,3 0,02 4,6

Tab. 5, ANOVA table for main effect test for H0: MA.*=MB.*

Test M*.L M*.H SS SSE F p Fcrit 5%

H0: M*.L=M*.H (TestApp2 v: *.0.5) 3,6 2,8 3,1 69,4 0,6 0,44 4,6

Tab. 6, ANOVA table for main effect test for H0: M*.L=M*.H

The null hypothesis has to be rejected because of the assumption that MA.*=MB.* (“history

effect”). For the second assumption M*.L=M*.H (“motivation effect”), there are no reasons to reject

the null hypothesis.

An estimation of the effect size requires an additional statistic to be calculated: Cohen’s d

(Shaughnessy, Zechmeister, & Zechmeister, 2005). The value of d for the comparison of A.* and

B.* (“history effect”) is d=1.08. According to Cohen (Cohen, 1988) this value is interpreted as a

large effect.

An important part of the experiment was the analysis of the feedback information provided to

test managers. For each version the statistics presented to the test manager were calculated and

compared with the grade assessed by the managers. The comparison of chosen estimators is

presented in table 7.

Estimator

Arithmetic

mean

Geometric

mean

Harmonic

mean

Harmonic

mean rounded Median

Estimator value 4,77 4,53 4,30 4,29 4,69

Estimator error -0,48 -0,24 -0,01 0,00 -0,40

Error std. dev. 0,84 0,76 0,73 0,78 0,92

14

Estimator

Arithmetic

mean

Geometric

mean

Harmonic

mean

Harmonic

mean rounded Median

Pearson’s r 0,94 0,95 0,95 0,94 0,93

Error range -2,75 -2,24 -1,62 -2,00 -3,50

Tab. 7, Estimators comparison for the opinion of managers

The most effective estimator among those tested is the harmonic mean of the evaluators’

answers. A version of this estimator where the values were rounded to the nearest integer (test

managers provided answers on a discrete scale) was also tested and the result was similar.

Discussion and conclusions

The experiment’s results support the thesis that the actual processes undertaken by evaluators

differ from the theoretical models presented in the software engineering literature, which assume

an objective and complete overview of quality characteristics (McCall’s, Boehm’s, ISO/IEC 9126,

ISO/IEC 25000 etc.). The scale of the experiment was quite large. In many cases the software

products delivered for the customer are being evaluated by smaller teams. However, there are

cases where the evaluation team is more than ten times larger (e.g. when a new core system is

being introduced in a bank the evaluation team consists of representatives from each business

department). The results cannot be directly used to predict evaluators’ behavior in larger or smaller

groups, although following the observation that humans tend to have rather stable behavioral

patterns (Underwood & Shaughnessy, 1975) it could be expected that the effect is only slightly

affected by the size of a group. The explanation of this effect can be based on the observation that

people are convinced that objects do not change their properties rapidly (Nęcka, Orzechowski, &

Szymura, 2008).

During the experiment, the difference in quality level assessment as an effect of the additional

motivation of subjects as in Baron’s experiment (Baron, Vandello, & Brunsman, 1996) was not

observed. The explanation of this result may be based on the professional character of activities

undertaken regularly by evaluators. Testers who verify software in large organizations are

pressured from two sides (compare (Simon, 1956)). If they declare a malfunction, then they

assume the risk of personal consequences if no malfunction actually occurred. On the other hand,

if the malfunction is not noted then they risk facing even more severe consequences. In such a

situation, additional motivation is far too weak to have any significant influence on the process.

Regarding the research question, this experiment has shown that the quality level of sequential

versions influenced the quality level assessed for the final version of the software. However there

is no evidence that in the professional environment users’ motivation influences the software

quality assessment process.

The experiment regarding secondary perception has shown that, regarding the receipt of

negative information, the opinion of the information recipient is worse than the simple average of

the analyzed grades. In fact, the most effective estimation among tested simple estimators was

15

made using the harmonic mean of the evaluators’ grades. It should be noted that this estimator is

the one giving the lowest value among those tested.

It has to be noted that it is possible that during the experiment the floor effect occurred. A more

detailed effect size estimation could have been calculated if it did not occur. The negative anchor

of the scale should not be used by the evaluators if the product was hardly operating. However,

when evaluators got frustrated with the product quality (when they were forced to continue testing

even though the application had an unacceptable quality level), their frustration led to the

assignment of the most critical grade without any analysis of its accuracy for the situation. This is

also an important conclusion regarding the escalation of negative opinions during evaluation.

In this article, the analysis of the influence of personal knowledge based on experience for the

evaluation of software quality, was investigated. Professional evaluators working in testing teams

were chosen to simulate the process taking place in organizations acquiring software and

evaluating its quality. The results may not be applicable to the non-professional evaluation of

software products (e.g. computer games).

Application and future work

The described experiment presents a study of the influence of knowledge on software quality

perception. Although the results extend the evaluation processes models presented in the software

engineering literature, this research is only the beginning of software customer’s descriptive

analysis.

However, these results demonstrate the importance of proper communication between the

project team and the customer regarding software quality. Unlike normative models, the actual

quality assessment process is neither objective nor supplied with complete information about the

product. The customer makes their judgment even though they may or may not be aware that they

are making a biased judgment or that they are overlooking important information about the

product. How should these results be used by software project managers? Software project

managers not only have to care about product quality, but they also have to build up an image of a

high quality product.

Building an image of high quality product may be perceived as a complex task. However,

project managers can adopt a simple set of activities to improve customer satisfaction, for

example:

• They must not display components of low or questionable quality

• They should adopt a method of communication which will emphasize the vendor’s

professionalism

• They should underline in all written documents their personal professionalism and quality

oriented approach etc.

The expected results, as in the described experiment, include the improvement of customer

satisfaction and the increased probability of the product’s acceptance.

The behavioral approach allows the vendor to improve customer satisfaction in a highly cost

effective manner. It also generates advice regarding what must not be done if a vendor does not

want to decrease customer satisfaction.

16

The area investigated in this experiment is just an introduction to the behavioral approach to

software quality perception understanding and influencing.

There may exist several areas of influence which affect the quality level perceived by customers

and users. Behavioral economists have reported an extensive list of biases which have the potential

to influence the communication and final results of software projects. The identification,

classification and modeling of the adoption patterns of descriptive methods pose a challenge for

upcoming years. Behavioral economics has proven its usefulness for the understanding of market

decisions, and the application of its achievements and research methods in software engineering

opens new possibilities for understanding and satisfying customers.

It is expected that a code of good practice for companies delivering software will be developed,

following an investigation of several phenomena. The results of the experiment presented in this

article represent just one bias affecting quality perception. Other influences need to be

investigated.

The behavioral approach will not replace software engineering methods aiming to obtain a good

product, although they will advise on how to properly deliver a good product and how to increase

the probability of its acceptance.

References

Abramowicz, W., Hofman, R., Suryn, W., & Zyskowski, D. (2008). SQuaRE based Web Services

Quality Model. International Conference on Internet Computing and Web Services. Hong

Kong: International Association of Engineers.

Akerlof, G. (1970). The Market for 'Lemons': Quality Uncertainty and the Market Mechanism.

Quarterly Journal of Economics (84).

Baron, R., Vandello, J., & Brunsman, B. (1996). The forgotten variable in conformity research:

Impact of task importance on social influence. Journal of Personality and Social Psychology,

71 (5).

Basili, V. (1993). The experimental paradigm in software engineering. In D. Rombach, V. Basili,

& R. Selby, Lecture Notes in Computer Software. Springer-Verlag.

Basili, V. (2007). The role of controlled experiments in software engineering research. In V.

Basili, Empirical Software Engineering Issues. Berlin: Springer-Verlag.

Becker, G. (1968). Crime and punishment: An Economic Approach. Journal of Political Economy.

Boehm, B., Brown, J., Lipow, M., & MacCleod, G. (1978). Characteristics of software quality.

New York: American Elsevier.

Boring, E. (1950). A History of Experimental Psychology (Second Edition ed.). Prentice-Hall.

Brookshire, D., & Coursey, D. (1987). Measuring the Value of a Public Good: An Empirical

Comparison of Elicitation Procedures. American Economic Review , 77 (4).

Camerer, C., & Hogarth, R. (1999). The Effects of Financial Incentives in Experiments: A Review

and Capital-Labor-Production Framework. Journal of Risk and Uncertainty , 19.

Camerer, C., & Loewenstein, G. (2003). Behavioral Economics: Past, Present, Future

(introduction for Advances in Behavioral Economics). Mimeo: Carnegie Mellon University.

17

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second Edition).

Hillsdale, NJ: Erlbaum.

Hofman, R. (2010). Software quality perception. In Advanced Techniques in Computing Sciences

and Software Engineering, Khaled Elleithy (ed.), Springer 2010, ISBN: 978-90-481-3659-9.

Kahneman, D., & Tversky, A. (1979). "Prospect" theory: an analysis of decision under risk.

Econometrica (47).

Khaleefa, O. (1999). Who Is the Founder of Psychophysics and Experimental Psychology?

American Journal of Islamic Social Sciences (16).

Levitt, S., & List, J. (2008). Field Experiments in Economics: The Past, The Present, and The

Future. Cambridge: National Bureau of Economic Research Working Paper Series.

List, A. (2004). Neoclassical Theory Versus Prospect Theory: Evidence From The Marketplace.

Econometrica (72).

McCall, J., Richards, P., & Walters, G. (1977). Factors In software quality. Griffiths Air Force

Base, NY, Rome Air Development Center Air Force Systems Command.

Miguel, P., daSilva, M., Chiosini, E., & Schützer, K. (2007). Assessment of service quality

dimensions: a study in a vehicle repair service chain. POMS College of Service Operations

and EurOMA Conference New Challenges in Service Operations. London.

Mook, D. (1983). In defense of external invalidity. American Psychologist (38).

Nęcka, E., Orzechowski, J., & Szymura, B. (2008). Psychologia poznawcza. Warszawa:

Wydawnictwo Naukowe PWN.

Nobel Foundation. (n.d.). Nobelprize.org. Retrieved 07 27, 2009, from All Laureates in

Economics: http://nobelprize.org/nobel_prizes/economics/laureates/

Osgood, C., Suci, G., & Tannenbaum, P. (1957). The measurement of meaning. Urbana, IL:

University of Illinois Press.

Pressman, R. (1992). Software Engineering. A Practitioner's Approach. McGraw Hill.

Shaughnessy, J., Zechmeister, E., & Zechmeister, J. (2005). Research Methods in Psychology

(Seventh edition). McGraw-Hill.

Simon, H. (1956). Rational choice and structure of environments. Psychological review (63).

Smith, A. (1776). An Inquiry into the Nature and Causes of the Wealth of Nations.

Smith, A. (1759). The theory of moral sentiments. London: A. Millar.

Stavrinoudis, D., Xenos, M., Peppas, P., & Christodoulakis, D. (2005). Early Estimation of Users'

Perception of Software Quality. Software Quality Journal , 13.

Steenkamp, J., Wierenga, B., & Meulenberg, M. (1986). Kwali-teits-perceptie van

voedingsmiddelen deel 1. Swoka. Den Haag.

Suryn, W., & Abran, A. (2003). ISO/IEC SQuaRE. The seconod generation of standards for

software product quality. IASTED2003.

Tversky, A., & Kahneman, D. (1982). Judgement under uncertainty: heuristics and biases. In D.

Kahneman, A. Tversky, & P. Slovic, Judgement under uncertainty: heuristics and biases.

Cambridge: Cambridge University Press.

Tversky, A., & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science

(185).

18

Underwood, B., & Shaughnessy, J. (1975). Experimentation in psychology. New York: Wiley.

Xenos, M., & Christodoulakis, D. (1995). Software quality: The user's point of view. In M. Lee, B.

Barta, & P. Juliff, Software Quality and Productivity: Theory, practice, education and

training. Chapman and Hall Publications.

Zelenski, J. (2007). The Role of Personality in Emotion, Judgment and Decision Making. In K.

Vohs, R. Baumeister, & G. (. Loewenstein, Do Emotions Help or Hurt Decision Making?

Russell Sage Foundation.