2. Measurement

7/28/2019 2. Measurement

1/43

Measurement


2/43

Measurement, test, evaluation Measurement: process of quantifying the

characteristics of persons according to explicitprocedures and rules

Quantification: assigning numbers, distinguishablefrom qualitative descriptions Characteristics: mental attributes such as aptitude,

intelligence, motivation, field

dependence/independence, attitude, nativelanguage, fluency

Rules and procedure: observation must bereplicable in other context and with other

individuals.


3/43

Test Carroll: a procedure designed to elicit certain

behavior from which one can make inferences

about certain characteristics of an individual

Elicitation: to obtain a specific sample of behavior Interagency Language Roundtable (ILR):oral

interview: a test of speaking consisting of (1) a set

of elicitation procedures, including a sequence of

activities and sets of question types and topics; (2)

a measurement scale of language proficiency

ranging from a low level of 0 to a high level of 5.


4/43

Test?

Years informal contact with a child to rate thechilds oral proficiency: the rater did not followthe procedures

A rating based on a collection of personal lettersto indicate an individuals ability to write effectiveargumentative editorials for a news magazine.

A teachers rating based on informal interactivesocial language use to indicate the students abilityto use language to perform variouscognitive/academic language functions.


5/43

Evaluation

Definition: systematic gathering ofinformation for the purpose of making

decisions.

Evaluation need not be exclusivelyquantitative: verbal descriptions,

performance profiles, letters of reference,overall impressions


6/43

Relation Between Evaluation,

Test, and Measurement


7/43

Relation Between Evaluation,

Test, and Measurement 1: qualitative descriptions of student performance

for diagnosing learning problems

2. teachers ranking for assigning grades 3. achievement test to determine student progress

4. proficiency test as a criterion in second

language acquisition research 5.assigning code numbers to subjects in secondlanguage research according to native language


8/43

What Is It, Measurement, Test,

Evaluation ? placement test

classroom quiz

grading of composition

rating of classroom fast reading exercise

rating of dictation


9/43

Measurement Qualities

A test must be reliable and valid.


10/43

Reliability Free from errors of measurement. If a student does a test twice within a short time,

and if the test is reliable, the results of the 2 tests

should be the same. If 2 raters rate the same writing sample, the ratings

should be consistent if the ratings should bereliable.

The primary concerns in examining reliability is toidentify the different sources or error, then to usethe appropriate empirical procedures forestimating the effect of these sources of errors on

test scores.


11/43

Validity

Validity: the extent to which the inferences ordecisions are meaningful, appropriate and useful.

The test should measure the ability and very littleelse.

If a test is not reliable, it is not valid.

Validity is a quality of test interpretation and use.

The investigation of validity is both a matter ofjudgment and of empirical research.


12/43

Reliability and Validity

Both are essential to the use of tests.

Neither is a quality of tests themselves: reliability

is a quality of test scores, while validity is aquality of interpretations or uses that are made of

test scores.

Neither is absolute: we can never attain perfectlyerror free measures and particular use of a test

score depends upon many factors outside the test

itself.


13/43

Properties of Measurement

Scales 4 properties

distinctiveness: different numbers assigned to

persons with different values ordered in magnitude: the larger the number, the

larger the amount of the attribute

equal interval: equal difference between abilitylevels

absolute zero point: the absence of the attribute


14/43

Four Types of Scales

Nomical: naming classes or categories.

Ordinal: an order with respect to each other.

Interval: the distance between the levels areequal.

Ratio: includes the absolute zero point


15/43

Nominal

Examples :License plate numbers; SocialSecurity numbers; names of people, places,

objects; numbers used to identify footballplayers

Limitations: Cannot specify quantitative

differences among categories


16/43

Ordinal

Examples: Letter grades (ratings fromexcellent to failing), military ranks, order of

finishing a test Limitations: Restricted to specifying

relative differences without regard to

absolute amount of difference


17/43

Interval

Examples: Temperature (Celsius andFahrenheit), calendar dates

Limitations: Ratios are meaningless; thezero point is arbitrarily defined


18/43

Ratio

Examples: Distance, weight, temperature indegrees Kelvin, time required to learn a

skill or subject Limitations: None except that few

educational variables have ratio

characteristics


19/43

Nominal, Ordinal, Interval or

Ratio? 5 in IELTS

550 in TOEFL

C in BEC

8 in CET-4 writing

58 in the final evaluation of a student


20/43

Property and Type of Scale

Type of Scale

Property Nominal Ordinal Interval Ratio

Distinctiveness + + + +

Ordering - + + +

Equal intervals - - + +

Absolute zero

point

- - - +


21/43

Limitations in Measurement

It is essential and important for us tounderstand the characteristics of measures

of mental abilities and the limitations thesecharacteristics place on our interpretation of

test scores.

These limitations are of two kinds:limitations in specification and limitations

in observation and quantification.


22/43

Limitation in Specification

Two levels of the specification of language ability Theoretical level

Task: we need to specify the ability in relation to,or in contrast to, other language abilities and otherfactors that may affect test performance.

Reality: large number of different individual

characteristicscognitive, affective, physicalthat could potentially affect test performance makethe task nearly impossible.


23/43

Limitation in Specification

Operational level

Task: we need to specify the instances of language

performance as indicators of the ability we wish tomeasure.

Reality: the complexity ad the interrelationshipsamong the factors that affect performance on

language tests force us to simplify assumptions in

designing language tests and interpreting test

scores.


24/43

Conclusion

Our interpretations and uses of test scoreswill be of limited validity.

Any theory of language test performancewe develop is likely to be underspecified

and we have to rely on measurement theory

to deal with the problem ofunderspecification.


25/43

Limitations in Observation and

Quantification All measures of mental ability are indirect,

incomplete, imprecise, subjective and

relative.


26/43

Indirectness The relationship between test scores and the

abilities we want to measure is indirect. Languagetests are indirectindicators of the underlying traitsin which we are interested. Because scores from

language tests are indirect indicators of ability, thevalid interpretation and use of such scores dependscrucially on the adequacy of the way we havespecified the relationship between the test score

and the ability we believe it indicates. To theextent that this relationship isnot adequatelyspecified, the interpretations and uses made of thetest score may be invalid.


27/43

Incompleteness

The performance we observe and measurein a language test is a sample of an

individual's total performance in thatlanguage.


28/43

Incompleteness

Since we cannot observe an individual'stotal language use, one of our main

concerns in language testing is assuring thatthe sample we do observe is representative

of that total use - a potentially infinite set of

utterances, whether written or spoken.


29/43

Incompleteness

It is vitally important that we incorporateinto our measurement design principles or

criteria that will guide us in determiningwhat kinds of performance will be most

relevant to and representative of the abilities

we want to measure, for example, real lifelanguage use.


30/43

Imprecision

Because of the nature of language, it is virtuallyimpossible (and probably not desirable) to write

tests with 'pure' items that test a single construct orto be sure that all items are equally representative

of a given ability. Likewise, it is extremely

difficult to develop tests in which all the tasks or

items are at the exact level of difficultyappropriate for the individuals being tested.


31/43

Subjectivity

As Pilliner (1968) noted, language tests aresubjective innearly all aspects.

Test developers Test writers

Test takers

Test scorers


32/43

Relativeness

The presence or absence of language abilities isimpossible to define in an absolute sense.

The concept of 'zero' language ability is a complex

one

The individual with absolutely complete languageability does not exist.

All measures of language ability based on domainspecifications of actual language performancemust be interpreted as relative to some 'norm' of

performance.


33/43

Steps in Measurement

Three steps

1. identify and define the constructtheoretically

2. define the construct operationally

3. establish procedures for quantifyingobservations


34/43

Defining Constructs

Theoretically Historically, there were two distinct

approaches to defining language proficiency.

Real-life approach: language proficiencyitself is not define, but a domain of actuallanguage us is identified.

The approach assumes that if we measurefeatures present in language use, wemeasure the language proficiency.


35/43

Real Life Approach: Example

American Council on the Teaching of ForeignLanguages (ACTFL): definition of advanced level

Able to satisfy the requirements of everyday

situations and routine school and workrequirements. Can handle with confidence but notwith facility complicated tasks and socialsituations, such as elaborating, complaining, and

apologizing. Can narrate the describe with somedetails, liking sentences together smoothly. Cancommunicate facts and talk casually about topicsof current public and personal interest, usinggeneral vocabulary.


36/43

Interactional/ability Approach

Language proficiency is defined in terms ofits component abilities. These components

can be reading, writing, listening, speaking,(Lado), functional framework (Halliday),

communicative frameworks (Munby)


37/43

Example of Pragmatic

Competence The knowledge necessary, in addition to

organizational competence, for

appropriately producing or comprehendingdiscourse,. Specifically, it includesillocutionary competence, or the knowledgeof how to perform speech acts, and

sociolinguistic competence, or theknowledge of the sociolinguisticconventions which govern language use.


38/43

Defining Constructs

Operationally This step involves determining how to isolate the

construct and make it observable.

We must decide what specific procedures we willfollow to elicit the kind of performance that willindicate the degree to which the given construct is

present in the individual.

The context in which the language testing takesplace influences the operations we would follow.

The test must elicit language performance in astandard way, under uniform conditions.


39/43

Quantifying Observations

The units of measurement of language tests aretypically defined in two ways.

1. points or levels of language performance. From zero to five in oral interview Different levels in mechanics, grammar,

organization, content in writing

Mostly an ordinal scale, therefore needingappropriate statistics for ordinal scales. 2. the number of tasks successfully completed


40/43

Quantifying Observations

2. the number of tasks successfullycompleted

We generally treat such a score as one withan interval scale.

Conditions for an interval scale


41/43

Quantifying Observations the performance must be defined and selected in a

way that enables us to determine the relativedifficulty and the extent to which they representthe construct being tested.

the relative difficulty: determined from thestatistical analysis of responses to individual testitems.

How much they represent the construct: depend onthe adequacy of the theoretical definition of theconstruct.


42/43

Score Sorting

Raw score

Score Class

1. Range

2. Number of groups: K=1.87(N-1)2/5

3. Interval: I=R/K

4. Highest and Lowest of the group

5. Arrange the data into groups


43/43

Central Tendency & Dispersion

Mean: x-=x / N

Median: middle of the range

Mode: the score around which the bulk ofthe data congregate

Variance: V=(x-x-)2 / (n-1)

Standard deviation:

S=((x-x-)2 / (n-1))

Documents

2. Measurement