Designing LSP tests LSP TESTING: Good Practice Procedure Pointers for good practice

Designing LSP testsL

SP

TE

ST

ING

: G

oo

d P

ract

ice

Pro

ced

ure

Pointers for good practice

Exe

cuti

ve s

um

mar

yDesigning LSP tests


Ideally a test is written by a small team of writers and reviewers. Subject specialists should be consulted for specific knowledge.

Before a test is ready for administration, it passes through four phases of development during which feasibility, validity, authenticity and reliability are determined.

When composing an LSP test, it is important to make sure that the skill-based tasks are representative for professional practice.

Test tasks are ideally composed of a rubric an subject-specific input materials, contributing to the authenticity.

Quality control of a test is an essential step in the writing process. A number of qualitative and quantitative procedures are available.

Designing LSP tests

Pointers for good practiceC

on

ten

ts

Test writing

Test timing

Test content

Test tasks

Test analysis

Further reading

Designing LSP tests


Test writers

Tes

t w

riti

ng

Designing LSP tests


The people involved in test development:

Test writers: Ideally, two or more people are responsible for designing, writing and revising the test

Test reviewers: One or two reviewers give feedback to the writers

Subject specialists: work in the LSP subject field give input concerning test content and goal

Representative end users: resemble the actual test taking population as closely as possible.make up for the sample population in test piloting

Designing LSP tests


Test timing

Tes

t ti

min

gDesigning LSP tests


Phase 1: Planning

Test writers decide on test goals, test format and test content

Phase 2: Design

Test writers collect material and compose a first draft

Test reviewers evaluate and rewrite the draft

Phase 3: Development

Final draft is piloted to a group of representative testees

Final draft is adjusted based on qualitative and quantitative conclusions from pilot

Phase 4: Live test

Test is ready for use

Tes

t ti

min



Phase 1: Planning

Test writers decide on test goals, test format and test content

Goal: Ideally the test goals should be linked to the professional reality. To ensure this,

subject specialists might be consulted. Getting an idea of their routine language tasks may help in drawing up the test goals. Research has shown that non representative tasks cause irritation or uncertainty with test takers.

Format: Although computer-based tests are as reliable as their paper-based ones, their limitations and possibilities are different from their paper-based counterparts.

Task type: Depending on the language level, the test goals, the format and the time available, various task

types are at hand.

Tes

t ti

min



Phase 2: Design

Test writers collect material and compose a first draftTest reviewers evaluate and rewrite the draft

Collecting material In collaboration with subject specialists, the writers collect authentic and representative material.

First draft Ideally, the first draft contains a large number of tasks and task types, so the reviewers can select which tasks are taken down to the next phase.

Evaluation The reviewers compare the draft to the test goals they had in mind. They check the test for validity (i.e. do we test what we want to test) and authenticity (i.e. does this test represent realistic interactions and situations) and suggest revisions.

Tes

t ti

min



Phase 3: Development

The final test draft is now piloted: a group of representative end users take the test in conditions similar or identical to the live test setting. Ideally 30 to 50 respondents are used.

This pilot will offer information concerning the…

- authenticity: to what extent does the test include situations / interactions that are meaningful or representative for the test taker?

- validity: to what extent does the test test what it means to test?

- reliability: to what extent do test scores reflect actual language ability?

- feasibility: this concerns practicalities such as timing and rating

… of the test

Note: If representative end users cannot be reached, a group of colleagues can also be used. In this case, the test’s reliability cannot be determined, but the feedback will be useful nonetheless.

Tes

t ti

min



Phase 4: Live tests

When the test has been adjusted based on the conclusions from the third phase, it can be taken down to the final phase; live testing.

The test is ready for use. If any remarks still arise during the administering, they are reported to the test writers, who keep the remarks in mind for later versions.

Designing LSP tests


Test content

Tes

t co

nte

nt

Designing LSP tests


The specificity of content in LSP tests is the cause of many debates among researchers.

One side of the spectrum states that both content and tasks cannot be too specific, whereas the other extreme advocates that LSP testing does not make sense, since all language has got a specific purpose.

Example:

A test of English for biomedical science should not use the same material as a test of English for the humanities, even if the required language proficiency is identical.

Both sides of the debate do agree on the importance of face validity (i.e. how do test takers perceive a test; how representative do they feel it is).

Interviews with representative end users also the importance of using familiar material dealing with familiar topics.

Tes

t co

nte

nt

Designing LSP tests


For skill-based exercises the importance of authenticity cannot be overstressed.

Make sure that both the content and the context of the task relate to the specific purpose professional reality.

Example:

Whereas writing a reflective essay might be a representative task within the humanities, it is alien to the biomedical sciences.

Test takers will respond negatively to tasks they perceive as non-representative.

Task content: Ask the students to write, speak, read on something within their professional field of expertise.

Task context: Clarify the context in which a communicative act is taking place as accurately as possible. If you ask students to present at a conference, state which one and give ample information regarding the setting and audience.

Tes

t co

nte

nt

Designing LSP tests


For knowledge-related exercises, such as grammar or vocabulary exercises the context does not appear to matter as much as for skill-based tasks.

Face validity is an important element here as well though; the texts, examples and stimuli should be related to the test takers’ field of expertise.

Quote:

“LSP testing cannot be about testing for subject specific knowledge. It must be about testing for the ability / abilities to manipulate language functions appropriately in a wide variety of ways. […] No doubt for face validity reasons, the stimuli in such tests will be field related, however.”

(Davies, 2001)

Tes

t co

nte

nt

Designing LSP tests


The role of subject specialists

Using the expertise of subject specialists is a much contested theme in LSP testing research.

In any case, when designing a test for specific purposes within a field outside of your expertise, it is always useful to get in touch with people who are in tune with the specific purpose. They will be able to tell you which tasks and texts are representative and which aren’t.

Designing LSP tests


Test tasks

Tas

k ty

pes

Designing LSP tests


There is a myriad of possible task types, that can be used in LSP testing. Since this overview is restricted to online testing with Curios however, the next pages will only cover those task types that are available through Curios.

To ensure task completeness Bachman & Palmer (1996) and Douglas (2001) suggest including the following elements in each task.

Rubric “Characteristics that specify how test takers are expected to proceed in taking the test.” (Bachman, 1990)

Objective The goal of the test task, i.e. to write a text summary

Procedure for responding Information on how the test taker is expected to respond (ie. checking boxes, writing full senteces...)

Structure Structure refers to information on the number of tasks in the communicative event, their importance, and the degree of distinction among them.

Time allotment The testee is told how much time can be spent on the task.

Evaluation criteria The explicit information concerning the criteria that will be used to judge the performance. (Douglas, 2000)

Tas

k ty

pes

Designing LSP tests


Input Input is the specific purpose material in the TLU situation that language users process and respond to.

Prompt Contextual information that clarifies the setting in which a communicative event takes place.

Input data The data to be processed during a (communicative) task. Here, the degree of authenticity of the material matters a great deal. - Situational authenticity: to what extent does the situation in the test represent reality? - Interactional authenticity: to what extent does the communicative act from the test correspond to reality?

Tas

k ty

pes

Expected response This refers to what the test developer intends the test takers to do in response to the rubric and input. If the actual response and the intended response do not generally match, the test task is most probably unclear.

Assessment criteria The criteria and procedures by which to judge a language performance. And which scores to deduce from that.

Also according to Bachman & Palmer (1996, the rater’s manual should include at least the following:

Designing LSP tests


Example 3: assessment criteria (taken from TOEFL iBT)

Tas

k ty

pes

an

d C

uri

os

Designing LSP tests


Curios is the Ghent university online testing environment. It can be accessed through Minerva and Zephyr and allows for the following multiple choice task types

Single response: A multiple choice task where only one option is possible.

Multiple response: A multiple choice task where more than one option is possible.

True/False: The student is given a statement and should indicate whether it is correct or not.

Matching: A multiple choice task in which the test taker combines two or more items.

Hotspot: the test taker digitally pinpoints or highlights areas on a picture or in a text.

Tas

k ty

pes

an

d C

uri

os

Designing LSP tests


Curios is the Ghent university online testing environment. It can be accessed through Minerva and Zephyr and allows for the following open answer task types

Text/numeric: As an answer to a question, students fill in words, short sentences or numbers.

Cloze: In a running text one or more words or numbers have been deleted. It is up to the test

taker to fill in the gaps.

C-Cloze: A cloze test which includes the first letter of each deleted word.

Extended: Students can be asked to reply to an open question or to produce longer answers.

Get

tin

g s

tart

ed w

ith

Designing LSP tests


Step 1: Accessing Curios Step 2: “Nieuwe vragenreeks”

Step 3b: “Nieuwe vraag” (cloze)Step 3a: “Nieuwe vraag” (MC)

Access Curios via Zephyr of Minerva Each new test starts with this.

Creating a cloze question.Creating a multiple choice question.

Click im

age for videoC

lick image for video

Click im

age for videoC


Designing LSP tests


Step 4: Double checking the scoring Step 5: Publishing the test

Step 7: Checking the resultsStep 6: Taking the test

Get

tin

g s

tart

ed w

ith

Always double check the questions using “geavanceerde editeermedthode”

Students can only access tests through Minerva or Zephyr.

Have a look at the scores.Try a test before making it public.

Click im

age for video

Click im

age for videoC


Click im

age for video

Designing LSP tests


Test analysis

Tes

t an

alys

isDesigning LSP tests


Determining the quality of a language test should take place in the third phase of development, but it should also be a persistent concern of test developers.

When the test has been piloted, qualitative and quantitative analyses can help to improve the reliability and validity of the live test.

In the case of LSP tests, three concepts are of vital importance:

1. RELIABILITY: reliable scores reflect one’s ability

2. VALIDITY: valid questions test what is intended to be tested

3. AUTHENTICITY: authentic tests reflect real-life interactions and situations

Tes

t an

alys

isPointers for good practice

Reliability

An efficient way to check a test’s reliability, is performing an Item Reliability Analysis. This statistical application indicates the discriminating potential of a test item. In other words: it checks to what degree able students get a hard item right and lesser able student’s don’t.

The graph on the right shows the statistical data resulting from a reliability analysis. The column showing the Corrected Item-Total Correlation indicates the reliability of each item.

Designing LSP tests

Items scoring within the -.3 ↔ .3 spectrum are considered unreliable and should be removed or rewritten.

Tes

t an

alys



Reliability: How to perform an Item Reliability Analysis

Enter all the results of all the test takers on all test items in SPSS (available on Athena). The easiest way to do this, is by assigning a score of 1 to a correctly answered question and 0 to an incorrect answer.

Click here for information on quantitative test analysis.

Next, click Analyze – Scale – Reliability analysis and indicate the items you want an analysis for.

Do not forget to check the box which states “scale if item deleted”.

Please click here for a clip on performing an Item Reliability Analysis.

Tes

t an

alys



Validity

The extent to which scores on a test enable inferences to be made which are appropriate, meaningful and useful, given the purpose of the test (i.e.: does the test measure what it intends to measure?).

There are various subclassifications of validity. The most significant ones in this LSP testing project are:

A test has construct validity if scores reflect a theory about a construct. It could be predicted, for example, that two valid tests of listening comprehension would rank learners in the same way, but each would have a weaker relationship with scores on a test of grammatical competence.

A test is said to have content validity if the items or tasks of which it is made up constitute a representative sample of items or tasks for the area of knowledge or ability to be tested. These are often related to a syllabus or course.

Face validity refers to the extent to which a test appears to candidates, or those choosing it on behalf of candidates, to be an acceptable measure of the ability they wish to measure. This is a subjective judgement rather than one based on any objective analysis of the test.

Tes

t an

alys



Determining construct validity implies a thorough knowledge of the construct to be tested.

A construct is a theoretical concept related to linguistic knowledge: i.e. listening comprehension, metacognition or pragmatic competence.

i.e. If you mean to test listening skills and ask students to write an essay about an audiosample, are you then testing receptive or productive skills?

i.e. If you ask students to type an essay within thirty minutes, are you then testing writing skills or typing speed.

Tes

t an

alys



The most effective way of determining an LSP test’s content validity is by having interviews with subject specialists. Various interview types are possible to determine whether the test tasks correspond to reality.

Unstructured The success of this kind of interview depends on the interaction between the researcher and the respondent. There is no fixed interview schedule, but rather a number of themes that are to be addressed.

Semi structured

The researcher follows a preset schedule. It is possible however to deviate from this when interesting issues arise.

Structured The interviewer goes through a fixed series of written questions without deviation. This type of interview closely resembles a questionnaire.

One on one This kind of interview allows the researcher to zoom in on the views of individual respondents.

Group The advantage interviewing larger numbers at once is that group interactions might spark observations that would have gone unnoticed.

Note that the interviewer should get the chance to practice his/her interview skills beforehand. Ideally, the pilot settings will resemble the actual conditions as accurately as possible.

Tes

t an

alys



Since face validity is a subjective measure imposed by the test takers, only test takers can be the judge of it. During or after the pilot test, ask the representative end users to give a verbal report. There are various ways of going about this.

Talk aloud informants voice their thoughts while taking the test

Think aloud Informants say what they are thinking and provide other non-verbal information, such as physical movements.

Concurrent The verbal is report is given in real time

Retrospective The verbal is report is given afterwards

Mediated The researcher occasionally intervenes

Non-mediated The researcher does not intervene

Click here for information on qualitative test analysis.

Tes

t an

alys



Authenticity

If you have interviewed subject specialists and representative end users have given a verbal report, you will also have a good understanding of a test’s situational authenticity (the extent to which a test / task represents real situations) and interactional authenticity (the extent to which a test / task represents realistic conversational interactions).

Note that using material destined for L1-users does not always appear interactionally authentic to test takers, since it depicts interactions that are meaningless to them.

i.e. the abovementioned example of the OET of English for veterinary sciences is not representative for the professional practice of researchers within the field of veterinary sciences. A doctor-patient dialogue is very relevant for students who would like to start working in a practice later.

Designing LSP tests


Further reading

Fu

rth

er r

ead

ing

Designing LSP tests


For more info on LSP testing, please consult

ABRASKEVICIUTE, Ausra et al. (2003). Handbook of LSP Examinations. Tut Press.

BACHMAN, Lyle F. (2000). “Modern Language Testing at the Turn of the Century: assuring that what we count counts”. Language Testing. 17(1)

BROADFOOT, Patricia and Paul Black. (2004). “Redefining Assessment? The first the years of Assessment in Education”. Assessment in Education. 11(1)

Clapham, C. (2000). "Assessment for academic purposes: where next?" System 28(4)

DAVIES, A. (2001). “The logic of testing Languages for Specific Purposes”. Language Testing. 18(2)

DOUGLAS, Dan. (2001). “Language for Specific Purposes assessment criteria: where do they come from?”. Language Testing. 18(2)

DOUGLAS, Dan. (2000). Assessing Languages for Specific Purposes. Cambridge University Press

Dovey, T. (2006). "What purposes, specifically? Re-thinking purposes and specificityin the context of the ‘new vocationalism’." English for Specific Purposes 25(4)

ROEVER, Carsten. (2001) “Web-Based Language Testing” Language Learning & Technology. 5(2)

Hyland, K. (2002). "Specificity revisited: how far should we go now?" English for Specific Purposes 21(4)

Documents

Designing LSP tests LSP TESTING: Good Practice Procedure Pointers for good practice