Upload
buidung
View
212
Download
0
Embed Size (px)
Citation preview
1/26/2017
1
CONFIDENTIAL & PROPRIETARY Copyright © 2016 Pearson Education, Inc. or its affiliates. All rights reserved.
March 1, 2017
Lions, Tigers & Validity, Oh My!
Lions, and tigers, and bears, oh, my!
2
Fundamentals We Will Cover
What is validity and where does it come from?
Is there more than one type of validity?
How do you know if you've achieved validity?
What types of security concerns impact validity in a testing program?
3
What is validity & where does it come from?
Two questions underlie the concept of validity
1. Does the test measure what it was designed to measure?
2. Are the interpretations drawn from the test legitimate and justifiable?
5
1. Does the test measure what it was designed to measure?
6
Why are we testing?
What is the purpose of the test?
What about the test takers needs to be measured to achieve this purpose?
1/26/2017
2
Purpose of testing?
Substitution for other measures (e.g., psychiatrist)
• Depression inventory
Prediction of future behavior (e.g., college entrance)
• Success in a first year of college after 1 year
Reflection of test-taker performance on content domain (e.g., licensure/certification)
• Assessment of minimal competence7
1. Does the test measure what it was designed to measure?
For licensure and certification examinations:
The purpose for testing is to establish whether the candidate has the minimal competence needed for professional practice
And..
The test is designed to measure the knowledge, skills and abilities (KSAs) needed to successfully perform the tasks one would perform on the job.
These are usually determined through a job analysis study.
2. Are the interpretations drawn from the test legitimate & justifiable?
Purpose Interpretation
Substitution for other measures
Depression inventory The degree to which an individual exhibits the measured personality trait(s)
Prediction of future behaviour
Success at college Whether or not the test-taker is likely to persist after their first year of college
Reflection of test-taker performance on content domain
Assessment of minimal competence Whether or not the test-taker is competent to practice a profession
Two questions underlie validity
1. Does the test measure what it was designed to measure?
2. Are the interpretations drawn from the test legitimate and justifiable?
10
Validity is established to the extent that you can answer Yes to these questions
What I’ve NOT said…
“The test is valid”
• It is not the test itself that is valid; it is the interpretations of the test scores that are deemed to be valid or not.
This is an important distinction because…
The validity of this interpretation is derived not just from the test items but from all factors that go into arriving at that score interpretation:
Is the test blueprint an adequate reflection of the job and the tasks individuals would be expected to perform on the job?
Does the content of the test items reflect current, competent practice?
Has the standard for competent practice (i.e., the cut score) been appropriately set?
12
1/26/2017
3
Is there more than one type of validity?
Validating a Test
Process of gathering evidence to justify the interpretations you want to make from the test scores
In the case of a licensure or certification exam, the question is:
Do the test scores reflect the knowledge, skills, and abilities needed to practice the profession competently?
Evaluating Validity
Types of evidence to gather when validating a test:
Content validity
Construct validity
Criterion-related validity
Content Validity
The test is valid to the extent that it is a representative sample of the behavior domain. Not a random sample!
Includes only relevant topics
Content validity is not evaluated after the test is developed and administered. It is built into the exam from the beginning – beginning with the decision
about what should be tested and how that decision is made.
Construct Validity
Answers the question: What trait or characteristic is measured by the test?
Real interest is the underlying theory.
Accumulation of evidence (not just one empirical relationship)
Interpretation of test scores: People who pass the test are “minimally competent.”
Not predicting success within the profession per se – only answering the question “Is this person minimally competent?”
Criterion-Related Validity
Typically involves a decision-making process:Should I use the test to select students for my university?
Should I use the test to select salespeople for my company?
Should I use the test to replace a psychiatric interview?
The test results are valid to the extent that it improves the effectiveness of my decision.
1/26/2017
4
Criterion-Related Validity
Predictive
• Test is used to predict behaviour• Real interest is the criterion
behaviour• Example: University admissions
test is used to predict success at university.• Criterion: First-year grades
Concurrent
• Test data are collected at the same type as external behaviour
• Real interest is in using the test as a substitute for something else
• Example: Test measuring depression used as a substitute for formal diagnosis by a psychiatrist• Criterion: Psychiatrist diagnosis
Criterion-Related Validity
Cautions:
1. Validity evidence is only as good as the criterion.
2. Validity determined for a group of test-takers errors will be made for individuals.
How do you know if you’ve achieved validity?
Content validity: Types of evidence
1. Are the test items linked to a test plan that has been developed using the participation of subject matter experts (SMEs)?
2. Have SMEs been involved in the item writing and review process?
3. Are the correct answers to test items validated by learning materials or a well-recognised reference, such as a textbook?
4. Have items been written using formatting and style guidelines?
5. Is an item bias and sensitivity review a component of the item development process?
6. Is there a review process for accepting items into the operational pool which involves reviewing the statistical properties of the items?
7. Is there a plan for the periodic review of operational items to ensure they still reflect current practice?
22
23
Tests are “imperfect measures of constructs because they either leave out something that should be included…or else include something that should be left out, or both”
~Samuel Messick, 1989, p. 34
Threats to validity
Threats to validity
24
What you want to measure
What you
actually measure
Construct-irrelevant varianceSome test scores are either inflated or suppressed due to extraneous variables that the test was not designed to assess
Construct underrepresentationContent coverage is not adequate to generalise test taker performance to the construct in question
What you actually measure
What you
want to measure
1/26/2017
5
Construct & Criterion-Related Validity: Types of evidence
Internal structure: rxx
Internally consistent? Stable over time?
Correlations with other measuresHigh correlations with other measures of the same constructLow correlations with measures of different constructs
Criterion-related validity studiesWhat does the test predict?Group differences
What types of security concerns impact validity in a testing program?
Cheating
Acting dishonestly or unfairly in order to gain an advantage In testing that means declaring knowledge, skills or abilities that the test taker does not really have.
Cheating impacts validity because if someone cheats, their score report is not a true reflection of the knowledge, skills and abilities they have.
Examples: Test-taker collaboration, using hidden notes, etc
27
Prior knowledge of test content
Prior knowledge of test content is gained through item leakage.
Prior knowledge impacts validity because their score report may not be an accurate reflection of their knowledge, skills and abilities.
Examples: Brain-dump sites, proctor collaboration, test-taker collaboration, program design flaws (same form over and over again, shallow item banks)
28
Differences in administration procedures
These differences impact validity because the score result may not be valid or reliable (consistent).
Examples: Allowing some test-takers to have more time than others, not checking IDs at one site but checking at another (allowing proxy-testing to occur), difference in administration sites (noisy vs quiet, dim versus bright, etc)
29
Summary
30
1/26/2017
6
The validity chain
Purpose of the test defined
Test blueprint
developed
Test content written
according to the test blueprint
Standard setting
conducted to set cut
score
Testing Occurs
Scoring methods applied
Leads to
Candidate’s observed
performance on the test (test score)
Interpretedas
Whether or not candidate has
KSAs required of competent practice
The interpretative argument is valid to the extent the inferences and assumptions made along the path are plausible, as determined by well-documented evidence gained through logical
and/or empirical validation.
QUESTIONS?
32