Educational Measurement

assessment:
definition and purpose
any of a variety of ways to look at performance.
how well does the individual perform?
test:
definition and purpose
instrument OR systematic procedure with uniform questions to sample behavior
how well does the individual perform?
can have NRT or CRT framework
latent variable
a line that represents a construct
ex: knowledge of chemical properties of bases
content standards v. performance standards
“what do students need to know?” v.
“how good is good enough?” (judgement! – cut score)
4 parts of assessment procedure
Establish the: nature (max/typical), form (MC/construc/perf), use (plc/form/sum/diag), and method of interpreting (CRT v. NRT) the assessment
natures of assessment
this is one part of the assessment procedure. Are you looking for “maximum/can do” or “typical/will do” performance? implied assessment type: achievement test vs. surveys/obervation
illustrative assessments for measuring “max performance” vs. “typical performance”
max -> achievement or aptitude test
typical -> attitude surveys, observations
these categories are examples of deciding the “nature” of assessment
forms of assessment
one part of the assessment procedure. Forms include:
MC, constructed response, performance task
uses of assessment
one part of the assessment procedure. Uses include:
placement, formative, diagnostic, summative
compare assessment types: placement, formative, diagnostic, summative (see p. 41 table 2.1)
placement and summative are higher stakes. formative is FYI, correction & reinforcement
diagnostic determines causes of struggle
placement can be just for goals/modality
questions we are asking with assessment
what do students know?
what are they able to do?
what do they need to do next?
methods of assessment (hint)
hint: methods of interpreting
CRT vs. NRT
CRT
Criterion referenced test – no details yet
NRT
norm-referenced test. No details yet
central policy issues
DIE!
How do test scores become meaningful?
this is an essential question. answer should address all aspects of validity (how many are there?) and reliability (specify variety of types that might be of interest
How can we use tests to improve education and society?
another essential question from lecture 2.
answer should have lots of hedged recommendations.
validity:
definition & types
the degree to which an assessment instrument or procedure can be used/interpreted for a specific purpose (context dependent).
assessment should: (1) cover content it purports to test, (2) correlate with specified, appropriate criteria (3) generate results consistent with implications of stated constructs (difficulty of items; bloom’s taxonomy), (4) have consequences that are fair and appropriate.
validity determinations are largely a matter of judgment.
reliability (table 5.1):
definition, types, & methods
the degree of consistency of the outcomes of an assessment. 5 diff. measures of reliability, + method(s) for each. One might measure (1) stability across time [using test-retest], (2) equivalence [using equivalent forms], (3) BOTH, (4) internal consistency [using 3 ways methods], or (5) conistency of ratings [using interrater methods]

reliability is reported as statistical coefficients (0-1)

validity v. reliability
analogous to accuracy (i got what I wanted) vs. precision (I got the same result consistently).
with tests, reliability is necessary but not sufficient for validity.
VALID is specific to a particular stated purpose. RELIABLE is specific to a particular “sample” of takers (aka, context, group)
content-related validity
the degree to which an assessment instrument or procedure covers content it purports to test. 4 steps to establishing: (1) objectives?, (2) know-do blueprint (bloom), (3) make test, (4) judge alignment
procedure for attaining content validity
(1) identify objectives/goals, (2) build table of specs (KNOW content , DO bloom), (3) construct test, (4) panel to evaluate alignment
criterion-related validity
Measure of scores’ correlation to an “appropriate” criterion, which may be Concurrent (eg current GPA) OR predictive (eg future GPA). Although this aspect of validity involves correlation coefficient, judgment is still required to decide what degree of correlation is good enough.
construct-related validity
the degree to which an assessment generates results that are consistent with the implications of stated constructs (difficulty of items; bloom’s taxonomy); when you PROPOSE that an item fits a specific construct (eg, this is a comprehension question – it’s easy) then that construct implies the sorts of scores you should get (HIGH). If evidence (scores) fit that prediction, then your proposed construct interpretation is valid.
consequential validity
the degree to which an assessment instrument or procedure (including interpretation) has consequences that are fair and appropriate.
test-retest
measure of stability of test scores over time (one type of reliability)
equivalent forms
measure of stability of test scores from different versions of test (one type of reliability)
split-half
measure of stability of test scores from halves of items within a single test. Requires use of spearman-brown formula
KR20
KR20 and KR21 and Cronbach’s alpha coefficient are calculations that measure the internal consistency of a single test, which is one measure of reliability
interrater methods
ways to measure the stability scores of the same test given by different raters. This is one type of reliability.
ways that tests may be consistent or not, aka, sources of variation (table 5.4)
1. testing procedure (use any method except interrater)
2. student “characteristics”/response (use interval)
3. sample items (use eq forms or internal consist)
4. judgmental scores (use interraters)
consistency in testing procedure
part of reliability; inconsistency will be detected by all methods of reliability estimation EXCEPT interrater
consistency in student characteristics (how kids respond to test)
part of reliability; inconsistency will be detected by any time interval method, and to a less useful extent by test-retest
consistency over diff. samples of items
part of reliability; inconsistency will be detected by test-retest OR internal consistency methods
internal consistency
one aspect of reliability, it can be measured by split-half method (which requires spearman-brown formula), KR20 and KR21 and Cronbach’s alpha coefficient (remember “generalizability theory” too?)
SEM
Standard Error of measurement (need to know formula?). To get range of likely values for a student’s “true” score, add a “confidence band” of +/- 1 SEM around every score for a category/domain.

SEM is determined by a student’s standard deviation AND the test’s reliability coefficient (table 5.6)

table of specifications
2nd step of process for establishing content validity. Table can separate out both both content and construct (bloom’s taxonomy) . In 4th step, items from test are placed into “cells” in the spec. table to see if they are distributed in the way you intended
reliability coefficient
A correlation coefficient that relates to the reliability of a test (eg, correlation between 2 forms of test, or test-retest, or between odd & even items, etc). Corr. coefficients range between +/- 1, but reliability can only go as low as zero. A neg. corr. coeff is stated as “zero” reliability.

Leave a Reply

Your email address will not be published. Required fields are marked *