Publication Date
In 2025 | 0 |
Since 2024 | 0 |
Since 2021 (last 5 years) | 0 |
Since 2016 (last 10 years) | 1 |
Since 2006 (last 20 years) | 1 |
Descriptor
Measurement Techniques | 7 |
Interrater Reliability | 4 |
Scores | 3 |
Test Interpretation | 3 |
Test Reliability | 3 |
Test Validity | 3 |
Evaluation Methods | 2 |
Reliability | 2 |
Scoring | 2 |
Test Use | 2 |
Accuracy | 1 |
More ▼ |
Source
Applied Measurement in… | 7 |
Author
Cohen, Allan | 1 |
Dunbar, Stephen B. | 1 |
Feldt, Leonard S. | 1 |
Fisher, Steve | 1 |
Johnson, Robert L. | 1 |
Kane, Michael | 1 |
Kuhs, Therese | 1 |
Lunz, Mary E. | 1 |
Penny, Jim | 1 |
Qualls, Audrey L. | 1 |
Raczynski, Kevin | 1 |
More ▼ |
Publication Type
Journal Articles | 7 |
Reports - Evaluative | 5 |
Reports - Research | 2 |
Speeches/Meeting Papers | 1 |
Education Level
Grade 7 | 1 |
Audience
Location
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018
The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…
Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators
Johnson, Robert L.; Penny, Jim; Fisher, Steve; Kuhs, Therese – Applied Measurement in Education, 2003
When raters assign different scores to a performance task, a method for resolving rating differences is required to report a single score to the examinee. Recent studies indicate that decisions about examinees, such as pass/fail decisions, differ across resolution methods. Previous studies also investigated the interrater reliability of…
Descriptors: Test Reliability, Test Validity, Scores, Interrater Reliability

Feldt, Leonard S. – Applied Measurement in Education, 1990
Sampling theory for the intraclass reliability coefficient, a Spearman-Brown extrapolation of alpha to a single measurement for each examinee, is less recognized and less cited than that of coefficient alpha. Techniques for constructing confidence intervals and testing hypotheses for the intraclass coefficient are presented. (SLD)
Descriptors: Hypothesis Testing, Measurement Techniques, Reliability, Sampling

Dunbar, Stephen B.; And Others – Applied Measurement in Education, 1991
Issues pertaining to the quality of performance assessments, including reliability and validity, are discussed. The relatively limited generalizability of performance across tasks is indicative of the care needed to evaluate performance assessments. Quality control is an empirical matter when measurement is intended to inform public policy. (SLD)
Descriptors: Educational Assessment, Generalization, Interrater Reliability, Measurement Techniques

Qualls, Audrey L. – Applied Measurement in Education, 1995
Classically parallel, tau-equivalently parallel, and congenerically parallel models representing various degrees of part-test parallelism and their appropriateness for tests composed of multiple item formats are discussed. An appropriate reliability estimate for a test with multiple item formats is presented and illustrated. (SLD)
Descriptors: Achievement Tests, Estimation (Mathematics), Measurement Techniques, Test Format

Kane, Michael – Applied Measurement in Education, 1996
This overview of the role of error and tolerance for error in measurement asserts that the generic precision associated with a measurement procedure is defined as the root mean square error, or standard error, in some relevant population. This view of precision is explored in several applications of measurement. (SLD)
Descriptors: Error of Measurement, Error Patterns, Generalizability Theory, Measurement Techniques

Lunz, Mary E.; And Others – Applied Measurement in Education, 1990
An extension of the Rasch model is used to obtain objective measurements for examinations graded by judges. The model calibrates elements of each facet of the examination on a common log-linear scale. Real examination data illustrate the way correcting for judge severity improves fairness of examinee measures. (SLD)
Descriptors: Certification, Difficulty Level, Interrater Reliability, Judges