Publication Date
In 2025 | 0 |
Since 2024 | 0 |
Since 2021 (last 5 years) | 0 |
Since 2016 (last 10 years) | 2 |
Since 2006 (last 20 years) | 4 |
Descriptor
Evaluation Methods | 8 |
Test Validity | 8 |
Test Reliability | 3 |
Construct Validity | 2 |
Cutting Scores | 2 |
Decision Making | 2 |
Interrater Reliability | 2 |
Item Analysis | 2 |
Measurement Techniques | 2 |
Models | 2 |
Scoring | 2 |
More ▼ |
Source
Applied Measurement in… | 8 |
Author
Bart, William M. | 1 |
Byrne, Barbara M. | 1 |
Cohen, Allan | 1 |
Ercikan, Kadriye | 1 |
Fisher, Steve | 1 |
Hauger, Jeffrey B. | 1 |
Johnson, Robert L. | 1 |
Kuhs, Therese | 1 |
Mehrens, William A. | 1 |
Oliveri, María Elena | 1 |
Penny, Jim | 1 |
More ▼ |
Publication Type
Journal Articles | 8 |
Reports - Evaluative | 4 |
Reports - Research | 4 |
Information Analyses | 1 |
Education Level
Grade 12 | 1 |
Grade 7 | 1 |
High Schools | 1 |
Audience
Location
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018
The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…
Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators
Ercikan, Kadriye; Oliveri, María Elena – Applied Measurement in Education, 2016
Assessing complex constructs such as those discussed under the umbrella of 21st century constructs highlights the need for a principled assessment design and validation approach. In our discussion, we made a case for three considerations: (a) taking construct complexity into account across various stages of assessment development such as the…
Descriptors: Evaluation Methods, Test Construction, Design, Scaling
Phillips, Gary W. – Applied Measurement in Education, 2015
This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling…
Descriptors: State Programs, Sampling, Research Design, Error of Measurement
Sireci, Stephen G.; Hauger, Jeffrey B.; Wells, Craig S.; Shea, Christine; Zenisky, April L. – Applied Measurement in Education, 2009
The National Assessment Governing Board used a new method to set achievement level standards on the 2005 Grade 12 NAEP Math test. In this article, we summarize our independent evaluation of the process used to set these standards. The evaluation data included observations of the standard-setting meeting, observations of advisory committee meetings…
Descriptors: Advisory Committees, Mathematics Tests, Standard Setting, National Competency Tests

Bart, William M.; Williams-Morris, Ruth – Applied Measurement in Education, 1990
Refined item digraph analysis (RIDA) is a way of studying diagnostic and prescriptive testing. It permits assessment of a test item's diagnostic value by examining the extent to which the item has properties of ideal items. RIDA is illustrated with the Orange Juice Test, which assesses the proportionality concept. (TJH)
Descriptors: Diagnostic Tests, Evaluation Methods, Item Analysis, Mathematical Models

Byrne, Barbara M. – Applied Measurement in Education, 1990
Methodological procedures used in validating the theoretical structure of academic self-concept and validating associated measurement instruments are reviewed. Substantive findings from research related to modes of inquiry are summarized, and recommendations for future research are outlined. (TJH)
Descriptors: Classification, Construct Validity, Evaluation Methods, Literature Reviews
Johnson, Robert L.; Penny, Jim; Fisher, Steve; Kuhs, Therese – Applied Measurement in Education, 2003
When raters assign different scores to a performance task, a method for resolving rating differences is required to report a single score to the examinee. Recent studies indicate that decisions about examinees, such as pass/fail decisions, differ across resolution methods. Previous studies also investigated the interrater reliability of…
Descriptors: Test Reliability, Test Validity, Scores, Interrater Reliability

Mehrens, William A.; Popham, W. James – Applied Measurement in Education, 1992
This paper discusses how to determine whether a test was developed in a legally defensible manner, reviewing general issues, specific cases bearing on different types of test use, some evaluative dimensions, and evidence of test quality. Tests constructed and used according to existing standards will generally stand legal scrutiny. (SLD)
Descriptors: College Entrance Examinations, Compliance (Legal), Constitutional Law, Court Litigation