Publication Date
| In 2026 | 0 |
| Since 2025 | 1 |
| Since 2022 (last 5 years) | 6 |
| Since 2017 (last 10 years) | 35 |
| Since 2007 (last 20 years) | 80 |
Descriptor
| Interrater Reliability | 136 |
| Test Items | 136 |
| Scoring | 42 |
| Test Construction | 36 |
| Difficulty Level | 35 |
| Test Reliability | 35 |
| Foreign Countries | 32 |
| Scores | 25 |
| Test Validity | 25 |
| Evaluators | 21 |
| Correlation | 19 |
| More ▼ | |
Source
Author
| Chang, Lei | 3 |
| Alonzo, Julie | 2 |
| Anderson, Daniel | 2 |
| Avery, Marybell | 2 |
| Carifio, James | 2 |
| Dempster, Edith R. | 2 |
| Dyson, Ben | 2 |
| Fisette, Jennifer L. | 2 |
| Fox, Connie | 2 |
| Franck, Marian | 2 |
| Friedman, Greg | 2 |
| More ▼ | |
Publication Type
Education Level
Audience
| Researchers | 4 |
| Practitioners | 1 |
| Teachers | 1 |
Location
| California | 3 |
| Florida | 3 |
| Netherlands | 3 |
| Pennsylvania | 3 |
| Taiwan | 3 |
| Turkey | 3 |
| United States | 3 |
| Australia | 2 |
| Canada | 2 |
| Germany | 2 |
| Japan | 2 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Hui Jin; Cynthia Lima; Limin Wang – Educational Measurement: Issues and Practice, 2025
Although AI transformer models have demonstrated notable capability in automated scoring, it is difficult to examine how and why these models fall short in scoring some responses. This study investigated how transformer models' language processing and quantification processes can be leveraged to enhance the accuracy of automated scoring. Automated…
Descriptors: Automation, Scoring, Artificial Intelligence, Accuracy
Parker, Mark A. J.; Hedgeland, Holly; Jordan, Sally E.; Braithwaite, Nicholas St. J. – European Journal of Science and Mathematics Education, 2023
The study covers the development and testing of the alternative mechanics survey (AMS), a modified force concept inventory (FCI), which used automatically marked free-response questions. Data were collected over a period of three academic years from 611 participants who were taking physics classes at high school and university level. A total of…
Descriptors: Test Construction, Scientific Concepts, Physics, Test Reliability
Atilgan, Hakan; Demir, Elif Kübra; Ogretmen, Tuncay; Basokcu, Tahsin Oguz – International Journal of Progressive Education, 2020
It has become a critical question what the reliability level would be when open-ended questions are used in large-scale selection tests. One of the aims of the present study is to determine what the reliability would be in the event that the answers given by test-takers are scored by experts when open-ended short answer questions are used in…
Descriptors: Foreign Countries, Secondary School Students, Test Items, Test Reliability
Kaharu, Sarintan N.; Mansyur, Jusman – Pegem Journal of Education and Instruction, 2021
This study aims to develop a test that can be used to explore mental models and representation patterns of objects in liquid fluid. The test developed by adapting the Reeves's Development Model was carried out in several stages, namely: determining the orientation and test segments; initial survey; preparation of the initial draft; try out;…
Descriptors: Test Construction, Schemata (Cognition), Scientific Concepts, Water
Koriakin, Taylor A.; McKee, Sarah L.; Schwartz, Marlene B.; Chafouleas, Sandra M. – Journal of School Health, 2020
Background: Stakeholders increasingly recognize the role of policy in implementing Whole School, Whole Community, Whole Child (WSCC) frameworks in schools; however, few tools are currently available to assess alignment between district policies and WSCC concepts. The purpose of this study was to expand the Wellness School Assessment Tool (WellSAT)…
Descriptors: School Policy, Health Services, Health Promotion, Wellness
Kilic, Abdullah Faruk; Uysal, Ibrahim – International Journal of Assessment Tools in Education, 2022
Most researchers investigate the corrected item-total correlation of items when analyzing item discrimination in multi-dimensional structures under the Classical Test Theory, which might lead to underestimating item discrimination, thereby removing items from the test. Researchers might investigate the corrected item-total correlation with the…
Descriptors: Item Analysis, Correlation, Item Response Theory, Test Items
Martin, David; Jamieson-Proctor, Romina – International Journal of Research & Method in Education, 2020
In Australia, one of the key findings of the Teacher Education Ministerial Advisory Group was that not all graduating pre-service teachers possess adequate pedagogical content knowledge (PCK) to teach effectively. The concern is that higher education providers working with pre-service teachers are using pedagogical practices and assessments which…
Descriptors: Test Construction, Preservice Teachers, Pedagogical Content Knowledge, Foreign Countries
Bimpeh, Yaw; Pointer, William; Smith, Ben Alexander; Harrison, Liz – Applied Measurement in Education, 2020
Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we…
Descriptors: Scoring, Generalizability Theory, Interrater Reliability, Foreign Countries
Walker, Grant M.; Basilakos, Alexandra; Fridriksson, Julius; Hickok, Gregory – Journal of Speech, Language, and Hearing Research, 2022
Purpose: Meaningful changes in picture naming responses may be obscured when measuring accuracy instead of quality. A statistic that incorporates information about the severity and nature of impairments may be more sensitive to the effects of treatment. Method: We analyzed data from repeated administrations of a naming test to 72 participants with…
Descriptors: Naming, Change, Aphasia, Severity (of Disability)
Dempster, Edith R.; Kirby, Nicola F. – Perspectives in Education, 2018
Taxonomies of cognitive demand are frequently used to ensure that assessment tasks include questions ranging from low to high cognitive demand. This paper investigates inter-rater agreement among four evaluators on the cognitive demand of the South African National Senior Certificate Life Sciences examinations after training, practice and…
Descriptors: Interrater Reliability, Biological Sciences, Cognitive Processes, Test Items
Abdalla, Widad – ProQuest LLC, 2019
Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time "A" are rescored by raters at Time "B." The purpose of this study is to examine the ability of…
Descriptors: Scoring, Interrater Reliability, Test Items, Error Patterns
Dempster, Edith R.; Kirby, Nicki F. – South African Journal of Education, 2018
Public perception of "declining standards" in school-leaving examinations often accompanies increases in pass rates in schoolleaving examinations. "Declining standards" to the public means easier examination papers. The present study evaluates a South African attempt to estimate the level of difficulty, as distinct from…
Descriptors: Foreign Countries, Interrater Reliability, Difficulty Level, Science Tests
Nieto, Ricardo; Casabianca, Jodi M. – Journal of Educational Measurement, 2019
Many large-scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple-choice and/or constructed responses sections of items to generate multiple…
Descriptors: Tests, Scoring, Responses, Test Items
Bais, Frank; Schouten, Barry; Lugtig, Peter; Toepoel, Vera; Arends-Tòth, Judit; Douhou, Salima; Kieruj, Natalia; Morren, Mattijn; Vis, Corrie – Sociological Methods & Research, 2019
Item characteristics can have a significant effect on survey data quality and may be associated with measurement error. Literature on data quality and measurement error is often inconclusive. This could be because item characteristics used for detecting measurement error are not coded unambiguously. In our study, we use a systematic coding…
Descriptors: Foreign Countries, National Surveys, Error of Measurement, Test Items
Benton, Tom; Leech, Tony; Hughes, Sarah – Cambridge Assessment, 2020
In the context of examinations, the phrase "maintaining standards" usually refers to any activity designed to ensure that it is no easier (or harder) to achieve a given grade in one year than in another. Specifically, it tends to mean activities associated with setting examination grade boundaries. Benton et al (2020) describes a method…
Descriptors: Mathematics Tests, Equated Scores, Comparative Analysis, Difficulty Level

Peer reviewed
Direct link
