NotesFAQContact Us
Collection
Advanced
Search Tips
Audience
Researchers1
Location
Laws, Policies, & Programs
Assessments and Surveys
Wechsler Adult Intelligence…1
What Works Clearinghouse Rating
Showing all 13 results Save | Export
Peer reviewed Peer reviewed
Direct linkDirect link
Jonas Flodén – British Educational Research Journal, 2025
This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…
Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022
The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…
Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory
Peer reviewed Peer reviewed
Direct linkDirect link
Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024
We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…
Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners
Greifer, Noah – ProQuest LLC, 2018
There has been some research in the use of propensity scores in the context of measurement error in the confounding variables; one recommended method is to generate estimates of the mis-measured covariate using a latent variable model, and to use those estimates (i.e., factor scores) in place of the covariate. I describe a simulation study…
Descriptors: Evaluation Methods, Probability, Scores, Statistical Analysis
Peer reviewed Peer reviewed
Direct linkDirect link
Lin, Chih-Kai – Language Testing, 2017
Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…
Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy
Peer reviewed Peer reviewed
PDF on ERIC Download full text
Falk, Carl F.; Cai, Li – Grantee Submission, 2015
In this paper, we present a flexible full-information approach to modeling multiple userdefined response styles across multiple constructs of interest. The model is based on a novel parameterization of the multidimensional nominal response model that separates estimation of overall item slopes from the scoring functions (indicating the order of…
Descriptors: Response Style (Tests), Item Response Theory, Outcome Measures, Models
Peer reviewed Peer reviewed
Direct linkDirect link
Collins, Rebecca L.; Martino, Steven C.; Elliott, Marc N. – Developmental Psychology, 2011
Longitudinal research has demonstrated a link between exposure to sexual content in media and subsequent changes in adolescent sexual behavior, including initiation of intercourse and various noncoital sexual activities. Based on a reanalysis of one of the data sets involved, Steinberg and Monahan (2011) have challenged these findings. However,…
Descriptors: Sexuality, Mass Media Effects, Adolescents, Evaluation Methods
Simpson, J. D. – Audio-Visual Language Journal, 1974
Some basic statistical concepts relevant to the teacher--mean scores, standard deviation, normal and skewed distributions, z scores, item analysis, standard error of measurement, reliability--and their use by the teacher are explained. (RM)
Descriptors: Error of Measurement, Evaluation Methods, Norm Referenced Tests, Scoring
Zwick, Rebecca; Thayer, Dorothy T. – 1994
Several recent studies have investigated the application of statistical inference procedures to the analysis of differential item functioning (DIF) in test items that are scored on an ordinal scale. Mantel's extension of the Mantel-Haenszel test is a possible hypothesis-testing method for this purpose. The development of descriptive statistics for…
Descriptors: Error of Measurement, Evaluation Methods, Hypothesis Testing, Item Bias
Peer reviewed Peer reviewed
Direct linkDirect link
Hopwood, Christopher J.; Richard, David C. S. – Assessment, 2005
Research on the Wechsler Adult Intelligence Scale-Revised and Wechsler Adult Intelligence Scale-Third Edition (WAIS-III) suggests that practicing clinical psychologists and graduate students make item-level scoring errors that affect IQ, index, and subtest scores. Studies have been limited in that Full-Scale IQ (FSIQ) and examiner administration,…
Descriptors: Scoring, Psychologists, Intelligence Quotient, Graduate Students
Rudner, Lawrence M. – 1992
Several common sources of error in assessment that depends on the use of judges are identified, and ways to reduce the impact of rating errors are examined. Numerous threats to the validity of scores based on ratings exist. These threats include: (1) the halo effect; (2) stereotyping; (3) perception differences; (4) leniency/stringency error; and…
Descriptors: Alternative Assessment, Error of Measurement, Evaluation Methods, Evaluators
Rizavi, Saba; Way, Walter D.; Davey, Tim; Herbert, Erin – Educational Testing Service, 2004
Item parameter estimates vary for a variety of reasons, including estimation error, characteristics of the examinee samples, and context effects (e.g., item location effects, section location effects, etc.). Although we expect variation based on theory, there is reason to believe that observed variation in item parameter estimates exceeds what…
Descriptors: Adaptive Testing, Test Items, Computation, Context Effect
Jaeger, Richard M.; Busch, John Christian – 1986
This study explores the use of the modified caution index (MCI) for identifying judges whose patterns of recommendations suggest that their judgments might be based on incomplete information, flawed reasoning, or inattention to their standard-setting tasks. It also examines the effect on test standards and passing rates when the test standards of…
Descriptors: Criterion Referenced Tests, Error of Measurement, Evaluation Methods, High Schools