ERIC - Search Results

Publication Date

In 2025	1
Since 2024	2
Since 2021 (last 5 years)	3
Since 2016 (last 10 years)	5
Since 2006 (last 20 years)	7

Descriptor

Error of Measurement	13
Evaluation Methods	13
Scoring	13
Interrater Reliability	6
Scores	6
Item Response Theory	4
Measurement Techniques	3
Simulation	3
Computer Assisted Testing	2
Correlation	2
Evaluators	2
Generalizability Theory	2
Models	2
Performance Based Assessment	2
Psychometrics	2
Reliability	2
Response Style (Tests)	2
Scaling	2
Statistical Analysis	2
Test Interpretation	2
Test Items	2
Test Reliability	2
Accuracy	1
Adaptive Testing	1
Adolescents	1
More ▼

Source

Assessment	1
Audio-Visual Language Journal	1
British Educational Research…	1
Developmental Psychology	1
Educational Assessment	1
Educational Testing Service	1
Grantee Submission	1
International Journal of…	1
Language Testing	1
ProQuest LLC	1

Publication Type

Reports - Research	8
Journal Articles	6
Speeches/Meeting Papers	2
Dissertations/Theses -…	1
ERIC Digests in Full Text	1
ERIC Publications	1
Opinion Papers	1
Reports - Evaluative	1

Education Level

Higher Education	2
Postsecondary Education	1

Audience

Researchers

Location

Laws, Policies, & Programs

Assessments and Surveys

Wechsler Adult Intelligence…

What Works Clearinghouse Rating

Showing all 13 results Save | Export

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Comparison of the Results of the Generalizability Theory with the Inter-Rater Agreement Coefficients

Peer reviewed
PDF on ERIC

Download full text

Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022

The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…

Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

Peer reviewed

Direct link

Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024

We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners

Improving Methods for Propensity Score Analysis with Mismeasured Variables by Incorporating Background Variables with Moderated Nonlinear Factor Analysis

Direct link

Greifer, Noah – ProQuest LLC, 2018

There has been some research in the use of propensity scores in the context of measurement error in the confounding variables; one recommended method is to generate estimates of the mis-measured covariate using a latent variable model, and to use those estimates (i.e., factor scores) in place of the covariate. I describe a simulation study…

Descriptors: Evaluation Methods, Probability, Scores, Statistical Analysis

Working with Sparse Data in Rated Language Tests: Generalizability Theory Applications

Peer reviewed

Direct link

Lin, Chih-Kai – Language Testing, 2017

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…

Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy

A Flexible Full-Information Approach to the Modeling of Response Styles

Peer reviewed
PDF on ERIC

Download full text

Falk, Carl F.; Cai, Li – Grantee Submission, 2015

In this paper, we present a flexible full-information approach to modeling multiple userdefined response styles across multiple constructs of interest. The model is based on a novel parameterization of the multidimensional nominal response model that separates estimation of overall item slopes from the scoring functions (indicating the order of…

Descriptors: Response Style (Tests), Item Response Theory, Outcome Measures, Models

Propensity Scoring and the Relationship between Sexual Media and Adolescent Sexual Behavior: Comment on Steinberg and Monahan (2011)

Peer reviewed

Direct link

Collins, Rebecca L.; Martino, Steven C.; Elliott, Marc N. – Developmental Psychology, 2011

Longitudinal research has demonstrated a link between exposure to sexual content in media and subsequent changes in adolescent sexual behavior, including initiation of intercourse and various noncoital sexual activities. Based on a reanalysis of one of the data sets involved, Steinberg and Monahan (2011) have challenged these findings. However,…

Descriptors: Sexuality, Mass Media Effects, Adolescents, Evaluation Methods

Statistics for the Non-statistical

Simpson, J. D. – Audio-Visual Language Journal, 1974

Some basic statistical concepts relevant to the teacher--mean scores, standard deviation, normal and skewed distributions, z scores, item analysis, standard error of measurement, reliability--and their use by the teacher are explained. (RM)

Descriptors: Error of Measurement, Evaluation Methods, Norm Referenced Tests, Scoring

Evaluation of the Magnitude of Differential Item Functioning in Polytomous Items. Program Statistics Research Technical Report No. 94-2.

Download full text

Zwick, Rebecca; Thayer, Dorothy T. – 1994

Several recent studies have investigated the application of statistical inference procedures to the analysis of differential item functioning (DIF) in test items that are scored on an ordinal scale. Mantel's extension of the Mantel-Haenszel test is a possible hypothesis-testing method for this purpose. The development of descriptive statistics for…

Descriptors: Error of Measurement, Evaluation Methods, Hypothesis Testing, Item Bias

Graduate Student WAIS-III Scoring Accuracy Is a Function of Full Scale IQ and Complexity of Examiner Tasks

Peer reviewed

Direct link

Hopwood, Christopher J.; Richard, David C. S. – Assessment, 2005

Research on the Wechsler Adult Intelligence Scale-Revised and Wechsler Adult Intelligence Scale-Third Edition (WAIS-III) suggests that practicing clinical psychologists and graduate students make item-level scoring errors that affect IQ, index, and subtest scores. Studies have been limited in that Full-Scale IQ (FSIQ) and examiner administration,…

Descriptors: Scoring, Psychologists, Intelligence Quotient, Graduate Students

Reducing Errors Due to the Use of Judges. ERIC/TM Digest.

Download full text

Rudner, Lawrence M. – 1992

Several common sources of error in assessment that depends on the use of judges are identified, and ways to reduce the impact of rating errors are examined. Numerous threats to the validity of scores based on ratings exist. These threats include: (1) the halo effect; (2) stereotyping; (3) perception differences; (4) leniency/stringency error; and…

Descriptors: Alternative Assessment, Error of Measurement, Evaluation Methods, Evaluators

Tolerable Variation in Item Parameter Estimates for Linear and Adaptive Computer-Based Testing. Research Report No. 04-28

Download full text

Rizavi, Saba; Way, Walter D.; Davey, Tim; Herbert, Erin – Educational Testing Service, 2004

Item parameter estimates vary for a variety of reasons, including estimation error, characteristics of the examinee samples, and context effects (e.g., item location effects, section location effects, etc.). Although we expect variation based on theory, there is reason to believe that observed variation in item parameter estimates exceeds what…

Descriptors: Adaptive Testing, Test Items, Computation, Context Effect

The Use and Effect of Caution Indices in Detecting Aberrant Patterns of Standard-Setting Recommendations.

Jaeger, Richard M.; Busch, John Christian – 1986

This study explores the use of the modified caution index (MCI) for identifying judges whose patterns of recommendations suggest that their judgments might be based on incomplete information, flawed reasoning, or inattention to their standard-setting tasks. It also examines the effect on test standards and passing rates when the test standards of…

Descriptors: Criterion Referenced Tests, Error of Measurement, Evaluation Methods, High Schools

Aksu, Gökhan	1
Busch, John Christian	1
Cai, Li	1
Collins, Rebecca L.	1
Davey, Tim	1
Elliott, Marc N.	1
Eser, Mehmet Taha	1
Falk, Carl F.	1
Greifer, Noah	1
Herbert, Erin	1
Hopwood, Christopher J.	1
Jaeger, Richard M.	1
Jonas Flodén	1
Lin, Chih-Kai	1
Martino, Steven C.	1
Richard, David C. S.	1
Rizavi, Saba	1
Rudner, Lawrence M.	1
Simpson, J. D.	1
Stefanie A. Wind	1
Thayer, Dorothy T.	1
Way, Walter D.	1
Yangmeng Xu	1
Zwick, Rebecca	1
More ▼