ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	4
Since 2016 (last 10 years)	6
Since 2006 (last 20 years)	13

Descriptor

Probability	15
Test Items	8
Item Response Theory	6
Test Bias	5
Comparative Analysis	4
Error of Measurement	4
Foreign Countries	4
Scores	4
Standard Setting (Scoring)	4
Bayesian Statistics	3
College Entrance Examinations	3
Difficulty Level	3
Models	3
Computation	2
Correlation	2
Cutting Scores	2
Evaluation Criteria	2
Generalizability Theory	2
Guessing (Tests)	2
Higher Education	2
Item Analysis	2
Item Banks	2
Judges	2
Licensing Examinations…	2
Reading Tests	2
More ▼

Source

Applied Measurement in…

Publication Type

Journal Articles	15
Reports - Research	11
Reports - Evaluative	4
Information Analyses	1
Speeches/Meeting Papers	1

Education Level

Higher Education	3
Postsecondary Education	2
Elementary Education	1
Grade 7	1
Junior High Schools	1
Middle Schools	1
Secondary Education	1

Audience

Location

Australia	1
Canada	1
Iran	1
United Kingdom	1

Laws, Policies, & Programs

Assessments and Surveys

Graduate Record Examinations	1
SAT (College Admission Test)	1
United States Medical…	1

What Works Clearinghouse Rating

Showing all 15 results Save | Export

When Should Individual Ability Estimates Be Reported if Rapid Guessing Is Present?

Peer reviewed

Direct link

Rios, Joseph A. – Applied Measurement in Education, 2022

Testing programs are confronted with the decision of whether to report individual scores for examinees that have engaged in rapid guessing (RG). As noted by the "Standards for Educational and Psychological Testing," this decision should be based on a documented criterion that determines score exclusion. To this end, a number of heuristic…

Descriptors: Testing, Guessing (Tests), Academic Ability, Scores

Maintaining Score Scales over Time: A Comparison of Five Scoring Methods

Peer reviewed

Direct link

Kim, Stella Yun; Lee, Won-Chan – Applied Measurement in Education, 2023

This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of…

Descriptors: Scoring, Comparative Analysis, Item Response Theory, Simulation

Dissecting Knowledge, Guessing, and Blunder in Multiple Choice Assessments

Peer reviewed

Direct link

Abu-Ghazalah, Rashid M.; Dubins, David N.; Poon, Gregory M. K. – Applied Measurement in Education, 2023

Multiple choice results are inherently probabilistic outcomes, as correct responses reflect a combination of knowledge and guessing, while incorrect responses additionally reflect blunder, a confidently committed mistake. To objectively resolve knowledge from responses in an MC test structure, we evaluated probabilistic models that explicitly…

Descriptors: Guessing (Tests), Multiple Choice Tests, Probability, Models

Detecting Differential Item Functioning Using Cognitive Diagnosis Models: Applications of the Wald Test and Likelihood Ratio Test in a University Entrance Examination

Peer reviewed

Direct link

Mehrazmay, Roghayeh; Ghonsooly, Behzad; de la Torre, Jimmy – Applied Measurement in Education, 2021

The present study aims to examine gender differential item functioning (DIF) in the reading comprehension section of a high stakes test using cognitive diagnosis models. Based on the multiple-group generalized deterministic, noisy "and" gate (MG G-DINA) model, the Wald test and likelihood ratio test are used to detect DIF. The flagged…

Descriptors: Test Bias, College Entrance Examinations, Gender Differences, Reading Tests

Using the Bayes Factors to Evaluate Person Fit in the Item Response Theory

Peer reviewed

Direct link

Pan, Tianshu; Yin, Yue – Applied Measurement in Education, 2017

In this article, we propose using the Bayes factors (BF) to evaluate person fit in item response theory models under the framework of Bayesian evaluation of an informative diagnostic hypothesis. We first discuss the theoretical foundation for this application and how to analyze person fit using BF. To demonstrate the feasibility of this approach,…

Descriptors: Bayesian Statistics, Goodness of Fit, Item Response Theory, Monte Carlo Methods

Probabilistic Approaches to Examining Linguistic Features of Test Items and Their Effect on the Performance of English Language Learners

Peer reviewed

Direct link

Solano-Flores, Guillermo – Applied Measurement in Education, 2014

This article addresses validity and fairness in the testing of English language learners (ELLs)--students in the United States who are developing English as a second language. It discusses limitations of current approaches to examining the linguistic features of items and their effect on the performance of ELL students. The article submits that…

Descriptors: English Language Learners, Test Items, Probability, Test Bias

Increasing the Validity of Angoff Standards through Analysis of Judge-Level Internal Consistency

Peer reviewed

Direct link

Clauser, Jerome C.; Clauser, Brian E.; Hambleton, Ronald K. – Applied Measurement in Education, 2014

The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a…

Descriptors: Standard Setting (Scoring), Validity, Reliability, Correlation

Evaluating the Consistency of Angoff-Based Cut Scores Using Subsets of Items within a Generalizability Theory Framework

Peer reviewed

Direct link

Kannan, Priya; Sgammato, Adrienne; Tannenbaum, Richard J.; Katz, Irvin R. – Applied Measurement in Education, 2015

The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e.,…

Descriptors: Reliability, Standard Setting (Scoring), Cutting Scores, Test Items

An Exploratory Analysis of Differential Item Functioning and Its Possible Sources in a Higher Education Admissions Context

Peer reviewed

Direct link

Oliveri, Maria Elena; Lawless, Rene; Robin, Frederic; Bridgeman, Brent – Applied Measurement in Education, 2018

We analyzed a pool of items from an admissions test for differential item functioning (DIF) for groups based on age, socioeconomic status, citizenship, or English language status using Mantel-Haenszel and item response theory. DIF items were systematically examined to identify its possible sources by item type, content, and wording. DIF was…

Descriptors: Test Bias, Comparative Analysis, Item Banks, Item Response Theory

Requiring a Consistent Unit of Scale between the Responses of Students and Judges in Standard Setting

Peer reviewed

Direct link

Humphry, Stephen; Heldsinger, Sandra; Andrich, David – Applied Measurement in Education, 2014

One of the best-known methods for setting a benchmark standard on a test is that of Angoff and its modifications. When scored dichotomously, judges estimate the probability that a benchmark student has of answering each item correctly. As in most methods of standard setting, it is assumed implicitly that the unit of the latent scale of the…

Descriptors: Foreign Countries, Standard Setting (Scoring), Judges, Item Response Theory

An Analytic Comparison of Effect Sizes for Differential Item Functioning

Peer reviewed

Direct link

Demars, Christine E. – Applied Measurement in Education, 2011

Three types of effects sizes for DIF are described in this exposition: log of the odds-ratio (differences in log-odds), differences in probability-correct, and proportion of variance accounted for. Using these indices involves conceptualizing the degree of DIF in different ways. This integrative review discusses how these measures are impacted in…

Descriptors: Effect Size, Test Bias, Probability, Difficulty Level

Impact of Design Effects in Large-Scale District and State Assessments

Peer reviewed

Direct link

Phillips, Gary W. – Applied Measurement in Education, 2015

This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling…

Descriptors: State Programs, Sampling, Research Design, Error of Measurement

An Empirical Examination of the Impact of Group Discussion and Examinee Performance Information on Judgments Made in the Angoff Standard-Setting Procedure

Peer reviewed

Direct link

Clauser, Brian E.; Harik, Polina; Margolis, Melissa J.; McManus, I. C.; Mollon, Jennifer; Chis, Liliana; Williams, Simon – Applied Measurement in Education, 2009

Numerous studies have compared the Angoff standard-setting procedure to other standard-setting methods, but relatively few studies have evaluated the procedure based on internal criteria. This study uses a generalizability theory framework to evaluate the stability of the estimated cut score. To provide a measure of internal consistency, this…

Descriptors: Generalizability Theory, Group Discussion, Standard Setting (Scoring), Scoring

Comments on Holland and Wainer's "Sources of Uncertainty Often Ignored in Adjusting State Mean SAT Scores for Differential Participation Rates: The Rules of the Game."

Peer reviewed

Edwards, Don; Cummings, Cynthia B. – Applied Measurement in Education, 1990

An evolved form of the Edwards and Beckworth (1989) model for probability selection for Scholastic Achievement Test takers using truncated normal distributions is presented. It is shown that the arguments of Holland and Wainer are not sufficient to reject this model. (SLD)

Descriptors: College Entrance Examinations, Models, Participation, Probability

Altering the Level of Difficulty in Computer Adaptive Testing.

Peer reviewed

Bergstrom, Betty A.; And Others – Applied Measurement in Education, 1992

Effects of altering test difficulty on examinee ability measures and test length in a computer adaptive test were studied for 225 medical technology students in 3 test difficulty conditions. Results suggest that, with an item pool of sufficient depth and breadth, acceptable targeting to test difficulty is possible. (SLD)

Descriptors: Ability, Adaptive Testing, Change, College Students

Clauser, Brian E.	2
Abu-Ghazalah, Rashid M.	1
Andrich, David	1
Bergstrom, Betty A.	1
Bridgeman, Brent	1
Chis, Liliana	1
Clauser, Jerome C.	1
Cummings, Cynthia B.	1
Demars, Christine E.	1
Dubins, David N.	1
Edwards, Don	1
Ghonsooly, Behzad	1
Hambleton, Ronald K.	1
Harik, Polina	1
Heldsinger, Sandra	1
Humphry, Stephen	1
Kannan, Priya	1
Katz, Irvin R.	1
Kim, Stella Yun	1
Lawless, Rene	1
Lee, Won-Chan	1
Margolis, Melissa J.	1
McManus, I. C.	1
Mehrazmay, Roghayeh	1
More ▼