ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	1
Since 2016 (last 10 years)	4
Since 2006 (last 20 years)	13

Descriptor

Error of Measurement	15
Test Items	15
Equated Scores	9
Statistical Analysis	7
Comparative Analysis	6
Item Response Theory	5
Simulation	5
Test Length	5
Sample Size	4
Test Construction	4
Ability	3
Evaluation Criteria	3
Statistical Bias	3
Computation	2
Computer Assisted Testing	2
Difficulty Level	2
Evaluation Methods	2
Factor Analysis	2
Generalizability Theory	2
Item Analysis	2
Methods	2
Raw Scores	2
Reliability	2
Sampling	2
Scaling	2
More ▼

Source

ETS Research Report Series

Publication Type

Journal Articles	15
Reports - Research	15
Numerical/Quantitative Data	1
Speeches/Meeting Papers	1

Education Level

Grade 3	1
Higher Education	1
Postsecondary Education	1

Audience

Location

Laws, Policies, & Programs

Assessments and Surveys

SAT (College Admission Test)

What Works Clearinghouse Rating

Showing all 15 results Save | Export

Robustness of Weighted Differential Item Functioning (DIF) Analysis: The Case of Mantel-Haenszel DIF Statistics. Research Report. ETS RR-21-12

Peer reviewed
PDF on ERIC

Download full text

Lu, Ru; Guo, Hongwen; Dorans, Neil J. – ETS Research Report Series, 2021

Two families of analysis methods can be used for differential item functioning (DIF) analysis. One family is DIF analysis based on observed scores, such as the Mantel-Haenszel (MH) and the standardized proportion-correct metric for DIF procedures; the other is analysis based on latent ability, in which the statistic is a measure of departure from…

Descriptors: Robustness (Statistics), Weighted Scores, Test Items, Item Analysis

Different Methods of Adjusting for Form Difficulty under the Rasch Model: Impact on Consistency of Assessment Results. Research Report. ETS RR-19-08

Peer reviewed
PDF on ERIC

Download full text

Manna, Venessa F.; Gu, Lixiong – ETS Research Report Series, 2019

When using the Rasch model, equating with a nonequivalent groups anchor test design is commonly achieved by adjustment of new form item difficulty using an additive equating constant. Using simulated 5-year data, this report compares 4 approaches to calculating the equating constants and the subsequent impact on equating results. The 4 approaches…

Descriptors: Item Response Theory, Test Items, Test Construction, Sample Size

A Modified "a"-Stratified Method for Computerized Adaptive Testing. Research Report. ETS RR-19-10

Peer reviewed
PDF on ERIC

Download full text

Gu, Lixiong; Ling, Guangming; Qu, Yanxuan – ETS Research Report Series, 2019

Research has found that the "a"-stratified item selection strategy (STR) for computerized adaptive tests (CATs) may lead to insufficient use of high a items at later stages of the tests and thus to reduced measurement precision. A refined approach, unequal item selection across strata (USTR), effectively improves test precision over the…

Descriptors: Computer Assisted Testing, Adaptive Testing, Test Use, Test Items

An Information-Correction Method for Testlet-Based Test Analysis: From the Perspectives of Item Response Theory and Generalizability Theory. Research Report. ETS RR-17-27

Peer reviewed
PDF on ERIC

Download full text

Li, Feifei – ETS Research Report Series, 2017

An information-correction method for testlet-based tests is introduced. This method takes advantage of both generalizability theory (GT) and item response theory (IRT). The measurement error for the examinee proficiency parameter is often underestimated when a unidimensional conditional-independence IRT model is specified for a testlet dataset. By…

Descriptors: Item Response Theory, Generalizability Theory, Tests, Error of Measurement

A Criterion to Evaluate the Individual Raw-to-Scale Equating Conversions. Research Report. ETS RR-13-05

Peer reviewed
PDF on ERIC

Download full text

Guo, Hongwen; Puhan, Gautam; Walker, Michael – ETS Research Report Series, 2013

In this study we investigated when an equating conversion line is problematic in terms of gaps and clumps. We suggest using the conditional standard error of measurement (CSEM) to measure the scale scores that are inappropriate in the overall raw-to-scale transformation.

Descriptors: Equated Scores, Test Items, Evaluation Criteria, Error of Measurement

Exploring Alternative Test Form Linking Designs with Modified Equating Sample Size and Anchor Test Length. Research Report. ETS RR-13-02

Peer reviewed
PDF on ERIC

Download full text

Wang, Lin; Qian, Jiahe; Lee, Yi-Hsuan – ETS Research Report Series, 2013

The purpose of this study was to evaluate the combined effects of reduced equating sample size and shortened anchor test length on item response theory (IRT)-based linking and equating results. Data from two independent operational forms of a large-scale testing program were used to establish the baseline results for evaluating the results from…

Descriptors: Test Construction, Item Response Theory, Testing Programs, Simulation

Comparison of the Effects of Discrete Anchor Items and Assage-Based Anchor Items on Observed-Score Equating Results. Research Report. ETS RR-09-44

Peer reviewed
PDF on ERIC

Download full text

Zu, Jiyun; Liu, Jinghua – ETS Research Report Series, 2009

Equating of tests composed of both discrete and passage-based items using the nonequivalent groups with anchor test (NEAT) design is popular in practice. This study investigated the impact of discrete anchor items and passage-based anchor items on observed score equating via simulation. Results suggested that an anchor with a larger proportion of…

Descriptors: Comparative Analysis, Equated Scores, Test Items, Simulation

Methods of Linking with Small Samples in a Common-Item Design: An Empirical Comparison. Research Report. ETS RR-09-38

Peer reviewed
PDF on ERIC

Download full text

Kim, Sooyeon; Livingston, Samuel A. – ETS Research Report Series, 2009

A series of resampling studies was conducted to compare the accuracy of equating in a common item design using four different methods: chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, and the circle-arc method. Four operational test forms, each containing more than 100 items, were used for…

Descriptors: Sampling, Sample Size, Accuracy, Test Items

Impossible Scores Resulting in Zero Frequencies in the Anchor Test: Impact on Smoothing and Equating. Research Report. ETS RR-08-10

Peer reviewed
PDF on ERIC

Download full text

Puhan, Gautam; vonDavier, Alina; Gupta, Shaloo – ETS Research Report Series, 2008

Equating under the external anchor design is frequently conducted using scaled scores on the anchor test. However, scaled scores often lead to the unique problem of creating zero frequencies in the score distribution because there may not always be a one-to-one correspondence between raw and scaled scores. For example, raw scores of 17 and 18 may…

Descriptors: Equated Scores, Test Items, Raw Scores, Statistical Analysis

Comparing Different Approaches of Bias Correction for Ability Estimation in IRT Models. Research Report. ETS RR-08-13

Peer reviewed
PDF on ERIC

Download full text

Lee, Yi-Hsuan; Zhang, Jinming – ETS Research Report Series, 2008

The method of maximum-likelihood is typically applied to item response theory (IRT) models when the ability parameter is estimated while conditioning on the true item parameters. In practice, the item parameters are unknown and need to be estimated first from a calibration sample. Lewis (1985) and Zhang and Lu (2007) proposed the expected response…

Descriptors: Item Response Theory, Comparative Analysis, Computation, Ability

Reliability and the Nonequivalent Groups with Anchor Test Design. Research Report. ETS RR-07-16

Peer reviewed
PDF on ERIC

Download full text

Moses, Tim; Kim, Sooyeon – ETS Research Report Series, 2007

This study evaluated the impact of unequal reliability on test equating methods in the nonequivalent groups with anchor test (NEAT) design. Classical true score-based models were compared in terms of their assumptions about how reliability impacts test scores. These models were related to treatment of population ability differences by different…

Descriptors: Reliability, Equated Scores, Test Items, Statistical Analysis

An Evaluation of the Kernel Equating Method: A Special Study with Pseudotests Constructed from Real Test Data. Research Report. ETS RR-06-02

Peer reviewed
PDF on ERIC

Download full text

von Davier, Alina A.; Holland, Paul W.; Livingston, Samuel A.; Casabianca, Jodi; Grant, Mary C.; Martin, Kathleen – ETS Research Report Series, 2006

This study examines how closely the kernel equating (KE) method (von Davier, Holland, & Thayer, 2004a) approximates the results of other observed-score equating methods--equipercentile and linear equatings. The study used pseudotests constructed of item responses from a real test to simulate three equating designs: an equivalent groups (EG)…

Descriptors: Equated Scores, Statistical Analysis, Simulation, Tests

Choice of Anchor Test in Equating. Research Report. ETS RR-06-35

Peer reviewed
PDF on ERIC

Download full text

Sinharay, Sandip; Holland, Paul – ETS Research Report Series, 2006

It is a widely held belief that anchor tests should be miniature versions (i.e., minitests), with respect to content and statistical characteristics of the tests being equated. This paper examines the foundations for this belief. It examines the requirement of statistical representativeness of anchor tests that are content representative. The…

Descriptors: Test Items, Equated Scores, Evaluation Methods, Difficulty Level

When Can Subscores Have Value? Research Report. ETS RR-05-08

Peer reviewed
PDF on ERIC

Download full text

Haberman, Shelby J. – ETS Research Report Series, 2005

In educational tests, subscores are often generated from a portion of the items in a larger test. Guidelines based on mean-squared error are proposed to indicate whether subscores are worth reporting. Alternatives considered are direct reports of subscores, estimates of subscores based on total score, combined estimates based on subscores and…

Descriptors: Scores, Test Items, Error of Measurement, Computation

Factor Structure of the LanguEdge™ Test across Language Groups. TOEFL® Monograph Series. MS-32. ETS RR-05-12

Peer reviewed
PDF on ERIC

Download full text

Stricker, Lawrence J.; Rock, Donald A.; Lee, Yong-Won – ETS Research Report Series, 2005

This study assessed the factor structure of the LanguEdge™ test and the invariance of its factors across language groups. Confirmatory factor analyses of individual tasks and subsets of items in the four sections of the test, Listening, Reading, Speaking, and Writing, was carried out for Arabic-, Chinese-, and Spanish-speaking test takers. Two…

Descriptors: Factor Structure, Language Tests, Factor Analysis, Semitic Languages

Gu, Lixiong	2
Guo, Hongwen	2
Kim, Sooyeon	2
Lee, Yi-Hsuan	2
Livingston, Samuel A.	2
Puhan, Gautam	2
Casabianca, Jodi	1
Dorans, Neil J.	1
Grant, Mary C.	1
Gupta, Shaloo	1
Haberman, Shelby J.	1
Holland, Paul	1
Holland, Paul W.	1
Lee, Yong-Won	1
Li, Feifei	1
Ling, Guangming	1
Liu, Jinghua	1
Lu, Ru	1
Manna, Venessa F.	1
Martin, Kathleen	1
Moses, Tim	1
Qian, Jiahe	1
Qu, Yanxuan	1
Rock, Donald A.	1
Sinharay, Sandip	1
More ▼