ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	3
Since 2006 (last 20 years)	14

Descriptor

Comparative Analysis	22
Scaling	22
Test Items	22
Foreign Countries	8
Item Response Theory	7
Equated Scores	6
Statistical Analysis	6
Difficulty Level	5
Educational Assessment	5
Scoring	5
Test Results	5
Academic Achievement	4
Academic Standards	4
Benchmarking	4
College Entrance Examinations	4
Educational Indicators	4
Educational Objectives	4
Educational Policy	4
Educational Testing	4
Gender Differences	4
Item Analysis	4
Mathematics Tests	4
National Programs	4
Performance Based Assessment	4
Program Descriptions	4
More ▼

Source

Ministerial Council on…	4
Applied Psychological…	3
Applied Measurement in…	2
ETS Research Report Series	2
OECD Publishing (NJ1)	2
Educational and Psychological…	1
Focus	1
Journal of Educational…	1
Journal of Educational and…	1
Pearson	1

Publication Type

Reports - Research	11
Journal Articles	10
Reports - Evaluative	10
Speeches/Meeting Papers	4
Numerical/Quantitative Data	3
Tests/Questionnaires	2
Collected Works - Serials	1
Reports - Descriptive	1

Education Level

Elementary Secondary Education	6
Elementary Education	4
Grade 6	3
Higher Education	2
Postsecondary Education	2
Secondary Education	2
Early Childhood Education	1
Grade 10	1
Grade 2	1
Primary Education	1

Audience

Teachers

Location

Australia	4
Asia	1
Germany	1

Laws, Policies, & Programs

Assessments and Surveys

Program for International…	6
SAT (College Admission Test)	2
Test of English as a Foreign…	1
Trends in International…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 22 results Save | Export

Maintaining Score Scales over Time: A Comparison of Five Scoring Methods

Peer reviewed

Direct link

Kim, Stella Yun; Lee, Won-Chan – Applied Measurement in Education, 2023

This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of…

Descriptors: Scoring, Comparative Analysis, Item Response Theory, Simulation

Mean Comparisons of Many Groups in the Presence of DIF: An Evaluation of Linking and Concurrent Scaling Approaches

Peer reviewed

Direct link

Robitzsch, Alexander; Lüdtke, Oliver – Journal of Educational and Behavioral Statistics, 2022

One of the primary goals of international large-scale assessments in education is the comparison of country means in student achievement. This article introduces a framework for discussing differential item functioning (DIF) for such mean comparisons. We compare three different linking methods: concurrent scaling based on full invariance,…

Descriptors: Test Bias, International Assessment, Scaling, Comparative Analysis

IRT Item Parameter Scaling for Developing New Item Pools

Peer reviewed

Direct link

Kang, Hyeon-Ah; Lu, Ying; Chang, Hua-Hua – Applied Measurement in Education, 2017

Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent…

Descriptors: Item Response Theory, Accuracy, Educational Assessment, Test Items

Estimating Item Difficulty with Comparative Judgments. Research Report. ETS RR-14-39

Peer reviewed
PDF on ERIC

Download full text

Attali, Yigal; Saldivia, Luis; Jackson, Carol; Schuppan, Fred; Wanamaker, Wilbur – ETS Research Report Series, 2014

Previous investigations of the ability of content experts and test developers to estimate item difficulty have, for themost part, produced disappointing results. These investigations were based on a noncomparative method of independently rating the difficulty of items. In this article, we argue that, by eliciting comparative judgments of…

Descriptors: Test Items, Difficulty Level, Comparative Analysis, College Entrance Examinations

Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating

Peer reviewed

Direct link

He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei – Applied Psychological Measurement, 2013

Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…

Descriptors: Regression (Statistics), Item Response Theory, Test Items, Equated Scores

An Application of Explanatory Item Response Modeling for Model-Based Proficiency Scaling

Peer reviewed

Direct link

Hartig, Johannes; Frey, Andreas; Nold, Gunter; Klieme, Eckhard – Educational and Psychological Measurement, 2012

The article compares three different methods to estimate effects of task characteristics and to use these estimates for model-based proficiency scaling: prediction of item difficulties from the Rasch model, the linear logistic test model (LLTM), and an LLTM including random item effects (LLTM+e). The methods are applied to empirical data from a…

Descriptors: Item Response Theory, Models, Methods, Computation

Coefficient Alpha and Reliability of Scale Scores

Peer reviewed

Direct link

Almehrizi, Rashid S. – Applied Psychological Measurement, 2013

The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results. The current formula for coefficient alpha (a; the commonly used reliability coefficient) only provides internal consistency reliability estimates of raw…

Descriptors: Raw Scores, Scaling, Reliability, Computation

Analysis of PISA 2006 Preferred Items Ranking Using the Percent-Correct Method. OECD Education Working Papers, No. 46

Direct link

Adams, Ray; Berezner, Alla; Jakubowski, Maciej – OECD Publishing (NJ1), 2010

This paper uses an approximate average percent-correct methodology to compare the ranks that would be obtained for PISA 2006 countries if the rankings had been derived from items judged by each country to be of highest priority for inclusion. The results reported show a remarkable consistency in the country rank orderings across different sets of…

Descriptors: Science Tests, Preferences, Test Items, Scores

Population Invariance of Vertical Scaling Results

Direct link

Powers, Sonya; Turhan, Ahmet; Binici, Salih – Pearson, 2012

The population sensitivity of vertical scaling results was evaluated for a state reading assessment spanning grades 3-10 and a state mathematics test spanning grades 3-8. Subpopulations considered included males and females. The 3-parameter logistic model was used to calibrate math and reading items and a common item design was used to construct…

Descriptors: Scaling, Equated Scores, Standardized Tests, Reading Tests

Comparing the Similarities and Differences of PISA 2003 and TIMSS. OECD Education Working Papers, No. 32

Direct link

Wu, Margaret – OECD Publishing (NJ1), 2010

This paper makes an in-depth comparison of the PISA (OECD) and TIMSS (IEA) mathematics assessments conducted in 2003. First, a comparison of survey methodologies is presented, followed by an examination of the mathematics frameworks in the two studies. The methodologies and the frameworks in the two studies form the basis for providing…

Descriptors: Mathematics Achievement, Foreign Countries, Gender Differences, Comparative Analysis

Reliability and the Nonequivalent Groups with Anchor Test Design. Research Report. ETS RR-07-16

Peer reviewed
PDF on ERIC

Download full text

Moses, Tim; Kim, Sooyeon – ETS Research Report Series, 2007

This study evaluated the impact of unequal reliability on test equating methods in the nonequivalent groups with anchor test (NEAT) design. Classical true score-based models were compared in terms of their assumptions about how reliability impacts test scores. These models were related to treatment of population ability differences by different…

Descriptors: Reliability, Equated Scores, Test Items, Statistical Analysis

Estimates of the Sampling Distribution of Scalability Coefficient H

Peer reviewed

Direct link

Van Onna, Marieke J. H. – Applied Psychological Measurement, 2004

Coefficient "H" is used as an index of scalability in nonparametric item response theory (NIRT). It indicates the degree to which a set of items rank orders examinees. Theoretical sampling distributions, however, have only been derived asymptotically and only under restrictive conditions. Bootstrap methods offer an alternative possibility to…

Descriptors: Sampling, Item Response Theory, Scaling, Comparative Analysis

Scaling Performance Assessments: A Comparison of One-Parameter and Two-Parameter Partial Credit Models.

Peer reviewed

Fitzpatrick, Anne R.; And Others – Journal of Educational Measurement, 1996

One-parameter (1PPC) and two-parameter partial credit (2PPC) models were compared using real and simulated data with constructed response items present. Results suggest that the more flexible three-parameter logistic-2PPC model combination produces better model fit than the combination of the one-parameter logistic and the 1PPC models. (SLD)

Descriptors: Comparative Analysis, Constructed Response, Goodness of Fit, Performance Based Assessment

Deriving Comparable Scores for Computer Adaptive and Conventional Tests: An Example Using the SAT.

Download full text

Eignor, Daniel R. – 1993

Procedures used to establish the comparability of scores derived from the College Board Admissions Testing Program (ATP) computer adaptive Scholastic Aptitude Test (SAT) prototype and the paper-and-pencil SAT are described in this report. Both the prototype, which is made up of Verbal and Mathematics computer adaptive tests (CATs), and a form of…

Descriptors: Adaptive Testing, College Entrance Examinations, Comparative Analysis, Computer Assisted Testing

A Comparative Study of Online Pretest Item Calibration/Scaling Methods in CAT. ACT Research Report Series.

Download full text

Ban, Jae-Chun; Hanson, Bradley A.; Wang, Tianyou; Yi, Qing; Harris, Deborah J. – 2000

The purpose of this study was to compare and evaluate five online pretest item calibration/scaling methods in computerized adaptive testing (CAT): (1) the marginal maximum likelihood estimate with one-EM cycle (OEM); (2) the marginal maximum likelihood estimate with multiple EM cycles (MEM); (3) Stocking's Method A (M. Stocking, 1988); (4)…

Descriptors: Adaptive Testing, Comparative Analysis, Computer Assisted Testing, Estimation (Mathematics)

Previous Page | Next Page »

Pages: 1 | 2

Donovan, Jenny	3
Lennon, Melissa	3
Hutton, Penny	2
Morrissey, Noni	2
O'Connor, Gayl	2
Wu, Margaret	2
Adams, Ray	1
Almehrizi, Rashid S.	1
Attali, Yigal	1
Ban, Jae-Chun	1
Bay, Luz	1
Benderson, Albert, Ed.	1
Berezner, Alla	1
Binici, Salih	1
Chang, Hua-Hua	1
Chen, Hanwei	1
Cui, Zhongmin	1
Eignor, Daniel R.	1
Fang, Yu	1
Fitzpatrick, Anne R.	1
Frey, Andreas	1
Hanson, Bradley A.	1
Harris, Deborah J.	1
Hartig, Johannes	1
More ▼