ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	3
Since 2006 (last 20 years)	17

Descriptor

Comparative Analysis	30
Test Items	30
Scaling	22
Foreign Countries	11
Item Response Theory	9
Statistical Analysis	9
Difficulty Level	8
Multidimensional Scaling	8
Equated Scores	6
Academic Achievement	5
Computation	5
Educational Assessment	5
Item Analysis	5
Scoring	5
Test Results	5
Academic Standards	4
Benchmarking	4
College Entrance Examinations	4
Educational Indicators	4
Educational Objectives	4
Educational Policy	4
Educational Testing	4
Factor Analysis	4
Gender Differences	4
Mathematics Tests	4
More ▼

Source

Applied Psychological…	4
Ministerial Council on…	4
Applied Measurement in…	3
ETS Research Report Series	2
Educational and Psychological…	2
OECD Publishing (NJ1)	2
Focus	1
Journal of Educational…	1
Journal of Educational and…	1
Multivariate Behavioral…	1
Peabody Journal of Education	1
Pearson	1
More ▼

Publication Type

Reports - Research	16
Journal Articles	15
Reports - Evaluative	14
Speeches/Meeting Papers	6
Numerical/Quantitative Data	3
Tests/Questionnaires	2
Collected Works - Serials	1
Reports - Descriptive	1

Education Level

Elementary Secondary Education	6
Elementary Education	4
Higher Education	4
Postsecondary Education	4
Grade 6	3
Early Childhood Education	2
Secondary Education	2
Grade 10	1
Grade 2	1
Primary Education	1

Audience

Teachers

Location

Australia	4
Germany	2
Asia	1
Israel	1

Laws, Policies, & Programs

Assessments and Surveys

Program for International…	6
SAT (College Admission Test)	2
Trends in International…	2
General Aptitude Test Battery	1
Test of English as a Foreign…	1
Vocational Preference…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 30 results Save | Export

Maintaining Score Scales over Time: A Comparison of Five Scoring Methods

Peer reviewed

Direct link

Kim, Stella Yun; Lee, Won-Chan – Applied Measurement in Education, 2023

This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of…

Descriptors: Scoring, Comparative Analysis, Item Response Theory, Simulation

Mean Comparisons of Many Groups in the Presence of DIF: An Evaluation of Linking and Concurrent Scaling Approaches

Peer reviewed

Direct link

Robitzsch, Alexander; Lüdtke, Oliver – Journal of Educational and Behavioral Statistics, 2022

One of the primary goals of international large-scale assessments in education is the comparison of country means in student achievement. This article introduces a framework for discussing differential item functioning (DIF) for such mean comparisons. We compare three different linking methods: concurrent scaling based on full invariance,…

Descriptors: Test Bias, International Assessment, Scaling, Comparative Analysis

IRT Item Parameter Scaling for Developing New Item Pools

Peer reviewed

Direct link

Kang, Hyeon-Ah; Lu, Ying; Chang, Hua-Hua – Applied Measurement in Education, 2017

Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent…

Descriptors: Item Response Theory, Accuracy, Educational Assessment, Test Items

Multidimensional Classification of Examinees Using the Mixture Random Weights Linear Logistic Test Model

Peer reviewed

Direct link

Choi, In-Hee; Wilson, Mark – Educational and Psychological Measurement, 2015

An essential feature of the linear logistic test model (LLTM) is that item difficulties are explained using item design properties. By taking advantage of this explanatory aspect of the LLTM, in a mixture extension of the LLTM, the meaning of latent classes is specified by how item properties affect item difficulties within each class. To improve…

Descriptors: Classification, Test Items, Difficulty Level, Statistical Analysis

Estimating Item Difficulty with Comparative Judgments. Research Report. ETS RR-14-39

Peer reviewed
PDF on ERIC

Download full text

Attali, Yigal; Saldivia, Luis; Jackson, Carol; Schuppan, Fred; Wanamaker, Wilbur – ETS Research Report Series, 2014

Previous investigations of the ability of content experts and test developers to estimate item difficulty have, for themost part, produced disappointing results. These investigations were based on a noncomparative method of independently rating the difficulty of items. In this article, we argue that, by eliciting comparative judgments of…

Descriptors: Test Items, Difficulty Level, Comparative Analysis, College Entrance Examinations

Using a Linear Regression Method to Detect Outliers in IRT Common Item Equating

Peer reviewed

Direct link

He, Yong; Cui, Zhongmin; Fang, Yu; Chen, Hanwei – Applied Psychological Measurement, 2013

Common test items play an important role in equating alternate test forms under the common item nonequivalent groups design. When the item response theory (IRT) method is applied in equating, inconsistent item parameter estimates among common items can lead to large bias in equated scores. It is prudent to evaluate inconsistency in parameter…

Descriptors: Regression (Statistics), Item Response Theory, Test Items, Equated Scores

Assessing the Development of Educational Research Literacy: The Effect of Courses on Research Methods in Studies of Educational Science

Peer reviewed

Direct link

Groß Ophoff, Jana; Schladitz, Sandra; Leuders, Juliane; Leuders, Timo; Wirtz, Markus A. – Peabody Journal of Education, 2015

The ability to purposefully access, reflect, and use evidence from educational research (Educational Research Literacy) is expected of future professionals in educational practice. Based on the presented conceptual framework, a test instrument was developed to assess the different competency aspects: Information Literacy, Statistical Literacy, and…

Descriptors: Educational Research, Research Methodology, Literacy, Educational Development

An Application of Explanatory Item Response Modeling for Model-Based Proficiency Scaling

Peer reviewed

Direct link

Hartig, Johannes; Frey, Andreas; Nold, Gunter; Klieme, Eckhard – Educational and Psychological Measurement, 2012

The article compares three different methods to estimate effects of task characteristics and to use these estimates for model-based proficiency scaling: prediction of item difficulties from the Rasch model, the linear logistic test model (LLTM), and an LLTM including random item effects (LLTM+e). The methods are applied to empirical data from a…

Descriptors: Item Response Theory, Models, Methods, Computation

Coefficient Alpha and Reliability of Scale Scores

Peer reviewed

Direct link

Almehrizi, Rashid S. – Applied Psychological Measurement, 2013

The majority of large-scale assessments develop various score scales that are either linear or nonlinear transformations of raw scores for better interpretations and uses of assessment results. The current formula for coefficient alpha (a; the commonly used reliability coefficient) only provides internal consistency reliability estimates of raw…

Descriptors: Raw Scores, Scaling, Reliability, Computation

Analysis of PISA 2006 Preferred Items Ranking Using the Percent-Correct Method. OECD Education Working Papers, No. 46

Direct link

Adams, Ray; Berezner, Alla; Jakubowski, Maciej – OECD Publishing (NJ1), 2010

This paper uses an approximate average percent-correct methodology to compare the ranks that would be obtained for PISA 2006 countries if the rankings had been derived from items judged by each country to be of highest priority for inclusion. The results reported show a remarkable consistency in the country rank orderings across different sets of…

Descriptors: Science Tests, Preferences, Test Items, Scores

Population Invariance of Vertical Scaling Results

Direct link

Powers, Sonya; Turhan, Ahmet; Binici, Salih – Pearson, 2012

The population sensitivity of vertical scaling results was evaluated for a state reading assessment spanning grades 3-10 and a state mathematics test spanning grades 3-8. Subpopulations considered included males and females. The 3-parameter logistic model was used to calibrate math and reading items and a common item design was used to construct…

Descriptors: Scaling, Equated Scores, Standardized Tests, Reading Tests

A Multilevel Nonlinear Profile Analysis Model for Dichotomous Data

Peer reviewed

Direct link

Culpepper, Steven Andrew – Multivariate Behavioral Research, 2009

This study linked nonlinear profile analysis (NPA) of dichotomous responses with an existing family of item response theory models and generalized latent variable models (GLVM). The NPA method offers several benefits over previous internal profile analysis methods: (a) NPA is estimated with maximum likelihood in a GLVM framework rather than…

Descriptors: Profiles, Item Response Theory, Models, Maximum Likelihood Statistics

Comparing the Similarities and Differences of PISA 2003 and TIMSS. OECD Education Working Papers, No. 32

Direct link

Wu, Margaret – OECD Publishing (NJ1), 2010

This paper makes an in-depth comparison of the PISA (OECD) and TIMSS (IEA) mathematics assessments conducted in 2003. First, a comparison of survey methodologies is presented, followed by an examination of the mathematics frameworks in the two studies. The methodologies and the frameworks in the two studies form the basis for providing…

Descriptors: Mathematics Achievement, Foreign Countries, Gender Differences, Comparative Analysis

Reliability and the Nonequivalent Groups with Anchor Test Design. Research Report. ETS RR-07-16

Peer reviewed
PDF on ERIC

Download full text

Moses, Tim; Kim, Sooyeon – ETS Research Report Series, 2007

This study evaluated the impact of unequal reliability on test equating methods in the nonequivalent groups with anchor test (NEAT) design. Classical true score-based models were compared in terms of their assumptions about how reliability impacts test scores. These models were related to treatment of population ability differences by different…

Descriptors: Reliability, Equated Scores, Test Items, Statistical Analysis

Estimates of the Sampling Distribution of Scalability Coefficient H

Peer reviewed

Direct link

Van Onna, Marieke J. H. – Applied Psychological Measurement, 2004

Coefficient "H" is used as an index of scalability in nonparametric item response theory (NIRT). It indicates the degree to which a set of items rank orders examinees. Theoretical sampling distributions, however, have only been derived asymptotically and only under restrictive conditions. Bootstrap methods offer an alternative possibility to…

Descriptors: Sampling, Item Response Theory, Scaling, Comparative Analysis

Previous Page | Next Page »

Pages: 1 | 2

Donovan, Jenny	3
Lennon, Melissa	3
Hutton, Penny	2
Morrissey, Noni	2
O'Connor, Gayl	2
Wu, Margaret	2
Adams, Ray	1
Almehrizi, Rashid S.	1
Attali, Yigal	1
Ban, Jae-Chun	1
Bay, Luz	1
Beller, Michael	1
Benderson, Albert, Ed.	1
Berezner, Alla	1
Berger, Martijn P. F.	1
Binici, Salih	1
Chang, Hua-Hua	1
Chen, Hanwei	1
Choi, In-Hee	1
Cui, Zhongmin	1
Culpepper, Steven Andrew	1
Davison, Mark L.	1
Eignor, Daniel R.	1
Fang, Yu	1
More ▼