Publication Date
In 2025 | 1 |
Since 2024 | 3 |
Since 2021 (last 5 years) | 6 |
Since 2016 (last 10 years) | 16 |
Since 2006 (last 20 years) | 34 |
Descriptor
Testing | 66 |
Test Items | 19 |
Scores | 15 |
Test Reliability | 15 |
Comparative Analysis | 14 |
Measurement Techniques | 11 |
Test Construction | 11 |
Psychometrics | 10 |
Scoring | 9 |
Simulation | 9 |
Test Interpretation | 9 |
More ▼ |
Source
Journal of Educational… | 66 |
Author
Puhan, Gautam | 6 |
Chen, Ping | 2 |
Ding, Shuliang | 2 |
Gierl, Mark J. | 2 |
Guo, Hongwen | 2 |
Hakstian, A. Ralph | 2 |
Kansup, Wanlop | 2 |
Kim, Sooyeon | 2 |
Linn, Robert L. | 2 |
Lord, Frederic M. | 2 |
Sinharay, Sandip | 2 |
More ▼ |
Publication Type
Journal Articles | 45 |
Reports - Research | 22 |
Reports - Evaluative | 10 |
Reports - Descriptive | 8 |
Opinion Papers | 3 |
Guides - Non-Classroom | 1 |
Information Analyses | 1 |
Tests/Questionnaires | 1 |
Education Level
Higher Education | 1 |
Postsecondary Education | 1 |
Secondary Education | 1 |
Audience
Practitioners | 1 |
Researchers | 1 |
Location
Canada | 1 |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Augustin Mutak; Robert Krause; Esther Ulitzsch; Sören Much; Jochen Ranger; Steffi Pohl – Journal of Educational Measurement, 2024
Understanding the intraindividual relation between an individual's speed and ability in testing scenarios is essential to assure a fair assessment. Different approaches exist for estimating this relationship, that either rely on specific study designs or on specific assumptions. This paper aims to add to the toolbox of approaches for estimating…
Descriptors: Testing, Academic Ability, Time on Task, Correlation
Gregory M. Hurtz; Regi Mucino – Journal of Educational Measurement, 2024
The Lognormal Response Time (LNRT) model measures the speed of test-takers relative to the normative time demands of items on a test. The resulting speed parameters and model residuals are often analyzed for evidence of anomalous test-taking behavior associated with fast and poorly fitting response time patterns. Extending this model, we…
Descriptors: Student Reaction, Reaction Time, Response Style (Tests), Test Items
Park, Seohee; Kim, Kyung Yong; Lee, Won-Chan – Journal of Educational Measurement, 2023
Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an…
Descriptors: Testing, Computation, Classification, Accuracy
Baldwin, Peter; Clauser, Brian E. – Journal of Educational Measurement, 2022
While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way--or may be incompatible with common examinee…
Descriptors: Scoring, Testing, Test Items, Test Format
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction
Puhan, Gautam; Kim, Sooyeon – Journal of Educational Measurement, 2022
As a result of the COVID-19 pandemic, at-home testing has become a popular delivery mode in many testing programs. When programs offer at-home testing to expand their service, the score comparability between test takers testing remotely and those testing in a test center is critical. This article summarizes statistical procedures that could be…
Descriptors: Scores, Scoring, Comparative Analysis, Testing
Castellano, Katherine E.; McCaffrey, Daniel F. – Journal of Educational Measurement, 2020
Testing programs are often interested in using a student growth measure. This article presents analytic derivations of the accuracy of common student growth measures on both the raw scale of the test and the percentile rank scale in terms of the proportional reduction in mean squared error and the squared correlation between the estimator and…
Descriptors: Student Evaluation, Accuracy, Testing, Student Development
Sinharay, Sandip – Journal of Educational Measurement, 2017
Person-fit assessment (PFA) is concerned with uncovering atypical test performance as reflected in the pattern of scores on individual items on a test. Existing person-fit statistics (PFSs) include both parametric and nonparametric statistics. Comparison of PFSs has been a popular research topic in PFA, but almost all comparisons have employed…
Descriptors: Goodness of Fit, Testing, Test Items, Scores
Zhang, Jinming; Li, Jie – Journal of Educational Measurement, 2016
An IRT-based sequential procedure is developed to monitor items for enhancing test security. The procedure uses a series of statistical hypothesis tests to examine whether the statistical characteristics of each item under inspection have changed significantly during CAT administration. This procedure is compared with a previously developed…
Descriptors: Computer Assisted Testing, Test Items, Difficulty Level, Item Response Theory
Zwick, Rebecca; Ye, Lei; Isham, Steven – Journal of Educational Measurement, 2018
In typical differential item functioning (DIF) assessments, an item's DIF status is not influenced by its status in previous test administrations. An item that has shown DIF at multiple administrations may be treated the same way as an item that has shown DIF in only the most recent administration. Therefore, much useful information about the…
Descriptors: Test Bias, Testing, Test Items, Bayesian Statistics
Palermo, Corey; Bunch, Michael B.; Ridge, Kirk – Journal of Educational Measurement, 2019
Although much attention has been given to rater effects in rater-mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large-scale, multi-state summative assessment program.…
Descriptors: Scoring, Interrater Reliability, Measurement, Summative Evaluation
Wang, Wenyi; Song, Lihong; Chen, Ping; Ding, Shuliang – Journal of Educational Measurement, 2019
Most of the existing classification accuracy indices of attribute patterns lose effectiveness when the response data is absent in diagnostic testing. To handle this issue, this article proposes new indices to predict the correct classification rate of a diagnostic test before administering the test under the deterministic noise input…
Descriptors: Cognitive Tests, Classification, Accuracy, Diagnostic Tests
Cheng, Ying; Liu, Cheng – Journal of Educational Measurement, 2016
For a certification, licensure, or placement exam, allowing examinees to take multiple attempts at the test could effectively change the pass rate. Change in the pass rate can occur without any change in the underlying latent trait, and can be an artifact of multiple attempts and imperfect reliability of the test. By deriving formulae to compute…
Descriptors: Testing, Computation, Change, Simulation
Heritage, Margaret; Kingston, Neal M. – Journal of Educational Measurement, 2019
Classroom assessment and large-scale assessment have, for the most part, existed in mutual isolation. Some experts have felt this is for the best and others have been concerned that the schism limits the potential contribution of both forms of assessment. Margaret Heritage has long been a champion of best practices in classroom assessment. Neal…
Descriptors: Measurement, Psychometrics, Context Effect, Classroom Environment
Huang, Hung-Yu – Journal of Educational Measurement, 2017
Cognitive diagnosis models (CDMs) have been developed to evaluate the mastery status of individuals with respect to a set of defined attributes or skills that are measured through testing. When individuals are repeatedly administered a cognitive diagnosis test, a new class of multilevel CDMs is required to assess the changes in their attributes…
Descriptors: Testing, Cognitive Measurement, Test Items, Classification