NotesFAQContact Us
Collection
Advanced
Search Tips
Showing all 15 results Save | Export
Peer reviewed Peer reviewed
Direct linkDirect link
Thompson, W. Jake; Clark, Amy K.; Nash, Brooke – Applied Measurement in Education, 2019
As the use of diagnostic assessment systems transitions from research applications to large-scale assessments for accountability purposes, reliability methods that provide evidence at each level of reporting are needed. The purpose of this paper is to summarize one simulation-based method for estimating and reporting reliability for an…
Descriptors: Test Reliability, Diagnostic Tests, Classification, Computation
Peer reviewed Peer reviewed
Direct linkDirect link
Wan, Lei; Henly, George A. – Applied Measurement in Education, 2012
Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats--the figural response (FR) and constructed response (CR) formats used in a K-12 computerized…
Descriptors: Test Items, Test Format, Computer Assisted Testing, Measurement
Peer reviewed Peer reviewed
Nichols, Paul; Kuehl, Barbara Jean – Applied Measurement in Education, 1999
An approach is presented that can predict internal consistency of cognitively complex assessments on two dimensions, those of adding tasks with similar or different solution strategies and adding test takers with different solution strategies. Data from the 1992 National Assessment of Educational Progress mathematics assessment are used to…
Descriptors: Cognitive Tests, Mathematics Tests, Prediction, Test Reliability
Peer reviewed Peer reviewed
Direct linkDirect link
Gierl, Mark J.; Gotzmann, Andrea; Boughton, Keith A. – Applied Measurement in Education, 2004
Differential item functioning (DIF) analyses are used to identify items that operate differently between two groups, after controlling for ability. The Simultaneous Item Bias Test (SIBTEST) is a popular DIF detection method that matches examinees on a true score estimate of ability. However in some testing situations, like test translation and…
Descriptors: True Scores, Simulation, Test Bias, Student Evaluation
Peer reviewed Peer reviewed
Qualls, Audrey L. – Applied Measurement in Education, 1995
Classically parallel, tau-equivalently parallel, and congenerically parallel models representing various degrees of part-test parallelism and their appropriateness for tests composed of multiple item formats are discussed. An appropriate reliability estimate for a test with multiple item formats is presented and illustrated. (SLD)
Descriptors: Achievement Tests, Estimation (Mathematics), Measurement Techniques, Test Format
Peer reviewed Peer reviewed
Feldt, Leonard S. – Applied Measurement in Education, 1993
The recommendation that the reliability of multiple-choice tests will be enhanced if the distribution of item difficulties is concentrated at approximately 0.50 is reinforced and extended in this article by viewing the 0/1 item scoring as a dichotomization of an underlying normally distributed ability score. (SLD)
Descriptors: Ability, Difficulty Level, Guessing (Tests), Mathematical Models
Peer reviewed Peer reviewed
Direct linkDirect link
Wise, Steven L. – Applied Measurement in Education, 2006
In low-stakes testing, the motivation levels of examinees are often a matter of concern to test givers because a lack of examinee effort represents a direct threat to the validity of the test data. This study investigated the use of response time to assess the amount of examinee effort received by individual test items. In 2 studies, it was found…
Descriptors: Computer Assisted Testing, Motivation, Test Validity, Item Response Theory
Peer reviewed Peer reviewed
Wainer, Howard; Thissen, David – Applied Measurement in Education, 1993
Because assessment instruments of the future may well be composed of a combination of types of questions, a way to combine those scores effectively is discussed. Two new graphic tools are presented that show that it may not be practical to equalize the reliability of different components. (SLD)
Descriptors: Constructed Response, Educational Assessment, Graphs, Item Response Theory
Peer reviewed Peer reviewed
Klein, Stephen P.; And Others – Applied Measurement in Education, 1995
Portfolios are the centerpiece of Vermont's statewide assessment program in mathematics. Portfolio scores in the first two years were not reliable enough to permit the reporting of student-level results, but increasing the number of readers or the number of portfolio pieces is not operationally feasible. (SLD)
Descriptors: Educational Assessment, Elementary Secondary Education, Mathematics Tests, Performance Based Assessment
Peer reviewed Peer reviewed
Linn, Robert L.; Kiplinger, Vonda L. – Applied Measurement in Education, 1995
The adequacy of linking statewide standardized test results to the National Assessment of Educational Progress by using equipercentile equating procedures was investigated using statewide mathematics data from four states. Results suggest that the linkings are not sufficiently trustworthy to make comparisons based on the tails of the distribution.…
Descriptors: Comparative Analysis, Educational Assessment, Equated Scores, Mathematics Tests
Peer reviewed Peer reviewed
Hambleton, Ronald K.; Slater, Sharon C. – Applied Measurement in Education, 1997
A brief history of developments in the assessment of the reliability of credentialing examinations is presented, and some new results are outlined that highlight the interactions among scoring, standard setting, and the reliability and validity of pass-fail decisions. Decision consistency is an important concept in evaluating credentialing…
Descriptors: Certification, Credentials, Decision Making, Interaction
Peer reviewed Peer reviewed
Stone, Clement A.; Lane, Suzanne – Applied Measurement in Education, 1991
A model-testing approach for evaluating the stability of item response theory item parameter estimates (IPEs) in a pretest-posttest design is illustrated. Nineteen items from the Head Start Measures Battery were used. A moderately high degree of stability in the IPEs for 5,510 children assessed on 2 occasions was found. (TJH)
Descriptors: Comparative Testing, Compensatory Education, Computer Assisted Testing, Early Childhood Education
Peer reviewed Peer reviewed
Dunbar, Stephen B.; And Others – Applied Measurement in Education, 1991
Issues pertaining to the quality of performance assessments, including reliability and validity, are discussed. The relatively limited generalizability of performance across tasks is indicative of the care needed to evaluate performance assessments. Quality control is an empirical matter when measurement is intended to inform public policy. (SLD)
Descriptors: Educational Assessment, Generalization, Interrater Reliability, Measurement Techniques
Peer reviewed Peer reviewed
Millman, Jason – Applied Measurement in Education, 1991
Alternatives to multiple-choice tests for teacher licensing examinations are described, and their advantages are cited. Concerns are expressed in the areas of cost and practicality, reliability, corruptibility, and validity. A suggestion for reducing costs using multiple-choice responses calibrated to constructed-response tasks is proposed. (SLD)
Descriptors: Beginning Teachers, Constructed Response, Cost Effectiveness, Educational Assessment
Peer reviewed Peer reviewed
Mehrens, William A.; Popham, W. James – Applied Measurement in Education, 1992
This paper discusses how to determine whether a test was developed in a legally defensible manner, reviewing general issues, specific cases bearing on different types of test use, some evaluative dimensions, and evidence of test quality. Tests constructed and used according to existing standards will generally stand legal scrutiny. (SLD)
Descriptors: College Entrance Examinations, Compliance (Legal), Constitutional Law, Court Litigation