Publication Date
In 2025 | 0 |
Since 2024 | 1 |
Since 2021 (last 5 years) | 3 |
Since 2016 (last 10 years) | 10 |
Since 2006 (last 20 years) | 42 |
Descriptor
Comparative Analysis | 96 |
Evaluation | 96 |
Academic Achievement | 21 |
Test Validity | 20 |
Test Reliability | 17 |
Foreign Countries | 16 |
Tests | 15 |
Statistical Analysis | 13 |
Testing | 12 |
Questionnaires | 11 |
Teaching Methods | 11 |
More ▼ |
Source
Author
Publication Type
Education Level
Higher Education | 11 |
Postsecondary Education | 6 |
Secondary Education | 6 |
Elementary Education | 4 |
Elementary Secondary Education | 3 |
Grade 3 | 2 |
Grade 4 | 2 |
High Schools | 2 |
Early Childhood Education | 1 |
Grade 12 | 1 |
Grade 2 | 1 |
More ▼ |
Audience
Practitioners | 2 |
Researchers | 2 |
Parents | 1 |
Policymakers | 1 |
Teachers | 1 |
Location
Australia | 5 |
United States | 5 |
Canada | 3 |
China | 3 |
New Mexico | 3 |
Taiwan | 3 |
Asia | 2 |
Japan | 2 |
Mexico | 2 |
New York (New York) | 2 |
South Korea | 2 |
More ▼ |
Laws, Policies, & Programs
Elementary and Secondary… | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Kate E. Walton; Cristina Anguiano-Carrasco – ACT, Inc., 2024
Large language models (LLMs), such as ChatGPT, are becoming increasingly prominent. Their use is becoming more and more popular to assist with simple tasks, such as summarizing documents, translating languages, rephrasing sentences, or answering questions. Reports like McKinsey's (Chui, & Yee, 2023) estimate that by implementing LLMs,…
Descriptors: Artificial Intelligence, Man Machine Systems, Natural Language Processing, Test Construction
OECD Publishing, 2023
Advances in artificial intelligence (AI) are ushering in a large and rapid technological transformation. Understanding how AI capabilities relate to human skills and how they develop over time is crucial for understanding this process. In 2016, the OECD assessed AI capabilities with the OECD's Survey of Adult Skills (PIAAC). The present report…
Descriptors: Artificial Intelligence, Adults, Reading Tests, Mathematics Tests
Ercikan, Kadriye; Guo, Hongwen; He, Qiwei – Educational Assessment, 2020
Comparing group is one of the key uses of large-scale assessment results, which are used to gain insights to inform policy and practice and to examine the comparability of scores and score meaning. Such comparisons typically focus on examinees' final answers and responses to test questions, ignoring response process differences groups may engage…
Descriptors: Data Use, Responses, Comparative Analysis, Test Bias
Falabella, Alejandra – British Journal of Sociology of Education, 2016
Using qualitative data from two Chilean public schools, I interrogate the expectation that standardised testing motivates staff to critically self-assess themselves and to be accountable for failing evaluations. The research findings bring new insights into looking at ways in which school members, especially head managers, strategically debate,…
Descriptors: Tests, Scores, Accountability, Criticism
Jones, Jason P.; McConnell, David A. – Journal of Geoscience Education, 2023
In the past couple of decades, the geoscience education community has made great strides toward investigating how to provide effective student learning experiences in the college setting. While experiences such as student-centered teaching strategies and course design elements are useful for the instructor, they may not make important elements of…
Descriptors: Geology, Introductory Courses, Science Instruction, Teaching Methods
Ramineni, Chaitanya; Trapani, Catherine S.; Williamson, David M. – ETS Research Report Series, 2015
Automated scoring models were trained and evaluated for the essay task in the "Praxis I"® writing test. Prompt-specific and generic "e-rater"® scoring models were built, and evaluation statistics, such as quadratic weighted kappa, Pearson correlation, and standardized differences in mean scores, were examined to evaluate the…
Descriptors: Writing Tests, Licensing Examinations (Professions), Teacher Competency Testing, Scoring
Baird, Jo-Anne; Meadows, Michelle; Leckie, George; Caro, Daniel – Assessment in Education: Principles, Policy & Practice, 2017
This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems…
Descriptors: Foreign Countries, High Stakes Tests, Accuracy, Hierarchical Linear Modeling
Nguyen, David J. – Tertiary Education and Management, 2016
International student assessments have become the "lifeblood" of the accountability movement in educational policy contexts. Drawing upon Stuart Hall's concept of representation, I critically examined who comprises epistemic communities responsible for developing the Organization for Economic Co-operation and Development's Assessment of…
Descriptors: Student Evaluation, Foreign Students, Epistemology, Expertise
Frankel, Lois; Brownstein, Beth; Soiffer, Neil; Hansen, Eric – ETS Research Report Series, 2016
The work described in this report is the first phase of a project to provide easy-to-use tools for authoring and rendering secondary-school algebra-level math expressions in synthesized speech that is useful for students with blindness or low vision. This report describes the initial development, software implementation, and evaluation of the…
Descriptors: Algebra, Automation, Secondary School Mathematics, Artificial Speech
Kolen, Michael J.; Lee, Won-Chan – Educational Measurement: Issues and Practice, 2011
This paper illustrates that the psychometric properties of scores and scales that are used with mixed-format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is…
Descriptors: Test Use, Test Format, Error of Measurement, Raw Scores
Guskey, Thomas R. – Journal of Staff Development, 2016
Effective professional learning evaluation requires consideration of five critical stages or levels of information. These five levels, which are presented in this article, represent an adaptation of an evaluation model developed by Kirkpatrick (1959, 1998) for judging the value of supervisory training programs in business and industry.…
Descriptors: Hierarchical Linear Modeling, Outcomes of Education, Supervisory Training, Faculty Development
What Works Clearinghouse, 2012
The study reviewed in this report examined the effectiveness of the "Milwaukee Parental Choice Program" ("MPCP"), which provides vouchers for low-income students to attend private schools. The study analyzed data on about 600 students who were given "MPCP" vouchers in the 2006-07 school year. The authors created a…
Descriptors: Private Schools, Evaluation, Reading Tests, Standardized Tests
Sunnassee, Devdass – ProQuest LLC, 2011
Small sample equating remains a largely unexplored area of research. This study attempts to fill in some of the research gaps via a large-scale, IRT-based simulation study that evaluates the performance of seven small-sample equating methods under various test characteristic and sampling conditions. The equating methods considered are typically…
Descriptors: Test Length, Test Format, Sample Size, Simulation
Shin, Sun-Young; Lidster, Ryan – Language Testing, 2017
In language programs, it is crucial to place incoming students into appropriate levels to ensure that course curriculum and materials are well targeted to their learning needs. Deciding how and where to set cutscores on placement tests is thus of central importance to programs, but previous studies in educational measurement disagree as to which…
Descriptors: Language Tests, English (Second Language), Standard Setting (Scoring), Student Placement
Woods, Carol M.; Cai, Li; Wang, Mian – Educational and Psychological Measurement, 2013
Differential item functioning (DIF) occurs when the probability of responding in a particular category to an item differs for members of different groups who are matched on the construct being measured. The identification of DIF is important for valid measurement. This research evaluates an improved version of Lord's X[superscript 2] Wald test for…
Descriptors: Test Bias, Item Response Theory, Computation, Comparative Analysis