Publication Date
In 2025 | 3 |
Since 2024 | 3 |
Since 2021 (last 5 years) | 15 |
Since 2016 (last 10 years) | 25 |
Since 2006 (last 20 years) | 35 |
Descriptor
Difficulty Level | 48 |
Evaluators | 48 |
Test Items | 15 |
Scoring | 13 |
Second Language Learning | 13 |
Language Tests | 12 |
Interrater Reliability | 11 |
Item Analysis | 11 |
Foreign Countries | 10 |
Item Response Theory | 10 |
English (Second Language) | 9 |
More ▼ |
Source
Author
Publication Type
Journal Articles | 35 |
Reports - Research | 31 |
Reports - Evaluative | 11 |
Speeches/Meeting Papers | 5 |
Dissertations/Theses -… | 4 |
Tests/Questionnaires | 4 |
Information Analyses | 1 |
Reports - Descriptive | 1 |
Education Level
Audience
Location
California | 2 |
Florida | 2 |
Iran | 2 |
United Kingdom (England) | 2 |
Europe | 1 |
Idaho | 1 |
Illinois | 1 |
Indonesia | 1 |
Japan | 1 |
Maryland | 1 |
Massachusetts | 1 |
More ▼ |
Laws, Policies, & Programs
Assessments and Surveys
Test of English as a Foreign… | 3 |
Flesch Kincaid Grade Level… | 1 |
Fry Readability Formula | 1 |
National Adult Literacy… | 1 |
National Teacher Examinations | 1 |
Test of English for… | 1 |
edTPA (Teacher Performance… | 1 |
What Works Clearinghouse Rating
Wang, Jue; Engelhard, George; Combs, Trenton – Journal of Experimental Education, 2023
Unfolding models are frequently used to develop scales for measuring attitudes. Recently, unfolding models have been applied to examine rater severity and accuracy within the context of rater-mediated assessments. One of the problems in applying unfolding models to rater-mediated assessments is that the substantive interpretations of the latent…
Descriptors: Writing Evaluation, Scoring, Accuracy, Computational Linguistics
Reza Shahi; Hamdollah Ravand; Golam Reza Rohani – International Journal of Language Testing, 2025
The current paper intends to exploit the Many Facet Rasch Model to investigate and compare the impact of situations (items) and raters on test takers' performance on the Written Discourse Completion Test (WDCT) and Discourse Self-Assessment Tests (DSAT). In this study, the participants were 110 English as a Foreign Language (EFL) students at…
Descriptors: Comparative Analysis, English (Second Language), Second Language Learning, Second Language Instruction
Jin, Kuan-Yu; Eckes, Thomas – Measurement: Interdisciplinary Research and Perspectives, 2022
Recent research on rater effects in performance assessments has increasingly focused on rater centrality, the tendency to assign scores clustering around the rating scale's middle categories. In the present paper, we adopted Jin and Wang's (2018) extended facets modeling approach and constructed a centrality continuum, ranging from raters…
Descriptors: Performance Based Assessment, Evaluators, Scoring, Sample Size
Makiko Kato – Journal of Education and Learning, 2025
This study aims to examine whether differences exist in the factors influencing the difficulty of scoring English summaries and determining scores based on the raters' attributes, and to collect candid opinions, considerations, and tentative suggestions for future improvements to the analytic rubric of summary writing for English learners. In this…
Descriptors: Writing Evaluation, Scoring, Writing Skills, English (Second Language)
Roger Young; Emily Courtney; Alexander Kah; Mariah Wilkerson; Yi-Hsin Chen – Teaching of Psychology, 2025
Background: Multiple-choice item (MCI) assessments are burdensome for instructors to develop. Artificial intelligence (AI, e.g., ChatGPT) can streamline the process without sacrificing quality. The quality of AI-generated MCIs and human experts is comparable. However, whether the quality of AI-generated MCIs is equally good across various domain-…
Descriptors: Item Response Theory, Multiple Choice Tests, Psychology, Textbooks
Lestari, Santi B.; Brunfaut, Tineke – Language Testing, 2023
Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating…
Descriptors: Reading Writing Relationship, Writing Tests, Rating Scales, Writing Processes
Chan, Kinnie Kin Yee; Bond, Trevor; Yan, Zi – Language Testing, 2023
We investigated the relationship between the scores assigned by an Automated Essay Scoring (AES) system, the Intelligent Essay Assessor (IEA), and grades allocated by trained, professional human raters to English essay writing by instigating two procedures novel to written-language assessment: the logistic transformation of AES raw scores into…
Descriptors: Computer Assisted Testing, Essays, Scoring, Scores
Isbell, Daniel R.; Son, Young-A – Studies in Second Language Acquisition, 2022
Elicited Imitation Tests (EITs) are commonly used in second language acquisition (SLA)/bilingualism research contexts to assess the general oral proficiency of study participants. While previous studies have provided valuable EIT construct-related validity evidence, some key gaps remain. This study uses an integrative data analysis to further…
Descriptors: Bilingualism, Imitation, Language Tests, Second Language Learning
Clauser, Brian E.; Kane, Michael; Clauser, Jerome C. – Journal of Educational Measurement, 2020
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item…
Descriptors: Cutting Scores, Generalization, Decision Making, Standard Setting
Alexander, David E.; Davis, Ian R. – Quality Assurance in Education: An International Perspective, 2019
Purpose: The purpose of this paper is to review the issues and challenges associated with examining PhD theses in the modern, rapidly changing academic world. The PhD degree has been described as the "pinnacle of academic qualifications", but it is under threat in terms of the quality of supervision and the outcome of examinations. By…
Descriptors: Doctoral Dissertations, Doctoral Degrees, Educational Quality, Difficulty Level
Cahyono, Sulistio Mukti; Kartawagiran, Badrun; Mahmudah, Fitri Nur – European Journal of Educational Research, 2021
Teachers who can adapt and be ready for all changes will also be able to provide a balance to increase the competence of vocational high school students. This is also not denied when teachers become assessors in student competency tests. The objectives of this study were to produce an instrument for the readiness of teachers as assessors; to…
Descriptors: Readiness, Vocational Education Teachers, Vocational High Schools, High School Students
Ilhan, Mustafa – International Journal of Assessment Tools in Education, 2019
This study investigated the effectiveness of statistical adjustments applied to rater bias in many-facet Rasch analysis. Some changes were first made in the dataset that did not include "rater × examinee" bias to cause to have "rater × examinee" bias. Later, bias adjustment was applied to rater bias included in the data file,…
Descriptors: Statistical Analysis, Item Response Theory, Evaluators, Bias
Moylan, Laura A.; Johnson, Evelyn S.; Zheng, Yuzhu – Reading & Writing Quarterly, 2022
This study describes the development of a special education teacher observation protocol detailing the elements of effective decoding instruction. The psychometric properties of the protocol were investigated through many-facet Rasch measurement (MFRM). Video observations of classroom decoding instruction from 20 special education teachers across…
Descriptors: Decoding (Reading), Special Education Teachers, Psychometrics, Video Technology
Lyness, Scott A.; Peterson, Kent; Yates, Kenneth – Education Sciences, 2021
The Performance Assessment for California Teachers (PACT) is a high stakes summative assessment that was designed to measure pre-service teacher readiness. We examined the inter-rater reliability (IRR) of trained PACT evaluators who rated 19 candidates. As measured by Cohen's weighted kappa, the overall IRR estimate was 0.17 (poor strength of…
Descriptors: High Stakes Tests, Performance Based Assessment, Teacher Effectiveness, Academic Language
Hidri, Sahbi – Language Testing in Asia, 2021
The study investigated the alignment process of the International English Language Competency Assessment (IELCA) suite examinations' four levels, B1, B2, C1 and C2, onto the Common European Framework of Reference (CEFR) by explaining and discussing the five linking stages (Council of Europe (CoE 2009). Unlike previous studies, this study used the…
Descriptors: Literacy, Second Language Learning, Second Language Instruction, English (Second Language)