Publication Date
In 2025 | 0 |
Since 2024 | 0 |
Since 2021 (last 5 years) | 0 |
Since 2016 (last 10 years) | 7 |
Since 2006 (last 20 years) | 11 |
Descriptor
Evaluators | 13 |
Scoring | 13 |
Statistical Analysis | 13 |
Correlation | 7 |
Essays | 7 |
Second Language Learning | 6 |
Computer Assisted Testing | 5 |
English (Second Language) | 5 |
Writing Evaluation | 5 |
Writing Tests | 5 |
College Students | 4 |
More ▼ |
Source
Author
Attali, Yigal | 1 |
Bridgeman, Brent | 1 |
Brown, Michelle Stallone | 1 |
Buzick, Heather | 1 |
Cramer, Stephen E. | 1 |
Davey, Tim | 1 |
Ebuoh, Casmir N. | 1 |
Ferrara, Steve | 1 |
Flor, Michael | 1 |
Gass, Susan | 1 |
Gentile, Claudia | 1 |
More ▼ |
Publication Type
Journal Articles | 11 |
Reports - Research | 10 |
Reports - Evaluative | 2 |
Dissertations/Theses -… | 1 |
Information Analyses | 1 |
Speeches/Meeting Papers | 1 |
Tests/Questionnaires | 1 |
Education Level
Higher Education | 4 |
Postsecondary Education | 2 |
Secondary Education | 1 |
Audience
Location
Nigeria | 1 |
South Korea | 1 |
Texas | 1 |
Turkey | 1 |
Laws, Policies, & Programs
Assessments and Surveys
Test of English as a Foreign… | 3 |
Graduate Record Examinations | 1 |
What Works Clearinghouse Rating
Yun, Jiyeo – ProQuest LLC, 2017
Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…
Descriptors: Interrater Reliability, Essays, Scoring, Evaluators
Ebuoh, Casmir N. – World Journal of Education, 2018
Literature revealed that the patterns/methods of scoring essay tests had been criticized for not being reliable and this unreliability is more likely to be more in internal examinations than in the external examinations. The purpose of this study is to find out the effects of analytical and holistic scoring patterns on scorer reliability in…
Descriptors: Holistic Approach, Scoring, Essay Tests, Biology
Steedle, Jeffrey T.; Ferrara, Steve – Applied Measurement in Education, 2016
As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation.…
Descriptors: Essays, Scoring, Comparative Analysis, Evaluators
In'nami, Yo; Koizumi, Rie – Language Testing, 2016
We addressed Deville and Chalhoub-Deville's (2006), Schoonen's (2012), and Xi and Mollaun's (2006) call for research into the contextual features that are considered related to person-by-task interactions in the framework of generalizability theory in two ways. First, we quantitatively synthesized the generalizability studies to determine the…
Descriptors: Evaluators, Second Language Learning, Writing Skills, Oral Language
Buzick, Heather; Oliveri, Maria Elena; Attali, Yigal; Flor, Michael – Applied Measurement in Education, 2016
Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K-12 large-scale assessment. In this…
Descriptors: Essays, Learning Disabilities, Attention Deficit Hyperactivity Disorder, Scoring
Han, Turgay; Huang, Jinyan – PASAA: Journal of Language Teaching and Learning in Thailand, 2017
Using generalizability (G-) theory and rater interviews as both quantitative and qualitative approaches, this study examined the impact of scoring methods (i.e., holistic versus analytic scoring) on the scoring variability and reliability of an EFL institutional writing assessment at a Turkish university. Ten raters were invited to rate 36…
Descriptors: English (Second Language), Second Language Learning, Second Language Instruction, Scoring
Kim, Jiyun; Lantolf, James P. – Language Teaching Research, 2018
This article reports on a pedagogical project aimed at helping second language (L2) learners of English develop the ability to detect and appropriately interpret spoken sarcasm. The study used a pre- and posttest procedure to assess the development of learners' ability to both detect sarcasm and impute appropriate speaker intentions and attitudes…
Descriptors: Concept Formation, Native Language, Korean, Language Usage
Ramineni, Chaitanya; Trapani, Catherine S.; Williamson, David M.; Davey, Tim; Bridgeman, Brent – ETS Research Report Series, 2012
Scoring models for the "e-rater"® system were built and evaluated for the "TOEFL"® exam's independent and integrated writing prompts. Prompt-specific and generic scoring models were built, and evaluation statistics, such as weighted kappas, Pearson correlations, standardized differences in mean scores, and correlations with…
Descriptors: Scoring, Prompting, Evaluators, Computer Software
Winke, Paula; Gass, Susan; Myford, Carol – ETS Research Report Series, 2011
This study investigated whether raters' second language (L2) background and the first language (L1) of test takers taking the TOEFL iBT® Speaking test were related through scoring. After an initial 4-hour training period, a group of 107 raters (mostly of learners of Chinese, Korean, and Spanish), listened to a selection of 432 speech samples that…
Descriptors: Second Language Learning, Evaluators, Speech Tests, English (Second Language)
Wang, Jinhao; Brown, Michelle Stallone – Contemporary Issues in Technology and Teacher Education (CITE Journal), 2008
The purpose of the current study was to analyze the relationship between automated essay scoring (AES) and human scoring in order to determine the validity and usefulness of AES for large-scale placement tests. Specifically, a correlational research design was used to examine the correlations between AES performance and human raters' performance.…
Descriptors: Scoring, Essays, Computer Assisted Testing, Sentence Structure
Lee, Yong-Won; Gentile, Claudia; Kantor, Robert – ETS Research Report Series, 2008
The main purpose of the study was to investigate the distinctness and reliability of analytic (or multitrait) rating dimensions and their relationships to holistic scores and "e-rater"® essay feature variables in the context of the TOEFL® computer-based test (CBT) writing assessment. Data analyzed in the study were analytic and holistic…
Descriptors: English (Second Language), Language Tests, Second Language Learning, Scoring

Park, Hyun-Sook; And Others – Journal of Experimental Education, 1990
The reliability of visual inspection in single-case research was investigated by determining agreement among 5 judges visually inspecting 44 graphs depicting behavior from baseline to intervention. Agreement between visual inspection and statistical procedures was determined. Implications for single-case research are discussed. (SLD)
Descriptors: Behavior Patterns, Evaluation Methods, Evaluators, Graphs
Cramer, Stephen E. – 1990
A standard-setting procedure was developed for the Georgia Teacher Certification Testing Program as tests in 30 teaching fields were revised. A list of important characteristics of a standard-setting procedure was derived, drawing on the work of R. A. Berk (1986). The best method was found to be a highly formalized judgmental, empirical Angoff…
Descriptors: Computer Assisted Testing, Cutting Scores, Data Collection, Elementary Secondary Education