Publication Date
In 2025 | 0 |
Since 2024 | 0 |
Since 2021 (last 5 years) | 1 |
Since 2016 (last 10 years) | 1 |
Since 2006 (last 20 years) | 2 |
Descriptor
Comparative Analysis | 35 |
Interrater Reliability | 35 |
Evaluators | 10 |
Higher Education | 8 |
Evaluation Methods | 7 |
Language Tests | 7 |
Scoring | 7 |
Test Items | 7 |
Testing | 7 |
English (Second Language) | 6 |
Evaluation Criteria | 6 |
More ▼ |
Author
Lunz, Mary E. | 2 |
Myford, Carol M. | 2 |
O'Neill, Thomas R. | 2 |
Adams, R. J. | 1 |
Alvermann, Donna E. | 1 |
Beasley, T. Mark | 1 |
Bridgeman, Brent | 1 |
Chang, Lei | 1 |
Chavez, Oscar | 1 |
Chen, H. Julie | 1 |
Christine, Charles T. | 1 |
More ▼ |
Publication Type
Speeches/Meeting Papers | 35 |
Reports - Research | 21 |
Reports - Evaluative | 11 |
Information Analyses | 2 |
Tests/Questionnaires | 2 |
Collected Works - Serials | 1 |
Opinion Papers | 1 |
Education Level
Elementary Secondary Education | 1 |
High Schools | 1 |
Secondary Education | 1 |
Audience
Researchers | 3 |
Practitioners | 2 |
Teachers | 1 |
Location
California | 1 |
Laws, Policies, & Programs
Assessments and Surveys
Graduate Management Admission… | 1 |
National Assessment of… | 1 |
Test of English as a Foreign… | 1 |
What Works Clearinghouse Rating
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Tack, Anaïs; Piech, Chris – International Educational Data Mining Society, 2022
How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports…
Descriptors: Artificial Intelligence, Dialogs (Language), Bayesian Statistics, Decision Making
Chavez, Oscar; Papick, Ira; Ross, Dan J.; Grouws, Douglas A. – Online Submission, 2010
The purpose of this paper was to describe the process of development of assessment instruments for the Comparing Options in Secondary Mathematics: Investigating Curriculum (COSMIC) project. The COSMIC project was a three-year longitudinal comparative study focusing on evaluating high school students' mathematics learning from two distinct…
Descriptors: Mathematics Education, Mathematics Achievement, Interrater Reliability, Scoring Rubrics
O'Neill, Thomas R.; Lunz, Mary E. – 1997
This paper illustrates a method to study rater severity across exam administrations. A multi-facet Rasch model defined the ratings as being dominated by four facets: examinee ability, rater severity, project difficulty, and task difficulty. Ten years of data from administrations of a histotechnology performance assessment were pooled and analyzed…
Descriptors: Ability, Comparative Analysis, Equated Scores, Interrater Reliability
Spolsky, Bernard – 1990
A discussion of the differences between the Test of English as a Foreign Language (TOEFL), an American test battery, and the Cambridge English Examinations (Cambridge), a British battery, focuses on the different approaches to language test development embodied in the tests as the source of difficulty in translating between them for individual…
Descriptors: Comparative Analysis, Cultural Differences, English (Second Language), Foreign Countries
Beasley, T. Mark; Leitner, Dennis W. – 1993
The L statistic of E. B. Page (1963) tests the agreement of a single group of judges with an a priori ordering of alternative treatments. This paper extends the two group test of D. W. Leitner and C. M. Dayton (1976), an extension of the L test, to analyze difference in consensus between two unequally sized groups of judges. Exact critical values…
Descriptors: Comparative Analysis, Equations (Mathematics), Estimation (Mathematics), Evaluators
Schael, Jocelyne; Dionne, Jean-Paul – 1991
The basis of agreement or disagreement among judges/evaluators when applying a coding scheme to concurrent verbal protocols was studied. The sample included 20 university graduates, from varied backgrounds; 10 subjects had and 10 subjects did not have experience in protocol analysis. The total sample was divided into four balanced groups according…
Descriptors: Adults, College Graduates, Comparative Analysis, Encoding (Psychology)
Christine, Charles T.; And Others – 1982
Thirty-two children aged 7 to 12 participated in a study to determine the reliability of the Ekwall Reading Inventory (ERI) and the Classroom Reading Inventory (CRI). The children were randomly assigned to take one of the two inventories, which were administered by four different specially trained teachers. The study used a test-retest design, in…
Descriptors: Comparative Analysis, Elementary Secondary Education, Informal Reading Inventories, Interrater Reliability
Debate Philosophy Statements as Predictors of Critic Attitudes: A Summary and Direction of Research.
Dudczak, Craig; Day, Donald – 1991
Philosophy statements have been used in the National Debate Tournament (NDT) since the mid-1970s and the Cross Examination Debate Association (CEDA) National Tournament since its 1986 inception. The statements should help debaters adapt to critics' expressed preferences. Moreover, philosophy statements can guide the study of argumentation theory…
Descriptors: Comparative Analysis, Content Analysis, Debate, Higher Education
Chen, H. Julie – 1995
A study investigated 42 native English-speakers' (NSs) perceptions of the pragmatic appropriateness of refusal statements. The NSs rated the appropriateness of 24 written statements in 4 different refusal scenarios, which were collected from both native speakers and non-native speakers. Four weeks later, as a reliability check, the subjects rated…
Descriptors: Attitudes, Comparative Analysis, English (Second Language), Interrater Reliability
Nicolai, Michael T. – 1987
To determine if there is a distinction between the forensics community's idea of quality and that of the general population, tournament rankings of forensics judges and those of a lay audience were compared. Undergraduate students enrolled in a variety of speech related courses were asked to attend rounds of competition at a midwest collegiate…
Descriptors: Communication Research, Comparative Analysis, Debate, Evaluation Criteria
Kenyon, Dorry; Stansfield, Charles W. – 1993
This paper examines whether individuals who train themselves to score a performance assessment will rate acceptably when compared to known standards. Research on the efficacy of rater self-training materials developed by the Center for Applied Linguistics for the Texas Oral Proficiency Test (TOPT) is examined. Rater self-materials are described…
Descriptors: Bilingual Education, Comparative Analysis, Evaluators, Individual Characteristics

Yates, Beverly J. – 1991
The predictive validity of the National Association of Secondary School Principals (NASSP) assessment center evaluation process for principals is compared with the perceived effectiveness of a selected population of principals. The NASSP assessment center approach includes a case study, a personal interview, two exercises, and a scholastic…
Descriptors: Administrator Evaluation, Assessment Centers (Personnel), Case Studies, Comparative Analysis
O'Neill, Thomas R.; Lunz, Mary E. – 1996
To generalize test results beyond the particular test administration, an examinee's ability estimate must be independent of the particular items attempted, and the item difficulty calibrations must be independent of the particular sample of people attempting the items. This stability is a key concept of the Rasch model, a latent trait model of…
Descriptors: Ability, Benchmarking, Comparative Analysis, Difficulty Level

Jaeger, Richard M.; Usher, Claire H. – 1991
This paper reports on a study of the foundation and application of two procedures used to specify appropriate weights to be applied to components in determining the overall quality of a school. These procedures are multiattribute utility technology (MAUT) and policy capturing, and the paper presents the results of applying them, using key…
Descriptors: Achievement Tests, Comparative Analysis, Curriculum Evaluation, Educational Assessment
Crews, William E., Jr. – 1991
As part of a study of teacher evaluation of student replies to open-ended questions, a second question--the best method of determining interrater reliability--was examined. The standard method, the Pearson Product-Moment correlation, overestimated the degree of match between researchers' and teachers' scoring of tests. The simpler percent…
Descriptors: Comparative Analysis, Elementary School Teachers, Evaluation Methods, Evaluators