Bridging Human and Machine Scoring in Experimental Assessments of Writing: Tools, Tips, and Lessons Learned from a Field Trial in Education.

Reagan Mozer; Luke Miratrix; Jackie Relyea; Jimmy Kim

Background: In a randomized trial that collects text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by human raters. An impact analysis can then be conducted to compare treatment and control groups, using the hand-coded scores as a measured outcome. This process is both time and labor-intensive, which creates a persistent barrier for large-scale assessments of text (Shermis and Burstein, 2003; Yan et al., 2020). Furthermore, enriching ones understanding of a found impact on text outcomes via secondary analyses can be difficult without additional scoring efforts. Machine based text analytic and data mining tools offer one potential avenue to help facilitate research in this domain. Modern methods based on natural language processing (NLP) allow for the automatic evaluation of an array of linguistic properties including measures of grammatical accuracy (Allen, 1995), discourse structure (Graesser et al., 2004), sentiment (Pennebaker et al., 2001) and social cognition (Crossley et al., 2017). These features characterize a rich set of abilities reflected in student writing, many of which are "invisible" to a human rater coding for higher-level constructs (Pennebaker et al., 2014). By examining patterns of impacts across a wide array of such outcomes, we can explore how different micro-features of text may be driving top-line impacts. Purpose: In this work, we consider how to leverage automated methods to expand the scope of what researchers might measure, in terms of a treatment impact, when using text as an outcome. Drawing on techniques from the automated scoring literature, we propose several different methods for supplementary analysis in this spirit. We then present a case study of using these methods to enrich an evaluation of a classroom intervention designed to enhance reading and writing skills in young children. In particular, we consider a randomized trial recently conducted by Authors (in press) to evaluate the impacts of a content literacy intervention (i.e., treatment) compared to typical instruction (i.e., control) on first and second graders' domain knowledge in science and social studies as reflected by their performance on an argumentative writing assessment. To assess whether their treatment "worked," they constructed a scoring rubric and hand-coded thousands of essays on a measure of holistic writing quality. Findings showed positive treatment effects on these human-coded writing outcomes in science and social studies (Authors, in press). In the present study, we seek to enrich these findings by analyzing treatment impacts across a suite of secondary outcomes generated using machine-based text analytic and data mining tools. We also gauge to what extent the results from an automated impact analyses taken in isolation, in this context, replicate the gold-standard findings based on careful human coding. Case Study and Data: Our case study examines a subset of data from the cluster-randomized trial described in Authors (in press). Our sample was composed of 2,929 first and second grade students from a total of 302 classrooms across 30 different elementary schools. Within the sample, we observed a total of 2,746 science essays and 2,548 social studies essays. Methods: For impact analysis, we use a hierarchical linear model with fixed effects for each school with cluster robust standard errors, clustered at the teacher level. We fit separate models for writing outcomes in science and social studies, regressing the human-coded essay quality scores in each domain on an indicator for treatment and other observed covariates. Next, we turn to unpacking these top-level impacts on overall writing quality by examining other text attributes that may not have been directly assessed in the human coding process. This is accomplished by generating a rich set of secondary outcomes that capture different descriptive characteristics of the text. We build these features from the raw text itself using a mix of simple statistical procedures (i.e., counting frequencies of words), pre-trained neural network embeddings such as Word2Vec (Mikolov et al., 2013), indices derived from different linguistic and discourse representations of text (Graesser et al., 2004), and other off-the-shelf software packages for text analysis such as Linguistic Inquiry Word Count (LIWC; Pennebaker et al., 2001). In our case study, our final feature set was comprised of nearly 120 unique text-based outcomes, which we then compare between treatment and control groups using the same cluster-robust approach used above. For each grade by domain, we then apply multiple testing corrections to identify the set of textual features that have been most prominently impacted by treatment in a manner that controls the false discovery rate. Results: Figure 1 shows the impact estimates for each subject and grade level with respect to ten common text statistics, including word count, readability, and four measures of higher-level thinking. Figure 2 then shows the estimated effects additional auxiliary outcomes that were found to have significant treatment impacts in at least one of the four groups. Across both grade levels, we see that essays from the treatment group score significantly higher on the dimension for clout, suggesting that students who received the intervention are writing from perspective of higher expertise and with more confidence, on average, than control students. Treated students also write with significantly lower levels of authenticity and emotional tone than students in control, which indicates a more formal and distanced form of discourse. In addition, we find significant differences between treatment and control groups at both grade levels on a number of dimensions related to students' drives, needs, and motives. These findings uncover several potential mechanisms as to how the treatment was effective for improving students' writing abilities in each domain. Conclusions: Overall, we find that by examining a variety of text-based features generated using automated methods, we can uncover important potential mechanisms as to why the treatment was effective. We argue that our rich array of findings move us from "it worked" to "it worked because" by revealing how observed improvements in writing were likely due, in part, to the students having learned to marshal evidence and speak with more authority. Relying exclusively on human scoring, by contrast, is a lost opportunity.