It’s time to use what we have discussed and learned in the first five chapters of this book in a supervised machine learning context. In the next two chapters, we will focus putting into practice such machine learning algorithms as:
- naive Bayes,
- support vector machines (SVM) (Boser, Guyon, and Vapnik 1992), and
- regularized linear models such as implemented in glmnet (Friedman, Hastie, and Tibshirani 2010).
We start in Chapter 6 with exploring regression models and continue in Chapter 7 with classification models. We will use the tidymodels framework for resampling, preprocessing, fitting, and evaluation. As you read through these next chapters, notice the modeling process moving through these stages; we’ll discuss the structure of this process in more detail in the foreword for the deep learning chapters.
Before we starting fitting these models to real datasets, let’s consider how to think about algorithmic bias for predictive modeling. Rachel Thomas proposed a checklist at ODSC West 2019 for algorithmic basic in machine learning.
Should we even be doing this?
This is always the first step. Machine learning algorithms involve math and data, but that does not mean they are neutral. They can be used for purposes that are helpful, harmful, or even unethical.
What bias is already in the data?
Chapter 6 uses a dataset of United States Supreme Court opinions, with an uneven distribution of years. There are many more opinions from more recent decades than from earlier ones. Bias like this is extremely common in datasets and must be considered in modeling. In this case, we show how using regularized linear models results in better predictions across years than other approaches (Section 6.4).
Can the code and data be audited?
In the case of this book, the code and data are all publicly available. You as a reader can audit our methods and what kinds of bias exist in the datasets. When you take what you have learned in this book and apply it your real-world work, consider how accessible your code and data are to internal and external stakeholders.
What are the error rates for sub-groups?
In Section 7.4 we demonstrate how to measure model performance for a multiclass classifier, but you can also compute model metrics for sub-groups that are not explicitly in your model as class labels or predictors. Using tidy data principles and the yardstick package makes this task well within the reach of data practitioners.
What is the accuracy of a simple rule-based alternative?
Chapter 7 shows how to train models to predict the category of a user complaint using sophisticated preprocessing steps and machine learning algorithms, but such a complaint could be categorized using simple regular expressions (Appendix 11), perhaps combined with other rules. Straightforward heuristics are easy to implement, maintain, and audit, compared to machine learning models; consider comparing the accuracy of models to simpler options.
What processes are in place to handle appeals or mistakes?
If models such as those built in Chapter 7 were put into production by an organization, what would happen if a complaint was classified incorrectly? We as data practitioners typically (hopefully) have a reasonable estimate of the true positive rate and true negative rate for models we train, so processes to handle misclassifications can be built with a good understanding of how often they will be used.
How diverse is the team that built it?
The two-person team that wrote this book includes perspectives from a man and woman, and from someone who has always lived inside the United States and someone who is from a European country. However, we are both white with similar educational backgrounds. We must be aware of how the limited life experiences of individuals training and assessing machine learning models can cause unintentional harm.
Real world effects
Questions like these are helpful checks against building inappropriate or even harmful machine learning models for text. Models affect real people in real ways. As the school year of 2020 began with many schools in the United States operating online only because of the novel coronavirus pandemic, a parent of a junior high student reported that her son was deeply upset and filled with doubt because of the way the algorithm of an ed tech company automatically scored his text answers. The parent and child discovered how to “game” the ed tech system’s scoring.
Algorithm update. He cracked it: Two full sentences, followed by a word salad of all possibly applicable keywords. 100% on every assignment. Students on @EdgenuityInc, there’s your ticket. He went from an F to an A+ without learning a thing.
We can’t know the details of the proprietary modeling and/or heuristics that make up the ed tech system’s scoring algorithm, but there is enough detail in this student’s experience to draw some conclusions. We surmise that this is a count-based method or model, perhaps a linear one but not necessarily so. The success of “word salad” submissions indicates that the model or heuristic being applied has not learned that complex, or even correct, language is important for the score.
What could a team building this kind of score do to avoid these problems? It seems like “word salad” type submissions were not included in the training data as negative examples (i.e., with low scores), indicating that the training data was biased; it did not reflect the full spectrum of submissions that the system sees in real life. The system (code and data) is not auditable for teachers or students, and the ed tech company does not directly have a process in place to handle appeals or mistakes in the score itself. This ed tech company does claim that these scores are used only to provide scoring guidance to teachers and that teachers can either accept or overrule such scores, but it is not clear how often teachers overrule scores. This highlights the first question on whether such a model or system should even be built to start with; with its current performance, this system is failing at what educators and students understand its goals to be, and is doing harm (almost certainly unevenly distributed harm) to its users.
Boser, Bernhard E, Isabelle M Guyon, and Vladimir N Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–52.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. http://www.jstatsoft.org/v33/i01/.