Overview

It’s time to use what we have discussed and learned in the first five chapters of this book in a supervised machine learning context, to make predictions from text data. In the next two chapters, we will focus on putting into practice such machine learning algorithms as:

naive Bayes,
support vector machines (SVM) (Boser, Guyon, and Vapnik 1992), and
regularized linear models such as implemented in glmnet (Friedman, Hastie, and Tibshirani 2010).

We start in Chapter 6 with exploring regression models and continue in Chapter 7 with classification models. These are different types of prediction problems, but in both, we can use the tools of supervised machine learning to connect our input, which may exist entirely or partly as text data, with our outcome of interest. Most supervised models for text data are built with one of three purposes in mind:

The main goal of a predictive model is to generate the most accurate predictions possible.
An inferential model is created to test a hypothesis or draw conclusions about a population.
The main purpose of a descriptive model is to describe the properties of the observed data.

Many learning algorithms can be used for more than one of these purposes. Concerns about a model’s predictive capacity may be as important for an inferential or descriptive model as for a model designed purely for prediction, and model interpretability and explainability may be important for a solely predictive or descriptive model as well as for an inferential model. We will use the tidymodels framework to address all of these issues, with its consistent approach to resampling, preprocessing, fitting, and evaluation.

The tidymodels framework (Kuhn and Wickham 2021a) is a collection of R packages for modeling and machine learning using tidyverse principles (Wickham et al. 2019). These packages facilitate resampling, preprocessing, modeling, and evaluation. There are core packages that you can load all together via library(tidymodels) and then extra packages for more specific tasks.

As you read through these next chapters, notice the modeling process moving through these stages; we’ll discuss the structure of this process in more detail in the overview for the deep learning chapters.

Before we start fitting these models to real data sets, let’s consider how to think about algorithmic bias for predictive modeling. Rachel Thomas proposed a checklist at ODSC West 2019 for algorithmic bias in machine learning.

Should we even be doing this?

This is always the first step. Machine learning algorithms involve math and data, but that does not mean they are neutral. They can be used for purposes that are helpful, harmful, or even unethical.

What bias is already in the data?

Chapter 6 uses a data set of United States Supreme Court opinions, with an uneven distribution of years. There are many more opinions from more recent decades than from earlier ones. Bias like this is extremely common in data sets and must be considered in modeling. In this case, we show how using regularized linear models results in better predictions across years than other approaches (Section 6.3).

Can the code and data be audited?

In the case of this book, the code and data are all publicly available. You as a reader can audit our methods and what kinds of bias exist in the data sets. When you take what you have learned in this book and apply it your real-world work, consider how accessible your code and data are to internal and external stakeholders.

What are the error rates for sub-groups?

In Section 7.6 we demonstrate how to measure model performance for a multiclass classifier, but you can also compute model metrics for sub-groups that are not explicitly in your model as class labels or predictors. Using tidy data principles and the yardstick package makes this task well within the reach of data practitioners.

In tidymodels, the yardstick package (Kuhn and Vaughan 2021a) has functions for model evaluation.

What is the accuracy of a simple rule-based alternative?

Chapter 7 shows how to train models to predict the category of a user complaint using sophisticated preprocessing steps and machine learning algorithms, but such a complaint could be categorized using simple regular expressions (Appendix A), perhaps combined with other rules. Straightforward heuristics are easy to implement, maintain, and audit, compared to machine learning models; consider comparing the accuracy of models to simpler options.

What processes are in place to handle appeals or mistakes?

If models such as those built in Chapter 7 were put into production by an organization, what would happen if a complaint was classified incorrectly? We as data practitioners typically (hopefully) have a reasonable estimate of the true positive rate and true negative rate for models we train, so processes to handle misclassifications can be built with a good understanding of how often they will be used.

How diverse is the team that built it?

The two-person team that wrote this book includes perspectives from a man and woman, and from someone who has always lived inside the United States and someone who is from a European country. However, we are both white with similar educational backgrounds. We must be aware of how the limited life experiences of individuals training and assessing machine learning models can cause unintentional harm.

References

Boser, B. E., Guyon, I. M., and Vapnik, V. N. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–152. COLT ’92. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/130385.130401.

Friedman, J. H., Hastie, T., and Tibshirani, R. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, Articles 33 (1): 1–22. https://www.jstatsoft.org/v033/i01.

Kuhn, M., and Vaughan, D. 2021a. yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick.

Kuhn, M., and Wickham, H. 2021a. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” RStudio PBC. https://www.tidymodels.org.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43). The Open Journal: 1686. https://doi.org/10.21105/joss.01686.