Preface
Modeling as a statistical practice can encompass a wide variety of activities. This book focuses on supervised or predictive modeling for text, using text data to make predictions about the world around us. We use the tidymodels framework for modeling, a consistent and flexible collection of R packages developed to encourage good statistical practice.
Supervised machine learning using text data involves building a statistical model to estimate some output from input that includes language. The two types of models we train in this book are regression and classification. Think of regression models as predicting numeric or continuous outputs, such as predicting the year of a United States Supreme Court opinion from the text of that opinion. Think of classification models as predicting outputs that are discrete quantities or class labels, such as predicting whether a GitHub issue is about documentation or not from the text of the issue. Models like these can be used to make predictions for new observations, to understand what features or characteristics contribute to differences in the output, and more. We can evaluate our models using performance metrics to determine which are best, which are acceptable for our specific context, and even which are fair.
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features (predictors) for machine learning from language.
Natural language that we as speakers and/or writers use must be dramatically transformed to a machine-readable, numeric representation to be ready for computation. In this book, we explore typical text preprocessing steps from the ground up and consider the effects of these steps. We also show how to fluently use the textrecipes R package (Hvitfeldt 2020a) to prepare text data within a modeling pipeline.
Silge and Robinson (2017) provides a practical introduction to text mining with R using tidy data principles, based on the tidytext package. If you have already started on the path of gaining insight from your text data, a next step is using that text directly in predictive modeling. Text data contains within it latent information that can be used for insight, understanding, and better decision-making, and predictive modeling with text can bring that information and insight to light. If you have already explored how to analyze text as demonstrated in Silge and Robinson (2017), this book will move one step further to show you how to learn and make predictions from that text data with supervised models. If you are unfamiliar with this previous work, this book will still provide a robust introduction to how text can be represented in useful ways for modeling and a diverse set of supervised modeling approaches for text.
Outline
The book is divided into three sections. We make a (perhaps arbitrary) distinction between machine learning methods and deep learning methods by defining deep learning as any kind of multilayer neural network (LSTM, bi-LSTM, CNN) and machine learning as anything else (regularized regression, naive Bayes, SVM, random forest). We make this distinction both because these different methods use separate software packages and modeling infrastructure, and from a pragmatic point of view, it is helpful to split up the chapters this way.
Natural language features: How do we transform text data into a representation useful for modeling? In these chapters, we explore the most common preprocessing steps for text, when they are helpful, and when they are not.
Machine learning methods: We investigate the power of some of the simpler and more lightweight models in our toolbox.
Deep learning methods: Given more time and resources, we see what is possible once we turn to neural networks.
Some of the topics in the second and third sections overlap as they provide different approaches to the same tasks.
Throughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in Appendix B.
We use three kinds of info boxes throughout the book to invite attention to notes and other ideas.
Some boxes call out warnings or possible problems to watch out for.
Boxes marked with hexagons highlight information about specific R packages and how they are used. We use bold for the names of R packages.
Topics this book will not cover
This book serves as a thorough introduction to prediction and modeling with text, along with detailed practical examples, but there are many areas of natural language processing we do not cover. The CRAN Task View on Natural Language Processing provides details on other ways to use R for computational linguistics. Specific topics we do not cover include:
Reading text data into memory: Text data may come to a data practitioner in any of a long list of heterogeneous formats. Text data exists in PDFs, databases, plain text files (single or multiple for a given project), websites, APIs, literal paper, and more. The skills needed to access and sometimes wrangle text data sets so that they are in memory and ready for analysis are so varied and extensive that we cannot hope to cover them in this book. We point readers to R packages such as readr (Wickham and Hester 2020), pdftools (Ooms 2020a), and httr (Wickham 2020), which we have found helpful in these tasks.
Unsupervised machine learning for text: Silge and Robinson (2017) provide an introduction to one method of unsupervised text modeling, and Chapter 5 does dive deep into word embeddings, which learn from the latent structure in text data. However, many more unsupervised machine learning algorithms can be used for the goal of learning about the structure or distribution of text data when there are no outcome or output variables to predict.
Text generation: The deep learning model architectures we discuss in Chapters 8, 9, and 10 can be used to generate new text, as well as to model existing text. Chollet and Allaire (2018) provide details on how to use neural network architectures and training data for text generation.
Speech processing: Models that detect words in audio recordings of speech are typically based on many of the principles outlined in this book, but the training data is audio rather than written text. R users can access pre-trained speech-to-text models via large cloud providers, such as Google Cloud’s Speech-to-Text API accessible in R through the googleLanguageR package (Edmondson 2020).
Machine translation: Machine translation of text between languages, based on either older statistical methods or newer neural network methods, is a complex, involved topic. Today, the most successful and well-known implementations of machine translation are proprietary, because large tech companies have access to both the right expertise and enough data in multiple languages to train successful models for general machine translation. Google is one such example, and Google Cloud’s Translation API is again available in R through the googleLanguageR package.
Who is this book for?
This book is designed to provide practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate text into their modeling pipelines.
We assume that the reader is somewhat familiar with R, predictive modeling concepts for non-text data, and the tidyverse family of packages (Wickham et al. 2019). For users who don’t have this background with tidyverse code, we recommend R for Data Science (Wickham and Grolemund 2017). Helpful resources for getting started with modeling and machine learning include a free interactive course developed by one of the authors (JS) and Hands-On Machine Learning with R (Boehmke and Greenwell 2019), as well as An Introduction to Statistical Learning (James et al. 2013).
We don’t assume an extensive background in text analysis, but Text Mining with R (Silge and Robinson 2017), by one of the authors (JS) and David Robinson, provides helpful skills in exploratory data analysis for text that will promote successful text modeling. This book is more advanced than Text Mining with R and will help practitioners use their text data in ways not covered in that book.
Acknowledgments
We are so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several we would like to thank in particular.
We would like to thank Max Kuhn and Davis Vaughan for their investment in the tidymodels packages, David Robinson for his collaboration on the tidytext package, and Yihui Xie for his work on knitr, bookdown, and the R Markdown ecosystem. Thank you to Desirée De Leon for the site design of the online work and to Sarah Lin for the expert creation of the published work’s index. We would also like to thank Carol Haney, Kasia Kulma, David Mimno, Kanishka Misra, and an additional anonymous technical reviewer for their detailed, insightful feedback that substantively improved this book, as well as our editor John Kimmel for his perspective and guidance during the process of writing and publishing.
This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to the four people who contributed via GitHub pull requests (in alphabetical order by username): @fellennert, Riva Quiroga (@rivaquiroga), Darrin Speegle (@speegled), Tanner Stauss (@tmstauss).
Note box icons by Smashicons from flaticon.com.
Colophon
This book was written in RStudio using bookdown. The website is hosted via GitHub Pages, and the complete source is available on GitHub. We generated all plots in this book using ggplot2 and its light theme (theme_light()
). The autoplot()
method for conf_mat()
has been modified slightly to allow colors; modified code can be found online.
Because of changes in package versions since the publication of the first edition, you may notice slight differences in some results when comparing this online work and the published paper edition.
This version of the book was built with R version 4.2.0 (2022-04-22) and the following packages:
package | version | source |
---|---|---|
bench | 1.1.2 | CRAN (R 4.2.0) |
bookdown | 0.26 | CRAN (R 4.2.0) |
broom | 0.8.0 | CRAN (R 4.2.0) |
corpus | 0.10.2 | CRAN (R 4.2.0) |
dials | 0.1.1 | CRAN (R 4.2.0) |
discrim | 0.2.0 | CRAN (R 4.2.0) |
doParallel | 1.0.17 | CRAN (R 4.2.0) |
glmnet | 4.1-4 | CRAN (R 4.2.0) |
gt | 0.5.0 | CRAN (R 4.2.0) |
hcandersenr | 0.2.0 | CRAN (R 4.2.0) |
htmltools | 0.5.2 | CRAN (R 4.2.0) |
htmlwidgets | 1.5.4 | CRAN (R 4.2.0) |
hunspell | 3.0.1 | CRAN (R 4.2.0) |
irlba | 2.3.5 | CRAN (R 4.2.0) |
jiebaR | 0.11 | CRAN (R 4.2.0) |
jsonlite | 1.8.0 | CRAN (R 4.2.0) |
kableExtra | 1.3.4 | CRAN (R 4.2.0) |
keras | 2.8.0 | CRAN (R 4.2.0) |
klaR | 1.7-0 | CRAN (R 4.2.0) |
LiblineaR | 2.10-12 | CRAN (R 4.2.0) |
lime | 0.5.2 | CRAN (R 4.2.0) |
lobstr | 1.1.1 | CRAN (R 4.2.0) |
naivebayes | 0.9.7 | CRAN (R 4.2.0) |
parsnip | 0.2.1 | CRAN (R 4.2.0) |
prismatic | 1.1.0 | CRAN (R 4.2.0) |
quanteda | 3.2.1 | CRAN (R 4.2.0) |
ranger | 0.13.1 | CRAN (R 4.2.0) |
recipes | 0.2.0 | CRAN (R 4.2.0) |
remotes | 2.4.2 | CRAN (R 4.2.0) |
reticulate | 1.24 | CRAN (R 4.2.0) |
rsample | 0.1.1 | CRAN (R 4.2.0) |
rsparse | 0.5.0 | CRAN (R 4.2.0) |
scico | 1.3.0 | CRAN (R 4.2.0) |
scotus | 1.0.0 | Github (EmilHvitfeldt/scotus) |
servr | 0.24 | CRAN (R 4.2.0) |
sessioninfo | 1.2.2 | CRAN (R 4.2.0) |
slider | 0.2.2 | CRAN (R 4.2.0) |
SnowballC | 0.7.0 | CRAN (R 4.2.0) |
spacyr | 1.2.1 | CRAN (R 4.2.0) |
stopwords | 2.3 | CRAN (R 4.2.0) |
styler | 1.7.0 | CRAN (R 4.2.0) |
text2vec | 0.6.1 | CRAN (R 4.2.0) |
textdata | 0.4.2 | CRAN (R 4.2.0) |
textfeatures | 0.3.3 | CRAN (R 4.2.0) |
textrecipes | 0.5.2 | CRAN (R 4.2.0) |
tfruns | 1.5.0 | CRAN (R 4.2.0) |
themis | 0.2.1 | CRAN (R 4.2.0) |
tidymodels | 0.2.0 | CRAN (R 4.2.0) |
tidytext | 0.3.2 | CRAN (R 4.2.0) |
tidyverse | 1.3.1 | CRAN (R 4.2.0) |
tokenizers | 0.2.1 | CRAN (R 4.2.0) |
tokenizers.bpe | 0.1.0 | CRAN (R 4.2.0) |
tufte | 0.12 | CRAN (R 4.2.0) |
tune | 0.2.0 | CRAN (R 4.2.0) |
UpSetR | 1.4.0 | CRAN (R 4.2.0) |
vip | 0.3.2 | CRAN (R 4.2.0) |
widyr | 0.1.4 | CRAN (R 4.2.0) |
workflows | 0.2.6 | CRAN (R 4.2.0) |
yardstick | 0.0.9 | CRAN (R 4.2.0) |