Modeling as a statistical practice can encompass a wide variety of activities. This book focuses on supervised or predictive modeling for text, using text data to make predictions about the world around us. We use the tidymodels framework for modeling, a consistent and flexible collection of R packages developed to encourage good statistical practice.
Supervised machine learning using text data involves building a statistical model to estimate some output from input that includes language. The two types of models we train in this book are regression and classification. Think of regression models as predicting numeric or continuous outputs, such as predicting the year of a United States Supreme Court opinion from the text of that opinion. Think of classification models as predicting outputs that are discrete quantities or class labels, such as predicting whether a GitHub issue is about documentation or not from the text of the issue. Models like these can be used to make predictions for new observations, to understand what features or characteristics contribute to differences in the output, and more. We can evaluate our models using performance metrics to determine which are best, which are acceptable for our specific context, and even which are fair.
Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features (predictors) for machine learning from language.
Natural language that we as speakers and/or writers use must be dramatically transformed to a machine-readable, numeric representation to be ready for computation. In this book, we explore typical text preprocessing steps from the ground up, and consider the effects of these steps. We also show how to fluently use the textrecipes R package (Hvitfeldt 2020b) to prepare text data within a modeling pipeline.
Silge and Robinson (2017) provides a practical introduction to text mining with R using tidy data principles, based on the tidytext package. If you have already started on the path of gaining insight from your text data, a next step is using that text directly in predictive modeling. Text data contains within it latent information that can be used for insight, understanding, and better decision-making, and predictive modeling with text can bring that information and insight to light. If you have already explored how to analyze text as demonstrated in Silge and Robinson (2017), this book will move one step further to show you how to learn and make predictions from that text data with supervised models. If you are unfamiliar with this previous work, this book will still provide a robust introduction to how text can be represented in useful ways for modeling and a diverse set of supervised modeling approaches for text.
The book is divided into three sections. We make a (perhaps arbitrary) distinction between machine learning methods and deep learning methods by defining deep learning as any kind of multi-layer neural network (LSTM, bi-LSTM, CNN) and machine learning as anything else (regularized regression, naive Bayes, SVM, random forest). We make this distinction both because these different methods use separate software packages and modeling infrastructure, and from a pragmatic point of view, it is helpful to split up the chapters this way.
Natural language features: How do we transform text data into a representation useful for modeling? In these chapters, we explore the most common preprocessing steps for text, when they are helpful, and when they are not.
Machine learning methods: We investigate the power of some of the simpler and more lightweight models in our toolbox.
Deep learning methods: Given more time and resources, we see what is possible once we turn to neural networks.
Some of the topics in the second and third sections overlap as they provide different approaches to the same tasks.
Throughout the book, we will demonstrate with examples and build models using a selection of text datasets. A description of these datasets can be found in Appendix 11.
Topics this book will not cover
This book serves as a thorough introduction to prediction and modeling with text, along with detailed practical examples, but there are many areas of natural language processing we do not cover. The CRAN Task View on Natural Language Processing provides details on other ways to use R for computational linguistics. Specific topics we do not cover include:
Unsupervised machine learning for text: Silge and Robinson (2017) provide an introduction to one method of unsupervised text modeling, and Chapter 5 does dive deep into word embeddings, which learn from the latent structure in text data. However, many more unsupervised machine learning algorithms can be used for the goal of learning about the structure or distribution of text data when there are no outcome or output variables to predict.
Text generation: The deep learning model architectures we discuss in Chapters 8 and 9 can be used to generate new text, as well as to model existing text. Chollet and Allaire (2018) provide details on how to use neural network architectures and training data for text generation.
Speech processing: Models that detect words in audio recordings of speech are typically based on many of the principles outlined in this book, but the training data is audio rather than written text. R users can access pre-trained speech-to-text models via large cloud providers, such as Google Cloud’s Speech-to-Text API accessible in R through the googleLanguageR package (Edmondson 2020).
Machine translation: Machine translation of text between languages, based on either older statistical methods or newer neural network methods, is a complex, involved topic. Today, the most successful and well-known implementations of machine translation are proprietary, because large tech companies have access to both the right expertise and enough data in multiple languages to train successful models for general machine translation. Google is one such example, and Google Cloud’s Translation API is again available in R through the googleLanguageR package.
Who is this book for?
This book is designed to provide practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate text into their modeling pipelines.
We assume that the reader is somewhat familiar with R, predictive modeling concepts for non-text data, and the tidyverse family of packages. For users who don’t have this background with tidyverse code, we recommend R for Data Science (Wickham and Grolemund 2017). Helpful resources for getting started with modeling and machine learning include a free interactive course developed by one of the authors (JS) and Hands-On Machine Learning with R (Boehmke and Greenwell 2019), as well as An Introduction to Statistical Learning (James et al. 2013).
We don’t assume an extensive background in text analysis, but Text Mining with R (Silge and Robinson 2017), by one of the authors (JS) and David Robinson, provides helpful skills in exploratory data analysis for text that will promote successful text modeling. This book is more advanced than Text Mining with R and will help practitioners use their text data in ways not covered in that book.
We are so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several we would like to thank in particular.
We would like to thank Max Kuhn and Davis Vaughan for their investment in the tidymodels packages, David Robinson for his collaboration on the tidytext package, the anonymous technical reviewers for their substantive and insightful feedback, and Desirée De Leon for the site design for this online work.
This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to all 2 people who contributed via GitHub pull requests (in alphabetical order by username): @fellennert, Tanner Stauss (@tmstauss).
Note box icons by Smashicons from flaticon.com
This book was written in RStudio using bookdown. The website is hosted via Netlify, and automatically built after every push by GitHub Actions. The complete source is available on GitHub. We generated all plots in this book using ggplot2 and its light theme (
This version of the book was built with R version 4.0.3 (2020-10-10) and the following packages:
|bench||1.1.1||CRAN (R 4.0.2)|
|corpus||0.10.1||CRAN (R 4.0.2)|
|discrim||0.1.1||CRAN (R 4.0.2)|
|doParallel||1.0.16||CRAN (R 4.0.2)|
|glmnet||4.0-2||CRAN (R 4.0.2)|
|hcandersenr||0.2.0||CRAN (R 4.0.2)|
|hunspell||3.0||CRAN (R 4.0.2)|
|irlba||2.3.3||CRAN (R 4.0.2)|
|jiebaR||0.11||CRAN (R 4.0.2)|
|jsonlite||1.7.1||CRAN (R 4.0.2)|
|keras||22.214.171.124||CRAN (R 4.0.2)|
|klaR||0.6-15||CRAN (R 4.0.2)|
|liquidSVM||1.2.4||CRAN (R 4.0.2)|
|lobstr||1.1.1||CRAN (R 4.0.2)|
|naivebayes||0.9.7||CRAN (R 4.0.2)|
|quanteda||2.1.2||CRAN (R 4.0.2)|
|ranger||0.12.1||CRAN (R 4.0.2)|
|remotes||2.2.0||CRAN (R 4.0.2)|
|rsparse||0.4.0||CRAN (R 4.0.2)|
|scico||1.2.0||CRAN (R 4.0.2)|
|servr||0.20||CRAN (R 4.0.2)|
|sessioninfo||1.1.1||CRAN (R 4.0.2)|
|slider||0.1.5||CRAN (R 4.0.2)|
|SnowballC||0.7.0||CRAN (R 4.0.2)|
|spacyr||1.2.1||CRAN (R 4.0.2)|
|stopwords||2.0||CRAN (R 4.0.2)|
|styler||1.3.2||CRAN (R 4.0.2)|
|text2vec||0.6||CRAN (R 4.0.2)|
|textdata||0.4.1||CRAN (R 4.0.2)|
|textfeatures||0.3.3||CRAN (R 4.0.2)|
|themis||0.1.3||CRAN (R 4.0.2)|
|tidymodels||0.1.2||CRAN (R 4.0.3)|
|tidytext||0.2.6||CRAN (R 4.0.2)|
|tidyverse||1.3.0||CRAN (R 4.0.2)|
|tokenizers||0.2.1||CRAN (R 4.0.2)|
|tufte||0.8||CRAN (R 4.0.2)|
|UpSetR||1.4.0||CRAN (R 4.0.2)|
|vip||0.2.2||CRAN (R 4.0.2)|
|widyr||0.1.3||CRAN (R 4.0.2)|
Boehmke, Brad, and Brandon M. Greenwell. 2019. Hands-on Machine Learning with R. 1st ed. Boca Raton: CRC Press.
Chollet, F., and J. J. Allaire. 2018. Deep Learning with R. Manning Publications. https://www.manning.com/books/deep-learning-with-r.
Edmondson, Mark. 2020. GoogleLanguageR: Call Google’s ’Natural Language’ Api, ’Cloud Translation’ Api, ’Cloud Speech’ Api and ’Cloud Text-to-Speech’ Api. https://CRAN.R-project.org/package=googleLanguageR.
Hvitfeldt, Emil. 2020b. Textrecipes: Extra ’Recipes’ for Text Processing. https://CRAN.R-project.org/package=textrecipes.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1st ed. O’Reilly Media, Inc.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.