In Chapters 6 and 7, we use algorithms such as regularized linear models, support vector machines, and naive Bayes models to predict outcomes from predictors including text data. Deep learning models approach the same tasks and have the same goals, but the algorithms involved are different. Deep learning models are “deep” in the sense that they use multiple layers to learn how to map from input features to output outcomes; this is in contrast to the kinds of models we used in the previous two chapters which use a shallow (single) mapping.
Deep learning models can be effective for text prediction problems because they use these multiple layers to capture complex relationships in language.
The layers in a deep learning model are connected in a network and these models are called neural networks, although they do not work much like a human brain. The layers can be connected in different configurations called network architectures. Two of the most common architectures used for text data are a convolutional neural network (CNN), and long short-term memory (LSTM)9. These architectures sometimes incorporate word embeddings, as described in Chapter 5.
For the following chapters, we will use tidymodels packages along with Tensorflow and the R interface to Keras (Allaire and Chollet 2020) for preprocessing, modeling, and evaluation. Table 7.1 presents some key differences between deep learning and what, in this book, we call machine learning methods.
|Machine learning||Deep learning|
|Faster to train||Takes more time to train|
|Software is typically easier to install||Software can be more challenging to install|
|Can achieve good performance with less data||Requires more data for good performance|
|Depends on preprocessing to model more than very simple relationships||Can model highly complex relationships|
Deep learning and more traditional machine learning algorithms are different, but the structure of the modeling process is largely the same, no matter what the specific details of prediction or algorithm are.
Spending your data budget
A limited amount of data is available for any given modeling project, and this data must be allocated to different tasks to balance competing priorities. We espouse an approach of first splitting data in testing and training sets, holding the testing set back until all modeling tasks are completed, including feature engineering and tuning. This testing set is then used as a final check on model performance, to estimate how the final model will perform on new data.
The training data is available for tasks from model parameter estimation to determining which features are important and more. To compare or tune model options or parameters, this training set can be further split so that models can be evaluated on a validation set, or it can be resampled as described in Section 6.1.2 to create new simulated datasets for the purpose of evaluation.
Text data requires extensive processing to be appropriate for modeling, whether via an algorithm like regularized regression or a neural network. Chapters 1 through 5 covered several of the most important techniques that are used to transform language into a representation appropriate for computation. This feature engineering part of the modeling process can be intensive for text, sometimes more computationally expensive than the fitting a model algorithm. We espouse an approach of implementing feature engineering on training data only, typically using resampled datasets, to avoid obtaining an overly optimistic estimate of model performance.
Fitting and tuning
Many different kinds of models are appropriate for text data, from more straightforward models like the linear models explored deeply in Chapter 6 to the neural network models we cover in Chapters ?? and 9. Some of these models have hyperparameters which cannot be learned from data during fitting, like the regularization parameter of the models in Chapter 6; these hyperparameters can be tuned using resampled datasets.
Once models are trained and perhaps tuned, we can evaluate their performance quantitatively using metrics appropriate for the kind of practical problem being dealt with. Model explanation analysis, such as feature importance, also helps us understand how well and why models are behaving the way they do.
Putting the model process in context
This outline of the model process depends on us as data practitioners coming prepared for modeling with a healthy understanding of our datasets from exploratory data analysis. Silge and Robinson (2017) provide a guide for exploratory data analysis for text.
Also, in practice, the structure of a real modeling project is iterative. After fitting and tuning a first model or set of a models, a practitioner will often return to build more or better features, then refit models, and evaluate in a more detailed way. Notice that we take this approach in each chapter, both for more straightforward machine learning and deep learning; we start with a simpler model and then go back again and again to improve it in several ways. This iterative approach is healthy and appropriate, as long as good practices in data “spending” are observed. The testing set cannot be used during this iterative back-and-forth, and using resampled datasets can set us as practitioners up for more accurate estimates of performance.
Allaire, JJ, and François Chollet. 2020. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1st ed. O’Reilly Media, Inc.
In other situations you may do best using a different architecture, for example, when working with dense, tabular data.↩︎