Chapter 3 Stop words
Once we have split text into tokens, it often becomes clear that not all words carry the same amount of information, if any information at all, for a predictive modeling task. Common words that carry little (or perhaps no) meaningful information are called stop words. It is common advice and practice to remove stop words for various NLP tasks, but the task of stop word removal is more nuanced than many resources may lead you to believe. In this chapter, we will investigate what a stop word list is, the differences between them, and the effects of using them in your preprocessing workflow.
The concept of stop words has a long history with Hans Peter Luhn credited with coining the term in 1960 (Luhn 1960). Examples of these words in English are “a,” “the,” “of,” and “didn’t.” These words are very common and typically don’t add much to the meaning of a text but instead ensure the structure of a sentence is sound.
Categorizing words as either informative or non-informative is limiting, and we prefer to consider words as having a more fluid or continuous amount of information associated with them. This informativeness is context-specific as well. In fact, stop words themselves are often important in genre or authorship identification.
Historically, one of the main reasons for removing stop words was to decrease the computational time for text mining; it can be regarded as a dimensionality reduction of text data and was commonly-used in search engines to give better results (Huston and Croft 2010).
Stop words can have different roles in a corpus. We generally categorize stop words into three groups: global, subject, and document stop words.
Global stop words are words that are almost always low in meaning in a given language; these are words such as “of” and “and” in English that are needed to glue text together. These words are likely a safe bet for removal, but they are low in number. You can find some global stop words in pre-made stop word lists (Section 3.1).
Next up are subject-specific stop words. These words are uninformative for a given subject area. Subjects can be broad like finance and medicine or can be more specific like obituaries, health code violations, and job listings for librarians in Kansas. Words like “bath,” “bedroom,” and “entryway” are generally not considered stop words in English, but they may not provide much information for differentiating suburban house listings and could be subject stop words for certain analysis. You will likely need to manually construct such a stop word list (Section 3.2). These kinds of stop words may improve your performance if you have the domain expertise to create a good list.
Lastly, we have document-level stop words. These words do not provide any or much information for a given document. These are difficult to classify and won’t be worth the trouble to identify. Even if you can find document stop words, it is not obvious how to incorporate this kind of information in a regression or classification task.
3.1 Using premade stop word lists
A quick option for using stop words is to get a list that has already been created. This is appealing because it is not difficult, but be aware that not all lists are created equal. Nothman, Qin, and Yurchak (2018) found some alarming results in a study of 52 stop word lists available in open-source software packages. Among some of the more grave issues were misspellings (“fify” instead of “fifty”), the inclusion of clearly informative words such as “computer” and “cry,” and internal inconsistencies, such as including the word “has” but not the word “does.” This is not to say that you should never use a stop word list that has been included in an open-source software project. However, you should always inspect and verify the list you are using, both to make sure it hasn’t changed since you used it last, and also to check that it is appropriate for your use case.
There is a broad selection of stop word lists available today. For the purpose of this chapter, we will focus on three of the lists of English stop words provided by the stopwords package (Benoit, Muhr, and Watanabe 2021). The first is from the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, an information retrieval system developed at Cornell University in the 1960s (Lewis et al. 2004). The second is the English Snowball stop word list (Porter 2001), and the last is the English list from the Stopwords ISO collection. These stop word lists are all considered general purpose and not domain-specific.
The stopwords package contains a comprehensive collection of stop word lists in one place for ease of use in analysis and other packages.
Before we start delving into the content inside the lists, let’s take a look at how many words are included in each.
library(stopwords)
length(stopwords(source = "smart"))
length(stopwords(source = "snowball"))
length(stopwords(source = "stopwords-iso"))
#> [1] 571
#> [1] 175
#> [1] 1298
The lengths of these lists are quite different, with the longest list being over seven times longer than the shortest! Let’s examine the overlap of the words that appear in the three lists in an UpSet plot in Figure 3.1. An UpSet plot (Lex et al. 2014) visualizes intersections and aggregates of intersections of sets using a matrix layout, presenting the number of elements as well as summary statistics.
The UpSet plot in Figure 3.1 shows us that these three lists are almost true subsets of each other. The only exception is a set of 10 words that appear in Snowball and ISO but not in the SMART list. What are those words?
setdiff(stopwords(source = "snowball"),
stopwords(source = "smart"))
#> [1] "she's" "he'd" "she'd" "he'll" "she'll" "shan't" "mustn't"
#> [8] "when's" "why's" "how's"
All these words are contractions. This is not because the SMART lexicon doesn’t include contractions; if we look, there are almost 50 of them.
str_subset(stopwords(source = "smart"), "'")
#> [1] "a's" "ain't" "aren't" "c'mon" "c's" "can't"
#> [7] "couldn't" "didn't" "doesn't" "don't" "hadn't" "hasn't"
#> [13] "haven't" "he's" "here's" "i'd" "i'll" "i'm"
#> [19] "i've" "isn't" "it'd" "it'll" "it's" "let's"
#> [25] "shouldn't" "t's" "that's" "there's" "they'd" "they'll"
#> [31] "they're" "they've" "wasn't" "we'd" "we'll" "we're"
#> [37] "we've" "weren't" "what's" "where's" "who's" "won't"
#> [43] "wouldn't" "you'd" "you'll" "you're" "you've"
We seem to have stumbled upon an inconsistency: why does SMART include "he's"
but not "she's"
? It is hard to say, but this could be worth rectifying before applying these stop word lists to an analysis or model preprocessing. This stop word list was likely generated by selecting the most frequent words across a large corpus of text that had more representation for text about men than women. This is once again a reminder that we should always look carefully at any pre-made word list or another artifact we use to make sure it works well with our needs3.
It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case.
When you select a stop word list, it is important that you consider its size and breadth. Having a small and concise list of words can moderately reduce your token count while not having too great of an influence on your models, assuming that you picked appropriate words. As the size of your stop word list grows, each word added will have a diminishing positive effect with the increasing risk that a meaningful word has been placed on the list by mistake. In Section 6.4, we show the effects of different stop word lists on model training.
3.1.1 Stop word removal in R
Now that we have seen stop word lists, we can move forward with removing these words. The particular way we remove stop words depends on the shape of our data. If you have your text in a tidy format with one word per row, you can use filter()
from dplyr with a negated %in%
if you have the stop words as a vector, or you can use anti_join()
from dplyr if the stop words are in a tibble()
. Like in our previous chapter, let’s examine the text of “The Fir-Tree” by Hans Christian Andersen, and use tidytext to tokenize the text into words.
library(hcandersenr)
library(tidyverse)
library(tidytext)
<- hca_fairytales() %>%
fir_tree filter(book == "The fir tree",
== "English")
language
<- fir_tree %>%
tidy_fir_tree unnest_tokens(word, text)
Let’s use the Snowball stop word list as an example. Since the stop words return from this function as a vector, we will use filter()
.
%>%
tidy_fir_tree filter(!(word %in% stopwords(source = "snowball")))
#> # A tibble: 1,547 × 3
#> book language word
#> <chr> <chr> <chr>
#> 1 The fir tree English far
#> 2 The fir tree English forest
#> 3 The fir tree English warm
#> 4 The fir tree English sun
#> 5 The fir tree English fresh
#> 6 The fir tree English air
#> 7 The fir tree English made
#> 8 The fir tree English sweet
#> 9 The fir tree English resting
#> 10 The fir tree English place
#> # … with 1,537 more rows
If we use the get_stopwords()
function from tidytext instead, then we can use the anti_join()
function.
%>%
tidy_fir_tree anti_join(get_stopwords(source = "snowball"))
#> # A tibble: 1,547 × 3
#> book language word
#> <chr> <chr> <chr>
#> 1 The fir tree English far
#> 2 The fir tree English forest
#> 3 The fir tree English warm
#> 4 The fir tree English sun
#> 5 The fir tree English fresh
#> 6 The fir tree English air
#> 7 The fir tree English made
#> 8 The fir tree English sweet
#> 9 The fir tree English resting
#> 10 The fir tree English place
#> # … with 1,537 more rows
The result of these two stop word removals is the same since we used the same stop word list in both cases.
3.2 Creating your own stop words list
Another way to get a stop word list is to create one yourself. Let’s explore a few different ways to find appropriate words to use. We will use the tokenized data from “The Fir-Tree” as a first example. Let’s take the words and rank them by their count or frequency.
1: the
2: and
3: tree
4: it
5: a
6: in
7: of
8: to
9: i
10: was
11: they
12: fir
13: were
14: all
15: with
16: but
17: on
18: then
19: had
20: is
21: at
22: little
23: so
24: not
25: said
26: what
27: as
28: that
29: he
30: you
31: its
32: out
33: be
34: them
35: this
36: branches
37: came
38: for
39: now
40: one
41: story
42: would
43: forest
44: have
45: how
46: know
47: thought
48: mice
49: trees
50: we
51: been
52: down
53: oh
54: very
55: when
56: where
57: who
58: children
59: dumpty
60: humpty
61: or
62: shall
63: there
64: while
65: will
66: after
67: by
68: come
69: happy
70: my
71: old
72: only
73: their
74: which
75: again
76: am
77: are
78: beautiful
79: evening
80: him
81: like
82: me
83: more
84: about
85: christmas
86: do
87: fell
88: fresh
89: from
90: here
91: last
92: much
93: no
94: princess
95: tall
96: young
97: asked
98: can
99: could
100: cried
101: going
102: grew
103: if
104: large
105: looked
106: made
107: many
108: seen
109: stairs
110: think
111: too
112: up
113: yes
114: air
115: also
116: away
117: birds
118: corner
119: cut
120: did
We recognize many of what we would consider stop words in the first column here, with three big exceptions. We see "tree"
at 3, "fir"
at 12, and "little"
at 22. These words appear high on our list, but they do provide valuable information as they all reference the main character. What went wrong with this approach? Creating a stop word list using high-frequency words works best when it is created on a corpus of documents, not an individual document. This is because the words found in a single document will be document-specific and the overall pattern of words will not generalize that well.
The word "tree"
does seem important as it is about the main character, but it could also be appearing so often that it stops providing any information. Let’s try a different approach, extracting high-frequency words from the corpus of all English fairy tales by H.C. Andersen.
1: the
2: and
3: of
4: a
5: to
6: in
7: was
8: it
9: he
10: that
11: i
12: she
13: had
14: his
15: they
16: but
17: as
18: her
19: with
20: for
21: is
22: on
23: said
24: you
25: not
26: were
27: so
28: all
29: be
30: at
31: one
32: there
33: him
34: from
35: have
36: little
37: then
38: which
39: them
40: this
41: old
42: out
43: could
44: when
45: into
46: now
47: who
48: my
49: their
50: by
51: we
52: will
53: like
54: are
55: what
56: if
57: me
58: up
59: very
60: would
61: no
62: been
63: about
64: over
65: where
66: an
67: how
68: only
69: came
70: or
71: down
72: great
73: good
74: do
75: more
76: here
77: its
78: did
79: man
80: see
81: can
82: through
83: beautiful
84: must
85: has
86: away
87: thought
88: still
89: than
90: well
91: people
92: time
93: before
94: day
95: other
96: stood
97: too
98: went
99: come
100: never
101: much
102: house
103: know
104: every
105: looked
106: many
107: again
108: eyes
109: our
110: quite
111: young
112: even
113: shall
114: tree
115: go
116: your
117: long
118: upon
119: two
120: water
This list is more appropriate for our concept of stop words, and now it is time for us to make some choices. How many do we want to include in our stop word list? Which words should we add and/or remove based on prior information? Selecting the number of words to remove is best done by a case-by-case basis as it can be difficult to determine a priori how many different “meaningless” words appear in a corpus. Our suggestion is to start with a low number like 20 and increase by 10 words until you get to words that are not appropriate as stop words for your analytical purpose.
It is worth keeping in mind that such a list is not perfect. Depending on how your text was generated or processed, strange tokens can surface as possible stop words due to encoding or optical character recognition errors. Further, these results are based on the corpus of documents we have available, which is potentially biased. In our example here, all the fairy tales were written by the same European white man from the early 1800s.
This bias can be minimized by removing words we would expect to be over-represented or to add words we expect to be under-represented.
Easy examples are to include the complements to the words in the list if they are not already present. Include “big” if “small” is present, “old” if “young” is present. This example list has words associated with women often listed lower in rank than words associated with men. With "man"
being at rank 79 but "woman"
at rank 179, choosing a threshold of 100 would lead to only one of these words being included. Depending on how important you think such nouns are going to be in your texts, consider either adding "woman"
or deleting "man"
.4
Figure 3.2 shows how the words associated with men have a higher rank than the words associated with women. By using a single threshold to create a stop word list, you would likely only include one form of such words.
Imagine now we would like to create a stop word list that spans multiple different genres, in such a way that the subject-specific stop words don’t overlap. For this case, we would like words to be denoted as a stop word only if it is a stop word in all the genres. You could find the words individually in each genre and use the right intersections. However, that approach might take a substantial amount of time.
Below is a bad approach where we try to create a multi-language list of stop words. To accomplish this we calculate the inverse document frequency (IDF) of each word. The IDF of a word is a quantity that is low for commonly-used words in a collection of documents and high for words not used often in a collection of documents. It is typically defined as
\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]
If the word “dog” appears in 4 out of 100 documents then it would have an idf("dog") = log(100/4) = 3.22
, and if the word “cat” appears in 99 out of 100 documents then it would have an idf("cat") = log(100/99) = 0.01
. Notice how the idf values goes to zero (as a matter of fact when a term appears in all the documents then the idf of that word is 0 log(100/100) = log(1) = 0
), the more documents it is contained in.
What happens if we create a stop word list based on words with the lowest IDF? The following function takes a tokenized dataframe and returns a dataframe with a column for each word and a column for the IDF.
library(rlang)
<- function(df, word, document) {
calc_idf <- df %>% pull({{word}}) %>% unique()
words <- length(unique(pull(df, {{document}})))
n_docs <- df %>%
n_words nest(data = c({{word}})) %>%
pull(data) %>%
map_dfc(~ words %in% unique(pull(.x, {{word}}))) %>%
rowSums()
tibble(word = words,
idf = log(n_docs / n_words))
}
Here is the result when we try to create a cross-language list of stop words, by taking each fairy tale as a document. It is not very good!
The overlap between words that appear in each language is very small, but these words are what we mostly see in this list.
1: a
2: de
3: man
4: en
5: da
6: se
7: es
8: an
9: in
10: her
11: me
12: so
13: no
14: i
15: for
16: den
17: at
18: der
19: was
20: du
21: er
22: dem
23: over
24: sin
25: he
26: alle
27: ja
28: have
29: to
30: mit
31: all
32: oh
33: will
34: am
35: la
36: sang
37: le
38: des
39: y
40: un
41: que
42: on
43: men
44: stand
45: al
46: si
47: son
48: han
49: ser
50: et
51: lo
52: die
53: just
54: bien
55: vor
56: las
57: del
58: still
59: land
60: under
61: has
62: los
63: by
64: as
65: not
66: end
67: fast
68: hat
69: see
70: but
71: from
72: is
73: and
74: o
75: alt
76: war
77: ni
78: su
79: time
80: von
81: hand
82: the
83: that
84: it
85: of
86: there
87: sit
88: with
89: por
90: el
91: con
92: una
93: be
94: they
95: one
96: como
97: pero
98: them
99: had
100: vi
101: das
102: his
103: les
104: sagte
105: ist
106: ein
107: und
108: zu
109: para
110: sol
111: auf
112: sie
113: nicht
114: aber
115: sich
116: then
117: were
118: said
119: into
120: más
This didn’t work very well because there is very little overlap between common words. Instead, let us limit the calculation to only one language and calculate the IDF of each word we can find compared to words that appear in a lot of documents.
1: a
2: the
3: and
4: to
5: in
6: that
7: it
8: but
9: of
10: was
11: as
12: there
13: on
14: at
15: is
16: for
17: with
18: all
19: not
20: they
21: one
22: he
23: his
24: so
25: them
26: be
27: from
28: had
29: then
30: were
31: said
32: into
33: by
34: have
35: which
36: this
37: up
38: out
39: what
40: who
41: no
42: an
43: now
44: i
45: only
46: old
47: like
48: when
49: if
50: little
51: over
52: are
53: very
54: you
55: him
56: we
57: great
58: how
59: their
60: came
61: been
62: down
63: would
64: where
65: or
66: she
67: can
68: could
69: about
70: her
71: will
72: time
73: good
74: must
75: my
76: than
77: away
78: more
79: has
80: thought
81: did
82: other
83: still
84: do
85: even
86: before
87: me
88: know
89: much
90: see
91: here
92: well
93: through
94: day
95: too
96: people
97: own
98: come
99: its
100: whole
101: just
102: many
103: never
104: made
105: stood
106: yet
107: looked
108: again
109: say
110: may
111: yes
112: went
113: every
114: each
115: such
116: world
117: some
118: long
119: eyes
120: go
This time we get better results. The list starts with “a,” “the,” “and,” and “to” and continues with many more reasonable choices of stop words. We need to look at these results manually to turn this into a list. We need to go as far down in rank as we are comfortable with. You as a data practitioner are in full control of how you want to create the list. If you don’t want to include “little” you are still able to add “are” to your list even though it is lower on the list.
3.3 All stop word lists are context-specific
Context is important in text modeling, so it is important to ensure that the stop word lexicon you use reflects the word space that you are planning on using it in. One common concern to consider is how pronouns bring information to your text. Pronouns are included in many different stop word lists (although inconsistently), but they will often not be noise in text data. Similarly, Bender et al. (2021) discuss how a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words” were used to filter and remove text before training a trillion parameter large language model, to protect it from learning offensive language, but the authors point out that in some community contexts, such words are reclaimed or used to describe marginalized identities.
On the other hand, sometimes you will have to add in words yourself, depending on the domain. If you are working with texts for dessert recipes, certain ingredients (sugar, eggs, water) and actions (whisking, baking, stirring) may be frequent enough to pass your stop word threshold, but you may want to keep them as they may be informative. Throwing away “eggs” as a common word would make it harder or downright impossible to determine if certain recipes are vegan or not while whisking and stirring may be fine to remove as distinguishing between recipes that do and don’t require a whisk might not be that big of a deal.
3.4 What happens when you remove stop words
We have discussed different ways of finding and removing stop words; now let’s see what happens once you do remove them. First, let’s explore the impact of the number of words that are included in the list. Figure 3.3 shows what percentage of words are removed as a function of the number of words in a text. The different colors represent the three different stop word lists we have considered in this chapter.
We notice, as we would predict, that larger stop word lists remove more words than shorter stop word lists. In this example with fairy tales, over half of the words have been removed, with the largest list removing over 80% of the words. We observe that shorter texts have a lower percentage of stop words. Since we are looking at fairy tales, this could be explained by the fact that a story has to be told regardless of the length of the fairy tale, so shorter texts are going to be denser with more informative words.
Another problem you may face is dealing with misspellings.
Most premade stop word lists assume that all the words are spelled correctly.
Handling misspellings when using premade lists can be done by manually adding common misspellings. You could imagine creating all words that are a certain string distance away from the stop words, but we do not recommend this as you would quickly include informative words this way.
One of the downsides of creating your own stop word lists using frequencies is that you are limited to using words that you have already observed. It could happen that “she’d” is included in your training corpus but the word “he’d” did not reach the threshold. This is a case where you need to look at your words and adjust accordingly. Here the large premade stop word lists can serve as inspiration for missing words.
In Section 6.4, we investigate the influence of removing stop words in the context of modeling. Given the right list of words, we see no harm to the model performance, and sometimes find improvement due to noise reduction (Feldman, and Sanger 2007).
3.5 Stop words in languages other than English
So far in this chapter, we have focused on English stop words, but English is not representative of every language. The notion of “short” and “long” lists we have used so far are specific to English as a language. You should expect different languages to have a different number of “uninformative” words, and for this number to depend on the morphological richness of a language; lists that contain all possible morphological variants of each stop word could become quite large.
Different languages have different numbers of words in each class of words. An example is how the grammatical case influences the articles used in German. The following tables show the use of definite and indefinite articles in German. Notice how German nouns have three genders (masculine, feminine, and neuter), which are not uncommon in languages around the world. Articles are almost always considered to be stop words in English as they carry very little information. However, German articles give some indication of the case, which can be used when selecting a list of stop words in German.
German Definite Articles (the) | ||||
---|---|---|---|---|
Masculine | Feminine | Neuter | Plural | |
Nominative | der | die | das | die |
Accusative | den | die | das | die |
Dative | dem | der | dem | den |
Genitive | des | der | des | der |
German Indefinite Articles (a/an) | |||
---|---|---|---|
Masculine | Feminine | Neuter | |
Nominative | ein | eine | ein |
Accusative | einen | eine | ein |
Dative | einem | einer | einem |
Genitive | eines | einer | eines |
Building lists of stop words in Chinese has been done both manually and automatically (Zou, Wang, Deng, Han, and Wang 2006) but so far none has been accepted as a standard (Zou, Wang, Deng, and Han 2006). A full discussion of stop word identification in Chinese text would be out of scope for this book, so we will just highlight some of the challenges that differentiate it from English.
Chinese text is much more complex than portrayed here. With different systems and billions of users, there is much we won’t be able to touch on here.
The main difference from English is the use of logograms instead of letters to convey information. However, Chinese characters should not be confused with Chinese words. The majority of words in modern Chinese are composed of multiple characters. This means that inferring the presence of words is more complicated, and the notion of stop words will affect how this segmentation of characters is done.
3.6 Summary
In many standard NLP workflows, the removal of stop words is presented as a default or the correct choice without comment. Although removing stop words can improve the accuracy of your machine learning using text data, choices around such a step are complex. The content of existing stop word lists varies tremendously, and the available strategies for building your own can have subtle to not-so-subtle effects on your model results.
References
This advice applies to any kind of pre-made lexicon or word list, not just stop words. For instance, the same concerns apply to sentiment lexicons. The NRC sentiment lexicon of Mohammad and Turney (2013) associates the word “white” with trust and the word “black” with sadness, which could have unintended consequences when analyzing text about racial groups.↩︎
On the other hand, the more biased stop word list may be helpful when modeling a corpus with gender imbalance, depending on your goal; words like “she” and “her” can identify where women are mentioned.↩︎