Chapter 3 Stop words

Once we have split text into tokens, it often becomes clear that not all words carry the same amount of information, if any information at all, for a predictive modeling task. Common words that carry little (or perhaps no) meaningful information are called stop words. It is common advice and practice to remove stop words for various NLP tasks, but the task of stop word removal is more nuanced than many resources may lead you to believe. In this chapter, we will investigate what a stop word list is, the differences between them, and the effects of using them in your preprocessing workflow.

The concept of stop words has a long history with Hans Peter Luhn credited with coining the term in 1960 . Examples of these words in English are “a,” “the,” “of,” and “didn’t.” These words are very common and typically don’t add much to the meaning of a text but instead ensure the structure of a sentence is sound.

Categorizing words as either informative or non-informative is limiting, and we prefer to consider words as having a more fluid or continuous amount of information associated with them. This informativeness is context-specific as well. In fact, stop words themselves are often important in genre or authorship identification.

Historically, one of the main reasons for removing stop words was to decrease the computational time for text mining; it can be regarded as a dimensionality reduction of text data and was commonly-used in search engines to give better results .

Stop words can have different roles in a corpus. We generally categorize stop words into three groups: global, subject, and document stop words.

Global stop words are words that are almost always low in meaning in a given language; these are words such as “of” and “and” in English that are needed to glue text together. These words are likely a safe bet for removal, but they are low in number. You can find some global stop words in pre-made stop word lists (Section 3.1).

Next up are subject-specific stop words. These words are uninformative for a given subject area. Subjects can be broad like finance and medicine or can be more specific like obituaries, health code violations, and job listings for librarians in Kansas. Words like “bath,” “bedroom,” and “entryway” are generally not considered stop words in English, but they may not provide much information for differentiating suburban house listings and could be subject stop words for certain analysis. You will likely need to manually construct such a stop word list (Section 3.2). These kinds of stop words may improve your performance if you have the domain expertise to create a good list.

Lastly, we have document-level stop words. These words do not provide any or much information for a given document. These are difficult to classify and won’t be worth the trouble to identify. Even if you can find document stop words, it is not obvious how to incorporate this kind of information in a regression or classification task.

3.1 Using premade stop word lists

A quick option for using stop words is to get a list that has already been created. This is appealing because it is not difficult, but be aware that not all lists are created equal. found some alarming results in a study of 52 stop word lists available in open-source software packages. Among some of the more grave issues were misspellings (“fify” instead of “fifty”), the inclusion of clearly informative words such as “computer” and “cry,” and internal inconsistencies, such as including the word “has” but not the word “does.” This is not to say that you should never use a stop word list that has been included in an open-source software project. However, you should always inspect and verify the list you are using, both to make sure it hasn’t changed since you used it last, and also to check that it is appropriate for your use case.

There is a broad selection of stop word lists available today. For the purpose of this chapter, we will focus on three of the lists of English stop words provided by the stopwords package . The first is from the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, an information retrieval system developed at Cornell University in the 1960s . The second is the English Snowball stop word list , and the last is the English list from the Stopwords ISO collection. These stop word lists are all considered general purpose and not domain-specific.

The stopwords package contains a comprehensive collection of stop word lists in one place for ease of use in analysis and other packages.

Before we start delving into the content inside the lists, let’s take a look at how many words are included in each.

library(stopwords)
length(stopwords(source = "smart"))
length(stopwords(source = "snowball"))
length(stopwords(source = "stopwords-iso"))
#> [1] 571
#> [1] 175
#> [1] 1298

The lengths of these lists are quite different, with the longest list being over seven times longer than the shortest! Let’s examine the overlap of the words that appear in the three lists in an UpSet plot in Figure 3.1. An UpSet plot visualizes intersections and aggregates of intersections of sets using a matrix layout, presenting the number of elements as well as summary statistics.

The UpSet plot in Figure 3.1 shows us that these three lists are almost true subsets of each other. The only exception is a set of 10 words that appear in Snowball and ISO but not in the SMART list. What are those words?

setdiff(stopwords(source = "snowball"),
stopwords(source = "smart"))
#>  [1] "she's"   "he'd"    "she'd"   "he'll"   "she'll"  "shan't"  "mustn't"
#>  [8] "when's"  "why's"   "how's"

All these words are contractions. This is not because the SMART lexicon doesn’t include contractions; if we look, there are almost 50 of them.

str_subset(stopwords(source = "smart"), "'")
#>  [1] "a's"       "ain't"     "aren't"    "c'mon"     "c's"       "can't"
#>  [7] "couldn't"  "didn't"    "doesn't"   "don't"     "hadn't"    "hasn't"
#> [13] "haven't"   "he's"      "here's"    "i'd"       "i'll"      "i'm"
#> [19] "i've"      "isn't"     "it'd"      "it'll"     "it's"      "let's"
#> [25] "shouldn't" "t's"       "that's"    "there's"   "they'd"    "they'll"
#> [31] "they're"   "they've"   "wasn't"    "we'd"      "we'll"     "we're"
#> [37] "we've"     "weren't"   "what's"    "where's"   "who's"     "won't"
#> [43] "wouldn't"  "you'd"     "you'll"    "you're"    "you've"

We seem to have stumbled upon an inconsistency: why does SMART include "he's" but not "she's"? It is hard to say, but this could be worth rectifying before applying these stop word lists to an analysis or model preprocessing. This stop word list was likely generated by selecting the most frequent words across a large corpus of text that had more representation for text about men than women. This is once again a reminder that we should always look carefully at any pre-made word list or another artifact we use to make sure it works well with our needs3.

When you select a stop word list, it is important that you consider its size and breadth. Having a small and concise list of words can moderately reduce your token count while not having too great of an influence on your models, assuming that you picked appropriate words. As the size of your stop word list grows, each word added will have a diminishing positive effect with the increasing risk that a meaningful word has been placed on the list by mistake. In Section 6.4, we show the effects of different stop word lists on model training.

3.1.1 Stop word removal in R

Now that we have seen stop word lists, we can move forward with removing these words. The particular way we remove stop words depends on the shape of our data. If you have your text in a tidy format with one word per row, you can use filter() from dplyr with a negated %in% if you have the stop words as a vector, or you can use anti_join() from dplyr if the stop words are in a tibble(). Like in our previous chapter, let’s examine the text of “The Fir-Tree” by Hans Christian Andersen, and use tidytext to tokenize the text into words.

library(hcandersenr)
library(tidyverse)
library(tidytext)

fir_tree <- hca_fairytales() %>%
filter(book == "The fir tree",
language == "English")

tidy_fir_tree <- fir_tree %>%
unnest_tokens(word, text)

Let’s use the Snowball stop word list as an example. Since the stop words return from this function as a vector, we will use filter().

tidy_fir_tree %>%
filter(!(word %in% stopwords(source = "snowball")))
#> # A tibble: 1,547 × 3
#>    book         language word
#>    <chr>        <chr>    <chr>
#>  1 The fir tree English  far
#>  2 The fir tree English  forest
#>  3 The fir tree English  warm
#>  4 The fir tree English  sun
#>  5 The fir tree English  fresh
#>  6 The fir tree English  air
#>  7 The fir tree English  made
#>  8 The fir tree English  sweet
#>  9 The fir tree English  resting
#> 10 The fir tree English  place
#> # … with 1,537 more rows

If we use the get_stopwords() function from tidytext instead, then we can use the anti_join() function.

tidy_fir_tree %>%
anti_join(get_stopwords(source = "snowball"))
#> # A tibble: 1,547 × 3
#>    book         language word
#>    <chr>        <chr>    <chr>
#>  1 The fir tree English  far
#>  2 The fir tree English  forest
#>  3 The fir tree English  warm
#>  4 The fir tree English  sun
#>  5 The fir tree English  fresh
#>  6 The fir tree English  air
#>  7 The fir tree English  made
#>  8 The fir tree English  sweet
#>  9 The fir tree English  resting
#> 10 The fir tree English  place
#> # … with 1,537 more rows

The result of these two stop word removals is the same since we used the same stop word list in both cases.

3.2 Creating your own stop words list

Another way to get a stop word list is to create one yourself. Let’s explore a few different ways to find appropriate words to use. We will use the tokenized data from “The Fir-Tree” as a first example. Let’s take the words and rank them by their count or frequency.

1: the

2: and

3: tree

4: it

5: a

6: in

7: of

8: to

9: i

10: was

11: they

12: fir

13: were

14: all

15: with

16: but

17: on

18: then

20: is

21: at

22: little

23: so

24: not

25: said

26: what

27: as

28: that

29: he

30: you

31: its

32: out

33: be

34: them

35: this

36: branches

37: came

38: for

39: now

40: one

41: story

42: would

43: forest

44: have

45: how

46: know

47: thought

48: mice

49: trees

50: we

51: been

52: down

53: oh

54: very

55: when

56: where

57: who

58: children

59: dumpty

60: humpty

61: or

62: shall

63: there

64: while

65: will

66: after

67: by

68: come

69: happy

70: my

71: old

72: only

73: their

74: which

75: again

76: am

77: are

78: beautiful

79: evening

80: him

81: like

82: me

83: more

85: christmas

86: do

87: fell

88: fresh

89: from

90: here

91: last

92: much

93: no

94: princess

95: tall

96: young

98: can

99: could

100: cried

101: going

102: grew

103: if

104: large

105: looked

107: many

108: seen

109: stairs

110: think

111: too

112: up

113: yes

114: air

115: also

116: away

117: birds

118: corner

119: cut

120: did

We recognize many of what we would consider stop words in the first column here, with three big exceptions. We see "tree" at 3, "fir" at 12, and "little" at 22. These words appear high on our list, but they do provide valuable information as they all reference the main character. What went wrong with this approach? Creating a stop word list using high-frequency words works best when it is created on a corpus of documents, not an individual document. This is because the words found in a single document will be document-specific and the overall pattern of words will not generalize that well.

In NLP, a corpus is a set of texts or documents. The set of Hans Christian Andersen’s fairy tales can be considered a corpus, with each fairy tale a document within that corpus. The set of United States Supreme Court opinions can be considered a different corpus, with each written opinion being a document within that corpus. Both data sets are described in more detail in Appendix B.

The word "tree" does seem important as it is about the main character, but it could also be appearing so often that it stops providing any information. Let’s try a different approach, extracting high-frequency words from the corpus of all English fairy tales by H.C. Andersen.

1: the

2: and

3: of

4: a

5: to

6: in

7: was

8: it

9: he

10: that

11: i

12: she

14: his

15: they

16: but

17: as

18: her

19: with

20: for

21: is

22: on

23: said

24: you

25: not

26: were

27: so

28: all

29: be

30: at

31: one

32: there

33: him

34: from

35: have

36: little

37: then

38: which

39: them

40: this

41: old

42: out

43: could

44: when

45: into

46: now

47: who

48: my

49: their

50: by

51: we

52: will

53: like

54: are

55: what

56: if

57: me

58: up

59: very

60: would

61: no

62: been

64: over

65: where

66: an

67: how

68: only

69: came

70: or

71: down

72: great

73: good

74: do

75: more

76: here

77: its

78: did

79: man

80: see

81: can

82: through

83: beautiful

84: must

85: has

86: away

87: thought

88: still

89: than

90: well

91: people

92: time

93: before

94: day

95: other

96: stood

97: too

98: went

99: come

100: never

101: much

102: house

103: know

104: every

105: looked

106: many

107: again

108: eyes

109: our

110: quite

111: young

112: even

113: shall

114: tree

115: go

116: your

117: long

118: upon

119: two

120: water

This list is more appropriate for our concept of stop words, and now it is time for us to make some choices. How many do we want to include in our stop word list? Which words should we add and/or remove based on prior information? Selecting the number of words to remove is best done by a case-by-case basis as it can be difficult to determine a priori how many different “meaningless” words appear in a corpus. Our suggestion is to start with a low number like 20 and increase by 10 words until you get to words that are not appropriate as stop words for your analytical purpose.

It is worth keeping in mind that such a list is not perfect. Depending on how your text was generated or processed, strange tokens can surface as possible stop words due to encoding or optical character recognition errors. Further, these results are based on the corpus of documents we have available, which is potentially biased. In our example here, all the fairy tales were written by the same European white man from the early 1800s.

This bias can be minimized by removing words we would expect to be over-represented or to add words we expect to be under-represented.

Easy examples are to include the complements to the words in the list if they are not already present. Include “big” if “small” is present, “old” if “young” is present. This example list has words associated with women often listed lower in rank than words associated with men. With "man" being at rank 79 but "woman" at rank 179, choosing a threshold of 100 would lead to only one of these words being included. Depending on how important you think such nouns are going to be in your texts, consider either adding "woman" or deleting "man".4

Figure 3.2 shows how the words associated with men have a higher rank than the words associated with women. By using a single threshold to create a stop word list, you would likely only include one form of such words.

Imagine now we would like to create a stop word list that spans multiple different genres, in such a way that the subject-specific stop words don’t overlap. For this case, we would like words to be denoted as a stop word only if it is a stop word in all the genres. You could find the words individually in each genre and use the right intersections. However, that approach might take a substantial amount of time.

Below is a bad approach where we try to create a multi-language list of stop words. To accomplish this we calculate the inverse document frequency (IDF) of each word. The IDF of a word is a quantity that is low for commonly-used words in a collection of documents and high for words not used often in a collection of documents. It is typically defined as

$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$

If the word “dog” appears in 4 out of 100 documents then it would have an idf("dog") = log(100/4) = 3.22, and if the word “cat” appears in 99 out of 100 documents then it would have an idf("cat") = log(100/99) = 0.01. Notice how the idf values goes to zero (as a matter of fact when a term appears in all the documents then the idf of that word is 0 log(100/100) = log(1) = 0), the more documents it is contained in. What happens if we create a stop word list based on words with the lowest IDF? The following function takes a tokenized dataframe and returns a dataframe with a column for each word and a column for the IDF.

library(rlang)
calc_idf <- function(df, word, document) {
words <- df %>% pull({{word}}) %>% unique()
n_docs <- length(unique(pull(df, {{document}})))
n_words <- df %>%
nest(data = c({{word}})) %>%
pull(data) %>%
map_dfc(~ words %in% unique(pull(.x, {{word}}))) %>%
rowSums()

tibble(word = words,
idf = log(n_docs / n_words))
}

Here is the result when we try to create a cross-language list of stop words, by taking each fairy tale as a document. It is not very good!

The overlap between words that appear in each language is very small, but these words are what we mostly see in this list.

1: a

2: de

3: man

4: en

5: da

6: se

7: es

8: an

9: in

10: her

11: me

12: so

13: no

14: i

15: for

16: den

17: at

18: der

19: was

20: du

21: er

22: dem

23: over

24: sin

25: he

26: alle

27: ja

28: have

29: to

30: mit

31: all

32: oh

33: will

34: am

35: la

36: sang

37: le

38: des

39: y

40: un

41: que

42: on

43: men

44: stand

45: al

46: si

47: son

48: han

49: ser

50: et

51: lo

52: die

53: just

54: bien

55: vor

56: las

57: del

58: still

59: land

60: under

61: has

62: los

63: by

64: as

65: not

66: end

67: fast

68: hat

69: see

70: but

71: from

72: is

73: and

74: o

75: alt

76: war

77: ni

78: su

79: time

80: von

81: hand

82: the

83: that

84: it

85: of

86: there

87: sit

88: with

89: por

90: el

91: con

92: una

93: be

94: they

95: one

96: como

97: pero

98: them

100: vi

101: das

102: his

103: les

104: sagte

105: ist

106: ein

107: und

108: zu

109: para

110: sol

111: auf

112: sie

113: nicht

114: aber

115: sich

116: then

117: were

118: said

119: into

120: más

This didn’t work very well because there is very little overlap between common words. Instead, let us limit the calculation to only one language and calculate the IDF of each word we can find compared to words that appear in a lot of documents.

1: a

2: the

3: and

4: to

5: in

6: that

7: it

8: but

9: of

10: was

11: as

12: there

13: on

14: at

15: is

16: for

17: with

18: all

19: not

20: they

21: one

22: he

23: his

24: so

25: them

26: be

27: from

29: then

30: were

31: said

32: into

33: by

34: have

35: which

36: this

37: up

38: out

39: what

40: who

41: no

42: an

43: now

44: i

45: only

46: old

47: like

48: when

49: if

50: little

51: over

52: are

53: very

54: you

55: him

56: we

57: great

58: how

59: their

60: came

61: been

62: down

63: would

64: where

65: or

66: she

67: can

68: could

70: her

71: will

72: time

73: good

74: must

75: my

76: than

77: away

78: more

79: has

80: thought

81: did

82: other

83: still

84: do

85: even

86: before

87: me

88: know

89: much

90: see

91: here

92: well

93: through

94: day

95: too

96: people

97: own

98: come

99: its

100: whole

101: just

102: many

103: never

105: stood

106: yet

107: looked

108: again

109: say

110: may

111: yes

112: went

113: every

114: each

115: such

116: world

117: some

118: long

119: eyes

120: go

This time we get better results. The list starts with “a,” “the,” “and,” and “to” and continues with many more reasonable choices of stop words. We need to look at these results manually to turn this into a list. We need to go as far down in rank as we are comfortable with. You as a data practitioner are in full control of how you want to create the list. If you don’t want to include “little” you are still able to add “are” to your list even though it is lower on the list.

3.3 All stop word lists are context-specific

Context is important in text modeling, so it is important to ensure that the stop word lexicon you use reflects the word space that you are planning on using it in. One common concern to consider is how pronouns bring information to your text. Pronouns are included in many different stop word lists (although inconsistently), but they will often not be noise in text data. Similarly, discuss how a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words” were used to filter and remove text before training a trillion parameter large language model, to protect it from learning offensive language, but the authors point out that in some community contexts, such words are reclaimed or used to describe marginalized identities.

On the other hand, sometimes you will have to add in words yourself, depending on the domain. If you are working with texts for dessert recipes, certain ingredients (sugar, eggs, water) and actions (whisking, baking, stirring) may be frequent enough to pass your stop word threshold, but you may want to keep them as they may be informative. Throwing away “eggs” as a common word would make it harder or downright impossible to determine if certain recipes are vegan or not while whisking and stirring may be fine to remove as distinguishing between recipes that do and don’t require a whisk might not be that big of a deal.

3.4 What happens when you remove stop words

We have discussed different ways of finding and removing stop words; now let’s see what happens once you do remove them. First, let’s explore the impact of the number of words that are included in the list. Figure 3.3 shows what percentage of words are removed as a function of the number of words in a text. The different colors represent the three different stop word lists we have considered in this chapter.

We notice, as we would predict, that larger stop word lists remove more words than shorter stop word lists. In this example with fairy tales, over half of the words have been removed, with the largest list removing over 80% of the words. We observe that shorter texts have a lower percentage of stop words. Since we are looking at fairy tales, this could be explained by the fact that a story has to be told regardless of the length of the fairy tale, so shorter texts are going to be denser with more informative words.

Another problem you may face is dealing with misspellings.

Most premade stop word lists assume that all the words are spelled correctly.

Handling misspellings when using premade lists can be done by manually adding common misspellings. You could imagine creating all words that are a certain string distance away from the stop words, but we do not recommend this as you would quickly include informative words this way.

One of the downsides of creating your own stop word lists using frequencies is that you are limited to using words that you have already observed. It could happen that “she’d” is included in your training corpus but the word “he’d” did not reach the threshold. This is a case where you need to look at your words and adjust accordingly. Here the large premade stop word lists can serve as inspiration for missing words.

In Section 6.4, we investigate the influence of removing stop words in the context of modeling. Given the right list of words, we see no harm to the model performance, and sometimes find improvement due to noise reduction .

3.5 Stop words in languages other than English

So far in this chapter, we have focused on English stop words, but English is not representative of every language. The notion of “short” and “long” lists we have used so far are specific to English as a language. You should expect different languages to have a different number of “uninformative” words, and for this number to depend on the morphological richness of a language; lists that contain all possible morphological variants of each stop word could become quite large.

Different languages have different numbers of words in each class of words. An example is how the grammatical case influences the articles used in German. The following tables show the use of definite and indefinite articles in German. Notice how German nouns have three genders (masculine, feminine, and neuter), which are not uncommon in languages around the world. Articles are almost always considered to be stop words in English as they carry very little information. However, German articles give some indication of the case, which can be used when selecting a list of stop words in German.

German Definite Articles (the)
Masculine Feminine Neuter Plural
Nominative der die das die
Accusative den die das die
Dative dem der dem den
Genitive des der des der
German Indefinite Articles (a/an)
Masculine Feminine Neuter
Nominative ein eine ein
Accusative einen eine ein
Dative einem einer einem
Genitive eines einer eines

Building lists of stop words in Chinese has been done both manually and automatically but so far none has been accepted as a standard . A full discussion of stop word identification in Chinese text would be out of scope for this book, so we will just highlight some of the challenges that differentiate it from English.

Chinese text is much more complex than portrayed here. With different systems and billions of users, there is much we won’t be able to touch on here.

The main difference from English is the use of logograms instead of letters to convey information. However, Chinese characters should not be confused with Chinese words. The majority of words in modern Chinese are composed of multiple characters. This means that inferring the presence of words is more complicated, and the notion of stop words will affect how this segmentation of characters is done.

3.6 Summary

In many standard NLP workflows, the removal of stop words is presented as a default or the correct choice without comment. Although removing stop words can improve the accuracy of your machine learning using text data, choices around such a step are complex. The content of existing stop word lists varies tremendously, and the available strategies for building your own can have subtle to not-so-subtle effects on your model results.

3.6.1 In this chapter, you learned:

• what a stop word is and how to remove stop words from text data

• how different stop word lists can vary

• that the impact of stop word removal is different for different kinds of texts

• about the bias built in to stop word lists and strategies for building such lists

References

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. FAccT ’21. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.
Benoit, K., Muhr, D., and Watanabe, K. 2021. stopwords: Multilingual Stopword Lists. R package version 2.2. https://CRAN.R-project.org/package=stopwords.
Feldman, R., and Sanger, J. 2007. The Text Mining Handbook. Cambridge: Cambridge University Press.
Huston, S., and Croft, W. B. 2010. “Evaluating Verbose Query Processing Techniques.” In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 291–298. SIGIR ’10. New York, NY: ACM. http://doi.acm.org/10.1145/1835449.1835499.
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5: 361–397. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., and Pfister, H. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–1992. https://doi.org/10.1109/TVCG.2014.2346248.
Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (kwic Index).” American Documentation 11 (4): 288–295. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090110403.
Mohammad, S. M., and Turney, P. D. 2013. “Crowdsourcing a Word–Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–465. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x.
Nothman, J., Qin, H., and Yurchak, R. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12. Melbourne, Australia: Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-2502.
Porter, M. F. 2001. “Snowball: A Language for Stemming Algorithms.” https://snowballstem.org.
Zou, F., Wang, F. L., Deng, X., and Han, S. 2006. “Evaluation of Stop Word Lists in Chinese Language.” In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC06). Genoa, Italy: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/273_pdf.pdf.
Zou, F., Wang, F. L., Deng, X., Han, S., and Wang, L. S. 2006. “Automatic Construction of Chinese Stop Word List.” In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, 1009–1014. ACOS’06. Stevens Point, Wisconsin: World Scientific; Engineering Academy; Society (WSEAS). http://dl.acm.org/citation.cfm?id=1973598.1973793.

1. This advice applies to any kind of pre-made lexicon or word list, not just stop words. For instance, the same concerns apply to sentiment lexicons. The NRC sentiment lexicon of associates the word “white” with trust and the word “black” with sadness, which could have unintended consequences when analyzing text about racial groups.↩︎

2. On the other hand, the more biased stop word list may be helpful when modeling a corpus with gender imbalance, depending on your goal; words like “she” and “her” can identify where women are mentioned.↩︎