References

Allaire, JJ, and François Chollet. 2020. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.

Appleby, Austin. 2008. “MurmurHash.” https://sites.google.com/site/murmurhash.

Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” The R Journal 9 (2): 1–20. https://journal.r-project.org/archive/2017/RJ-2017-035/index.html.

Bender, Emily M. 2011. “On Achieving and Evaluating Language-Independence in Nlp.” Linguistic Issues in Language Technology 6 (3): 1–26.

———. 2013. “Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax.” Synthesis Lectures on Human Language Technologies 6 (3): 1–184.

Benoit, Kenneth, and Akitaka Matsuo. 2019. Spacyr: Wrapper to the ’spaCy’ ’Nlp’ Library. https://CRAN.R-project.org/package=spacyr.

Benoit, Kenneth, David Muhr, and Kohei Watanabe. 2019. Stopwords: Multilingual Stopword Lists. https://CRAN.R-project.org/package=stopwords.

Boehmke, Brad, and Brandon M. Greenwell. 2019. Hands-on Machine Learning with R. 1st ed. Boca Raton: CRC Press.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. “Enriching Word Vectors with Subword Information.” CoRR abs/1607.04606. http://arxiv.org/abs/1607.04606.

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. “Quantifying and Reducing Stereotypes in Word Embeddings.” CoRR abs/1606.06121. http://arxiv.org/abs/1606.06121.

Boser, Bernhard E, Isabelle M Guyon, and Vladimir N Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–52.

Briscoe, Ted. 2013. “Introduction to Linguistics for Natural Language Processing.” https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf.

Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2018. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks.” http://arxiv.org/abs/1802.08232.

Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103.

Chollet, F., and J. J. Allaire. 2018. Deep Learning with R. Manning Publications. https://www.manning.com/books/deep-learning-with-r.

Edmondson, Mark. 2020. GoogleLanguageR: Call Google’s ’Natural Language’ Api, ’Cloud Translation’ Api, ’Cloud Speech’ Api and ’Cloud Text-to-Speech’ Api. https://CRAN.R-project.org/package=googleLanguageR.

Ethayarajh, Kawin, David Duvenaud, and Graeme Hirst. 2019. “Understanding Undesirable Word Embedding Associations.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1696–1705. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1166.

Feldman, R., and J. Sanger. 2007. The Text Mining Handbook. Cambridge university press.

Forman, George, and Evan Kirshenbaum. 2008. “Extremely Fast Text Feature Extraction for Classification and Indexing.” In Proceedings of the 17th Acm Conference on Information and Knowledge Management, 1221–30. CIKM ’08. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1458082.1458243.

Frank, Eibe, and Remco R. Bouckaert. 2006. “Naive Bayes for Text Classification with Unbalanced Classes.” In Knowledge Discovery in Databases: PKDD 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 503–10. Berlin, Heidelberg: Springer Berlin Heidelberg.

Fredrikson, Matthew, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” In 23rd USENIX Security Symposium (USENIX Security 14), 17–32. San Diego, CA: USENIX Association. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew.

Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. 2015. “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures.” In, 1322–33. https://doi.org/10.1145/2810103.2813677.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. http://www.jstatsoft.org/v33/i01/.

Gagolewski, Marek. 2019. R Package Stringi: Character String Processing Facilities. http://www.gagolewski.com/software/stringi/.

Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–E3644. https://doi.org/10.1073/pnas.1720347115.

Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but Do Not Remove Them.” CoRR abs/1903.03862. http://arxiv.org/abs/1903.03862.

Group, The Open. 2018. “The Open Group Base Specifications Issue 7, 2018 Edition.” https://pubs.opengroup.org/onlinepubs/9699919799/.

Harman, Donna. 1991. “How Effective Is Suffixing?” Journal of the American Society for Information Science 42 (1): 7–15.

Honnibal, Matthew, and Ines Montani. 2017. “spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.”

Howard, Jeremy, and Sebastian Ruder. 2018. “Fine-Tuned Language Models for Text Classification.” CoRR abs/1801.06146. http://arxiv.org/abs/1801.06146.

Huang, Weipeng, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. 2019. “Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning.”

Huston, Samuel, and W. Bruce Croft. 2010. “Evaluating Verbose Query Processing Techniques.” In Proceedings of the 33rd International Acm Sigir Conference on Research and Development in Information Retrieval, 291–98. SIGIR ’10. New York, NY, USA: ACM. https://doi.org/10.1145/1835449.1835499.

Hvitfeldt, Emil. 2019a. Hcandersenr: H.C. Andersens Fairy Tales. https://CRAN.R-project.org/package=hcandersenr.

———. 2019b. Scotus: What the Package Does (One Line, Title Case). https://github.com/EmilHvitfeldt/scotus.

———. 2020a. Textdata: Download and Load Various Text Datasets. https://github.com/EmilHvitfeldt/textdata.

———. 2020b. Textrecipes: Extra ’Recipes’ for Text Processing. https://CRAN.R-project.org/package=textrecipes.

Islam, Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2016. “Semantics Derived Automatically from Language Corpora Necessarily Contain Human Biases.” CoRR abs/1608.07187. http://arxiv.org/abs/1608.07187.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Joachims, Thorsten. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, 137–42. ECML’98. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/BFb0026683.

Kibriya, Ashraf M., Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. 2005. “Multinomial Naive Bayes for Text Categorization Revisited.” In AI 2004: Advances in Artificial Intelligence, edited by Geoffrey I. Webb and Xinghuo Yu, 488–99. Berlin, Heidelberg: Springer Berlin Heidelberg.

Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” CoRR abs/1405.4053. http://arxiv.org/abs/1405.4053.

Levy, Omer, and Yoav Goldberg. 2014. “Dependency-Based Word Embeddings.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–8. Baltimore, Maryland: Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-2050.

Lewis, David D., Yiming Yang, Tony G. Rose, and Fan Li. 2004. “RCV1: A New Benchmark Collection for Text Categorization Research.” J. Mach. Learn. Res. 5 (December): 361–97. http://dl.acm.org/citation.cfm?id=1005332.1005345.

Lex, Alexander, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–92.

Lovins, Julie B. 1968. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics 11: 22–31.

Lu, Kaiji, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2018. “Gender Bias in Neural Natural Language Processing.” CoRR abs/1807.11714. http://arxiv.org/abs/1807.11714.

Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (Kwic Index).” American Documentation 11 (4): 288–95. https://doi.org/10.1002/asi.5090110403.

Ma, Ji, Kuzman Ganchev, and David Weiss. 2018. “State-of-the-Art Chinese Word Segmentation with Bi-LSTMs.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–8. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1529.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.

McCulloch, Gretchen. 2015. “Move over Shakespeare, Teen Girls Are the Real Language Disruptors.” Quartz. Quartz. https://qz.com/474671/move-over-shakespeare-teen-girls-are-the-real-language-disruptors/.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Miller, George A. 1995. “WordNet: A Lexical Database for English.” Commun. ACM 38 (11): 39–41. https://doi.org/10.1145/219717.219748.

Moody, Chris. 2017. “Stop Using Word2vec.” Multithreaded. StitchFix. https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/.

Mullen, Lincoln A., Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. 2018. “Fast, Consistent Tokenization of Natural Language Text.” Journal of Open Source Software 3 (23): 655. https://doi.org/10.21105/joss.00655.

Nothman, Joel, Hanmin Qin, and Roman Yurchak. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12. Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-2502.

Olson, Randal S., William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. “Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems.” http://arxiv.org/abs/1708.05070.

Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Empirical Methods in Natural Language Processing (Emnlp), 1532–43. http://www.aclweb.org/anthology/D14-1162.

Perry, Patrick O. 2020. Corpus: Text Corpus Analysis. https://CRAN.R-project.org/package=corpus.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” CoRR abs/1802.05365. http://arxiv.org/abs/1802.05365.

Porter, Martin F. 2001. “Snowball: A Language for Stemming Algorithms.”

Porter, Martin F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–37. https://doi.org/10.1108/eb046814.

“Quantifiers , *, ? And n.” 2019. The Modern Javascript Tutorial. https://javascript.info/regexp-quantifiers.

Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng. 2006. “Some Effective Techniques for Naive Bayes Text Classification.” IEEE Transactions on Knowledge and Data Engineering 18 (11): 1457–66.

Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of Racial Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–78. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1163.

Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.

Selivanov, Dmitriy, and Qing Wang. 2018. Text2vec: Modern Text Mining Framework for R. https://CRAN.R-project.org/package=text2vec.

Sheng, Emily, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” http://arxiv.org/abs/1909.01326.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). https://doi.org/10.21105/joss.00037.

———. 2017. Text Mining with R: A Tidy Approach. 1st ed. O’Reilly Media, Inc.

Speer, Robyn. 2017. “How to Make a Racist Ai Without Really Trying.” ConceptNet Blog. http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/.

Sugisaki, Kyoko, and Don Tuggener. 2018. “German Compound Splitting Using the Compound Productivity of Morphemes,” October.

Tang, Cheng, Damien Garreau, and Ulrike von Luxburg. 2018. “When Do Random Forests Fail?” In Advances in Neural Information Processing Systems, 2983–93.

Van-Tu, Nguyen, and Le Anh-Cuong. 2016. “Improving Question Classification by Feature Extraction and Selection.” Indian Journal of Science and Technology 9 (May). https://doi.org/10.17485/ijst/2016/v9i17/93160.

Vaughan, Davis. 2020. Slider: Sliding Window Functions. https://CRAN.R-project.org/package=slider.

Vaughan, Davis, and Matt Dancho. 2018. Furrr: Apply Mapping Functions in Parallel Using Futures. https://CRAN.R-project.org/package=furrr.

Wagner, Claudia, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. “Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia.” EPJ Data Science 5 (1): 5.

Weinberger, Kilian, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ICML ’09. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1553374.1553516.

Wickham, Hadley. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.

Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program: Electronic Library and Information Systems 40 (3): 219–23. http://eprints.whiterose.ac.uk/1434/.

Zou, Feng, Fu Lee Wang, Xiaotie Deng, and Song Han. 2006. “Evaluation of Stop Word Lists in Chinese Language,” January.

Zou, Feng, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang. 2006. “Automatic Construction of Chinese Stop Word List.” In Proceedings of the 5th Wseas International Conference on Applied Computer Science, 1009–14. ACOS’06. Stevens Point, Wisconsin, USA: World Scientific; Engineering Academy; Society (WSEAS). http://dl.acm.org/citation.cfm?id=1973598.1973793.


  1. The English word “raccoon” derives from an Algonquin word meaning, “scratches with his hands!”↩︎

  2. On the other hand, the more biased stop word list may be helpful when modeling a corpus with gender imbalance, depending on your goal; words like “she” and “her” can identify where women are mentioned.↩︎

  3. This simple, “weak” stemmer is handy to have in your toolkit for many applications. Notice how we implement it here using dplyr::case_when().↩︎

  4. Part-of-speech information is also sometimes used directly in machine learning↩︎

  5. Google has since worked to correct this problem.↩︎

  6. The random forest implementation in the ranger package, demonstrated in Section @ref{comparerf}, does not handle special characters in columns names well.↩︎

  7. In other situations you may do best using a different architecture, for example, when working with dense, tabular data.↩︎