References

2007. Boost C Libraries. https://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html.
Allaire, JJ, and François Chollet. 2020. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Appleby, Austin. 2008. “MurmurHash.” https://sites.google.com/site/murmurhash.
Arnold, Taylor. 2017. “A Tidy Data Model for Natural Language Processing Using cleanNLP.” The R Journal 9 (2): 1–20. https://journal.r-project.org/archive/2017/RJ-2017-035/index.html.
Bates, Douglas, and Martin Maechler. 2019. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.
Bender, Emily M. 2011. “On Achieving and Evaluating Language-Independence in NLP.” Linguistic Issues in Language Technology 6 (3): 1–26.
———. 2013. “Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax.” Synthesis Lectures on Human Language Technologies 6 (3): 1–184.
Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.
Benoit, Kenneth, and Akitaka Matsuo. 2019. Spacyr: Wrapper to the ’spaCy’ ’NLP’ Library. https://CRAN.R-project.org/package=spacyr.
Benoit, Kenneth, David Muhr, and Kohei Watanabe. 2019. Stopwords: Multilingual Stopword Lists. https://CRAN.R-project.org/package=stopwords.
Boehmke, Brad, and Brandon M. Greenwell. 2019. Hands-on Machine Learning with r. 1st ed. Boca Raton: CRC Press.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. “Enriching Word Vectors with Subword Information.” CoRR abs/1607.04606. http://arxiv.org/abs/1607.04606.
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. “Quantifying and Reducing Stereotypes in Word Embeddings.” CoRR abs/1606.06121. http://arxiv.org/abs/1606.06121.
Boser, Bernhard E, Isabelle M Guyon, and Vladimir N Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–52.
Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.
Briscoe, Ted. 2013. “Introduction to Linguistics for Natural Language Processing.” https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf.
Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2018. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks.” http://arxiv.org/abs/1802.08232.
Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103.
Chollet, F., and J. J. Allaire. 2018. Deep Learning with r. Manning Publications. https://www.manning.com/books/deep-learning-with-r.
Edmondson, Mark. 2020. googleLanguageR: Call Google’s ’Natural Language’ API, ’Cloud Translation’ API, ’Cloud Speech’ API and ’Cloud Text-to-Speech’ API. https://CRAN.R-project.org/package=googleLanguageR.
Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/https://doi.org/10.1016/0364-0213(90)90002-E.
Ethayarajh, Kawin, David Duvenaud, and Graeme Hirst. 2019. “Understanding Undesirable Word Embedding Associations.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1696–1705. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1166.
Feathers, Todd. 2019. “Flawed Algorithms Are Grading Millions of Students’ Essays.” Motherboard. VICE. https://www.vice.com/en/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays.
Feldman, R., and J. Sanger. 2007. The Text Mining Handbook. Cambridge university press.
Forman, George, and Evan Kirshenbaum. 2008. “Extremely Fast Text Feature Extraction for Classification and Indexing.” In Proceedings of the 17th ACM Conference on Information and Knowledge Management, 1221–30. CIKM ’08. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1458082.1458243.
Frank, Eibe, and Remco R. Bouckaert. 2006. “Naive Bayes for Text Classification with Unbalanced Classes.” In Knowledge Discovery in Databases: PKDD 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 503–10. Berlin, Heidelberg: Springer Berlin Heidelberg.
Fredrikson, Matthew, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” In 23rd USENIX Security Symposium (USENIX Security 14), 17–32. San Diego, CA: USENIX Association. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/fredrikson_matthew.
Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. 2015. “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures.” In, 1322–33. https://doi.org/10.1145/2810103.2813677.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. http://www.jstatsoft.org/v33/i01/.
Gage, P. 1994. “A New Algorithm for Data Compression.” The C Users Journal Archive 12: 23–38.
Gagolewski, Marek. 2019. R Package Stringi: Character String Processing Facilities. http://www.gagolewski.com/software/stringi/.
Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–44. https://doi.org/10.1073/pnas.1720347115.
Golub, G. H., and C. Reinsch. 1970. “Singular Value Decomposition and Least Squares Solutions.” Numer. Math. 14 (5): 403–20. https://doi.org/10.1007/BF02163027.
Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but Do Not Remove Them.” CoRR abs/1903.03862. http://arxiv.org/abs/1903.03862.
Group, The Open. 2018. “The Open Group Base Specifications Issue 7, 2018 Edition.” https://pubs.opengroup.org/onlinepubs/9699919799/.
Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Dino Pedreschi, and Fosca Giannotti. 2018. “A Survey of Methods for Explaining Black Box Models.” http://arxiv.org/abs/1802.01933.
Harman, Donna. 1991. “How Effective Is Suffixing?” Journal of the American Society for Information Science 42 (1): 7–15.
Helleputte, Thibault. 2017. LiblineaR: Linear Predictive Models Based on the LIBLINEAR c/c++ Library.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Comput. 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Honnibal, Matthew, and Ines Montani. 2017. spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.”
Howard, Jeremy, and Sebastian Ruder. 2018. “Fine-Tuned Language Models for Text Classification.” CoRR abs/1801.06146. http://arxiv.org/abs/1801.06146.
Huang, Weipeng, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. 2019. “Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning.”
Huston, Samuel, and W. Bruce Croft. 2010. “Evaluating Verbose Query Processing Techniques.” In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 291–98. SIGIR ’10. New York, NY, USA: ACM. https://doi.org/10.1145/1835449.1835499.
Hvitfeldt, Emil. 2019a. Hcandersenr: H.c. Andersens Fairy Tales. https://CRAN.R-project.org/package=hcandersenr.
———. 2019b. Scotus: Collection of Supreme Court of the United States’ Opinions. https://github.com/EmilHvitfeldt/scotus.
———. 2020a. Textdata: Download and Load Various Text Datasets. https://github.com/EmilHvitfeldt/textdata.
———. 2020b. Textrecipes: Extra ’Recipes’ for Text Processing. https://CRAN.R-project.org/package=textrecipes.
———. 2020c. Wordsalad: Provide Tools to Extract and Analyze Word Vectors. https://CRAN.R-project.org/package=wordsalad.
Islam, Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2016. “Semantics Derived Automatically from Language Corpora Necessarily Contain Human Biases.” CoRR abs/1608.07187. http://arxiv.org/abs/1608.07187.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Joachims, Thorsten. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, 137–42. ECML’98. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/BFb0026683.
Johnson, Stephen B. 1999. “A Semantic Lexicon for Medical Language Processing.” Journal of the American Medical Informatics Association 6 (3): 205–18.
Kearney, Michael W. 2019. Textfeatures: Extracts Features from Text. https://CRAN.R-project.org/package=textfeatures.
Kibriya, Ashraf M., Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. 2005. “Multinomial Naive Bayes for Text Categorization Revisited.” In AI 2004: Advances in Artificial Intelligence, edited by Geoffrey I. Webb and Xinghuo Yu, 488–99. Berlin, Heidelberg: Springer Berlin Heidelberg.
Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–51. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1181.
Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980.
Lampinen, Andrew K., and James L. McClelland. 2018. “One-Shot and Few-Shot Learning of Word Embeddings.” http://arxiv.org/abs/1710.10280.
Le, Quoc V., and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” CoRR abs/1405.4053. http://arxiv.org/abs/1405.4053.
Levy, Omer, and Yoav Goldberg. 2014. “Dependency-Based Word Embeddings.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–8. Baltimore, Maryland: Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-2050.
Lewis, David D., Yiming Yang, Tony G. Rose, and Fan Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” J. Mach. Learn. Res. 5 (December): 361–97. http://dl.acm.org/citation.cfm?id=1005332.1005345.
Lex, Alexander, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–92.
Lingolia. 2021. “Articles in German Grammar.” https://deutsch.lingolia.com/en/grammar/nouns-and-articles/articles-noun-markers.
Lovins, Julie B. 1968. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics 11: 22–31.
Lu, Kaiji, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2018. “Gender Bias in Neural Natural Language Processing.” CoRR abs/1807.11714. http://arxiv.org/abs/1807.11714.
Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (Kwic Index).” American Documentation 11 (4): 288–95. https://doi.org/10.1002/asi.5090110403.
Ma, Ji, Kuzman Ganchev, and David Weiss. 2018. “State-of-the-Art Chinese Word Segmentation with Bi-LSTMs.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–8. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1529.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
McCulloch, Gretchen. 2015. “Move over Shakespeare, Teen Girls Are the Real Language Disruptors.” Quartz. Quartz. https://qz.com/474671/move-over-shakespeare-teen-girls-are-the-real-language-disruptors/.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.
Miller, George A. 1995. “WordNet: A Lexical Database for English.” Commun. ACM 38 (11): 39–41. https://doi.org/10.1145/219717.219748.
Minaee, Shervin, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. “Deep Learning Based Text Classification: A Comprehensive Review.” arXiv Preprint arXiv:2004.03705.
Mohammad, Saif M., and Peter D. Turney. 2013. “CROWDSOURCING a WORD–EMOTION ASSOCIATION LEXICON.” Computational Intelligence 29 (3): 436–65. https://doi.org/10.1111/j.1467-8640.2012.00460.x.
Moody, Chris. 2017. “Stop Using Word2vec.” Multithreaded. StitchFix. https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/.
Mullen, Lincoln A., Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. 2018. “Fast, Consistent Tokenization of Natural Language Text.” Journal of Open Source Software 3: 655. https://doi.org/10.21105/joss.00655.
Nothman, Joel, Hanmin Qin, and Roman Yurchak. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12. Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-2502.
Olson, Randal S., William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. 2017. “Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems.” http://arxiv.org/abs/1708.05070.
Ooms, Jeroen. 2020. Pdftools: Text Extraction, Rendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Empirical Methods in Natural Language Processing (EMNLP), 1532–43. http://www.aclweb.org/anthology/D14-1162.
Perry, Patrick O. 2020. Corpus: Text Corpus Analysis. https://CRAN.R-project.org/package=corpus.
Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” CoRR abs/1802.05365. http://arxiv.org/abs/1802.05365.
Porter, Martin F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–37. https://doi.org/10.1108/eb046814.
Porter, Martin F. 2001. “Snowball: A Language for Stemming Algorithms.”
“Quantifiers , *, ? And n.” 2019. The Modern Javascript Tutorial. https://javascript.info/regexp-quantifiers.
Ramineni, Chaitanya, and David Williamson. 2018. “Understanding Mean Score Differences Between the e-Rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test.” ETS Research Report Series 2018 (1): 1–31. https://doi.org/https://doi.org/10.1002/ets2.12192.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should i Trust You?": Explaining the Predictions of Any Classifier.” http://arxiv.org/abs/1602.04938.
Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng. 2006. “Some Effective Techniques for Naive Bayes Text Classification.” IEEE Transactions on Knowledge and Data Engineering 18 (11): 1457–66.
Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of Racial Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–78. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1163.
Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.
Selivanov, Dmitriy, and Qing Wang. 2018. Text2vec: Modern Text Mining Framework for r. https://CRAN.R-project.org/package=text2vec.
Sheng, Emily, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” http://arxiv.org/abs/1909.01326.
Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. 2019. “Learning Important Features Through Propagating Activation Differences.” http://arxiv.org/abs/1704.02685.
Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” http://arxiv.org/abs/1703.00810.
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” JOSS 1 (3). https://doi.org/10.21105/joss.00037.
———. 2017. Text Mining with r: A Tidy Approach. 1st ed. O’Reilly Media, Inc.
Speer, Robyn. 2017. “How to Make a Racist AI Without Really Trying.” ConceptNet Blog. http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (56): 1929–58. http://jmlr.org/papers/v15/srivastava14a.html.
Sugisaki, Kyoko, and Don Tuggener. 2018. “German Compound Splitting Using the Compound Productivity of Morphemes,” October.
Sweeney, Latanya. 2000. “Simple Demographics Often Identify People Uniquely.” Health (San Francisco) 671 (2000): 1–34.
Tang, Cheng, Damien Garreau, and Ulrike von Luxburg. 2018. “When Do Random Forests Fail?” In Advances in Neural Information Processing Systems, 2983–93.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.
“Unicode Text Segmentation.” 2019. https://www.unicode.org/reports/tr29/tr29-35.html#Default_Word_Boundaries.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2020. Reticulate: Interface to ’Python’. https://github.com/rstudio/reticulate.
Van-Tu, Nguyen, and Le Anh-Cuong. 2016. “Improving Question Classification by Feature Extraction and Selection.” Indian Journal of Science and Technology 9 (May). https://doi.org/10.17485/ijst/2016/v9i17/93160.
Vaughan, Davis. 2020. Slider: Sliding Window Functions. https://CRAN.R-project.org/package=slider.
Vaughan, Davis, and Matt Dancho. 2018. Furrr: Apply Mapping Functions in Parallel Using Futures. https://CRAN.R-project.org/package=furrr.
Vosoughi, Soroush, Prashanth Vijayaraghavan, and Deb Roy. 2016. “Tweet2Vec: Learning Tweet Embeddings Using Character-Level CNN-LSTM Encoder-Decoder.” In, 1041–44. https://doi.org/10.1145/2911451.2914762.
Wagner, Claudia, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. “Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia.” EPJ Data Science 5 (1): 5.
Weinberger, Kilian, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ICML ’09. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1553374.1553516.
Wickham, Hadley. 2019. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
———. 2020. Httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.
Wickham, Hadley, and Jim Hester. 2020. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program: Electronic Library and Information Systems 40 (3): 219–23. http://eprints.whiterose.ac.uk/1434/.
Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 649–57. NIPS’15. Cambridge, MA, USA: MIT Press.
Zou, Feng, Fu Lee Wang, Xiaotie Deng, and Song Han. 2006. “Evaluation of Stop Word Lists in Chinese Language,” January.
Zou, Feng, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang. 2006. “Automatic Construction of Chinese Stop Word List.” In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, 1009–14. ACOS’06. Stevens Point, Wisconsin, USA: World Scientific; Engineering Academy; Society (WSEAS). http://dl.acm.org/citation.cfm?id=1973598.1973793.