References

Allaire, J., and Chollet, F. 2021. keras: R Interface to ‘Keras’. R package version 2.4.0. https://CRAN.R-project.org/package=keras.

Appleby, A. 2008. “MurmurHash.” https://sites.google.com/site/murmurhash.

Arnold, T. 2017. “A Tidy Data Model for Natural Language Processing using cleanNLP.” The R Journal 9 (2): 248–267. https://doi.org/10.32614/RJ-2017-035.

Bates, D., and Maechler, M. 2021. Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.3-2. https://CRAN.R-project.org/package=Matrix.

Bender, E. M. 2011. “On Achieving and Evaluating Language-Independence in NLP.” Linguistic Issues in Language Technology 6 (3): 1–26.

Bender, E. M. 2013. “Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax.” Synthesis Lectures on Human Language Technologies 6 (3). Morgan & Claypool Publishers: 1–184.

Bender, E. M. 2019. “The #BenderRule: On Naming the Languages We Study and Why It Matters.” The Gradient. https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. FAccT ’21. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.

Benoit, K., and Matsuo, A. 2020. spacyr: Wrapper to the ‘spaCy’ ‘NLP’ Library. R package version 1.2.1. https://CRAN.R-project.org/package=spacyr.

Benoit, K., Muhr, D., and Watanabe, K. 2021. stopwords: Multilingual Stopword Lists. R package version 2.2. https://CRAN.R-project.org/package=stopwords.

Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., and Matsuo, A. 2018. “quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Boehmke, B., and Greenwell, B. M. 2019. Hands-on Machine Learning with R. Boca Raton: CRC Press.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–146. https://www.aclweb.org/anthology/Q17-1010.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. 2016. “Quantifying and Reducing Stereotypes in Word Embeddings.” CoRR abs/1606.06121. http://arxiv.org/abs/1606.06121.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–152. COLT ’92. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/130385.130401.

Bouchet-Valat, M. 2020. SnowballC: Snowball Stemmers Based on the C ‘libstemmer’ UTF-8 Library. R package version 0.7.0. https://CRAN.R-project.org/package=SnowballC.

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. 1984. Classification and Regression Trees. Boca Raton: CRC Press.

Briscoe, T. 2013. “Introduction to Linguistics for Natural Language Processing.” https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf.

Caliskan, A., Bryson, J. J., and Narayanan, A. 2017. “Semantics Derived Automatically from Language Corpora Contain Human-Like Biases.” Science 356 (6334). American Association for the Advancement of Science: 183–186. https://science.sciencemag.org/content/356/6334/183.

Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. 2019. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks.” In Proceedings of the 28th USENIX Conference on Security Symposium, 267–284. SEC’19. USA: USENIX Association.

Caruana, R., Karampatziakis, N., and Yessenalina, A. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103. ICML ’08. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/1390156.1390169.

Chin, M. 2020. “These Students Figured Out Their Tests Were Graded by AI.” The Verge. https://www.theverge.com/2020/9/2/21419012/edgenuity-online-class-ai-grading-keyword-mashing-students-school-cheating-algorithm-glitch.

Chollet, F., and Allaire, J. J. 2018. Deep Learning with R. Shelter Island, NY: Manning Publications. https://www.manning.com/books/deep-learning-with-r.

Edmondson, M. 2020. googleLanguageR: Call Google’s ‘Natural Language’ API, ‘Cloud Translation’ API, ‘Cloud Speech’ API and ‘Cloud Text-to-Speech’ API. R package version 0.3.0. https://CRAN.R-project.org/package=googleLanguageR.

Elman, J. L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. https://doi.org/10.1207/s15516709cog1402_1.

Ethayarajh, K., Duvenaud, D., and Hirst, G. 2019. “Understanding Undesirable Word Embedding Associations.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1696–1705. Florence, Italy: Association for Computational Linguistics. https://www.aclweb.org/anthology/P19-1166.

Feathers, T. 2019. “Flawed Algorithms Are Grading Millions of Students’ Essays.” Motherboard. VICE. https://www.vice.com/en/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays.

Feldman, R., and Sanger, J. 2007. The Text Mining Handbook. Cambridge: Cambridge University Press.

Forman, G., and Kirshenbaum, E. 2008. “Extremely Fast Text Feature Extraction for Classification and Indexing.” In Proceedings of the 17th ACM Conference on Information and Knowledge Management, 1221–1230. CIKM ’08. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/1458082.1458243.

Frank, E., and Bouckaert, R. R. 2006. “Naive Bayes for Text Classification with Unbalanced Classes.” In Knowledge Discovery in Databases: PKDD 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 503–510. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/11871637_49.

Fredrikson, Matt, Jha, S., and Ristenpart, T. 2015. “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures.” In, 1322–1333. CCS ’15. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/2810103.2813677.

Fredrikson, Matthew, Lantz, E., Jha, S., Lin, S., Page, D., and Ristenpart, T. 2014. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” In Proceedings of the 23rd USENIX Conference on Security Symposium, 17–32. SEC’14. USA: USENIX Association.

Friedman, J. H., Hastie, T., and Tibshirani, R. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, Articles 33 (1): 1–22. https://www.jstatsoft.org/v033/i01.

Gage, P. 1994. “A New Algorithm for Data Compression.” The C Users Journal Archive 12: 23–38.

Gagolewski, M. 2020. stringi: Character String Processing Facilities. R package version 1.6.2. http://www.gagolewski.com/software/stringi/.

Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16). National Academy of Sciences: E3635–E3644. https://www.pnas.org/content/115/16/E3635.

Golub, G. H., and Reinsch, C. 1970. “Singular Value Decomposition and Least Squares Solutions.” Numerische Mathematik 14 (5). Berlin, Heidelberg: Springer-Verlag: 403–420. https://doi.org/10.1007/BF02163027.

Gonen, H., and Goldberg, Y. 2019. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but Do Not Remove Them.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 609–614. Minneapolis, Minnesota: Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1061.

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Computing Surveys 51 (5). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3236009.

Harman, D. 1991. “How Effective Is Suffixing?” Journal of the American Society for Information Science 42 (1): 7–15. https://doi.org/10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P.

Helleputte, T. 2021. LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library. R package version 2.10-12. https://CRAN.R-project.org/package=LiblineaR.

Hochreiter, S., and Schmidhuber, J. 1997. “Long Short-Term Memory.” Neural Comput. 9 (8). Cambridge, MA: MIT Press: 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.

Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. 2020. spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. https://doi.org/10.5281/zenodo.1212303.

Howard, J., and Ruder, S. 2018. “Universal Language Model Fine-Tuning for Text Classification.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328–339. Melbourne, Australia: Association for Computational Linguistics. https://www.aclweb.org/anthology/P18-1031.

Huang, W., Cheng, X., Chen, K., Wang, T., and Chu, W. 2020. “Towards Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning.” In Proceedings of the 28th International Conference on Computational Linguistics, 2062–2072. Barcelona, Spain (Online): International Committee on Computational Linguistics. https://www.aclweb.org/anthology/2020.coling-main.186.

Huston, S., and Croft, W. B. 2010. “Evaluating Verbose Query Processing Techniques.” In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 291–298. SIGIR ’10. New York, NY: ACM. http://doi.acm.org/10.1145/1835449.1835499.

Hvitfeldt, E. 2019b. scotus: Collection of Supreme Court of the United States’ Opinions. R package version 1.0.0. https://github.com/EmilHvitfeldt/scotus.

Hvitfeldt, E. 2019a. hcandersenr: H.C. Andersen’s Fairy Tales. R package version 0.2.0. https://CRAN.R-project.org/package=hcandersenr.

Hvitfeldt, E. 2020b. textdata: Download and Load Various Text Datasets. R package version 0.4.1. https://CRAN.R-project.org/package=textdata.

Hvitfeldt, E. 2020a. textrecipes: Extra ‘Recipes’ for Text Processing. R package version 0.4.1. https://CRAN.R-project.org/package=textrecipes.

Hvitfeldt, E. 2020c. wordsalad: Provide Tools to Extract and Analyze Word Vectors. R package version 0.2.0. https://CRAN.R-project.org/package=wordsalad.

Hvitfeldt, E. 2020d. themis: Extra Recipe Steps for Dealing with Unbalanced Data. R package version 0.1.4. https://CRAN.R-project.org/package=themis.

James, G., Witten, D., Hastie, T., and Tibshirani, R. 2013. An Introduction to Statistical Learning. New York: Springer.

Joachims, T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, 137–142. ECML’98. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/BFb0026683.

Johnson, S. B. 1999. “A Semantic Lexicon for Medical Language Processing.” Journal of the American Medical Informatics Association 6 (3). BMJ Group BMA House, Tavistock Square, London, WC1H 9JR: 205–218. https://doi.org/10.1136/jamia.1999.0060205.

Kearney, M. W. 2019. textfeatures: Extracts Features from Text. R package version 0.3.3. https://CRAN.R-project.org/package=textfeatures.

Kibriya, A. M., Frank, E., Pfahringer, B., and Holmes, G. 2005. “Multinomial Naive Bayes for Text Categorization Revisited.” In AI 2004: Advances in Artificial Intelligence, edited by Geoffrey I. Webb and Xinghuo Yu, 488–499. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_43.

Kim, S., Han, K., Rim, H., and Myaeng, S. H. 2006. “Some Effective Techniques for Naive Bayes Text Classification.” IEEE Transactions on Knowledge and Data Engineering 18 (11): 1457–1466. https://doi.org/10.1109/TKDE.2006.180.

Kim, Y. 2014. “Convolutional Neural Networks for Sentence Classification.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. Doha, Qatar: Association for Computational Linguistics. https://www.aclweb.org/anthology/D14-1181.

Kingma, D. P., and Ba, J. 2017. “Adam: A Method for Stochastic Optimization.” https://arxiv.org/abs/1412.6980.

Kuhn, M. 2020. dials: Tools for Creating Tuning Parameter Values. R package version 0.0.9. https://CRAN.R-project.org/package=dials.

Kuhn, M., and Vaughan, D. 2021b. parsnip: A Common API to Modeling and Analysis Functions. R package version 0.1.6. https://CRAN.R-project.org/package=parsnip.

Kuhn, M., and Vaughan, D. 2021a. yardstick: Tidy Characterizations of Model Performance. R package version 0.0.8. https://CRAN.R-project.org/package=yardstick.

Kuhn, M., and Wickham, H. 2021a. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” RStudio PBC. https://www.tidymodels.org.

Kuhn, M., and Wickham, H. 2021b. recipes: Preprocessing Tools to Create Design Matrices. R package version 0.1.16. https://CRAN.R-project.org/package=recipes.

Lampinen, A. K., and McClelland, J. L. 2018. “One-Shot and Few-Shot Learning of Word Embeddings.” https://arxiv.org/abs/1710.10280.

Le, Q., and Mikolov, T. 2014. “Distributed Representations of Sentences and Documents.” In Proceedings of the 31st International Conference on Machine Learning, edited by Eric P. Xing and Tony Jebara, 32:1188–1196. Proceedings of Machine Learning Research 2. Bejing, China: PMLR. http://proceedings.mlr.press/v32/le14.html.

Levithan, J. G. S. 2012. Regular Expressions Cookbook. Sebastopol: O’Reilly Media, Inc.

Levy, O., and Goldberg, Y. 2014. “Dependency-Based Word Embeddings.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. Baltimore, Maryland: Association for Computational Linguistics. https://www.aclweb.org/anthology/P14-2050.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5: 361–397. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., and Pfister, H. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–1992. https://doi.org/10.1109/TVCG.2014.2346248.

Lovins, J. B. 1968. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics 11: 22–31.

Lu, K., Mardziel, P., Wu, F., Amancharla, P., and Datta, A. 2020. “Gender Bias in Neural Natural Language Processing.” In Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday, edited by Vivek Nigam, Tajana Ban Kirigin, Carolyn Talcott, Joshua Guttman, Stepan Kuznetsov, Boon Thau Loo, and Mitsuhiro Okada, 189–202. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-62077-6_14.

Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (kwic Index).” American Documentation 11 (4): 288–295. https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.5090110403.

Ma, J., Ganchev, K., and Weiss, D. 2018. “State-of-the-Art Chinese Word Segmentation with Bi-LSTMs.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–4908. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1529.

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. New York, NY: Cambridge University Press.

McCulloch, G. 2015. “Move over Shakespeare, Teen Girls Are the Real Language Disruptors.” Quartz. Quartz. https://qz.com/474671/move-over-shakespeare-teen-girls-are-the-real-language-disruptors/.

Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. 2013. “Efficient Estimation of Word Representations in Vector Space.” http://arxiv.org/abs/1301.3781.

Miller, G. A. 1995. “WordNet: A Lexical Database for English.” Communications of the ACM 38 (11). New York, NY: ACM: 39–41. http://doi.acm.org/10.1145/219717.219748.

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., and Gao, J. 2021. “Deep Learning–Based Text Classification: A Comprehensive Review.” ACM Comput. Surv. 54 (3). New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3439726.

Mohammad, S. M., and Turney, P. D. 2013. “Crowdsourcing a Word–Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–465. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x.

Moody, C. 2017. “Stop Using word2vec.” Multithreaded. StitchFix. https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/.

Mullen, L. A., Benoit, K., Keyes, O., Selivanov, D., and Arnold, J. 2018. “Fast, Consistent Tokenization of Natural Language Text.” Journal of Open Source Software 3: 655. https://doi.org/10.21105/joss.00655.

Nothman, J., Qin, H., and Yurchak, R. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12. Melbourne, Australia: Association for Computational Linguistics. https://www.aclweb.org/anthology/W18-2502.

Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A., and Moore, J. H. 2018. “Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems.” In Pacific Symposium on Biocomputing 2018: Proceedings of the Pacific Symposium, 192–203. World Scientific. https://doi.org/10.1142/9789813235533_0018.

Ooms, J. 2020a. pdftools: Text Extraction, Rendering and Converting of PDF Documents. R package version 2.3.1. https://CRAN.R-project.org/package=pdftools.

Ooms, J. 2020b. hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker. R package version 3.0.1. https://CRAN.R-project.org/package=hunspell.

Pedersen, T. L., and Benesty, M. 2021. lime: Local Interpretable Model-Agnostic Explanations. R package version 0.5.2. https://CRAN.R-project.org/package=lime.

Pennington, J., Socher, R., and Manning, C. 2014. “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. Doha, Qatar: Association for Computational Linguistics. https://www.aclweb.org/anthology/D14-1162.

Perry, P. O. 2020. corpus: Text Corpus Analysis. R package version 0.10.2. https://CRAN.R-project.org/package=corpus.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237. New Orleans, Louisiana: Association for Computational Linguistics. https://www.aclweb.org/anthology/N18-1202.

Porter, M. F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–137. https://doi.org/10.1108/eb046814.

Porter, M. F. 2001. “Snowball: A Language for Stemming Algorithms.” https://snowballstem.org.

Ramineni, C., and Williamson, D. 2018. “Understanding Mean Score Differences Between the e-Rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test.” ETS Research Report Series 2018 (1): 1–31. https://onlinelibrary.wiley.com/doi/abs/10.1002/ets2.12192.

Ribeiro, M. T., Singh, S., and Guestrin, C. 2016. “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144. KDD ’16. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939778.

Robinson, D. 2020. widyr: Widen, Process, Then Re-Tidy Data. R package version 0.1.3. https://CRAN.R-project.org/package=widyr.

Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. 2019. “The Risk of Racial Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. Florence, Italy: Association for Computational Linguistics. https://www.aclweb.org/anthology/P19-1163.

Schofield, A., and Mimno, D. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.

Selivanov, D., Bickel, M., and Wang, Q. 2020. text2vec: Modern Text Mining Framework for R. R package version 0.6. https://CRAN.R-project.org/package=text2vec.

Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3407–3412. Hong Kong: Association for Computational Linguistics. https://www.aclweb.org/anthology/D19-1339.

Shrikumar, A., Greenside, P., and Kundaje, A. 2017. “Learning Important Features Through Propagating Activation Differences.” In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 3145–3153. ICML’17. Sydney, NSW, Australia: JMLR.org.

Shwartz-Ziv, R., and Tishby, N. 2017. “Opening the Black Box of Deep Neural Networks via Information.” https://arxiv.org/abs/1703.00810.

Silge, J., Chow, F., Kuhn, M., and Wickham, H. 2021. rsample: General Resampling Infrastructure. R package version 0.1.0. https://CRAN.R-project.org/package=rsample.

Silge, J., and Robinson, D. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. http://dx.doi.org/10.21105/joss.00037.

Silge, J., and Robinson, D. 2017. Text Mining with R: A Tidy Approach. Sebastopol: O’Reilly Media, Inc.

Speer, R. 2017. “How to Make a Racist AI Without Really Trying.” ConceptNet Blog. http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (56): 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html.

Sugisaki, K., and Tuggener, D. 2018. “German Compound Splitting Using the Compound Productivity of Morphemes.” Verlag der Österreichischen Akademie der Wissenschaften.

Sweeney, L. 2000. Simple Demographics Often Identify People Uniquely. Data Privacy Working Paper 3. Carnegie Mellon University. https://dataprivacylab.org/projects/identifiability/.

Tang, C., Garreau, D., and Luxburg, U. von. 2018. “When Do Random Forests Fail?” In, 2987–2997. NIPS’18. Red Hook, NY: Curran Associates Inc.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1). [Royal Statistical Society, Wiley]: 267–288. http://www.jstor.org/stable/2346178.

Ushey, K., Allaire, J., and Tang, Y. 2021. reticulate: Interface to ‘Python’. R package version 1.20. https://CRAN.R-project.org/package=reticulate.

Van-Tu, N., and Anh-Cuong, L. 2016. “Improving Question Classification by Feature Extraction and Selection.” Indian Journal of Science and Technology 9 (17): 1–8. https://doi.org/10.17485/ijst/2016/v9i17/93160.

Vaughan, D. 2021a. slider: Sliding Window Functions. R package version 0.2.1. https://CRAN.R-project.org/package=slider.

Vaughan, D. 2021b. workflows: Modeling Workflows. R package version 0.2.2. https://CRAN.R-project.org/package=workflows.

Vaughan, D., and Dancho, M. 2021. furrr: Apply Mapping Functions in Parallel Using Futures. R package version 0.2.2. https://CRAN.R-project.org/package=furrr.

Vaughan, D., and Kuhn, M. 2020. hardhat: Construct Modeling Packages. R package version 0.1.5. https://CRAN.R-project.org/package=hardhat.

Vosoughi, S., Vijayaraghavan, P., and Roy, D. 2016. “Tweet2Vec: Learning Tweet Embeddings Using Character-Level CNN-LSTM Encoder-Decoder.” In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1041–1044. SIGIR ’16. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/2911451.2914762.

Wagner, C., Graells-Garrido, E., Garcia, D., and Menczer, F. 2016. “Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia.” EPJ Data Science 5 (1). SpringerOpen: 5. https://doi.org/10.1140/epjds/s13688-016-0066-4.

Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. ICML ’09. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/1553374.1553516.

Wenfeng, Q., and Yanyi, W. 2019. jiebaR: Chinese Text Segmentation. R package version 0.11. https://CRAN.R-project.org/package=jiebaR.

Wickham, H. 2019. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr.

Wickham, H. 2020. httr: Tools for Working with URLs and HTTP. R package version 1.4.2. https://CRAN.R-project.org/package=httr.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43). The Open Journal: 1686. https://doi.org/10.21105/joss.01686.

Wickham, H., and Grolemund, G. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol: O’Reilly Media, Inc.

Wickham, H., and Hester, J. 2020. readr: Read Rectangular Text Data. R package version 1.4.0. https://CRAN.R-project.org/package=readr.

Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program: Electronic Library and Information Systems 40 (3). Emerald: 219–223. http://eprints.whiterose.ac.uk/1434/.

Zhang, X., Zhao, J., and LeCun, Y. 2015. “Character-Level Convolutional Networks for Text Classification.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 649–657. NIPS’15. Cambridge, MA: MIT Press.

Zou, F., Wang, F. L., Deng, X., and Han, S. 2006. “Evaluation of Stop Word Lists in Chinese Language.” In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). Genoa, Italy: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/273_pdf.pdf.

Zou, F., Wang, F. L., Deng, X., Han, S., and Wang, L. S. 2006. “Automatic Construction of Chinese Stop Word List.” In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, 1009–1014. ACOS’06. Stevens Point, Wisconsin: World Scientific; Engineering Academy; Society (WSEAS). http://dl.acm.org/citation.cfm?id=1973598.1973793.