References

Allaire, JJ, and François Chollet. 2021. keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Appleby, Austin. 2008. “MurmurHash.” https://sites.google.com/site/murmurhash.
Arnold, Taylor. 2017. A Tidy Data Model for Natural Language Processing using cleanNLP.” The R Journal 9 (2): 248–267. doi:10.32614/RJ-2017-035.
Bates, Douglas, and Martin Maechler. 2021. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.
Bender, Emily M. 2011. “On Achieving and Evaluating Language-Independence in NLP.” Linguistic Issues in Language Technology 6 (3): 1–26.
Bender, Emily M. 2013. “Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax.” Synthesis Lectures on Human Language Technologies 6 (3). Morgan & Claypool Publishers: 1–184.
Bender, Emily M. 2019. “The #BenderRule: On Naming the Languages We Study and Why It Matters.” The Gradient. https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.
Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. FAccT ’21. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3442188.3445922.
Benoit, Kenneth, and Akitaka Matsuo. 2020. spacyr: Wrapper to the ’spaCy’ ’NLP’ Library. https://CRAN.R-project.org/package=spacyr.
Benoit, Kenneth, David Muhr, and Kohei Watanabe. 2021. stopwords: Multilingual Stopword Lists. https://CRAN.R-project.org/package=stopwords.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. quanteda: An r Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. doi:10.21105/joss.00774.
Boehmke, Brad, and Brandon M. Greenwell. 2019. Hands-on Machine Learning with R. 1st ed. Boca Raton: CRC Press.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–146. doi:10.1162/tacl_a_00051.
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Tauman Kalai. 2016. “Quantifying and Reducing Stereotypes in Word Embeddings.” CoRR abs/1606.06121. http://arxiv.org/abs/1606.06121.
Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–152. COLT ’92. New York, NY, USA: Association for Computing Machinery. doi:10.1145/130385.130401.
Bouchet-Valat, Milan. 2020. SnowballC: Snowball Stemmers Based on the c ’Libstemmer’ UTF-8 Library. https://CRAN.R-project.org/package=SnowballC.
Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.
Briscoe, Ted. 2013. “Introduction to Linguistics for Natural Language Processing.” https://www.cl.cam.ac.uk/teaching/1314/L100/introling.pdf.
Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. 2017. “Semantics Derived Automatically from Language Corpora Contain Human-Like Biases.” Science 356 (6334). American Association for the Advancement of Science: 183–186. doi:10.1126/science.aal4230.
Carlini, Nicholas, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. “The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks.” In Proceedings of the 28th USENIX Conference on Security Symposium, 267–284. SEC’19. USA: USENIX Association.
Caruana, Rich, Nikos Karampatziakis, and Ainur Yessenalina. 2008. “An Empirical Evaluation of Supervised Learning in High Dimensions.” In Proceedings of the 25th International Conference on Machine Learning, 96–103. ICML ’08. New York, NY, USA: Association for Computing Machinery. doi:10.1145/1390156.1390169.
Chin, Monica. 2020. “These Students Figured Out Their Tests Were Graded by AI.” The Verge. https://www.theverge.com/2020/9/2/21419012/edgenuity-online-class-ai-grading-keyword-mashing-students-school-cheating-algorithm-glitch.
Chollet, F., and J. J. Allaire. 2018. Deep Learning with R. Manning Publications. https://www.manning.com/books/deep-learning-with-r.
Edmondson, Mark. 2020. googleLanguageR: Call Google’s ’Natural Language’ API, ’Cloud Translation’ API, ’Cloud Speech’ API and ’Cloud Text-to-Speech’ API. https://CRAN.R-project.org/package=googleLanguageR.
Elman, Jeffrey L. 1990. “Finding Structure in Time.” Cognitive Science 14 (2): 179–211. doi:10.1207/s15516709cog1402_1.
Ethayarajh, Kawin, David Duvenaud, and Graeme Hirst. 2019. “Understanding Undesirable Word Embedding Associations.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1696–1705. Florence, Italy: Association for Computational Linguistics. doi:10.18653/v1/P19-1166.
Feathers, Todd. 2019. “Flawed Algorithms Are Grading Millions of Students’ Essays.” Motherboard. VICE. https://www.vice.com/en/article/pa7dj9/flawed-algorithms-are-grading-millions-of-students-essays.
Feldman, R., and J. Sanger. 2007. The Text Mining Handbook. Cambridge University Press.
Forman, George, and Evan Kirshenbaum. 2008. “Extremely Fast Text Feature Extraction for Classification and Indexing.” In Proceedings of the 17th ACM Conference on Information and Knowledge Management, 1221–1230. CIKM ’08. New York, NY, USA: Association for Computing Machinery. doi:10.1145/1458082.1458243.
Frank, Eibe, and Remco R. Bouckaert. 2006. “Naive Bayes for Text Classification with Unbalanced Classes.” In Knowledge Discovery in Databases: PKDD 2006, edited by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, 503–510. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/11871637_49.
Fredrikson, Matthew, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. 2014. “Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing.” In Proceedings of the 23rd USENIX Conference on Security Symposium, 17–32. SEC’14. USA: USENIX Association.
Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. 2015. “Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures.” In, 1322–1333. CCS ’15. New York, NY, USA: Association for Computing Machinery. doi:10.1145/2810103.2813677.
Friedman, Jerome H., Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software, Articles 33 (1): 1–22. doi:10.18637/jss.v033.i01.
Gage, P. 1994. “A New Algorithm for Data Compression.” The C Users Journal Archive 12: 23–38.
Gagolewski, Marek. 2020. R Package Stringi: Character String Processing Facilities. http://www.gagolewski.com/software/stringi/.
Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16). National Academy of Sciences: E3635–E3644. doi:10.1073/pnas.1720347115.
Golub, G. H., and C. Reinsch. 1970. “Singular Value Decomposition and Least Squares Solutions.” Numer. Math. 14 (5). Berlin, Heidelberg: Springer-Verlag: 403–420. doi:10.1007/BF02163027.
Gonen, Hila, and Yoav Goldberg. 2019. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings but Do Not Remove Them.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 609–614. Minneapolis, Minnesota: Association for Computational Linguistics. doi:10.18653/v1/N19-1061.
Guidotti, Riccardo, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. “A Survey of Methods for Explaining Black Box Models.” ACM Comput. Surv. 51 (5). New York, NY, USA: Association for Computing Machinery. doi:10.1145/3236009.
Harman, Donna. 1991. “How Effective Is Suffixing?” Journal of the American Society for Information Science 42 (1): 7–15. doi:10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P.
Helleputte, Thibault. 2021. LiblineaR: Linear Predictive Models Based on the LIBLINEAR c/c++ Library. https://CRAN.R-project.org/package=LiblineaR.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Comput. 9 (8). Cambridge, MA, USA: MIT Press: 1735–1780. doi:10.1162/neco.1997.9.8.1735.
Honnibal, Matthew, and Ines Montani. 2017. spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing.”
Howard, Jeremy, and Sebastian Ruder. 2018. “Universal Language Model Fine-Tuning for Text Classification.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 328–339. Melbourne, Australia: Association for Computational Linguistics. doi:10.18653/v1/P18-1031.
Huang, Weipeng, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. 2020. “Towards Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning.” In Proceedings of the 28th International Conference on Computational Linguistics, 2062–2072. Barcelona, Spain (Online): International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.186.
Huston, Samuel, and W. Bruce Croft. 2010. “Evaluating Verbose Query Processing Techniques.” In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 291–298. SIGIR ’10. New York, NY, USA: ACM. doi:10.1145/1835449.1835499.
Hvitfeldt, Emil. 2019b. Scotus: Collection of Supreme Court of the United States’ Opinions. https://github.com/EmilHvitfeldt/scotus.
Hvitfeldt, Emil. 2019a. hcandersenr: H.c. Andersens Fairy Tales. https://CRAN.R-project.org/package=hcandersenr.
Hvitfeldt, Emil. 2020b. textdata: Download and Load Various Text Datasets. https://CRAN.R-project.org/package=textdata.
Hvitfeldt, Emil. 2020a. textrecipes: Extra ’Recipes’ for Text Processing. https://CRAN.R-project.org/package=textrecipes.
Hvitfeldt, Emil. 2020c. wordsalad: Provide Tools to Extract and Analyze Word Vectors. https://CRAN.R-project.org/package=wordsalad.
Hvitfeldt, Emil. 2020d. themis: Extra Recipes Steps for Dealing with Unbalanced Data. https://CRAN.R-project.org/package=themis.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Joachims, Thorsten. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In Proceedings of the 10th European Conference on Machine Learning, 137–142. ECML’98. Berlin, Heidelberg: Springer-Verlag. doi:10.1007/BFb0026683.
Johnson, Stephen B. 1999. “A Semantic Lexicon for Medical Language Processing.” Journal of the American Medical Informatics Association 6 (3). BMJ Group BMA House, Tavistock Square, London, WC1H 9JR: 205–218. doi:10.1136/jamia.1999.0060205.
Kearney, Michael W. 2019. textfeatures: Extracts Features from Text. https://CRAN.R-project.org/package=textfeatures.
Kibriya, Ashraf M., Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. 2005. “Multinomial Naive Bayes for Text Categorization Revisited.” In AI 2004: Advances in Artificial Intelligence, edited by Geoffrey I. Webb and Xinghuo Yu, 488–499. Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-540-30549-1_43.
Kim, Sang-Bum, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng. 2006. “Some Effective Techniques for Naive Bayes Text Classification.” IEEE Transactions on Knowledge and Data Engineering 18 (11): 1457–1466. doi:10.1109/TKDE.2006.180.
Kim, Yoon. 2014. “Convolutional Neural Networks for Sentence Classification.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. Doha, Qatar: Association for Computational Linguistics. doi:10.3115/v1/D14-1181.
Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980.
Kuhn, Max. 2020. dials: Tools for Creating Tuning Parameter Values. https://CRAN.R-project.org/package=dials.
Kuhn, Max, and Davis Vaughan. 2021b. parsnip: A Common API to Modeling and Analysis Functions.
Kuhn, Max, and Davis Vaughan. 2021a. yardstick: Tidy Characterizations of Model Performance. https://CRAN.R-project.org/package=yardstick.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Kuhn, Max, and Hadley Wickham. 2021. recipes: Preprocessing Tools to Create Design Matrices. https://CRAN.R-project.org/package=recipes.
Lampinen, Andrew K., and James L. McClelland. 2018. “One-Shot and Few-Shot Learning of Word Embeddings.” http://arxiv.org/abs/1710.10280.
Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In Proceedings of the 31st International Conference on Machine Learning, edited by Eric P. Xing and Tony Jebara, 32:1188–1196. Proceedings of Machine Learning Research 2. Bejing, China: PMLR. http://proceedings.mlr.press/v32/le14.html.
Levithan, J. G. S. 2012. Regular Expressions Cookbook. 2nd ed. O’Reilly Media, Inc.
Levy, Omer, and Yoav Goldberg. 2014. “Dependency-Based Word Embeddings.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. Baltimore, Maryland: Association for Computational Linguistics. doi:10.3115/v1/P14-2050.
Lewis, David D., Yiming Yang, Tony G. Rose, and Fan Li. 2004. Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5: 361–397. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
Lex, Alexander, Nils Gehlenborg, Hendrik Strobelt, Romain Vuillemot, and Hanspeter Pfister. 2014. “UpSet: Visualization of Intersecting Sets.” IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983–1992. doi:10.1109/TVCG.2014.2346248.
Lovins, Julie B. 1968. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics 11: 22–31.
Lu, Kaiji, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2020. “Gender Bias in Neural Natural Language Processing.” In Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday, edited by Vivek Nigam, Tajana Ban Kirigin, Carolyn Talcott, Joshua Guttman, Stepan Kuznetsov, Boon Thau Loo, and Mitsuhiro Okada, 189–202. Cham: Springer International Publishing. doi:10.1007/978-3-030-62077-6_14.
Luhn, H. P. 1960. “Key Word-in-Context Index for Technical Literature (kwic Index).” American Documentation 11 (4): 288–295. doi:10.1002/asi.5090110403.
Ma, Ji, Kuzman Ganchev, and David Weiss. 2018. “State-of-the-Art Chinese Word Segmentation with Bi-LSTMs.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4902–4908. Brussels, Belgium: Association for Computational Linguistics. doi:10.18653/v1/D18-1529.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.
McCulloch, Gretchen. 2015. “Move over Shakespeare, Teen Girls Are the Real Language Disruptors.” Quartz. Quartz. https://qz.com/474671/move-over-shakespeare-teen-girls-are-the-real-language-disruptors/.
Mikolov, Tomas, Kai Chen, Greg S. Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” http://arxiv.org/abs/1301.3781.
Miller, George A. 1995. “WordNet: A Lexical Database for English.” Commun. ACM 38 (11). New York, NY, USA: ACM: 39–41. doi:10.1145/219717.219748.
Minaee, Shervin, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. “Deep Learning–Based Text Classification: A Comprehensive Review.” ACM Comput. Surv. 54 (3). New York, NY, USA: Association for Computing Machinery. doi:10.1145/3439726.
Mohammad, Saif M., and Peter D. Turney. 2013. “Crowdsourcing a Word–Emotion Association Lexicon.” Computational Intelligence 29 (3): 436–465. doi:10.1111/j.1467-8640.2012.00460.x.
Moody, Chris. 2017. “Stop Using word2vec.” Multithreaded. StitchFix. https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/.
Mullen, Lincoln A., Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. 2018. “Fast, Consistent Tokenization of Natural Language Text.” Journal of Open Source Software 3: 655. doi:10.21105/joss.00655.
Nothman, Joel, Hanmin Qin, and Roman Yurchak. 2018. “Stop Word Lists in Free Open-Source Software Packages.” In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12. Melbourne, Australia: Association for Computational Linguistics. doi:10.18653/v1/W18-2502.
Olson, Randal S, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H Moore. 2018. “Data-Driven Advice for Applying Machine Learning to Bioinformatics Problems.” In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018: Proceedings of the Pacific Symposium, 192–203. World Scientific. doi:10.1142/9789813235533_0018.
Ooms, Jeroen. 2020a. pdftools: Text Extraction, Rendering and Converting of PDF Documents. https://CRAN.R-project.org/package=pdftools.
Ooms, Jeroen. 2020b. hunspell: High-Performance Stemmer, Tokenizer, and Spell Checker. https://CRAN.R-project.org/package=hunspell.
Pedersen, Thomas Lin, and Michaël Benesty. 2021. lime: Local Interpretable Model-Agnostic Explanations. https://CRAN.R-project.org/package=lime.
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. Doha, Qatar: Association for Computational Linguistics. doi:10.3115/v1/D14-1162.
Perry, Patrick O. 2020. corpus: Text Corpus Analysis. https://CRAN.R-project.org/package=corpus.
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237. New Orleans, Louisiana: Association for Computational Linguistics. doi:10.18653/v1/N18-1202.
Porter, Martin F. 1980. “An Algorithm for Suffix Stripping.” Program 14 (3): 130–137. doi:10.1108/eb046814.
Porter, Martin F. 2001. “Snowball: A Language for Stemming Algorithms.”
Ramineni, Chaitanya, and David Williamson. 2018. “Understanding Mean Score Differences Between the e-Rater® Automated Scoring Engine and Humans for Demographically Based Groups in the GRE® General Test.” ETS Research Report Series 2018 (1): 1–31. doi:https://doi.org/10.1002/ets2.12192.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “"Why Should i Trust You?": Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144. KDD ’16. New York, NY, USA: Association for Computing Machinery. doi:10.1145/2939672.2939778.
Robinson, David. 2020. widyr: Widen, Process, Then Re-Tidy Data. https://CRAN.R-project.org/package=widyr.
Sap, Maarten, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. “The Risk of Racial Bias in Hate Speech Detection.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1668–1678. Florence, Italy: Association for Computational Linguistics. doi:10.18653/v1/P19-1163.
Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. doi:10.1162/tacl_a_00099.
Selivanov, Dmitriy, Manuel Bickel, and Qing Wang. 2020. text2vec: Modern Text Mining Framework for R. https://CRAN.R-project.org/package=text2vec.
Sheng, Emily, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3407–3412. Hong Kong, China: Association for Computational Linguistics. doi:10.18653/v1/D19-1339.
Shrikumar, Avanti, Peyton Greenside, and Anshul Kundaje. 2017. “Learning Important Features Through Propagating Activation Differences.” In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 3145–3153. ICML’17. Sydney, NSW, Australia: JMLR.org.
Shwartz-Ziv, Ravid, and Naftali Tishby. 2017. “Opening the Black Box of Deep Neural Networks via Information.” http://arxiv.org/abs/1703.00810.
Silge, Julia, Fanny Chow, Max Kuhn, and Hadley Wickham. 2021. rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. doi:10.21105/joss.00037.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. 1st ed. O’Reilly Media, Inc.
Speer, Robyn. 2017. “How to Make a Racist AI Without Really Trying.” ConceptNet Blog. http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/.
Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15 (56): 1929–1958. http://jmlr.org/papers/v15/srivastava14a.html.
Sugisaki, Kyoko, and Don Tuggener. 2018. “German Compound Splitting Using the Compound Productivity of Morphemes.” Verlag der Österreichischen Akademie der Wissenschaften.
Sweeney, Latanya. 2000. Simple Demographics Often Identify People Uniquely. Data Privacy Working Paper 3. Carnegie Mellon University. https://dataprivacylab.org/projects/identifiability/.
Tang, Cheng, Damien Garreau, and Ulrike von Luxburg. 2018. “When Do Random Forests Fail?” In, 2987–2997. NIPS’18. Red Hook, NY, USA: Curran Associates Inc.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1). [Royal Statistical Society, Wiley]: 267–288. http://www.jstor.org/stable/2346178.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2021. reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.
Van-Tu, Nguyen, and Le Anh-Cuong. 2016. “Improving Question Classification by Feature Extraction and Selection.” Indian Journal of Science and Technology 9 (May). doi:10.17485/ijst/2016/v9i17/93160.
Vaughan, Davis. 2021a. slider: Sliding Window Functions. https://CRAN.R-project.org/package=slider.
Vaughan, Davis. 2021b. workflows: Modeling Workflows. https://CRAN.R-project.org/package=workflows.
Vaughan, Davis, and Matt Dancho. 2021. furrr: Apply Mapping Functions in Parallel Using Futures. https://CRAN.R-project.org/package=furrr.
Vaughan, Davis, and Max Kuhn. 2020. hardhat: Construct Modeling Packages. https://CRAN.R-project.org/package=hardhat.
Vosoughi, Soroush, Prashanth Vijayaraghavan, and Deb Roy. 2016. “Tweet2Vec: Learning Tweet Embeddings Using Character-Level CNN-LSTM Encoder-Decoder.” In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1041–1044. SIGIR ’16. New York, NY, USA: Association for Computing Machinery. doi:10.1145/2911451.2914762.
Wagner, Claudia, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. “Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia.” EPJ Data Science 5 (1). SpringerOpen: 5. doi:10.1140/epjds/s13688-016-0066-4.
Weinberger, Kilian, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. ICML ’09. New York, NY, USA: Association for Computing Machinery. doi:10.1145/1553374.1553516.
Wenfeng, Qin, and Wu Yanyi. 2019. jiebaR: Chinese Text Segmentation. https://CRAN.R-project.org/package=jiebaR.
Wickham, Hadley. 2019. stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
Wickham, Hadley. 2020. httr: Tools for Working with URLs and HTTP. https://CRAN.R-project.org/package=httr.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43). The Open Journal: 1686. doi:10.21105/joss.01686.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.
Wickham, Hadley, and Jim Hester. 2020. readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program: Electronic Library and Information Systems 40 (3). Emerald: 219–223. doi:10.1108/00330330610681295.
Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. “Character-Level Convolutional Networks for Text Classification.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, 649–657. NIPS’15. Cambridge, MA, USA: MIT Press.
Zou, Feng, Fu Lee Wang, Xiaotie Deng, and Song Han. 2006. “Evaluation of Stop Word Lists in Chinese Language.” In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC06). Genoa, Italy: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2006/pdf/273_pdf.pdf.
Zou, Feng, Fu Lee Wang, Xiaotie Deng, Song Han, and Lu Sheng Wang. 2006. “Automatic Construction of Chinese Stop Word List.” In Proceedings of the 5th WSEAS International Conference on Applied Computer Science, 1009–1014. ACOS’06. Stevens Point, Wisconsin, USA: World Scientific; Engineering Academy; Society (WSEAS). http://dl.acm.org/citation.cfm?id=1973598.1973793.