Chapter 11 Data
This section includes brief explanations of the various datasets we will be using in this book.
- 156 in English,
- 154 in Spanish,
- 150 in German,
- 138 in Danish and
- 58 in French
The package comes with a dataset for each language with the naming convention
** is a country code.
Each dataset comes as a data.frame with two columns;
book where the
book variable has the text divided into strings of up to 80 characters.
The package also comes with a dataset called
EK which includes information about the publication date, language of origin and names in the different languages.
The scotus (Hvitfeldt 2019b) package contains a sample of the Supreme Court of the United States’ opinions.
scotus_sample data.frame includes 1 opinion per row along with the year, case name, docket number, and a unique ID number.
The text has had minimal preprocessing done on them and will include the header information in the text field. Example of the beginning of a court opinion is shown below
## No. 97-1992 ## VAUGHN L. MURPHY, Petitioner v. UNITED PARCEL SERVICE, INC. ## ON WRIT OF CERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE TENTH ## CIRCUIT ## [June 22, 1999] ## Justice O'Connor delivered the opinion of the Court. ## Respondent United Parcel Service, Inc. (UPS), dismissed petitioner Vaughn ## L. Murphy from his job as a UPS mechanic because of his high blood pressure. ## Petitioner filed suit under Title I of the Americans with Disabilities Act of ## 1990 (ADA or Act), 104 Stat. 328, 42 U.S.C. § 12101 et seq., in Federal District ## Court. The District Court granted summary judgment to respondent, and the Court ## of Appeals for the Tenth Circuit affirmed. We must decide whether the Court ## of Appeals correctly considered petitioner in his medicated state when it held ## that petitioner's impairment does not "substantially limi[t]" one or more of ## his major life activities and whether it correctly determined that petitioner ## is not "regarded as disabled." See §12102(2). In light of our decision in Sutton ## v. United Air Lines, Inc., ante, p. ____, we conclude that the Court of Appeals' ## resolution of both issues was correct.
11.3 GitHub issues
This dataset includes 1161 Github issue title and an indicator of whether the issue was about documentation or not, it have been converted to be accesiable from the ghissuesdata(???) package. The dataset is split into a training data set and evaluation data set.
11.4 US Consumer Finance Complaints
This dataset includes 117214 consumers’ complaints about financial products and services to companies for response. Each comes with a
complaint_id, various categorical variables and a text column
consumer_complaint_narrative containing the written complaints.