Chapter 11 Data

This section includes brief explanations of the various datasets we will be using in this book.

11.1 hcandersenr

The hcandersenr(Hvitfeldt 2019a) package includes the text of the 157 known fairy tales by the Danish author H.C. Andersen. The text comes with 5 different languages with

  • 156 in English,
  • 154 in Spanish,
  • 150 in German,
  • 138 in Danish and
  • 58 in French

The package comes with a dataset for each language with the naming convention hcandersen_**, where ** is a country code. Each dataset comes as a data.frame with two columns; text and book where the book variable has the text divided into strings of up to 80 characters.

The package also comes with a dataset called EK which includes information about the publication date, language of origin and names in the different languages.

11.2 scotus

The scotus (Hvitfeldt 2019b) package contains a sample of the Supreme Court of the United States’ opinions. The scotus_sample data.frame includes 1 opinion per row along with the year, case name, docket number, and a unique ID number.

The text has had minimal preprocessing done on them and will include the header information in the text field. Example of the beginning of a court opinion is shown below

## No. 97-1992
## VAUGHN L. MURPHY, Petitioner v. UNITED PARCEL SERVICE, INC.
## ON WRIT OF CERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE TENTH
## CIRCUIT
## [June 22, 1999]
## Justice O'Connor delivered the opinion of the Court.
## Respondent United Parcel Service, Inc. (UPS), dismissed petitioner Vaughn
## L. Murphy from his job as a UPS mechanic because of his high blood pressure.
## Petitioner filed suit under Title I of the Americans with Disabilities Act of
## 1990 (ADA or Act), 104 Stat. 328, 42 U.S.C. § 12101 et seq., in Federal District
## Court. The District Court granted summary judgment to respondent, and the Court
## of Appeals for the Tenth Circuit affirmed. We must decide whether the Court
## of Appeals correctly considered petitioner in his medicated state when it held
## that petitioner's impairment does not "substantially limi[t]" one or more of
## his major life activities and whether it correctly determined that petitioner
## is not "regarded as disabled." See §12102(2). In light of our decision in Sutton
## v. United Air Lines, Inc., ante, p. ____, we conclude that the Court of Appeals'
## resolution of both issues was correct.

11.3 GitHub issues

This dataset includes 1161 Github issue title and an indicator of whether the issue was about documentation or not, it have been converted to be accesiable from the ghissuesdata(???) package. The dataset is split into a training data set and evaluation data set.

library(ghissuesdata)

dplyr::glimpse(github_issues_training)

11.4 US Consumer Finance Complaints

This dataset includes 117214 consumers’ complaints about financial products and services to companies for response. Each comes with a complaint_id, various categorical variables and a text column consumer_complaint_narrative containing the written complaints.

References

Hvitfeldt, Emil. 2019a. Hcandersenr: H.C. Andersens Fairy Tales. https://CRAN.R-project.org/package=hcandersenr.

Hvitfeldt, Emil. 2019b. Scotus: What the Package Does (One Line, Title Case). https://github.com/EmilHvitfeldt/scotus.