Home / World / Some datasets for educating information science

Some datasets for educating information science

On this submit I describe the dslabs bundle, which comprises some datasets that I exploit in my information science programs.

A a lot mentioned subject in stats training is that computing ought to play a extra outstanding function within the curriculum. I strongly agree, however I feel the primary enchancment will come from bringing functions to the forefront and mimicking, as greatest as potential, the challenges utilized statisticians face in actual life. I subsequently attempt to keep away from utilizing broadly used toy examples, such because the mtcars dataset, once I train information science. Nevertheless, my expertise has been that discovering examples which can be each life like, attention-grabbing, and acceptable for rookies shouldn’t be straightforward. After just a few years of educating I’ve collected just a few datasets that I feel match this standards. To facilitate their use in introductory lessons, I embrace them within the dslabs bundle:

set up.packages("dslabs")

Under I present some instance of how you should utilize these datasets. You’ll be able to see the datasets which can be included right here:

library("dslabs")
information(bundle="dslabs")

Observe that the bundle additionally contains a few of the scripts used to wrangle the information from their unique supply:

listing.information(system.file("script", bundle = "dslabs"))
##  [1] "make-admissions.R"                   
##  [2] "make-divorce_margarine.R"            
##  [3] "make-gapminder-rdas.R"               
##  [4] "make-murders-rda.R"                  
##  [5] "make-na_example-rda.R"               
##  [6] "make-outlier_example.R"              
##  [7] "make-polls_us_election_2016.R"       
##  [8] "make-reported_heights-rda.R"         
##  [9] "make-research_funding_rates.R"       
## [10] "make-weekly_us_contagious_diseases.R"
## [11] "save-gapminder-example-csv.R"

If you wish to be taught extra about how we use these datasets in school, you’ll be able to learn this paper or this on-line ebook.

US murders

This dataset contains gun homicide information for US states in 2012. I exploit this dataset to introduce the fundamentals of R program.

information("murders")
library(tidyverse)
library(ggthemes)
library(ggrepel)

r <- murders %>%
  summarize(pop=sum(inhabitants), tot=sum(whole)) %>%
  mutate(price = tot/pop*10^6) %>% .$price

ds_theme_set()
murders %>% ggplot(aes(x = inhabitants/10^6, y = whole, label = abb)) +
  geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
  geom_point(aes(shade=area), measurement = 3) +
  geom_text_repel() +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in tens of millions (log scale)") +
  ylab("Whole variety of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(title="Area") 

Gapminder

This dataset contains well being and revenue outcomes for 184 international locations from 1960 to 2016. It additionally contains two character vectors, OECD and OPEC, with the names of OECD and OPEC international locations from 2016. I exploit this dataset to show information visualization and ggplot2.

information("gapminder")

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder %>%
  mutate(group = case_when(
    area %in% west ~ "The West",
    area %in% c("Japanese Asia", "South-Japanese Asia") ~ "East Asia",
    area %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & area != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>%
  mutate(group = issue(group, ranges = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))

filter(gapminder, yearpercentinpercentc(1962, 2013) & !is.na(group) &
         !is.na(fertility) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = inhabitants/10^6) %>%
  ggplot( aes(fertility, y=life_expectancy, col = group, measurement = population_in_millions)) +
  geom_point(alpha = 0.8) +
  guides(measurement=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(30, 85)) +
  xlab("Fertility price (births per lady)") +
  ylab("Life Expectancy") +
  geom_text(aes(x=7, y=82, label=yr), cex=12, shade="gray") +
  facet_grid(. ~ yr) +
  theme(strip.background = element_blank(),
        strip.textual content.x = element_blank(),
        strip.textual content.y = element_blank(),
   legend.place = "high")

Contagious illness information for US states

This dataset comprises yearly counts for Hepatitis A, measles, mumps, pertussis, polio, rubella, and smallpox for US states. Authentic information courtesy of Tycho Challenge. I exploit it to indicate methods one can plot greater than 2 dimensions.

library(RColorBrewer)
information("us_contagious_diseases")
the_disease <- "Measles"
us_contagious_diseases %>%
  filter(!statepercentinpercentc("Hawaii","Alaska") & illness ==  the_disease) %>%
  mutate(price = rely / inhabitants * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, price)) %>%
  ggplot(aes(yr, state,  fill = price)) +
  geom_tile(shade = "gray50") +
  scale_x_continuous(increase=c(0,0)) +
  scale_fill_gradientn(colours = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")

Fivethirtyeight 2016 Ballot Information

This information contains ballot outcomes from the US 2016 presidential elections aggregated from HuffPost Pollster, RealClearPolitics, polling companies and information experiences. The dataset additionally contains election outcomes (standard vote) and electoral faculty votes in results_us_election_2016. I exploit this dataset to show inference.

information(polls_us_election_2016)
polls_us_election_2016 %>%
  filter(state == "U.S." & enddate>="2016-07-01") %>%
  choose(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
  rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
  collect(candidate, share, -enddate, -pollster) %>% 
  mutate(candidate = issue(candidate, ranges = c("Trump","Clinton")))%>%
  group_by(pollster) %>%
  filter(n()>=10) %>%
  ungroup() %>%
  ggplot(aes(enddate, share, shade = candidate)) +  
  geom_point(present.legend = FALSE, alpha=0.4)  + 
  geom_smooth(methodology = "loess", span = 0.15) +
  scale_y_continuous(limits = c(30,50))

Pupil reported heights

These are self-reported heights in inches for women and men from information science course throughout a number of years. I exploit this to show distributions and abstract statistics.

information("heights")
heights %>% 
  ggplot(aes(peak, fill=intercourse)) + 
  geom_density(alpha = 0.2)

These information have been extremely wrangled as college students would typically reported heights in values apart from inches. The unique entries are right here:

information("reported_heights")
reported_heights %>% filter(is.na(as.numeric(peak))) %>% choose(peak) %>% .$peak 
## Warning in evalq(is.na(as.numeric(peak)), <atmosphere>): NAs launched
## by coercion
##  [1] "5' 4""                 "165cm"                 
##  [3] "5'7"                    ">9000"                 
##  [5] "5'7""                  "5'3""                 
##  [7] "5 toes and eight.11 inches" "5'11"                  
##  [9] "5'9''"                  "5'10''"                
## [11] "5,3"                    "6'"                    
## [13] "6,8"                    "5' 10"                 
## [15] "5 foot eight inches" "5'5""                 
## [17] "5'2""                  "5,4"                   
## [19] "5'3"                    "5'10''"                
## [21] "5'3''"                  "5'7''"                 
## [23] "5'12"                   "2'33"                  
## [25] "5'11"                   "5'3""                 
## [27] "5,8"                    "5'6''"                 
## [29] "5'4"                    "1,70"                  
## [31] "5'7.5''"                "5'7.5''"               
## [33] "5'2""                  "5' 7.78""             
## [35] "yyy"                    "5'5"                   
## [37] "5'8"                    "5'6"                   
## [39] "5 toes 7inches"         "6*12"                  
## [41] "5 .11"                  "5 11"                  
## [43] "5'4"                    "5'8""                 
## [45] "5'5"                    "5'7"                   
## [47] "5'6"                    "5'11""                
## [49] "5'7""                  "5'7"                   
## [51] "5'8"                    "5' 11""               
## [53] "6'1""                  "69""                  
## [55] "5' 7""                 "5'10''"                
## [57] "5'10"                   "5'10"                  
## [59] "5ft 9 inches"           "5 ft 9 inches"         
## [61] "5'2"                    "5'11"                  
## [63] "5'11''"                 "5'8""                 
## [65] "708,661"                "5 toes 6 inches"       
## [67] "5'10''"                 "5'8"                   
## [69] "6'3""                  "649,606"               
## [71] "728,346"                "6 04"                  
## [73] "5'9"                    "5'5''"                 
## [75] "5'7""                  "6'4""                 
## [77] "5'4"                    "170 cm"                
## [79] "7,283,465"              "5'6"                   
## [81] "5'6"

We use this for instance to show string processing and regex.

Margarine and divorce price

Lastly, here’s a foolish instance from the web site Spurious Correlations that I exploit when educating correlation doesn’t indicate causation.

the_title <- paste("Correlation =",
                spherical(with(divorce_margarine,
                           cor(margarine_consumption_per_capita, divorce_rate_maine)),2))
information(divorce_margarine)
divorce_margarine %>%
  ggplot(aes(margarine_consumption_per_capita, divorce_rate_maine)) +
  geom_point(cex=3) +
  geom_smooth(methodology = "lm") +
  ggtitle(the_title) +
  xlab("Margarine Consumption per Capita (lbs)") +
  ylab("Divorce price in Maine (per 1000)")


Supply hyperlink

About admin

Check Also

The Coolest Bathing Suit Brands I've Found for Summer

The Coolest Bathing Go well with Manufacturers I've Discovered for Summer time

Is it simply me, or is Instagram like, the perfect bathing swimsuit retailer the web …

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: