Tidy Tuesday

I decided to start posting items for the R for Data Science Tidy Tuesday project where data sets are released each Monday as part of the R4DS Online Learning Community. A quote from the home page states -

The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data.

The Data

I found the datasets used in this post by accident, while looking for another item for a totally unrelated project, and thought this would be a good starter for my first contribution.

The dataset is available here and is a zip file and the contents extracted to the data sub-folder.

The file is large, over 600Mb, so if you choose to use this data, be mindful of where you store the csv file.

# Read data to a data frame

df <- read_csv("P00000001-ALL.csv",
               show_col_types = FALSE)

str(df)

spec_tbl_df [4,084,074 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ cmte_id          : chr [1:4084074] "C00420224" "C00420224" "C00420224" "C00420224" ...
 $ cand_id          : chr [1:4084074] "P80002983" "P80002983" "P80002983" "P80002983" ...
 $ cand_nm          : chr [1:4084074] "Cox, John H" "Cox, John H" "Cox, John H" "Cox, John H" ...
 $ contbr_nm        : chr [1:4084074] "BROWN, CHARLENE" "KELLY, RAY" "CINGEL, KEITH" "DUNAWAY, JONATHON" ...
 $ contbr_city      : chr [1:4084074] "EAGLE RIVER" "HUNTSVILLE" "SEVERN" "DEATSVILLE" ...
 $ contbr_st        : chr [1:4084074] "AK" "AL" "AL" "AL" ...
 $ contbr_zip       : chr [1:4084074] "99577" "35801" "20999" "36022" ...
 $ contbr_employer  : chr [1:4084074] NA "ARKTECH" "SANTA CLAUS" "CSC" ...
 $ contbr_occupation: chr [1:4084074] "STUDENT" "RETIRED" "SNOWMAN" "TECHNICAL MANAGER" ...
 $ contb_receipt_amt: num [1:4084074] 25 25 50 10 25 25 20 5 10 10 ...
 $ contb_receipt_dt : chr [1:4084074] "01-MAR-07" "25-JAN-07" "17-MAY-07" "18-JAN-07" ...
 $ receipt_desc     : chr [1:4084074] NA NA NA NA ...
 $ memo_cd          : chr [1:4084074] NA NA NA NA ...
 $ memo_text        : chr [1:4084074] NA NA NA NA ...
 $ form_tp          : chr [1:4084074] "SA17A" "SA17A" "SA17A" "SA17A" ...
 $ file_num         : num [1:4084074] 288757 288757 305408 288757 288757 ...
 - attr(*, "spec")=
  .. cols(
  ..   cmte_id = col_character(),
  ..   cand_id = col_character(),
  ..   cand_nm = col_character(),
  ..   contbr_nm = col_character(),
  ..   contbr_city = col_character(),
  ..   contbr_st = col_character(),
  ..   contbr_zip = col_character(),
  ..   contbr_employer = col_character(),
  ..   contbr_occupation = col_character(),
  ..   contb_receipt_amt = col_double(),
  ..   contb_receipt_dt = col_character(),
  ..   receipt_desc = col_character(),
  ..   memo_cd = col_character(),
  ..   memo_text = col_character(),
  ..   form_tp = col_character(),
  ..   file_num = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

16 variables are available in the data, but I will concentrate on the contbr_st, contb_receipt_dt and contb_receipt_amt variables to create a map showing which states contributed the most dollars to campaigns. I’m not happy with those variable names, so lets rename them.

Tidy the data

df <- df %>% 
  select(
    city = contbr_city,
    state_abb = contbr_st,
    amount = contb_receipt_amt,
    date = contb_receipt_dt
  )

Add variables

Let’s continue some basic tidying up of the data by ensuring the date variable is recognised as a proper date values and add a year variable.

df <- df %>% 
  mutate(date = dmy(date),
         yr = as.numeric(year(date)))

What’s in the data?

What sort of spread of data do we have for each year in the data?

df %>% 
  group_by(yr) %>% 
  count(yr)

# A tibble: 5 × 2
# Groups:   yr [5]
     yr       n
  <dbl>   <int>
1  2004       3
2  2005     106
3  2006    6089
4  2007  691650
5  2008 3386226

It looks like the first three years worth of the data contain low numbers, so let’s exclude them from the data set - we may want to facet the data later when we plot the map.

df <- df %>% 
  filter(date >= as.Date("2007-01-01"))

US State data

I had been aware of the built-in data sets within R and the various libraries, but had never used them in a project. Let’s change that and get some data relating to the US States.

state_names <- data.frame(state_abb = state.abb,
                          state_name = state.name)

head(state_names)

  state_abb state_name
1        AL    Alabama
2        AK     Alaska
3        AZ    Arizona
4        AR   Arkansas
5        CA California
6        CO   Colorado

Looking at the data_map("state") data set that comes with ggplot2, containing the longitude and latitude variables for states, the region name is in lowercase, so let’s create that data frame again, with a lower case state name

state_names <- data.frame(state_abb = state.abb,
                          state_name = str_to_lower(state.name))

head(state_names)

  state_abb state_name
1        AL    alabama
2        AK     alaska
3        AZ    arizona
4        AR   arkansas
5        CA california
6        CO   colorado

Create a new data frame containing the longitude and latitude variables and join it to the state names

map_coords <- map_data("state") %>% 
  inner_join(state_names, by = c("region" = "state_name"))

head(map_coords)

       long      lat group order  region subregion state_abb
1 -87.46201 30.38968     1     1 alabama      <NA>        AL
2 -87.48493 30.37249     1     2 alabama      <NA>        AL
3 -87.52503 30.37249     1     3 alabama      <NA>        AL
4 -87.53076 30.33239     1     4 alabama      <NA>        AL
5 -87.57087 30.32665     1     5 alabama      <NA>        AL
6 -87.58806 30.32665     1     6 alabama      <NA>        AL

Calculating the plot data

To plot the data for amounts donated to the Presidential Campaigns, we need the total amounts by state.

donations_by_state <- df %>% 
  group_by(state_abb, yr) %>% 
  summarise(total_amount = sum(amount), .groups = "drop")

head(donations_by_state)

# A tibble: 6 × 3
  state_abb    yr total_amount
  <chr>     <dbl>        <dbl>
1 -          2008          50 
2 -1         2008        3100 
3 0          2008       -4368.
4 1          2008         376.
5 60         2007         800 
6 60         2008        1030

Joining the data

We can see from the donations_by_state data frame, we have multiple rows with invalid state names.

unique(donations_by_state$state_abb)

  [1] "-"  "-1" "0"  "1"  "60" "75" "AA" "AB" "AC" "AE" "AF" "AK" "AL"
 [14] "AM" "AN" "AP" "AR" "AS" "AU" "AZ" "BA" "BC" "BE" "BH" "BM" "BR"
 [27] "C"  "CA" "CH" "CN" "CO" "CT" "DC" "DE" "DF" "DI" "DJ" "EN" "ES"
 [40] "EU" "FF" "FI" "FL" "FM" "FR" "GA" "GE" "GH" "GR" "GU" "HE" "HI"
 [53] "HK" "HO" "HU" "IA" "ID" "IL" "IN" "IO" "IR" "IS" "IT" "JA" "JP"
 [66] "JT" "KE" "KS" "KT" "KY" "LA" "LN" "LO" "LU" "MA" "MB" "MD" "ME"
 [79] "MH" "MI" "ML" "MN" "MO" "MP" "MS" "MT" "MU" "MX" "MY" "N"  "N."
 [92] "N/" "NC" "ND" "NE" "NH" "NJ" "NL" "NM" "NO" "NS" "NT" "NU" "NV"
[105] "NY" "OC" "OH" "OK" "ON" "OR" "OT" "PA" "PE" "PO" "PR" "PW" "QB"
[118] "QC" "QL" "QU" "RE" "RH" "RI" "RM" "SA" "SC" "SD" "SE" "SK" "SO"
[131] "SP" "ST" "SU" "SW" "TA" "TE" "TH" "TK" "TN" "TO" "TP" "TR" "TU"
[144] "TX" "TZ" "U." "UK" "US" "UT" "VA" "VE" "VI" "VK" "VT" "WA" "WE"
[157] "WI" "WV" "WY" "XX" "YT" "ZU" "ZZ" NA

By joining the data frame to the map_coords data frame, these invalid rows will be removed.

plot_data <- inner_join(x = map_coords, y = donations_by_state)

head(plot_data)

       long      lat group order  region subregion state_abb   yr
1 -87.46201 30.38968     1     1 alabama      <NA>        AL 2007
2 -87.46201 30.38968     1     1 alabama      <NA>        AL 2008
3 -87.48493 30.37249     1     2 alabama      <NA>        AL 2007
4 -87.48493 30.37249     1     2 alabama      <NA>        AL 2008
5 -87.52503 30.37249     1     3 alabama      <NA>        AL 2007
6 -87.52503 30.37249     1     3 alabama      <NA>        AL 2008
  total_amount
1      2382083
2      3661394
3      2382083
4      3661394
5      2382083
6      3661394

Building the plot

Now we have the data joined together, let’s make the first plot to see what we have.

p <- ggplot(plot_data, aes(long, lat)) +
  geom_polygon(aes(group = group, fill = total_amount),
               colour = "grey30",
               size = .2)

p

I definitely do not like that standard blue for a continuous scale, we will change that later on!

We can tell which are the top states making donations, but my being a Brit, I’m not always confident on which states are which (I know California, Texas and New York though!), so let’s find out the top 5 states making donations by looking at the donations_by_state data frame again. We can use this later to create some descriptive text for the plot subtitle.

top_five_state_abb <- donations_by_state %>% 
  group_by(state_abb) %>% 
  summarise(total_amount = sum(total_amount)) %>% 
  arrange(desc(total_amount)) %>% 
  top_n(n = 5)

top_five_state_abb

# A tibble: 5 × 2
  state_abb total_amount
  <chr>            <dbl>
1 CA          177005057.
2 NY          122472922.
3 TX           65920709.
4 FL           58941780.
5 IL           54418090.

Creating plot labels

So we can have a plot that explains what is being shown, we can create some text variables that will be used by the labs function when creating the plot.

plot_title <- "Which states donated the most for the US Presidential Campaigns?"

plot_caption <- "Data Source: https://ocw.mit.edu/courses/res-6-009-how-to-process-analyze-and-visualize-data-january-iap-2012/pages/datasets-and-code/"

Making a descriptive subtitle for the plot, we need to look back at some of the earlier data frames and summarise the data a little further.

total_donations <- inner_join(x = state_names, y = df) %>% 
  summarise(total_amount = sum(amount)) %>% pull(total_amount)

total_donations

[1] 995334194

Create the total amount donated by the top five states and calculate the percentage of all donations.

top_five_amount <- sum(top_five_state_abb$total_amount)

top_five_pct <- percent(top_five_amount / total_donations, accuracy = .2)

top_five_amount <- dollar(top_five_amount, accuracy = .2, scale = 1e-6, suffix = "m")

total_donations <- dollar(total_donations, accuracy = .2, scale = 1e-6, suffix = "m")

top_five_amount

[1] "$478.8m"

top_five_pct

[1] "48.2%"

total_donations

[1] "$995.4m"

Create a text string for the top five states donating.

top_five_states <- toString(top_five_state_abb %>% pull(state_abb))

Now we have the amounts needed for the subtitle, lets make the text variable, with a line break at the start to add some spacing.

plot_subtitle <- glue::glue(
  "A total of {total_donations} was contributed from all states, with the top five states of {top_five_states}, contributing a total of {top_five_amount},<br />representing {top_five_pct} of all donations in 2007 & 2008")

Making the final plot

p <- p +
  labs(
    title = plot_title,
    subtitle = plot_subtitle,
    caption = plot_caption,
    fill = NULL # Remove legend title
  )

p

Change those colours

As I said previously, I do not like the standard blue colour that comes with a continuous scale, let’s change and change the type of map projection shown in the plot.

p <- p +
  scale_fill_continuous(
    low = '#FFF8DC',
    high = '#8B1A1A',
    labels = label_dollar(
      scale = 1e-6,
      suffix = "m",
      accuracy = .2
    )
  ) +
  coord_map(projection = "mollweide")

p

Improving the theme

Load my preferred fonts for plot text and titles.

font_add_google(family = "roboto-slab", "Roboto Slab")
font_add_google(family = "roboto-condensed", "Roboto Condensed")

showtext_auto()

Void the theme by using theme_void

p <- p + theme_void()

p

As the amount of text for the title and subtitle overflow the plot area, we need to use the element_textbox_simple from the ggtext package to allow for overflowing text and add all the other plot theme elements.

p <- p +
  theme(
    text = element_text(family = "roboto-condensed", size = 22),
    plot.margin = margin(rep(1, 4), unit = "cm"),
    legend.direction = "horizontal",
    legend.position = "bottom",
    legend.key.height = unit(.8, units = "lines"),
    legend.key.width = unit(3.5, units = "lines"),
    legend.margin = margin(b = 1, unit = "lines"),
    plot.title = element_text(
      face = "bold",
      size = 26,
      family = "roboto-slab",
      colour = "#8B1A1A"
    ),
    plot.title.position = "plot",
    plot.subtitle = element_markdown(),
    plot.caption = element_text(size = 14, hjust = 0, face = "italic"),
    plot.caption.position = "plot",
    panel.grid = element_line(
      colour = "grey30",
      size = .2,
      linetype = "dashed"
    )
  )

p

Conclusion

Overall, for a first attempt at using the maps data sets and plotting data, I’m happy with the result.

Presidential Donations