How much were donations made to Presidential Campaigns?.
I decided to start posting items for the
R for Data Science Tidy Tuesday
project where data sets are
released each Monday as part of the
R4DS Online Learning Community
. A quote from the home page
states -
The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data.
I found the datasets used in this post by accident, while looking for another item for a totally unrelated project, and thought this would be a good starter for my first contribution.
The dataset is available here and is a zip file and the contents extracted to the data sub-folder.
The file is large, over 600Mb, so if you choose to use this data, be mindful of where you store the csv file.
spec_tbl_df [4,084,074 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ cmte_id : chr [1:4084074] "C00420224" "C00420224" "C00420224" "C00420224" ...
$ cand_id : chr [1:4084074] "P80002983" "P80002983" "P80002983" "P80002983" ...
$ cand_nm : chr [1:4084074] "Cox, John H" "Cox, John H" "Cox, John H" "Cox, John H" ...
$ contbr_nm : chr [1:4084074] "BROWN, CHARLENE" "KELLY, RAY" "CINGEL, KEITH" "DUNAWAY, JONATHON" ...
$ contbr_city : chr [1:4084074] "EAGLE RIVER" "HUNTSVILLE" "SEVERN" "DEATSVILLE" ...
$ contbr_st : chr [1:4084074] "AK" "AL" "AL" "AL" ...
$ contbr_zip : chr [1:4084074] "99577" "35801" "20999" "36022" ...
$ contbr_employer : chr [1:4084074] NA "ARKTECH" "SANTA CLAUS" "CSC" ...
$ contbr_occupation: chr [1:4084074] "STUDENT" "RETIRED" "SNOWMAN" "TECHNICAL MANAGER" ...
$ contb_receipt_amt: num [1:4084074] 25 25 50 10 25 25 20 5 10 10 ...
$ contb_receipt_dt : chr [1:4084074] "01-MAR-07" "25-JAN-07" "17-MAY-07" "18-JAN-07" ...
$ receipt_desc : chr [1:4084074] NA NA NA NA ...
$ memo_cd : chr [1:4084074] NA NA NA NA ...
$ memo_text : chr [1:4084074] NA NA NA NA ...
$ form_tp : chr [1:4084074] "SA17A" "SA17A" "SA17A" "SA17A" ...
$ file_num : num [1:4084074] 288757 288757 305408 288757 288757 ...
- attr(*, "spec")=
.. cols(
.. cmte_id = col_character(),
.. cand_id = col_character(),
.. cand_nm = col_character(),
.. contbr_nm = col_character(),
.. contbr_city = col_character(),
.. contbr_st = col_character(),
.. contbr_zip = col_character(),
.. contbr_employer = col_character(),
.. contbr_occupation = col_character(),
.. contb_receipt_amt = col_double(),
.. contb_receipt_dt = col_character(),
.. receipt_desc = col_character(),
.. memo_cd = col_character(),
.. memo_text = col_character(),
.. form_tp = col_character(),
.. file_num = col_double()
.. )
- attr(*, "problems")=<externalptr>
16 variables are available in the data, but I will concentrate on the
contbr_st
, contb_receipt_dt
and
contb_receipt_amt
variables to create a map showing which
states contributed the most dollars to campaigns. I’m not happy with
those variable names, so lets rename them.
Let’s continue some basic tidying up of the data by ensuring the
date
variable is recognised as a proper date values and add
a year
variable.
df <- df %>%
mutate(date = dmy(date),
yr = as.numeric(year(date)))
What sort of spread of data do we have for each year in the data?
# A tibble: 5 × 2
# Groups: yr [5]
yr n
<dbl> <int>
1 2004 3
2 2005 106
3 2006 6089
4 2007 691650
5 2008 3386226
It looks like the first three years worth of the data contain low numbers, so let’s exclude them from the data set - we may want to facet the data later when we plot the map.
I had been aware of the built-in data sets within R and the various libraries, but had never used them in a project. Let’s change that and get some data relating to the US States.
state_names <- data.frame(state_abb = state.abb,
state_name = state.name)
head(state_names)
state_abb state_name
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
Looking at the data_map("state")
data set that comes
with ggplot2
, containing the longitude and latitude
variables for states, the region name is in lowercase, so let’s create
that data frame again, with a lower case state name
state_names <- data.frame(state_abb = state.abb,
state_name = str_to_lower(state.name))
head(state_names)
state_abb state_name
1 AL alabama
2 AK alaska
3 AZ arizona
4 AR arkansas
5 CA california
6 CO colorado
Create a new data frame containing the longitude and latitude
variables and join it to the state names
map_coords <- map_data("state") %>%
inner_join(state_names, by = c("region" = "state_name"))
head(map_coords)
long lat group order region subregion state_abb
1 -87.46201 30.38968 1 1 alabama <NA> AL
2 -87.48493 30.37249 1 2 alabama <NA> AL
3 -87.52503 30.37249 1 3 alabama <NA> AL
4 -87.53076 30.33239 1 4 alabama <NA> AL
5 -87.57087 30.32665 1 5 alabama <NA> AL
6 -87.58806 30.32665 1 6 alabama <NA> AL
To plot the data for amounts donated to the Presidential Campaigns, we need the total amounts by state.
donations_by_state <- df %>%
group_by(state_abb, yr) %>%
summarise(total_amount = sum(amount), .groups = "drop")
head(donations_by_state)
# A tibble: 6 × 3
state_abb yr total_amount
<chr> <dbl> <dbl>
1 - 2008 50
2 -1 2008 3100
3 0 2008 -4368.
4 1 2008 376.
5 60 2007 800
6 60 2008 1030
We can see from the donations_by_state
data frame, we
have multiple rows with invalid state names.
unique(donations_by_state$state_abb)
[1] "-" "-1" "0" "1" "60" "75" "AA" "AB" "AC" "AE" "AF" "AK" "AL"
[14] "AM" "AN" "AP" "AR" "AS" "AU" "AZ" "BA" "BC" "BE" "BH" "BM" "BR"
[27] "C" "CA" "CH" "CN" "CO" "CT" "DC" "DE" "DF" "DI" "DJ" "EN" "ES"
[40] "EU" "FF" "FI" "FL" "FM" "FR" "GA" "GE" "GH" "GR" "GU" "HE" "HI"
[53] "HK" "HO" "HU" "IA" "ID" "IL" "IN" "IO" "IR" "IS" "IT" "JA" "JP"
[66] "JT" "KE" "KS" "KT" "KY" "LA" "LN" "LO" "LU" "MA" "MB" "MD" "ME"
[79] "MH" "MI" "ML" "MN" "MO" "MP" "MS" "MT" "MU" "MX" "MY" "N" "N."
[92] "N/" "NC" "ND" "NE" "NH" "NJ" "NL" "NM" "NO" "NS" "NT" "NU" "NV"
[105] "NY" "OC" "OH" "OK" "ON" "OR" "OT" "PA" "PE" "PO" "PR" "PW" "QB"
[118] "QC" "QL" "QU" "RE" "RH" "RI" "RM" "SA" "SC" "SD" "SE" "SK" "SO"
[131] "SP" "ST" "SU" "SW" "TA" "TE" "TH" "TK" "TN" "TO" "TP" "TR" "TU"
[144] "TX" "TZ" "U." "UK" "US" "UT" "VA" "VE" "VI" "VK" "VT" "WA" "WE"
[157] "WI" "WV" "WY" "XX" "YT" "ZU" "ZZ" NA
By joining the data frame to the map_coords
data frame,
these invalid rows will be removed.
plot_data <- inner_join(x = map_coords, y = donations_by_state)
head(plot_data)
long lat group order region subregion state_abb yr
1 -87.46201 30.38968 1 1 alabama <NA> AL 2007
2 -87.46201 30.38968 1 1 alabama <NA> AL 2008
3 -87.48493 30.37249 1 2 alabama <NA> AL 2007
4 -87.48493 30.37249 1 2 alabama <NA> AL 2008
5 -87.52503 30.37249 1 3 alabama <NA> AL 2007
6 -87.52503 30.37249 1 3 alabama <NA> AL 2008
total_amount
1 2382083
2 3661394
3 2382083
4 3661394
5 2382083
6 3661394
Now we have the data joined together, let’s make the first plot to see what we have.
p <- ggplot(plot_data, aes(long, lat)) +
geom_polygon(aes(group = group, fill = total_amount),
colour = "grey30",
size = .2)
p
I definitely do not like that standard blue for a continuous scale, we will change that later on!
We can tell which are the top states making donations, but my being a
Brit, I’m not always confident on which states are which (I know
California, Texas and New York though!), so let’s find out the top 5
states making donations by looking at the
donations_by_state
data frame again. We can use this later
to create some descriptive text for the plot subtitle.
top_five_state_abb <- donations_by_state %>%
group_by(state_abb) %>%
summarise(total_amount = sum(total_amount)) %>%
arrange(desc(total_amount)) %>%
top_n(n = 5)
top_five_state_abb
# A tibble: 5 × 2
state_abb total_amount
<chr> <dbl>
1 CA 177005057.
2 NY 122472922.
3 TX 65920709.
4 FL 58941780.
5 IL 54418090.
So we can have a plot that explains what is being shown, we can
create some text variables that will be used by the labs
function when creating the plot.
plot_title <- "Which states donated the most for the US Presidential Campaigns?"
plot_caption <- "Data Source: https://ocw.mit.edu/courses/res-6-009-how-to-process-analyze-and-visualize-data-january-iap-2012/pages/datasets-and-code/"
Making a descriptive subtitle for the plot, we need to look back at some of the earlier data frames and summarise the data a little further.
total_donations <- inner_join(x = state_names, y = df) %>%
summarise(total_amount = sum(amount)) %>% pull(total_amount)
total_donations
[1] 995334194
Create the total amount donated by the top five states and calculate the percentage of all donations.
top_five_amount <- sum(top_five_state_abb$total_amount)
top_five_pct <- percent(top_five_amount / total_donations, accuracy = .2)
top_five_amount <- dollar(top_five_amount, accuracy = .2, scale = 1e-6, suffix = "m")
total_donations <- dollar(total_donations, accuracy = .2, scale = 1e-6, suffix = "m")
top_five_amount
[1] "$478.8m"
top_five_pct
[1] "48.2%"
total_donations
[1] "$995.4m"
Create a text string for the top five states donating.
Now we have the amounts needed for the subtitle, lets make the text variable, with a line break at the start to add some spacing.
plot_subtitle <- glue::glue(
"A total of {total_donations} was contributed from all states, with the top five states of {top_five_states}, contributing a total of {top_five_amount},<br />representing {top_five_pct} of all donations in 2007 & 2008")
p <- p +
labs(
title = plot_title,
subtitle = plot_subtitle,
caption = plot_caption,
fill = NULL # Remove legend title
)
p
As I said previously, I do not like the standard blue colour that comes with a continuous scale, let’s change and change the type of map projection shown in the plot.
p <- p +
scale_fill_continuous(
low = '#FFF8DC',
high = '#8B1A1A',
labels = label_dollar(
scale = 1e-6,
suffix = "m",
accuracy = .2
)
) +
coord_map(projection = "mollweide")
p
Load my preferred fonts for plot text and titles.
font_add_google(family = "roboto-slab", "Roboto Slab")
font_add_google(family = "roboto-condensed", "Roboto Condensed")
showtext_auto()
Void the theme by using theme_void
p <- p + theme_void()
p
As the amount of text for the title and subtitle overflow the plot
area, we need to use the element_textbox_simple
from the
ggtext
package to allow for overflowing text and add all
the other plot theme elements.
p <- p +
theme(
text = element_text(family = "roboto-condensed", size = 22),
plot.margin = margin(rep(1, 4), unit = "cm"),
legend.direction = "horizontal",
legend.position = "bottom",
legend.key.height = unit(.8, units = "lines"),
legend.key.width = unit(3.5, units = "lines"),
legend.margin = margin(b = 1, unit = "lines"),
plot.title = element_text(
face = "bold",
size = 26,
family = "roboto-slab",
colour = "#8B1A1A"
),
plot.title.position = "plot",
plot.subtitle = element_markdown(),
plot.caption = element_text(size = 14, hjust = 0, face = "italic"),
plot.caption.position = "plot",
panel.grid = element_line(
colour = "grey30",
size = .2,
linetype = "dashed"
)
)
p
Overall, for a first attempt at using the maps data sets and plotting data, I’m happy with the result.