Visualising UK Covid-19 Data - Part One

A recent news report suggested that Covid-19 cases in England and Wales are increasing once more.

Author

Affiliation

Graham Cox

Published

March 2, 2023

DOI

Overview

A recent news report, I cannot remember where, suggested that the number of Covid-19 cases in England and Wales have been increasing since the start of February 2023.

During the first Covid-19 lockdown in the UK, I started to download the UK government’s Covid-19 datasets to expand my knowledge and skills to use the ggplot2 package to create visualisations of the data.

The data available, and method of downloading the data has expanded considerable since then and is now available via a URL rather than using an API and custom functions.

Downloading Covid-19 Data

Previously, a package was available to download data. I had used it on several occasions but found it slow and sometimes the data was not consistent or was missing values.

Looking again at the UK government’s Covid-19 Dashboard, an option is available to create a custom URL that will create a CSV file containing the required data.

Download URL

Using the link above, searching through the many metrics that are available, I settled on using the metric named newCasesBySpecimenDate. The Metrics Documentation for this item states:

COVID-19 cases are identified by taking specimens from people and testing them for the SARS-CoV-2 virus. If the test is positive, this is a case.

Using the URL built on the download page, we can obtain the data using read_csv from the readr package.

# Download URL
url <- "https://api.coronavirus.data.gov.uk/v2/data?
areaType=utla&metric=newCasesBySpecimenDate&format=csv"


df <- read_csv(url) %>% 
  # clean column names
  janitor::clean_names() %>%
  # add new variable based on first letter of area_code
  mutate(country = if_else(substr(area_code, 1, 1) == "E", 
                           "England", 
                           "Wales")) %>% 
  # drop column listing government area type
  select(-3)

# Write to CSV file
write_csv(df, file = "all_areas.csv")

# A tibble: 6 × 5
  area_code area_name                 date       new_cases_b…¹ country
  <chr>     <chr>                     <date>             <dbl> <chr>  
1 E06000003 Redcar and Cleveland      2023-03-08             5 England
2 E06000014 York                      2023-03-08            14 England
3 E06000050 Cheshire West and Chester 2023-03-08            18 England
4 E08000001 Bolton                    2023-03-08             7 England
5 E08000016 Barnsley                  2023-03-08            14 England
6 E08000031 Wolverhampton             2023-03-08            12 England
# … with abbreviated variable name ¹new_cases_by_specimen_date

Initial Analysis and Plot

What sort of data do we have in the dataset?

spc_tbl_ [232,016 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ area_code                 : chr [1:232016] "E06000003" "E06000014" "E06000050" "E08000001" ...
 $ area_name                 : chr [1:232016] "Redcar and Cleveland" "York" "Cheshire West and Chester" "Bolton" ...
 $ date                      : Date[1:232016], format: "2023-03-08" ...
 $ new_cases_by_specimen_date: num [1:232016] 5 14 18 7 14 12 23 12 3 5 ...
 $ country                   : chr [1:232016] "England" "England" "England" "England" ...
 - attr(*, "spec")=
  .. cols(
  ..   area_code = col_character(),
  ..   area_name = col_character(),
  ..   date = col_date(format = ""),
  ..   new_cases_by_specimen_date = col_double(),
  ..   country = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

  area_code          area_name              date           
 Length:232016      Length:232016      Min.   :2020-01-30  
 Class :character   Class :character   1st Qu.:2020-11-30  
 Mode  :character   Mode  :character   Median :2021-08-28  
                                       Mean   :2021-08-29  
                                       3rd Qu.:2022-05-26  
                                       Max.   :2023-03-08  
 new_cases_by_specimen_date   country         
 Min.   :   0.0             Length:232016     
 1st Qu.:   9.0             Class :character  
 Median :  31.0             Mode  :character  
 Mean   : 104.2                               
 3rd Qu.: 108.0                               
 Max.   :6865.0

Let’s do a quick plot of the data for whole period and all areas.

# Basic line ggplot of all data
df %>%
  ggplot(aes(date, new_cases_by_specimen_date)) +
  geom_line()

Figure 1: A simple line plot

There is definitely a peak of cases at the end of 2021 or the start of 2022. Let’s tidy the data and change the plot type to a column chart so we can see a clearer picture of the data.

A better plot

To so there is not so much data included in a plot, let’s add some new variables for year and month and group the data by those values to obtain a clearer view of the data.

Summarise the data

all_areas_df <- df %>%
  # Add yr, month labels
  mutate(yr = year(date) %>% as.factor(),
         mth = month(date)
         ) %>%
  # Group data
  group_by(yr, mth) %>%
  # Calculate total cases
  summarise(cases = sum(new_cases_by_specimen_date), .groups = "drop")

Rows: 39
Columns: 3
$ yr    <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, …
$ mth   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, …
$ cases <dbl> 1, 57, 36486, 132491, 78120, 27923, 20589, 34239, 1491…

Basic column plot

all_areas_df %>%
  ggplot(aes(mth, cases)) +
  geom_col(aes(fill = yr)) +
  # Facet data to show years and months
  facet_wrap(~ yr) +
  # Apply labels to numeric values
  scale_x_continuous(breaks = 1:12,
                     labels = c("J", "F", "M", "A", "M", "J", 
                                "J", "A", "S", "O", "N", "D"),
                     expand = c(0.01,0))

Figure 2: A simple faceted column plot

Our initial analysis of the main peak of cases at the end of 2012 and the start of 2022 was correct, but we can still do better.

Changing the layout and palette

The standard ggplot2 colour scheme could definitely be better, so lets make a new palette of colours using the colorspace package.

# Create colour palette using the burgyl (Burgundy-Yellow) palette
pal <- colorspace::sequential_hcl(length(unique(all_areas_df$yr)), palette = "burgyl")

Our new layout will plot all the years horizontally across the plot, with a different colour for each year from the pal palette created above.

all_areas_df %>%
  ggplot(aes(mth, cases)) +
  geom_col(aes(fill = yr)) +
  # Facet data to show years and months
  facet_wrap(~ yr) +
  # Apply labels to numeric values
  scale_x_continuous(breaks = 1:12,
                     labels = c("J", "F", "M", "A", "M", "J", 
                                "J", "A", "S", "O", "N", "D"),
                     expand = c(0.01,0)) +
  # Facet data to show years and months
  facet_wrap(~ yr, nrow = 1, strip.position = "bottom") +
  # Apply colour palette to columns
  scale_fill_manual(
    breaks = unique(all_areas_df$yr),
    values = pal
  ) +
  # Expand y axis and format labels
  scale_y_continuous(labels = label_comma(scale = 1e-6, accuracy = .2, suffix = " m"),
                     expand = expansion(mult = c(0,.1)))

Figure 3: A horizontal facet column plot

Adding labels

We can now add some labels with some descriptive text explaining the chart to the viewer.

# Title
title_text <- "How many Covid-19 cases submitted each month?"

# Extract items for use in subtitle
# Sum of all cases
all_cases <- sum(all_areas_df$cases)

# The year with the max number of cases
yr_max <- with(all_areas_df, yr[which.max(cases)])

# The month name with the max number of cases
mth_max <- month.name[with(all_areas_df, mth[which.max(cases)])]

# The actual max number of cases for month
mth_cases_max <- with(all_areas_df, cases[which.max(cases)])

# Calculate the percent
mth_cases_pct <- percent(mth_cases_max / all_cases, accuracy = .2)

# Create Subtitle
subtitle_text <-
  paste(
    mth_max,
    yr_max,
    "has the highest number of Covid-19 cases in England and Wales, 
    with a total of",
    comma(mth_cases_max),
    "cases. This represents",
    mth_cases_pct,
    "of the",
    comma(all_cases),
    "cases submitted for the complete period."
  )

# Create Caption
caption_text <-
  "Source: UK Health Security Agency at https://coronavirus.data.gov.uk/"

Create the final plot

Now we have our labels, let’s create the final plot, along with some amendments to the legend and theme.

all_areas_df %>%
  ggplot(aes(mth, cases)) +
  geom_col(aes(fill = yr)) +
  # Add labels
  labs(
    title = title_text,
    subtitle = subtitle_text,
    caption = caption_text) +
  # Facet data to show years and months
  facet_wrap(~ yr, nrow = 1) +
  # Apply colour palette to columns
  scale_fill_manual(
    breaks = unique(all_areas_df$yr),
    values = pal,
    guide = guide_legend(
      title = "Year", title.position = "top",
      title.theme = element_text(size = 10, 
                                 family = "roboto-condensed", 
                                 face = "bold"),
      label.position = "bottom"
    )
  ) +
  # Apply labels to numeric values
  scale_x_continuous(breaks = 1:12,
  labels = c("J", "F", "M", "A", "M", "J", 
             "J", "A", "S", "O", "N", "D"),
  expand = c(0.01,0)) +
  # Expand y axis and format labels
  scale_y_continuous(labels = label_comma(scale = 1e-6, 
                                          accuracy = .2, 
                                          suffix = " m"),
                     expand = expansion(mult = c(0,.1))) +
  # Amend theme for plot elements
  theme(
    # Axis elements
    axis.line.x = element_line(colour = "grey70"),
    axis.text.x = element_text(colour = "grey60"),
    axis.title = element_blank(),
    # Panel elements
    panel.grid.major.y = element_line(colour = "grey70", 
                                      linewidth = .2, 
                                      linetype = "dashed"),
    panel.spacing = unit(0,'lines'),
    # Remove Facet strip
    strip.text = element_blank(),
    # Legend formatting
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.justification = "left",
    legend.key.height = unit(.6, "lines"),
    legend.key.width = unit(2, "lines"),
    legend.spacing.x = unit(1, "lines")
  )

Figure 4: The final plot

Conclusion

In the next part, we shall zoom in on the data just for 2023 and a closer look at the data for the area of the UK where I live - North East England.