A recent news report suggested that Covid-19 cases in England and Wales are increasing once more.
A recent news report, I cannot remember where, suggested that the number of Covid-19 cases in England and Wales have been increasing since the start of February 2023.
During the first Covid-19 lockdown in the UK, I started to download the UK government’s Covid-19 datasets to expand my knowledge and skills to use the ggplot2 package to create visualisations of the data.
The data available, and method of downloading the data has expanded considerable since then and is now available via a URL rather than using an API and custom functions.
Previously, a package was available to download data. I had used it on several occasions but found it slow and sometimes the data was not consistent or was missing values.
Looking again at the UK government’s Covid-19 Dashboard, an option is available to create a custom URL that will create a CSV file containing the required data.
Using the link above, searching through the many metrics that are available, I settled on using the metric named newCasesBySpecimenDate. The Metrics Documentation for this item states:
COVID-19 cases are identified by taking specimens from people and testing them for the SARS-CoV-2 virus. If the test is positive, this is a case.
Using the URL built on the download page, we can obtain the data using read_csv from the readr package.
# Download URL
url <- "https://api.coronavirus.data.gov.uk/v2/data?
areaType=utla&metric=newCasesBySpecimenDate&format=csv"
df <- read_csv(url) %>%
# clean column names
janitor::clean_names() %>%
# add new variable based on first letter of area_code
mutate(country = if_else(substr(area_code, 1, 1) == "E",
"England",
"Wales")) %>%
# drop column listing government area type
select(-3)
# Write to CSV file
write_csv(df, file = "all_areas.csv")
# A tibble: 6 × 5
area_code area_name date new_cases_b…¹ country
<chr> <chr> <date> <dbl> <chr>
1 E06000003 Redcar and Cleveland 2023-03-08 5 England
2 E06000014 York 2023-03-08 14 England
3 E06000050 Cheshire West and Chester 2023-03-08 18 England
4 E08000001 Bolton 2023-03-08 7 England
5 E08000016 Barnsley 2023-03-08 14 England
6 E08000031 Wolverhampton 2023-03-08 12 England
# … with abbreviated variable name ¹new_cases_by_specimen_date
What sort of data do we have in the dataset?
spc_tbl_ [232,016 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ area_code : chr [1:232016] "E06000003" "E06000014" "E06000050" "E08000001" ...
$ area_name : chr [1:232016] "Redcar and Cleveland" "York" "Cheshire West and Chester" "Bolton" ...
$ date : Date[1:232016], format: "2023-03-08" ...
$ new_cases_by_specimen_date: num [1:232016] 5 14 18 7 14 12 23 12 3 5 ...
$ country : chr [1:232016] "England" "England" "England" "England" ...
- attr(*, "spec")=
.. cols(
.. area_code = col_character(),
.. area_name = col_character(),
.. date = col_date(format = ""),
.. new_cases_by_specimen_date = col_double(),
.. country = col_character()
.. )
- attr(*, "problems")=<externalptr>
area_code area_name date
Length:232016 Length:232016 Min. :2020-01-30
Class :character Class :character 1st Qu.:2020-11-30
Mode :character Mode :character Median :2021-08-28
Mean :2021-08-29
3rd Qu.:2022-05-26
Max. :2023-03-08
new_cases_by_specimen_date country
Min. : 0.0 Length:232016
1st Qu.: 9.0 Class :character
Median : 31.0 Mode :character
Mean : 104.2
3rd Qu.: 108.0
Max. :6865.0
Let’s do a quick plot of the data for whole period and all areas.
# Basic line ggplot of all data
df %>%
ggplot(aes(date, new_cases_by_specimen_date)) +
geom_line()
Figure 1: A simple line plot
There is definitely a peak of cases at the end of 2021 or the start of 2022. Let’s tidy the data and change the plot type to a column chart so we can see a clearer picture of the data.
To so there is not so much data included in a plot, let’s add some new variables for year and month and group the data by those values to obtain a clearer view of the data.
all_areas_df <- df %>%
# Add yr, month labels
mutate(yr = year(date) %>% as.factor(),
mth = month(date)
) %>%
# Group data
group_by(yr, mth) %>%
# Calculate total cases
summarise(cases = sum(new_cases_by_specimen_date), .groups = "drop")
Rows: 39
Columns: 3
$ yr <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, …
$ mth <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, …
$ cases <dbl> 1, 57, 36486, 132491, 78120, 27923, 20589, 34239, 1491…
all_areas_df %>%
ggplot(aes(mth, cases)) +
geom_col(aes(fill = yr)) +
# Facet data to show years and months
facet_wrap(~ yr) +
# Apply labels to numeric values
scale_x_continuous(breaks = 1:12,
labels = c("J", "F", "M", "A", "M", "J",
"J", "A", "S", "O", "N", "D"),
expand = c(0.01,0))
Figure 2: A simple faceted column plot
Our initial analysis of the main peak of cases at the end of 2012 and the start of 2022 was correct, but we can still do better.
The standard ggplot2 colour scheme could definitely be better, so lets make a new palette of colours using the colorspace package.
# Create colour palette using the burgyl (Burgundy-Yellow) palette
pal <- colorspace::sequential_hcl(length(unique(all_areas_df$yr)), palette = "burgyl")
Our new layout will plot all the years horizontally across the plot, with a different colour for each year from the pal palette created above.
all_areas_df %>%
ggplot(aes(mth, cases)) +
geom_col(aes(fill = yr)) +
# Facet data to show years and months
facet_wrap(~ yr) +
# Apply labels to numeric values
scale_x_continuous(breaks = 1:12,
labels = c("J", "F", "M", "A", "M", "J",
"J", "A", "S", "O", "N", "D"),
expand = c(0.01,0)) +
# Facet data to show years and months
facet_wrap(~ yr, nrow = 1, strip.position = "bottom") +
# Apply colour palette to columns
scale_fill_manual(
breaks = unique(all_areas_df$yr),
values = pal
) +
# Expand y axis and format labels
scale_y_continuous(labels = label_comma(scale = 1e-6, accuracy = .2, suffix = " m"),
expand = expansion(mult = c(0,.1)))
Figure 3: A horizontal facet column plot
We can now add some labels with some descriptive text explaining the chart to the viewer.
# Title
title_text <- "How many Covid-19 cases submitted each month?"
# Extract items for use in subtitle
# Sum of all cases
all_cases <- sum(all_areas_df$cases)
# The year with the max number of cases
yr_max <- with(all_areas_df, yr[which.max(cases)])
# The month name with the max number of cases
mth_max <- month.name[with(all_areas_df, mth[which.max(cases)])]
# The actual max number of cases for month
mth_cases_max <- with(all_areas_df, cases[which.max(cases)])
# Calculate the percent
mth_cases_pct <- percent(mth_cases_max / all_cases, accuracy = .2)
# Create Subtitle
subtitle_text <-
paste(
mth_max,
yr_max,
"has the highest number of Covid-19 cases in England and Wales,
with a total of",
comma(mth_cases_max),
"cases. This represents",
mth_cases_pct,
"of the",
comma(all_cases),
"cases submitted for the complete period."
)
# Create Caption
caption_text <-
"Source: UK Health Security Agency at https://coronavirus.data.gov.uk/"
Now we have our labels, let’s create the final plot, along with some amendments to the legend and theme.
all_areas_df %>%
ggplot(aes(mth, cases)) +
geom_col(aes(fill = yr)) +
# Add labels
labs(
title = title_text,
subtitle = subtitle_text,
caption = caption_text) +
# Facet data to show years and months
facet_wrap(~ yr, nrow = 1) +
# Apply colour palette to columns
scale_fill_manual(
breaks = unique(all_areas_df$yr),
values = pal,
guide = guide_legend(
title = "Year", title.position = "top",
title.theme = element_text(size = 10,
family = "roboto-condensed",
face = "bold"),
label.position = "bottom"
)
) +
# Apply labels to numeric values
scale_x_continuous(breaks = 1:12,
labels = c("J", "F", "M", "A", "M", "J",
"J", "A", "S", "O", "N", "D"),
expand = c(0.01,0)) +
# Expand y axis and format labels
scale_y_continuous(labels = label_comma(scale = 1e-6,
accuracy = .2,
suffix = " m"),
expand = expansion(mult = c(0,.1))) +
# Amend theme for plot elements
theme(
# Axis elements
axis.line.x = element_line(colour = "grey70"),
axis.text.x = element_text(colour = "grey60"),
axis.title = element_blank(),
# Panel elements
panel.grid.major.y = element_line(colour = "grey70",
linewidth = .2,
linetype = "dashed"),
panel.spacing = unit(0,'lines'),
# Remove Facet strip
strip.text = element_blank(),
# Legend formatting
legend.position = "bottom",
legend.direction = "horizontal",
legend.justification = "left",
legend.key.height = unit(.6, "lines"),
legend.key.width = unit(2, "lines"),
legend.spacing.x = unit(1, "lines")
)
Figure 4: The final plot
In the next part, we shall zoom in on the data just for 2023 and a closer look at the data for the area of the UK where I live - North East England.