Last semester (Fall 2020) I taught a new course in healthcare data science for the Johnson Shoyama Graduate School in Public Policy. One of the final topics of the course was querying application programming interfaces (APIs) from within R. The example we used was querying data on the Covid 19 pandemic from the Covid-19 Tracker Canada, which has a simple API that’s easy to work with. In this post I’ll show how we accessed the API from within R and converted the query responses into something we can work with easily.
There are many ways of querying APIs in R via a range of packages. Here, I’m going to use httr 📦 to query the API and jsonlite 📦 to convert what the API responds to our query with into something more useful. The packages we need are listed in the chunk below — if you don’t have them, uncomment the
install.packages() line and change
Ncpus to something suitble for your computer.
The kind of API we’re going to query is a RESTful API — REpresentational State Transfer. To do the query we need identify the resource we want to query and then send the query using HTTP, the HyperText Transfer Protocol. The resource identity is specified using a uniform resource identifier or URI
The URI comprises four parts
- the protocol
- the base URI
- the endpoint
- additional query parameters
For the Covid-19 Tracker Canada, we’ll use the HTTPS protocol for secure HTTP, and its base URI is
api.covid19tracker.ca. The endpoint is the specific location of the data you want to access. For the API we’re querying, endpoints include
Endpoints can also allow multiple sub-resources, these are variables and take the form
:var_name. For example, the
/reports/province endpoint allows the province to be specified as a sub-resource. It is documented as
/reports/province/:code, so we would specify endpoints as
/reports/province/SK etc, where we are setting
The final part of the URI are the query parameters and they allow some fine control over what is requested from the endpoint. These are added as key-value pairs, following a
?, and pairs are separated with
&. The key is the name of the parameter, and the value is what you want to pass to that parameter. For example, when querying cases, we can specify the province and how many cases are returned per page using
Which endpoints and query parameters are supported are documented in the specific API you are trying to access, so always take some time to familiarise yourself with the API itself. For the Covid-19 Tracker Canada the documentation is also at api.covid19tracker.ca.
It’s usually best to build the URI up from these parts stored as separate objects within R
The HTTP request involves using a verb and the URI — here we will use the
GET verb. In httr 📦 the
GET verb is found in the
The response consists of two parts
- the headers
- the body
The headers contain information about the request and response, while the body contains the result of the query. You can access these components of the response using
When you print
response you’ll see a brief summary of the response metadata
The status code is important; 200 means success and anything else likely indicates some form of failure. Keep an I on the status code of your queries. If you’re wrapping these codes in a function, the
stop_for_status() functions to query the status and which throw a warning or an error if the request failed respectively.
The body of the response can be accessed as a generic R list, as the raw bytes of the response, or as plain text. When viewed as text, we see that the text format is JSON
Above, I used the
prettify() function to display the JSON in a human-readable format. Note also that I’m specifying the encoding explicitly to be UTF-8 as that’s what my Linux system uses. If you’re not sure about the encoding for your system, just leave the
encoding argument off and you’ll see a message indicating what encoding was used.
To actually parse the JSON into a similar R object we use
What we’re most interested in is the
$data component, but you can see that jsonlite 📦 has converted the JSON to an R list and where appropriate has converted arrays to data frames, as for
$data here. Exactly what is returned by the API will be specific to each API, so read the docmentation for the API you want and look at the structure of what is returned to identify the names of relevant components etc.
Covid-19 cases per day
Now that we’ve had a crash course in querying an API, let’s do something substantive and query the Covid-19 case data for my adopted home province of Saskatchewan. For this we want the
/reports endpoint and we can specify the province as a sub-resource.
At the moment the
date variable is stored as a simple character vector. If we convert that to a
'Date' object, ggplot2 📦 will draw a nicely formatted time axis for us
Yeah, we’re not doing very well in this province 😞🤬
Hope you enjoyed the post — if you have comments or questions, ask them in the Comment section below.