Dividing the world: spatial techniques to acquire online data at scale
As we continue to adapt to the challenges posed by the COVID-19 pandemic, UBDC researchers have moved quickly to develop new programmes of research, focused on the virus’ impacts, both during its current phase and throughout the subsequent recovery period. This ambition is enabled partly through the use of the Centre’s existing data collections, but also demands the rapid acquisition of new data.
Online data sources offer rich opportunities for information extraction, related to themes including mobility, economic activity, education, skills and labour markets and the use of space and the built environment.
As well as the potential for information discovery, these sources present practical challenges. While our data collection and research are legitimised by laws such as the text and data mining copyright exception, we must navigate often challenging technical limitations in undertaking data collection at scale.
In this technical blog, UBDC data scientists explain the different approaches to the challenges of collecting spatial data (any data with a direct or indirect reference to a specific location or geographical area) and share the method we use.
When you are asked to generate a dataset from spatial data the task is usually straightforward: You use your favourite tool, service or library and query your dataset with some form of the following strategy:
For something inside my boundary:
Your datasets, which are inside shapefiles or spatially-enabled databases, then determine whatever is inside your boundaries and returns the results.
Then it’s up to you, GIS Specialist, to perform the necessary spatial extract-transform-load (ETL) operations to clean it, bring it up to the required specifications and make it ready for your team to explore it further!
At such times, your primary source of information is likely to be an outside source. It may be from a provider that generates data as part of, or as an output of, his or her business. In many cases, such data are made accessible through one or more RESTful APIs maintained by the data owner to reflect their own needs.
A RESTful API is an application program interface (API) that uses HTTP requests to GET, PUT, POST and DELETE data from a usually remote source
Complications when targeting specific spatial areas
The picture becomes more complicated when data collection efforts target specific spatial areas. Unfortunately, most of the time data APIs are not designed to understand boundaries, as they could be very complex. They could range from a simple rectangular box to the shape of a country. Their serialization and de-serialization is not a trivial matter. Nevertheless, API developers may still be expected to provide some spatial filtering. They may choose to implement something very basic, like supporting just a bounding box (=west, north, east, south) instead of a real-world or more complex boundary.
This can present challenges in a workflow, as now it’s up to you, the requester, to translate your area of interest (AOI) boundary, to a set of bounding boxes, that can be used as target-inputs.
The most straightforward approach might be to generate from your boundaries a corresponding bounding box and use its coordinates as an input to the API. This would be partially effective, but probably quite limited. Most API are configured to return only a set number of results up to an upper threshold. Your area of interest and corresponding bounding box could conceivably include the whole world! Depending on the elements of the database, that could take ages to fetch the data and serialize them for delivery over the Internet.
The bottom-up approach
On the other extreme, you could fragment your boundary into many small boxes – sub-containers of the overall boundary. We could then have more confidence that the API would return a full set of results.
There are two main problems with this approach – firstly, it’s empirical, meaning that you can’t have any guarantees that your chosen scale would work for all your cases – after all, who decides what is small enough? Secondly, it could create a very dense grid, where several elements would contain little or even no data and generate redundant calls to the API. Since many providers also impose limits on the number of requests permitted within a period, a balance must be struck to ensure those requests that are issued are likely to return an optimal number of results. Even with a permissive API, you would see an increase in overall processing time and an ever-relevant real-world constraint is that if, for example, you wish to collect hourly data updates, you have to be able to do so within an hour!
Generally, one should adopt a balanced approach, aiming for grids of varying size, that generate roughly the same amount of data. But this exercise is not trivial, particularly if you are aiming to avoid overlapping rectangular grids (and redundant requests and data collection). Trying to combine smaller grids to achieve bigger units might be described as a bottom-up approach.
The top-down approach
A different approach is to adopt a top-down method. From one big grid, divide it into four equally sized quarters, that contain roughly the same number of elements, and iteratively repeat until you have achieved an optimized grid. This is basically an implementation of a quadtrees data structure.
Quadtrees is a tree data structure where each node has exactly 4 children. Among other uses, they can be used to divide 2d space into regions or sectors. The decision on when to split the node into more children depends on how many elements that node has. That way you can have ‘grids’ or sectors that contain x, or a maximum of x elements.
The beautiful thing about trees is that they retain the same amount of information in each leaf. If the information is denser (in a geographical region with disproportionately high numbers of data points available for instance), the leaves are divided more vigorously:
Image above from “Quadtrees” published by Geographic Information Technology Training Alliance under Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Licence
How we use quadtrees to collect data
Within our use case we use this principle to generate grids, each appropriately sized to reflect the corresponding data points within, and use this as these as the basis for our API requests.
We start with a bounding box from our initial boundary – our area of interest – and from there define an initial, overlapping big grid. Following the ‘bottom-down’ approach, we then can query the API and establish numbers of data results within each location. If the results are more a given threshold (e.g. a limit of results per request determined by the API maintainer) then we recursively divide to smaller areas until we are within that threshold, but still optimized to maximise results per request.
Last graph in the above figure taken from 'Damn Cool Algorithms: Spatial indexing with Quadtrees and Hilbert Curves'
At the end of the process, we should have several grids, that can describe our original boundary. If the above process sounds familiar, you are correct; It is a geohash algorithm. Moreover, if we use a common initial grid to start the splitting procedure, then we can be confident that all the resulting grids are aligned with no overlaps.