blog | 13.04.2021 | David McArthur

When five times three equals zero: the curious case of Strava Metro

We have been big fans of Strava Metro data - the dataset collected from users of the activity-tracking app Strava - at UBDC.

We have published several papers using the dataset looking at how well it corresponds to manual counts, how to visualise it, how to use it to evaluate infrastructure (see also 'Can providing safe cycling infrastructure encourage people to cycle more when it rains? The use of crowdsourced cycling data') and how to use it to understand COVID-19 impacts. We have also supplied the data through our data service. Those of you who have used the data will know that it comes in three main components. The one we have made the most use of was minute-by-minute cycle counts for each road link. It is a rich source of data that allows us to do all sorts of interesting things.

Recently, Strava changed the specification of this data. This was seemingly motivated by a desire to ensure the privacy of Strava users (a worthy aim). These changes meant that cycle counts would no longer be provided for every minute of the day and instead would be provided for each hour. That sounds reasonable, as we usually aggregated the data before working with it anyway. Of more concern to us was a new system of rounding they intended to employ. Counts of three or fewer cyclists would be rounded down to zero. All other counts would be rounded up to the nearest multiple of five. We set out to understand how this might affect our work.

The first thing we noticed was that we lost between 75% and 90% of the activities recorded on Glasgow’s roads, depending on what year we looked at! Many roads in Glasgow have only a few cyclists on them over the course of an hour. If they do not have more than three, then they are recorded as having none. At first glance, three cyclists doesn’t sound like very much. However, only a small proportion of cyclists use the Strava app, so three Strava users could represent around 30 cyclists (according to our work for Glasgow). This was worrying, although it could be argued that these roads are not as important for cycling because they aren’t used all that much. We still captured the busier routes. By using daily or monthly aggregations, we were able to reduce the data loss quite substantially. Perhaps all was not lost.

Our next realisations were more troubling. To illustrate the point, let us imagine we are studying two small towns which each have five roads. Each road is 1 km long such that observing a cyclist on a road means that 1 km has been cycled. Imagine we use the original data to see how the level of cycling has changed from one day to the next and we obtain the following results.

 

Town 1

Town 2

Road

Day 1

Day 2

Day 1

Day 2

1

4

3

3

6

2

4

3

3

6

3

0

3

3

0

4

0

3

3

0

5

0

3

3

0

Total

8

15

15

12

In Town 1, cycling increased from 8 km to 15 km representing an increase of 88%. In Town 2, cycling declined from 15 km to 12 km, representing a decrease of 20%. This gives us an insight into what is going on in each town. Now let’s suppose we had conducted the same study with the new data specification. To do this, we apply the rounding rules to our first table and derive the following table.

 

Town 1

Town 2

Road

Day 1

Day 2

Day 1

Day 2

1

5

0

0

10

2

5

0

0

10

3

0

0

0

0

4

0

0

0

0

5

0

0

0

0

Total

10

0

0

20

The same data (but this time rounded) now tells us that cycling declined in Town 1 from 10 km to 0 km i.e., a 100% decrease. In Town 2 it seems that cycling increased from 0 km to 20 km. This is precisely the opposite conclusion of what we derived before. It is also an incorrect conclusion. The rounding can introduce serious problems. The extent of the problem depends on the overall level of cycling to start with and how it is spread out around the city. If levels are high and activity concentrated on fewer routes, then the rounding will be less important. If levels are lower and activity is more dispersed, then rounding is more problematic and more likely to give spurious results.

The rounding introduces considerable uncertainty. For City 1, cycling volume could be anywhere between 8 km and 19 km in Year 1 and between 0 and 15 km in Year 2. It is, therefore, difficult to say what has happened. Volumes may have risen or declined.

This raises serious doubts about the quality of the data. Researchers and planners must take these limitations into account. We explore the issues in more depth in our new paper and show how some current studies in the literature might have given misleading results if they had been conducted using the rounded dataset. We also provide some recommendations about the best way to proceed. While Strava Metro data may still be useful, it must now be handled with much greater caution than before. We must also be much more critical when evaluating evidence based on these data.

David McArthur

Dr David McArthur is the Associate Director for Training and Capacity Building at UBDC and is a Senior Lecturer in Transport Studies at the University of Glasgow.

Leave a comment. Please refer to our Comments Policy before posting.

Your comment

Comments

    In response to Martin Laban: Glad you found it helpful. It is indeed a shame that this valuable source of data has been degraded in this way. I think the saddest part is that the aim of protecting privacy could have been achieved without reducing the utility of the data quite as much.

  • 2 years ago
  • |
  • David McArthur

    Thanks for the useful review, but fundamentally a shame to see a reduced level of detail available from Strava.

  • 2 years ago
  • |
  • Martin Laban

    Thanks for the useful review, but fundamentally a shame to see a reduced level of detail available from Strava.

  • 2 years ago
  • |
  • Martin Laban

    In response to Martin Laban: Glad you found it helpful. It is indeed a shame that this valuable source of data has been degraded in this way. I think the saddest part is that the aim of protecting privacy could have been achieved without reducing the utility of the data quite as much.

  • 2 years ago
  • |
  • David McArthur

JOINTLY FUNDED BY