Practical approaches for web scraping for research – using Airbnb as an example data provider
- Tuesday 22 June 2021
- 10:00 - 12:00 (BST)
- Online (via Zoom) GET DIRECTIONS
It is no exaggeration to say that the web is a fertile source of data - offering deep insights into people’s beliefs, opinions, transactions, movements and many other aspects of their lives.
For social science academics and data scientists, the UK’s legal environment appears (although not definitively) to provide opportunities to capture these data at scale in service of research goals.
Referencing UBDC’s project and open-source software platform to scrape short-term-let data from Airbnb, this webinar provides practical guidance on how researchers, technologists and data scientists can approach web scraping, from the selection of online sources to the planning, conceptualisation, governance, risk management and implementation of technical approaches.
Participants will receive practical training and code examples, developing an understanding not only of how scraping works but also how to systematise approaches to scale up data collection while avoiding common pitfalls.
Throughout the session, a series of practical examples will cover data scraping using UBDC’s established scraping method from Airbnb’s online platform.
The webinar will include a series of short talks. These will punctuate the main content of the webinar - a sequence of practical demonstrations with UBDC’s web scraping platform being introduced, installed, configured and deployed.
What you will learn
Following completion of this webinar attendees will be able to:
- Describe and latterly recreate within their own environment the installation, configuration and deployment of UBDC’s open-source web scraping platform
- Explain what web scraping means and describe the variety of approaches available
- Outline the legal, ethical and data governance issues that must be considered when designing a web scraping project (including coverage of relevant intellectual property, contract and privacy law)
- Summarise limitations on data use and sharing
- Identify and select appropriate datasets for web scraping
- Explain what an API (Application Programming Interface) is and query APIs to retrieve data
- Explain how to negotiate technical barriers to scraping (including managing call limits and avoiding blacklisting)
- Systematise, scale and optimise approaches, including planning for wide geographical or temporal data coverage
Who should attend
Academic researchers or technical support staff who are interested in learning how to capture online data systematically and at scale using web scraping techniques. The session’s core content is principally aimed at technologists/implementers charged with building web scraping systems.
Prior knowledge requirements
Attendees wishing to subsequently recreate the software deployment covered within the session should have some experience of the Python programming language or similar languages and be comfortable running code within their computing environment. No specific technical proficiencies are required to attend and engage with the webinar content itself.
Participants may also find it beneficial to attend the related Using daily Airbnb web scraped data to provide spatial and temporal understanding of short-term lets activity webinar on 24 June (10:00 - 11:00 BST).
Data and software requirements
Although there is no explicit technical participatory component to the webinar, code examples, documentation and practical exercises will be available for webinar attendees from UBDC’s GitHub repository. Instructions for installing core libraries and software will also be made available to attendees during and following the session.
Registration for this online event is free and available via Eventbrite. Full details and instructions for joining will be circulated post-registration.