reveal.js

Harnessing APIs and web scraping to collect data

What is an API?

"API" stands for application programming interface.

What is an API?

"API" stands for application programming interface.

What is an API?

"API" stands for application programming interface.

Interacting with APIs

The REST API

REST stands for "representational state transfer".
It defines a number of architectural constraints of the API.

REST APIs operate in a client-server architecture with clients, servers and resources.

Requests and responses between clients and servers are transferred via the HTTP protocol.

REST APIs are stateless: no client information is stored by the server.

Endpoints

An API endpoint is the end of your communication channel to the API. An API can have different endpoints for channels leading to different resources.

Example: New York Times API

(-) /archive: get NYT article metadata for a given month.

(-) /search: search for NYT articles with filters.

(-) /books: Get the NYT best sellers list and lookup book reviews.

(-) /mostpolular: Get popular articles on NYTimes.com.

Anatomy of a request

Anatomy of a response

Very often API responses are formatted as JSON objects.


						{
							"data":
							[
								{
									"end":"2022-07-01T00:00:00.000Z",
									"start":"2022-06-30T15:04:14.000Z",
									"tweet_count":2
								},
								{
									"end":"2022-07-02T00:00:00.000Z",
									"start":"2022-07-01T00:00:00.000Z",
									"tweet_count":2
								}
							]
						}

Wrappers

Writing requests manually is tedious. Wrappers add a layer of abstraction that makes this job easier for us.


						# Example: interacting with the Twitter API through twarc	
						from twarc import Twarc2 

						twarc_client = Twarc2(bearer_token = "XXX") 

						granularity = "day"
						query = "from:joebiden"

						day_count = []
						for count in twarc_client.counts_recent(
					        query,
					        granularity = granularity):
					    day_count.extend(count["data"])

Rate limits & quotas

Rate limiting is a strategy for limiting network traffic. It puts a cap on how often we can send requests to an API before being temporarily blocked.

Quotas limit how much data you can download in a given time frame.

Example: Twitter v2 API academic-level access

(-) Full-archive tweet search: 300 requests / 15 min & 1 request / second

(-) Quota: 10 mio tweets / month

(-) Follower lookup: 15 requests / 15 min

API tour-de-force

See also the API access code snippets

The Google Trends API

What do you get? Search trends for a search query, time range and country, normalized to the maximum search volume in the given time range.

Usefulness for CSS: Was used a lot during the early times of CSS, hard to quality control. Can serve as a first indicator if something interesting is going on.

Accessibility: Very easy to use, no authentification required.

Example publication: What Can Digital Disease Detection Learn from (an External Revision to) Google Flu Trends?

The Twitter v2 API

What do you get? Historic tweets back to 2010 and in real time, highly developed query language, followers, full conversation trees, ...

Usefulness for CSS: Very useful, allthough a bit overused because of the very good accessibility & data quality. A LOT of CSS publications use Twitter data.

Accessibility: Academic access requires an application process. Very good documentation, well-maintained Python wrapper, restrictive rate-limits & monthly quota require some planning for data retrieval.

Example publication: Collective Emotions and Social Resilience in the Digital Traces After a Terrorist Attack

The PushShift API and Reddit data

What do you get? Complete history of submissions, comments, scores.

Usefulness for CSS: Very useful, subreddits are a nice way to retrieve topic-specific text, used in many CSS research projects.

Accessibility: No application process, rather restrictive rate limit of 1 request / second, good data set descriptors.

Example publication: Patterns of Routes of Administration and Drug Tampering for Nonmedical Opioid Consumption: Data Mining and Content Analysis of Reddit Discussions

The Telegram API

What do you get? Messages & media in a given channel, channel members.

Usefulness for CSS: Very useful to get data from fringe-groups that are not represented on big social media platforms. Severly underused in CSS research.

Accessibility: Clunky authentification process, lacking documentation, python wrapper not really designed for data collection.

Example publication: What they do in the shadows: examining the far-right networks on Telegram

The MediaWiki API

What do you get? Gateway to Wipipedia and Wikimedia Commons content. Access wikipedia articles, article discussions & revision histories, ...

Usefulness for CSS: Very useful to research public knowledge creation, knowledge structure, volunteer interactions in a large organisation, ...

Accessibility: No authentification needed, good documentation, large number of endpoints and search options, good python wrapper.

Example publication: Volunteer contributions to Wikipedia increased during COVID-19 mobility restrictions

The New York Times API

What do you get? Article abstracts and metadata, not full article texts.

Usefulness for CSS: Somewhat limited because full texts are missing, but abstracts can serve as a window into histporic high-quality journalistic texts.

Accessibility: Easy authentification, good documentation, nice python wrapper, hard to get article texts.

Example publication: The rise and fall of rationality in language

Steam reviews & comments

What do you get? Reviews & comments to reviews of games on Steam.

Usefulness for CSS: Niche usefulness to investigate gamers, some comment threads are used as chats. Underused data source.

Accessibility: Easy authentification, sparse documentation, OK python wrapper.

Example publication: A Study of Analyzing on Online Game Reviews using a Data Mining Approach: STEAM Community Data

Open Street Maps

What do you get? Map data (nodes, ways, relations), alternative to Google maps.

Usefulness for CSS: Very useful for research that needs location data. HUGE amounts of available data. Data availability can be biased, due to volunteer activity.

Accessibility: No authentification required, very powerful API & query language with a steep learning curve. For big data queries, use command-line tools.

Example publication: Generating Open-Source Datasets for Power Distribution Network Using OpenStreetMaps

CrossRef

What do you get? Article titles, authors, journals, number of references.

Usefulness for CSS: Useful for scientometrics research, alternative to costly closed services such as Web of Science.

Accessibility: No authentification required, slow and a bit unstable data retrieval (?), OK python wrapper, good dataset descriptor.

Example publication: Citation Intent Classification Using Word Embedding

Spotify

What do you get? Metadata about artists, playlists & tracks, song features like "danceability".

Usefulness for CSS: Useful to research characteristics and popularity of songs, song features are "black boxes" - hard to validate.

Accessibility: Easy authentification via Spotify account, somewhat clunky to use because there is no query language and artist / playlist / track URIs are needed to get data, good Python wrapper.

Example publication: Cultural Divergence in popular music: the increasing diversity of music consumption on Spotify across countries

Facebook & Instagram

CrowdTangle (API documentation)

Web scraping
- what to do if there is no API -

What is webscraping?

List of United States presidential Elections page source

Getting information from webpages.

How does it work?

Write code that sends a request to the server hosting the web page of interest. In Python the "requests" package helps to send and receive requests.


							import requests
							page = "https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168"
							page = requests.get(page)

How does it work?

The server sends the source code of the webpage: (mostly) HTML. Web browsers render the HTML code to display a nice-looking website. Our code doesn't do that - we need to parse the information to identify elements of interest.

How does it work?

Enter: the Python package BeautifulSoup.

Finding useful information within the page


							page = requests.get(page)
							soup = BeautifulSoup(page.content, 'html.parser')
							seven_day = soup.find(id="seven-day-forecast")
							forecast_items = seven_day.find_all(class_="tombstone-container")
							tonight = forecast_items[0]
							print(tonight.prettify())