Online platforms such as Yelp, Twitter, Amazon, or Instagram are large-scale, rich and relevant sources of data. Researchers in the social sciences increasingly tap into these data for field evidence when studying various phenomena.
In this course, you will learn how to find, acquire, store, and manage data from such sources and prepare them for follow-up statistical analysis for your own research.
After a short introduction into the relevance of data science skills for the social sciences, we will review R as a programming language and its basic data formats. We will then use R to program simple scrapers that systematically extract data from websites. We will use the packages rvest, httr, and RSelenium, among others, for this purpose. You will further need to learn how to read HTML, CSS, JSON, or XML codes, to use regular expressions, and to handle string, text and image data. To store the data, we will look into relational databases, (My)SQL, and related R packages. Many websites such as Twitter and Yelp offer convenient application-programming interfaces (APIs) that facilitate the extraction of data and we will look into accessing them from R. Finally, we will highlight some options for feature extraction from images and text, which allows us to augment our collected data with meaningful variables we can use in our analysis.
At the end of this course, students should be able to identify valuable online data sources, to write basic scrapers, and to prepare the collected data such that they can use them for statistical analysis as part of their own research projects.
Throughout the course, students will work on a data-scraping project related to their theses. This project will be presented at the final day of the course.
All data scraping code and other sources will be made available on