Syllabi
University of St.Gallen
Prerequisites (knowledge of topic)
Advanced knowledge in statistics and econometrics (gained, for example, following the specific courses in a master in quantitative methods/economics/finance).
Hardware
Individual laptop (with no particular requisite).
Software
Examples and codes are shown using the R-software (free downloadable from https://www.r-project.org/).
Course Content
Computational Statistics is the area of specialization within statistics that includes statistical visualization and other computationally-intensive methods of statistics for mining large, nonhomogeneous, multi-dimensional datasets so as to discover knowledge in the data. As in all areas of statistics, probability models are important, and results are qualified by statements of confidence or of probability. An important activity in computational statistics is model building and evaluation.
First, the basic multiple linear regression is reviewed. Then, some nonparametric procedures for regression and classification are introduced and explained. In particular, Kernel estimators, smoothing splines, classification and regression trees, additive models, projection pursuit and eventually neural nets will be considered, where some of them have a straightforward interpretation, other are useful for obtaining good predictions.
The main problems arising in computational statistics like the curse of dimensionality will be discussed. Moreover, the goodness of a given (complex) model for estimation and prediction is analyzed using resampling, bootstrap and cross-validation techniques.
Structure
Outline
- Overview of supervised learning
Introductory examples, two simple approaches to prediction, statistical decision theory, local methods in high dimensions, structured regression models, bias-variance tradeoff, multiple testing and use of p-values. - Linear methods for regression
Multiple regression, analysis of residuals, subset selection and coefficient shrinkage. - Methods for classification
Bayes classifier, linear regression of an indicator matrix, discriminant analysis, logistic regression. - Nonparametric density estimation and regression
Histogram, kernel density estimation, kernel regression estimator, local polynomial nonparametric regression estimator, smoothing splines and penalized regression. - Model assessment and selection
Bias, variance and model complexity, bias-variance decomposition, optimism of the training error rate, AIC and BIC, cross-validation, boostrap methods. - Flexible regression and classification methods
Additive models; multivariate adaptive regression splines (MARS); neural networks; projection pursuit regression; classification and regression trees (CART). - Bagging and Boosting
The bagging algorithm, bagging for trees, subagging, the AdaBoost procedure, steepest descent and gradient boosting. - Introduction to the idea of a Superlearner
Structure (Chapters refer to the outline above)
Days 1 and 2: Chapters 1,2, and 3
Day 3: Chapter 5
Day 4: Chapter 4
Days 5 and 6: Chapters 6,7, and 8.
Literature
Mandatory
F. Audrino, Lecture Notes (can be downloaded from Studynet or asked directly to the lecturer).
Hastie T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction, Springer Series in Statistics, Springer, Canada.
Supplementary / voluntary
Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.
Moreover: References to related published papers will be given during the course.
Additional online resources:
A complete version of the main reference book can be downloaded online: http://statweb.stanford.edu/~hastie/ElemStatLearn/ Moreover, the R-package for the examples in the book is available: https://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf
The web-page of the book on Targeted Learning: http://www.targetedlearningbook.com/
https://stat.ethz.ch/education/semesters/ss2015/CompStat (mostly overlapping Computational Statistics class taught at the ETH Zürich)
R-software information and download: https://www.r-project.org/
Online course of Hastie and Tibshirani on Statistical Learning: Official course at Stanford Online: https://lagunita.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Quicker access to videos: http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ Link to the website of an introductory book related to the course: http://www-bcf.usc.edu/~gareth/ISL/index.html
Mandatory readings before course start
–
Examination part
Decentral: 100% group examination paper (term paper). Due to St. Gallen quality standards, possibility of an individual examination paper.
Supplementary aids
The examination paper consists in the analysis of a data set chosen by the students involving the methods learned in the lecture.
Examination content
The whole outline of the lecture described above.
Literature
Audrino, Lecture Notes.
These workshop lectures are designed to introduce participants to one of the most vibrant, free of charge statistical computing environments ― R. In this course you will learn how to use R for effective data analysis. We will cover a selected range of both basic topics (e.g., reading data into R, data structures (i.e., data frames, lists, matrices), data manipulation, statistical graphics) to more advanced topics (e.g., writing functions, control statements, loops, reshaping data, string manipulations, and statistical models in R).
This course is also helpful as a primer for other summer program courses that will use R, such as the courses on Computational Statistics, Data Mining, or Advanced Regression Modeling, among other courses. No prerequisites are required for this course.
These intensive lectures provide an opportunity for a participant to initially develop, or perhaps refresh, an intuitive understanding of the concepts of basic matrix algebra (the first three days) and calculus (the final day). The ultimate goal is to develop a sufficient level of understanding of these foundational mathematical tools that will allow a participant to adequately comprehend and successfully apply them in subsequent GSERM coursework. In addition to the lectures, problems and solutions will be provided to enhance the learning process and provide for self-evaluation. It is assumed that the participants are proficient in rudimentary algebra. This series of workshop lectures carries no formal academic credit.
These intensive lectures provide an opportunity for a participant to refresh their understanding of, and familiarity with, the concepts and assumptions of bivariate and multiple ordinary least squares (OLS) linear regression. The ultimate goal is to possess a sufficient level of understanding of these topics so that a participant can adequately comprehend and successfully apply them in subsequent GSERM coursework. In addition to the lectures, problems and solutions will be provided to enhance the learning process and provide for self-evaluation. It is assumed that the participants are proficient in rudimentary statistics (e.g., basic hypothesis testing); while it would be helpful to have been (at least) exposed to OLS regression, this is not a prerequisite. This series of workshop lectures carries no formal academic credit.
Day 1 – Fundamentals in Python: Operators, data types, control flow, style guide.
Day 2 – Introduction to PANDAS (Python Data Analysis Library).
Day 3 – Introduction to NUMPY (Numerical Python) and MATPLOTLIB (Visualization with Python).
For each topic, we solve small programming tasks in class and discuss possible solutions.
This course assumes no prior experience with machine learning or R, though it may be helpful to be familiar with introductory statistics and programming.
Hardware
A laptop computer is required to complete the in-class exercises.
Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available at no cost and are needed for this course.
Course content
Machine learning, put simply, involves teaching computers to learn from experience, typically for the purpose of identifying or responding to patterns or making predictions about what may happen in the future. This course is intended to be an introduction to machine learning methods through the exploration of real-world examples. We will cover the basic math and statistical theory needed to understand and apply many of the most common machine learning techniques, but no advanced math or programming skills are required. The target audience may include social scientists or practitioners who are interested in understanding more about these methods and their applications. Students with extensive programming or statistics experience may be better served by a more theoretical course on these methods.
Structure
The course will be designed to be interactive, with ample time for hands-on practice with the Machine Learning methods. Each day will include several lectures based on a Machine Learning topic, in addition to hands-on “lab” sections to apply the learnings to new datasets (or your own data, if desired).
The schedule will be as follows:
Day 1: Introducing Machine Learning with R
- How machines learn
- Using R, R Studio, and R Markdown
- k-Nearest Neighbors
- Lab sections – installing R, using R Markdown, choosing own dataset (if desired)
Day 2: Intermediate ML Methods – Classification Models
- Quiz on Day 1 material
- Naïve Bayes
- Decision Trees and Rule Learners
- Lab sections – practicing with Naïve Bayes and decision trees
Day 3: Intermediate ML Methods – Numeric Prediction
- Quiz on Day 2 material
- Linear Regression
- Regression trees
- Logistic regression
- Lab sections – practicing with regression methods
Day 4: Advanced Classification Models
- Quiz on Day 3 material
- Neural Networks
- Support Vector Machines
- Random Forests
- Lab section – practice with neural networks, SVMs, and random forests
Day 5: Other ML Methods
- Quiz on Day 4 material
- Association Rules
- Hierarchical clustering
- k-Means clustering
- Lab section – practice with these methods, work on final report
Literature
Mandatory
Machine Learning with R (3rd ed.) by Brett Lantz (2019). Packt Publishing
Supplementary / voluntary
None required.
Mandatory readings before course start
Please install R and R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
100% of the course grade will be based on a project and final report (approximately 10 pages), to be delivered within 2-3 weeks after the course. The project is intended to demonstrate your ability to apply the course materials to a dataset of your own choosing. Students should feel free to use a project related to their career or field of study. For example, one may use this opportunity to advance his/her dissertation research or complete a task for his/her job. The exact scoring criteria for this assignment will be provided on the first day of class. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
There will also be brief quizzes at the start of each lecture, which cover the previous day’s materials. These are ungraded and are designed to provoke thought and discussion.
Supplementary aids
Students may reference literature and class materials as needed when writing the final project report.
Examination content
The final project report should illustrate an ability to apply machine learning methods to a new dataset, which may be on a topic of the student’s choosing. The student should explore the data and explain the methods applied. Detailed instructions will be provided on the fist day of class.
Course content
The primary goal is to develop an applied and intuitive (as opposed to purely theoretical or mathematical) understanding of the topics and procedures. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles. Some more advanced regression topics will be covered later in the course, but only after the introductory foundations have been established.
We will begin with a quick review of basic univariate statistics and hypothesis testing.
After that we will cover various topics in bivariate and then multiple regression, including:
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Structure
This course will utilize approximately 525 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in eleven Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These eleven Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While in-class examples are presented using SPSS, participants are free (and encouraged!) to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequirements
This course is intended for participants who are comfortable with algebra and basic introductory statistics, and now want to learn applied ordinary least squares (OLS) multiple regression analysis for their own research and to understand the work of others.
Note: We will not use matrix algebra or calculus in this course.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
• Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
• The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
• Finally, I have included several articles from a number of journals across several
academic disciplines.
Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
A written Final Examination will be administered during the last meeting of the course.
Since this Final Examination is the only artifact that will be formally graded in the course, it will determine the course grade.
Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the Final Examination.
Supplementary aids
The Final Examination will be written, open-book (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and open-note. No other materials, including laptops, cell phones, or other electronic devices, will be permitted.
The Final Exam will be two hours in length and administered during the last course meeting.
I will provide more specific “practical matter” details about this exam early in the course.
Examination content
The potential substantive content areas for the Final Examination are:
• Basic univariate statistics and hypothesis testing.
• Fundamental concepts of bivariate regression and multiple regression.
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Literature
Literature relevant to the exam:
• Lecture Transcripts (eleven Packets; approximately 525 pages).
• Class notes (taken by each participant individually).
• Tutorial and Answer Key documents (for each optional data analysis project assignment). Supplementary/Voluntary literature not directly relevant to the exam.
• Optional supplemental readings listed in the course syllabus (and discussed earlier).
• Any other textbooks, articles, etc., the participant reads before or during the course.
Work load
At least 24 units 45 minutes each on 5 consecutive days.
Prerequisites (knowledge of topic)
– Basic programming skills, Python recommended
– Undergraduate-level linear algebra, analysis and statistics
Hardware
– Personal laptop with Mac OSX or Linux, Windows
– Tablets (iOS, Windows) will not be working with this lecture
Software
– Webbrowser (Chrome, Safari, Firefox)
– Text editor
– Jupyer notebook
– Local Python installation including Numpy, Scipy, scikit-learn, PyTorch
(there will be an installation session on the first day for participants)
Course content
– Machine Learning Refresh
o Supervised Learning vs. Unsupervised Learning
o Traditional Machine Learning vs. End-to-End Learning
– Fundamentals of Neuronal Networks:
o Rosenblatt Perceptron and Neurons
o Network Structure (feed-forward, recurrent), matrix notation, forward evaluation
– Training as optimization
o Loss and Error functions
o Backpropagation
o SGD and other optimizer
– Activation functions and topologies
o Convolutional neural networks
o Generative Adversarial Networks
o Long short-term memory networks
o Special layer types (inception, resnet)
o Embeddings
o Attention Mechanis & Transformer
– Applications to real-world problems:
o Acoustic keyword recognition (audio/speech processing)
o Sentiment analysis (text processing)
o Digit recognition (image processing)
o Tiny Image Recognition (image processing)
o Face Detection and Tracking (image/video processing)
o Stock market prediction (time series prediction)
– Training on large data sets (Hardware, GPU)
– Trustworthy AI
Structure
The course is a theoretical content in the morning and practical exercises in the afternoon in form of lab Jupyter notebook programming.
Literature
Goodfellow I, Benjo Y., Courville A., Courville A, Deep Learning, MIT Press, 2016
Supplementary
• http://jupyter.org/
• http://www.numpy.org/
• https://www.scipy.org/
• https://www.tensorflow.org
• https://pytorch.org
Examination part
– Completed Jupyter notebooks labs 1-8 (40% closed book), in class
– Complete Jupyter Notebook Assignment 1-8 (60%, open-book), at home
Prerequisites (knowledge of topic)
Linear regression (strong), Maximum Likelihood Estimation (some familiarity), Linear/Matrix Algebra (some exposure is helpful), R (not required, but helpful).
Hardware
Access to a laptop will be useful, but not absolutely necessary.
Software
R/RStudio, JAGS (both are freely available online).
Learning objectives
To understand what the Bayesian approach to statistical modeling is and to appreciate the differences between the Bayesian and Frequentist approaches. The students will be able to estimate a wide variety of models in the Bayesian framework and to adjust example code to fit their specific modeling needs.
Course content
-Theory/foundations of the Bayesian approach including:
-objective vs subjective probability
-how to derive and incorporate prior information
-the basics of MCMC sampling
-assessing convergence of Markov Chains
-Bayesian difference of means/ANOVA
-Bayesian versions of: Linear models, logit/probit (dichotomous/ordered/unordered choice models), Count models, Latent variable and measurement models, Multilevel models
-presentation of results
Structure
Day 1 a.m.: Overview of Bayesian approach—Bayes vs Frequentism. History of Bayesian statistics, Problems with the NHST, The Beta-Binomial model
Day 1 p.m.: Review of GLM/MLE. Probability review. Application of Bayes Rule.
Day 2 a.m.: Priors, Sampling methods (Inversion, Rejection, Gibbs sampling)
Day 2 p.m.: Convergence diagnostics. Using JAGS to estimate Bayesian models.
Day 3 a.m.: Estimating parameters of the Normal Distribution
Day 3 p.m.: Bayesian linear models, imputing missing data.
Day 4 a.m.: Choice models (dichotomous, ordered, unordered)
Day 4 p.m.: Latent variable models
Day 5 a.m.: Multilevel models: linear models.
Day 5 p.m.: Multilevel models: non-linear models, best practices for model
presentation.
Literature
Mandatory
Gill, J. (2008). Bayesian Methods: A Social And Behavioral Sciences Approach. Chapman and Hall, Boca Raton, FL
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical
Jackman, S. (2000). Estimation and Inference Are Missing Data Problems: Unifying Social Science Statistics via Bayesian Simulation. Political Analysis, 8(4):307–332. http://pan.oxfordjournals.org/content/8/4/307.full.pdf+html
Supplementary / voluntary
Siegfried, T. (2010). Odds are, it’s wrong: Science fails to face the shortcomings of statistics. Science News, 177(7):26–29. http://dx.doi.org/10.1002/scin.5591770721
Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science.
Bakker, R. (2009). Re-measuring left–right: A comparison of SEM and bayesian measurement models for extracting left–right party placements. Electoral Studies, 28(3):413–421
Bakker, R. and Poole, K. T. (2013). Bayesian Metric Multidimensional Scaling. Political Analysis, 21(1):125–140
For those unfamiliar with R: Jon Fox and Sandford Weisberg. An R Companion to Applied Regression. Sage, 2011.
Mandatory readings before course start
Western, B. and Jackman, S. (1994). Bayesian Inference for Comparative Research. American Political Science Review, 88(2):412–423. http://www.jstor.org/stable/2944713
Efron, B. (1986). Why Isn’t Everyone a Bayesian? The American Statistician, 40(1):1–5. http://www.jstor.org/stable/2683105
Examination part
A written homework assignment which consists of estimating a variety of models using JAGS as well as a brief essay describing how the students would go about incorporating Bayesian methods in their own work and what they see as the main advantages/disadvantages of doing so.
Supplementary aids
Open book/practical examinations. The students should use the example code from the lectures to help complete the practical component as well as both required texts to help answer the essay component. Specifically, the linear model and dichotomous choice model examples will be very useful as well as the first 3 chapters of the Gill text and Section 3 of the Gelman and Hill text.
Examination content
Bayesian versions of the linear and dichotomous choice models, including presenting the appropriate results in a professionally acceptable manner. This includes creating graphical representations of the model results as well as a thorough discussion of how to interpret the results.
For the essay component, students will need to be aware of the benefits of the Bayesian approach for their own research (or the lack thereof) and to describe, in detail, the types of choices they would need to make in order to apply Bayesian methods to their own work. This includes a detailed description and justification of what priors they would choose as well as what differences they would expect to see between the Bayesian and Frequentist approaches, if any, and why they would expect such differences.
Literature
The only required literature to complete the examinations are the 2 required texts and the code examples from the lectures.
Prerequisites (knowledge of topic)
Each student is to submit an outline (no more than 500 words in length) of a specific research question and/or a set of hypotheses that s/he would like to examine via an experimental approach. This outline (in PDF format, file name format: “LastName-FirstName-ResQues.pdf”) should be e-mailed to ghaeubl@ualberta.ca with “GSERM-EMBS” as the subject line by 23:00 (St. Gallen time) on Friday prior course start.
As part of the introductions on the first morning of the course, students be will be asked to give 2-minute presentations on these research questions/hypotheses (and to say a few words about their areas of research interest more broadly).
The objectives behind this assignment are:
• to facilitate learning by ensuring that students have their own concrete research questions/hypotheses in mind as they engage with the material covered in the course
• to provide the instructor with input for tailoring the course content and/or class discussions to students’ interests
Course content
The objective of this course is to provide students with an understanding of the essential principles and techniques for conducting scientific experiments on human behavior. It is tailored for individuals with an interest in doing research (using experimental methods) in areas such as psychology, judgment and decision making, behavioral economics, consumer behavior, organizational behavior, and human performance. The course covers a variety of topics, including the formulation of research hypotheses, the construction of experimental designs, the development of experimental tasks and stimuli, how to avoid confounds and other threats to validity, procedural aspects of administering experiments, the analysis of experimental data, and the reporting of results obtained from experiments. Classes are conducted in an interactive seminar format, with extensive discussion of concrete examples, challenges, and solutions.
Topics
The topics covered in the course include:
• Basic principles of experimental research
• Formulation of research question and hypothesis development
• Experimental paradigms
• Design and manipulation
• Measurement
• Factorial designs
• Implementation of experiments
• Data analysis and reporting of results
• Advanced methods and complex experimental designs
• Ethical issues
Literature
Recommended
There is no textbook for this course.
However, here are some recommended books on the design (and analysis) of experiments:
Abdi, Edelman, Valentin, and Dowling (2009), Experimental Design and Analysis for Psychology, Oxford University Press.
Field and Hole (2003), How to Design and Report Experiments, Sage.
Keppel and Wickens (2004), Design and Analysis: A Researcher’s Handbook, Pearson.
Kirk (2013), Experimental Design: Procedures for the Behavioral Sciences, Sage.
Martin (2007), Doing Psychology Experiments, Wadsworth.
Oehlert (2010), A First Course in Design and Analysis of Experiments, available online at:
http://users.stat.umn.edu/~gary/book/fcdae.pdf.
In addition, the following papers are recommended as background readings for the course:
Cumming, Geoff (2014), “The New Statistics: Why and How,” Psychological Science, 25, 1, 7-29.
Elrod, Häubl, and Tipps (2012), “Parsimonious Structural Equation Models for Repeated Measures Data, With Application to the Study of Consumer Preferences,” Psychometrika, 77, 2, 358-387.
Goodman and Paolacci (2017), “Crowdsourcing Consumer Research,” Journal of Consumer Research, 44, 1, 196-210.
McShane and Böckenholt (2017), “Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability,” Journal of Consumer Research, 43, 6, 1048-1063.
Meyvis and Van Osselaer (2018), “Increasing the Power of Your Study by Increasing the
Effect Size,” Journal of Consumer Research, 44, 5, 1157-1173.
Morales, Amir, and Lee (2017), “Keeping It Real in Experimental Research-Understanding When, Where, and How to Enhance Realism and Measure Consumer Behavior,” Journal of Consumer Research, 44, 2, 465-476.
Oppenheimer, Meyvis, and Davidenko (2009), “Instructional Manipulation Checks: Detecting Satisficing to Increase Statistical Power,” Journal of Experimental Social Psychology, 45, 867-872.
Pieters (2017), “Meaningful Mediation Analysis: Plausible Causal Inference and Informative Communication,” Journal of Consumer Research, 44, 3, 692-716.
Simmons, Nelson, and Simonsohn (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22, 11, 1359-1366.
Simonsohn, Nelson, and Simmons (2014), “P-Curve: A Key to the File-Drawer,” Journal of Experimental Psychology: General, 143, 2, 534-547.
Spiller, Fitzsimons, Lynch, and McClelland (2013), “Spotlights, Floodlights, and the Magic Number Zero: Simple Effects Tests in Moderated Regression,” Journal of Marketing Research, 50, 277-288.
Zhao, Lynch, and Chen (2010), “Reconsidering Baron and Kenny: Myths and Truths about Mediation Analysis,” Journal of Consumer Research, 37, 197-206.
Examination part
Students are to complete a (2-hour) written exam in the afternoon of the last day of class. In the exam, students are given a description of a research question, along with specific hypotheses. They are to produce a proposal for an experiment, or a series of experiments, for testing these hypotheses. The exam is “open book” – that is, students are free to use any appropriate local resources they wish in developing their proposal. (Here, “local” means that students may not access the Internet or other communication networks.)
Regular attendance and active participation in class discussion are expected.
Common standards of academic integrity apply. Work submitted by students must be their own – submitting what someone else has created is not acceptable.
Grading
A student’s overall grade is based on the following components:
– Initial Assignment and Presentation: 10%
– Class Participation: 20%
– Exam: 70%
Prerequisites (knowledge of topic)
Students should have previous exposure to social research methods, including basic training in quantitative methods, at the post-baccalaureate level.
Hardware
Laptop (PC or Mac): Students should bring a laptop. The course will include instruction in the use of the software package fsQCA (for both Windows and Mac).
Software
Please install the fsQCA software package ahead of the course. It can be please download for free at fsqca.com
Learning objectives
Qualitative comparative analysis (QCA) is a research approach consisting of both an analytical technique and a conceptual perspective for researchers interested in studying configurational phenomena. QCA is particularly appropriate for the analysis of causally complex phenomena marked by multiple, conjunctural causation where multiple causes combine to bring about outcomes in complex ways.
QCA was developed in the 1980s by Charles Ragin, a sociologist and political scientist, as an alternative comparative approach that lies midway between the primarily qualitative, case-oriented approach and the primarily quantitative, variable-oriented approach, with the goal of bridging both by combining their advantages and tackling situations where causality is complex and conjunctural. QCA uses Boolean algebra for the analysis of set relations and allows researchers to formally analyze patterns of necessity and sufficiency regarding outcomes of interest. Since its inception, QCA has developed into a broad set of techniques that share their set-analytic nature and include both descriptive and inferential techniques.
Many researchers have drawn on QCA because it offers a means to systematically analyze data sets with only few observations. In fact, QCA was originally applied to small-n situations of between 10 and 50 cases; situations where there are frequently too many cases to pursue a classical qualitative approach but too few cases for conventional statistical analysis. However, more recently, researchers have also applied QCA to medium- and large-n situations marked by hundreds of thousands of cases. While these applications require some changes to how QCA is applied, they retain many advantages for analyzing situations that are configurational in nature and marked by causal complexity.
The goal of this workshop is to provide a ground-up introduction to Qualitative
Comparative Analysis (QCA) and fuzzy sets. Participants will get intensive
instruction and hands-on experience with the fsQCA software package and on
completion should be prepared to design and execute research projects using the
set-analytic approach.
After successful completion of you should be able to:
1. understand the goals, assumptions, and key concepts of QCA
2. conduct data analysis using the fsQCA software package
3. design and execute research projects using a set-analytic approach
4. apply advanced forms of set-analytic investigation
I would like this workshop to be as useful to you as possible. To get the most out of this workshop, you would ideally already be working on an empirical project that might be aided by taking a configurational approach, but that is not essential. Over the course of this workshop, I hope you will be thinking about how you can apply these methods to your research, and I will do my best to be of assistance.
Course content
See below under structure
Structure
Day 1: Units 1-3
Day 2: Units 3-4
Day 3: Units 5-6
Day 4: Units 6-7
Day 5: Student Presentations
Unit 1. Introduction to the Comparative Method
The goal of this first unit is to offer an introduction to the logic of comparative research, as this perspective will be fundamental in informing our thinking for the coming days. The focus is on understanding social research from a set-analytic perspective as well as examining the distinctive place of configurational and comparative research.
Key Readings:
Ragin, 2008 (“Redesigning Social Inquiry”): Chapters 1-2
Unit 2. The Basics of QCA
We’ll move on to the basics of QCA. We will begin with an Introduction to Boolean algebra and set-analytic methods. Other issues we will cover include set-analytic analysis vs. correlational analysis, the concepts of necessity and sufficiency as well as consistency, coverage, and set coincidence. Time permitting, we will also examine case-oriented research strategies for theory building.
Key Readings:
Ragin, 2000: Chapters 3-5
Ragin, 2008: Chapters 1-3
Unit 3. Crisp Set Analysis
In this unit, we will dive into crisp-set QCA (csQCA), the simpler version of QCA using binary data sets. This will include the coding of data, the construction of truth tables, and understanding the three solutions—complex, parsimonious, and intermediate. We will also begin to examine the importance of counterfactual analysis based on easy versus difficult counterfactuals. Topics also include understanding consistency and coverage in crisp-set truth table analysis.
Key Readings:
Ragin, 2000: Chapters 3-5
Unit 4. Fuzzy Set Analysis I
Fuzzy set analysis presents a slightly more complex version of QCA. We will start with the notions of fuzzy sets and fuzzy set relations before moving on to calibrating fuzzy sets and fuzzy set consistency, coverage, and coincidence.
Ragin, 2008: Chapters 4-5
Unit 5. The Fuzzy-Set Truth Table Algorithm
We will next cover the fuzzy-set truth table algorithm. Building on crisp set analysis, we will further examine issues around limited diversity, fuzzy sets and counterfactual analysis. We also will work with sample data sets.
Unit 6. Advanced Topics in QCA
This unit provides us with an opportunity to catch up and delve deeper into some of the topics introduced above. Should we feel comfortable enough, we will move on to some more advanced topics in QCA, including the testing of causal recipes and substitutable causal conditions.
Key Readings:
Ragin, 2008: Chapters 7-10
Unit 7. Large-N Applications of QCA
The last unit will provide examples of recent large-N applications of QCA. These examples will give us an opportunity to raise further questions about how to execute research using a set-analytic approach. We will also reserve some time for further questions that have come up during the workshop.
Literature
There are four key books for the course, and required chapters are posted here in pdf format. I recommend reading the remainder of the books, but this is not required.
Ragin, Charles C. 1987. The Comparative Method: Moving beyond Qualitative and Quantitative Strategies. Berkeley, CA: University of California Press
Ragin, Charles C. 2000. Fuzzy Set Social Science. Chicago, IL: University of Chicago Press.
Ragin, Charles C. 2008. Redesigning Social Inquiry: Fuzzy-Sets and Beyond. Chicago, IL: University of Chicago Press.
Ragin, Charles, C., and Fiss, Peer C. 2017. Intersectional Inequality: Race, Class, Test Scores, and Poverty. Chicago, IL: University of Chicago Press.
Mandatory
Background Reading: The Comparative Method, chapters 6-8 and Redesigning Social Inquiry, chapters 1-5 and Fuzzy-Set social Science, chapters 3-5.
Supplementary / voluntary
Goertz, Gary. 2006. Social Science Concepts: A User’s Guide. Princeton, NJ: Princeton University Press.
Goertz, Gary and James Mahoney. 2012. A Tale of Two Cultures: Qualitative and Quantitative Research in the Social Sciences. Princeton, NJ: Princeton University Press.
Rihoux, Benoit and Charles C. Ragin (eds.) 2008. Configurational Comparative Methods. Thousand Oaks, CA: Sage.
Schneider, Carsten and Claudius Wagemann. 2012. Set-Theoretic Methods for the Social Sciences: A Guide to QCA. New York: Cambridge.
Mandatory readings before course start
The above chapters.
Examination part
Presentation (individual) (50%)
Research proposal written at home (individual) (50%)
Supplementary aids
To get inspiration for research proposals, I recommend that participants review recent research projects in their field using QCA. A bibliography of such projects is available at http://compasss.org/bibliography/
Examination content
The “structure” section above presents a complete list of topics relevant to the examination. Specifically, the oral presentation and subsequent examination paper will focus on using course materials to develop a research proposal.
Examination relevant literature
All required chapters listed above are part of the examination relevant literature, as are all course materials such as PPTs and additional materials distributed to the participants during the course.
The course is designed for Master, PhD students and practitioners in the social and policy sciences, including political science, sociology, public policy, public administration, business, and economics. It is especially suitable to MA students in these fields who have an interest in carrying out research. Previous courses in research methods and philosophy of science are helpful but not required. Materials not in the books assigned for purchase and not easily available through online library databases will be made available electronically. Bringing a laptop to class will be helpful but is not essential.
Hardware
Laptop helpful but not required
Software
None
Course content
The central goal of the seminar is to enable students to create and critique methodologically sophisticated case study research designs in the social sciences. To do so, the seminar will explore the techniques, uses, strengths, and limitations of case study methods, while emphasizing the relationships among these methods, alternative methods, and contemporary debates in the philosophy of science. The research examples used to illustrate methodological issues will be drawn primarily from international relations and comparative politics. The methodological content of the course is also applicable, however, to the study of history, sociology, education, business, economics, and other social and behavioral sciences.
Course structure
The seminar will begin with a focus on the philosophy of science, theory construction, theory testing, causality, and causal inference. With this epistemological grounding, the seminar will then explore the core issues in case study research design, including methods of structured and focused comparisons of cases, typological theory, case selection, process tracing, and the use of counterfactual analysis. Next, the seminar will look at the epistemological assumptions, comparative strengths and weaknesses, and proper domain of case study methods and alternative methods, particularly statistical methods and formal modeling, and address ways of combining these methods in a single research project. The seminar then examines field research techniques, including archival research and interviews.
Course Assignments and Assessment
In addition to doing the reading and participating in course discussions, students will be required to present orally an outline for a research design, either written or in powerpoint, in the final sessions of the class for a constructive critique by fellow students and Professor Bennett. Students will then write this into a research design paper about 3000 words long (12 pages, double-spaced).
Presumably, students will choose to present the research design for their PhD or MA thesis, though students could also present a research design for a separate project, article, or edited volume. Research designs should address all of the following tasks (elaborated upon in the assigned readings and course sessions): 1) specification of the research problem and research objectives, in relation to the current stage of development and research needs of the relevant research program, related literatures, and alternative explanations; 2) specification of the independent and dependent variables of the main hypothesis of interest and alternative hypotheses; 3) selection of a historical case or cases that are appropriate in light of the first two tasks, and justification of why these cases were selected and others were not; 4) consideration of how variance in the variables can best be described for testing and/or refining existing theories; 5) specification of the data requirements, including both process tracing data and measurements of the independent and dependent variables for the main hypotheses of interest, including alternative explanations.
Students will be assessed on how well their research design achieves these tasks, and on how useful their suggestions are on other students’ research designs. Students will also be assessed on the general quality of their contributions to class discussions.
Literature
Mandatory:
Assigned Readings for GSERM Case Study Methods Course
Andrew Bennett, Georgetown University
Students should obtain and read these books in advance of the course (see below for specific page assignments):
•Alexander L. George and Andrew Bennett, Case Studies and Theory Development in the Social Sciences (MIT Press 2005).
•Henry Brady and David Collier, Rethinking Social Inquiry (second edition, 2010)
•Gary Goertz, Social Science Concepts: A User’s Guide, (Princeton, 2005).
•Andrew Bennett and Jeffrey Checkel, eds., Process Tracing: From Metaphor to Analytic Tool (Cambridge University Press, 2014).
•Gary King, Robert Keohane, and Sidney Verba, Designing Social Inquiry (Princeton University Press, 1994).
Lecture 1: Inferences About Causal Effects and Causal Mechanisms
This lecture addresses the philosophy of science issues relevant to case study research.
Readings:
•Alexander L. George and Andrew Bennett, Case Studies and Theory Development, preface and chapter 7, pages 127-150.
•King, Keohane, and Verba, Designing Social Inquiry pp. 3-33, 76-91, 99-114.
Lecture 2: Critiques and Justifications of Case Study Methods
Readings:
•Gary King, Robert Keohane, and Sidney Verba, Designing Social Inquiry, pp. 46-48, 118-121, 208-230.
•Brady and Collier, Rethinking Social Inquiry, 1-64, 123-201 (or if you have the first edition, pages 3-20, 36-50, 195-266)
•George and Bennett, Case Studies and Theory Development, Chapter 1, pages 3-36.
Lecture 3: Concept Formation and Measurement
Readings:
•Gary Goertz, Social Science Concepts, chapters 1, 2, 3, and 9, pages 1-94, 237-268.
•Gary Goertz, Exercises, available at
http://press.princeton.edu/releases/m8089.pdf
Please think through the following exercises: 7, 21, 48, 49, 52, 163, 252, 253, 256, 257.
Lecture 4: Designs for Single and Comparative Case Studies
Readings:
•George and Bennett, Case Studies and Theory Development, chapter 4, pages 73-88.
•Jason Seawright and John Gerring, Case Selection Techniques In Case Study Research. Political Research Quarterly June 2008. Available at: http://blogs.bu.edu/jgerring/files/2013/06/CaseSelection.pdf
Lecture 5: Typological Theory, Fuzzy Set Analysis
Readings:
•George and Bennett, Case Studies and Theory Development chapter 11, pages 233-262.
•Excerpt from Andrew Bennett, "Causal mechanisms and typological theories in the study of civil conflict," in Jeff Checkel, ed., Transnational Dynamics of Civil War, Columbia University Press, 2012.
•Charles Ragin, "From Fuzzy Sets to Crisp Truth Tables," available at:
http://www.compasss.org/files/WPfiles/Raginfztt_April05.pdf
Lecture 6: Process Tracing, Congruence Testing, and Counterfactual Analysis
Readings:
•Andrew Bennett and Jeff Checkel, Process Tracing, chapter 1, conclusions, and appendix on Bayesianism.
•David Collier, online process tracing exercises. Look at exercises 3, 4, 7, and 8 at:
http://polisci.berkeley.edu/sites/default/files/people/u3827/Teaching%20Process%20Tracing.pdf
Lecture 7: Multimethod Research: Combining Case Studies with Statistics and/or Formal Modeling
Readings:
•Andrew Bennett and Bear Braumoeller, "Where the Model Frequently Meets the Road: Combining Statistical, Formal, and Case Study Methods," draft paper.
•Evan Lieberman, "Nested Analysis as a Mixed-Method Strategy for Comparative Research," American Political Science Review August 2005, pp. 435-52.
Lecture 8: Field Research Techniques: Archives, Interviews, and Surveys
Readings:
•Cameron Thies, "A Pragmatic Guide to Qualitative Historical Analysis in the Study of International Relations," International Studies Perspectives 3 (4) (November 2002) pp. 351-72.
Lecture 9 & 10: Student research design presentations
Read and be ready to constructively critique your fellow students’ research designs.
Supplementary / voluntary:
The following readings are useful for students interested in exploring the topic further, but they are not required:
I) Philosophy of Science and Epistemological Issues
Henry Brady, "Causation and Explanation in Social Science," in Janet Box-
Steffensmeier, Henry Brady, and David Collier, eds., Oxford Handbook of Political Methodology (Oxford, 2008) pp. 217-270.
II) Case Study Methods
George and Bennett, Case Studies and Theory Development, Chapter 1.
Gerardo Munck, "Canons of Research Design in Qualitative Analysis," Studies in Comparative International Development, Fall 1998.
Timothy McKeown, "Case Studies and the Statistical World View," International Organization Vol. 53, No. 1 (Winter, 1999) pp. 161190.
Concept Formation and Measurement
John Gerring, "What Makes a Concept Good?," Polity Spring 1999: 357-93.
Robert Adcock and David Collier, "Measurement Validity: A Shared Standard for Qualitative and Quantitative Research," APSR Vol. 95, No. 3 (September, 2001) pp. 529-546.
Robert Adcock and David Collier, "Democracy and Dichotomies," Annual Review of Political Science, Vol. 2, 1999, pp. 537-565.
David Collier and Steven Levitsky, "Democracy with Adjectives: Conceptual Innovation in Comparative Research," World Politics, Vol. 49, No. 3 (April 1997) pp. 430451.
David Collier, "Data, Field Work, and Extracting New Ideas at Close Range," APSA -CP Newsletter Winter 1999 pp. 1-6.
Gerardo Munck and Jay Verkuilen, "Conceptualizing and Measuring Democracy: Evaluating Alternative Indices," Comparative Political Studies Feb. 2002, pp. 5-34.
Designs for Single and Comparative Case Studies and Alternative Research Goals
Aaron Rapport, Hard Thinking about Hard and Easy Cases in Security Studies, Security Studies 24:3 (2015), 431-465.
Van Evera, Guide to Methodology, pp. 7788.
Richard Nielsen, "Case Selection via Matching," Sociological Methods and Research
(forthcoming).
Typological Theory and Case Selection
Colin Elman, "Explanatory Typologies and Property Space in Qualitative Studies of International Politics," International Organization, Spring 2005, pp. 293-326.
Gary Goertz and James Mahoney, "Negative Case Selection: The Possibility Principle," in Goertz, chapter 7.
David Collier, Jody LaPorte, Jason Seawright . "Putting typologies to work: concept formation, measurement, and analytic rigor." Political Research Quarterly, 2012
Process Tracing
Tasha Fairfield and Andrew Charman, 2015 APSA paper on Bayesian process tracing.
David Waldner, "Process Tracing and Causal Mechanisms." In Harold Kincaid, ed., The Oxford Handbook of Philosophy of Social Science (Oxford University Press, 2012), pp. 65‐84.
Gary Goertz and Jack Levy, "Causal Explanation, Necessary Conditions, and Case Studies: The Causes of World War I," manuscript, Dec. 2002.
Counterfactual Analysis, Natural Experiments
Jack Levy, paper in Security Studies on counterfactual analysis.
Thad Dunning, "Design-Based Inference: Beyond the Pitfalls of Regression Analysis?" in Brady and Collier, pp. 273-312.
Thad Dunning, Natural Experiments in the Social Sciences: A Design‐Based Approach (Cambridge University Press, 2012), Chapters 1,7
Philip Tetlock and Aaron Belkin, eds., Counterfactual Thought Experiments, chapters 1, 12.
Multimethod Research: Combining Case Studies with Statistics and/or Formal Modeling
David Dessler, "Beyond Correlations: Toward a Causal Theory of War," International Studies Quarterly vol. 35 no. 3 (September, 1991), pp. 337355.
Alexander George and Andrew Bennett, Case Studies and Theory Development, Chapter 2.
James Mahoney, "Nominal, Ordinal, and Narrative Appraisal in MacroCausal Analysis," American Journal of Sociology, Vol. 104, No.3 (January 1999).
Field Research Techniques: Archives, Interviews, and Surveys
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "Field Research in
Political Science: Practices and Principles," chapter 1 in Field Research in Political Science: Practices and Principles (Cambridge University Press). Read pages 15-33.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "Interviews, Oral
Histories, and Focus Groups" in Field Research in Political Science: Practices and Principles (Cambridge University Press).
Elisabeth Jean Wood, "Field Research," in Carles Boix and Susan Stokes, eds., Oxford Handbook of Comparative Politics, Oxford University Press 2007, pp. 123-146.
Soledad Loaeza, Randy Stevenson, and Devra C. Moehler. 2005. "Symposium: Should Everyone Do Fieldwork?" APSA-CP 16(2) 2005: 8-18.
Layna Mosley, ed., Interview Research in Political Science, Cornell University Press, 2013.
Hope Harrison, "Inside the SED Archives," CWIHP Bulletin
Ian Lustick, "History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias," APSR September 1996, pp. 605618.
Symposium on interview methods in political science in PS: Political Science and Politics (December, 2002), articles by Beth Leech ("Asking Questions: Sampling and Completing Elite Interviews"), Kenneth Goldstein ("Getting in the Door: Sampling and Completing Elite Interviews"), Joel Aberbach and Bert Rockman ("Conducting and Coding Elite Interviews"), Laura Woliver ("Ethical Dilemmas in Personal Interviewing"), and Jeffrey Barry ("Validity and Reliability Issues in Elite Interviewing), pp. 665-682.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "A Historical and
Empirical Overview of Field Research in the Discipline" Chapter 2 in Field Research in Political Science: Practices and Principles (Cambridge University Press, forthcoming).
Mandatory readings before course start:
It is advisable to do as much of the mandatory reading as possible before the course starts.
Prerequisites (knowledge of topic)
• Some prior knowledge in R and/or programming beneficial, but not required
Hardware
• Bring your own laptop
Software
• RStudio & R, most recent version (download free versions)
• You may want to bring your own credit card to create your own cloud accounts (for database server and certain APIs). Accounts are typically free but some require depositing a credit card number.
Course content
Online platforms such as Yelp, Twitter, Amazon, or Instagram are large-scale, rich and relevant sources of data. Researchers in the social sciences increasingly tap into these data for field evidence when studying various phenomena.
In this course, you will learn how to find, acquire, store, and manage data from such sources and prepare them for follow-up statistical analysis for your own research.
After a short introduction into the relevance of data science skills for the social sciences, we will review R as a programming language and its basic data formats. We will then use R to program simple scrapers that systematically extract data from websites. We will use the packages rvest, httr, and RSelenium, among others, for this purpose. You will further need to learn how to read HTML, CSS, JSON, or XML codes, to use regular expressions, and to handle string, text and image data. To store the data, we will look into relational databases, (My)SQL, and related R packages. Many websites such as Twitter and Yelp offer convenient application-programming interfaces (APIs) that facilitate the extraction of data and we will look into accessing them from R. Finally, we will highlight some options for feature extraction from images and text, which allows us to augment our collected data with meaningful variables we can use in our analysis.
At the end of this course, students should be able to identify valuable online data sources, to write basic scrapers, and to prepare the collected data such that they can use them for statistical analysis as part of their own research projects.
Throughout the course, students will work on a data-scraping project related to their theses. This project will be presented at the final day of the course.
All data scraping code and other sources will be made available on
https://www.data-scraping.org.
Structure
Preliminary schedule:
Day 1
Intro to data scraping
Define students’ scraping projects
Review of R and introduction to programming with R
Afternoon: R programming exercises
Day 2
The anatomy of the internet and relevant data formats
Intro to web scraping with R (with httr, rvest, RSelenium)
Introduction to APIs
Afternoon: Scraping exercises
Day 3
Relational databases and SQL
Data management with R
Afternoon: Database design and implementation project (with MySQL in the cloud)
Day 4
Scraping examples from Yelp, Crowdspring, Twitter, and Instagram
Scaling up your scraper with parallel code and proxies
Feature extraction examples
Afternoon: Work on your scraping projects
Day 5
Wrap-up of course
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Literature
Mandatory
None, all readings will be provided during the course
Supplementary / voluntary
None, all readings will be provided during the course
Examination part
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Supplementary aids
Individual quiz: Closed book
Presentation of students’ scraping projects: Closed book
Examination content
Lecture slides covering key concepts of R and programming, the anatomy of the internet, relational databases, and scraping (slides will be provided as PDFs the day before classes).
• Students will need to understand R code when they see it but they will not be required to code during the exam.
Literature
None.
Prerequisites (knowledge of topic)
Basic knowledge of descriptive statistics, data analysis and R is useful, but not necessary. Participants need to bring their own laptop and complete our detailed installation instructions for R and RStudio (both open source software) shared prior to the course.
Learning objectives
The creation and communication of data visualizations is a critical step in any data analytic project. Modern open-source software packages offer ever more powerful data visualizations tools. When applied with psychological and design principles in mind, users competent in these tools can produce data visualizations that easily tell more than a thousand words. In this course, participants learn how to employ state-of-the-art data visualization tools within the programming language R to create stunning, publication-ready data visualizations that communicate critical insights about data. Prior to, during, and after the course participants work their own data visualization project.
Course content
Each day will contain a series of short lectures and demonstrations that introduce and discuss new topics. The bulk of each day will be dedicated to hands-on, step-by-step exercises to help participants ‘learn by doing’. In these exercises, participants will learn how to read-in and prepare data, how to create various types of static and interactive data visualizations, how to tweak them to exactly fit one’s needs, and how to embed them in digital reports. Accompanying the course, each participant will work on his or her own data visualization project turning an initial visualization sketch into a one-page academic paper featuring a polished, well-designed figure. To advance these projects, participants will be able to draw on support from the instructors in the afternoons of course days two to four.
Structure
Day 1
Morning: Cognitive and design principles of good data visualizations
Afternoon: Introduction to R
Day 2
Morning: Reading-in, organizing and transforming data
Afternoon: Project sketch pitches
Day 3
Morning: Creating plots using the grammar of graphics
Afternoon: Visualizing statistical uncertainty, facets, networks, and maps
Day 4
Morning: Styling and exporting plots
Afternoon: Making visualizations interactive
Day 5
Morning: Reporting visualizations using Markdown
Afternoon: Final presentation and competition
Literature
Voluntary readings:
Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons.
Healy, K. (2018). Data visualization: a practical introduction. Princeton University Press.
Examination part
The course grade is determined based on the quality of the initial project sketch (20%), the data visualization produced during the course (40%), and the one-page paper submitted after the course (40%).
Prerequisites and content
Prerequisite knowledge for the course includes the fundamentals of probability and statistics, especially hypothesis testing and regression analysis. This intermediate level course assumes that students can interpret the results of Ordinary Least Squares, Probit, and Logit regressions. They should also be familiar with the problems that are most common in regression, such as multicollinearity, heteroscedasticity, and endogeneity. Finally, students should be comfortable working with computers and data. No prior knowledge of R or network analysis is required.
The concept of “social networks” is increasingly a part of social discussion, organizational strategy, and academic research. The rising interest in social networks has been coupled with a proliferation of widely available network data, but there has not been a concomitant increase in understanding how to analyze social network data. This course presents concepts and methods applicable for the analysis of a wide range of social networks, such as those based on family ties, business collaboration, political alliances, and social media.
Classical statistical analysis is premised on the assumption that observations are sampled independently of one another. In the case of social networks, however, observations are not independent of one another, but are dependent on the structure of the social network. The dependence of observations on one another is a feature of the data, rather than a nuisance. This course is an introduction to statistical models that attempt to understand this feature as both a cause and an effect of social processes.
Since network data are generated in a different way than many other kinds of social data, the course begins by considering the research designs, sampling strategies, and data formats that are commonly associated with network analysis. A key aspect of performing network analysis is describing various elements of the network’s structure. To this end, the course covers the calculation of a variety of descriptive statistics on networks, such as density, centralization, centrality, connectedness, reciprocity, and transitivity. We consider various ways of visualizing networks, including multidimensional scaling and spring embedding. We learn methods of estimating regressions in which network ties are the dependent variable, including the quadratic assignment procedure and exponential random graph models (ERGMs). We consider extensions of ERGMs, including models for two-mode data and networks over time.
Instruction is split between lectures and hands-on computer exercises. Students may find it to their advantage to bring with them a social network data set that is relevant to their research interests, but doing so is not required. The instructor will provide data sets necessary for completing the course exercises.
Structure
Day 1: Fundamental of Network Analysis
- Why undertake network analysis?
- How network analysis differs from other statistical methods
- Elements of networks (Nodes, links, modes, attributes, matrices, graphs)
- Key concepts (directionality, symmetry)
- Visualization
- Sampling
- Survey methods
- Working with network data in R
Day 2: Descriptive and Inferential Statistics
- Density
- Degree distributions
- Centrality (degree, betweenness, closeness, power)
- Centralization
- Components and cores
- Triads, triples, and transitivity
- Clustering
- Correlation and the Quadratic Assignment Procedure
- Random graphs
- Descriptive and inferential statistics in R
Day 3: Exponential Random Graph Models (ERGMs)
- Theory
- Specification
- Estimation
- Goodness of Fit
- Working with one-mode and two-mode ERGMs in R
Day 4: Network Data over Time Using Temporal ERGMs
Day 5: Student Presentations and Extensions of ERGM
- Student Presentations
- Additional extension of ERGMs, if time allows
- Concluding Discussion
Literature
Breiger, Ronald L. 1974. “The Duality of Persons and Groups.” Social Forces 53 (2): 181-190.
Burt, Ronald S. 1992. Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press. Pp. 8-49.
Butts, Carter T. 2008. “network: A Package for Managing Relational Data in R.” Journal of Statistical Software 24 (2): 1-36.
Butts, Carter T. 2008. “Social Network Analysis with sna.” Journal of Statistical Software 24 (6): 1-51.
Cranmer, Skyler J., Bruce A. Desmarais and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press.
Cranmer, Skyler J., Philip Leifeld, Scott D. McClurg, and Meredith Rolfe. 2017. “Navigating the Range of Statistical Tools for Inferential Network Analysis.” American Journal of Political Science 61 (1): 237-251.
Denny, Matthew J. 2016. “Getting Started with GERGM.” https://www.mjdenny.com/getting_started_with_GERGM.html
Emirbayer, Mustafa. 1997. “Manifesto for a Relational Sociology.” American Journal of Sociology 103 (2): 281-317.
Freeman, Linton C. 1977. “A Set of Measures of Centrality Based on Betweenness.” Sociometry 40 (1): 35-41.
Gould, Roger V., and Roberto M. Fernandez. 1989. “Structures of Mediation: A Formal Approach to Brokerage in Transaction Networks.” Sociological Methodology 19: 89-126.
Granovetter, Mark. 1973. “The Strength of Weak Ties.” American Journal of Sociology 78 (6): 1360-1380.
Heaney, Michael T. 2014. “Multiplex Networks and Interest Group Influence Reputation: An Exponential Random Graph Model.” Social Networks 36 (1): 66-81.
Heaney, Michael T., and Philip Leifeld. 2018. “Contributions by Interest Groups to Lobbying Coalitions.” Journal of Politics 80 (2): 494-509
Heckathorn, Douglas D. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174-199.
Hunter, David R., Mark S. Handcock, Carter T. Butts, Steven M. Goodreau and Martina Morris. 2008. “ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks.” Journal of Statistical Software 24 (3): 1-29
Knoke, David, Mario Diani, James Hollway, and Dimitris Christopolous. 2021. Multimodal Political Networks. New York: Cambridge University Press. Pp. 134-157.
Krackhardt, David. 1992. “The Strength of Strong Ties: The Importance of Philos in Organizations.” Pp. 216-239 in Nitin Nohria and Robert Eccles, eds., Networks and Organizations: Structure, Form, and Action. Boston, MA: Harvard Business School Press.
Laumann, Edward O., Peter V. Marsden, and David Prensky. 1983. “The Boundary Specification Problem in Network Analysis.” Pp. 18-34 in Ronald S. Burt and Michael Minor, eds., Applied Network Analysis, eds. Beverly Hills, CA: Sage.
Leifleld, Philip, and Skyler J. Cranmer. 2019. “A theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model.” Network Science 7 (1): 20-51.
Leifeld, Philip, Skyler J. Cramner, and Bruce A. Desmarais. 2018. “Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals.” Journal of Statistical Software 83 (6):1-36.
McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415-444.
Morris, Martina, Mark S. Handcock, and David R. Hunter. 2008. “Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects.” Journal of Statistical Software 24 (4): 1-24.
Podolny, Joel M. 2001. “Networks as the pipes and prisms of the market.” American Journal of Sociology 107 (1): 33-60.
Scott, John T. 2017. Social Network Analysis, 4th ed. London: Sage.
Strogatz, Steven. 2010. “The Enemy of My Enemy.” New York Times (February 14).
Watts, Duncan. 1999. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton: Princeton University Press. Pp. 3-40.
Exam
75%: There will be one written computer-based problem set on Monday through Thursday (for four assignments in total). Time will be allocated in class to complete the assignments, which must be submitted each day.
25%: On the final of day of the course, each student will make a presentation to the class on the results of her or his research project for the week. Giving a presentation to the course is required to receive a satisfactory grade in the course.
Course content
The goal is to develop an applied and intuitive (not purely theoretical or mathematical) understanding of the topics and procedures, so that participants can use them in their own research and also understand the work of others. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles.
Advanced regression topics will be covered only after the foundations have been established. The ordinary least squares multiple regression topics that will be covered include:
- Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance). - Categorical independent variables (e.g., new tests for “Intervalness” and
“Collapsing”). - Dichotomous dependent variables: Logit and Probit analysis.
- Outliers, influence, and leverage.
- Advanced diagnostic plots and graphical techniques.
- Matrix algebra: A quick primer. (Optional)
- Regression models… now from a matrix perspective.
- Heteroskedasticity: Definition, consequences, detection, and correction.
- Autocorrelation: Definition, consequences, detection, and correction.
- Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Structure
This course will utilize approximately 325 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in nine Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These nine Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While in‑class examples are presented using SPSS, participants are free and encouraged to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequisites
This course is a continuation of Tim McDaniel’s “Regression I – Introduction” course. While it is not necessary that participants have taken that specific course, they will need to be familiar with many of the topics that are covered in it.
Note: We will use matrix algebra in the second half of the course. We will not use calculus.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
- Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
- The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
- Finally, I have included several articles from a number of journals across several academic disciplines.Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
Decentral ‑ Written examination (100%)
Supplementary aids
Open Book
Examination content
The potential substantive content areas for the Final Examination are:
- Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance). - Categorical independent variables (e.g., new tests for “Intervalness” and “Collapsing”).
- Dichotomous dependent variables: Logit and Probit analysis.
- Outliers, influence, and leverage.
- Advanced diagnostic plots and graphical techniques.
- Regression models… now from a matrix perspective.
- Heteroskedasticity: Definition, consequences, detection, and correction.
- Autocorrelation: Definition, consequences, detection, and correction.
- Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Since this final examination is the only artifact that will be formally graded in the course, it will determine the course grade. Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the final examination.
The final examination will be written, open‑ook (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and open‑note. No other materials, including Laptops, cell phones, or other electronic devices, will be permitted.The written final exam will be two hours in length and administered during the last course meeting.
Literature
Literature relevant to the exam:
- Lecture Transcripts (nine Packets; approximately 325 pages).
- Class notes (taken by each participant individually).
- Assignment Tutorial and Answer Key documents (for each optional data analysis
project).
Supplementary/Voluntary literature not directly relevant to the exam:
- Optional supplemental readings listed in the course syllabus (and discussed earlier).
- Any other textbooks, articles, etc., the participant reads before or during the course.
Prerequisites (knowledge of topic)
Participants should have a basic working knowledge of the principles and practice of multiple regression and elementary statistical inference. No knowledge of matrix algebra is required or assumed, nor is matrix algebra ever used in the course.
Hardware
Participants are strongly encouraged to bring their own laptops (Mac or Windows)
Software
Computer applications will focus on the use of OLS regression and the PROCESS macro for SPSS and SAS developed by Andrew F. Hayes (processmacro.org) that makes the analyses described in this class much easier than they otherwise would be. Because this is a hands-on course, participants are strongly encouraged to bring their own laptops (Mac or Windows) with a recent version of SPSS Statistics (version 19 or later) or SAS (release 9.2 or later) installed. SPSS users should ensure their installed copy is patched to its latest release. SAS users should ensure that the IML product is part of the installation. R and STATA users can benefit from the course content, but PROCESS makes these analyses much easier and is not available for R or STATA.
Course content
Statistical mediation and moderation analyses are among the most widely used data analysis techniques in social science, health, and business fields. Mediation analysis is used to test hypotheses about various intervening mechanisms by which causal effects operate. Moderation analysis is used to examine and explore questions about the contingencies or conditions of an effect, also called “interaction”. Increasingly, moderation and mediation are being integrated analytically in the form of what has become known as “conditional process analysis,” used when the goal is to understand the contingencies or conditions under which mechanisms operate. An understanding of the fundamentals of mediation and moderation analysis is in the job description of almost any empirical scholar. In this course, you will learn about the underlying principles and the practical applications of these methods using ordinary least squares (OLS) regression analysis and the PROCESS macro for SPSS and SAS.
Topics covered in this five-day course include:
- Path analysis: Direct, indirect, and total effects in mediation models.
- Estimation and inference about indirect effects in single mediator models.
- Models with multiple mediators
- Mediation analysis in the two-condition within-subject design.
- Estimation of moderation and conditional effects.
- Probing and visualizing interactions.
- Conditional Process Analysis (also known as “moderated mediation”)
- Quantification of and inference about conditional indirect effects.
- Testing a moderated mediation hypothesis and comparing conditional indirect effects
As an introductory-level course, we focus primarily on research designs that are experimental or cross-sectional in nature with continuous outcomes. We do not cover complex models involving dichotomous outcomes, latent variables, models with more than two repeated measures, nested data (i.e., multilevel models), or the use of structural equation modeling.
This course will be helpful for researchers in any field—including psychology, sociology, education, business, human development, political science, public health, communication—and others who want to learn how to apply the latest methods in moderation and mediation analysis using readily-available software packages such as SPSS and SAS.
Structure
The schedule for the course will be partially determined by previous experience of the students, and their existing familiarity with mediation and moderation. The below schedule is a rough approximation of the schedule for the course.
Day 1
- Path analysis: Direct, indirect, and total effects in mediation models.
- Estimation and inference about indirect effects in single mediator models.
Day 2
- Models with multiple mediators
- Mediation analysis in the two-condition within-subject design.
Day 3
- Estimation of moderation and conditional effects.
- Probing and visualizing interactions.
- Moderation analysis in the two-condition within-subject design
Days 4 & 5
- Estimation of conditional process models (also known as “moderated mediation”)
- Quantification of and inference about conditional indirect effects.
- Testing a moderated mediation hypothesis and comparing conditional indirect effects
Literature
This course is a companion to Andrew Hayes’s book Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press. The content of the course overlaps the book to some extent, but many of the examples are different, and this course includes material not in the first edition of the book. A copy of the book is not required to benefit from the course, but it could be helpful to reinforce understanding.
Beyond IMMCPA additional materials include:
Montoya, A. K., & Hayes, A. F. (2017). Two-condition within-participant statistical mediation analysis: A path-analytic framework. Psychological Methods, 22(1), 6-27.
Hayes, A. F. (2015). An index and test of linear moderated mediation. Multivariate Behavioral Research, 50, 1-22.
Mandatory:
No materials are mandatory, but students will benefit greatly from reading Andrew Hayes’s book Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press
Supplementary / voluntary:
Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press
Montoya, A. K., & Hayes, A. F. (2017). Two-condition within-participant statistical mediation analysis: A path-analytic framework. Psychological Methods, 22(1), 6-27.
Hayes, A. F. (2015). An index and test of linear moderated mediation. Multivariate Behavioral Research, 50, 1-22.
Mandatory readings before course start:
N/A
Examination part
100% of assessment will be based on a written final examination at the end of the course. The exam will be a combination of multiple choice questions and short-answer/fill in the blank questions, along with some interpretation of computer output. Students will take the examination home on the last day of class and return it to the instructor within one week.
During the examination students will be allowed to use all course materials, such as PDFs of PowerPoint slides, student notes taken during class, and any other materials distributed or student-generated during class. Although the book mentioned in “Literature” is not a requirement of the course nor is it necessary to complete the exam, students may use the book if desired during the exam.
A computer is not required during the exam, though students may use a computer if desired, for example as a storage and display device for class notes provided to them during class.
Examination content
Among the topics of the exam may include how to quantify and interpret path analysis models, calculate direct, indirect, and total effects, and determine whether evidence of a mediation effect exists in a data set based on computer output provided or other information. Also covered will be the testing moderation of an effect, interpreting evidence of interaction, and probing interactions. Students will be asked to generate or interpret conditional indirect effects from computer output given to them and/or determine whether an indirect effect is moderated. Students may be asked to construct computer commands that will conduct certain analyses. All questions will come from the content listed in “Course Content” above.
Literature
Although the book mentioned in “Literature” is not a requirement of the course nor is it necessary to complete the assignments, students may use the book if desired.
Prerequisites (knowledge of topic)
Required knowledge of statistics and introductory econometric (or equivalent biometrics, technometrics, etc.) which should comprise basic statistics, estimation and testing in multivariate linear regression models, simple calculus (also with vectors and matrices). Knowledge of estimation should include moment, likelihood, and least squares methods.
Some knowledge of inference in non- or generalized linear models is an advantage.
Hardware
Laptops.
Software
The statistics language R should be implemented on the laptops.
Learning objectives
The topic is estimation and testing of regression problems typically considered in microeconometrics by the means of (standard) nonparametric methods.
The concept/content is: nonparametric density estimation (univariate, joint, conditional); nonparametric estimation of conditional moments; miscellaneous (model selection, bandwidth choice, conditional distribution); semiparametric estimation of generalized structured models; nonparametric testing.
The approach is teaching half intuition, half (asymptotic) theory.
After a successful completion, the students will know, understand and be able to apply nonparametric methods for data analysis, in particular estimation and regression. Moreover, the mixed approach enables them to broaden and deepen their knowledge in this direction for also applying non- and semiparametric methods in much more complex situations than those outlined in this course.
Course content
Nonparametric density estimation (histograms and kernel densities) for uni- and multivariate distributions; Nonparametric regression (with kernels, kNN, series estimators and splines); Miscellaneous of nonparametric regression (model selection, bandwidth selection, practical issues including implementation); Semiparametric estimation of regression functions and probabilities (in particular backfitting and marginal integration for generalized structured models); Nonparametric specification testing (of parametric, semiparametric and structural hypotheses).
Structure
Day 1:
Morning session: 1. The basic model for studying variables
(a) From histograms and empirical distribution function to kernel densities;
(b) What means model selection if there is none
Afternoon session: 2. Toward the study of relations of variables
(a) Multivariate/joint densities; (b) Conditional densities
Day 2:
Morning session: 3. Conditional Moments: regression without model specification.
(a) From conditional distributions to conditional moments; (b) Local vs global fits
Afternoon session: continued …
(c) Mixtures of global and local fits
Day 3:
Morning session: 4. Miscellaneous of nonparametric regression
(a) Model selection and its applications;
Afternoon session: continued …
(b) Conditional c.d.f. (c) Comments on causality
Day 4:
Morning session: 5. Generalized Structured Models
(a) Basic principles of semiparametrics; (b) Marginal integration; (c) Linear Mixed Models
Afternoon session: continued …
(d) Backfitting; (e) likelihood related approaches
Day 5:
Morning session: 6. Validation of economic models
(a) Bootstrap in non- and semiparametrics; (b) Nonparametric tests
Afternoon session: continued …
(c) Semiparametric tests; (d) Notes on subsampling
Literature
Mandatory
None.
Recommendations
W. Härdle, M. Müller, S. Sperlich, A. Werwatz (2004) Nonparametric and Semiparametric Models,
Springer Series in Statistics, Springer-Verlag, Heidelberg, NY; ISBN: 3-540-20722-8
For R-codes and more visit http://www.marlenemueller.de/nspm.html
Qi Li and Jeffrey Scott Racine (2006) Nonparametric Econometrics: Theory and Practice, Princeton University Press, Princeton; ISBN: 978069112611
D.J. Henderson and C.F. Parmeter (2015) Applied Nonparametric Econometrics, Cambridge University Press, NY, ISBN: 978-0-521-27968-0
Mandatory readings before course start
None.
Examination part
The examination is 100% conducted in form of a written test (closed book if possible); for content and structure of the test see below.
Supplementary aids
None.
Examination content
The test will be written, 120 minutes at the end of the course.
It will be 50% a multiple choice test with 4 possible answers (one correct) and another 50% a list of short questions to be answered in text and mathematical formulas. Explicit proofs and calculations are not demanded.
The list of questions in both parts intent to cover all subjects treated, see “Structure” in which the detailed list is given. Students will mainly asked comprehension questions to test both, their understanding and their knowledge of the functioning of nonparametric estimation methods.
Literature
None.
Prerequisites (knowledge of topic)
This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.
Hardware
A laptop computer is required to complete the in-class exercises.
Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.
Course content
With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.
Structure
The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).
The tentative schedule is as follows:
Day 1: Handling messy data
Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?
Lecture: Learning to explore data
Lecture: Missing values – imputation and other strategies
Lecture: The R data pipeline – tidyverse
Lab: Getting to know your data
Day 2: Understanding ML performance
Discussion: What makes a successful ML model?
Lecture: Getting beyond accuracy – other performance measures
Lecture: The “no free lunch” theorem
Lecture: Estimating future performance – sampling methods, model selection
Lab: Comparing models on your dataset with ROC curves
Day 3: Improving ML performance
Discussion: What factors keep ML models from perfect prediction?
Lecture: Tuning stock models – automated parameter tuning
Lecture: Meta-learning – ensembles, stacked models
Lab: Machine Learning Competition (Round 1)
Day 4: “Big data” problems
Discussion: Is more data always better? Why or why not?
Lecture: The curse of dimensionality – dimensionality reduction, t-SNE
Lecture: Imbalanced datasets – under and over-sampling strategies
Lecture: Improving R’s performance on big data
Lab: Machine Learning Competition (Round 2)
Day 5: Next-generation “Black Box” methods
Discussion: What are the strengths and weaknesses of man versus machine?
Lecture: Deep Learning – Keras, Tensorflow
Lecture: Text embeddings – word2vec
Lecture: Cluster computing – use cases of Hadoop, Spark, etc.
Discussion: Results of ML Competition – winners’ tips and tricks
Lab: Work on your final project
Literature
Mandatory
PDFs with readings will be distributed prior to the start of each class day.
Supplementary / voluntary
None required.
Mandatory readings before course start
Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. The project is based on a challenging real-world dataset given to all course participants. The project will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
The remaining 20% of the course grade will be based on participation during in-class discussions and performance during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.
Supplementary aids
Students may reference any literature as needed when writing the final report.
Examination content
The primary goal of the final project is for students to gain an ability to solve difficult ML tasks. The project should reflect an understanding of the material covered throughout the week, as well as an ability to apply the material in new and innovative ways.
Literature
Not applicable.
Prerequisites (knowledge of topic)
As long ago as 2010, Eric Schmidt, the executive chairman of Alphabet, observed that every two days we generate as much information as was created in the entire history of civilization until 2003. The problem is only that much of this information is unstructured by not being organized in a pre-defined manner. This lack of structure complicates extracting useful insights from these massively increasing data sources. Students should have some familiarity with the Python/R programming. Please bring a laptop to class. You also need a Google account to practice using Colab.
Learning objectives and course content
In this class, we will explore different statistical approaches that have proven useful in making sense out of unstructured data. The course is centered around business applications that involve the analyses of text, social networks, images as well as well as their relationships with meta-data. For most of the analyses, we will use Python/R and dedicate some of the class sessions to hands-on time. Students are invited to bring their unstructured data sets but doing so is not required.
Structure
Day 1: Text mining: text representation, word2vec, sentiment analysis, topic modeling.
Day 2: Supervised and unsupervised machine learning: regression, random forest, K-means.
Day 3: Social network analysis: centralities, community detection, and representation learning.
Day 4: Image analysis: image processing, deep learning.
Day 5: Discussion of student projects.
Literature
The following books provide useful background material for the class. I will refer to more specialized publications as part of my lecture.
Introduction to information retrieval:
https://nlp.stanford.edu/IR-book/
Deep Learning:
https://www.deeplearningbook.org/
Community detection in graphs:
https://www.sciencedirect.com/science/article/pii/S0370157309002841
Graph representation learning book:
https://www.cs.mcgill.ca/~wlh/grl_book/
Python:
https://docs.python.org/3/
Examination Part
Final grades are based on a portfolio of assigned exercises. The solutions are due about two weeks after the end of the course.
Prerequisites (knowledge of topic)
• The basic knowledge of probability and statistics such as conditional probability, hypothesis testing and regression.
• Experiences of working with data (with any software package).
• No prior knowledge on causality and causal inference is required.
Hardware
A laptop for the practical sessions.
Software
R and R Studio (both are freely-available).
Learning objectives
Upon the completion of the course, your will be able to:
• Recognize research problems that require causal inference.
• Describe different perspectives and models in causal inference.
• Implement a selection of tools in causal inference.
• Evaluate critically in-lab and field experiments and other research designs.
• Develop research designs to address causal inference problems.
Course content
Causal questions in the form of how X influences Y are pervasive in real life. It is therefore imperative for us to know how to address these questions, especially given the “big data revolution” in the last decade. Moreover, without the understanding of causal inference, we can easily fall victim to misinformation. For example, in response to Apple’s new privacy policy on the mobile system, Facebook launched a series of full-page newspaper ads, claiming that Apple’s new privacy policy would hurt small business advertisers. Facebook concluded that for small business advertisers, the new policy would lead to “a cut of 60% in their sales for every dollar they spend.” However, is the claim credible? How do we judge its credibility? To answer these questions, in this course, you are introduced to the exciting area of causal inference.
This course provides you with conceptual understandings, as well as tools to learn causality from data. These understandings and tools come from the rapidly developing science of causal inference. On the conceptual level, the course covers basic concepts such as causation vs. correlation, causal inference, causal identification and counterfactual. It also presents perspectives and tools to help you formalize and conceptualize causal relationships. These perspectives and tools are synergized from multiple disciplines, including statistics (e.g., Robin Causal Model or Potential Outcome Framework), computer science (e.g., Pearlian Causal Model or Causal Graph), and econometrics (e.g., identification strategies and local average treatment effect).
In this course, we will also discuss a selection of tools in causal inference. We will start with the completely randomized experiment and discuss the assignment mechanism, Fisher’s exact p-value, and Neyman’s repeated sampling approach. From then on, we will gradually relax the assumption of complete randomization and discuss situations where the complete randomization does not hold. Specifically, we will discuss the following: First, block randomization and conditional random assignment, with a focus on matching and weighting estimators; Second, non-compliance where the random assignment fails and the local average treatment effects; Third, attrition where some outcomes are missing and the bounding approach; Fourth, research designs when the assignment mechanism is unknown to us, including the difference-in-difference approach and regression discontinuity design. Moreover, in the last day of the course, we will discuss the ethics of causal inference and the assessment of the unconfoundedness assumption. As the “finale,” we will go through the most recent developments in the intersection between causal inference and machine learning, where machine learning techniques are used to address causal inference problems.
One major distinction of the course is its emphasis on practical relevance. Throughout the course, you are given cases and real data to apply what you learn to real causal inference problems. The course is split between lectures and practical sessions. Cases and data will be provided by the instructor before class.
Structure
Day 1:
1. Introduction to causality and causal inference.
2. Overview of the course.
3. Causal graph: a new language of causality.
4. Causal identification.
5. The formulation of an empirical strategy.
6. Practical session: Using Daggity to draw and analyze causal graphs.
Day 2:
1. The potential outcome framework.
2. The assignment mechanism.
3. Fisher vs. Neyman’s treatment of completely randomized experiments.
4. Counterfactual: Robin vs. Pearl.
5. Block randomization.
6. Selection on observables: subclassification, matching and weighting.
7. Practical session: Matching and weighting estimation of the effect of virtual-fitting on sales.
Day 3:
1. Non-compliance problem in the assignment mechanism.
2. The intention-to-treat and the local average treatment effect (LATE).
3. The link of LATE to the instrumental variables strategy.
4. The attrition problem after randomization.
5. Classic treatments of the attrition problem.
6. The bounding approach and Lee bounds.
7. Practical session: Estimating LATE in a field experiment of coupons.
Day 4:
1. Canonical difference-in-difference (DID) model.
2. Two-way fixed estimation of the multiple-period DID.
3. DID with staggered adoptions of the treatment.
4. The basic idea of the regression continuity design (RDD).
5. How to estimate the LATE with RDD.
6. Regression kink design and the bunching method.
7. Practical session: DID analysis of the Low Carbon London Electricity Trial.
Day 5:
1. Ethics of causal inference: being transparent about your assumptions.
2. Assessing the assumption of unconfoundedness.
3. Sensitivity analysis in regression adjustment.
4. Placebo and falsification test in DID and RDD.
5. Recent applications of machine learning techniques in causal inference.
Literature
Lecture notes will be shared. You are recommended to read them before class.
Books (all supplementary):
• Gerber, Alan S., and Donald P. Green. Field experiments: Design, analysis, and interpretation. WW Norton, 2012.
• Guido W. Imbens and Donald B. Rubin. Causal inference for statistics, social, and biomedical sciences: An Introduction. Cambridge University Press, 2015.
• Lee, Myoung-jae. Matching, regression discontinuity, difference in differences, and beyond. Oxford University Press, 2016.
• Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
• Pearl, Judea, Madelyn Glymour, and Nicholas P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
Articles:
(starred ones are mandatory and others are supplementary)
(arranged in chronological order of course contents)
Day 1:
• * Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669-688. [Section 1-3].
• Lewbel, A. (2019). The identification zoo: Meanings of identification in econometrics. Journal of Economic Literature, 57(4), 835-903.
• * Angrist, J. (2022). Empirical strategies in economics: Illuminating the path from cause to effect (No. w29726). National Bureau of Economic Research.
Day 2:
• * Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.
• Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129-79.
• * Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1.
• Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373-419.
Day 3:
• * Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica, 62(2), 467-475.
• Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434), 444-455.
• DiNardo, J., McCrary, J., & Sanbonmatsu, L. (2006). Constructive proposals for dealing with attrition: An empirical example. Working paper, University of Michigan.
• * Manski, C. F. (1990). Non-parametric bounds on treatment effects. American
Economic Review, 80(2), 319-323.
* Lee, D. S. (2009). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. Review of Economic Studies, 76(3), 1071-1102.
Day 4:
• * Lechner, M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends® in Econometrics, 4(3), 165-224.
• Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
• * Cattaneo, M. D., & Titiunik, R. (2022). Regression discontinuity designs. Annual Review of Economics, 14, 821-851.
• Lee, D. S., & Lemieux, T. (2010). Regression Discontinuity Designs in Economics. Journal of Economic Literature, 48, 281-355.
• Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142, 615-635.
Day 5:
• * Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observational studies. Biometrika, 74(1), 13-26.
• * Altonji, J. G., Elder, T. E., & Taber, C. R. (2005). Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools. Journal of Political Economy, 113(1), 151.
• Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2), 187-204.
• Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
• Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. The Econometrics Journal, 21(1). C1-C68.
Examination part
The performance of participants will be assessed by a take-home assignment at the end of the course (100%). The assignment will be due in 3 weeks.
Supplementary aids
The examination aids and documents are listed below:
1. Lecture notes for all the lectures.
2. Starred articles in the literature list.
3. R notebooks shared after the practical sessions.
Examination content
The list of topics that are covered in the assignment are:
• The basics of the causal graph.
• Formulating an identification strategy with the causal graph.
• Evaluating empirical strategies.
• The basics potential outcome framework.
• Calculating Fisher’s exact p-value and Neyman’s repeated sampling statistics.
• Matching and weighting estimators.
• Propensity scores.
• The non-compliance problem.
• Calculating local average treatment effects.
• The attrition problem in experiments.
• The bounding approach.
• Basics of the difference-in-difference approach.
• Basics of the regression discontinuity design.
• Sensitivity analysis.
• Falsification and placebo tests.
Examination relevant literature
Only the starred article in the literature list are necessary for the examination. Books and articles are only supplementary.
Prerequisites (knowledge of topic)
Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Students should have completed Ph.D.-level courses in introductory statistics, and in linear and generalized linear regression models (including logistic regression, etc.), up to the level of Regression III. Familiarity with discrete and continuous univariate probability distributions will be helpful.
Hardware
Students will be required to provide their own laptop computers.
Software
All analyses will be conducted using the R statistical software. R is free, open-source, and runs on all contemporary operating systems. The instructor will also offer support for students wishing to use Stata.
Learning objectives
Students will learn how to visualize, analyze, and conduct diagnostics on models for observational data that has both cross-sectional and temporal variation.
Course content
Analysts increasingly find themselves presented with data that vary both over cross-sectional units and across time. Such panel data provides unique and valuable opportunities to address substantive questions in the economic, social, and behavioral sciences. This course will begin with a discussion of the relevant dimensions of variation in such data, and discuss some of the challenges and opportunities that such data provide. It will then progress to linear models for one-way unit effects (fixed, between, and random), models for complex panel error structures, dynamic panel models, nonlinear models for discrete dependent variables, and models that leverage panel data to make causal inferences in observational contexts. Students will learn the statistical theory behind the various models, details about estimation and inference, and techniques for the visualization and substantive interpretation of their statistical results. Students will also develop statistical software skills for fitting and interpreting the models in question, and will use the models in both simulated and real data applications. Students will leave the course with a thorough understanding of both the theoretical and practical aspects of conducting analyses of panel data.
Structure
Day One:
Morning:
• (Very) Brief Review of Linear Regression
• Overview of Panel Data: Visualization, Pooling, and Variation
• Regression with Panel Data
Afternoon:
• Unit Effects Models: Fixed-, Between-, and Random-Effects
Day Two:
Morning:
• Dynamic Panel Data Models: The Instrumental Variables / Generalized Method of Moments Framework
Afternoon:
• More Dynamic Models: Orthogonalization-Based Methods
Day Three:
Morning:
• Unit-Effects and Dynamic Models for Discrete Dependent Variables
Afternoon:
• GLMs for Panel Data: Generalized Estimating Equations (GEEs)
Day Four:
Morning:
• Introduction to Causal Inference with Panel Data (Including Unit Effects)
Afternoon:
• Models for Causal Inference: Differences-In-Differences, Synthetic Controls, and Other Methods
Day Five:
Morning:
• Practical Issues: Model Selection, Specification, and Interpretation
Afternoon:
• Course Examination
Literature
Mandatory
Hsiao, Cheng. 2014. Analysis of Panel Data, 3rd Ed. New York: Cambridge University Press.
OR
Croissant, Yves, and Giovanni Millo. 2018. Panel Data Econometrics with R. New York: Wiley.
Supplementary / voluntary
Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” Review of Economic Studies 72:1-19.
Anderson, T. W., and C. Hsiao. 1981. “Estimation Of Dynamic Models With Error Components.” Journal of the American Statistical Association 76:598-606.
Antonakis, John, Samuel Bendahan, Philippe Jacquart, and Rafael Lalive. 2010. “On Making Causal Claims: A Review and Recommendations.” The Leadership Quarterly 21(6):1086-1120.
Arellano, M. and S. Bond. 1991. “Some Tests Of Specification For Panel Data: Monte Carlo Evidence And An Application To Employment Equations.” Review of Economic Studies 58:277-297.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (And Not To Do) With Time-Series Cross-Section Data.” American Political Science Review 89(September): 634-647.
Bliese, P. D., D. J. Schepker, S. M. Essman, and R. E. Ployhart. 2020. “Bridging Methodological Divides Between Macro- and Microresearch: Endogeneity and Methods for Panel Data.” Journal of Management, 46(1):70-99.
Clark, Tom S. and Drew A. Linzer. 2015. “Should I Use Fixed Or Random Effects?” Political Science Research and Methods 3(2):399-408.
Doudchenko, Nikolay, and Guido Imbens. 2016. “Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis.” Working paper: Graduate School of Business, Stanford University.
Gaibulloev, K., Todd Sandler, and D. Sul. 2014. “Of Nickell Bias, Cross-Sectional Dependence, and Their Cures: Reply.” Political Analysis 22: 279-280.
Hill, T. D., A. P. Davis, J. M. Roos, and M. T. French. 2020. “Limitations of Fixed-Effects Models for Panel Data.” Sociological Perspectives 63:357-369.
Hu, F. B., J. Goldberg, D. Hedeker, B. R. Flay, and M. A. Pentz. 1998. “Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes.” American Journal of Epidemiology 147(7):694-703.
Imai, Kosuke, and In Song Kim. 2019. “When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data?” American Journal of Political Science 62:467-490.
Keele, Luke, and Nathan J. Kelly. 2006. “Dynamic Models for Dynamic Theories: The Ins and Outs of Lagged Dependent Variables.” Political Analysis 14(2):186-205.
Lancaster, Tony. 2002. “Orthogonal Parameters and Panel Data.” Review of Economic Studies 69:647-666.
Liu, Licheng, Ye Wang, Yiqing Xu. 2019. “A Practical Guide to Counterfactual Estimators for Causal Inference with Time-Series Cross-Sectional Data.” Working paper: Stanford University.
Mummolo, Jonathan, and Erik Peterson. 2018. “Improving the Interpretation of Fixed Effects Regression Results.” Political Science Research and Methods 6:829-835.
Neuhaus, J. M., and J. D. Kalbfleisch. 1998. “Between- and Within-Cluster Covariate Effects in the Analysis of Clustered Data. Biometrics, 54(2): 638-645.
Pickup, Mark and Vincent Hopkins. 2020. “Transformed-Likelihood Estimators for Dynamic Panel Models with a Very Small T.” Political Science Research & Methods, forthcoming.
Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25:57-76.
Zorn, Christopher. 2001. “Generalized Estimating Equation Models for Correlated Data: A Review with Applications.” American Journal of Political Science 45(April):470-90.
Mandatory readings before course start
Hsiao, Cheng. 2007. “Panel Data Analysis — Advantages and Challenges.” Test 16:1-22.
Examination part
Students will be evaluated on two written homework assignments that will be completed during the course (20% each) and a final examination (60%). Homework assignments will typically involve a combination of simulation-based exercises and “real data” analyses, and will be completed during the evenings while the class is in session. For the final examination, students will have two alternatives:
• “In-Class”: Complete the final examination in the afternoon of the last day of class (from roughly noon until 6:00 p.m. local time), or
• “Take-Home”: Complete the final examination during the week following the end of the course (due date: TBA).
Additional details about the final examination will be discussed in the morning session on the first day of the course.
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data with a panel / time-series cross-sectional component. Students will be required to specify, estimate, and interpret various statistical models, to conduct and present diagnostics and robustness checks, and to give detailed justifications for their choices.
Examination relevant literature
See above. Details of the examination literature will be finalized prior to the start of class.
Prerequisites (knowledge of topic)
- Basic knowledge of the R programming language
- Basic statistical knowledge including graduate level statistics
Hardware
- A laptop computer with Internet connection. The laptop should have at least 4GBS of RAM (preferably more because text mining is intensive).
Software
- A modern web browser (ie Chrome)
- R (https://www.r-project.org/), R Studio (https://www.rstudio.com/products/rstudio/) and git (https://git-scm.com/downloads) are available at no cost and are needed for this course. Please install all three on your personal laptop prior to class.
- As a backup, students should also sign up at https://rstudio.cloud/
- Specific R Packages will be shared prior to class for installation onto the laptop. The installation script will be shared via email with participants and shared on the class github repository.
Course content
Text mining is the art and science of extracting insights from large amounts of natural language. The topics of Text Mining will help students add natural language processing techniques to their research, and data science toolset. As a technical course with some machine learning elements, limited exposure to programming, graduate level statistics and mathematical theory is needed but the vast majority of the course content will be focused on applying popular text mining methods. As a result, the target audience may also include qualitative researchers looking to add quantitative analysis to interviews, media and other language based field research as long as participants have some basic R background.
If you stay engaged in the course and complete the suggested readings and code:
Students will be able to think systematically about how information can be obtained from diverse natural language. Students will learn how to implement a variety of popular text mining algorithms in R (a free and open-source software) to identify insights, extract information and measure emotional content.
Structure
Overall the course is meant to be a practical examination of text mining, with some overlap of machine learning techniques for natural language. Following the adult learning model, each day will have a lecture, demonstration, co-working session and finally students will have a standalone lab where they can apply the technique to new data with instructor support.
Specifically, each morning session will include a lecture and code step through demonstrating a text mining technique. In the afternoon, the technique will be applied to a new data set followed by a lab. During the lab yet another data set will be provided or students can apply the day’s technique to their own data.
Day 1: R Basics & What is text mining?
Intro to R programming
String Manipulation & Text Cleaning
Lab Section: Clean tweets, and prepare for bag of words examination
Day 2: Common Text Mining Visuals
Word Frequency & Term-Frequency Inverse Document Frequency (TF-IDF)
Term Document, & Document Term Matrices
Word Clouds – Comparison Clouds, Commonality Clouds
Other Visuals – Word Networks, Associations, Pyramid Plots, Treemaps
Lab Section: Create various visualizations with news articles
Day 3: Sentiment Analysis & Unsupervised Learning: Topic Modeling & Clustering
Sentiment Lexicons – Negation, Amplification, Valence Shifters,
K-Means & Spherical K-Means
Correlated Topic Modeling
Lab Section: Clustering Professional Resumes/CVs
Day 4: Supervised Learning: Document Classification
Elastic Net (Lasso & Ridge Regression)
Data Science Ethics – IBM Watson’s use of text for cancer diagnosis
Lab Section: Classify clickbait from news headlines
Day 5: OpenNLP & Text Sources
Named Entity Recognition
APIs, web-scraping basics, Microsoft Office documents
Afternoon Session: Final Examination (no lab)
Literature
Mandatory
- Text Mining in Practice with R by Ted Kwartler; Wiley & Sons Publishing
ISBN: 978-1-119-28201-3
- Two Data Ethics articles assigned at class to spur reflection for the ethics essay.
Supplementary / voluntary
None.
Mandatory readings before course start
- Read chapter 1 of Text Mining in Practive with R entitled “What is Text Mining?”
- Please install R & R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. As a backup, sign up for an account at R-Studio’s cloud environment https://rstudio.cloud.
Examination part
20% Ethics Paper – Due at midnight at the last course day
- 500-750 word essay with personal reflection on the ethical implications of text mining research methods
80% Final Exam – Proctored on the final day of the week
- 30 multiple choice (2pts each),
- 1 of (20pts) code review section asking students to describe what and why specific code steps are being taken
- 4 of (5pts each) Short form questions/answers requiring 1 paragraph (2-4 sentences each)
Supplementary aids
Students may bring a hand written “index card” to the final examination period. It may be double sided, and should be functionally equivalent to the UK standard 3in by 5in notecard. Students may put any information they deem important for the final on their notecard and use it as a supplement during the exam. Use of an exam supporting notecard is optional.
Examination content
Topic |
Example Topic |
R Coding principles and basic functions |
how to read in data, and data types
|
Steps in a machine learning or analytical project workflow |
SEMMA EDA functions, partitioning if modeling |
Steps in a text mining workflow |
Problem statement> unorganized state> organized state |
R text mining libraries and functions |
which functions are appropriate for text uses |
Text Preprocessing Steps |
Why perform “cleaning” steps |
Bag of Words Text Processing |
What is Bag of Words? |
Sentiment analysis |
Lexicons, their application and implications for understanding author emotion |
Document Classification |
Elastic Net Machine Learning for document classification |
Topic Extraction |
Unsupervised machine learning for topic extraction – Kmeans, Spherical K Mean, Hierarchical Clustering |
Text as inputs for Machine Learning Algorithms |
Classification and Prediction using mixed training sets including extracted text features as independent variables |
Text Mining Visuals |
Word frequencies, disjoint comparisons, and other common visuals |
Names Entity Recognition |
Examples of named entities in large corpora |
Text Sources |
APIs, web scraping, OCR and other text sources |
Literature
The exam will be based on the lectures and mandatory assigned reading from Text Mining in Practice with R.
Prerequisites (knowledge of topic)
None.
The course is designed for Master, PhD students and practitioners in the social and policy sciences, including political science, sociology, public policy, public administration, business, and economics. It is especially suitable to MA students in these fields who have an interest in carrying out research. This course follows naturally from Andrew Bennett’s “Case Study Methods;” however, participants do not need any previous exposure to either Bayesian analysis or qualitative methods literature.
Hardware
Laptop.
Software
None.
Learning objectives
Upon completing the course, students will be equipped with a concrete set of Bayesian-inspired best practices to deploy in their own research, as well as widely-applicable analytic skills that will help them to better evaluate and critique socio-political analysis.
Course content
The way we intuitively approach qualitative case research is similar to how we read detective novels. We consider various different hypotheses to explain what occurred— whether a major tax reform in Chile, or the death of Samuel Ratchett on the Orient Express—drawing on the literature we have read (e.g. theories of policy change, or other Agatha Christie mysteries) and any salient previous experiences we have had. As we gather evidence and discover new clues, we update our beliefs about which hypothesis provides the best explanation—or we may introduce a new alternative that occurs to us along the way. Bayesianism provides a natural framework that is both logically rigorous and grounded in common sense, that governs how we should revise our degree of belief in the truth of a hypothesis—e.g., “the imperative of attracting globally-mobile capital motivated policymakers to reform the tax system,” or “a lone gangster sneaked onboard the train and killed Ratchett as revenge for being swindled”—given our relevant prior knowledge and new information that we learn during our investigation. Bayesianism is enjoying a revival across many fields, and it offers a powerful tool for improving inference and analytic transparency in qualitative case-study research.
This interactive course introduces the principles of Bayesian reasoning, with applications to process-tracing, comparative case studies, and multimethod research. Participants will learn how to construct well-articulated rival hypotheses to compare, systematically assess the inferential weight of qualitative evidence, avoid common cognitive biases that can lead to sloppy reasoning, and evaluate which hypothesis provides the best explanation through Bayesian updating. The course will also address key aspects of research design, including case selection. We will further explore the potential for Bayesianism to serve as a bridge between quantitative and qualitative research. Throughout, we will conduct a wide range of exercises and group work to give participants hands-on practice applying Bayesian techniques. Upon completing the course, participants will be able to read case studies more critically, evaluate whether and to what extent the evidence presented supports the authors’ conclusions, and apply Bayesian principles in their own research.
Structure
Day 1. Introduction to Bayesian reasoning
Session A: Preview of best practices and foundations of Bayesian probability
Session B: Constructing rival hypotheses and evaluating prior odds
Because we live in a world of uncertainty, we intuitively reason in terms of probabilities. For example, we think about how likely it is to rain in the afternoon, given the weather conditions we observe in the morning. If it does rain in the afternoon, we adjust our expectations about the likelihood of rain tomorrow. But even though probabilistic thinking is common in daily life, we don’t always reason correctly about probabilities, especially when we’re making quick judgements. Our first session will begin with an introduction to the fundamentals of Bayesian probability. We will come up to speed on conditional probability and Bayes’ rule, with some fun practice problems along the way. In the second session, we will go on to talk about how to construct well-articulated rival hypotheses to compare, and how to assign prior odds, which express our initial view about which hypothesis is more plausible given the background knowledge that we bring to our research.
Day 2. Assessing evidentiary import
Session A: Likelihood ratios
Session B: Silver Blaze exercise
We now turn to identifying evidence, which includes any salient observation about the world that helps us adjudicate between rival hypotheses, and evaluating likelihood ratios, which is how we figure out how strongly the evidence supports a hypothesis over rivals. Evaluating likelihood ratios is the key step in Bayesian analysis that tells us how to update our prior odds. Here we must ask which hypothesis makes the evidence more expected. The key is to “mentally inhabit the world” of each hypothesis. We need to think about the most plausible scenario that would lead us to observe the evidence in the world of a particular hypothesis, and then ask whether that story would be more or less plausible than the most sensible scenario we can envision in the world of the alternative hypothesis. We will consider concrete examples from research on international investment treaties signed by South Africa (Poulsen 2015, CUP) and research on social spending in Mexico (Garay 2016, CUP), along with an exercise using clues from the famous Sherlock Holmes story Silver Blaze.
Day 3. Log-odds updating
Session A: Weight of evidence; State building exercise
Session B: Market reform exercise
We will now introduce the log-odds form of Bayes’ rule and the weight of evidence, which is closely related to the likelihood ratio. This approach greatly simplifies Bayesian analysis, and it allows us to quantify probabilities in order to better communicate our views and aggregate the probative value of multiple pieces of evidence more systematically. We will practice this approach with group exercises that examine case studies on state-building and market reform.
Day 4. Bayesian reasoning for comparative case studies
Session A: Multiple hypotheses and multiple cases
Session B: Case selection and research design
Our next step is to apply Bayesian reasoning in situations where we wish to compare more than just two rival hypotheses, and in research that involves studying more than a single case. When working with multiple rival hypotheses, we simply carry out a set of pairwise comparisons. When working with multiple cases, weights of evidence aggregate across the cases in the same way that weights of evidence add up for multiple clues pertaining to a single case. The second session will introduce an information-theoretic approach to case selection, where the goal is to choose cases that will be highly informative for developing theory and/or for comparing rival hypotheses. We will discuss a number of practical guidelines for case selection that emerge from this approach. If time allows, we will also discuss guidelines for how to proceed when evidence we discover inspires us to devise a new hypothesis.
Day 5. Bayesian reasoning in methodological perspective
Session A: Contrasting Bayesianism with alternative approaches to qualitative research
Session B: Applying Bayesian reasoning across multiple types of evidence
We will conclude by highlighting the relative advantages of Bayesianism and how it differs from frequentist statistical inference, as well as other methodologies for process tracing and qualitative research, with some fun exercises along the way. We will further discuss how Bayesianism can be applied across very different kinds of evidence with an application to the question of SARS-CoV-2 origins.
Literature
Mandatory
Various short worksheets (TBA)
For Day 1:
Tasha Fairfield & Andrew Charman (2022), Social Inquiry and Bayesian Inference, Cambridge University Press, Chap. 1 and Chap. 3, pp. 73-77.
For Day 2:
Fairfield & Charman (2022), Chap. 3, pp. 101-119.
Sir Arthur Conan Doyle, The Adventure of Silver Blaze
Tasha Fairfield and Candelaria Garay. 2017. “Redistribution under the Right in Latin America: Electoral Competition and Organized Actors in Policymaking,” Comparative Political Studies 50 (14). Read only the Mexico case, pp. 1882-1885.
For Day 3:
Fairfield & Charman (2022), Chap. 4, pp.124-136.
For Day 4:
Dan Slater, 2009. “Revolutions, Crackdowns, and Quiescence: Communal Elites and Democratic Mobilization in Southeast Asia.” American Journal of Sociology 115 (1): 203-254. Read only the Philippines and Vietnam cases.
For Day 5:
Fairfield and Charman (2019), “A Dialogue with the Data: The Bayesian Foundations of Iterative Research in Qualitative Social Science,” Perspectives on Politics 17 (1).
Supplementary / voluntary
Fairfield & Charman (2022), Chap. 3, pp. 86-101; Chap. 5, 9, 10, 13.
Mandatory readings before course start
Andrew Bennett & Jeffrey Checkel (2015), Chapter. 1, in Andrew Bennett and Jeffrey Checkel, eds, Process Tracing in the Social Sciences: From Metaphor to Analytic Tool, Cambridge University Press. Read only pp. 23-31.
Examination part
Oral participation (20%)
Final Exam (80%)
Supplementary aids
Final exam will be closed-book, administered online using Qualtrics (can be accessed with any web browser)
Examination content
Final exam will cover all material discussed during the 5 days: Bayes’ rule, heuristic Bayesian process tracing, explicit Bayesian process tracing, iterative research and case selection principles.
Examination relevant literature
None.
Prerequisites (knowledge of topic)
A course in regression (e.g., GSERM Regression I) is essential. A second course in regression (e.g., GSERM Regression II) is recommended. Regression topics that are particularly important: i) assessing and dealing with non-linearity ii) dummy variables (including block F-tests) iii) standardization.
Hardware
Participants should bring laptops loaded with the software identified below.
Software
We will make primary use of the lavaan package in R but will also demonstrate the sem procedure in STATA. The following R packages should be installed on participant laptops: lavaan, haven, semTools. STATA will be available in a computer lab at the University of St. Gallen for participants who do not have it installed on their own laptops.
Learning objectives
The course will provide a conceptual introduction to structural equation models, provide a thorough outline of model “fitting” and assessment, teach how to effectively program structural equation models using available software, demonstrate how to extend basic models into multiple group situations, and provide an introduction to models where common model assumptions regarding missing and non-normal data are not met.
Course content
1. Introduction to latent variable models, measurement error, path diagrams.
2. Estimation, identification, interpretation of model parameters.
3. Scaling and interpretation issues
4. Scalar programming for structural equation models in R-lavaan and STATA.
5. Mediation models in the structural equation framework.
6. Model fit and model improvement
7. General linear parameter constraints
8. Multiple-group models
9. Introduction to models for means and intercepts
10. The FIML approach to analysis with missing data
11. Alternative estimators for non-normal data.
Structure
Schedule may vary slightly according to class progress.
Day 1 Morning
Path models, mediation. Introduction to latent variable conceptualization. Diagrams, equations and model parameters. Moving from equations to diagrams and vice versa; listing model parameters.
Day 1 Afternoon
Introduction to computer SEM software. Computer exercises: A simple single-indicator model. A latent variable measurement model.
Day 2 Morning
Identification. Variances, scaling. Covariance algebra for structural equation models. Applications. Class exercises: a) identification b) covariance algebra. Equality constraints and dummy variables in SEM models.
Day 2 Afternoon
Computer exercises (R, STATA): A latent variable measurement model with covariates. Model diagnostics, fit improvement approaches. Mediation with manifest and latent variables.
Day 3 Morning
Nested models, Wald and LM tests, mixing single- and multiple-indicator measurement models. Fit functions. Estimation. Dealing with estimation problems, including negative variance estimates and non-convergence.
Day 3 Afternoon
Computer exercise (R/lavaan): SEM model with multiple latent variables, single-indicator and multiple-indicator covariates. Improving model fit, assessing diagnostics. Non-standard models. Multiple group models: conceptual introduction.
Day 4 Morning
Multiple Group Models. Measurement equation equivalence across groups (tests, assessment). Construct equation equivalence. Software applications, formal versus substantive comparisons. Reporting SEM model results. Computer exercise (R): a multiple-group model.
Day 4 Afternoon
Computer exercise: multiple-group models in STATA. Alternative estimators and scaled variance estimators: dealing with missing data and non-normal data. Item parcels (pro and con).
Day 5 Morning
Computer exercise (R/lavaan) for datasets with missing and/or non-normal data. An introduction to models for means and intercepts.
Day 5 Afternoon
Computer exercises (R/lavaan and STATA): a model for means and intercepts
Literature
Mandatory
Nine PDF files will be made available to participants as reading materials for this course, titled Notes(Section1) through Notes(Section9).
Supplementary / voluntary
Randall Schumacker and Richard Lomax, A Beginner’s Guide to Structural Equation Modeling. 4th edition (Routledge, 2016). This reading is helpful but not essential. Earlier versions of this text can be used.
Mandatory readings before course start
There are no mandatory pre-course readings. Participants are encouraged to red through section 1 of the course notes in advance of the class, but may choose to read this while the class is in progress.
Examination part
Two computer exercises, 20% each: 40%.
First exercise is due Thursday during the course. Second exercise is due Monday immediately following the course.
One major exercise: 60%.
This exercise will consist of a series of 5-7 questions requiring essay-style responses (approx. 8-14 pp. total). Some questions will involve the interpretation of computer output listings, while other questions will deal with conceptual issues discussed in the course. The exercise is due within 2 weeks of the end of the course.
Supplementary aids
For computer exercises, the following materials will be helpful: a) lab exercise materials and descriptions and b) an abbreviated software user manual/guide (one available for each of STATA and lavaan), c) PDF course text files. For the major project, the PDF course files will be very helpful.
Examination content
For the final exercise, students will need to understand the following subject matter:
1. Converting equations to path diagrams and vice versa.
2. Principles of mediation assessment: total, direct and indirect effects in structural equation path models
3. Determining whether a model is identified or not
4. Dealing with estimation difficulties
5. Interpreting model parameters in the metric of the manifest variables
6. Interpreting standardized model parameters
7. Determining whether the fit of a model is acceptable
8. Hypothesis testing: simultaneous tests for b=0; tests for equality
9. Interpreting models with parameter constraints
10. Testing measurement model equivalence in multiple-group models
11. Testing construct equation equivalence in multiple group models; assessing individual parameters and groups of parameters for cross-group differences
12. Dummy exogenous variables in structural equation models
13. Approaches to missing data in SEM models.
14. Dealing with non-normal data: ADF, DWLS estimators, Bentler-Satorra and other variance adjustment approaches.
Examination relevant literature
For the major assignment exercise, students should have access to the course powerpoint slide materials and the course text PDF files.
Qualitative Research Methods and Data Analysis presents strategies for analyzing and making sense of qualitative data. Both descriptive and interpretive qualitative studies will be discussed, as will more defined qualitative approaches such as grounded theory, narrative analysis, and case studies. The course will briefly cover research design and data collection strategies but will largely focus on analysis. In particular, we will consider how researchers develop codes and integrate memo writing into a larger analytic process. The purpose of coding is to provide a focus to qualitative analysis; it is critical to have a handle on coding practices as you move deeper into analysis. The course will present coding and memo writing as concurrent tasks that occur during an active review of interviews, documents, focus groups, and/or multi‑media data. We will discuss deductive and inductive coding and how a codebook evolves, that is, how codes might “emerge” and shift during analysis. Managing codes includes developing code hierarchies, identifying code “constellations,” and building multidimensional themes.
The class will present memo writing as a strategy for capturing analytical thinking, inscribed meaning, and cumulative evidence for condensed meanings. Memos can also resemble early writing for reports, articles, chapters, and other forms of presentation. Researchers can also mine memos for codes and use memos to build evocative themes and theory. Coding and memo writing are discussed in the context of data-driven qualitative research beginning with design and moving toward presentation of findings. The course will also discuss using visual tools in analysis, such as diagramming core quotations from data to holistically present the participant’s key narratives. Visual tools can also assist in looking horizontally across many transcripts to identify connective themes and link the parts to the whole.
Software
We will spend one day learning a qualitative analysis software package:
GSERM St. Gallen Atlas.TI
GSERM Ljubljana NVIVO
If the course will be in a remote format, we will work with MAXQDA.
The methods discussed in the course will be applicable to qualitative studies in a range of fields, including the behavioral sciences, social sciences, health sciences, communications, and business.
Structure
Day 1
- Core Principles and Practices in Qualitative Data Inquiry
- Qualitative Research Design: An Overview Data types
- Comparative strategies
- Qualitative sampling
- Triangulation
Analysis Task 1: Memo Writing Document summary memos
- Key-quote memos
- Methods memos
Day 2
- Analysis Task 2: Using Visual Tools
- Episode profiles
- Making sense of data using diagrams
- Working with core quotations
Analysis Task 3: Coding Qualitative Data
- Descriptive coding
- Interpretive coding
- Strategies to coding
- Line‑by‑line coding
- Creating a codebook
Day 3
- Introduction to Qualitative Software: MAXQDA (see information at “Software”)
a. Overview
b. Beginning a project
c. Writing comments and memos
d. Coding data
- Hands‑on Exercises Using MAXQDA
- Analysis in MAXQDA
- Exploring codes and memos in queries
- Matrices and diagrams
- Blending quantitative and qualitative data
Day 4
- Methodological Traditions
a. Grounded theory
b. Narrative analysis
c. Case study
e. Pragmatic qualitative analysis
Day 5
- Qualitative Research Design: Revisiting Strategies
- Data Collection considerations Types
•Interviews
•Focus groups
•Other types of data - Developing interviewing skills
- Other data types
- Evaluating qualitative articles
- Class discussion
Suggested Reading (Articles)
Electronic version of these articles will be provided to registered participants:
Ahlsen, Birgitte, et al. 2013. “(Un)doing Gender in a Rehabilitation Context: A Narrative Analysis of Gender and Self in Stories or Chronic Muscle Pain.” Disability and Rehabilitation 1‑8.
Charmaz, Kathy. 1999. “Stories of Suffering: Subjective Tales and Research Narratives.” Qualitative Health Research 9:362‑82.
Sandelowski, Margarete. 2000. “Whatever Happened to Qualitative Description?” Research in Nursing and Health 23:334‑40.
Rouch, Gareth, et al. 2010. “Public, Private and Personal: Qualitative Research on Policymakers’ Opinions on Smokefree Interventions to Protect Children in ‘Private’ Spaces.” BMC Public Health 10:797‑807.
Suggested Reading (Books)
Charmaz, Kathy. 2006. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. Sage.
Marshall, Catherine, and Gretchen B. Rossman. 2006. Designing Qualitative Research. 4th ed. Sage.
Yin, Robert. 2013. Case Study Research Design and Methods. Sage.
Examination
Participants will be asked to read several interviews or journal entries and generate a preliminary analysis of the data using techniques discussed during the course. This examination will be due three weeks after the course ends.
Examination content
Students will have to demonstrate familiarity with the differences between grounded theory, narrative analysis, case study, and pragmatic analysis. The assignment will require them to choose one of these approaches to design a study and analyze several documents provided by the instructor. Their preliminary analysis will include memos, a codebook, diagrams, early findings, and reflection on next steps.
Prerequisites (knowledge of topic)
Participants should have a basic working knowledge of the principles and practice of multiple regression and elementary statistical inference. Because this is a second course, participants should either be familiar with the contents of the first edition of Introduction to Mediation, Moderation, and Conditional Process Analysis and the statistical procedures discussed therein or should have taken the first course through GSERM or elsewhere. Participants should also have experience using syntax in SPSS, SAS, or R, and it is assumed that participants will already have some experience using the PROCESS macro. No knowledge of matrix algebra is required or assumed, nor is matrix algebra ever used in the course.
Hardware
Students are strongly encouraged to bring their own laptops (Mac or Windows)
Software
Laptops need a recent version of SPSS Statistics (version 19 or later), SAS (release 9.2 or later) or R (3.6 or later) installed. SPSS users should ensure their installed copy is patched to its latest release. SAS users should ensure that the IML product is part of the installation. PROCESS for R has not yet been publicly released. STATA users can benefit from the course content, but PROCESS makes these analyses much easier and is not available for STATA.
Learning objectives
- Apply and report on tests of moderated mediation using the index of moderated mediation
- Identify models for which partial and conditional moderated mediation are appropriate.
- Apply and report mediation analysis with multicategorical independent variables.
- Test and probe an interaction involving a multicategorical independent variable or moderator.
- Apply and report tests of moderated mediation involving a multicategorical independent variable.
- Generalize the index of moderated mediation to models with serial mediation
- Estimate and conduct inference in mediation, moderation, and moderated mediation contexts for two-instance repeated-measures designs.
- Generate and specify custom models in PROCESS
Course content
Statistical mediation and moderation analyses are among the most widely used data analysis techniques. Mediation analysis is used to test various intervening mechanisms by which causal effects operate. Moderation analysis is used to examine and explore questions about the contingencies or conditions of an effect, also called ʺinteraction.ʺ Conditional process analysis is the integration of mediation and moderation analysis and used when one seeks to understand the conditional nature of processes
(i.e., ʺmoderated mediationʺ).
In Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression‑Based Approach (www.guilford.com/p/hayes3) Dr. Andrew Hayes describes the fundamentals of mediation, moderation, and conditional process analysis using ordinary least squares regression. He also explains how to use PROCESS, a freely‑available and handy tool he invented that brings modern approaches to mediation and moderation analysis within convenient reach.
This seminar‑ a second course ‑picks up where the first edition of the book and the first course offered by GSERM leaves off. After a review of basic principles, it covers material in the second and third editions of the book as well as new material including recently published methodological research.
Topics covered include:
- Review of the fundamentals of mediation, moderation, and conditional process analysis.
- Testing whether an indirect effect is moderated and probing moderation of indirect effects.
- Partial and conditional moderated mediation.
- Mediation analysis with a multicategorical independent variable.
- Moderation analysis with a multicategorical (3 or more groups) independent variable or moderator.
- Conditional process analysis with a multicategorical independent variable
- Moderation of indirect effects in the serial mediation model.
- Mediation, Moderation, and Conditional Process Analysis in Two-Instance Repeated-Measures Designs
- Advanced uses of PROCESS, such as how to modify a numbered model or customize your own model.
We focus primarily on research designs that are experimental or cross‑sectional in nature with continuous outcomes. We do not cover complex models involving dichotomous outcomes, latent variables, nested data (i.e., multilevel models), or the use of structural equation modeling.
Structure
Day 1:
- Review of fundamentals
- Testing whether and indirect effect is moderated
- Estimating conditional indirect effects
Day 2:
- Representing multicategorical predictors
- Mediation analysis with a multicategorical independent variable
- Estimating Moderation models with a multicategorical independent variable
Day 3:
- Probing moderation models with a multicategorical independent variable
- Estimating and probing moderation models with a multicategorical moderator
- Estimating conditional process models with a multicategorical independent variable
Day 4:
- Inference and probing of conditional process models with a multicategorical independent variable
- Conditional process analysis involving serial mediation
- Custom models in PROCESS
Day 5:
- Mediation analysis in two-instance repeated-measures designs
- Moderation analysis in two-instance repeated-measures designs
- Conditional process analysis in two-instance repeated-measures designs
Literature
Introduction to Mediation, Moderation, and Conditional Process Analysis (3rd edition)
Examination part
Homework delivered during week of the course (4 assignments, 60%)
Homework delivered after the course (1 assignment, 40%)
Supplementary aids
Open Book
Examination content
Homework 1 (Due Tuesday Morning):
- Review of fundamentals
- Testing whether and indirect effect is moderated
- Estimating conditional indirect effects
Homework 2 (Due Wednesday Morning):
- Representing multicategorical predictors
- Mediation analysis with a multicategorical independent variable
- Estimating Moderation models with a multicategorical independent variable
Homework 3 (Due Thursday Morning):
- Probing moderation models with a multicategorical independent variable
- Estimating and probing moderation models with a multicategorical moderator
- Estimating conditional process models with a multicategorical independent variable
Homework 4 (Due Friday Morning):
- Inference and probing of conditional process models with a multicategorical independent variable
- Conditional process analysis involving serial mediation
- Custom models in PROCESS
Homework 5 (Due within 2 weeks of end of course):
- All content listed above+
- Mediation analysis in two-instance repeated-measures designs
- Moderation analysis in two-instance repeated-measures designs
- Conditional process analysis in two-instance repeated-measures designs
Examination relevant literature
Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression‑Based Approach (www.guilford.com/p/hayes3) Dr. Andrew Hayes
Prerequisites (knowledge of topic)
Students should be interested in spatial topics such as real estate markets, urban economics, crime, pollution, spatial distribution of political preferences, and trade flows. We assume that students are familiar with matrix algebra, and have had courses in probability theory and econometrics. The course emphasizes programming and empirical application. The empirical implementation of spatial models is done in R, hence some familiarity in R is useful but not required for the course. The course is open to students from the PiF/PEF and other external PhD programs.
Learning objectives
The goal of this course is to provide students with the main tools for analyzing and visualizing spatial data. Students will learn how to estimate and interpret a range of spatial models and how to program own models in R.
Course content
This course focuses on the visualization and modeling of spatial data. Examples are taken from different research areas such as political science, empirical international trade, criminology, and real estate. It offers a detailed explanation of individual estimation methods and their implementation in R. In this course, students will learn:
• How to generate a variety of different maps that visualize the location of spatial units
• How maximum likelihood estimation works and how to set up and optimize a likelihood function in R
• How to deal with computational problems that are frequently accounted when working with spatial data
• How to increase computation speed using concentrated maximum likelihood and the matrix exponential spatial specification model
• How to estimate a spatial regression model both, with cross‑sectional and with time‑series data
• How to properly interpret the output from a spatial regression model and how to investigate policy interventions.
• A basic background on spatial interaction models, heterogeneous coefficient SAR models, and spatio‑temporal models
What students do NOT learn in this course:
• Estimation of spatial regression models with other estimation techniques such as IV, NLS, and Bayesian Methods
• The use of a specialized Geographic Information System such as ArcGIS
Structure
Monday
Lecture 1: 09:15 ‑ 12:00
R Tutorial 1: 13:00 ‑ 15:00
Tuesday
Lecture 2: 09:15 ‑ 12:00
R Tutorial 2: 13:00 ‑ 15:00
Wednesday
Lecture 3: 09:15 ‑ 12:00
R Tutorial 3: 13:00 ‑ 15:00
Thursday
Lecture 4: 09:15 ‑ 12:00
R Tutorial 4: 13:00 ‑ 15:00
Friday
Lecture 5: 09:15 ‑ 12:00
R Tutorial 5: 13:00 ‑ 15:00
Times and room information in the timetable apply.
Literature
Mandatory
LeSage, J., and R.K. Pace (2009), ʺIntroduction to Spatial Econometricsʺ. CRC Press.
Supplementary / voluntary
Elhorst, J.P. (2014), ʺSpatial Econometrics: From Cross‑Sectional Data to Spatial Panelsʺ, Springer.
Holly, S., M.H. Pesaran, and T. Yamagata (2011), ʺThe Spatial and Temporal Diffusion of House Prices in the UKʺ,Journal of Urban Economics 69, 2‑23.
LeSage, J. (2014), ʺWhat Regional Scientists Need to Know about Spatial Econometricsʺ,The Review of Regional Studies 44, 13‑32.
Examination part
Examination paper written at home (100%)
Remark: Paper Replication or own research idea.
Examination content
• SAR model, SDM model, CML, MESS, Spatial Interaction model, Spatial Panel model, HSAR model
Implementing maximum likelihood estimation in R: Full Maximum Likelihood, Concentrated Maximum Likelihood, Matrix Exponential Spatial Specification.
Examination relevant literature
• LeSage, J., and R.K. Pace (2009), “Introduction to Spatial Econometrics”. CRC Press, Chapter 1, 2, 3, 4, 8, and 9.
• LeSage, J., and Y.-Y. Chih (2016), “Interpreting Heterogeneous Coefficient Spatial Autoregressive Panel Models”, Economics Letters 142, 1–5.
Prerequisites (knowledge of topic)
– A graduate statistics course at an introductory level,
– Some knowledge of regression analysis is recommendable and desirable, but not expected from the participants.
Hardware
Your own laptop with R on it.
Software
R (version 4.2).
Learning objectives
– To become familiar with the basics as well as some intermediate to advanced elements of multilevel and longitudinal modeling;
– To acquire a nuanced understanding and skills for assessing the need for multilevel modeling of data from an empirical study;
– To be able at the end of the course to carry out multilevel and longitudinal modeling with R;
– To be in a position to interpret statistically and substantively the results from a multilevel and/or longitudinal modeling session with R;
– To be in a position to report the findings from such analyses in a publishable document.
Course content
This intermediate level 5-day course presents the highly popular methodology of multilevel modeling (MLM) across the social, behavioral, educational, life, biomedical, marketing, and business sciences. In addition, it discusses its specific applications in the analysis of longitudinal data that are currently very frequently collected in these and cognate disciplines.
MLM provides a widely applicable approach to modeling and accounting for clustering effects that impact the majority of contemporary empirical studies in these sciences. A key feature of these effects is the associated relationship among and similarity of the observed scores collected from members of studied groups or clusters of units of analysis (usually persons, respondents, patients, employees, students, clients, etc., but could also be higher-order aggregates of them). If this similarity is not properly handled, as when using standard analysis methods, incorrect statistical results ensue, followed by substantive conclusions that can be seriously misleading. A key achievement of MLM is the proper handling of the clustering effects, yielding valid and dependable statistical and substantive results.
A particular field of application of MLM is that of longitudinal data analysis and modeling. Designs and studies producing such data are very frequently utilized in the social, behavioral, medical, life, and business as well as related sciences. Through appropriate use of MLM, whose detailed coverage is also part of this course, data resulting from such studies involving repeated measures can be analyzed and modeled applying relevant statistical procedures that leads to valid and dependable results entailing correct substantive conclusions.
Structure
Day 1:
1. Resources for course and what it is about
– Literature,
– Software,
– What this course is about,
– Why use clustered (nested) settings, studies, and data.
2. A brief introduction to R
– What is R?
– Why use R?
– R installation, packages, and libraries,
– R packages needed in this course
– Reading data into R.
3. Fitting single-level regression models using R
– Some important nomenclature
– The meaning of the intercept and slope parameters
– Estimation of model parameters
– How good is the model?
– Multiple regression
– The generalized linear model (GLIM)
– Logistic regression as an important GLIM.
Day 2:
4. Why do we need multilevel models?
– What is multilevel modeling MLM), why can’t we do without
it, and how come aggregation and disaggregation do not do
the job?
• Examples of nested data and the hallmark of
multilevel modeling
• Another important instance of multilevel modeling
• Aggregation and disaggregation of scores
• Analytic benefits of multilevel modeling
– The beginnings of multilevel modeling – why what we already
know about regression analysis will be so useful
• Multilevel models as sets of regression equations
• An insightful look at key model parameters
• Graphing of hierarchical (multilevel) data.
– Appendix: Restricted maximum likelihood (REML) as a
widely applicable multilevel model estimation method.
5. The intra-class correlation coefficient (ICC)
– The fully unconditional two-level model, assumptions, and
definition of the ICC
– The meanings of the ICC
– Estimation of the ICC using R
– Appendices – MLM glossary and design effect.
Day 3:
6. How many levels? – Proportion third and second level variance
– The fully unconditional three-level model
– Proportion third level variance (PTLV)
– Proportion second level variance (PSLV) – has there been an
omitted a level in past analyses?
– Evaluation of PTLV and PSLV using R.
7. Random intercept models
– Introduction to multilevel modeling with covariates
– Statistical underpinnings of random intercept models
– Fitting a random intercept model with R
– R-squares for multilevel models
– Conditional three-level models.
Day 4:
8. Robust modeling of lower-level variable relationships in the
presence of clustering effect
– Introduction to robust modeling with clustering effects,
– Robust modeling of empirical data from a multilevel study.
8. Mixed effect models
– What are mixed models, what are they made of,
and why are they so useful?
• An illustration of the difference between fixed
and random effects
• Examples of mixed modeling frameworks
– Random regression models (RMMs)
• Why we need RMMs
• Standard regression as a mixed model
• A RRM as a mixed model
• Multiple random slopes
– Numerical issues and how to resolve them.
Day 5:
9. Multilevel models with discrete responses
– Introduction – why bother?
– An important statistical fact – a refresher
– Random intercept models with discrete outcomes
– Random regression models with discrete outcomes
– Model choice with discrete outcome.
10. Longitudinal multilevel modeling
– Introduction – why do we need longitudinal modeling?
– The need for modeling individual temporal development
– Multilevel modeling of repeated measure data – unconditional
and conditional longitudinal analysis
– Using R to fit unconditional and conditional growth curve
models.
11. Outlook and conclusion.
Literature
Recommended Overviews
– Snijders, T. A. B., & Bosker, R. J. (2013). Multilevel analysis. An introduction to basic and advanced multilevel modeling. Thousand Oaks, CA: Sage.
– Finch, W. H., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. Boca Raton, FL: CRC Press (Taylor & Francis).
Supplementary / voluntary readings
– Raykov, T. & Marcoulides, G. A. (2012). Basic statistics. An introduction with R. New York, NY: Rowman & Littlefield.
Additional background readings
– Rabe-Hesketh, S., & Skrondal, A. (2021). Multilevel and longitudinal modeling. College Station, TX: Stata Press.
Examination part
Take home assignment, to be submitted within 3 weeks upon course completion.
Participants are allowed to use any literature they can find, incl. the lecture notes volume to be provided in pdf form to them before the course commences.
Supplementary aids
Course participants are allowed to use any literature they can access, incl. the lecture notes.
Examination content
Multilevel modeling with clustering effects. Robust modeling of multiple dependent variables in the presence of nesting effects. Evaluation of unique significance of predictor effects.
Examination relevant literature
Finch, W. H., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. Boca Raton, FL: CRC Press (Taylor & Francis).
Prerequisites (knowledge of topic)
There is no prerequisite knowledge. However, some familiarity with experimental design (e.g., through the GSERM course “Experimental Methods for Behavioral Science”) is very useful, as the course delves deep into practical aspects of conducting experimental research online.
Hardware
Students will complete course work on their own laptop computers.
Software
The course will rely on several online research tools, including Amazon Mechanical Turk, Prolific, CloudResearch, Qualtrics, and more. Accounts are typically free to open (at least in trial version), though some might require depositing a credit card number.
Learning objectives
– conducting online “lab” surveys and experiments on several alternative platforms
– using advanced features to conduct richer and innovative “lab” studies
– conducting “field” studies on social media platforms, search engines, and beyond
– improving research designs by scraping web data
– making online research more publishable, reproducible, and replicable
Course content
The Internet is revolutionizing how empirical research is conducted across the social sciences. Without the need for intermediaries, individual researchers can now conduct large-scale experiments on human participants, longitudinal surveys of rare populations, A/B tests on social media, and more. In this course, you will learn how to harness these opportunities while avoiding the many pitfalls of online research. The course is tailored for researchers in psychology, economics, business, and any other area of academia or industry who investigate human behavior.
We will cover the nuts and bolts of conducting “lab” experiments on alternative Internet platforms, including techniques to maximize the validity and reproducibility of research findings. We will also discuss how to unlock the potential of the Internet for more elaborate, richer designs (e.g., longitudinal, interactive) that go beyond simple survey experiments. Additionally, we will teach you how to scrape publicly available information and to conduct “field” experiments on social media, gathering real-world, immediately applicable insights about consumers, workers, and Internet users more generally. Importantly, technical and practical insights will explicitly serve the goal to improve the rigor and the publishability of participants’ own research. To this end, we will include discussions on whether and how to combine online and offline investigations, how to preregister and report online research in a paper, and more.
The course relies on a mix of discussions, demonstrations, and exercises that use participants’ own research needs and projects as starting points. At the end of the week, participants will be fully equipped to design, execute, and report valid online research for their own investigations.
Structure
Each class day includes lecturing, applications, and discussions. The morning session is prevalently, though not exclusively, devoted to introducing new concepts and techniques in online behavioral research. Afternoon sessions are mostly devoted to putting this content in practice, with students designing, setting up, and discussing online research applications.
Day 1: Conducting virtual “lab” studies
How is online research different?
Data collection tutorials:
MTurk
CloudResearch
Prolific
(and more)
Choosing between platforms
Day 2: Data quality in the virtual lab
Attrition and sampling selection issues
Study Imposters
Deception
Inattention, miscomprehension, insufficient effort
Participant experience and nonnaiveté
Ethical experimentation online
Preregistration for online research
Ensuring reproducibility in online “lab” research
Day 3: Advanced “lab” designs
Studying rare populations
Cross-cultural studies
Longitudinal studies
Incentivizing participants
Participant Interaction
Beyond surveys and self-reports
Reporting online experiments
Day 4: Running digital quasi-experiments
What is digital quasi-experimentation?
How is different from other forms of (online) behavioral research?
Running A/B tests on social media platforms
Running A/B tests on search engines
Challenges of using platforms and search engines for (academic) research
Navigating validity trade-offs in digital quasi-experimentation
Day 5: Scaling and automating online behavioral research
A primer to web scraping and APIs for online behavioral research
Collecting web data at scale for experimental stimuli
Overview of use cases of web scraping and APIs for online behavioral research
Automating online behavioral research
Literature
There is no textbook for the course, and there are no mandatory readings. Below we recommend some overviews of the general topics that we will address in the course, and list a number of additional background readings that zoom into selected aspects of online behavioral research. Note that the course will also cover material which is not presented in any reading.
Recommended Overviews
Boegershausen, J., Datta, H., Borah, A., & Stephen, A. T. (2022). Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing, 86(5), 1-20.
Hauser, D., Paolacci, G., Chandler, J. (2019). Common Concerns with MTurk as a Participant Pool. Evidence and Solutions. In Handbook of Research Methods in Consumer Psychology, ed. F. R. Kardes, P. M. Herr, and N. Schwarz, Routledge.
Orazi, D. C., & Johnston, A. C. (2020). Running field experiments using Facebook split test. Journal of Business Research, 118, 189-198.
Stewart, N., Chandler, J., Paolacci, G. (2017). Crowdsourcing Samples in Cognitive Science. Trends in Cognitive Sciences, 21(10), 736-748.
Additional Background Readings
Arechar, A. A., Gächter, S., & Molleman, L. (2018). Conducting interactive experiments online. Experimental Economics, 21(1), 99–131.
Casey, L. S., Chandler, J., Levine, A. S., Proctor, A., & Strolovitch, D. Z. (2017). Intertemporal differences among MTurk workers. SAGE Open, 7(2).
Chandler, J., Paolacci, G. (2017). Lie for a Dime: When Most Prescreening Responses Are Honest but Most Study Participants Are Impostors. Social Psychological and Personality Science, 8(5), 500-508.
Chandler, J., Paolacci, G., Hauser, D. (2020). Data Quality Issues on MTurk. In Conducting Online Research on Amazon Mechanical Turk and Beyond, ed. L. Litman, Sage.
Coppock, A. (2018). Generalizing from survey experiments conducted on Mechanical Turk: A replication approach. Political Science Research and Methods, 1–16.
Chandler, J., Mueller, P., Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130.
Chandler, J., Sisso, I., & Shapiro, D. (2020). Participant carelessness and fraud: Consequences for clinical research and potential solutions. Journal of Abnormal Psychology, 129(1), 49–55.
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19.
Eckles, D., Gordon, B. R., & Johnson, G. A. (2018). Field Studies of Psychologically Targeted Ads Face Threats to Internal Validity. Proceedings of the National Academy of Sciences, 115(23), E5254.
Goldfarb, A., Tucker, C., & Wang, Y. (2022). Conducting research in marketing with quasi-experiments. Journal of Marketing, 86(3), 1-20.
Goodman and Paolacci (2017), “Crowdsourcing Consumer Research,” Journal of Consumer Research, 44, 1, 196-210.
Litman, L., Robinson, J., & Rosenzweig, C. (2015). The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk. Behavior Research Methods, 47(2), 519–528.
Molnar, A. (2019). SMARTRIQS: A Simple Method Allowing Real-Time Respondent Interaction in Qualtrics Surveys. Journal of Behavioral and Experimental Finance, 22, 161-169.
Morales, A. C., Amir, O., & Lee, L. (2017). Keeping it real in experimental research—Understanding when, where, and how to enhance realism and measure consumer behavior. Journal of Consumer Research, 44(2), 465-476.
Moss, A. J., Rosenzweig, C., Robinson, J., & LItman, L. (2020). Is it ethical to use Mechanical Turk for behavioral research? Relevant data from a representative survey of MTurk participants and wages. PsyArXiv.
Paolacci, G., Chandler, J., Ipeirotis, P. G. (2010). Running Experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.
Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a participant pool. Current Directions in Psychological Science, 23(3), 184-188.
Peer, E., Rotschild, D., Gordon, A., Evernden, Z., Damer, E. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54: 1643-1662.
Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 1123-1128
Weinberg, J., Freese, J., & McElhattan, D. (2014). Comparing data characteristics and results of an online factorial survey between a population-based and a crowdsource-recruited sample. Sociological Science, 1, 292–310.
Woike, J. K. (2019). Upon repeated reflection: Consequences of frequent exposure to the cognitive reflection test for Mechanical Turk participants. Frontiers in Psychology, 10.
Zallot, C., Paolacci, G., Chandler, J., Sisso, I. (2022). Crowdsourcing in observational and experimental research. Handbook of Computational Social Science. Volume 2 Data Science, Statistical Modelling, and Machine Learning Methods, eds. U. Engel, A. Quan-Haase, S. Xun Liu, & L.E. Lyberg, Routledge.
Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology, 111(4), 493–504.
Examination part
Performance will be evaluated with individual assignments during the course. In these assignments, students will put the course content into practice. For example, they will preregister an online research design to test a hypothesis, detailing all their methodological choices. Though class participation is not graded, regular attendance and active participation in class discussions are expected.
Supplementary aids
The assignments are “open book”. Lecture slides and notes are recommended, and all background readings are additional recommended sources.
Examination content
Lecture slides and notes.
Examination relevant literature
Lecture slides and notes.
Prerequisites (knowledge of topic)
Mathematics: Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Familiarity with discrete and continuous univariate probability distributions will be helpful. Statistics: Students should have completed Ph.D.-level courses in introductory statistics and linear regression models, up to the level of GSERM’s Regression II.
Hardware
Students will complete course work on their own laptop computers. Microsoft Windows, Apple OS-X, and Linux variants are all supported; please contact the instructor to ascertain the viability of other operating systems for course work.
Software
Basic proficiency with at least one statistical software package/language is not required but is highly recommended. Preferred software packages include the R statistical computing language and Stata. Course content will be presented using R; computer code for all course materials (analyses, graphics, course slides, examples, exercises) will be made available to students. Students choosing to use R are encouraged to arrive at class with current versions of both R (https://www.r-project.org) and RStudio (https://www.rstudio.com) on their laptops.
Course content
This course builds directly upon the foundations laid in Regression II, with a focus on successfully applying linear and generalized linear regression models. After a brief review of the linear regression model, the course addresses a series of practical issues in the application of such models: presentation and discussion of results (including tabular, graphical, and textual modes of presentation); fitting, presentation, and interpretation of two- and three-way multiplicative interaction terms; model specification for dealing with nonlinearities in covariate effects; and post-estimation diagnostics, including specification and sensitivity testing. The course then moves to a discussion of generalized linear models, including logistic, probit, and Poisson regression, as well as textual, tabular, and graphical methods for presentation and discussion of such models. The course concludes with a “participants’ choice” session, where we will discuss specific issues and concerns raised by students’ own research projects and agendas.
Structure
Day One (morning session): Review of linear regression.
Day One (afternoon session): Presentation and interpretation of linear regression models.
Day Two (morning session): Fitting and interpreting models with multiplicative interactions.
Day Two (afternoon session): Nonlinearity: Specification, presentation, and interpretation.
Day Three (morning session): Anticipating criticisms: Model diagnostics and sensitivity tests.
Day Three (afternoon session): Introduction to logit, probit, and other Generalized Linear Models (GLMs).
Day Four (morning session): GLMs: Presentation, interpretation, and discussion.
Day Four (afternoon session): GLMs: Practical considerations, plus extensions.
Day Five (morning session): “Participants’ choice” session.
Day Five (afternoon session): Examination period.
Literature
Mandatory
The course has one required text:
Fox, John R. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Additional readings will also be assigned as necessary; a list of those readings will be sent to course participants a few weeks before the course begins. All additional readings will be available on the course github repository and/or through online library services (e.g., JSTOR).
Supplementary / Voluntary
None.
Mandatory readings before course start
None.
Examination part
Grading:
– Two written homework assignments (20% each)
– A final examination (50%)
– Oral / class participation (10%)
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam. Additional useful materials include:
Fox, John, and Sanford Weisberg. 2011. An R and S-Plus Companion to Applied Regression, Second Edition. Thousand Oaks, CA: Sage Publications.
Nagler, Jonathan. 1996. “Coding Style and Good Computing Practices.” The Political Methodologist 6(2):2-8.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data using linear and/or generalized linear regression. Students will be required to specify, estimate, and interpret various forms of regression models, to present tabular and graphical interpretations of those model results, to conduct and present diagnostics and robustness checks, and to give detailed explanations and justifications for their responses.
Literature
Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel / Hierarchical Models. New York: Cambridge University Press.
In the past 60 years econometrics provided us with many tools to uncover lots of different types of correlations. The technical level of this literature is impressive (see the PEF course Advanced Microeconometrics). However, at the end of day, correlations are less interesting if they do not have a causal implication. For example, the fact that smokers are more likely to die earlier than other people does not tell us much about the effect of smoking. For example, it might just be that smokers are the type of people who face more health and crime risks for quite different (social or genetic) reasons. The same problem occurs with almost any correlation of economic or financial variables. The interesting question is always whether these correlations are spurious, or whether they do tell us something about the underlying causal link of the different variables involved?
In this course we review and organize the rapidly developing literature on causal analysis in economics and econometrics and consider the conditions and methods required for drawing causal inferences from the data. Empirical applications play an important role in this course.
Active participation of PhD students participating in this course is expected. During the second part of the course, participants will conduct their own empirical study and present their results.
General structure and rules
Students activities
Active participation of the students in this course is the key to its success. Students are expected to do the following:
- Read the papers shown as ‘compulsary reading’ in the reading list BEFORE the lecture concerned with the topic.
- Each morning students will present a paper (15‑30 minutes each; depending on the number of participants) and there will be some general discussion about these papers. Students not presenting will be expected at least to sketch the papers to be able to participate in the discussion.
- Small groups of students (group size depends on number of participants) will conduct an independent empirical study (using Software of their own choice; GAUSS or STATA is recommended). In the empirical project students will show that they understood the basic concepts and are able to apply them to a ‘real life’ situation.
Grades
- Written Exam about 4 weeks after the last lecture (2 hours) (40%).
- Students’ active participation in general discussions during lectures and presentations (20%).
- Presentation of papers (20%).
- Empirical project (based on two presentations; 20%).
Prerequisites
As defined for the econometrics specialisation of PEF.
Course literature
To be published shortly before the lecture
Examination content
Empirical work, literature, contents of lecture
Examination relevant literature
To be defined during the lecture