Syllabi
University of St.Gallen
Prerequisites (knowledge of topic)
Advanced knowledge in statistics and econometrics (gained, for example, following the specific courses in a master in quantitative methods/economics/finance).
Hardware
Individual laptop (with no particular requisite).
Software
Examples and codes are shown using the R-software (free downloadable from https://www.r-project.org/).
Course Content
Computational Statistics is the area of specialization within statistics that includes statistical visualization and other computationally-intensive methods of statistics for mining large, nonhomogeneous, multi-dimensional datasets so as to discover knowledge in the data. As in all areas of statistics, probability models are important, and results are qualified by statements of confidence or of probability. An important activity in computational statistics is model building and evaluation.
First, the basic multiple linear regression is reviewed. Then, some nonparametric procedures for regression and classification are introduced and explained. In particular, Kernel estimators, smoothing splines, classification and regression trees, additive models, projection pursuit and eventually neural nets will be considered, where some of them have a straightforward interpretation, other are useful for obtaining good predictions.
The main problems arising in computational statistics like the curse of dimensionality will be discussed. Moreover, the goodness of a given (complex) model for estimation and prediction is analyzed using resampling, bootstrap and cross-validation techniques.
Structure
Outline
- Overview of supervised learning
Introductory examples, two simple approaches to prediction, statistical decision theory, local methods in high dimensions, structured regression models, bias-variance tradeoff, multiple testing and use of p-values. - Linear methods for regression
Multiple regression, analysis of residuals, subset selection and coefficient shrinkage. - Methods for classification
Bayes classifier, linear regression of an indicator matrix, discriminant analysis, logistic regression. - Nonparametric density estimation and regression
Histogram, kernel density estimation, kernel regression estimator, local polynomial nonparametric regression estimator, smoothing splines and penalized regression. - Model assessment and selection
Bias, variance and model complexity, bias-variance decomposition, optimism of the training error rate, AIC and BIC, cross-validation, boostrap methods. - Flexible regression and classification methods
Additive models; multivariate adaptive regression splines (MARS); neural networks; projection pursuit regression; classification and regression trees (CART). - Bagging and Boosting
The bagging algorithm, bagging for trees, subagging, the AdaBoost procedure, steepest descent and gradient boosting. - Introduction to the idea of a Superlearner
Structure (Chapters refer to the outline above)
Days 1 and 2: Chapters 1,2, and 3
Day 3: Chapter 5
Day 4: Chapter 4
Days 5 and 6: Chapters 6,7, and 8.
Literature
Mandatory
F. Audrino, Lecture Notes (can be downloaded from Studynet or asked directly to the lecturer).
Hastie T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction, Springer Series in Statistics, Springer, Canada.
Supplementary / voluntary
Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.
Moreover: References to related published papers will be given during the course.
Additional online resources:
A complete version of the main reference book can be downloaded online: http://statweb.stanford.edu/~hastie/ElemStatLearn/ Moreover, the R-package for the examples in the book is available: https://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf
The web-page of the book on Targeted Learning: http://www.targetedlearningbook.com/
https://stat.ethz.ch/education/semesters/ss2015/CompStat (mostly overlapping Computational Statistics class taught at the ETH Zürich)
R-software information and download: https://www.r-project.org/
Online course of Hastie and Tibshirani on Statistical Learning: Official course at Stanford Online: https://lagunita.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Quicker access to videos: http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ Link to the website of an introductory book related to the course: http://www-bcf.usc.edu/~gareth/ISL/index.html
Mandatory readings before course start
–
Examination part
Decentral: 100% group examination paper (term paper). Due to St. Gallen quality standards, possibility of an individual examination paper.
Supplementary aids
The examination paper consists in the analysis of a data set chosen by the students involving the methods learned in the lecture.
Examination content
The whole outline of the lecture described above.
Literature
Audrino, Lecture Notes.
These workshop lectures are designed to introduce participants to one of the most vibrant, free of charge statistical computing environments ― R. In this course you will learn how to use R for effective data analysis. We will cover a selected range of both basic topics (e.g., reading data into R, data structures (i.e., data frames, lists, matrices), data manipulation, statistical graphics) to more advanced topics (e.g., writing functions, control statements, loops, reshaping data, string manipulations, and statistical models in R).
This course is also helpful as a primer for other summer program courses that will use R, such as the courses on Computational Statistics, Data Mining, or Advanced Regression Modeling, among other courses. No prerequisites are required for this course.
Day 1 – Fundamentals in Python: Operators, data types, control flow, style guide.
Day 2 – Introduction to PANDAS (Python Data Analysis Library).
Day 3 – Introduction to NUMPY (Numerical Python) and MATPLOTLIB (Visualization with Python).
For each topic, we solve small programming tasks in class and discuss possible solutions.
These intensive lectures provide an opportunity for a participant to initially develop, or perhaps refresh, an intuitive understanding of the concepts of basic matrix algebra (the first three days) and calculus (the final day). The ultimate goal is to develop a sufficient level of understanding of these foundational mathematical tools that will allow a participant to adequately comprehend and successfully apply them in subsequent GSERM coursework. In addition to the lectures, problems and solutions will be provided to enhance the learning process and provide for self-evaluation. It is assumed that the participants are proficient in rudimentary algebra. This series of workshop lectures carries no formal academic credit.
These intensive lectures provide an opportunity for a participant to refresh their understanding of, and familiarity with, the concepts and assumptions of bivariate and multiple ordinary least squares (OLS) linear regression. The ultimate goal is to possess a sufficient level of understanding of these topics so that a participant can adequately comprehend and successfully apply them in subsequent GSERM coursework. In addition to the lectures, problems and solutions will be provided to enhance the learning process and provide for self-evaluation. It is assumed that the participants are proficient in rudimentary statistics (e.g., basic hypothesis testing); while it would be helpful to have been (at least) exposed to OLS regression, this is not a prerequisite. This series of workshop lectures carries no formal academic credit.
Prerequisites
A course in regression (e.g., GSERM Regression I) is essential. A second course in regression (e.g., GSERM Regression II) is recommended. Regression topics that are particularly important: i) assessing and dealing with non-linearity ii) dummy variables (including block F-tests) iii) standardization.
Hardware
Participants should bring laptops loaded with the software identified below.
Software
We will make primary use of the lavaan package in R but will also demonstrate the sem procedure in STATA. The following R packages should be installed on participant laptops: lavaan, haven, semTools. STATA will be available in a computer lab at the University of St. Gallen for participants who do not have it installed on their own laptops.
Learning objectives
The course will provide a conceptual introduction to structural equation models, provide a thorough outline of model “fitting” and assessment, teach how to effectively program structural equation models using available software, demonstrate how to extend basic models into multiple group situations, and provide an introduction to models where common model assumptions regarding missing and non-normal data are not met.
Course content
- Introduction to latent variable models, measurement error, path diagrams.
- Estimation, identification, interpretation of model parameters.
- Scaling and interpretation issues
- Scalar programming for structural equation models in R-lavaan and STATA.
- Mediation models in the structural equation framework.
- Model fit and model improvement
- General linear parameter constraints
- Multiple-group models
- Introduction to models for means and intercepts
- The FIML approach to analysis with missing data
- Alternative estimators for non-normal data.
Structure
Schedule may vary slightly according to class progress and specific interests of class members.
Day 1 Morning
Path models for mediation analysis. Introduction to latent variable conceptualization. Diagrams, equations and model parameters. Moving from equations to diagrams and vice versa; listing model parameters.
Day 1 Afternoon
Introduction to computer SEM software. Computer exercises: A simple single-indicator model. A latent variable measurement model. Identification.
Day 2 Morning
Assessment of model fit. Diagnostics for model improvement. Variances, scaling. Applications. Equality constraints and dummy variables in SEM models. Mediation with manifest and latent variables.
Day 2 Afternoon
Computer exercises (R, optionally: STATA): A latent variable measurement model with covariates. Non-linear constraints. Nested models, Wald and LM tests, mixing single- and multiple-indicator measurement models. Fit functions. Item parcels (pro and con).
Day 3 Morning
Correlated measurement errors. Estimation . Dealing with estimation problems, including negative variance estimates and non-convergence. Higher order models. Introduction to multiple group models.
Day 3 Afternoon
Computer exercise (R/lavaan): SEM model with multiple latent variables, single-indicator and multiple-indicator covariates.. Non-standard models. Item parcels (pro and con). Multiple group models: types of measurement equivalence.
Day 4 Morning
Multiple Group Model: Measurement and construct equation equivalence. Software applications, formal versus substantive comparisons. Reporting SEM model results. Computer exercise (R): a multiple-group model.
Day 4 Afternoon
Alternative estimators and scaled variance estimators: dealing with missing data and non-normal data. Bootstrap standard errors. Computer exercise (R/lavaan) for datasets with missing and/or non-normal data
Day 5 Morning
Models for means and intercepts. Assessing latent variable mean differences across groups. Assessing scalar invariance.
Day 5 Afternoon
Moderation: non-parallel slopes models. Latent variable interactions. Some special topics if time permits: a) multilevel SEM b) longitudinal data in SEM models: an introduction c) dealing with ordinal indicators d) dealing with categorical outcomes.
Literature
Mandatory
Nine PDF files will be made available to participants as reading materials for this course, titled Notes(Section1) through Notes(Section9).
Supplementary / voluntary
Randall Schumacker and Richard Lomax, A Beginner’s Guide to Structural Equation Modeling. 4th edition (Routledge, 2016). This reading is helpful but not essential. Earlier versions of this text can be used.
Mandatory readings before course start
There are no mandatory pre-course readings. Participants are encouraged to red through section 1 of the course notes in advance of the class, but may choose to read this while the class is in progress.
Examination part
Two computer exercises, 20% each: 40%.
First exercise is due Thursday during the course. Second exercise is due Monday immediately following the course.
One major exercise: 60%.
This exercise will consist of a series of 5-7 questions requiring essay-style responses (approx. 8-14 pp. total). Some questions will involve the interpretation of computer output listings, while other questions will deal with conceptual issues discussed in the course. The exercise is due within 2 weeks of the end of the course.
Supplementary aids
For computer exercises, the following materials will be helpful: a) lab exercise materials and descriptions and b) an abbreviated software user manual/guide (one available for each of STATA and lavaan), c) PDF course text files. For the major project, the PDF course files will be very helpful.
Examination content
For the final exercise, students will need to understand the following subject matter:
- Converting equations to path diagrams and vice versa.
- Principles of mediation assessment: total, direct and indirect effects in structural equation path models
- Determining whether a model is identified or not
- Dealing with estimation difficulties
- Interpreting model parameters in the metric of the manifest variables
- Interpreting standardized model parameters
- Determining whether the fit of a model is acceptable
- Hypothesis testing: simultaneous tests for b=0; tests for equality
- Interpreting models with parameter constraints
- Testing measurement model equivalence in multiple-group models
- Testing construct equation equivalence in multiple group models; assessing individual parameters and groups of parameters for cross-group differences
- Dummy exogenous variables in structural equation models
- Approaches to missing data in SEM models.
- Dealing with non-normal data: ADF, DWLS estimators, Bentler-Satorra and other variance adjustment approaches.
Examination relevant literature
For the major assignment exercise, students should have access to the course powerpoint slide materials and the course text PDF files.
Prerequisites (knowledge of topic)
Basic knowledge of Python programming and familiarity with Machine Learning
Hardware
A laptop
Software
A google account to access colab
Learning objectives
On successful completion of this course, students will be able to:
– Understand the fundamentals of Generative AI and Large Language Models.
– Understand how to Implement and fine-tune LLMs for various tasks.
– Evaluate the performance of LLMs using various metrics.
– Understand the ethical and societal implications of using LLMs.
– Develop their own projects utilizing Generative AI.
Structure
Day 1: Introduction to Generative AI
Generative vs. Discriminative Models
Overview of Generative AI
Applications of generative AI
Day 2: Basics of Language Models
Understanding Language Models
Unigram, Bigram, N-gram Models
Neural network based (seq2seq) Language Models
Day 3: Transformer Models and Attention Mechanism
Introduction to Transformer Models
Self-Attention and Multi-Head Attention
BERT
Day 4: Large Language Models (LLMs)
Introduction to LLMs
Architecture of GPT-4
Fine-tuning LLMs
Day 5: Evaluating and Improving LLMs
Evaluation metrics
Issues in LLMs
Prompt engineering
Examination
We will have a take-home exam.
Supplementary aids
Open Book
Examination Content
Students are expected to work on one individual project related to the application of LLMs on a specific domain (e.g., healthcare, finance, or social marketing).
Prerequisites and content
Prerequisite knowledge for the course includes the fundamentals of probability and statistics, especially hypothesis testing and regression analysis. This intermediate level course assumes that students can interpret the results of Ordinary Least Squares, Probit, and Logit regressions. They should also be familiar with the problems that are most common in regression, such as multicollinearity, heteroscedasticity, and endogeneity. Finally, students should be comfortable working with computers and data. No prior knowledge of R or network analysis is required.
Instruction is split between lectures and hands-on computer exercises. Students may find it to their advantage to bring with them a social network data set that is relevant to their research interests, but doing so is not required. The instructor will provide data sets necessary for completing the course exercises.
Hardware
Personal Laptop
Software
The latest version of R available at https://www.r-project.org/
Learning objective
Students will learn about the motivation, logic, methods, and literature of social network analysis. They will learn how to format network data for analysis R, how to choose and estimate appropriate models, and how to evaluate the results. They will strengthen their skills in presenting and commenting on statistical research in academic settings.
Course content
The concept of “social networks” is increasingly a part of social discussion, organizational strategy, and academic research. The rising interest in social networks has been coupled with a proliferation of widely available network data, but there has not been a concomitant increase in understanding how to analyze social network data. This course presents concepts and methods applicable for the analysis of a wide range of social networks, such as those based on family ties, business collaboration, political alliances, and social media.
Classical statistical analysis is premised on the assumption that observations are sampled independently of one another. In the case of social networks, however, observations are not independent of one another, but are dependent on the structure of the social network. The dependence of observations on one another is a feature of the data, rather than a nuisance. This course is an introduction to statistical models that attempt to understand this feature as both a cause and an effect of social processes.
Since network data are generated in a different way than many other kinds of social data, the course begins by considering the research designs, sampling strategies, and data formats that are commonly associated with network analysis. A key aspect of performing network analysis is describing various elements of the network’s structure. To this end, the course covers the calculation of a variety of descriptive statistics on networks, such as density, centralization, centrality, connectedness, reciprocity, and transitivity. We consider various ways of visualizing networks, including multidimensional scaling and spring embedding. We learn methods of estimating regressions in which network ties are the dependent variable, including the quadratic assignment procedure and exponential random graph models (ERGMs). We consider extensions of ERGMs, including models for two-mode data and networks over time.
Structure
Monday
1. Welcome, course procedures, requirements, and objectives
2. Lecture 01: Introduction to social network analysis
Recommended Readings
- John T. Scott. 2017. Social Network Analysis, 4th edition. London: Sage. Pages 1-40.
- Mustafa Emirbayer. 1997. “Manifesto for a Relational Sociology.” American Journal of Sociology 103 (2): 281-317.
- Ronald L. Breiger. 1974. “The Duality of Persons and Groups.” Social Forces 53 (2): 181-190.
- Linton C. Freeman. 1977. “A Set of Measures of Centrality Based on Betweenness.” Sociometry 40 (1): 35-41.
- Duncan Watts. 1999. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton: Princeton University Press. Pp. 11-40.
- Steven Strogatz. 2010. “The Enemy of My Enemy.” New York Times (February 14).
- David Knoke, Mario Diani, James Hollway, and Dimitris Christopolous. 2021. Multimodal Political Networks. New York: Cambridge University Press. Pp. 134-135.
3. Lecture 02: Major theories
Recommended Readings
- Mark Granovetter. 1973. “The Strength of Weak Ties.” American Journal of Sociology 78 (6): 1360-1380.
- Roger V. Gould and Roberto M. Fernandez. 1989. “Structures of Mediation: A Formal Approach to Brokerage in Transaction Networks.” Sociological Methodology 19: 89-126.
- Ronald S. Burt. 1992. Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press. Pp. 8-49.
- Joel M. Podolny. 2001. “Networks as the pipes and prisms of the market.” American Journal of Sociology 107 (1): 33-60.
- Miller McPherson, Lynn Smith-Lovin, and James M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415-444.
4. Lecture 03: Research designs and data
Recommended Readings
- John T. Scott. 2017. Social Network Analysis, 4th edition. London: Sage. Pp. 41-56.
- Edward O. Laumann, Peter V. Marsden, and David Prensky. 1983. “The Boundary Specification Problem in Network Analysis.” Pp. 18-34 in Ronald S. Burt and Michael Minor, eds., Applied Network Analysis, eds. Beverly Hills, CA: Sage.
- Douglas D. Heckathorn. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174-199.
- David Krackhardt. 1992. “The Strength of Strong Ties: The Importance of Philos in Organizations.” Pp. 216-239 in Nitin Nohria and Robert Eccles, eds., Networks and Organizations: Structure, Form, and Action. Boston, MA: Harvard Business School Press.
Tuesday
5. Computer Exercises 01: Introduction to Network Analysis in R
Recommended Readings
- Carter T. Butts. 2008. “network: A Package for Managing Relational Data in R.” Journal of Statistical Software 24 (2): 1-36.
- Carter T. Butts. 2008. “Social Network Analysis with sna.” Journal of Statistical Software 24 (6): 1-51.
6. Lecture 04: Descriptive statistics
Recommended Readings
- John T. Scott. 2017. Social Network Analysis, 4th ed. London: Sage. Pp. 57-136.
7. Computer Exercises 02: Descriptive statistics
8. Lecture 05: Inferential network analysis
Recommended Readings
- Skyler J. Cranmer, Bruce A. Desmarais, and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press. Pp. 3-32.
Wednesday
9. Lecture 06: Exponential Random Graph Models (ERGMs)
Recommended Readings
- Skyler J. Cranmer, Bruce A. Desmarais, and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press. Pp. 35-115.
- Skyler J. Cranmer, Philip Leifeld, Scott D. McClurg, and Meredith Rolfe. 2017. “Navigating the Range of Statistical Tools for Inferential Network Analysis.” American Journal of Political Science 61 (1): 237-251.
10. Computer Exercises 03: Exponential Random Graph Models (ERGMs)
Recommended Readings
- David R. Hunter, Mark S. Handcock, Carter T. Butts, Steven M. Goodreau, and Martina Morris. 2008. “ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks.” Journal of Statistical Software 24 (3): 1-29
- Martina Morris, Mark S. Handcock, and David R. Hunter. 2008. “Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects.” Journal of Statistical Software 24 (4): 1-24.
- Michael T. Heaney. 2014. “Multiplex Networks and Interest Group Influence Reputation: An Exponential Random Graph Model.” Social Networks 36 (1): 66-81.
- Michael T. Heaney and Philip Leifeld. 2018. “Contributions by Interest Groups to Lobbying Coalitions.” Journal of Politics 80 (2): 494-509.
11. Individual consultations
Thursday
12. Lecture 07: Temporal Exponential Random Graph Models (TERGMs)
Recommended Readings
- Skyler J. Cranmer, Bruce A. Desmarais, and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press. Pp. 116-147.
- Philip Leifleld and Skyler J. Cranmer. 2019. “A theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model.” Network Science 7 (1): 20-51.
13. Computer Exercises 04: Temporal Exponential Random Graph Models
Recommended Readings
- Philip Leifeld, Skyler J. Cramner, and Bruce A. Desmarais. 2018. “Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals.” Journal of Statistical Software 83 (6):1-36.
14. Individual consultations. Participants should plan to work in the evening to refine their presentations for Friday morning.
Friday
15. Student presentations (required if receiving credit for course)
16. Lecture 08: Generalized Exponential Random Graph Models (GERGMs)
Recommended Readings
- Skyler J. Cranmer, Bruce A. Desmarais, and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press. Pp. 148-164.
17. Computer Exercises 05: Generalized Exponential Random Graph Models (GERGMs)
Recommended Readings
- Matthew J. Denny. 2016. “Getting Started with GERGM.” https://www.mjdenny.com/getting_started_with_GERGM.html
18. Closing discussion
- Recommended Readings
Heaney, Michael T. 2024. “Theory and Possibilities in Social Network Analysis” in Janet Box-Steffensmeier, Valeria Sinclair-Chapman, and Dino Christenson, eds., The Oxford Handbook of Engaged Methodological Pluralism in Political Science. New York: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780192868282.013.33
Literature
Breiger, Ronald L. 1974. “The Duality of Persons and Groups.” Social Forces 53 (2): 181-190.
Burt, Ronald S. 1992. Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press. Pp. 8-49.
Butts, Carter T. 2008. “network: A Package for Managing Relational Data in R.” Journal of Statistical Software 24 (2): 1-36.
Butts, Carter T. 2008. “Social Network Analysis with sna.” Journal of Statistical Software 24 (6): 1-51.
Cranmer, Skyler J., Bruce A. Desmarais and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press.
Cranmer, Skyler J., Philip Leifeld, Scott D. McClurg, and Meredith Rolfe. 2017. “Navigating the Range of Statistical Tools for Inferential Network Analysis.” American Journal of Political Science 61 (1): 237-251.
Denny, Matthew J. 2016. “Getting Started with GERGM.” https://www.mjdenny.com/getting_started_with_GERGM.html
Emirbayer, Mustafa. 1997. “Manifesto for a Relational Sociology.” American Journal of Sociology 103 (2): 281-317.
Freeman, Linton C. 1977. “A Set of Measures of Centrality Based on Betweenness.” Sociometry 40 (1): 35-41.
Gould, Roger V., and Roberto M. Fernandez. 1989. “Structures of Mediation: A Formal Approach to Brokerage in Transaction Networks.” Sociological Methodology 19: 89-126.
Granovetter, Mark. 1973. “The Strength of Weak Ties.” American Journal of Sociology 78 (6): 1360-1380.
Heaney, Michael T. 2014. “Multiplex Networks and Interest Group Influence Reputation: An Exponential Random Graph Model.” Social Networks 36 (1): 66-81.
Heaney, Michael T. 2024. “Theory and Possibilities in Social Network Analysis” in Janet Box-Steffensmeier, Valeria Sinclair-Chapman, and Dino Christenson, eds., The Oxford Handbook of Engaged Methodological Pluralism in Political Science. New York: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780192868282.013.33
Heaney, Michael T., and Philip Leifeld. 2018. “Contributions by Interest Groups to Lobbying Coalitions.” Journal of Politics 80 (2): 494-509
Heckathorn, Douglas D. 1997. “Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174-199.
Hunter, David R., Mark S. Handcock, Carter T. Butts, Steven M. Goodreau and Martina Morris. 2008. “ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks.” Journal of Statistical Software 24 (3): 1-29
Knoke, David, Mario Diani, James Hollway, and Dimitris Christopolous. 2021. Multimodal Political Networks. New York: Cambridge University Press. Pp. 134-157.
Krackhardt, David. 1992. “The Strength of Strong Ties: The Importance of Philos in Organizations.” Pp. 216-239 in Nitin Nohria and Robert Eccles, eds., Networks and Organizations: Structure, Form, and Action. Boston, MA: Harvard Business School Press.
Laumann, Edward O., Peter V. Marsden, and David Prensky. 1983. “The Boundary Specification Problem in Network Analysis.” Pp. 18-34 in Ronald S. Burt and Michael Minor, eds., Applied Network Analysis, eds. Beverly Hills, CA: Sage.
Leifleld, Philip, and Skyler J. Cranmer. 2019. “A theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actor-oriented model.” Network Science 7 (1): 20-51.
Leifeld, Philip, Skyler J. Cramner, and Bruce A. Desmarais. 2018. “Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals.” Journal of Statistical Software 83 (6):1-36.
McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415-444.
Morris, Martina, Mark S. Handcock, and David R. Hunter. 2008. “Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects.” Journal of Statistical Software 24 (4): 1-24.
Podolny, Joel M. 2001. “Networks as the pipes and prisms of the market.” American Journal of Sociology 107 (1): 33-60.
Scott, John T. 2017. Social Network Analysis, 4th ed. London: Sage.
Strogatz, Steven. 2010. “The Enemy of My Enemy.” New York Times (February 14).
Watts, Duncan. 1999. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton: Princeton University Press. Pp. 3-40.
Exam
75%: There will be one written computer-based problem set on Monday through Thursday (for four assignments in total). Time will be allocated in class to complete the assignments, which must be submitted each day.
25%: On the final of day of the course, each student will make a presentation to the class on the results of her or his research project for the week. Giving a presentation to the course is required to receive a satisfactory grade in the course.
Supplementary aids
Lecture materials and problem sets will be distributed to students via email during the course.
Examination content
Students will be evaluated on their ability to estimate and correctly interpret statistical models of social networks, especially Exponential Random Graph Models (ERGMs).
Examination relevant literature
Recommended literature detailed above.
Prerequisites (knowledge of topic)
The course is designed for Master, PhD students and practitioners in the social and policy sciences, including political science, sociology, public policy, public administration, business, and economics. It is especially suitable to MA students in these fields who have an interest in carrying out research. Previous courses in research methods and philosophy of science are helpful but not required. Materials not in the books assigned for purchase and not easily available through online library databases will be made available electronically. Bringing a laptop to class will be helpful but is not essential.
Hardware
Laptop helpful but not required
Software
None
Learning objectives
By the end of the course, students should be able to design, critique, and implement case study research for major projects, including MA and PhD theses.
Course content
The central goal of the seminar is to enable students to create and critique methodologically sophisticated case study research designs in the social sciences. To do so, the seminar will explore the techniques, uses, strengths, and limitations of case study methods, while emphasizing the relationships among these methods, alternative methods, and contemporary debates in the philosophy of science. The research examples used to illustrate methodological issues will be drawn primarily from international relations and comparative politics. The methodological content of the course is also applicable, however, to the study of history, sociology, education, business, economics, and other social and behavioral sciences.
Course structure
On the first day the seminar focuses on the philosophy of science, theory construction, theory testing, causality, and causal inference. Next, the seminar will look at the comparative strengths and weaknesses of case study methods relative to quantitative methods, particularly statistical methods and formal modeling. On Day 2 the seminar outlines how to conceptualize, define, and measure qualitative independent and dependent variables. It then turns to the core issues in case study research design for single and comparative case studies: identifying a research question, developing alternative explanations for the outcomes of cases, and choosing cases for close study and comparison. On Day 3 we will explore typological theorizing and its use in case selection for small-n research, and we will apply this method in exercises. This leads into a discussion of and exercises on traditional process tracing, a method of within-case analysis that uses evidence from a case to adjudicate among alternative explanations of the case. On Day 4 we will examine the relatively new method of formal Bayesian process tracing, and we will practice it in exercises. The seminar then looks at ways of combining qualitative and quantitative methods in a single research project, and it outlines field research techniques, including archival research and interviews.
Literature
Mandatory:
Assigned Readings for GSERM Case Study Methods Course
Andrew Bennett, Georgetown University
Students should obtain and read the assigned pages of these books in advance of the course (see Lecture topics below for specific page assignments):
• Alexander L. George and Andrew Bennett, Case Studies and Theory Development in the Social Sciences (MIT Press 2005).
• Henry Brady and David Collier, Rethinking Social Inquiry (second edition, 2010)
• Gary Goertz, Social Science Concepts: A User’s Guide, (Princeton, 2005).
• Andrew Bennett and Jeffrey Checkel, eds., Process Tracing: From Metaphor to Analytic Tool (Cambridge University Press, 2014).
Lecture 1: Inferences About Causal Effects and Causal Mechanisms
Readings:
• Alexander L. George and Andrew Bennett, Case Studies and Theory Development, preface and chapter 7, pages 127-150.
Lecture 2: Critiques and Justifications of Case Study Methods
Readings:
• Brady and Collier, Rethinking Social Inquiry, 1-64, (or if you have the first edition, pages 3-20, 36-50)
• George and Bennett, Case Studies and Theory Development, Chapter 1, pages 3-36.
Lecture 3: Concept Formation and Measurement
Readings:
• Gary Goertz, Social Science Concepts, chapters 1, 2, 3, pages 1-94.
• As time allows we will look at some of Gary Goertz’s Exercises; you do not need to read them in advance but they are available at:
http://press.princeton.edu/releases/m8089.pdf
Lecture 4: Designs for Single and Comparative Case Studies
Readings:
• George and Bennett, Case Studies and Theory Development, chapter 4, pages 73-88.
• Jason Seawright and John Gerring, Case Selection Techniques In Case Study Research. Political Research Quarterly June 2008.
Lecture 5: Typological Theory
Readings:
• Excerpt from Andrew Bennett, “Causal mechanisms and typological theories in the study of civil conflict,” in Jeff Checkel, ed., Transnational Dynamics of Civil War, Columbia University Press, 2012. This will be provided via email.
Lecture 6: Traditional Process Tracing
Readings:
• Andrew Bennett and Jeff Checkel, Process Tracing, chapter 1, conclusions.
Lecture 7: Formal Bayesian Process Tracing
Tasha Fairfield and Andrew Charman, “Explicit Bayesian Analysis for Process Tracing: Guidelines, Opportunities, and Caveats” Political Analysis vol. 25, no. 3 (July 2017) pp. 363-380
Lecture 8: Multimethod Research; Field Research Techniques: Archives, Interviews, and Surveys
Readings:
• Andrew Bennett and Bear Braumoeller, “Where the Model Frequently Meets the Road: Combining Statistical, Formal, and Case Study Methods,” available at https://arxiv.org/abs/2202.08062
• Cameron Thies, “A Pragmatic Guide to Qualitative Historical Analysis in the Study of International Relations,” International Studies Perspectives 3 (4) (November 2002) pp. 351-72.
Lectures 9 & 10: Student research design presentations
Students present powerpoints on their research designs and provide constructive feedback on other students’ designs.
Supplementary / voluntary
The following readings are useful for students interested in exploring the topic further, but they are not required:
I) Philosophy of Science and Epistemological Issues
Henry Brady, “Causation and Explanation in Social Science,” in Janet Box-
Steffensmeier, Henry Brady, and David Collier, eds., Oxford Handbook of Political Methodology (Oxford, 2008) pp. 217-270.
II) Case Study Methods
George and Bennett, Case Studies and Theory Development, Chapter 1.
Gerardo Munck, “Canons of Research Design in Qualitative Analysis,” Studies in Comparative International Development, Fall 1998.
Timothy McKeown, “Case Studies and the Statistical World View,” International Organization Vol. 53, No. 1 (Winter, 1999) pp. 161190.
Concept Formation and Measurement
John Gerring, “What Makes a Concept Good?,” Polity Spring 1999: 357-93.
Robert Adcock and David Collier, “Measurement Validity: A Shared Standard for Qualitative and Quantitative Research,” APSR Vol. 95, No. 3 (September, 2001) pp. 529-546.
Robert Adcock and David Collier, “Democracy and Dichotomies,” Annual Review of Political Science, Vol. 2, 1999, pp. 537-565.
David Collier and Steven Levitsky, “Democracy with Adjectives: Conceptual Innovation in Comparative Research,” World Politics, Vol. 49, No. 3 (April 1997) pp. 430451.
David Collier, “Data, Field Work, and Extracting New Ideas at Close Range,” APSA -CP Newsletter Winter 1999 pp. 1-6.
Gerardo Munck and Jay Verkuilen, “Conceptualizing and Measuring Democracy: Evaluating Alternative Indices,” Comparative Political Studies Feb. 2002, pp. 5-34.
Designs for Single and Comparative Case Studies and Alternative Research Goals
Aaron Rapport, Hard Thinking about Hard and Easy Cases in Security Studies, Security Studies 24:3 (2015), 431-465.
Van Evera, Guide to Methodology, pp. 7788.
Richard Nielsen, “Case Selection via Matching,” Sociological Methods and Research
(forthcoming).
Typological Theory and Case Selection
Colin Elman, “Explanatory Typologies and Property Space in Qualitative Studies of International Politics,” International Organization, Spring 2005, pp. 293-326.
Gary Goertz and James Mahoney, “Negative Case Selection: The Possibility Principle,” in Goertz, chapter 7.
David Collier, Jody LaPorte, Jason Seawright . “Putting typologies to work: concept formation, measurement, and analytic rigor.” Political Research Quarterly, 2012
Process Tracing
Tasha Fairfield and Andrew Charman, Social Inquiry and Bayesian Inference (Cambridge, 2022)
David Waldner, “Process Tracing and Causal Mechanisms.” In Harold Kincaid, ed., The Oxford Handbook of Philosophy of Social Science (Oxford University Press, 2012), pp. 65‐84.
Gary Goertz and Jack Levy, “Causal Explanation, Necessary Conditions, and Case Studies: The Causes of World War I,” manuscript, Dec. 2002.
Counterfactual Analysis, Natural Experiments
Jack Levy, paper in Security Studies on counterfactual analysis.
Thad Dunning, “Design-Based Inference: Beyond the Pitfalls of Regression Analysis?” in
Brady and Collier, pp. 273-312.
Thad Dunning, Natural Experiments in the Social Sciences: A Design‐Based Approach (Cambridge University Press, 2012), Chapters 1,7
Philip Tetlock and Aaron Belkin, eds., Counterfactual Thought Experiments, chapters 1, 12.
Multimethod Research: Combining Case Studies with Statistics and/or Formal Modeling
David Dessler, “Beyond Correlations: Toward a Causal Theory of War,” International Studies Quarterly vol. 35 no. 3 (September, 1991), pp. 337355.
Alexander George and Andrew Bennett, Case Studies and Theory Development, Chapter 2.
James Mahoney, “Nominal, Ordinal, and Narrative Appraisal in MacroCausal Analysis,” American Journal of Sociology, Vol. 104, No.3 (January 1999).
Field Research Techniques: Archives, Interviews, and Surveys
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, “Field Research in
Political Science: Practices and Principles,” chapter 1 in Field Research in Political Science: Practices and Principles (Cambridge University Press). Read pages 15-33.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, “Interviews, Oral
Histories, and Focus Groups” in Field Research in Political Science: Practices and Principles (Cambridge University Press).
Elisabeth Jean Wood, “Field Research,” in Carles Boix and Susan Stokes, eds., Oxford Handbook of Comparative Politics, Oxford University Press 2007, pp. 123-146.
Soledad Loaeza, Randy Stevenson, and Devra C. Moehler. 2005. “Symposium: Should Everyone Do Fieldwork?” APSA-CP 16(2) 2005: 8-18.
Layna Mosley, ed., Interview Research in Political Science, Cornell University Press, 2013.
Hope Harrison, “Inside the SED Archives,” CWIHP Bulletin
Ian Lustick, “History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias,” APSR September 1996, pp. 605618.
Symposium on interview methods in political science in PS: Political Science and Politics (December, 2002), articles by Beth Leech (“Asking Questions: Sampling and Completing Elite Interviews”), Kenneth Goldstein (“Getting in the Door: Sampling and Completing Elite Interviews”), Joel Aberbach and Bert Rockman (“Conducting and Coding Elite Interviews”), Laura Woliver (“Ethical Dilemmas in Personal Interviewing”), and Jeffrey Barry (“Validity and Reliability Issues in Elite Interviewing), pp. 665-682.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, “A Historical and
Empirical Overview of Field Research in the Discipline” Chapter 2 in Field Research in Political Science: Practices and Principles (Cambridge University Press, forthcoming).
Mandatory readings before course start
It is advisable to do as much of the mandatory reading as possible before the course starts.
Examination part
Research design paper (details below)
Supplementary aids
none
Examination content
In addition to doing the reading and participating in course discussions, students will be required to use powerpoint to present orally an outline for a case study research design, or a multimethod research design that includes one or more case studies, in the final sessions of the class for a constructive critique by fellow students and Professor Bennett. Students will then write this into a research design paper about 3000 words long (12 pages, double-spaced) within a month after the course.
Presumably, students will choose to present the research design for their PhD or MA thesis, though students could also present a research design for a separate project, article, or edited volume. Research designs should address all of the following tasks (elaborated upon in the assigned readings and course sessions): 1) specification of the research problem and research objectives, in relation to the current stage of development and research needs of the relevant research program, related literatures, and alternative explanations; 2) specification of the independent and dependent variables of the alternative explanations for case outcomes; 3) selection of a historical case or cases that are appropriate in light of the first two tasks, and justification of why these cases were selected and others were not; 4) consideration of how variance in the variables can best be described for testing and/or refining existing theories; 5) specification of the evidence to be gathered, including both process tracingevidence and measurements of the independent and dependent variables for the alternative explanations.
Students will be assessed on how well their research design achieves these tasks, and on how useful their suggestions are on other students’ research designs. Students will also be assessed on the general quality of their contributions to class discussions.
Examination relevant literature
The mandatory readings provide the methodological basis for writing the research design paper.
Prerequisites
Basic knowledge of statistics, data analysis, and programming.
Hardware
Participants need to bring laptops.
Software
Google Colaboratory environment (https://colab.research.google.com/). Participants will receive setup instructions before the course.
Learning objective
Large language models (think ChatGPT) are incredibly useful tools for research in the social and behavioral sciences. In this course, you will (1) learn about the fundamental principles of large language models, (2) learn how to employ open-source large language models using the Hugging Face ecosystem, (3) learn about the rich opportunities that large language models offer for behavioral and social science research, and (4) gain experience in applying large language models to answer personal research questions.
Course content
The course introduces the use of open-source large language models from the Hugging Face ecosystem for research in the behavioral and social sciences. In short lectures, participants will learn about key concepts (e.g., embeddings, causal attention, feature extraction, classification, prediction, fine-tuning, and token generation) and practical examples from social and behavioral science. In hands-on exercises, participants will apply language models to answer research questions from psychology, political science, decision-making research, and other fields. During and after the course, participants will engage in a personal research project applying large language models to a personal research question. Two lecturers will hold this course, implying a high level of support during the exercises and research project design.
Structure
Day 1:
Morning: Welcome and intro to large language models
Afternoon: Applying the Hugging Face ecosystem for open-source large language models
Day 2:
Morning: Intro to feature extraction and embedding models
Afternoon: Applying large language models to predict the relationships between survey items and questionnaires in personality psychology
Day 3:
Morning: Intro to fine-tuning for classification and regression
Afternoon: Applying large language models to evaluate and classify texts in political science
Day 4:
Morning: Intro to token and text generation
Afternoon: Using large language models to predict people’s responses in decision-making situations
Day 5:
Morning: Overview of additional applications of large language models for qualitative data analysis
Afternoon: Project pitches
Literature
Mandatory:
Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2023). A tutorial on open-source large language models for behavioral science. PsyArXiv (available in December)
Supplementary / voluntary:
Tunstall, L., von Werra, L., & Wolf T. (2022). Natural Language Processing with Transformers. John Wiley & Sons.
Examination
The course grade will be determined based on the quality of a project pitch at the end of the course and a two-page research paper submitted after the course. The paper communicates an analysis applying large language models to a personal research question, including all parts of a traditional research paper (introduction, method, results, and discussion). The research paper can be based on the examples during the course.
Prerequisites (knowledge of topic)
Each student is to submit an outline (no more than 500 words in length) of a specific research question and/or a set of hypotheses that s/he would like to examine via an experimental approach. This outline (in PDF format, file name format: “LastName-FirstName-ResQues.pdf”) should be e-mailed to ghaeubl@ualberta.ca with “GSERM-EMBS” as the subject line by 23:00 (St. Gallen time) on Friday prior course start.
As part of the introductions on the first morning of the course, students be will be asked to give 2-minute presentations on these research questions/hypotheses (and to say a few words about their areas of research interest more broadly).
The objectives behind this assignment are:
• to facilitate learning by ensuring that students have their own concrete research questions/hypotheses in mind as they engage with the material covered in the course
• to provide the instructor with input for tailoring the course content and/or class discussions to students’ interests
Course content
The objective of this course is to provide students with an understanding of the essential principles and techniques for conducting scientific experiments on human behavior. It is tailored for individuals with an interest in doing research (using experimental methods) in areas such as psychology, judgment and decision making, behavioral economics, consumer behavior, organizational behavior, and human performance. The course covers a variety of topics, including the formulation of research hypotheses, the construction of experimental designs, the development of experimental tasks and stimuli, how to avoid confounds and other threats to validity, procedural aspects of administering experiments, the analysis of experimental data, and the reporting of results obtained from experiments. Classes are conducted in an interactive seminar format, with extensive discussion of concrete examples, challenges, and solutions.
Topics
The topics covered in the course include:
• Basic principles of experimental research
• Formulation of research question and hypothesis development
• Experimental paradigms
• Design and manipulation
• Measurement
• Factorial designs
• Implementation of experiments
• Data analysis and reporting of results
• Advanced methods and complex experimental designs
• Ethical issues
Literature
Recommended
There is no textbook for this course.
However, here are some recommended books on the design (and analysis) of experiments:
Abdi, Edelman, Valentin, and Dowling (2009), Experimental Design and Analysis for Psychology, Oxford University Press.
Field and Hole (2003), How to Design and Report Experiments, Sage.
Keppel and Wickens (2004), Design and Analysis: A Researcher’s Handbook, Pearson.
Kirk (2013), Experimental Design: Procedures for the Behavioral Sciences, Sage.
Martin (2007), Doing Psychology Experiments, Wadsworth.
Oehlert (2010), A First Course in Design and Analysis of Experiments, available online at:
http://users.stat.umn.edu/~gary/book/fcdae.pdf.
In addition, the following papers are recommended as background readings for the course:
Cumming, Geoff (2014), “The New Statistics: Why and How,” Psychological Science, 25, 1, 7-29.
Elrod, Häubl, and Tipps (2012), “Parsimonious Structural Equation Models for Repeated Measures Data, With Application to the Study of Consumer Preferences,” Psychometrika, 77, 2, 358-387.
Goodman and Paolacci (2017), “Crowdsourcing Consumer Research,” Journal of Consumer Research, 44, 1, 196-210.
McShane and Böckenholt (2017), “Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability,” Journal of Consumer Research, 43, 6, 1048-1063.
Meyvis and Van Osselaer (2018), “Increasing the Power of Your Study by Increasing the
Effect Size,” Journal of Consumer Research, 44, 5, 1157-1173.
Morales, Amir, and Lee (2017), “Keeping It Real in Experimental Research-Understanding When, Where, and How to Enhance Realism and Measure Consumer Behavior,” Journal of Consumer Research, 44, 2, 465-476.
Oppenheimer, Meyvis, and Davidenko (2009), “Instructional Manipulation Checks: Detecting Satisficing to Increase Statistical Power,” Journal of Experimental Social Psychology, 45, 867-872.
Pieters (2017), “Meaningful Mediation Analysis: Plausible Causal Inference and Informative Communication,” Journal of Consumer Research, 44, 3, 692-716.
Simmons, Nelson, and Simonsohn (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22, 11, 1359-1366.
Simonsohn, Nelson, and Simmons (2014), “P-Curve: A Key to the File-Drawer,” Journal of Experimental Psychology: General, 143, 2, 534-547.
Spiller, Fitzsimons, Lynch, and McClelland (2013), “Spotlights, Floodlights, and the Magic Number Zero: Simple Effects Tests in Moderated Regression,” Journal of Marketing Research, 50, 277-288.
Zhao, Lynch, and Chen (2010), “Reconsidering Baron and Kenny: Myths and Truths about Mediation Analysis,” Journal of Consumer Research, 37, 197-206.
Examination part
Students are to complete a (2-hour) written exam in the afternoon of the last day of class. In the exam, students are given a description of a research question, along with specific hypotheses. They are to produce a proposal for an experiment, or a series of experiments, for testing these hypotheses. The exam is “open book” – that is, students are free to use any appropriate local resources they wish in developing their proposal. (Here, “local” means that students may not access the Internet or other communication networks.)
Regular attendance and active participation in class discussion are expected.
Common standards of academic integrity apply. Work submitted by students must be their own – submitting what someone else has created is not acceptable.
Grading
A student’s overall grade is based on the following components:
– Initial Assignment and Presentation: 10%
– Class Participation: 20%
– Exam: 70%
Prerequisites (knowledge of topic)
Linear regression (strong), Maximum Likelihood Estimation (some familiarity), Linear/Matrix Algebra (some exposure is helpful), R (not required, but helpful).
Hardware
Access to a laptop will be useful, but not absolutely necessary.
Software
R/RStudio, JAGS (both are freely available online).
Learning objectives
To understand what the Bayesian approach to statistical modeling is and to appreciate the differences between the Bayesian and Frequentist approaches. The students will be able to estimate a wide variety of models in the Bayesian framework and to adjust example code to fit their specific modeling needs.
Course content
-Theory/foundations of the Bayesian approach including:
-objective vs subjective probability
-how to derive and incorporate prior information
-the basics of MCMC sampling
-assessing convergence of Markov Chains
-Bayesian difference of means/ANOVA
-Bayesian versions of: Linear models, logit/probit (dichotomous/ordered/unordered choice models), Count models, Latent variable and measurement models, Multilevel models
-presentation of results
Structure
Day 1 a.m.: Overview of Bayesian approach—Bayes vs Frequentism. History of Bayesian statistics, Problems with the NHST, The Beta-Binomial model
Day 1 p.m.: Review of GLM/MLE. Probability review. Application of Bayes Rule.
Day 2 a.m.: Priors, Sampling methods (Inversion, Rejection, Gibbs sampling)
Day 2 p.m.: Convergence diagnostics. Using JAGS to estimate Bayesian models.
Day 3 a.m.: Estimating parameters of the Normal Distribution
Day 3 p.m.: Bayesian linear models, imputing missing data.
Day 4 a.m.: Choice models (dichotomous, ordered, unordered)
Day 4 p.m.: Latent variable models
Day 5 a.m.: Multilevel models: linear models.
Day 5 p.m.: Multilevel models: non-linear models, best practices for model
presentation.
Literature
Mandatory
Gill, J. (2008). Bayesian Methods: A Social And Behavioral Sciences Approach. Chapman and Hall, Boca Raton, FL
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical
Jackman, S. (2000). Estimation and Inference Are Missing Data Problems: Unifying Social Science Statistics via Bayesian Simulation. Political Analysis, 8(4):307–332. http://pan.oxfordjournals.org/content/8/4/307.full.pdf+html
Supplementary / voluntary
Siegfried, T. (2010). Odds are, it’s wrong: Science fails to face the shortcomings of statistics. Science News, 177(7):26–29. http://dx.doi.org/10.1002/scin.5591770721
Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science.
Bakker, R. (2009). Re-measuring left–right: A comparison of SEM and bayesian measurement models for extracting left–right party placements. Electoral Studies, 28(3):413–421
Bakker, R. and Poole, K. T. (2013). Bayesian Metric Multidimensional Scaling. Political Analysis, 21(1):125–140
For those unfamiliar with R: Jon Fox and Sandford Weisberg. An R Companion to Applied Regression. Sage, 2011.
Mandatory readings before course start
Western, B. and Jackman, S. (1994). Bayesian Inference for Comparative Research. American Political Science Review, 88(2):412–423. http://www.jstor.org/stable/2944713
Efron, B. (1986). Why Isn’t Everyone a Bayesian? The American Statistician, 40(1):1–5. http://www.jstor.org/stable/2683105
Examination part
A written homework assignment which consists of estimating a variety of models using JAGS as well as a brief essay describing how the students would go about incorporating Bayesian methods in their own work and what they see as the main advantages/disadvantages of doing so.
Supplementary aids
Open book/practical examinations. The students should use the example code from the lectures to help complete the practical component as well as both required texts to help answer the essay component. Specifically, the linear model and dichotomous choice model examples will be very useful as well as the first 3 chapters of the Gill text and Section 3 of the Gelman and Hill text.
Examination content
Bayesian versions of the linear and dichotomous choice models, including presenting the appropriate results in a professionally acceptable manner. This includes creating graphical representations of the model results as well as a thorough discussion of how to interpret the results.
For the essay component, students will need to be aware of the benefits of the Bayesian approach for their own research (or the lack thereof) and to describe, in detail, the types of choices they would need to make in order to apply Bayesian methods to their own work. This includes a detailed description and justification of what priors they would choose as well as what differences they would expect to see between the Bayesian and Frequentist approaches, if any, and why they would expect such differences.
Literature
The only required literature to complete the examinations are the 2 required texts and the code examples from the lectures.
Course content
The primary goal is to develop an applied and intuitive (as opposed to purely theoretical or mathematical) understanding of the topics and procedures. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles. Some more advanced regression topics will be covered later in the course, but only after the introductory foundations have been established.
We will begin with a quick review of basic univariate statistics and hypothesis testing.
After that we will cover various topics in bivariate and then multiple regression, including:
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Structure
This course will utilize approximately 525 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in eleven Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These eleven Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While in-class examples are presented using SPSS, participants are free (and encouraged!) to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequirements
This course is intended for participants who are comfortable with algebra and basic introductory statistics, and now want to learn applied ordinary least squares (OLS) multiple regression analysis for their own research and to understand the work of others.
Note: We will not use matrix algebra or calculus in this course.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
• Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
• The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
• Finally, I have included several articles from a number of journals across several
academic disciplines.
Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
A written Final Examination will be administered during the last meeting of the course.
Since this Final Examination is the only artifact that will be formally graded in the course, it will determine the course grade.
Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the Final Examination.
Supplementary aids
The Final Examination will be written, open-book (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and open-note. No other materials, including laptops, cell phones, or other electronic devices, will be permitted.
The Final Exam will be two hours in length and administered during the last course meeting.
I will provide more specific “practical matter” details about this exam early in the course.
Examination content
The potential substantive content areas for the Final Examination are:
• Basic univariate statistics and hypothesis testing.
• Fundamental concepts of bivariate regression and multiple regression.
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Literature
Literature relevant to the exam:
• Lecture Transcripts (eleven Packets; approximately 525 pages).
• Class notes (taken by each participant individually).
• Tutorial and Answer Key documents (for each optional data analysis project assignment). Supplementary/Voluntary literature not directly relevant to the exam.
• Optional supplemental readings listed in the course syllabus (and discussed earlier).
• Any other textbooks, articles, etc., the participant reads before or during the course.
Work load
At least 24 units 45 minutes each on 5 consecutive days.
Prerequisites (knowledge of topic)
This course assumes no prior experience with machine learning or R, though it may be helpful to be familiar with introductory statistics and programming.
Hardware
A laptop computer is required to complete the in-class exercises.
Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available at no cost and are needed for this course.
Learning objectives
Students will leave with an understanding of the foundational machine learning methods as well as the ability to apply machine learning to their own field of work or study using the R programming language.
Course content
Machine learning, put simply, involves teaching computers to learn from experience, typically for the purpose of identifying or responding to patterns or making predictions about what may happen in the future. This course is intended to be an introduction to machine learning methods through the exploration of real-world examples. We will cover the basic math and statistical theory needed to understand and apply many of the most common machine learning techniques, but no advanced math or programming skills are required. The target audience may include social scientists or practitioners who are interested in understanding more about these methods and their applications. Students with extensive programming or statistics experience may be better served by a more theoretical course on these methods.
Structure
The course will be designed to be interactive, with ample time for hands-on practice with the Machine Learning methods. Each day will include several lectures based on a Machine Learning topic, in addition to hands-on “lab” sections to apply the learnings to new datasets (or your own data, if desired).
The approximate schedule will be as follows:
Day 1: Introducing Machine Learning with R
- How machines learn
- Using R and R Studio
- Data exploration
- k-Nearest Neighbors
- Lab sections – installing R, choosing and exploring own dataset (if desired)
Day 2: Intermediate ML Methods – Classification Models
- Quiz on Day 1 material
- Naïve Bayes
- Lab sections – practicing with kNN and Naïve Bayes
Day 3: Intermediate ML Methods – Numeric Prediction
- Quiz on Day 2 material
- Decision Trees and Rule Learners
- Linear Regression
- Regression trees
- Logistic regression
- Lab sections – practicing with classification and regression methods
Day 4: Advanced Classification Models
- Quiz on Day 3 material
- Neural Networks
- Support Vector Machines
- Random Forests
- Lab section – practice with neural networks, SVMs, and random forests
Day 5: Other ML Methods
- Quiz on Day 4 material
- Association Rules
- Hierarchical clustering
- K-Means clustering
- Model evaluation
- Lab section – practice with these methods, work on final report
Literature
Mandatory
Machine Learning with R (4th ed.) by Brett Lantz (2023). Packt Publishing
Supplementary / voluntary
None required.
Mandatory readings before course start
Please install R and R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
100% of the course grade will be based on a project and final report (approximately 10 pages), to be delivered within 2-3 weeks after the course. The project is intended to demonstrate your ability to apply the course materials to a dataset of your own choosing. Students should feel free to use a project related to their career or field of study. For example, it is possible to use this opportunity to advance dissertation research or complete a task for one’s job. The exact scoring criteria for this assignment will be provided on the first day of class. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
There will also be brief quizzes at the start of each lecture, which cover the previous day’s materials. These are ungraded and are designed to provoke thought and discussion.
Supplementary aids
Students may reference literature and class materials as needed when writing the final project report.
Examination content
The final project report should illustrate an ability to apply machine learning methods to a new dataset, which may be on a topic of the student’s choosing. The student should explore the data and explain the methods applied. Detailed instructions will be provided on the first day of class.
Examination relevant literature
The written report is only expected to reflect knowledge of the material presented in class, which covers chapters 1 to 10 in the Machine Learning with R (4th ed.) textbook noted above.
Prerequisites (knowledge of topic)
This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.
Hardware
A laptop computer is required to complete the in-class exercises.
Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.
Learning objectives
Students will develop the ability to build more advanced machine learning skills by practicing on a large and challenging dataset, applying advanced data preparation, model tuning, and evaluation strategies, as well as learning about how “big data” techniques can improve even conventional ML models.
Course content
With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.
Structure
The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).
The tentative schedule is as follows:
Day 1: Being Successful with Machine Learning
- What makes a successful ML model?
- ML models in the real world
- Learning to explore data
- Putting the “science” in data science
- Advanced model evaluation
Day 2: Advanced data preparation
- Why is 80% of a modeling project spent on data preparation?
- Feature engineering strategies
- The R data pipeline – tidyverse
Day 3: Challenging data: Too much, too little, too complex
- The challenges of high dimensional and sparse data
- Handling missing data (missing data imputation)
- The problem of imbalanced data (SMOTE algorithm)
Day 4: Building Better Learners
- Tuning stock models for better performance
- Improving model performance with ensembles (Random Forest, XGBoost)
- Stacking models for meta-learning
Day 5: Making Use of Big Data
- Practical applications of deep learning
- Unsupervised learning and big data (word embeddings, t-SNE)
- Improving R’s performance on big data
Literature
Mandatory
This course covers Chapter 11-15 in Machine Learning with R (4th ed.) by Brett Lantz (2023). Packt Publishing. A small number of additional readings will be distributed in PDF format prior to the start of each class day.
Supplementary / voluntary
None required.
Mandatory readings before course start
Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. The project is based on a challenging real-world dataset given to all course participants. The project will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
The remaining 20% of the course grade will be based on participation during in-class discussions and performance during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.
Supplementary aids
Students may reference any literature as needed when writing the final report.
Examination content
The primary goal of the final project is for students to gain an ability to solve difficult ML tasks. The project should reflect an understanding of the material covered throughout the week, as well as an ability to apply the material in new and innovative ways.
Examination relevant literature
The written report is only expected to reflect knowledge of the material presented in class, which covers chapters 11 to 15 in the Machine Learning with R (4th ed.) textbook noted above.
Course content
The goal is to develop an applied and intuitive (not purely theoretical or mathematical) understanding of the topics and procedures, so that participants can use them in their own research and also understand the work of others. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles.
Advanced regression topics will be covered only after the foundations have been established. The ordinary least squares multiple regression topics that will be covered include:
- Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance). - Categorical independent variables (e.g., new tests for “Intervalness” and
“Collapsing”). - Dichotomous dependent variables: Logit and Probit analysis.
- Outliers, influence, and leverage.
- Advanced diagnostic plots and graphical techniques.
- Matrix algebra: A quick primer. (Optional)
- Regression models… now from a matrix perspective.
- Heteroskedasticity: Definition, consequences, detection, and correction.
- Autocorrelation: Definition, consequences, detection, and correction.
- Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Structure
This course will utilize approximately 325 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in nine Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These nine Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While in‑class examples are presented using SPSS, participants are free and encouraged to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequisites
This course is a continuation of Tim McDaniel’s “Regression I – Introduction” course. While it is not necessary that participants have taken that specific course, they will need to be familiar with many of the topics that are covered in it.
Note: We will use matrix algebra in the second half of the course. We will not use calculus.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
- Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
- The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
- Finally, I have included several articles from a number of journals across several academic disciplines.Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
Decentral ‑ Written examination (100%)
Supplementary aids
Open Book
Examination content
The potential substantive content areas for the Final Examination are:
- Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance). - Categorical independent variables (e.g., new tests for “Intervalness” and “Collapsing”).
- Dichotomous dependent variables: Logit and Probit analysis.
- Outliers, influence, and leverage.
- Advanced diagnostic plots and graphical techniques.
- Regression models… now from a matrix perspective.
- Heteroskedasticity: Definition, consequences, detection, and correction.
- Autocorrelation: Definition, consequences, detection, and correction.
- Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Since this final examination is the only artifact that will be formally graded in the course, it will determine the course grade. Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the final examination.
The final examination will be written, open‑ook (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and open‑note. No other materials, including Laptops, cell phones, or other electronic devices, will be permitted.The written final exam will be two hours in length and administered during the last course meeting.
Literature
Literature relevant to the exam:
- Lecture Transcripts (nine Packets; approximately 325 pages).
- Class notes (taken by each participant individually).
- Assignment Tutorial and Answer Key documents (for each optional data analysis
project).
Supplementary/Voluntary literature not directly relevant to the exam:
- Optional supplemental readings listed in the course syllabus (and discussed earlier).
- Any other textbooks, articles, etc., the participant reads before or during the course.
Prerequisites (knowledge of topic)
Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Students should have completed Ph.D.-level courses in introductory statistics, and in linear and generalized linear regression models (including logistic regression, etc.), up to the level of Regression III. Familiarity with discrete and continuous univariate probability distributions will be helpful.
Hardware
Students will be required to provide their own laptop computers.
Software
All analyses will be conducted using the R statistical software. R is free, open-source, and runs on all contemporary operating systems. The instructor will also offer support for students wishing to use Stata.
Learning objectives
Students will learn how to visualize, analyze, and conduct diagnostics on models for observational data that has both cross-sectional and temporal variation.
Course content
Analysts increasingly find themselves presented with data that vary both over cross-sectional units and across time. Such panel data provides unique and valuable opportunities to address substantive questions in the economic, social, and behavioral sciences. This course will begin with a discussion of the relevant dimensions of variation in such data, and discuss some of the challenges and opportunities that such data provide. It will then progress to linear models for one-way unit effects (fixed, between, and random), models for complex panel error structures, dynamic panel models, nonlinear models for discrete dependent variables, and models that leverage panel data to make causal inferences in observational contexts. Students will learn the statistical theory behind the various models, details about estimation and inference, and techniques for the visualization and substantive interpretation of their statistical results. Students will also develop statistical software skills for fitting and interpreting the models in question, and will use the models in both simulated and real data applications. Students will leave the course with a thorough understanding of both the theoretical and practical aspects of conducting analyses of panel data.
Structure
Day One:
Morning:
• (Very) Brief Review of Linear Regression
• Overview of Panel Data: Visualization, Pooling, and Variation
• Regression with Panel Data
Afternoon:
• Unit Effects Models: Fixed-, Between-, and Random-Effects
Day Two:
Morning:
• Dynamic Panel Data Models: The Instrumental Variables / Generalized Method of Moments Framework
Afternoon:
• More Dynamic Models: Orthogonalization-Based Methods
Day Three:
Morning:
• Unit-Effects and Dynamic Models for Discrete Dependent Variables
Afternoon:
• GLMs for Panel Data: Generalized Estimating Equations (GEEs)
Day Four:
Morning:
• Introduction to Causal Inference with Panel Data (Including Unit Effects)
Afternoon:
• Models for Causal Inference: Differences-In-Differences, Synthetic Controls, and Other Methods
Day Five:
Morning:
• Practical Issues: Model Selection, Specification, and Interpretation
Afternoon:
• Course Examination
Literature
Mandatory
Hsiao, Cheng. 2014. Analysis of Panel Data, 3rd Ed. New York: Cambridge University Press.
OR
Croissant, Yves, and Giovanni Millo. 2018. Panel Data Econometrics with R. New York: Wiley.
Supplementary / voluntary
Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” Review of Economic Studies 72:1-19.
Anderson, T. W., and C. Hsiao. 1981. “Estimation Of Dynamic Models With Error Components.” Journal of the American Statistical Association 76:598-606.
Antonakis, John, Samuel Bendahan, Philippe Jacquart, and Rafael Lalive. 2010. “On Making Causal Claims: A Review and Recommendations.” The Leadership Quarterly 21(6):1086-1120.
Arellano, M. and S. Bond. 1991. “Some Tests Of Specification For Panel Data: Monte Carlo Evidence And An Application To Employment Equations.” Review of Economic Studies 58:277-297.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (And Not To Do) With Time-Series Cross-Section Data.” American Political Science Review 89(September): 634-647.
Bliese, P. D., D. J. Schepker, S. M. Essman, and R. E. Ployhart. 2020. “Bridging Methodological Divides Between Macro- and Microresearch: Endogeneity and Methods for Panel Data.” Journal of Management, 46(1):70-99.
Clark, Tom S. and Drew A. Linzer. 2015. “Should I Use Fixed Or Random Effects?” Political Science Research and Methods 3(2):399-408.
Doudchenko, Nikolay, and Guido Imbens. 2016. “Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis.” Working paper: Graduate School of Business, Stanford University.
Gaibulloev, K., Todd Sandler, and D. Sul. 2014. “Of Nickell Bias, Cross-Sectional Dependence, and Their Cures: Reply.” Political Analysis 22: 279-280.
Hill, T. D., A. P. Davis, J. M. Roos, and M. T. French. 2020. “Limitations of Fixed-Effects Models for Panel Data.” Sociological Perspectives 63:357-369.
Hu, F. B., J. Goldberg, D. Hedeker, B. R. Flay, and M. A. Pentz. 1998. “Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes.” American Journal of Epidemiology 147(7):694-703.
Imai, Kosuke, and In Song Kim. 2019. “When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data?” American Journal of Political Science 62:467-490.
Keele, Luke, and Nathan J. Kelly. 2006. “Dynamic Models for Dynamic Theories: The Ins and Outs of Lagged Dependent Variables.” Political Analysis 14(2):186-205.
Lancaster, Tony. 2002. “Orthogonal Parameters and Panel Data.” Review of Economic Studies 69:647-666.
Liu, Licheng, Ye Wang, Yiqing Xu. 2019. “A Practical Guide to Counterfactual Estimators for Causal Inference with Time-Series Cross-Sectional Data.” Working paper: Stanford University.
Mummolo, Jonathan, and Erik Peterson. 2018. “Improving the Interpretation of Fixed Effects Regression Results.” Political Science Research and Methods 6:829-835.
Neuhaus, J. M., and J. D. Kalbfleisch. 1998. “Between- and Within-Cluster Covariate Effects in the Analysis of Clustered Data. Biometrics, 54(2): 638-645.
Pickup, Mark and Vincent Hopkins. 2020. “Transformed-Likelihood Estimators for Dynamic Panel Models with a Very Small T.” Political Science Research & Methods, forthcoming.
Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25:57-76.
Zorn, Christopher. 2001. “Generalized Estimating Equation Models for Correlated Data: A Review with Applications.” American Journal of Political Science 45(April):470-90.
Mandatory readings before course start
Hsiao, Cheng. 2007. “Panel Data Analysis — Advantages and Challenges.” Test 16:1-22.
Examination part
Students will be evaluated on two written homework assignments that will be completed during the course (20% each) and a final examination (60%). Homework assignments will typically involve a combination of simulation-based exercises and “real data” analyses, and will be completed during the evenings while the class is in session. For the final examination, students will have two alternatives:
• “In-Class”: Complete the final examination in the afternoon of the last day of class (from roughly noon until 6:00 p.m. local time), or
• “Take-Home”: Complete the final examination during the week following the end of the course (due date: TBA).
Additional details about the final examination will be discussed in the morning session on the first day of the course.
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data with a panel / time-series cross-sectional component. Students will be required to specify, estimate, and interpret various statistical models, to conduct and present diagnostics and robustness checks, and to give detailed justifications for their choices.
Examination relevant literature
See above. Details of the examination literature will be finalized prior to the start of class.
Content
Statistical mediation and moderation analyses are among the most widely used data analysis techniques in social science, health, and business fields. Mediation analysis is used to test hypotheses about various intervening mechanisms by which causal effects operate. Moderation analysis is used to examine and explore questions about the contingencies or conditions of an effect, also called “interaction”. Increasingly, moderation and mediation are being integrated analytically in the form of what has become known as “conditional process analysis,” used when the goal is to understand the contingencies or conditions under which mechanisms operate. An understanding of the fundamentals of mediation and moderation analysis is in the job description of almost any empirical scholar. In this course, you will learn about the underlying principles and the practical applications of these methods using ordinary least squares (OLS) regression analysis and the PROCESS macro for SPSS, SAS and R invented by the course instructor.
Topics covered in this five-day course include:
- Path analysis: Direct, indirect, and total effects in mediation models.
- Estimation and inference about indirect effects in single mediator models.
- Models with multiple mediators
- Mediation analysis in the two-condition within-subject design.
- Estimation of moderation and conditional effects.
- Probing and visualizing interactions.
- Conditional Process Analysis (also known as “moderated mediation”)
- Quantification of and inference about conditional indirect effects.
- Testing a moderated mediation hypothesis and comparing conditional indirect effects
As an introductory-level course, we focus primarily on research designs that are experimental or cross-sectional in nature with continuous outcomes. We do not cover complex models involving dichotomous outcomes, latent variables, models with more than two repeated measures, nested data (i.e., multilevel models), or the use of structural equation modeling.
This course will be helpful for researchers in any field—including psychology, sociology, education, business, human development, political science, public health, communication—and others who want to learn how to apply the latest methods in moderation and mediation analysis using readily-available software packages such as SPSS, SAS and R.
Prerequisites (knowledge of topic)
Participants should have a basic working knowledge of the principles and practice of multiple regression and elementary statistical inference. No knowledge of matrix algebra is required or assumed, nor is matrix algebra ever used in the course.
Hardware and Software
Computer applications will focus on the use of OLS regression and the PROCESS macro for SPSS, SAS and R developed by the instructor that makes the analyses described in this class much easier than they otherwise would be.
Because this is a hands-on course, participants are strongly encouraged to bring their own laptops (Mac or Windows) with a recent version of SPSS Statistics (version 23 or later), SAS (release 9.2 or later), or R (version 3.6 or later) installed. (Only one statistical package is required, but participants can use more than one if desired) SPSS users should ensure their installed copy is patched to its latest release. SAS users should ensure that the IML product is part of the installation. You should have good familiarity with the basics of ordinary least squares regression, as well as the use of SPSS, SAS, or R. You are also encouraged to bring your own data to apply what you’ve learned.
STATA users can benefit from the course content, but PROCESS makes these analyses much easier and is not available for STATA.
Literature
This course is a companion to the second edition of the instructor’s book Introduction to Mediation, Moderation, and Conditional Process Analysis, published by The Guilford Press. The content of the course overlaps the book to some extent, but many of the examples are different, and this course includes some material not in the book. A copy of the book is not required to benefit from the course, but it could be helpful to reinforce understanding.
Examination
100% of assessment will be based on a written final examination at the end of the course. The exam will be a combination of multiple choice questions and short-answer/fill in the blank questions, along with some interpretation of computer output. Students will take the examination home on the last day of class and return it to the instructor within one week.
During the examination students will be allowed to use all course materials, such as PDFs of PowerPoint slides, student notes taken during class, and any other materials distributed or student-generated during class. Although the book mentioned in “Literature” is not a requirement of the course nor is it necessary to complete the exam, students may use the book if desired during the exam.
A computer is not required during the exam, though students may use a computer if desired, for example as a storage and display device for class notes provided to them during class.
Among the topics of the exam may include how to quantify and interpret path analysis models, calculate direct, indirect, and total effects, and determine whether evidence of a mediation effect exists in a data set based on computer output provided or other information. Also covered will be the testing moderation of an effect, interpreting evidence of interaction, and probing interactions. Students will be asked to generate or interpret conditional indirect effects from computer output given to them and/or determine whether an indirect effect is moderated. Students may be asked to construct computer commands that will conduct certain analyses. All questions will come from the content listed in “Course Content” above.
Prerequisites (knowledge of topic)
- Proficiency in basic programming, with Python highly recommended.
- Understanding of linear algebra, analysis, and statistics at the undergraduate level.
Hardware
- Personal laptop with Mac OSX or Linux, Windows.
- Tablets (iOS, Windows) will not be working with this course.
- Laptop adapter for connecting to the Swiss electricity grid.
Software
- Webbrowser (Chrome, Safari, Firefox)
- Jupyter, Binder, or Google CoLab notebook.
- Local Python installation including Numpy, Scipy, scikit-learn, PyTorch
Learning objectives
This course introduces the basic concepts of Deep Learning (DL). The goal is to provide a broad overview of the field, empowering you to understand the relationship between AI and DL and DL’s exciting and challenging application areas. Upon completing this course, you should be familiar with the common terminology in the field and understand its basic concepts. In addition, you will learn how distinct DL architectures can be applied to train your machine learning models.
Course content
Day 1: Introduction & Machine Learning
- Welcome and Motivation
- Course Overview and Logistics
- Requirements and Literature
- Machine Learning Revisited
- Gaussian Naïve Bayes
- Support Vector Machines
Day 2: Artificial Neurons and Artificial Neural Networks
- Multilayer Perceptron
- Neural Network Architectures
- End-to-End Learning
- Neural Network Training
- Backpropagation of Error
- Stochastic Gradient Descent
Day 3: Convolutional Networks and Autoencoder Networks
- Convolutions and Feature Maps
- Convolutional Neural Network Architecture
- Data Augmentation and Training Regularization
- Dimensionality Reduction
- Autoencoder Neural Network Architecture
- Latent Space and Embeddings
Day 4: Recurrent Networks and Generative Adversarial Networks
- Sequential Learning
- Long Short-Term Memory Network Architecture
- Backpropagation Trough Time
- Generative Models
- Generative Adversarial Network Architecture
- Adversarial Attacks
Day 5: Explainable Artificial Intelligence & Wrap-Up
- Black Box Artificial Intelligence
- Regulating Artificial Intelligence Systems
- Shapley Additive Explanations
- Summary and Discussion
- Final Course Assignment
- Your Final Questions
Structure
The course is a theoretical content in the morning and practical coding lab sessions in the afternoon. The labs will primarily utilize the Python programming language, a powerful and widely-adopted language in the realms of Artificial Intelligence and Machine Learning. We’ll be integrating our work with Jupyter Notebooks, an open-source tool that facilitates the creation and sharing of documents containing live code, equations, visualizations, and textual content.
Literature
“Deep Learning” by Goodfellow I, Bengio Y., and Courville A., MIT Press, 2016.
“Artificial Intelligence – A Modern Approach” by Russel, S., and Norvig, P., Prentice-Hall, 2016.
Supplementary Online Materials of the Coding Lab Sessions:
- http://jupyter.org/
- https://colab.research.google.com
- http://www.numpy.org/
- https://pytorch.org
Examination part
Deep Learning coding assignment which must be completed (at home) within 2-3 weeks after the course.
Supplementary aids
Open Book Course Assignment
Examination content
- Artificial Neural Network Implementation, Training, and Evaluation using Jupyter Notebooks.
- Approach and Result Documentation using OpenOffice, LibreOffice, or Microsoft Powerpoint.
Examination relevant literature
“Deep Learning” by Goodfellow I, Bengio Y., and Courville A., MIT Press, 2016.
Prerequisites (knowledge of topic)
There is no prerequisite knowledge. However, some familiarity with experimental design (e.g., through the GSERM course “Experimental Methods for Behavioral Science”) is very useful, as the course delves deep into practical aspects of conducting experimental research online.
Hardware
Students will complete course work on their own laptop computers.
Software
The course will rely on several online research tools, including Amazon Mechanical Turk, Prolific, CloudResearch, Qualtrics, and more. Accounts are typically free to open (at least in trial version), though some might require depositing a credit card number.
Learning objectives
– conducting online “lab” surveys and experiments on several alternative platforms
– using advanced features to conduct richer and innovative “lab” studies
– conducting “field” studies on social media platforms, search engines, and beyond
– improving research designs by scraping web data
– making online research more publishable, reproducible, and replicable
Course content
The Internet is revolutionizing how empirical research is conducted across the social sciences. Without the need for intermediaries, individual researchers can now conduct large-scale experiments on human participants, longitudinal surveys of rare populations, A/B tests on social media, and more. In this course, you will learn how to harness these opportunities while avoiding the many pitfalls of online research. The course is tailored for researchers in psychology, economics, business, and any other area of academia or industry who investigate human behavior.
We will cover the nuts and bolts of conducting “lab” experiments on alternative Internet platforms, including techniques to maximize the validity and reproducibility of research findings. We will also discuss how to unlock the potential of the Internet for more elaborate, richer designs (e.g., longitudinal, interactive) that go beyond simple survey experiments. Additionally, we will teach you how to scrape publicly available information and to conduct “field” experiments on social media, gathering real-world, immediately applicable insights about consumers, workers, and Internet users more generally. Importantly, technical and practical insights will explicitly serve the goal to improve the rigor and the publishability of participants’ own research. To this end, we will include discussions on whether and how to combine online and offline investigations, how to preregister and report online research in a paper, and more.
The course relies on a mix of discussions, demonstrations, and exercises that use participants’ own research needs and projects as starting points. At the end of the week, participants will be fully equipped to design, execute, and report valid online research for their own investigations.
Structure
Each class day includes lecturing, applications, and discussions. The morning session is prevalently, though not exclusively, devoted to introducing new concepts and techniques in online behavioral research. Afternoon sessions are mostly devoted to putting this content in practice, with students designing, setting up, and discussing online research applications.
Day 1: Conducting virtual “lab” studies
How is online research different?
Data collection tutorials:
MTurk
CloudResearch
Prolific
(and more)
Choosing between platforms
Day 2: Data quality in the virtual lab
Attrition and sampling selection issues
Study Imposters
Deception
Inattention, miscomprehension, insufficient effort
Participant experience and nonnaiveté
Ethical experimentation online
Preregistration for online research
Ensuring reproducibility in online “lab” research
Day 3: Advanced “lab” designs
Studying rare populations
Cross-cultural studies
Longitudinal studies
Incentivizing participants
Participant Interaction
Beyond surveys and self-reports
Reporting online experiments
Day 4: Running digital quasi-experiments
What is digital quasi-experimentation?
How is different from other forms of (online) behavioral research?
Running A/B tests on social media platforms
Running A/B tests on search engines
Challenges of using platforms and search engines for (academic) research
Navigating validity trade-offs in digital quasi-experimentation
Day 5: Scaling and automating online behavioral research
A primer to web scraping and APIs for online behavioral research
Collecting web data at scale for experimental stimuli
Overview of use cases of web scraping and APIs for online behavioral research
Automating online behavioral research
Literature
There is no textbook for the course, and there are no mandatory readings. Below we recommend some overviews of the general topics that we will address in the course, and list a number of additional background readings that zoom into selected aspects of online behavioral research. Note that the course will also cover material which is not presented in any reading.
Recommended Overviews
Boegershausen, J., Datta, H., Borah, A., & Stephen, A. T. (2022). Fields of Gold: Scraping Web Data for Marketing Insights. Journal of Marketing, 86(5), 1-20.
Hauser, D., Paolacci, G., Chandler, J. (2019). Common Concerns with MTurk as a Participant Pool. Evidence and Solutions. In Handbook of Research Methods in Consumer Psychology, ed. F. R. Kardes, P. M. Herr, and N. Schwarz, Routledge.
Orazi, D. C., & Johnston, A. C. (2020). Running field experiments using Facebook split test. Journal of Business Research, 118, 189-198.
Stewart, N., Chandler, J., Paolacci, G. (2017). Crowdsourcing Samples in Cognitive Science. Trends in Cognitive Sciences, 21(10), 736-748.
Additional Background Readings
Arechar, A. A., Gächter, S., & Molleman, L. (2018). Conducting interactive experiments online. Experimental Economics, 21(1), 99–131.
Casey, L. S., Chandler, J., Levine, A. S., Proctor, A., & Strolovitch, D. Z. (2017). Intertemporal differences among MTurk workers. SAGE Open, 7(2).
Chandler, J., Paolacci, G. (2017). Lie for a Dime: When Most Prescreening Responses Are Honest but Most Study Participants Are Impostors. Social Psychological and Personality Science, 8(5), 500-508.
Chandler, J., Paolacci, G., Hauser, D. (2020). Data Quality Issues on MTurk. In Conducting Online Research on Amazon Mechanical Turk and Beyond, ed. L. Litman, Sage.
Coppock, A. (2018). Generalizing from survey experiments conducted on Mechanical Turk: A replication approach. Political Science Research and Methods, 1–16.
Chandler, J., Mueller, P., Paolacci, G. (2014). Nonnaïveté Among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers. Behavior Research Methods, 46(1), 112-130.
Chandler, J., Sisso, I., & Shapiro, D. (2020). Participant carelessness and fraud: Consequences for clinical research and potential solutions. Journal of Abnormal Psychology, 129(1), 49–55.
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19.
Eckles, D., Gordon, B. R., & Johnson, G. A. (2018). Field Studies of Psychologically Targeted Ads Face Threats to Internal Validity. Proceedings of the National Academy of Sciences, 115(23), E5254.
Goldfarb, A., Tucker, C., & Wang, Y. (2022). Conducting research in marketing with quasi-experiments. Journal of Marketing, 86(3), 1-20.
Goodman and Paolacci (2017), “Crowdsourcing Consumer Research,” Journal of Consumer Research, 44, 1, 196-210.
Litman, L., Robinson, J., & Rosenzweig, C. (2015). The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk. Behavior Research Methods, 47(2), 519–528.
Molnar, A. (2019). SMARTRIQS: A Simple Method Allowing Real-Time Respondent Interaction in Qualtrics Surveys. Journal of Behavioral and Experimental Finance, 22, 161-169.
Morales, A. C., Amir, O., & Lee, L. (2017). Keeping it real in experimental research—Understanding when, where, and how to enhance realism and measure consumer behavior. Journal of Consumer Research, 44(2), 465-476.
Moss, A. J., Rosenzweig, C., Robinson, J., & LItman, L. (2020). Is it ethical to use Mechanical Turk for behavioral research? Relevant data from a representative survey of MTurk participants and wages. PsyArXiv.
Paolacci, G., Chandler, J., Ipeirotis, P. G. (2010). Running Experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411-419.
Paolacci, G., Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a participant pool. Current Directions in Psychological Science, 23(3), 184-188.
Peer, E., Rotschild, D., Gordon, A., Evernden, Z., Damer, E. (2022). Data quality of platforms and panels for online behavioral research. Behavior Research Methods, 54: 1643-1662.
Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 1123-1128
Weinberg, J., Freese, J., & McElhattan, D. (2014). Comparing data characteristics and results of an online factorial survey between a population-based and a crowdsource-recruited sample. Sociological Science, 1, 292–310.
Woike, J. K. (2019). Upon repeated reflection: Consequences of frequent exposure to the cognitive reflection test for Mechanical Turk participants. Frontiers in Psychology, 10.
Zallot, C., Paolacci, G., Chandler, J., Sisso, I. (2022). Crowdsourcing in observational and experimental research. Handbook of Computational Social Science. Volume 2 Data Science, Statistical Modelling, and Machine Learning Methods, eds. U. Engel, A. Quan-Haase, S. Xun Liu, & L.E. Lyberg, Routledge.
Zhou, H., & Fishbach, A. (2016). The pitfall of experimenting on the web: How unattended selective attrition leads to surprising (yet false) research conclusions. Journal of Personality and Social Psychology, 111(4), 493–504.
Examination part
Performance will be evaluated with individual assignments during the course. In these assignments, students will put the course content into practice. For example, they will preregister an online research design to test a hypothesis, detailing all their methodological choices. Though class participation is not graded, regular attendance and active participation in class discussions are expected.
Supplementary aids
The assignments are “open book”. Lecture slides and notes are recommended, and all background readings are additional recommended sources.
Examination content
Lecture slides and notes.
Examination relevant literature
Lecture slides and notes.
Prerequisites (knowledge of topic)
Basic knowledge of descriptive statistics, data analysis and R is useful, but not necessary. Participants need to bring their own laptop and complete our detailed installation instructions for R and RStudio (both open source software) shared prior to the course.
Learning objectives
The creation and communication of data visualizations is a critical step in any data analytic project. Modern open-source software packages offer ever more powerful data visualizations tools. When applied with psychological and design principles in mind, users competent in these tools can produce data visualizations that easily tell more than a thousand words. In this course, participants learn how to employ state-of-the-art data visualization tools within the programming language R to create stunning, publication-ready data visualizations that communicate critical insights about data. Prior to, during, and after the course participants work their own data visualization project.
Course content
Each day will contain a series of short lectures and demonstrations that introduce and discuss new topics. The bulk of each day will be dedicated to hands-on, step-by-step exercises to help participants ‘learn by doing’. In these exercises, participants will learn how to read-in and prepare data, how to create various types of static and interactive data visualizations, how to tweak them to exactly fit one’s needs, and how to embed them in digital reports. Accompanying the course, each participant will work on his or her own data visualization project turning an initial visualization sketch into a one-page academic paper featuring a polished, well-designed figure. To advance these projects, participants will be able to draw on support from the instructors in the afternoons of course days two to four.
Structure
Day 1
Morning: Cognitive and design principles of good data visualizations
Afternoon: Introduction to R
Day 2
Morning: Reading-in, organizing and transforming data
Afternoon: Project sketch pitches
Day 3
Morning: Creating plots using the grammar of graphics
Afternoon: Visualizing statistical uncertainty, facets, networks, and maps
Day 4
Morning: Styling and exporting plots
Afternoon: Making visualizations interactive
Day 5
Morning: Reporting visualizations using Markdown
Afternoon: Final presentation and competition
Literature
Voluntary readings:
Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons.
Healy, K. (2018). Data visualization: a practical introduction. Princeton University Press.
Examination part
The course grade is determined based on the quality of the initial project sketch (20%), the data visualization produced during the course (40%), and the one-page paper submitted after the course (40%).
Prerequisites (knowledge of topic)
Students should have previous exposure to social research methods, including basic training in quantitative methods, at the post-baccalaureate level.
Hardware
Laptop (PC or Mac): Students should bring a laptop. The course will include instruction in the use of the software package fsQCA (for both Windows and Mac). I strongly recommend running the fsQCA software package in its Windows version, either on a Windows machine, using a Windows virtual machine on a Mac, or using the GSERM virtual machine. The fsQCA Mac version tends to be much less stable.
Software
Please install the fsQCA software package ahead of the course or ensure you can run it in a virtual environment. If you install it on your machine it can be downloaded for free at fsqca.com
Learning objectives
Qualitative comparative analysis (QCA) is a research approach consisting of both an analytical technique and a conceptual perspective for researchers interested in studying configurational phenomena. QCA is particularly appropriate for the analysis of causally complex phenomena marked by multiple, conjunctural causation where multiple causes combine to bring about outcomes in complex ways.
QCA was developed in the 1980s by Charles Ragin, a sociologist and political scientist, as an alternative comparative approach that lies midway between the primarily qualitative, case-oriented approach and the primarily quantitative, variable-oriented approach, with the goal of bridging both by combining their advantages and tackling situations where causality is complex and conjunctural. QCA uses Boolean algebra for the analysis of set relations and allows researchers to formally analyze patterns of necessity and sufficiency regarding outcomes of interest. Since its inception, QCA has developed into a broad set of techniques that share their set-analytic nature and include both descriptive and inferential techniques.
Many researchers have drawn on QCA because it offers a means to systematically analyze data sets with relatively few observations. In fact, QCA was originally developed for small- to medium-N situations with between 10 and 50 cases. In such situations there are frequently too many cases to pursue a classical qualitative approach but too few cases for conventional statistical analysis. However, more recently, researchers have also applied QCA to medium- and large-N situations marked by hundreds of thousands of cases. While these applications require some changes to how QCA is applied, they retain many advantages for analyzing situations that are configurational in nature and marked by causal complexity.
The goal of this workshop is to provide a ground-up introduction to Qualitative Comparative Analysis (QCA) and fuzzy sets. Participants will get intensive instruction in the method as well as hands-on experience with the fsQCA software package. On completion of the course, participants should be prepared to design and execute research projects using the set-analytic approach.
Specifically, after successful completion of you should be able to:
- understand the goals, assumptions, and key concepts of QCA
- conduct data analysis using the fsQCA software package
- design and execute research projects using a set-analytic approach
- apply advanced forms of set-analytic investigation
I would like this workshop to be as useful to you as possible. To get the most out of this workshop, you would ideally already be working on an empirical project that might be aided by taking a configurational approach, but that is not essential. Over the course of this workshop, I hope you will be thinking about how you can apply these methods to your research, and I will do my best to be of assistance.
Course content
See below under structure
Structure
Day 1: Units 1-3
Day 2: Units 3-4
Day 3: Units 5-6
Day 4: Units 6-7
Day 5: Student Presentations
Unit 1. Introduction to the Comparative Method
The goal of this first unit is to offer an introduction to the logic of comparative research, as this perspective will be fundamental in informing our thinking for the coming days. The focus is on understanding social research from a set-analytic perspective as well as examining the distinctive place of configurational and comparative research.
Key Readings:
Ragin, 2008 (“Redesigning Social Inquiry”): Chapters 1-2
Unit 2. The Basics of QCA
We’ll move on to the basics of QCA. We will begin with an Introduction to Boolean algebra and set-analytic methods. Other issues we will cover include set-analytic analysis vs. correlational analysis, the concepts of necessity and sufficiency as well as consistency, coverage, and set coincidence. Time permitting, we will also examine case-oriented research strategies for theory building.
Key Readings:
Ragin, 2000: Chapters 3-5
Ragin, 2008: Chapters 1-3
Unit 3. Crisp Set Analysis
In this unit, we will dive into crisp-set QCA (csQCA), the simpler version of QCA using binary data sets. This will include the coding of data, the construction of truth tables, and understanding the three solutions—complex, parsimonious, and intermediate. We will also begin to examine the importance of counterfactual analysis based on easy versus difficult counterfactuals. Topics also include understanding consistency and coverage in crisp-set truth table analysis.
Key Readings:
Ragin, 2000: Chapters 3-5
Unit 4. Fuzzy Set Analysis I
Fuzzy set analysis presents a slightly more complex version of QCA. We will start with the notions of fuzzy sets and fuzzy set relations before moving on to calibrating fuzzy sets and fuzzy set consistency, coverage, and coincidence.
Ragin, 2008: Chapters 4-5
Unit 5. The Fuzzy-Set Truth Table Algorithm
We will next cover the fuzzy-set truth table algorithm. Building on crisp set analysis, we will further examine issues around limited diversity, fuzzy sets and counterfactual analysis. We also will work with sample data sets.
Unit 6. Advanced Topics in QCA
This unit provides us with an opportunity to catch up and delve deeper into some of the topics introduced above. Should we feel comfortable enough, we will move on to some more advanced topics in QCA, including the testing of causal recipes and substitutable causal conditions.
Key Readings:
Ragin, 2008: Chapters 7-10
Unit 7. Large-N Applications of QCA
The last unit will provide examples of recent large-N applications of QCA. These examples will give us an opportunity to raise further questions about how to execute research using a set-analytic approach. We will also reserve some time for further questions that have come up during the workshop.
Literature
There are four key books for the course, and required chapters are posted here in pdf format. I recommend reading the remainder of the books, but this is not required.
Ragin, Charles C. 1987. The Comparative Method: Moving beyond Qualitative and Quantitative Strategies. Berkeley, CA: University of California Press
Ragin, Charles C. 2000. Fuzzy Set Social Science. Chicago, IL: University of Chicago Press.
Ragin, Charles C. 2008. Redesigning Social Inquiry: Fuzzy-Sets and Beyond. Chicago, IL: University of Chicago Press.
Ragin, Charles, C., and Fiss, Peer C. 2017. Intersectional Inequality: Race, Class, Test Scores, and Poverty. Chicago, IL: University of Chicago Press.
Mandatory
Background Reading: The Comparative Method, chapters 6-8 and Redesigning Social Inquiry, chapters 1-5 and Fuzzy-Set social Science, chapters 3-5.
Supplementary / voluntary
Goertz, Gary. 2006. Social Science Concepts: A User’s Guide. Princeton, NJ: Princeton University Press.
Goertz, Gary and James Mahoney. 2012. A Tale of Two Cultures: Qualitative and Quantitative Research in the Social Sciences. Princeton, NJ: Princeton University Press.
Rihoux, Benoit and Charles C. Ragin (eds.) 2008. Configurational Comparative Methods. Thousand Oaks, CA: Sage.
Schneider, Carsten and Claudius Wagemann. 2012. Set-Theoretic Methods for the Social Sciences: A Guide to QCA. New York: Cambridge.
In addition, I will be posting a variety of readings, including examples of empricial studies using QCA, in a shared folder for all participants.
Mandatory readings before course start
The above chapters.
Examination part
Presentation (individual) (50%)
Research proposal written at home (individual) (50%)
Supplementary aids
To get inspiration for research proposals, I recommend that participants review recent research projects in their field using QCA. A bibliography of such projects is available at http://compasss.org/bibliography/ , but as noted above I will also provide you with access to a folder containing a set of studies curated by me.
Examination content
The “structure” section above presents a complete list of topics relevant to the examination. Specifically, the oral presentation and subsequent examination paper will focus on using course materials to develop a research proposal.
Examination relevant literature
All required chapters listed above are part of the examination relevant literature, as are all course materials such as PPTs and additional materials distributed to the participants during the course.
Prerequisites (knowledge of topic)
This course assumes no prior experience with machine learning or R, though it may be helpful to be familiar with introductory statistics and programming.
Hardware
A laptop computer is required to complete the in-class exercises.
Software
R (https://www.r-project.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available at no cost and are needed for this course.
Learning objectives
Students will understand how to extract, interpret, and measure content from natural language data, utilizing popular NLP methods to identify key insights and emotional content.
Course content
The Introduction to Natural Language Processing (NLP) at GSERM is a comprehensive journey into the world of textual data analysis. The course is designed to immerse attendees in both the theory and practical implementation of versatile NLP methods, transforming qualitative research prospects.
Through a mix of lectures and labs, participants will gain practical proficiency in powerful NLP techniques that include:
- Large Language Models (LLMs)
- Prompt Engineering
- Vector Database Basics
- Bag-of-words Analysis
- Sentiment Analysis
- Document Classification and Clustering
Students with previous experience in programming, graduate-level statistics, and mathematical theory will benefit most from this course. However, the curriculum is crafted to appeal and be accessible to all researchers eager to integrate NLP tools in their analysis.
Structure
The course effectively blends theoretical knowledge and hands-on experience. Your typical day in the course will include:
- Morning Session: Engaging lectures and demonstrations on a specific NLP technique using the R language.
- Afternoon Session: Practical application sessions where the technique of the day is applied to new datasets.
- Lab: An opportunity to apply the day’s learned theory and practice on a provided dataset, or students can choose to use their own. Instructor support will be available throughout these lab hours.
- This approach ensures participants gain a clear theoretical understanding of NLP, as well as the practical ability to implement text mining techniques with confidence.
Day 1: R Basics & Introduction to NLP
- Intro to R programming
- Introduction to NLP & basic text mining
- String Manipulation & Text Cleaning
Lab Section: Clean tweets, and prepare for bag of words examination
Day 2: Visualizations in text mining
- Word Frequency & Term Frequency Inverse Document Frequency (TF-IDF)
- Term Document, & Document Term Matrices
- Word Clouds – Comparison Clouds, Commonality Clouds
- Other Visuals – Word Networks, Associations, Pyramid Plots, Treemaps
Lab Section: Create various visualizations with news articles
Day 3: Sentiment Analysis & Machine Learning: Document Classification & Clustering
- Lexicon based sentiment analysis
- Elastic Net (Lasso & Ridge Regression)
- K-Means, K-Mediods & Spherical K-Means
Lab Section: Classify clickbait from news headlines, group news articles by clusters
Day 4: Introduction to Large Language Models
- How do LLMs work?
- Accessing LLMs with APIs, libraries gptstudio/gpttools/ chattr, or local LLMs
- Common LLM tasks: Sentiment, document classification, Named Entity Recognition, summarization, POS tagging
Lab Section: Using an LLMs to classify clickbait from news headlines, group news articles by clusters and compare to Day 3’s lab results
Day 5: Effective Prompt Engineering, vector databases and RAG models
- Introduction to prompt engineering for effective LLM usage
- Introduction to Vector Databases
- Building a basic RAG LLM workflow in R for information retrieval
Afternoon Session: Bring your own research data so we can explore it in a lab setting!
Literature
Mandatory
- Text Mining in Practice with R by Ted Kwartler; Wiley & Sons Publishing
ISBN: 978-1-119-28201-3
- Two Data Ethics articles assigned at class to spur reflection for the ethics essay.
Supplementary / voluntary
None.
Mandatory readings before course start
- Read chapter 1 of Text Mining in Practive with R entitled “What is Text Mining?”
- Please install R & R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. As a backup, sign up for an account at R-Studio’s cloud environment https://rstudio.cloud.
Examination part
20% Ethics Paper – Due at midnight at the last course day
- 500-750 word essay with personal reflection on the ethical implications of text mining research methods
80% Final Exam – Proctored on the final day of the week
- 30 multiple choice (2pts each),
- 1 of (20pts) code review section asking students to describe what and why specific code steps are being taken
- 4 of (5pts each) Short form questions/answers requiring 1 paragraph (2-4 sentences each)
Supplementary aids
Students may bring a hand written “index card” to the final examination period. It may be double sided, and should be functionally equivalent to the UK standard 3in by 5in notecard. Students may put any information they deem important for the final on their notecard and use it as a supplement during the exam. Use of an exam supporting notecard is optional.
Examination content
Topic |
Example Topic |
R Coding principles and basic functions |
how to read in data, and data types
|
Steps in a machine learning or analytical project workflow |
SEMMA EDA functions, partitioning if modeling |
Steps in a text mining workflow |
Problem statement> unorganized state> organized state |
R text mining libraries and functions |
which functions are appropriate for text uses |
Text Preprocessing Steps |
Why perform “cleaning” steps |
Bag of Words Text Processing |
What is Bag of Words? |
Sentiment analysis |
Lexicons, their application and implications for understanding author emotion |
Document Classification |
Elastic Net Machine Learning for document classification |
Topic Extraction |
Unsupervised machine learning for topic extraction – Kmeans, Spherical K Mean, Hierarchical Clustering |
Text as inputs for Machine Learning Algorithms |
Classification and Prediction using mixed training sets including extracted text features as independent variables |
Text Mining Visuals |
Word frequencies, disjoint comparisons, and other common visuals |
Names Entity Recognition |
Examples of named entities in large corpora |
Text Sources |
APIs, web scraping, OCR and other text sources |
Literature
The exam will be based on the lectures and mandatory assigned reading from Text Mining in Practice with R.
Prerequisites (knowledge of topic)
Mathematics: Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Familiarity with discrete and continuous univariate probability distributions will be helpful. Statistics: Students should have completed Ph.D.-level courses in introductory statistics and linear regression models, up to the level of GSERM’s Regression II.
Hardware
Students will complete course work on their own laptop computers. Microsoft Windows, Apple OS-X, and Linux variants are all supported; please contact the instructor to ascertain the viability of other operating systems for course work.
Software
Basic proficiency with at least one statistical software package/language is not required but is highly recommended. Preferred software packages include the R statistical computing language and Stata. Course content will be presented using R; computer code for all course materials (analyses, graphics, course slides, examples, exercises) will be made available to students. Students choosing to use R are encouraged to arrive at class with current versions of both R (https://www.r-project.org) and RStudio (https://www.rstudio.com) on their laptops.
Course content
This course builds directly upon the foundations laid in Regression II, with a focus on successfully applying linear and generalized linear regression models. After a brief review of the linear regression model, the course addresses a series of practical issues in the application of such models: presentation and discussion of results (including tabular, graphical, and textual modes of presentation); fitting, presentation, and interpretation of two- and three-way multiplicative interaction terms; model specification for dealing with nonlinearities in covariate effects; and post-estimation diagnostics, including specification and sensitivity testing. The course then moves to a discussion of generalized linear models, including logistic, probit, and Poisson regression, as well as textual, tabular, and graphical methods for presentation and discussion of such models. The course concludes with a “participants’ choice” session, where we will discuss specific issues and concerns raised by students’ own research projects and agendas.
Structure
Day One (morning session): Review of linear regression.
Day One (afternoon session): Presentation and interpretation of linear regression models.
Day Two (morning session): Fitting and interpreting models with multiplicative interactions.
Day Two (afternoon session): Nonlinearity: Specification, presentation, and interpretation.
Day Three (morning session): Anticipating criticisms: Model diagnostics and sensitivity tests.
Day Three (afternoon session): Introduction to logit, probit, and other Generalized Linear Models (GLMs).
Day Four (morning session): GLMs: Presentation, interpretation, and discussion.
Day Four (afternoon session): GLMs: Practical considerations, plus extensions.
Day Five (morning session): “Participants’ choice” session.
Day Five (afternoon session): Examination period.
Literature
Mandatory
The course has one required text:
Fox, John R. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Additional readings will also be assigned as necessary; a list of those readings will be sent to course participants a few weeks before the course begins. All additional readings will be available on the course github repository and/or through online library services (e.g., JSTOR).
Supplementary / Voluntary
None.
Mandatory readings before course start
None.
Examination part
Grading:
– Two written homework assignments (20% each)
– A final examination (50%)
– Oral / class participation (10%)
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam. Additional useful materials include:
Fox, John, and Sanford Weisberg. 2011. An R and S-Plus Companion to Applied Regression, Second Edition. Thousand Oaks, CA: Sage Publications.
Nagler, Jonathan. 1996. “Coding Style and Good Computing Practices.” The Political Methodologist 6(2):2-8.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data using linear and/or generalized linear regression. Students will be required to specify, estimate, and interpret various forms of regression models, to present tabular and graphical interpretations of those model results, to conduct and present diagnostics and robustness checks, and to give detailed explanations and justifications for their responses.
Literature
Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel / Hierarchical Models. New York: Cambridge University Press.
Prerequisites (knowledge of topic)
None.
The course is designed for Master, PhD students and practitioners in the social and policy sciences, including political science, sociology, public policy, public administration, business, and economics. It is especially suitable to MA students in these fields who have an interest in carrying out research. This course follows naturally from Andrew Bennett’s “Case Study Methods;” however, participants do not need any previous exposure to either Bayesian analysis or qualitative methods literature.
Hardware
Laptop.
Software
None.
Learning objectives
Upon completing the course, students will be equipped with a concrete set of Bayesian-inspired best practices to deploy in their own research, as well as widely-applicable analytic skills that will help them to better evaluate and critique socio-political analysis.
Course content
The way we intuitively approach qualitative case research is similar to how we read detective novels. We consider various different hypotheses to explain what occurred— whether a major tax reform in Chile, or the death of Samuel Ratchett on the Orient Express—drawing on the literature we have read (e.g. theories of policy change, or other Agatha Christie mysteries) and any salient previous experiences we have had. As we gather evidence and discover new clues, we update our beliefs about which hypothesis provides the best explanation—or we may introduce a new alternative that occurs to us along the way. Bayesianism provides a natural framework that is both logically rigorous and grounded in common sense, that governs how we should revise our degree of belief in the truth of a hypothesis—e.g., “the imperative of attracting globally-mobile capital motivated policymakers to reform the tax system,” or “a lone gangster sneaked onboard the train and killed Ratchett as revenge for being swindled”—given our relevant prior knowledge and new information that we learn during our investigation. Bayesianism is enjoying a revival across many fields, and it offers a powerful tool for improving inference and analytic transparency in qualitative case-study research.
This interactive course introduces the principles of Bayesian reasoning, with applications to process-tracing, comparative case studies, and multimethod research. Participants will learn how to construct well-articulated rival hypotheses to compare, systematically assess the inferential weight of qualitative evidence, avoid common cognitive biases that can lead to sloppy reasoning, and evaluate which hypothesis provides the best explanation through Bayesian updating. The course will also address key aspects of research design, including case selection. We will further explore the potential for Bayesianism to serve as a bridge between quantitative and qualitative research. Throughout, we will conduct a wide range of exercises and group work to give participants hands-on practice applying Bayesian techniques. Upon completing the course, participants will be able to read case studies more critically, evaluate whether and to what extent the evidence presented supports the authors’ conclusions, and apply Bayesian principles in their own research.
Structure
Day 1. Introduction to Bayesian reasoning
Session A: Preview of best practices and foundations of Bayesian probability
Session B: Constructing rival hypotheses and evaluating prior odds
Because we live in a world of uncertainty, we intuitively reason in terms of probabilities. For example, we think about how likely it is to rain in the afternoon, given the weather conditions we observe in the morning. If it does rain in the afternoon, we adjust our expectations about the likelihood of rain tomorrow. But even though probabilistic thinking is common in daily life, we don’t always reason correctly about probabilities, especially when we’re making quick judgements. Our first session will begin with an introduction to the fundamentals of Bayesian probability. We will come up to speed on conditional probability and Bayes’ rule, with some fun practice problems along the way. In the second session, we will go on to talk about how to construct well-articulated rival hypotheses to compare, and how to assign prior odds, which express our initial view about which hypothesis is more plausible given the background knowledge that we bring to our research.
Day 2. Assessing evidentiary import
Session A: Likelihood ratios
Session B: Silver Blaze exercise
We now turn to identifying evidence, which includes any salient observation about the world that helps us adjudicate between rival hypotheses, and evaluating likelihood ratios, which is how we figure out how strongly the evidence supports a hypothesis over rivals. Evaluating likelihood ratios is the key step in Bayesian analysis that tells us how to update our prior odds. Here we must ask which hypothesis makes the evidence more expected. The key is to “mentally inhabit the world” of each hypothesis. We need to think about the most plausible scenario that would lead us to observe the evidence in the world of a particular hypothesis, and then ask whether that story would be more or less plausible than the most sensible scenario we can envision in the world of the alternative hypothesis. We will consider concrete examples from research on international investment treaties signed by South Africa (Poulsen 2015, CUP) and research on social spending in Mexico (Garay 2016, CUP), along with an exercise using clues from the famous Sherlock Holmes story Silver Blaze.
Day 3. Log-odds updating
Session A: Weight of evidence; State building exercise
Session B: Market reform exercise
We will now introduce the log-odds form of Bayes’ rule and the weight of evidence, which is closely related to the likelihood ratio. This approach greatly simplifies Bayesian analysis, and it allows us to quantify probabilities in order to better communicate our views and aggregate the probative value of multiple pieces of evidence more systematically. We will practice this approach with group exercises that examine case studies on state-building and market reform.
Day 4. Bayesian reasoning for comparative case studies
Session A: Multiple hypotheses and multiple cases
Session B: Case selection and research design
Our next step is to apply Bayesian reasoning in situations where we wish to compare more than just two rival hypotheses, and in research that involves studying more than a single case. When working with multiple rival hypotheses, we simply carry out a set of pairwise comparisons. When working with multiple cases, weights of evidence aggregate across the cases in the same way that weights of evidence add up for multiple clues pertaining to a single case. The second session will introduce an information-theoretic approach to case selection, where the goal is to choose cases that will be highly informative for developing theory and/or for comparing rival hypotheses. We will discuss a number of practical guidelines for case selection that emerge from this approach. If time allows, we will also discuss guidelines for how to proceed when evidence we discover inspires us to devise a new hypothesis.
Day 5. Bayesian reasoning in methodological perspective
Session A: Contrasting Bayesianism with alternative approaches to qualitative research
Session B: Applying Bayesian reasoning across multiple types of evidence
We will conclude by highlighting the relative advantages of Bayesianism and how it differs from frequentist statistical inference, as well as other methodologies for process tracing and qualitative research, with some fun exercises along the way. We will further discuss how Bayesianism can be applied across very different kinds of evidence with an application to the question of SARS-CoV-2 origins.
Literature
Mandatory
Various short worksheets (TBA)
For Day 1:
Tasha Fairfield & Andrew Charman (2022), Social Inquiry and Bayesian Inference, Cambridge University Press, Chap. 1 and Chap. 3, pp. 73-77.
For Day 2:
Fairfield & Charman (2022), Chap. 3, pp. 101-119.
Sir Arthur Conan Doyle, The Adventure of Silver Blaze
Tasha Fairfield and Candelaria Garay. 2017. “Redistribution under the Right in Latin America: Electoral Competition and Organized Actors in Policymaking,” Comparative Political Studies 50 (14). Read only the Mexico case, pp. 1882-1885.
For Day 3:
Fairfield & Charman (2022), Chap. 4, pp.124-136.
For Day 4:
Dan Slater, 2009. “Revolutions, Crackdowns, and Quiescence: Communal Elites and Democratic Mobilization in Southeast Asia.” American Journal of Sociology 115 (1): 203-254. Read only the Philippines and Vietnam cases.
For Day 5:
Fairfield and Charman (2019), “A Dialogue with the Data: The Bayesian Foundations of Iterative Research in Qualitative Social Science,” Perspectives on Politics 17 (1).
Supplementary / voluntary
Fairfield & Charman (2022), Chap. 3, pp. 86-101; Chap. 5, 9, 10, 13.
Mandatory readings before course start
Andrew Bennett & Jeffrey Checkel (2015), Chapter. 1, in Andrew Bennett and Jeffrey Checkel, eds, Process Tracing in the Social Sciences: From Metaphor to Analytic Tool, Cambridge University Press. Read only pp. 23-31.
Examination part
Oral participation (20%)
Final Exam (80%)
Supplementary aids
Final exam will be closed-book, administered online using Qualtrics (can be accessed with any web browser)
Examination content
Final exam will cover all material discussed during the 5 days: Bayes’ rule, heuristic Bayesian process tracing, explicit Bayesian process tracing, iterative research and case selection principles.
Examination relevant literature
None.
Prerequisites (knowledge of topic)
Basic methodological knowledge of the scientific method and experimental social research is beneficial but not necessary. Ideally you will bring a project idea to the course we can use for a pre-registration. This could be a study you plan to work on or a research project you are already in the middle of. This means you have a rough idea of what you want to do, the measures you want to use and a concrete research question.
Hardware
Bring your own device (ideally a laptop) so you can code along and work on your own materials as we go.
Software
Browser / Texteditor / RStudio (installation instructions will be provided, can also be used in a cloud version via Browser)
Learning objectives
Upon completion of the course, participants will be able to:
- Discuss the incentive structure of modern academic environments against the
backdrop of research history - Explain the importance of transparency and openness for the scientific
method. - Discuss the benefits and challenges of sharing research materials openly.
- Pre-register their own empirical work.
- Make own research data and code openly available.
- Collaborate with peers on GitHub to co-write code and create reproducible projects in RStudio.
Course content
The course addresses central aspects of good research practice and invites you to think
about your own methods and academic habits. We discuss what qualifies as “good”
research practice, mirroring current debates in various disciplines. The course is
intended to shed light on these multi-faceted and stimulating debates and to
familiarize participants with questions such as: Why is transparency and openness
needed in research? What does “open” mean for different perspectives on knowledge
generation (i.e., different disciplines, methods, etc.)? Why do we publish knowledge
and findings at all, and how can such publications be made more accessible? Which
other aspects of the research and teaching process can be designed “better” in
accordance with the principles of good research practice? What are limits of
openness in research?
With these questions comes the need for using certain tools to achieve transparency
and openness. Collaborative work on Github for writing code and producing
reproducible projects in RStudio will be introduce. We will also explore the ideas
connected to pre-registration of your hypothesis and analysis plan on the open
science framework (osf.io).
The course takes place from 9am to 3pm on five consecutive course days. Each day will
contain a series of short lectures and demonstrations that introduce and discuss new
topics. Hands-on exercises will give you as much practical experience as possible in
applying the new ideas introduced during the week.
Structure
Monday
- Introduction
- Incentives in Academia
- Academic Publishing
- Peer Review
- The empirical method & Reproducibility
Tuesday
- Recap
- Pre-registration
- Registered Reports
- Critical Evaluation of Pre-registration
- “Reproduce this analysis” project introduction
Wednesday
- Recap
- Open Data
- Open Code
- Project work
Thursday
- Collaborate with git
- RMarkdown
- Reproducible writing with Quarto/Papaja
- Project work
Friday
- Recap
- Good research practice at your institutions
- Science Communication
- Project Presentations
- Wrap-up
Literature
Podcasts:
Everything Hertz / Nullius in Verba / Reproducibilitea / Two Psychologists, Four Beers
Papers:
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Percie du
Sert, N., … & Ioannidis, J. (2017). A manifesto for reproducible science. Nature
human behaviour, 1(1), 1-9.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The
preregistration revolution. Proceedings of the National Academy of
Sciences, 115(11), 2600-2606.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of
replication is neglected in the social sciences. Review of general
psychology, 13(2), 90-100.
Tennant, J. P., Waldner, F., Jacques, D. C., Masuzzo, P., Collister, L. B., &
Hartgerink, C. H. (2016). The academic, economic and societal impacts of Open
Access: an evidence-based review. F1000Research, 5.
Vazire, S. (2019) Implications of the Credibility Revolution for Productivity,
Creativity, and Progress. Perspectives on Psychological Science, 13(4), 411-417.
https://doi.org/10.1177/1745691617751884
Mandatory readings before course start:
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., … &
Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological
science. Annual review of psychology, 73, 719-748.
Examination part
Grading will consist of the completion and publication (on osf.io) of your preregistration documents (40%), and the writing and presentation of a
computational reproduction of a study introduced in the course (60%).
Examination content
For the pre-registration, the examination will focus on whether:
- Clear and testable hypotheses are specified
- The design and materials of the experiment are described in enough
detail so that a subsequent replication attempt could be well informed - Measured and manipulated variables are described in adequate detail
- The sampling process is adequately motivated and described
For the computational reproduction, the examination will focus on whether:
- Executable code is produced, which runs the reproduction attempt in
Papaja - The success of the reproduction attempts is adequately discussed
The outcome of the reproduction attempt (i.e., whether the same result can or
cannot be obtained) will not be part of the evaluation.
Examination relevant literature
NA
Prerequisites (knowledge of topic)
• A graduate statistics course at an introductory level,
• Some knowledge of regression analysis is recommendable and desirable, but not expected from the participants.
Hardware
Your own laptop with R on it.
Software
R.
Learning objectives
• To become familiar with the basics as well as some intermediate to advanced theoretically and empirically relevant topics of multilevel analysis, mixed effect modeling and longitudinal data analysis with R;
• To acquire a nuanced understanding and skills for assessing the need for multilevel, mixed-effect, and longitudinal modeling of data from an empirical study;
• To be able at the end of the course to carry out multilevel analysis, mixed-effect modeling, and longitudinal data analysis with R, including model selection (model comparison/model choice).
• To be in a position to interpret statistically and substantively the results from a multilevel and mixed-effect modeling as well as longitudinal data analysis sessions with R;
• To be in a position to report the findings from such analyses in a publishable document.
Course content
This intermediate level 5-day course presents the highly popular methodology of multilevel and mixed effect modeling (MLMEM) across the social, marketing, business, behavioral, educational, organizational, life, health, and biomedical sciences. In addition, it discusses its specific applications in longitudinal data analysis (LDA) that are currently very frequently collected in these and cognate disciplines.
MLMEM and LDA provide a widely applicable approach to modeling and accounting for clustering effects that impact the majority of contemporary empirical studies in these sciences. A key feature of these effects is the associated relationship among and similarity of the observed scores collected from members of studied groups or clusters of units of analysis (usually persons, respondents, patients, employees, students, clients, etc., but could also be higher-order aggregates of them). If this similarity (within-group ‘correlation’) is not properly handled, as would be the case when using standard statistical analysis methods, incorrect statistical results ensue, followed by substantive conclusions that can be seriously misleading. A key achievement of MLMEM is the proper handling of the clustering effects, yielding valid and dependable statistical and substantive results.
A particular field of application of MLMEM is that of longitudinal data analysis and modeling. Designs and studies producing such data are very frequently utilized in the social, marketing, business, behavioral, educational, medical, organizational, health, and life as well as related sciences. Through appropriate use of MLMEM, whose detailed coverage is also part of this course, data resulting from such studies involving repeated measures can be analyzed and modeled applying relevant statistical procedures, which leads to valid and dependable results entailing correct substantive conclusions.
Structure
Day 1:
1. Resources for course and what it is about
– Literature,
– Software,
– What this course is about,
– Why use clustered (nested) settings, studies, and data.
2. A (very) brief introduction to R
– What is R?
– Why use R?
– R installation, packages, and libraries,
– R packages needed in this course
– Reading data into R.
3. Fitting single-level regression models using R
– Some important nomenclature
– The meaning of the intercept and slope parameters
– Estimation of model parameters
– How good is the model?
– Multiple regression
– The generalized linear model (GLIM)
– Logistic regression as an important GLIM
– The likelihood ratio test (LRT) and model selection (model comparison/model choice).
Day 2:
4. Why do we need multilevel models?
– What is multilevel modeling MLM), why can’t we do without
it, and how come aggregation and disaggregation do not do
the job?
• Examples of nested data and the hallmark of
multilevel modeling
• Another important instance of multilevel modeling
• Aggregation and disaggregation of scores
• Analytic benefits of multilevel modeling
– The beginnings of multilevel modeling – why what we already
know about regression analysis will be so useful
• Multilevel models as sets of regression equations
• An insightful look at key model parameters
• Graphing of hierarchical (multilevel) data.
– Appendix: Restricted maximum likelihood (REML) as a
widely applicable multilevel model estimation method.
5. The intra-class correlation coefficient (ICC)
– The fully unconditional two-level model, assumptions, and
definition of the ICC
– The meanings of the ICC
– Estimation of the ICC using R
– Appendices – MLM glossary and design effect.
Day 3:
6. How many levels? – Proportion third and second level variance
– The fully unconditional three-level model
– Proportion third level variance (PTLV)
– Proportion second level variance (PSLV) – has there been an
omitted a level in past analyses?
– Evaluation of PTLV and PSLV using R.
7. Random intercept models
– Introduction to multilevel modeling with covariates
– Statistical underpinnings of random intercept models
– Fitting a random intercept model with R
– R-squares for multilevel models
– Conditional three-level models.
8. Robust modeling of lower-level variable relationships in the
presence of clustering effect
– Introduction to robust modeling with clustering effects,
– Robust modeling of empirical data from a multilevel study.
Day 4:
9. Mixed effect models
– What are mixed models, what are they made of,
and why are they so useful?
• An illustration of the difference between fixed
and random effects
• Examples of mixed modeling frameworks
– Random regression models (RMMs)
• Why we need RMMs
• Standard regression as a mixed model
• A RRM as a mixed model
• Multiple random slopes
– Numerical issues and how to resolve them.
10. Multilevel models with discrete responses
– Introduction – why bother?
– An important statistical fact – a refresher
– Random intercept models with discrete outcomes
– Random regression models with discrete outcomes
– Model choice with discrete outcome.
Day 5:
11. Longitudinal multilevel modeling
– Introduction – why do we need longitudinal modeling?
– The need for modeling individual temporal development
– How to apply multilevel modeling for longitudinal data analysis
– Multilevel modeling of repeated measure data: unconditional
and conditional longitudinal analysis
– Using R to fit unconditional and conditional growth curve
models.
12. Course reiew, outlook and conclusion.
Literature
Recommended Overviews
– Snijders, T. A. B., & Bosker, R. J. (2013). Multilevel analysis. An introduction to basic and advanced multilevel modeling. Thousand Oaks, CA: Sage.
– Finch, W. H., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. Boca Raton, FL: CRC Press (Taylor & Francis).
Supplementary / voluntary readings
– Raykov, T. & Marcoulides, G. A. (2012). Basic statistics. An introduction with R. New York, NY: Rowman & Littlefield.
Additional background readings
– Rabe-Hesketh, S., & Skrondal, A. (2021). Multilevel and longitudinal modeling. College Station, TX: Stata Press.
Examination part
Take home assignment, to be submitted within 3 weeks upon course completion.
Participants are allowed to use any literature they can find for this assignment, incl. the lecture notes volume to be provided in pdf form to them before the course commences.
Supplementary aids
Course participants are allowed to use any literature they can access, incl. the lecture notes.
Examination content
Multilevel and mixed-effect modeling with clustering effects. Robust modeling of multiple dependent variables in the presence of nesting effects. Evaluation of unique significance of predictor effects.
Examination relevant literature
Finch, W. H., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. Boca Raton, FL: CRC Press (Taylor & Francis).
Prerequisites (knowledge of topic)
Participants should have a basic working knowledge of the principles and practice of multiple regression and elementary statistical inference. Because this is a second course, participants should either be familiar with the contents of the first edition of Introduction to Mediation, Moderation, and Conditional Process Analysis and the statistical procedures discussed therein or should have taken the first course through GSERM or elsewhere. Participants should also have experience using syntax in SPSS, SAS, or R, and it is assumed that participants will already have some experience using the PROCESS macro. No knowledge of matrix algebra is required or assumed, nor is matrix algebra ever used in the course.
Hardware
Students are strongly encouraged to bring their own laptops (Mac or Windows)
Software
Laptops need a recent version of SPSS Statistics (version 19 or later), SAS (release 9.2 or later) or R (3.6 or later) installed. SPSS users should ensure their installed copy is patched to its latest release. SAS users should ensure that the IML product is part of the installation. STATA users can benefit from the course content, but PROCESS makes these analyses much easier and is not available for STATA.
Learning objectives
- Apply and report on tests of moderated mediation using the index of moderated mediation
- Identify models for which partial and conditional moderated mediation are appropriate.
- Apply and report mediation analysis with multicategorical independent variables.
- Test and probe an interaction involving a multicategorical independent variable or moderator.
- Apply and report tests of moderated mediation involving a multicategorical independent variable.
- Generalize the index of moderated mediation to models with serial mediation
- Estimate and conduct inference in mediation, moderation, and moderated mediation contexts for two-instance repeated-measures designs.
- Generate and specify custom models in PROCESS
Course content
Statistical mediation and moderation analyses are among the most widely used data analysis techniques. Mediation analysis is used to test various intervening mechanisms by which causal effects operate. Moderation analysis is used to examine and explore questions about the contingencies or conditions of an effect, also called ʺinteraction.ʺ Conditional process analysis is the integration of mediation and moderation analysis and used when one seeks to understand the conditional nature of processes
(i.e., ʺmoderated mediationʺ).
In Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression‑Based Approach (www.guilford.com/p/hayes3) Dr. Andrew Hayes describes the fundamentals of mediation, moderation, and conditional process analysis using ordinary least squares regression. He also explains how to use PROCESS, a freely‑available and handy tool he invented that brings modern approaches to mediation and moderation analysis within convenient reach.
This seminar‑ a second course ‑picks up where the first edition of the book and the first course offered by GSERM leaves off. After a review of basic principles, it covers material in the second and third editions of the book as well as new material including recently published methodological research.
Topics covered include:
- Review of the fundamentals of mediation, moderation, and conditional process analysis.
- Testing whether an indirect effect is moderated and probing moderation of indirect effects.
- Partial and conditional moderated mediation.
- Mediation analysis with a multicategorical independent variable.
- Moderation analysis with a multicategorical (3 or more groups) independent variable or moderator.
- Conditional process analysis with a multicategorical independent variable
- Moderation of indirect effects in the serial mediation model.
- Mediation, Moderation, and Conditional Process Analysis in Two-Instance Repeated-Measures Designs
- Advanced uses of PROCESS, such as how to modify a numbered model or customize your own model.
We focus primarily on research designs that are experimental or cross‑sectional in nature with continuous outcomes. We do not cover complex models involving dichotomous outcomes, latent variables, nested data (i.e., multilevel models), or the use of structural equation modeling.
Structure
Day 1:
- Review of fundamentals
- Testing whether and indirect effect is moderated
- Estimating conditional indirect effects
Day 2:
- Representing multicategorical predictors
- Mediation analysis with a multicategorical independent variable
- Estimating Moderation models with a multicategorical independent variable
Day 3:
- Probing moderation models with a multicategorical independent variable
- Estimating and probing moderation models with a multicategorical moderator
- Estimating conditional process models with a multicategorical independent variable
Day 4:
- Inference and probing of conditional process models with a multicategorical independent variable
- Conditional process analysis involving serial mediation
- Custom models in PROCESS
Day 5:
- Mediation analysis in two-instance repeated-measures designs
- Moderation analysis in two-instance repeated-measures designs
- Conditional process analysis in two-instance repeated-measures designs
Literature
Introduction to Mediation, Moderation, and Conditional Process Analysis (3rd edition)
Examination part
Homework delivered during week of the course (4 assignments, 60%)
Homework delivered after the course (1 assignment, 40%)
Supplementary aids
Open Book
Examination content
Homework 1 (Due Tuesday Morning):
- Review of fundamentals
- Testing whether and indirect effect is moderated
- Estimating conditional indirect effects
Homework 2 (Due Wednesday Morning):
- Representing multicategorical predictors
- Mediation analysis with a multicategorical independent variable
- Estimating Moderation models with a multicategorical independent variable
Homework 3 (Due Thursday Morning):
- Probing moderation models with a multicategorical independent variable
- Estimating and probing moderation models with a multicategorical moderator
- Estimating conditional process models with a multicategorical independent variable
Homework 4 (Due Friday Morning):
- Inference and probing of conditional process models with a multicategorical independent variable
- Conditional process analysis involving serial mediation
- Custom models in PROCESS
Homework 5 (Due within 2 weeks of end of course):
- All content listed above+
- Mediation analysis in two-instance repeated-measures designs
- Moderation analysis in two-instance repeated-measures designs
- Conditional process analysis in two-instance repeated-measures designs
Examination relevant literature
Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression‑Based Approach (www.guilford.com/p/hayes3) Dr. Andrew Hayes
Prerequisites (knowledge of topic)
- The basic knowledge of probability and statistics such as conditional probability, hypothesis testing and regression.
- Experiences of working with data (with any software package).
- No prior knowledge on causality and causal inference is required.
Hardware
A laptop for the practical sessions.
Software
R and R Studio (both are freely-available).
Learning objectives
- Recognize research problems that require causal inference.
- Describe different perspectives and models in causal inference.
- Implement a selection of tools in causal inference.
- Evaluate critically in-lab and field experiments and other research designs
- Develop research designs to address causal inference problems.
Course content
Causal questions in the form of how X influences Y are pervasive in real life. It is therefore imperative for us to know how to address these questions, especially given the “big data revolution” in the last decade. Moreover, without the understanding of causal inference, we can easily fall victim to misinformation. For example, in response to Apple’s new privacy policy on the mobile system, Facebook launched a series of full-page newspaper ads, claiming that Apple’s new privacy policy would hurt small business advertisers. Facebook concluded that for small business advertisers, the new policy would lead to “a cut of 60% in their sales for every dollar they spend.” However, is the claim credible? How do we judge its credibility? To answer these questions, in this course, you are introduced to the exciting area of causal inference.
This course provides you with conceptual understandings, as well as tools to learn causality from data. These understandings and tools come from the rapidly developing science of causal inference. On the conceptual level, the course covers basic concepts such as causation vs. correlation, causal inference, causal identification and counterfactual. It also presents perspectives and tools to help you formalize and conceptualize causal relationships. These perspectives and tools are synergized from multiple disciplines, including statistics (e.g., Robin Causal Model or Potential Outcome Framework), computer science (e.g., Pearlian Causal Model or Causal Graph), and econometrics (e.g., identification strategies and local average treatment effect).
In this course, we will also discuss a selection of tools in causal inference. We will start with the completely randomized experiment and discuss the assignment mechanism, Fisher’s exact p-value, and Neyman’s repeated sampling approach. From then on, we will gradually relax the assumption of complete randomization and discuss situations where the complete randomization does not hold. Specifically, we will discuss the following: First, block randomization and conditional random assignment, with a focus on matching and weighting estimators; Second, non-compliance where the random assignment fails and the local average treatment effects; Third, attrition where some outcomes are missing and the bounding approach; Fourth, research designs when the assignment mechanism is unknown to us, including the difference-in-difference approach and regression discontinuity design. Moreover, in the last day of the course, we will discuss the ethics of causal inference and the assessment of the unconfoundedness assumption. As the “finale,” we will go through the most recent developments in the intersection between causal inference and machine learning, where machine learning techniques are used to address causal inference problems .
One major distinction of the course is its emphasis on practical relevance. Throughout the course, you are given cases and real data to apply what you learn to real causal inference problems. The course is split between lectures and practical sessions. Cases and data will be provided by the instructor before class.
Structure
Day 1:
- Introduction to causality and causal inference.
- Overview of the course.
- Causal graph: a new language of causality.
- Causal identification: the key to causal inference.
- The formulation of an empirical strategy.
- Practical session: Using Daggity to draw and analyze causal graphs.
Day 2:
- The potential outcome framework.
- Understanding the assignment mechanism.
- Fisher vs. Neyman’s treatment of completely randomized experiments.
- Counterfactual: Robin vs. Pearl.
- Block randomization.
- Selection on observables: subclassification, matching and weighting.
- Practical session: Matching and weighting estimation of the effect of virtual-fitting on sales.
Day 3:
- Non-compliance problem in the assignment mechanism.
- The intention-to-treat and the local average treatment effect (LATE).
- The link of LATE to the instrumental variables strategy.
- The attrition problem after randomization.
- Classic treatments of the attrition problem.
- The bounding approach and Lee bounds.
- Practical session: Estimating LATE in a field experiment of coupons.
Day 4:
- Canonical difference-in-difference (DID) model.
- Two-way fixed estimation of the multiple-period DID.
- DID with staggered adoptions of the treatment.
- The basic idea of the regression continuity design (RDD).
- How to estimate the LATE with RDD.
- Introduction to regression kink design and the bunching method.
- Practical session: DID analysis of the Low Carbon London Electricity Trial.
Day 5:
- Ethics of causal inference: being transparent about your assumptions.
- Assessing the assumption of unconfoundedness.
- Sensitivity analysis, placebo and falsification tests.
- Recent applications of machine learning techniques in causal inference.
- Boosting coventional mediation analysis with causal inference.
Literature
Lecture notes will be shared. You are recommended to read them before class.
Books (all supplementary):
Gerber, Alan S., and Donald P. Green. Field experiments: Design, analysis, and interpretation. WW Norton, 2012.
Guido W. Imbens and Donald B. Rubin. Causal inference for statistics, social, and biomedical sciences: An Introduction. Cambridge University Press, 2015.
Lee, Myoung-jae. Matching, regression discontinuity, difference in differences, and beyond. Oxford University Press, 2016.
Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference. Cambridge University Press, 2015.
Pearl, Judea, Madelyn Glymour, and Nicholas P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.
Articles:
(* starred ones are mandatory and others are supplementary)
(arranged in chronological order of course contents)
Day 1:
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669-688. [Section 1-3].
Lewbel, A. (2019). The identification zoo: Meanings of identification in econometrics. Journal of Economic Literature, 57(4), 835-903.
* Angrist, J. (2022). Empirical strategies in economics: Illuminating the path from cause to effect (No. w29726). National Bureau of Economic Research.
Day 2:
* Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.
Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129-79.
* Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1.
Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373-419.
Day 3:
* Imbens, G. W., & Angrist, J. D. (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica, 62(2), 467-475.
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434), 444-455.
DiNardo, J., McCrary, J., & Sanbonmatsu, L. (2006). Constructive proposals for dealing with attrition: An empirical example. Working paper, University of Michigan.
* Manski, C. F. (1990). Non-parametric bounds on treatment effects. American Economic Review, 80(2), 319-323.
Lee, D. S. (2009). Training, wages, and sample selection: Estimating sharp bounds on treatment effects. Review of Economic Studies, 76(3), 1071-1102.
Day 4:
* Lechner, M. (2011). The estimation of causal effects by difference-in-difference methods. Foundations and Trends® in Econometrics, 4(3), 165-224.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
* Cattaneo, M. D., & Titiunik, R. (2022). Regression discontinuity designs. Annual Review of Economics, 14, 821-851.
Lee, D. S., & Lemieux, T. (2010). Regression Discontinuity Designs in Economics. Journal of Economic Literature, 48, 281-355.
Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142, 615-635.
Day 5:
* Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observational studies. Biometrika, 74(1), 13-26.
* Altonji, J. G., Elder, T. E., & Taber, C. R. (2005). Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools. Journal of Political Economy, 113(1), 151.
Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2), 187-204.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
* Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. The Econometrics Journal, 21(1). C1-C68.
* Imai, K., Keele, L., & Yamamoto, T. (2010). Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statistical Science, 25(1), 51-71.
Examination part
The performance of participants will be assessed by a take-home assignment at the end of the course (100%). The assignment will be due in 3 weeks.
Supplementary aids
The examination aids and documents are listed below:
1. Lecture notes for all the lectures.
2. Starred articles in the literature list.
3. R notebooks shared after the practical sessions.
Examination content
The list of topics that are covered in the assignment are:
- The basics of the causal graph.
- Formulating an identification strategy with the causal graph.
- Evaluating empirical strategies.
- The basics potential outcome framework.
- Calculating Fisher’s exact p-value and Neyman’s repeated sampling statistics.
- Matching and weighting estimators.
- Propensity scores.
- The non-compliance problem.
- Calculating local average treatment effects.
- The attrition problem in experiments.
- The bounding approach.
- Basics of the difference-in-difference approach.
- Basics of the regression discontinuity design.
- Sensitivity analysis.
- Falsification and placebo tests.
Examination relevant literature
Only the starred article in the literature list are necessary for the examination. Books and articles are only supplementary.
Prerequisites (knowledge of topic)
• Intermediate Python programming skills (ideally: pytorch, huggingface, gradio)
• Undergraduate-level linear algebra, analysis and statistics
Hardware
– Laptop with GPU/TPU support recommended
– GPU/TPU access via Cluster (or paid Colab)
– Tablets will most likely not suffice
Software
– Pycharm/VSCode for Jupyter Notebooks
Learning objectives
– Students have an overview of generative models for text, audio and image generation
– Students understand the chances and limitations of generative systems as well as legal and ethical implications
– Students learn to navigate model repositories to identify and compare foundation models for tasks at hand
– Students understand the process of fine-tuning for selected applications
– Students understand the concepts of prompt engineering and in-context-learning
– Students have hands-on experience for selected generative models
Course content
We will cover the basic concepts of prominent methods of deep learning for generative AI and how to build applications on top of foundation models. While we focus on text generation, we will also cover multi-modal input and speech and image generation.
Since the practical part of the course focuses on building applications on foundation models, we strongly recommend you get familiar with deep learning ahead of this course.
You will learn about
• Generative Pretrained Transformers (GPT)
• Model adaptation techniques including fine-tuning, RLHF, instruction learning,
• Prompting techniques including Zero-Shot Learning, In-Context Learning, Chain-of-Thought
• Retrieval Augmented Generation (RAG)
• Multi-modal models and modality adapters
• Other generative models including Generative Adversarial Networks (GAN), Variational Auto-Encoders (VAE), Diffusion, style transfer
In the assignments, you will work on
• prompting LLMs for various use cases such as topic extraction, summarization, question answering
• retrieval augmented generation to integrate knowledge bases into your application
• function calling and structured decoding
• a multi-modal chatbot that can listen, see, talk and draw.
We will conclude the course with an outlook on risks, limitations, ethical and legal implications.
Structure
Monday through Thursday, we will have lectures and walk-through tutorials in the mornings, and project work in the afternoon. On Friday morning, we will present and discuss the team projects and conclude with an outlook and group discussion on risks, limitations, ethical and legal implications.
Day 1+2:
Focus on text with GPT: Foundations, training and fine-tuning, prompting; ideation and team building for class project.
Day 3:
Focus on images with Stable Diffusion, combined image+text models. Team project part 1.
Day 4:
Focus on audio: text-to-speech and voice conversion. Team project part 2;
Day 5:
Project presentations and discussions; risks, limitations, ethical and legal implications.
Literature
Recommended
– F. Chollet. Deep Learning with Python
Supplementary / voluntary
– https://huggingface.co/docs
Examination part
Project work (70%): ideation on day 1, team work (teams of 2-3) on days 2-4; final submission 14 days after completion.
Presentation (30%): 15 min presentation and discussion of project on day 5.
Supplementary aids
Not applicable.
Examination content
Project requires at minimum the use of a foundation model for either text, audio or image to build an MVP of a AI use case. Ideally, the working prototype is using a different modality for input and output.
Examination relevant literature
Not applicable.
Prerequisites
The instructor will send students instructions on how to install qualitative software for in-class exercises. No previous knowledge of qualitative research is required.
Hardware
NA
Software
We will spend one day learning a qualitative analysis software package:
GSERM St. Gallen: ATLAS.ti
GSERM Ljubljana: NVIVO
Virtual course: MAXQDA
The methods discussed in the course will be applicable to qualitative studies in a range of fields, including the behavioral sciences, social sciences, health sciences, communications, business, and marketing.
Learning objectives
The primary learning objectives are:
- Understand methodological options regarding research design, data collection, and analysis
- Understand approaches to developing an interview guide
- Gain skills in writing analytic memos, creating diagrams, and coding qualitative data
- Understand primary differences among research traditions: grounded theory, narrative analysis, and case study
- Gain skills in using qualitative analysis software
Course content
Qualitative Research Methods and Data Analysis presents strategies for analyzing and making sense of qualitative data. The course will discuss the qualitative inquiry continuum—from descriptive to interpretive—and present established qualitative research approaches such as grounded theory, narrative analysis, and case studies. The course will briefly cover research design, including discussion of design dimensions: time, comparison, and use of theory. We will briefly cover data collection strategies—primarily interviews and focus groups—but the course will largely focus on data analysis. In particular, we will consider how researchers develop codes and integrate memo writing and diagrams into a larger analytic process. Coding and memo writing are concurrent tasks that occur during an active review of interviews, documents, focus groups, and/or online data.
Analytic memo writing is a strategy for capturing analytical thinking, inscribed meaning, and cumulative evidence for condensed meanings. Memos can also resemble early writing for reports, articles, chapters, and other forms of presentation. Researchers can also mine memos for codes and use memos to build evocative themes and theory. Coding provides an analytic focus and investigative point of view; the course will illustrate specific coding practices that generate particular types of topics and concepts, such as process codes and in vivo codes. We will discuss deductive and inductive coding and how a codebook evolves—how we discern emerging codes and assess conceptual shifts during analysis. Our discussion will move from managing codes to developing code hierarchies, identifying code clusters, and building multidimensional themes. We will discuss final research products—how results are framed to underscore cognitive empathy, precision, and emergent discovery.
The course will also discuss using visual tools in analysis, such as diagramming key quotations from data to holistically present the participant’s key narratives. Visual tools can also assist in looking horizontally across many documents to identify and illustrative connective themes and link the parts (quotations or codes) to the whole (themes, documents, or participants).
The course will include daily in-class exercises—both individual and group—including exercises using software.
Structure
Day 1
• Core principles and practices in qualitative data inquiry
• Qualitative research design
· Overview of data types
· Design dimensions: Comparison, time, theory
· Sampling strategies
· Triangulation
• Data collection
• Interviews
• Focus groups
• Other types of data
• Developing interview questions
Day 2
• Analysis practices
o Memo writing
· Document summary memos
· Key-quote memos
· Methods memos
o Using visual tools
· Data collection episode profiles
· Making sense of data using diagrams
o Coding qualitative data
· Deductive vs. Inductive coding
· Descriptive coding
· Interpretive coding
· Coding practices
· Creating a codebook
Day 3
• Introduction to qualitative software
• Writing comments and memos
• Coding data
• Developing a code system
• Creating quotations diagrams
• Analysis
· Exploring codes and memos
· Code co-occurrences
· Codes and demographic variables
· Matrices and diagrams
· Blending quantitative and qualitative data
Day 4
• Methodological traditions
• Grounded theory
• Narrative analysis
• Case study
• Generic qualitative analysis
Day 5
• Comparison of methodological traditions
• Qualitative research design: Revisiting strategies
• In-class exercise: Study design
• Evaluating qualitative articles
• Class discussion
Literature
Suggested Reading (Articles)
Electronic version of these articles will be provided to registered participants:
Ahlsen, Birgitte, et al. 2013. “(Un)doing Gender in a Rehabilitation Context: A Narrative Analysis of Gender and Self in Stories or Chronic Muscle Pain.” Disability and Rehabilitation 1 8.
Charmaz, Kathy. 1999. “Stories of Suffering: Subjective Tales and Research Narratives.” Qualitative Health Research 9:362 82.
Sandelowski, Margarete. 2000. “Whatever Happened to Qualitative Description?” Research in Nursing and Health 23:334 40.
Rouch, Gareth, et al. 2010. “Public, Private and Personal: Qualitative Research on Policymakers’ Opinions on Smokefree Interventions to Protect Children in ‘Private’ Spaces.” BMC Public Health 10:797 807.
Suggested Reading (Books)
Charmaz, Kathy. 2006. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. Sage.
Marshall, Catherine, and Gretchen B. Rossman. 2006. Designing Qualitative Research. 4th ed. Sage.
Yin, Robert. 2013. Case Study Research Design and Methods. Sage.
Examination part
Participants will be asked to read several interviews or journal entries and generate a preliminary analysis of the data using techniques discussed during the course. This examination will be due three weeks after the course ends.
Examination content
Students will have to demonstrate familiarity with the differences between grounded theory, narrative analysis, case study, and pragmatic analysis. The assignment will require them to choose one of these approaches to design a study and analyze several documents provided by the instructor. Their preliminary analysis will include memos, a codebook, diagrams, early findings, and reflection on next steps.
Prerequisites (knowledge of topic)
• Some prior knowledge in R and/or programming beneficial, but not required
Hardware
• Bring your own laptop
Software
• RStudio & R, most recent version (download free versions)
• You may want to bring your own credit card to create your own cloud accounts (for database server and certain APIs). Accounts are typically free but some require depositing a credit card number.
Course content
Online platforms such as Yelp, Twitter, Amazon, or Instagram are large-scale, rich and relevant sources of data. Researchers in the social sciences increasingly tap into these data for field evidence when studying various phenomena.
In this course, you will learn how to find, acquire, store, and manage data from such sources and prepare them for follow-up statistical analysis for your own research.
After a short introduction into the relevance of data science skills for the social sciences, we will review R as a programming language and its basic data formats. We will then use R to program simple scrapers that systematically extract data from websites. We will use the packages rvest, httr, and RSelenium, among others, for this purpose. You will further need to learn how to read HTML, CSS, JSON, or XML codes, to use regular expressions, and to handle string, text and image data. To store the data, we will look into relational databases, (My)SQL, and related R packages. Many websites such as Twitter and Yelp offer convenient application-programming interfaces (APIs) that facilitate the extraction of data and we will look into accessing them from R. Finally, we will highlight some options for feature extraction from images and text, which allows us to augment our collected data with meaningful variables we can use in our analysis.
At the end of this course, students should be able to identify valuable online data sources, to write basic scrapers, and to prepare the collected data such that they can use them for statistical analysis as part of their own research projects.
Throughout the course, students will work on a data-scraping project related to their theses. This project will be presented at the final day of the course.
All data scraping code and other sources will be made available on
https://www.data-scraping.org.
Structure
Preliminary schedule:
Day 1
Intro to data scraping
Define students’ scraping projects
Review of R and introduction to programming with R
Afternoon: R programming exercises
Day 2
The anatomy of the internet and relevant data formats
Intro to web scraping with R (with httr, rvest, RSelenium)
Introduction to APIs
Afternoon: Scraping exercises
Day 3
Relational databases and SQL
Data management with R
Afternoon: Database design and implementation project (with MySQL in the cloud)
Day 4
Scraping examples from Yelp, Crowdspring, Twitter, and Instagram
Scaling up your scraper with parallel code and proxies
Feature extraction examples
Afternoon: Work on your scraping projects
Day 5
Wrap-up of course
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Literature
Mandatory
None, all readings will be provided during the course
Supplementary / voluntary
None, all readings will be provided during the course
Examination part
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Supplementary aids
Individual quiz: Closed book
Presentation of students’ scraping projects: Closed book
Examination content
Lecture slides covering key concepts of R and programming, the anatomy of the internet, relational databases, and scraping (slides will be provided as PDFs the day before classes).
• Students will need to understand R code when they see it but they will not be required to code during the exam.
Literature
None.
In the past 60 years econometrics provided us with many tools to uncover lots of different types of correlations. The technical level of this literature is impressive (see the PEF course Advanced Microeconometrics). However, at the end of day, correlations are less interesting if they do not have a causal implication. For example, the fact that smokers are more likely to die earlier than other people does not tell us much about the effect of smoking. For example, it might just be that smokers are the type of people who face more health and crime risks for quite different (social or genetic) reasons. The same problem occurs with almost any correlation of economic or financial variables. The interesting question is always whether these correlations are spurious, or whether they do tell us something about the underlying causal link of the different variables involved?
In this course we review and organize the rapidly developing literature on causal analysis in economics and econometrics and consider the conditions and methods required for drawing causal inferences from the data. Empirical applications play an important role in this course.
Active participation of PhD students participating in this course is expected. During the second part of the course, participants will conduct their own empirical study and present their results.
General structure and rules
Students activities
Active participation of the students in this course is the key to its success. Students are expected to do the following:
- Read the papers shown as ‘compulsary reading’ in the reading list BEFORE the lecture concerned with the topic.
- Each morning students will present a paper (15‑30 minutes each; depending on the number of participants) and there will be some general discussion about these papers. Students not presenting will be expected at least to sketch the papers to be able to participate in the discussion.
- Small groups of students (group size depends on number of participants) will conduct an independent empirical study (using Software of their own choice; GAUSS or STATA is recommended). In the empirical project students will show that they understood the basic concepts and are able to apply them to a ‘real life’ situation.
Grades
- Written Exam about 4 weeks after the last lecture (2 hours) (40%).
- Students’ active participation in general discussions during lectures and presentations (20%).
- Presentation of papers (20%).
- Empirical project (based on two presentations; 20%).
Prerequisites
As defined for the econometrics specialisation of PEF.
Course literature
To be published shortly before the lecture
Examination content
Empirical work, literature, contents of lecture
Examination relevant literature
To be defined during the lecture