Syllabuses
University of St.Gallen
Prerequisites (knowledge of topic)
Advanced knowledge in statistics and econometrics (gained, for example, following the specific courses in a master in quantitative methods/economics/finance).
Hardware
Individual laptop (with no particular requisite).
Software
Examples and codes are shown using the Rsoftware (free downloadable from https://www.rproject.org/).
Course Content
Computational Statistics is the area of specialization within statistics that includes statistical visualization and other computationallyintensive methods of statistics for mining large, nonhomogeneous, multidimensional datasets so as to discover knowledge in the data. As in all areas of statistics, probability models are important, and results are qualified by statements of confidence or of probability. An important activity in computational statistics is model building and evaluation.
First, the basic multiple linear regression is reviewed. Then, some nonparametric procedures for regression and classification are introduced and explained. In particular, Kernel estimators, smoothing splines, classification and regression trees, additive models, projection pursuit and eventually neural nets will be considered, where some of them have a straightforward interpretation, other are useful for obtaining good predictions.
The main problems arising in computational statistics like the curse of dimensionality will be discussed. Moreover, the goodness of a given (complex) model for estimation and prediction is analyzed using resampling, bootstrap and crossvalidation techniques.
Structure
Outline
 Overview of supervised learning
Introductory examples, two simple approaches to prediction, statistical decision theory, local methods in high dimensions, structured regression models, biasvariance tradeoff, multiple testing and use of pvalues.  Linear methods for regression
Multiple regression, analysis of residuals, subset selection and coefficient shrinkage.  Methods for classification
Bayes classifier, linear regression of an indicator matrix, discriminant analysis, logistic regression.  Nonparametric density estimation and regression
Histogram, kernel density estimation, kernel regression estimator, local polynomial nonparametric regression estimator, smoothing splines and penalized regression.  Model assessment and selection
Bias, variance and model complexity, biasvariance decomposition, optimism of the training error rate, AIC and BIC, crossvalidation, boostrap methods.  Flexible regression and classification methods
Additive models; multivariate adaptive regression splines (MARS); neural networks; projection pursuit regression; classification and regression trees (CART).  Bagging and Boosting
The bagging algorithm, bagging for trees, subagging, the AdaBoost procedure, steepest descent and gradient boosting.  Introduction to the idea of a Superlearner
Structure (Chapters refer to the outline above)
Days 1 and 2: Chapters 1,2, and 3
Day 3: Chapter 5
Day 4: Chapter 4
Days 5 and 6: Chapters 6,7, and 8.
Literature
Mandatory
F. Audrino, Lecture Notes (can be downloaded from Studynet or asked directly to the lecturer).
Hastie T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction, Springer Series in Statistics, Springer, Canada.
Supplementary / voluntary
Bühlmann, P. and van de Geer, S. (2011). Statistics for HighDimensional Data: Methods, Theory and Applications. Springer.
van der Laan, M.J. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.
Moreover: References to related published papers will be given during the course.
Additional online resources:
A complete version of the main reference book can be downloaded online: http://statweb.stanford.edu/~hastie/ElemStatLearn/ Moreover, the Rpackage for the examples in the book is available: https://cran.rproject.org/web/packages/ElemStatLearn/ElemStatLearn.pdf
The webpage of the book on Targeted Learning: http://www.targetedlearningbook.com/
https://stat.ethz.ch/education/semesters/ss2015/CompStat (mostly overlapping Computational Statistics class taught at the ETH Zürich)
Rsoftware information and download: https://www.rproject.org/
Online course of Hastie and Tibshirani on Statistical Learning: Official course at Stanford Online: https://lagunita.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/about Quicker access to videos: http://www.rbloggers.com/indepthintroductiontomachinelearningin15hoursofexpertvideos/ Link to the website of an introductory book related to the course: http://wwwbcf.usc.edu/~gareth/ISL/index.html
Mandatory readings before course start
–
Examination part
Decentral: 100% group examination paper (term paper). Due to St. Gallen quality standards, possibility of an individual examination paper.
Supplementary aids
The examination paper consists in the analysis of a data set chosen by the students involving the methods learned in the lecture.
Examination content
The whole outline of the lecture described above.
Literature
Audrino, Lecture Notes.
These workshop lectures are designed to introduce participants to one of the most vibrant, free of charge statistical computing environments ― R. In this course you will learn how to use R for effective data analysis. We will cover a selected range of both basic topics (e.g., reading data into R, data structures (i.e., data frames, lists, matrices), data manipulation, statistical graphics) to more advanced topics (e.g., writing functions, control statements, loops, reshaping data, string manipulations, and statistical models in R).
This course is also helpful as a primer for other summer program courses that will use R, such as the courses on Computational Statistics, Data Mining, or Advanced Regression Modeling, among other courses. No prerequisites are required for this course.
Course content
The primary goal is to develop an applied and intuitive (as opposed to purely theoretical or mathematical) understanding of the topics and procedures. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles. Some more advanced regression topics will be covered later in the course, but only after the introductory foundations have been established.
We will begin with a quick review of basic univariate statistics and hypothesis testing.
After that we will cover various topics in bivariate and then multiple regression, including:
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Structure
This course will utilize approximately 525 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in eleven Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These eleven Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While inclass examples are presented using SPSS, participants are free (and encouraged!) to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequirements
This course is intended for participants who are comfortable with algebra and basic introductory statistics, and now want to learn applied ordinary least squares (OLS) multiple regression analysis for their own research and to understand the work of others.
Note: We will not use matrix algebra or calculus in this course.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
• Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
• The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
• Finally, I have included several articles from a number of journals across several
academic disciplines.
Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
A written Final Examination will be administered during the last meeting of the course.
Since this Final Examination is the only artifact that will be formally graded in the course, it will determine the course grade.
Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the Final Examination.
Supplementary aids
The Final Examination will be written, openbook (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and opennote. No other materials, including laptops, cell phones, or other electronic devices, will be permitted.
The Final Exam will be two hours in length and administered during the last course meeting.
I will provide more specific “practical matter” details about this exam early in the course.
Examination content
The potential substantive content areas for the Final Examination are:
• Basic univariate statistics and hypothesis testing.
• Fundamental concepts of bivariate regression and multiple regression.
• Model specification and interpretation.
• Diagnostic tests and plots.
• Analysis of residuals and outliers.
• Transformations to induce linearity.
• Interaction (“Multiplicative”) terms.
• Multicollinearity.
• Dichotomous (“Dummy”) independent variables.
• Categorical (e.g., Likert scale) independent variables.
Literature
Literature relevant to the exam:
• Lecture Transcripts (eleven Packets; approximately 525 pages).
• Class notes (taken by each participant individually).
• Tutorial and Answer Key documents (for each optional data analysis project assignment). Supplementary/Voluntary literature not directly relevant to the exam.
• Optional supplemental readings listed in the course syllabus (and discussed earlier).
• Any other textbooks, articles, etc., the participant reads before or during the course.
Work load
At least 24 units 45 minutes each on 5 consecutive days.
Prerequisites (knowledge of topic)
As long ago as 2010, Eric Schmidt, the executive chairman of Alphabet, observed that every two days we generate as much information as was created in the entire history of civilization until 2003. The problem is only that much of this information is unstructured by not being organized in a predefined manner. This lack of structure complicates extracting useful insights from these massively increasing data sources. Students should have some familiarity with the Python/R programming. Please bring a laptop to class. You also need a Google account to practice using Colab.
Learning objectives and course content
In this class, we will explore different statistical approaches that have proven useful in making sense out of unstructured data. The course is centered around business applications that involve the analyses of text, social networks, images as well as well as their relationships with metadata. For most of the analyses, we will use Python/R and dedicate some of the class sessions to handson time. Students are invited to bring their unstructured data sets but doing so is not required.
Structure
Day 1: Text mining: text representation, word2vec, sentiment analysis, topic modeling.
Day 2: Supervised and unsupervised machine learning: regression, random forest, Kmeans.
Day 3: Social network analysis: centralities, community detection, and representation learning.
Day 4: Image analysis: image processing, deep learning.
Day 5: Discussion of student projects.
Literature
The following books provide useful background material for the class. I will refer to more specialized publications as part of my lecture.
Introduction to information retrieval:
https://nlp.stanford.edu/IRbook/
Deep Learning:
https://www.deeplearningbook.org/
Community detection in graphs:
https://www.sciencedirect.com/science/article/pii/S0370157309002841
Graph representation learning book:
https://www.cs.mcgill.ca/~wlh/grl_book/
Python:
https://docs.python.org/3/
Examination Part
Final grades are based on a portfolio of assigned exercises. The solutions are due about two weeks after the end of the course.
Prerequisites (knowledge of topic)
Linear regression (strong), Maximum Likelihood Estimation (some familiarity), Linear/Matrix Algebra (some exposure is helpful), R (not required, but helpful).
Hardware
Access to a laptop will be useful, but not absolutely necessary.
Software
R/RStudio, JAGS (both are freely available online).
Learning objectives
To understand what the Bayesian approach to statistical modeling is and to appreciate the differences between the Bayesian and Frequentist approaches. The students will be able to estimate a wide variety of models in the Bayesian framework and to adjust example code to fit their specific modeling needs.
Course content
Theory/foundations of the Bayesian approach including:
objective vs subjective probability
how to derive and incorporate prior information
the basics of MCMC sampling
assessing convergence of Markov Chains
Bayesian difference of means/ANOVA
Bayesian versions of: Linear models, logit/probit (dichotomous/ordered/unordered choice models), Count models, Latent variable and measurement models, Multilevel models
presentation of results
Structure
Day 1 a.m.: Overview of Bayesian approach—Bayes vs Frequentism. History of Bayesian statistics, Problems with the NHST, The BetaBinomial model
Day 1 p.m.: Review of GLM/MLE. Probability review. Application of Bayes Rule.
Day 2 a.m.: Priors, Sampling methods (Inversion, Rejection, Gibbs sampling)
Day 2 p.m.: Convergence diagnostics. Using JAGS to estimate Bayesian models.
Day 3 a.m.: Estimating parameters of the Normal Distribution
Day 3 p.m.: Bayesian linear models, imputing missing data.
Day 4 a.m.: Choice models (dichotomous, ordered, unordered)
Day 4 p.m.: Latent variable models
Day 5 a.m.: Multilevel models: linear models.
Day 5 p.m.: Multilevel models: nonlinear models, best practices for model
presentation.
Literature
Mandatory
Gill, J. (2008). Bayesian Methods: A Social And Behavioral Sciences Approach. Chapman and Hall, Boca Raton, FL
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical
Jackman, S. (2000). Estimation and Inference Are Missing Data Problems: Unifying Social Science Statistics via Bayesian Simulation. Political Analysis, 8(4):307–332. http://pan.oxfordjournals.org/content/8/4/307.full.pdf+html
Supplementary / voluntary
Siegfried, T. (2010). Odds are, it’s wrong: Science fails to face the shortcomings of statistics. Science News, 177(7):26–29. http://dx.doi.org/10.1002/scin.5591770721
Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science.
Bakker, R. (2009). Remeasuring left–right: A comparison of SEM and bayesian measurement models for extracting left–right party placements. Electoral Studies, 28(3):413–421
Bakker, R. and Poole, K. T. (2013). Bayesian Metric Multidimensional Scaling. Political Analysis, 21(1):125–140
For those unfamiliar with R: Jon Fox and Sandford Weisberg. An R Companion to Applied Regression. Sage, 2011.
Mandatory readings before course start
Western, B. and Jackman, S. (1994). Bayesian Inference for Comparative Research. American Political Science Review, 88(2):412–423. http://www.jstor.org/stable/2944713
Efron, B. (1986). Why Isn’t Everyone a Bayesian? The American Statistician, 40(1):1–5. http://www.jstor.org/stable/2683105
Examination part
A written homework assignment which consists of estimating a variety of models using JAGS as well as a brief essay describing how the students would go about incorporating Bayesian methods in their own work and what they see as the main advantages/disadvantages of doing so.
Supplementary aids
Open book/practical examinations. The students should use the example code from the lectures to help complete the practical component as well as both required texts to help answer the essay component. Specifically, the linear model and dichotomous choice model examples will be very useful as well as the first 3 chapters of the Gill text and Section 3 of the Gelman and Hill text.
Examination content
Bayesian versions of the linear and dichotomous choice models, including presenting the appropriate results in a professionally acceptable manner. This includes creating graphical representations of the model results as well as a thorough discussion of how to interpret the results.
For the essay component, students will need to be aware of the benefits of the Bayesian approach for their own research (or the lack thereof) and to describe, in detail, the types of choices they would need to make in order to apply Bayesian methods to their own work. This includes a detailed description and justification of what priors they would choose as well as what differences they would expect to see between the Bayesian and Frequentist approaches, if any, and why they would expect such differences.
Literature
The only required literature to complete the examinations are the 2 required texts and the code examples from the lectures.
The course is designed for Master, PhD students and practitioners in the social and policy sciences, including political science, sociology, public policy, public administration, business, and economics. It is especially suitable to MA students in these fields who have an interest in carrying out research. Previous courses in research methods and philosophy of science are helpful but not required. Materials not in the books assigned for purchase and not easily available through online library databases will be made available electronically. Bringing a laptop to class will be helpful but is not essential.
Hardware
Laptop helpful but not required
Software
None
Course content
The central goal of the seminar is to enable students to create and critique methodologically sophisticated case study research designs in the social sciences. To do so, the seminar will explore the techniques, uses, strengths, and limitations of case study methods, while emphasizing the relationships among these methods, alternative methods, and contemporary debates in the philosophy of science. The research examples used to illustrate methodological issues will be drawn primarily from international relations and comparative politics. The methodological content of the course is also applicable, however, to the study of history, sociology, education, business, economics, and other social and behavioral sciences.
Course structure
The seminar will begin with a focus on the philosophy of science, theory construction, theory testing, causality, and causal inference. With this epistemological grounding, the seminar will then explore the core issues in case study research design, including methods of structured and focused comparisons of cases, typological theory, case selection, process tracing, and the use of counterfactual analysis. Next, the seminar will look at the epistemological assumptions, comparative strengths and weaknesses, and proper domain of case study methods and alternative methods, particularly statistical methods and formal modeling, and address ways of combining these methods in a single research project. The seminar then examines field research techniques, including archival research and interviews.
Course Assignments and Assessment
In addition to doing the reading and participating in course discussions, students will be required to present orally an outline for a research design, either written or in powerpoint, in the final sessions of the class for a constructive critique by fellow students and Professor Bennett. Students will then write this into a research design paper about 3000 words long (12 pages, doublespaced).
Presumably, students will choose to present the research design for their PhD or MA thesis, though students could also present a research design for a separate project, article, or edited volume. Research designs should address all of the following tasks (elaborated upon in the assigned readings and course sessions): 1) specification of the research problem and research objectives, in relation to the current stage of development and research needs of the relevant research program, related literatures, and alternative explanations; 2) specification of the independent and dependent variables of the main hypothesis of interest and alternative hypotheses; 3) selection of a historical case or cases that are appropriate in light of the first two tasks, and justification of why these cases were selected and others were not; 4) consideration of how variance in the variables can best be described for testing and/or refining existing theories; 5) specification of the data requirements, including both process tracing data and measurements of the independent and dependent variables for the main hypotheses of interest, including alternative explanations.
Students will be assessed on how well their research design achieves these tasks, and on how useful their suggestions are on other students’ research designs. Students will also be assessed on the general quality of their contributions to class discussions.
Literature
Mandatory:
Assigned Readings for GSERM Case Study Methods Course
Andrew Bennett, Georgetown University
Students should obtain and read these books in advance of the course (see below for specific page assignments):
•Alexander L. George and Andrew Bennett, Case Studies and Theory Development in the Social Sciences (MIT Press 2005).
•Henry Brady and David Collier, Rethinking Social Inquiry (second edition, 2010)
•Gary Goertz, Social Science Concepts: A User’s Guide, (Princeton, 2005).
•Andrew Bennett and Jeffrey Checkel, eds., Process Tracing: From Metaphor to Analytic Tool (Cambridge University Press, 2014).
•Gary King, Robert Keohane, and Sidney Verba, Designing Social Inquiry (Princeton University Press, 1994).
Lecture 1: Inferences About Causal Effects and Causal Mechanisms
This lecture addresses the philosophy of science issues relevant to case study research.
Readings:
•Alexander L. George and Andrew Bennett, Case Studies and Theory Development, preface and chapter 7, pages 127150.
•King, Keohane, and Verba, Designing Social Inquiry pp. 333, 7691, 99114.
Lecture 2: Critiques and Justifications of Case Study Methods
Readings:
•Gary King, Robert Keohane, and Sidney Verba, Designing Social Inquiry, pp. 4648, 118121, 208230.
•Brady and Collier, Rethinking Social Inquiry, 164, 123201 (or if you have the first edition, pages 320, 3650, 195266)
•George and Bennett, Case Studies and Theory Development, Chapter 1, pages 336.
Lecture 3: Concept Formation and Measurement
Readings:
•Gary Goertz, Social Science Concepts, chapters 1, 2, 3, and 9, pages 194, 237268.
•Gary Goertz, Exercises, available at
http://press.princeton.edu/releases/m8089.pdf
Please think through the following exercises: 7, 21, 48, 49, 52, 163, 252, 253, 256, 257.
Lecture 4: Designs for Single and Comparative Case Studies
Readings:
•George and Bennett, Case Studies and Theory Development, chapter 4, pages 7388.
•Jason Seawright and John Gerring, Case Selection Techniques In Case Study Research. Political Research Quarterly June 2008. Available at: http://blogs.bu.edu/jgerring/files/2013/06/CaseSelection.pdf
Lecture 5: Typological Theory, Fuzzy Set Analysis
Readings:
•George and Bennett, Case Studies and Theory Development chapter 11, pages 233262.
•Excerpt from Andrew Bennett, "Causal mechanisms and typological theories in the study of civil conflict," in Jeff Checkel, ed., Transnational Dynamics of Civil War, Columbia University Press, 2012.
•Charles Ragin, "From Fuzzy Sets to Crisp Truth Tables," available at:
http://www.compasss.org/files/WPfiles/Raginfztt_April05.pdf
Lecture 6: Process Tracing, Congruence Testing, and Counterfactual Analysis
Readings:
•Andrew Bennett and Jeff Checkel, Process Tracing, chapter 1, conclusions, and appendix on Bayesianism.
•David Collier, online process tracing exercises. Look at exercises 3, 4, 7, and 8 at:
http://polisci.berkeley.edu/sites/default/files/people/u3827/Teaching%20Process%20Tracing.pdf
Lecture 7: Multimethod Research: Combining Case Studies with Statistics and/or Formal Modeling
Readings:
•Andrew Bennett and Bear Braumoeller, "Where the Model Frequently Meets the Road: Combining Statistical, Formal, and Case Study Methods," draft paper.
•Evan Lieberman, "Nested Analysis as a MixedMethod Strategy for Comparative Research," American Political Science Review August 2005, pp. 43552.
Lecture 8: Field Research Techniques: Archives, Interviews, and Surveys
Readings:
•Cameron Thies, "A Pragmatic Guide to Qualitative Historical Analysis in the Study of International Relations," International Studies Perspectives 3 (4) (November 2002) pp. 35172.
Lecture 9 & 10: Student research design presentations
Read and be ready to constructively critique your fellow students’ research designs.
Supplementary / voluntary:
The following readings are useful for students interested in exploring the topic further, but they are not required:
I) Philosophy of Science and Epistemological Issues
Henry Brady, "Causation and Explanation in Social Science," in Janet Box
Steffensmeier, Henry Brady, and David Collier, eds., Oxford Handbook of Political Methodology (Oxford, 2008) pp. 217270.
II) Case Study Methods
George and Bennett, Case Studies and Theory Development, Chapter 1.
Gerardo Munck, "Canons of Research Design in Qualitative Analysis," Studies in Comparative International Development, Fall 1998.
Timothy McKeown, "Case Studies and the Statistical World View," International Organization Vol. 53, No. 1 (Winter, 1999) pp. 161190.
Concept Formation and Measurement
John Gerring, "What Makes a Concept Good?," Polity Spring 1999: 35793.
Robert Adcock and David Collier, "Measurement Validity: A Shared Standard for Qualitative and Quantitative Research," APSR Vol. 95, No. 3 (September, 2001) pp. 529546.
Robert Adcock and David Collier, "Democracy and Dichotomies," Annual Review of Political Science, Vol. 2, 1999, pp. 537565.
David Collier and Steven Levitsky, "Democracy with Adjectives: Conceptual Innovation in Comparative Research," World Politics, Vol. 49, No. 3 (April 1997) pp. 430451.
David Collier, "Data, Field Work, and Extracting New Ideas at Close Range," APSA CP Newsletter Winter 1999 pp. 16.
Gerardo Munck and Jay Verkuilen, "Conceptualizing and Measuring Democracy: Evaluating Alternative Indices," Comparative Political Studies Feb. 2002, pp. 534.
Designs for Single and Comparative Case Studies and Alternative Research Goals
Aaron Rapport, Hard Thinking about Hard and Easy Cases in Security Studies, Security Studies 24:3 (2015), 431465.
Van Evera, Guide to Methodology, pp. 7788.
Richard Nielsen, "Case Selection via Matching," Sociological Methods and Research
(forthcoming).
Typological Theory and Case Selection
Colin Elman, "Explanatory Typologies and Property Space in Qualitative Studies of International Politics," International Organization, Spring 2005, pp. 293326.
Gary Goertz and James Mahoney, "Negative Case Selection: The Possibility Principle," in Goertz, chapter 7.
David Collier, Jody LaPorte, Jason Seawright . "Putting typologies to work: concept formation, measurement, and analytic rigor." Political Research Quarterly, 2012
Process Tracing
Tasha Fairfield and Andrew Charman, 2015 APSA paper on Bayesian process tracing.
David Waldner, "Process Tracing and Causal Mechanisms." In Harold Kincaid, ed., The Oxford Handbook of Philosophy of Social Science (Oxford University Press, 2012), pp. 65‐84.
Gary Goertz and Jack Levy, "Causal Explanation, Necessary Conditions, and Case Studies: The Causes of World War I," manuscript, Dec. 2002.
Counterfactual Analysis, Natural Experiments
Jack Levy, paper in Security Studies on counterfactual analysis.
Thad Dunning, "DesignBased Inference: Beyond the Pitfalls of Regression Analysis?" in Brady and Collier, pp. 273312.
Thad Dunning, Natural Experiments in the Social Sciences: A Design‐Based Approach (Cambridge University Press, 2012), Chapters 1,7
Philip Tetlock and Aaron Belkin, eds., Counterfactual Thought Experiments, chapters 1, 12.
Multimethod Research: Combining Case Studies with Statistics and/or Formal Modeling
David Dessler, "Beyond Correlations: Toward a Causal Theory of War," International Studies Quarterly vol. 35 no. 3 (September, 1991), pp. 337355.
Alexander George and Andrew Bennett, Case Studies and Theory Development, Chapter 2.
James Mahoney, "Nominal, Ordinal, and Narrative Appraisal in MacroCausal Analysis," American Journal of Sociology, Vol. 104, No.3 (January 1999).
Field Research Techniques: Archives, Interviews, and Surveys
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "Field Research in
Political Science: Practices and Principles," chapter 1 in Field Research in Political Science: Practices and Principles (Cambridge University Press). Read pages 1533.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "Interviews, Oral
Histories, and Focus Groups" in Field Research in Political Science: Practices and Principles (Cambridge University Press).
Elisabeth Jean Wood, "Field Research," in Carles Boix and Susan Stokes, eds., Oxford Handbook of Comparative Politics, Oxford University Press 2007, pp. 123146.
Soledad Loaeza, Randy Stevenson, and Devra C. Moehler. 2005. "Symposium: Should Everyone Do Fieldwork?" APSACP 16(2) 2005: 818.
Layna Mosley, ed., Interview Research in Political Science, Cornell University Press, 2013.
Hope Harrison, "Inside the SED Archives," CWIHP Bulletin
Ian Lustick, "History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias," APSR September 1996, pp. 605618.
Symposium on interview methods in political science in PS: Political Science and Politics (December, 2002), articles by Beth Leech ("Asking Questions: Sampling and Completing Elite Interviews"), Kenneth Goldstein ("Getting in the Door: Sampling and Completing Elite Interviews"), Joel Aberbach and Bert Rockman ("Conducting and Coding Elite Interviews"), Laura Woliver ("Ethical Dilemmas in Personal Interviewing"), and Jeffrey Barry ("Validity and Reliability Issues in Elite Interviewing), pp. 665682.
Diana Kapiszewski, Lauren M. MacLean, and Benjamin L. Read, "A Historical and
Empirical Overview of Field Research in the Discipline" Chapter 2 in Field Research in Political Science: Practices and Principles (Cambridge University Press, forthcoming).
Mandatory readings before course start:
It is advisable to do as much of the mandatory reading as possible before the course starts.
This course assumes no prior experience with machine learning or R, though it may be helpful to be familiar with introductory statistics and programming.
Hardware
A laptop computer is required to complete the inclass exercises.
Software
R (https://www.rproject.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available at no cost and are needed for this course.
Course content
Machine learning, put simply, involves teaching computers to learn from experience, typically for the purpose of identifying or responding to patterns or making predictions about what may happen in the future. This course is intended to be an introduction to machine learning methods through the exploration of realworld examples. We will cover the basic math and statistical theory needed to understand and apply many of the most common machine learning techniques, but no advanced math or programming skills are required. The target audience may include social scientists or practitioners who are interested in understanding more about these methods and their applications. Students with extensive programming or statistics experience may be better served by a more theoretical course on these methods.
Structure
The course will be designed to be interactive, with ample time for handson practice with the Machine Learning methods. Each day will include several lectures based on a Machine Learning topic, in addition to handson “lab” sections to apply the learnings to new datasets (or your own data, if desired).
The schedule will be as follows:
Day 1: Introducing Machine Learning with R
 How machines learn
 Using R, R Studio, and R Markdown
 kNearest Neighbors
 Lab sections – installing R, using R Markdown, choosing own dataset (if desired)
Day 2: Intermediate ML Methods – Classification Models
 Quiz on Day 1 material
 Naïve Bayes
 Decision Trees and Rule Learners
 Lab sections – practicing with Naïve Bayes and decision trees
Day 3: Intermediate ML Methods – Numeric Prediction
 Quiz on Day 2 material
 Linear Regression
 Regression trees
 Logistic regression
 Lab sections – practicing with regression methods
Day 4: Advanced Classification Models
 Quiz on Day 3 material
 Neural Networks
 Support Vector Machines
 Random Forests
 Lab section – practice with neural networks, SVMs, and random forests
Day 5: Other ML Methods
 Quiz on Day 4 material
 Association Rules
 Hierarchical clustering
 kMeans clustering
 Lab section – practice with these methods, work on final report
Literature
Mandatory
Machine Learning with R (3rd ed.) by Brett Lantz (2019). Packt Publishing
Supplementary / voluntary
None required.
Mandatory readings before course start
Please install R and R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
100% of the course grade will be based on a project and final report (approximately 10 pages), to be delivered within 23 weeks after the course. The project is intended to demonstrate your ability to apply the course materials to a dataset of your own choosing. Students should feel free to use a project related to their career or field of study. For example, one may use this opportunity to advance his/her dissertation research or complete a task for his/her job. The exact scoring criteria for this assignment will be provided on the first day of class. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
There will also be brief quizzes at the start of each lecture, which cover the previous day’s materials. These are ungraded and are designed to provoke thought and discussion.
Supplementary aids
Students may reference literature and class materials as needed when writing the final project report.
Examination content
The final project report should illustrate an ability to apply machine learning methods to a new dataset, which may be on a topic of the student’s choosing. The student should explore the data and explain the methods applied. Detailed instructions will be provided on the fist day of class.
Prerequisites (knowledge of topic)
Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Students should have completed Ph.D.level courses in introductory statistics, and in linear and generalized linear regression models (including logistic regression, etc.), up to the level of Regression III. Familiarity with discrete and continuous univariate probability distributions will be helpful.
Hardware
Students will be required to provide their own laptop computers.
Software
All analyses will be conducted using the R statistical software. R is free, opensource, and runs on all contemporary operating systems. The instructor will also offer support for students wishing to use Stata.
Learning objectives
Students will learn how to visualize, analyze, and conduct diagnostics on models for observational data that has both crosssectional and temporal variation.
Course content
Analysts increasingly find themselves presented with data that vary both over crosssectional units and across time. Such panel data provides unique and valuable opportunities to address substantive questions in the economic, social, and behavioral sciences. This course will begin with a discussion of the relevant dimensions of variation in such data, and discuss some of the challenges and opportunities that such data provide. It will then progress to linear models for oneway unit effects (fixed, between, and random), models for complex panel error structures, dynamic panel models, nonlinear models for discrete dependent variables, and models that leverage panel data to make causal inferences in observational contexts. Students will learn the statistical theory behind the various models, details about estimation and inference, and techniques for the visualization and substantive interpretation of their statistical results. Students will also develop statistical software skills for fitting and interpreting the models in question, and will use the models in both simulated and real data applications. Students will leave the course with a thorough understanding of both the theoretical and practical aspects of conducting analyses of panel data.
Structure
Day One:
Morning:
• (Very) Brief Review of Linear Regression
• Overview of Panel Data: Visualization, Pooling, and Variation
• Regression with Panel Data
Afternoon:
• Unit Effects Models: Fixed, Between, and RandomEffects
Day Two:
Morning:
• Dynamic Panel Data Models: The Instrumental Variables / Generalized Method of Moments Framework
Afternoon:
• More Dynamic Models: OrthogonalizationBased Methods
Day Three:
Morning:
• UnitEffects and Dynamic Models for Discrete Dependent Variables
Afternoon:
• GLMs for Panel Data: Generalized Estimating Equations (GEEs)
Day Four:
Morning:
• Introduction to Causal Inference with Panel Data (Including Unit Effects)
Afternoon:
• Models for Causal Inference: DifferencesInDifferences, Synthetic Controls, and Other Methods
Day Five:
Morning:
• Practical Issues: Model Selection, Specification, and Interpretation
Afternoon:
• Course Examination
Literature
Mandatory
Hsiao, Cheng. 2014. Analysis of Panel Data, 3rd Ed. New York: Cambridge University Press.
OR
Croissant, Yves, and Giovanni Millo. 2018. Panel Data Econometrics with R. New York: Wiley.
Supplementary / voluntary
Abadie, Alberto. 2005. “Semiparametric DifferenceinDifferences Estimators.” Review of Economic Studies 72:119.
Anderson, T. W., and C. Hsiao. 1981. “Estimation Of Dynamic Models With Error Components.” Journal of the American Statistical Association 76:598606.
Antonakis, John, Samuel Bendahan, Philippe Jacquart, and Rafael Lalive. 2010. “On Making Causal Claims: A Review and Recommendations.” The Leadership Quarterly 21(6):10861120.
Arellano, M. and S. Bond. 1991. “Some Tests Of Specification For Panel Data: Monte Carlo Evidence And An Application To Employment Equations.” Review of Economic Studies 58:277297.
Beck, Nathaniel, and Jonathan N. Katz. 1995. “What To Do (And Not To Do) With TimeSeries CrossSection Data.” American Political Science Review 89(September): 634647.
Bliese, P. D., D. J. Schepker, S. M. Essman, and R. E. Ployhart. 2020. “Bridging Methodological Divides Between Macro and Microresearch: Endogeneity and Methods for Panel Data.” Journal of Management, 46(1):7099.
Clark, Tom S. and Drew A. Linzer. 2015. “Should I Use Fixed Or Random Effects?” Political Science Research and Methods 3(2):399408.
Doudchenko, Nikolay, and Guido Imbens. 2016. “Balancing, Regression, DifferenceInDifferences and Synthetic Control Methods: A Synthesis.” Working paper: Graduate School of Business, Stanford University.
Gaibulloev, K., Todd Sandler, and D. Sul. 2014. “Of Nickell Bias, CrossSectional Dependence, and Their Cures: Reply.” Political Analysis 22: 279280.
Hill, T. D., A. P. Davis, J. M. Roos, and M. T. French. 2020. “Limitations of FixedEffects Models for Panel Data.” Sociological Perspectives 63:357369.
Hu, F. B., J. Goldberg, D. Hedeker, B. R. Flay, and M. A. Pentz. 1998. “Comparison of populationaveraged and subjectspecific approaches for analyzing repeated binary outcomes.” American Journal of Epidemiology 147(7):694703.
Imai, Kosuke, and In Song Kim. 2019. “When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data?” American Journal of Political Science 62:467490.
Keele, Luke, and Nathan J. Kelly. 2006. “Dynamic Models for Dynamic Theories: The Ins and Outs of Lagged Dependent Variables.” Political Analysis 14(2):186205.
Lancaster, Tony. 2002. “Orthogonal Parameters and Panel Data.” Review of Economic Studies 69:647666.
Liu, Licheng, Ye Wang, Yiqing Xu. 2019. “A Practical Guide to Counterfactual Estimators for Causal Inference with TimeSeries CrossSectional Data.” Working paper: Stanford University.
Mummolo, Jonathan, and Erik Peterson. 2018. “Improving the Interpretation of Fixed Effects Regression Results.” Political Science Research and Methods 6:829835.
Neuhaus, J. M., and J. D. Kalbfleisch. 1998. “Between and WithinCluster Covariate Effects in the Analysis of Clustered Data. Biometrics, 54(2): 638645.
Pickup, Mark and Vincent Hopkins. 2020. “TransformedLikelihood Estimators for Dynamic Panel Models with a Very Small T.” Political Science Research & Methods, forthcoming.
Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25:5776.
Zorn, Christopher. 2001. “Generalized Estimating Equation Models for Correlated Data: A Review with Applications.” American Journal of Political Science 45(April):47090.
Mandatory readings before course start
Hsiao, Cheng. 2007. “Panel Data Analysis — Advantages and Challenges.” Test 16:122.
Examination part
Students will be evaluated on two written homework assignments that will be completed during the course (20% each) and a final examination (60%). Homework assignments will typically involve a combination of simulationbased exercises and “real data” analyses, and will be completed during the evenings while the class is in session. For the final examination, students will have two alternatives:
• “InClass”: Complete the final examination in the afternoon of the last day of class (from roughly noon until 6:00 p.m. local time), or
• “TakeHome”: Complete the final examination during the week following the end of the course (due date: TBA).
Additional details about the final examination will be discussed in the morning session on the first day of the course.
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data with a panel / timeseries crosssectional component. Students will be required to specify, estimate, and interpret various statistical models, to conduct and present diagnostics and robustness checks, and to give detailed justifications for their choices.
Examination relevant literature
See above. Details of the examination literature will be finalized prior to the start of class.
Prerequisites (knowledge of topic)
• Some prior knowledge in R and/or programming beneficial, but not required
Hardware
• Bring your own laptop
Software
• RStudio & R, most recent version (download free versions)
• You may want to bring your own credit card to create your own cloud accounts (for database server and certain APIs). Accounts are typically free but some require depositing a credit card number.
Course content
Online platforms such as Yelp, Twitter, Amazon, or Instagram are largescale, rich and relevant sources of data. Researchers in the social sciences increasingly tap into these data for field evidence when studying various phenomena.
In this course, you will learn how to find, acquire, store, and manage data from such sources and prepare them for followup statistical analysis for your own research.
After a short introduction into the relevance of data science skills for the social sciences, we will review R as a programming language and its basic data formats. We will then use R to program simple scrapers that systematically extract data from websites. We will use the packages rvest, httr, and RSelenium, among others, for this purpose. You will further need to learn how to read HTML, CSS, JSON, or XML codes, to use regular expressions, and to handle string, text and image data. To store the data, we will look into relational databases, (My)SQL, and related R packages. Many websites such as Twitter and Yelp offer convenient applicationprogramming interfaces (APIs) that facilitate the extraction of data and we will look into accessing them from R. Finally, we will highlight some options for feature extraction from images and text, which allows us to augment our collected data with meaningful variables we can use in our analysis.
At the end of this course, students should be able to identify valuable online data sources, to write basic scrapers, and to prepare the collected data such that they can use them for statistical analysis as part of their own research projects.
Throughout the course, students will work on a datascraping project related to their theses. This project will be presented at the final day of the course.
All data scraping code and other sources will be made available on
https://www.datascraping.org.
Structure
Preliminary schedule:
Day 1
Intro to data scraping
Define students’ scraping projects
Review of R and introduction to programming with R
Afternoon: R programming exercises
Day 2
The anatomy of the internet and relevant data formats
Intro to web scraping with R (with httr, rvest, RSelenium)
Introduction to APIs
Afternoon: Scraping exercises
Day 3
Relational databases and SQL
Data management with R
Afternoon: Database design and implementation project (with MySQL in the cloud)
Day 4
Scraping examples from Yelp, Crowdspring, Twitter, and Instagram
Scaling up your scraper with parallel code and proxies
Feature extraction examples
Afternoon: Work on your scraping projects
Day 5
Wrapup of course
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Literature
Mandatory
None, all readings will be provided during the course
Supplementary / voluntary
None, all readings will be provided during the course
Examination part
Individual quiz (60% of grade)
Presentation of students’ scraping projects (40% of grade)
Supplementary aids
Individual quiz: Closed book
Presentation of students’ scraping projects: Closed book
Examination content
Lecture slides covering key concepts of R and programming, the anatomy of the internet, relational databases, and scraping (slides will be provided as PDFs the day before classes).
• Students will need to understand R code when they see it but they will not be required to code during the exam.
Literature
None.
Prerequisites (knowledge of topic)
Substantive Background: Students taking this course should have a general familiarity with the types of data that can be obtained through survey research. While not absolutely required, it would be useful if students bring to the course survey datasets from their own fields. But, even if students do not bring their own data, the instructor will provide several survey datasets for course use.
Statistical Methods: Students in this course should be familiar with multiple regression analysis and comfortable with the process of employing regression models to analyze empirical data.
Computing: Students in this course should have some prior exposure and basic experience with the R statistical computing environment. But specific packages and functions will be introduced and explained in detail throughout the course.
Hardware
Students in this course should bring their own laptop computers to class so they can access the software required to carry out the analyses in course examples and exercises.
Software
This course will rely on the R statistical computing environment. Students should install the latest version of R on their computers before the first class session. While not absolutely required, it is strongly recommended that students also install RStudio. Doing so will make it much easier to interact with the R system in productive ways.
The course material will use several R packages. Students should install the optiscale, psych, mokken, and smacof packages before the first class session. Additional R packages will be made available and used throughout the course.
Course content
This course is aimed at demonstrating to students how to complete 3 critical tasks with survey data: 1) combine several survey items into a more reliable and powerful scale, 2) assess the dimensionality of a set of attitudes, 3) produce geometric maps of attitudes and preferences, so that the fundamental structure of people’s beliefs can be more readily interpreted. More generally, this course is aimed at aiding researchers in better measuring the phenomena they are interested in. Though researchers of all sorts recognize measurement as a fundamental and crucial step of the scientific process, the topic is rarely given formal attention in core graduate courses beyond a cursory treatment of the concepts of reliability and validity.
The course will cover a variety of strategies for producing quantitative (usually intervallevel) variables from qualitative survey responses (which are usually believed to be measured at the nominal or ordinal level). We will begin with a discussion of measurement theory, giving detailed consideration to such concepts as measurement level and measurement accuracy. This will lead us to optimal scaling strategies, for assigning numbers to objects. Following that, we will cover a variety of methods for combining multiple survey responses in order to produce higherquality summary measures. These include: summated rating (or “Likert”) scales and reliability of measurement; principal components analysis; item response theory; factor analysis; multidimensional scaling; the vector model for profile data; and correspondence analysis. Each of these methods applies a measurement model to empirical data in order to generate a quantitative representation of the observations and survey items. The results provide new variables that can be employed as input to subsequent statistical models. These methods are not just “mere” measurement tools; in addition to quantifying observations, they often provide useful new insights about the systematic structure that exists within those observations. And, from a practical perspective, consideration of measurement theory and scaling methods can guide researchers to construct more powerful batteries of survey questions.
Structure
On each class day, the morning session will be used to introduce new concepts, models, and techniques. Some of this discussion may extend on into the afternoon sessions. But, most of the time during the afternoon sessions will be devoted to class exercises that provide students an opportunity to apply the material discussed during the morning session.
Day 1
General introduction and basic concepts
Measurement theory
Optimal scaling
Summated rating scales (or, additive indexes)
Day 2
Reliability
Cumulative scales (or, Mokken scaling, IRT)
Day 3
Biplots
Principal components analysis
Day 4
Factor analysis (exploratory and confirmatory)
Multidimensional scaling
Day 5
More multidimensional scaling
Correspondence analysis
Literature
Mandatory
Unfortunately, there is no single textbook that covers all of the topics in this course. In addition, many of the texts that are available have certain drawbacks that limit their usefulness for our purposes: They tend to be very expensive; they usually assume a high level of mathematical sophistication; they often contain sections that are out of date. Because of these considerations, the required readings can be taken from two alternative sources: (1) The Sage series on Quantitative Applications in the Social Sciences (i.e., the “little green books”); or (2) chapters from The Wiley Handbook of Psychometric Testing, edited by Paul Irwing, Tom Booth, and David J. Hughes.
Sage QASS monographs:
Dunteman, George H. (1989) Principal Components Analysis.
Jacoby, William G. (1991) Data Theory and Dimensional Analysis.
Kim, JaeOn and Charles W. Mueller. (1978a) Introduction to Factor Analysis.
Kim, JaeOn and Charles W. Mueller. (1978b) Factor Analysis: Statistical Methods and Practical Issues.
Kruskal, Joseph B. and Myron Wish. (1978) Multidimensional Scaling.
McIver, John and Edward G. Carmines. (1981) Unidimensional Scaling.
Van Schuur, Wijbrandt. (2011) Ordinal Item Response Theory: Mokken Scale Analysis.
Weller, Susan C. and A. Kimball Romney. (1990) Metric Scaling: Correspondence Analysis.
Chapters from The Wiley Handbook of Psychometric Testing:
DeMars, Christine. “Classical Test Theory and Item Response Theory.”
Hughes, David J. “Psychometric Validity: Establishing the Accuracy and Appropriateness of Psychometric Measures.”
Jacoby, William G. and David J. Ciuk. “Multidimensional Scaling: An Introduction.”
Jennrich, Robert J. “Rotation.”
Meijer, Rob R. and Jorge N. Tendeiro. “Unidimensional Item Response Theory.”
Mulaik, Stanley A. “Fundamentals of Common Factor Analysis.”
Revelle, William and David M. Condon. “Reliability.”
Timmerman, Marieke E.; Urbano LorenzoSeva; Eva Ceulemans. “The Number of Factors Problem.”
Supplementary / voluntary
Armstrong II, David A.; Ryan Bakker; Royce Carroll; Christopher Hare; Keith T. Poole; Howard Rosenthal. (2014) Analyzing Spatial Models of Choice and Judgment with R.
Bartholomew, David J.; Fiona Steele; Irini Moustaki; Jane I. Galbraith. (2008) Analysis of Multivariate Social Science Data (Second Edition).
Borg, Ingwer and Patrick Groenen. (2005) Modern Multidimensional Scaling: Theory and Applications (Second Edition).
Cudek, Robert and Robert C. MacCallum, Editors (2007) Factor Analysis at 100.
Lattin, James; J. Douglas Carroll; Paul E. Green. (2003) Analyzing Multivariate Data.
Mulaik, Stanley A. (2010) Foundations of Factor Analysis (Second Edition).
Wickens, Thomas D. (1995) The Geometry of Multivariate Statistics.
Mandatory readings before course start
None.
Examination part
Course participants will be evaluated on the basis of oral participation (20%) and a major homework exercise (80%). In the homework exercise, course participants will apply one or more of the techniques covered in the class to actual survey data. Ideally, students will have their own survey data drawn from their respective substantive fields. But, if not, the course instructor can provide some survey data drawn from political science and sociological applications.
Prerequisites (knowledge of topic)
Each student is to submit an outline (no more than 500 words in length) of a specific research question and/or a set of hypotheses that s/he would like to examine via an experimental approach. This outline (in PDF format, file name format: “LastNameFirstNameResQues.pdf”) should be emailed to ghaeubl@ualberta.ca with “GSERMEMBS” as the subject line by 23:00 (St. Gallen time) on Friday prior course start.
As part of the introductions on the first morning of the course, students be will be asked to give 2minute presentations on these research questions/hypotheses (and to say a few words about their areas of research interest more broadly).
The objectives behind this assignment are:
• to facilitate learning by ensuring that students have their own concrete research questions/hypotheses in mind as they engage with the material covered in the course
• to provide the instructor with input for tailoring the course content and/or class discussions to students’ interests
Course content
The objective of this course is to provide students with an understanding of the essential principles and techniques for conducting scientific experiments on human behavior. It is tailored for individuals with an interest in doing research (using experimental methods) in areas such as psychology, judgment and decision making, behavioral economics, consumer behavior, organizational behavior, and human performance. The course covers a variety of topics, including the formulation of research hypotheses, the construction of experimental designs, the development of experimental tasks and stimuli, how to avoid confounds and other threats to validity, procedural aspects of administering experiments, the analysis of experimental data, and the reporting of results obtained from experiments. Classes are conducted in an interactive seminar format, with extensive discussion of concrete examples, challenges, and solutions.
Topics
The topics covered in the course include:
• Basic principles of experimental research
• Formulation of research question and hypothesis development
• Experimental paradigms
• Design and manipulation
• Measurement
• Factorial designs
• Implementation of experiments
• Data analysis and reporting of results
• Advanced methods and complex experimental designs
• Ethical issues
Literature
Recommended
There is no textbook for this course.
However, here are some recommended books on the design (and analysis) of experiments:
Abdi, Edelman, Valentin, and Dowling (2009), Experimental Design and Analysis for Psychology, Oxford University Press.
Field and Hole (2003), How to Design and Report Experiments, Sage.
Keppel and Wickens (2004), Design and Analysis: A Researcher’s Handbook, Pearson.
Kirk (2013), Experimental Design: Procedures for the Behavioral Sciences, Sage.
Martin (2007), Doing Psychology Experiments, Wadsworth.
Oehlert (2010), A First Course in Design and Analysis of Experiments, available online at:
http://users.stat.umn.edu/~gary/book/fcdae.pdf.
In addition, the following papers are recommended as background readings for the course:
Cumming, Geoff (2014), “The New Statistics: Why and How,” Psychological Science, 25, 1, 729.
Elrod, Häubl, and Tipps (2012), “Parsimonious Structural Equation Models for Repeated Measures Data, With Application to the Study of Consumer Preferences,” Psychometrika, 77, 2, 358387.
Goodman and Paolacci (2017), “Crowdsourcing Consumer Research,” Journal of Consumer Research, 44, 1, 196210.
McShane and Böckenholt (2017), “SinglePaper MetaAnalysis: Benefits for Study Summary, Theory Testing, and Replicability,” Journal of Consumer Research, 43, 6, 10481063.
Meyvis and Van Osselaer (2018), “Increasing the Power of Your Study by Increasing the
Effect Size,” Journal of Consumer Research, 44, 5, 11571173.
Morales, Amir, and Lee (2017), “Keeping It Real in Experimental ResearchUnderstanding When, Where, and How to Enhance Realism and Measure Consumer Behavior,” Journal of Consumer Research, 44, 2, 465476.
Oppenheimer, Meyvis, and Davidenko (2009), “Instructional Manipulation Checks: Detecting Satisficing to Increase Statistical Power,” Journal of Experimental Social Psychology, 45, 867872.
Pieters (2017), “Meaningful Mediation Analysis: Plausible Causal Inference and Informative Communication,” Journal of Consumer Research, 44, 3, 692716.
Simmons, Nelson, and Simonsohn (2011), “FalsePositive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant,” Psychological Science, 22, 11, 13591366.
Simonsohn, Nelson, and Simmons (2014), “PCurve: A Key to the FileDrawer,” Journal of Experimental Psychology: General, 143, 2, 534547.
Spiller, Fitzsimons, Lynch, and McClelland (2013), “Spotlights, Floodlights, and the Magic Number Zero: Simple Effects Tests in Moderated Regression,” Journal of Marketing Research, 50, 277288.
Zhao, Lynch, and Chen (2010), “Reconsidering Baron and Kenny: Myths and Truths about Mediation Analysis,” Journal of Consumer Research, 37, 197206.
Examination part
Students are to complete a (2hour) written exam in the afternoon of the last day of class. In the exam, students are given a description of a research question, along with specific hypotheses. They are to produce a proposal for an experiment, or a series of experiments, for testing these hypotheses. The exam is “open book” – that is, students are free to use any appropriate local resources they wish in developing their proposal. (Here, “local” means that students may not access the Internet or other communication networks.)
Regular attendance and active participation in class discussion are expected.
Common standards of academic integrity apply. Work submitted by students must be their own – submitting what someone else has created is not acceptable.
Grading
A student’s overall grade is based on the following components:
– Initial Assignment and Presentation: 10%
– Class Participation: 20%
– Exam: 70%
Prerequisites (knowledge of topic)
Probability theory at a good level, affinity to mathematical problems, advanced econometrics.
Hardware
Laptops for the pc sessions.
Software
Exercises will require the usage of the statistical software R.
Learning objectives
The goal of this course is to provide a comprehensive overview of the mathematical theory behind machine learning. How can we characterize a good prediction? How can we construct good predictions based on machine learning methods? What is the relationship between (1) estimation error, (2) sample size and (3) model complexity? How do these abstract concepts apply in particular Machine Learning methods such as Boosting, Support Vector Machine, Ridge and LASSO? The objective of the course is to give detailed and intuitively clear answers to those questions. As a result, participants will receive a good preparation for theoretical and empicial work with/on Machine Learning methods.
Course content
1. Principles of statistical theory (loss function and risk, approximation vs estimation error, no free lunch theorems)
2. Concentration inequalities for bounded loss functions (Hoeffding’s Lemma, AzumaHoeffding’s inequality, Bounded difference inequality, Bernstein’s inequality, McDiarmid inequality)
3. Classification (binary case and its loss function, Bayesian classifier, Optimality of the Bayesian Classifier, Oracle inequalities for the Bayesian classifier, Finite dictionary learning case, The impact of noise on convergence rates, infinite dictionary)
4. General case (general loss functions, symmetrization, Rademacher complexity, Covering numbers, Chaining)
5. Applications Part 1: Vector Machine support, boosting
6. The mathematics and statistics of regularization methods (LASSO, Ridge, elastic net)
7. Applications Part 2: applying LASSO and Ridge
Structure
Part I.
Concepts of statistical learning: Concentration inequalities, concepts of statistical theory (topics 1 and 2 from the course content)
Part II.
The math of Machine learning and Classification. (topic 3 from course content)
Part III.
The Machine learning methods and the general case (topics 4 and 5 from the course content)
Part IV.
LASSO and Ridge (topics 6 and 7 from the course content).
Literature
Mandatory
There will be a lecture script.
Supplementary / voluntary
The book “Elements of statistical learning” by Hastie, Tibshirani and Friedman gives a nice introduction intro Boosting and Vector Support Machines.
Further topicspecific nonobligatory references will be given during the lecture.
Examination part
Final written examination at the end of the course (100%)
Supplementary aids
‘Closed Book’. No external references allowed.
Examination content
While the course is very technical, for the exam only intuitions are necessary. Intuition means: describe which assumptions are necessary for a result and give a verbal, possibly graphical reason (the participants able to give precise mathematical reasoning can do it instead). Participants should be able to give the intuition for the following concepts: Calculating a loss function, giving intuition for concentration inequalities, showing optimality of the Bayesian classifier, understanding the intuition behind the finite dictionary case, understanding the intuition of the impact of noise (Massart’s noise condition, MammenTsybakov’s noise condition), describing the infinite dictionary learning problem, being able to define and explain Rademacher complexity, its relation to cardinality, calculating the VCdimension, applications to empirical risk, being able to explain symmetrization, its relation to Rademacher complexity in the general case, and giving the intuition of how these concepts apply to Vector Support Machines and Boosting; the mathematical intuition behind the LASSO and Ridge methods, in particular in the orthonormal design.
Examination relevant literature
Lecture script
Qualitative Research Methods and Data Analysis presents strategies for analyzing and making sense of qualitative data. Both descriptive and interpretive qualitative studies will be discussed, as will more defined qualitative approaches such as grounded theory, narrative analysis, and case studies. The course will briefly cover research design and data collection strategies but will largely focus on analysis. In particular, we will consider how researchers develop codes and integrate memo writing into a larger analytic process. The purpose of coding is to provide a focus to qualitative analysis; it is critical to have a handle on coding practices as you move deeper into analysis. The course will present coding and memo writing as concurrent tasks that occur during an active review of interviews, documents, focus groups, and/or multi‑media data. We will discuss deductive and inductive coding and how a codebook evolves, that is, how codes might “emerge” and shift during analysis. Managing codes includes developing code hierarchies, identifying code “constellations,” and building multidimensional themes.
The class will present memo writing as a strategy for capturing analytical thinking, inscribed meaning, and cumulative evidence for condensed meanings. Memos can also resemble early writing for reports, articles, chapters, and other forms of presentation. Researchers can also mine memos for codes and use memos to build evocative themes and theory. Coding and memo writing are discussed in the context of datadriven qualitative research beginning with design and moving toward presentation of findings. The course will also discuss using visual tools in analysis, such as diagramming core quotations from data to holistically present the participant’s key narratives. Visual tools can also assist in looking horizontally across many transcripts to identify connective themes and link the parts to the whole.
Software
We will spend one day learning a qualitative analysis software package:
GSERM St. Gallen Atlas.TI
GSERM Ljubljana NVIVO
If the course will be in a remote format, we will work with MAXQDA.
The methods discussed in the course will be applicable to qualitative studies in a range of fields, including the behavioral sciences, social sciences, health sciences, communications, and business.
Structure
Day 1
 Core Principles and Practices in Qualitative Data Inquiry
 Qualitative Research Design: An Overview Data types
 Comparative strategies
 Qualitative sampling
 Triangulation
Analysis Task 1: Memo Writing Document summary memos
 Keyquote memos
 Methods memos
Day 2
 Analysis Task 2: Using Visual Tools
 Episode profiles
 Making sense of data using diagrams
 Working with core quotations
Analysis Task 3: Coding Qualitative Data
 Descriptive coding
 Interpretive coding
 Strategies to coding
 Line‑by‑line coding
 Creating a codebook
Day 3
 Introduction to Qualitative Software: MAXQDA (see information at “Software”)
a. Overview
b. Beginning a project
c. Writing comments and memos
d. Coding data
 Hands‑on Exercises Using MAXQDA
 Analysis in MAXQDA
 Exploring codes and memos in queries
 Matrices and diagrams
 Blending quantitative and qualitative data
Day 4
 Methodological Traditions
a. Grounded theory
b. Narrative analysis
c. Case study
e. Pragmatic qualitative analysis
Day 5
 Qualitative Research Design: Revisiting Strategies
 Data Collection considerations Types
•Interviews
•Focus groups
•Other types of data  Developing interviewing skills
 Other data types
 Evaluating qualitative articles
 Class discussion
Suggested Reading (Articles)
Electronic version of these articles will be provided to registered participants:
Ahlsen, Birgitte, et al. 2013. “(Un)doing Gender in a Rehabilitation Context: A Narrative Analysis of Gender and Self in Stories or Chronic Muscle Pain.” Disability and Rehabilitation 1‑8.
Charmaz, Kathy. 1999. “Stories of Suffering: Subjective Tales and Research Narratives.” Qualitative Health Research 9:362‑82.
Sandelowski, Margarete. 2000. “Whatever Happened to Qualitative Description?” Research in Nursing and Health 23:334‑40.
Rouch, Gareth, et al. 2010. “Public, Private and Personal: Qualitative Research on Policymakers’ Opinions on Smokefree Interventions to Protect Children in ‘Private’ Spaces.” BMC Public Health 10:797‑807.
Suggested Reading (Books)
Charmaz, Kathy. 2006. Constructing Grounded Theory: A Practical Guide through Qualitative Analysis. Sage.
Marshall, Catherine, and Gretchen B. Rossman. 2006. Designing Qualitative Research. 4th ed. Sage.
Yin, Robert. 2013. Case Study Research Design and Methods. Sage.
Examination
Participants will be asked to read several interviews or journal entries and generate a preliminary analysis of the data using techniques discussed during the course. This examination will be due three weeks after the course ends.
Examination content
Students will have to demonstrate familiarity with the differences between grounded theory, narrative analysis, case study, and pragmatic analysis. The assignment will require them to choose one of these approaches to design a study and analyze several documents provided by the instructor. Their preliminary analysis will include memos, a codebook, diagrams, early findings, and reflection on next steps.
Prerequisites and content
Prerequisite knowledge for the course includes the fundamentals of probability and statistics, especially hypothesis testing and regression analysis. This intermediate level course assumes that students can interpret the results of Ordinary Least Squares, Probit, and Logit regressions. They should also be familiar with the problems that are most common in regression, such as multicollinearity, heteroscedasticity, and endogeneity. Finally, students should be comfortable working with computers and data. No prior knowledge of R or network analysis is required.
The concept of “social networks” is increasingly a part of social discussion, organizational strategy, and academic research. The rising interest in social networks has been coupled with a proliferation of widely available network data, but there has not been a concomitant increase in understanding how to analyze social network data. This course presents concepts and methods applicable for the analysis of a wide range of social networks, such as those based on family ties, business collaboration, political alliances, and social media.
Classical statistical analysis is premised on the assumption that observations are sampled independently of one another. In the case of social networks, however, observations are not independent of one another, but are dependent on the structure of the social network. The dependence of observations on one another is a feature of the data, rather than a nuisance. This course is an introduction to statistical models that attempt to understand this feature as both a cause and an effect of social processes.
Since network data are generated in a different way than many other kinds of social data, the course begins by considering the research designs, sampling strategies, and data formats that are commonly associated with network analysis. A key aspect of performing network analysis is describing various elements of the network’s structure. To this end, the course covers the calculation of a variety of descriptive statistics on networks, such as density, centralization, centrality, connectedness, reciprocity, and transitivity. We consider various ways of visualizing networks, including multidimensional scaling and spring embedding. We learn methods of estimating regressions in which network ties are the dependent variable, including the quadratic assignment procedure and exponential random graph models (ERGMs). We consider extensions of ERGMs, including models for twomode data and networks over time.
Instruction is split between lectures and handson computer exercises. Students may find it to their advantage to bring with them a social network data set that is relevant to their research interests, but doing so is not required. The instructor will provide data sets necessary for completing the course exercises.
Structure
Day 1: Fundamental of Network Analysis
 Why undertake network analysis?
 How network analysis differs from other statistical methods
 Elements of networks (Nodes, links, modes, attributes, matrices, graphs)
 Key concepts (directionality, symmetry)
 Visualization
 Sampling
 Survey methods
 Working with network data in R
Day 2: Descriptive and Inferential Statistics
 Density
 Degree distributions
 Centrality (degree, betweenness, closeness, power)
 Centralization
 Components and cores
 Triads, triples, and transitivity
 Clustering
 Correlation and the Quadratic Assignment Procedure
 Random graphs
 Descriptive and inferential statistics in R
Day 3: Exponential Random Graph Models (ERGMs)
 Theory
 Specification
 Estimation
 Goodness of Fit
 Working with onemode and twomode ERGMs in R
Day 4: Network Data over Time Using Temporal ERGMs
Day 5: Student Presentations and Extensions of ERGM
 Student Presentations
 Additional extension of ERGMs, if time allows
 Concluding Discussion
Literature
Breiger, Ronald L. 1974. “The Duality of Persons and Groups.” Social Forces 53 (2): 181190.
Burt, Ronald S. 1992. Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press. Pp. 849.
Butts, Carter T. 2008. “network: A Package for Managing Relational Data in R.” Journal of Statistical Software 24 (2): 136.
Butts, Carter T. 2008. “Social Network Analysis with sna.” Journal of Statistical Software 24 (6): 151.
Cranmer, Skyler J., Bruce A. Desmarais and Jason W. Morgan. 2021. Inferential Network Analysis. New York: Cambridge University Press.
Cranmer, Skyler J., Philip Leifeld, Scott D. McClurg, and Meredith Rolfe. 2017. “Navigating the Range of Statistical Tools for Inferential Network Analysis.” American Journal of Political Science 61 (1): 237251.
Denny, Matthew J. 2016. “Getting Started with GERGM.” https://www.mjdenny.com/getting_started_with_GERGM.html
Emirbayer, Mustafa. 1997. “Manifesto for a Relational Sociology.” American Journal of Sociology 103 (2): 281317.
Freeman, Linton C. 1977. “A Set of Measures of Centrality Based on Betweenness.” Sociometry 40 (1): 3541.
Gould, Roger V., and Roberto M. Fernandez. 1989. “Structures of Mediation: A Formal Approach to Brokerage in Transaction Networks.” Sociological Methodology 19: 89126.
Granovetter, Mark. 1973. “The Strength of Weak Ties.” American Journal of Sociology 78 (6): 13601380.
Heaney, Michael T. 2014. “Multiplex Networks and Interest Group Influence Reputation: An Exponential Random Graph Model.” Social Networks 36 (1): 6681.
Heaney, Michael T., and Philip Leifeld. 2018. “Contributions by Interest Groups to Lobbying Coalitions.” Journal of Politics 80 (2): 494509
Heckathorn, Douglas D. 1997. “RespondentDriven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174199.
Hunter, David R., Mark S. Handcock, Carter T. Butts, Steven M. Goodreau and Martina Morris. 2008. “ergm: A Package to Fit, Simulate and Diagnose ExponentialFamily Models for Networks.” Journal of Statistical Software 24 (3): 129
Krackhardt, David. 1992. “The Strength of Strong Ties: The Importance of Philos in Organizations.” Pp. 216239 in Nitin Nohria and Robert Eccles, eds., Networks and Organizations: Structure, Form, and Action. Boston, MA: Harvard Business School Press.
Laumann, Edward O., Peter V. Marsden, and David Prensky. 1983. “The Boundary Specification Problem in Network Analysis.” Pp. 1834 in Ronald S. Burt and Michael Minor, eds., Applied Network Analysis, eds. Beverly Hills, CA: Sage.
Leifleld, Philip, and Skyler J. Cranmer. 2019. “A theoretical and empirical comparison of the temporal exponential random graph model and the stochastic actororiented model.” Network Science 7 (1): 2051.
Leifeld, Philip, Skyler J. Cramner, and Bruce A. Desmarais. 2018. “Temporal Exponential Random Graph Models with btergm: Estimation and Bootstrap Confidence Intervals.” Journal of Statistical Software 83 (6):136.
McPherson, Miller, Lynn SmithLovin, and James M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415444.
Morris, Martina, Mark S. Handcock, and David R. Hunter. 2008. “Specification of ExponentialFamily Random Graph Models: Terms and Computational Aspects.” Journal of Statistical Software 24 (4): 124.
Podolny, Joel M. 2001. “Networks as the pipes and prisms of the market.” American Journal of Sociology 107 (1): 3360.
Scott, John T. 2017. Social Network Analysis, 4^{th} ed. London: Sage.
Strogatz, Steven. 2010. “The Enemy of My Enemy.” New York Times (February 14).
Watts, Duncan. 1999. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton: Princeton University Press. Pp. 340.
Exam
75%: There will be one written computerbased problem set on Monday through Thursday (for four assignments in total). Time will be allocated in class to complete the assignments, which must be submitted each day.
25%: On the final of day of the course, each student will make a presentation to the class on the results of her or his research project for the week. Giving a presentation to the course is required to receive a satisfactory grade in the course.
Course content
The goal is to develop an applied and intuitive (not purely theoretical or mathematical) understanding of the topics and procedures, so that participants can use them in their own research and also understand the work of others. Whenever possible presentations will be in “Words,” “Picture,” and “Math” languages in order to appeal to a variety of learning styles.
Advanced regression topics will be covered only after the foundations have been established. The ordinary least squares multiple regression topics that will be covered include:
 Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance).  Categorical independent variables (e.g., new tests for “Intervalness” and
“Collapsing”).  Dichotomous dependent variables: Logit and Probit analysis.
 Outliers, influence, and leverage.
 Advanced diagnostic plots and graphical techniques.
 Matrix algebra: A quick primer. (Optional)
 Regression models… now from a matrix perspective.
 Heteroskedasticity: Definition, consequences, detection, and correction.
 Autocorrelation: Definition, consequences, detection, and correction.
 Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Structure
This course will utilize approximately 325 pages of “Lecture Transcripts.” These Lecture Transcripts are organized in nine Packets and will serve as the sole required textbook for this course. (They also will serve as an information resource after the course ends.) In addition, the Lecture Transcripts will significantly reduce the amount of notes participants have to write during class, which means they can concentrate much more on learning and understanding the material itself. These nine Packets will be provided at the beginning of the first class.
It is important to note that this is a course on regression analysis, not on computer or software usage. While in‑class examples are presented using SPSS, participants are free and encouraged to use the statistical software package of their choice to replicate these examples and to analyze their own datasets. Note that many statistical software packages can be used with the material in this course. Participants can, at their option, complete several formative data analysis projects; a detailed and comprehensive “Tutorial and Answer Key” will be provided for each.
Prerequisites
This course is a continuation of Tim McDaniel’s “Regression I – Introduction” course. While it is not necessary that participants have taken that specific course, they will need to be familiar with many of the topics that are covered in it.
Note: We will use matrix algebra in the second half of the course. We will not use calculus.
Literature
The aforementioned Lecture Transcript Packets that we will use in each class serve as the de facto required textbook for this course.
In addition, the course syllabus includes full bibliographic information pertaining to several supplemental (and optional) readings for each of the nine Packets of Lecture Transcripts.
 Some of these readings are from four traditional textbooks, each of which takes a somewhat (though at times only subtly) different pedagogical approach.
 The optional supplemental readings also include several “little green books” from the Sage Series on Quantitative Applications in the Social Sciences.
 Finally, I have included several articles from a number of journals across several academic disciplines.Some of these optional supplemental readings are older classics and others are more recently written and published.
Examination part
Decentral ‑ Written examination (100%)
Supplementary aids
Open Book
Examination content
The potential substantive content areas for the Final Examination are:
 Various F‑tests (e.g., group significance test; Chow test; relative importance of
variables and groups of variables; comparison of overall model performance).  Categorical independent variables (e.g., new tests for “Intervalness” and “Collapsing”).
 Dichotomous dependent variables: Logit and Probit analysis.
 Outliers, influence, and leverage.
 Advanced diagnostic plots and graphical techniques.
 Regression models… now from a matrix perspective.
 Heteroskedasticity: Definition, consequences, detection, and correction.
 Autocorrelation: Definition, consequences, detection, and correction.
 Generalized Least Squares (GLS) and Weighted Least Squares (WLS).
Since this final examination is the only artifact that will be formally graded in the course, it will determine the course grade. Note that class attendance, discussion participation, and studying the material outside of class are indirectly very important for earning a good score on the final examination.
The final examination will be written, open‑ook (i.e., class notes, Lecture Transcripts, and Tutorial and Answer Key documents are allowed), and open‑note. No other materials, including Laptops, cell phones, or other electronic devices, will be permitted.The written final exam will be two hours in length and administered during the last course meeting.
Literature
Literature relevant to the exam:
 Lecture Transcripts (nine Packets; approximately 325 pages).
 Class notes (taken by each participant individually).
 Assignment Tutorial and Answer Key documents (for each optional data analysis
project).
Supplementary/Voluntary literature not directly relevant to the exam:
 Optional supplemental readings listed in the course syllabus (and discussed earlier).
 Any other textbooks, articles, etc., the participant reads before or during the course.
Prerequisites (knowledge of topic)
This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent realworld experience (via other ML courses or onthejob experiences) are also welcome.
Hardware
A laptop computer is required to complete the inclass exercises.
Software
R (https://www.rproject.org/) and R Studio (https://www.rstudio.com/products/rstudio/) are available and no cost and are needed for this course.
Course content
With machine learning, it is often difficult to make the leap from classroom examples to the realworld. Realworld applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through handson experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle (https://www.kaggle.com/). The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.
Structure
The course will be designed to be interactive, with ample time for handson practice. Each day will include at least one lecture based on the day’s topic in addition to a handson “lab” section to apply the learnings to a competition dataset (or one’s own data).
The tentative schedule is as follows:
Day 1: Handling messy data
Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?
Lecture: Learning to explore data
Lecture: Missing values – imputation and other strategies
Lecture: The R data pipeline – tidyverse
Lab: Getting to know your data
Day 2: Understanding ML performance
Discussion: What makes a successful ML model?
Lecture: Getting beyond accuracy – other performance measures
Lecture: The “no free lunch” theorem
Lecture: Estimating future performance – sampling methods, model selection
Lab: Comparing models on your dataset with ROC curves
Day 3: Improving ML performance
Discussion: What factors keep ML models from perfect prediction?
Lecture: Tuning stock models – automated parameter tuning
Lecture: Metalearning – ensembles, stacked models
Lab: Machine Learning Competition (Round 1)
Day 4: “Big data” problems
Discussion: Is more data always better? Why or why not?
Lecture: The curse of dimensionality – dimensionality reduction, tSNE
Lecture: Imbalanced datasets – under and oversampling strategies
Lecture: Improving R’s performance on big data
Lab: Machine Learning Competition (Round 2)
Day 5: Nextgeneration “Black Box” methods
Discussion: What are the strengths and weaknesses of man versus machine?
Lecture: Deep Learning – Keras, Tensorflow
Lecture: Text embeddings – word2vec
Lecture: Cluster computing – use cases of Hadoop, Spark, etc.
Discussion: Results of ML Competition – winners’ tips and tricks
Lab: Work on your final project
Literature
Mandatory
PDFs with readings will be distributed prior to the start of each class day.
Supplementary / voluntary
None required.
Mandatory readings before course start
Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.
Examination part
80% of the course grade will be based on a project and final report (approximately 510 pages), to be delivered within 23 weeks after the course in R Notebook format. The project is based on a challenging realworld dataset given to all course participants. The project will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data.
The remaining 20% of the course grade will be based on participation during inclass discussions and performance during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runnersup will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.
Supplementary aids
Students may reference any literature as needed when writing the final report.
Examination content
The primary goal of the final project is for students to gain an ability to solve difficult ML tasks. The project should reflect an understanding of the material covered throughout the week, as well as an ability to apply the material in new and innovative ways.
Literature
Not applicable.
Prerequisites (knowledge of topic)
Note that this course is designed for the applied analyst; its focus is on teaching you the tools you need for effective model evaluation, presentation, and interpretation. Support and code for model estimation and post estimation is provided for R as well as Stata.
At the end of this course, you should have a clear understanding as to which types of models and methods are available to answer different research questions, and also have experience applying a varied toolkit of these models.
Course content
Course website: http://www.shawnasmith.net/gserm
Software & Computing:
Models for this course are presented in broad strokes; however, a major component of this course is application through model estimation, postestimation and interpretation. For pedagogical purposes, I will use Stata 15 in course lectures; however course support will be provided for both Stata and R. While Stata—and most popular statistical software packages—includes native estimation (and even postestimation) commands for categorical models, we will also use a set of ado files written for Stata by Scott Long & Jeremy Freese that facilitate the (at times complicated) interpretation of categorical models within Stata. This suite of commands is called SPost. These postestimation commands can also be emulated in R although this will require more investigation on the part of the student. A variety of packages now exist in R relevant to the course that we are happy to provide guidance on when possible. With respect to the machine learning models, we will be making use of several userwritten packages in both Stata and R. If you are taking this course for credit, you will need to complete assignments using either Stata and SPost 13 commands or R with appropriate postestimation commands.
• Getting Access to Stata/R:
o Stata: Access to Stata is available through the GSERM labs. Several versions are also available to purchase at different price points; I am happy to provide guidance as to which would be most appropriate for your needs.
o R: R is free. You can download it at https://www.rproject.org/
R Studio is a free program that greatly upgrades R’s userinterface and can be downloaded at https://www.rstudio.com/
• Getting Started using Stata: New to Stata? No worries—this course will catch you up quickly. However, I strongly suggest working through the “Getting Started using Stata” document available on the course website (http://shawnasmith.net/gserm/) prior to Day 1 of class. Feel free to get in touch if you have questions.
o New to R? As R has a steeper learning curve, I would not recommend attempting to learn R solely for the purposes of this course. However, I am happy to recommend resources for those of you so inclined:
The two textbooks recommended above provide good introductions for ‘getting started’ with R, as well as lots of in situ code.
Mike Marin (UBC Public Health) has a great series of videos introducing R online at http://www.statslectures.com/index.php/rstatsvideostutorials/gettingstartedwithr.
• Downloading Stata packages: If you will be using Stata on a personal computer, then you will need to install several userwritten packages. Here’s the stepbystep:
o Prereqs: Internet access & administrative privileges
o In Stata, type search {package name} into the command line.
o In the viewer window that appears, click the link for the package
o Follow directions to install
• Accessing course data and usecda: Course data will be available for download through the course website. It is also available for us in Stata with the usecda command. usecda is a command written specifically for this course to expedite access to course datasets & examples. It is currently only available for download through Shawna’s Github account. To download on any computer:
o Tell Stata where to download the file from by using the following command in the Stata command line or in a dofile:
net from “https://shawnana79.github.io/data”
o Install the program by either: (a) clicking on the blue usecda link that appears in the output following the previous command; or (b) using the command: net install usecda
o Check out the help file by typing help usecda in the Stata command line
By nature or by measurement, dependent variables of interest to social and behavioral scientists are frequently categorical. Outcomes that include several ranked or unranked, noncontinuous categories–like vote choice, social media platform preference, brand loyalty, and/or condom use—are often of interest, with scientists expressly interested in developing models to explain or classify variation therein. Explanatory models are processfocused, and aim to determine the individual impact of factors that contribute to a particular outcome, often based on a priori theory—e.g., “How does social class affect whether an individual voted for the Conservatives in 2019?”; classification models, alternatively, are outcomefocused, and aim to identify the set of factors that most accurately classify (or predict) a particular outcome—e.g., “How do the Tories best use information from polls, geography, weather, Twitter feeds, and/or social demographics to predict who voted Conservative in 2019?”
Chances are your research involves a categorical outcome—binary, ordinal, or multinomial—and thus options thus abound for the modeling approach(es) you might take to address your research question of interest. This course is designed to provide an overview of a number of parametric and nonparametric approaches to exploring your outcome of interest, via both explanatory and classification perspectives.
Structure
N.B.: The exact content of the course will vary depending on the background & interests of participants. In other words, this schedule is subject to change.
Topic on Monday:
• Overview of class; introduction to models; some vocabulary (Pt 1)
Explanatory Models
• The 30Minute Review of linear regression; Identification; Maximum Likelihood Estimation
• Linear probability model; Identification of Pr(y=1); Two philosophies: transformational and latent variable approach for binary outcomes
• Estimation of BRM; Odds ratios
• Using Pr(y=1) to interpret the BRM (pt. 1): tables & plots; discrete change
Suggested Readings:
• Long Ch. 1
• HT&J Ch. 1
• JWH&T Ch. 1
• Long Ch. 2; P&X Ch. 2; L&F Ch. 12 F&W Ch. 12 or Monogan Ch. 12 (R)
• Long Ch. 3; P&X Ch.1
Due:
A1: Math Review
Topic on Tuesday:
• Using Pr(y=1) to interpret the BRM (pt. 2): plots; difference at means vs. mean of difference; partial change/margins
• Hypothesis testing; Wald and LR tests; Confidence intervals
• Scalar measures of fit: pseudoR2, AIC, BIC
• Ordinal variables; a latent variable model
Suggested Readings:
• Long Ch. 4
• Long Ch. 5; P&X Ch. 7
Topic on Wednesday:
• Estimation of ORM; latent variable interpretations; Pr(y=k)
• Odds ratios; parallel regression assumption and proportional odds
• Multinomial logit as a set of BLMs
• Calculating predicted probabilities; Interpretation using Pr(y=k)
• Odds ratio plots; Discrete change plots
• Tests for the MNLM; IIA
Suggested Readings:
• Long Ch. 6; P&X Ch. 8
Topic on Thursday:
Classification Models
• Overview of classification models; some vocabulary (Pt. 2)
• Using Pr(y=1) to interpret the BRM (pt. 3): AUC, ROC, penalization
• Strengths & limitations of BRM for classification
• Introduction to partitionbased models for classification
• CART models: Testing, evaluating, improving
• Random forests: Testing, evaluating, improving
• Strengths & limitations of CART methods for classification
Suggested Readings:
• JWH&T Ch. 4.13.4, 5; HT&J Ch. 4.4
• JWH&T Ch. 8; HT&J Ch. 9.19.3
Due:
A2: BRM + T&F
Topic on Friday:
• Introduction to semilinear models for classification
• kNN models: Testing, evaluating, improving
• Strengths & limitations of kNN methods for classification
• Support Vector Machine (SVM) models: Testing, evaluating, improving
• Strengths & limitations of SVM models for classification
• Course wrapup/other topics on request, as time allows
Suggested Readings:
• JWH&T Ch. 4.6; HT&J Ch. 2.3; 13.1, 13.3
• JWH&T Ch. 9; HT&J Ch. 12.112.3
Sunday:
A3: ORM & MNLM
and
A4: Classification Methods
due via email to shawnana@umich.edu;
include “GSERMCat:” in subject line
The first half of this course will focus on explanatory methods, with an emphasis on regression methods for categorical outcomes. Although regression models for categorical outcomes are often conceptualized as extensions of linear regression models (i.e., ‘generalized linear models’), categorical outcomes violate key assumptions of the simple linear regression framework, and thus require both alternative estimation strategies and additional identification assumptions. These assumptions have implications for model interpretation, notably interpreting coefficients, comparing coefficients, testing for significance, and assessing model fit. We will begin with deriving the logit and probit models for use with binary outcomes, and also introduce a variety of postestimation tools for interpreting effects of predictor variables on binary outcomes. We will then extend these models and methods of interpretation from binary to ordinal outcomes using the ordinal logit and probit models, and multinomial outcomes with the multinomial logit model. Methods for examining model fit and evaluating significance tests will also be discussed.
In the second half of the course, we will turn our attention to methods for classification and prediction. We will begin by reexamining what we’ve already learned—namely, logit models for categorical outcomes—and discussing the strengths and weaknesses of these models for model prediction and classification. These familiar models will also be used to introduce concepts of training a model, evaluating model performance, and improving model performance. As before, our focus will be on the binary case, with extensions for ordinal and multinomial cases. We will then move onto partitionbased models, specifically Classification and Regression Tree (CART) models and random forests, followed by semilinear models, namely kNearest Neighbor (kNN) and Support Vector Machines (SVM). The focus of these approaches will be on classification and prediction for binary outcomes, but extensions for outcomes with multiple categories will also be presented.
Literature
Mandatory
Lecture Notes for Foundations of Machine Learning and Regression Methods for Categorical Outcomes. This coursepack contain copies of the overheads for the lectures, data set codebooks, and materials used in the computing lab. It will be provided at the beginning of our first class session. Be sure to bring these notes to all lecture and lab sessions.
• For participants that prefer electronic versions, component parts are also available on the course website.
Supplementary / voluntary
Explanatory models
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage. Hereafter: Long
Powers, Daniel A. & Yu Xie. 2008. Statistical Methods for Categorical Data Analysis. 2nd Edition. Bingley, UK: Emerald Press. Hereafter: P&X
For the Stata devotees: Long, J. Scott & Jeremy Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. 3rd Edition. College Station, TX: Stata Press. Hereafter: L&F
Or if you like R: I’m still searching for my favorite here, but a couple of good ones are:
• Monogan, James E. III. 2015. Political Analysis Using R. New York, NY: Springer. Hereafter: Monogan.
• Fox, John & Sanford Weisberg. 2010. An R Companion to Applied Regression. Thousand Oaks, CA: Sage. Hereafter: F&W.
Classification models
Hastie, T., Tibshirani, R. and Freedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. New York: Springer. Hereafter: HT&J
James, G., Witten, D., Hastie, T., & Tibshirani, R. 2013. An Introduction to Statistical Learning. New York: Springer. Hereafter: JWH&T
Examination part
Decentral – examination paper written at home (individual) 100%
Grading
Participant’s overall grades are based on completion of four assignments weighted as follows:
• A1 Math Review: 1/6
• A2 BRM + T&F: 2/6
• A3 ORM & MNLM: 1/6
• A4 Classification Methods: 2/6
Literature
None.
Additional Course Information
Getting Help
I am available to provide feedback or answer questions during lunch breaks & after course hours. Due to the compressed nature of this course (& my desire for you all to digest as much material as possible!), I encourage you to bring up questions or concerns early & often. If you would like to discuss questions or concerns related to the methods presented here for a particular paper or thesis, I would encourage you to make an appointment to meet before or after lecture one day, or during the lunch break.
• I can also be reached by email both during & after this course at shawnana@umich.edu. Ensure a prompt response to your email by prefacing your subject with “GSERMCat:”.
Academic Integrity
It is not possible for us to have an intellectual community without honor. I expect that you demonstrate respect by recognizing the labor of those who create intellectual products. Academic dishonesty (including cheating and plagiarism) will not be tolerated and will be dealt with according to university policy.
Prerequisites (knowledge of topic)
Mathematics: Comfortable familiarity with univariate differential and integral calculus, basic probability theory, and linear algebra is required. Familiarity with discrete and continuous univariate probability distributions will be helpful. Statistics: Students should have completed Ph.D.level courses in introductory statistics and linear regression models, up to the level of GSERM’s Regression II.
Hardware
Students will complete course work on their own laptop computers. Microsoft Windows, Apple OSX, and Linux variants are all supported; please contact the instructor to ascertain the viability of other operating systems for course work.
Software
Basic proficiency with at least one statistical software package/language is not required but is highly recommended. Preferred software packages include the R statistical computing language and Stata. Course content will be presented using R; computer code for all course materials (analyses, graphics, course slides, examples, exercises) will be made available to students. Students choosing to use R are encouraged to arrive at class with current versions of both R (https://www.rproject.org) and RStudio (https://www.rstudio.com) on their laptops.
Course content
This course builds directly upon the foundations laid in Regression II, with a focus on successfully applying linear and generalized linear regression models. After a brief review of the linear regression model, the course addresses a series of practical issues in the application of such models: presentation and discussion of results (including tabular, graphical, and textual modes of presentation); fitting, presentation, and interpretation of two and threeway multiplicative interaction terms; model specification for dealing with nonlinearities in covariate effects; and postestimation diagnostics, including specification and sensitivity testing. The course then moves to a discussion of generalized linear models, including logistic, probit, and Poisson regression, as well as textual, tabular, and graphical methods for presentation and discussion of such models. The course concludes with a “participants’ choice” session, where we will discuss specific issues and concerns raised by students’ own research projects and agendas.
Structure
Day One (morning session): Review of linear regression.
Day One (afternoon session): Presentation and interpretation of linear regression models.
Day Two (morning session): Fitting and interpreting models with multiplicative interactions.
Day Two (afternoon session): Nonlinearity: Specification, presentation, and interpretation.
Day Three (morning session): Anticipating criticisms: Model diagnostics and sensitivity tests.
Day Three (afternoon session): Introduction to logit, probit, and other Generalized Linear Models (GLMs).
Day Four (morning session): GLMs: Presentation, interpretation, and discussion.
Day Four (afternoon session): GLMs: Practical considerations, plus extensions.
Day Five (morning session): “Participants’ choice” session.
Day Five (afternoon session): Examination period.
Literature
Mandatory
The course has one required text:
Fox, John R. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Additional readings will also be assigned as necessary; a list of those readings will be sent to course participants a few weeks before the course begins. All additional readings will be available on the course github repository and/or through online library services (e.g., JSTOR).
Supplementary / Voluntary
None.
Mandatory readings before course start
None.
Examination part
Grading:
– Two written homework assignments (20% each)
– A final examination (50%)
– Oral / class participation (10%)
Supplementary aids
The exam will be a “practical examination” (see below for content). Students will be allowed access to (and encouraged to reference) all course materials, notes, help files, and other documentation in completing their exam. Additional useful materials include:
Fox, John, and Sanford Weisberg. 2011. An R and SPlus Companion to Applied Regression, Second Edition. Thousand Oaks, CA: Sage Publications.
Nagler, Jonathan. 1996. “Coding Style and Good Computing Practices.” The Political Methodologist 6(2):28.
Examination content
The examination will involve the application of the techniques taught in the class to one or more “live” data example(s). These will typically take the form of either (a) a replication and extension of an existing published work, or (b) an original analysis of observational data using linear and/or generalized linear regression. Students will be required to specify, estimate, and interpret various forms of regression models, to present tabular and graphical interpretations of those model results, to conduct and present diagnostics and robustness checks, and to give detailed explanations and justifications for their responses.
Literature
Fox, John. 2016. Applied Regression Analysis and Generalized Linear Models, Third Edition. Thousand Oaks, CA: Sage Publications.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel / Hierarchical Models. New York: Cambridge University Press.
Prerequisites (knowledge of topic)
Participants should have a basic working knowledge of the principles and practice of multiple regression and elementary statistical inference. No knowledge of matrix algebra is required or assumed, nor is matrix algebra ever used in the course.
Hardware
Participants are strongly encouraged to bring their own laptops (Mac or Windows)
Software
Computer applications will focus on the use of OLS regression and the PROCESS macro for SPSS and SAS developed by Andrew F. Hayes (processmacro.org) that makes the analyses described in this class much easier than they otherwise would be. Because this is a handson course, participants are strongly encouraged to bring their own laptops (Mac or Windows) with a recent version of SPSS Statistics (version 19 or later) or SAS (release 9.2 or later) installed. SPSS users should ensure their installed copy is patched to its latest release. SAS users should ensure that the IML product is part of the installation. R and STATA users can benefit from the course content, but PROCESS makes these analyses much easier and is not available for R or STATA.
Course content
Statistical mediation and moderation analyses are among the most widely used data analysis techniques in social science, health, and business fields. Mediation analysis is used to test hypotheses about various intervening mechanisms by which causal effects operate. Moderation analysis is used to examine and explore questions about the contingencies or conditions of an effect, also called “interaction”. Increasingly, moderation and mediation are being integrated analytically in the form of what has become known as “conditional process analysis,” used when the goal is to understand the contingencies or conditions under which mechanisms operate. An understanding of the fundamentals of mediation and moderation analysis is in the job description of almost any empirical scholar. In this course, you will learn about the underlying principles and the practical applications of these methods using ordinary least squares (OLS) regression analysis and the PROCESS macro for SPSS and SAS.
Topics covered in this fiveday course include:
 Path analysis: Direct, indirect, and total effects in mediation models.
 Estimation and inference about indirect effects in single mediator models.
 Models with multiple mediators
 Mediation analysis in the twocondition withinsubject design.
 Estimation of moderation and conditional effects.
 Probing and visualizing interactions.
 Conditional Process Analysis (also known as “moderated mediation”)
 Quantification of and inference about conditional indirect effects.
 Testing a moderated mediation hypothesis and comparing conditional indirect effects
As an introductorylevel course, we focus primarily on research designs that are experimental or crosssectional in nature with continuous outcomes. We do not cover complex models involving dichotomous outcomes, latent variables, models with more than two repeated measures, nested data (i.e., multilevel models), or the use of structural equation modeling.
This course will be helpful for researchers in any field—including psychology, sociology, education, business, human development, political science, public health, communication—and others who want to learn how to apply the latest methods in moderation and mediation analysis using readilyavailable software packages such as SPSS and SAS.
Structure
The schedule for the course will be partially determined by previous experience of the students, and their existing familiarity with mediation and moderation. The below schedule is a rough approximation of the schedule for the course.
Day 1
 Path analysis: Direct, indirect, and total effects in mediation models.
 Estimation and inference about indirect effects in single mediator models.
Day 2
 Models with multiple mediators
 Mediation analysis in the twocondition withinsubject design.
Day 3
 Estimation of moderation and conditional effects.
 Probing and visualizing interactions.
 Moderation analysis in the twocondition withinsubject design
Days 4 & 5
 Estimation of conditional process models (also known as “moderated mediation”)
 Quantification of and inference about conditional indirect effects.
 Testing a moderated mediation hypothesis and comparing conditional indirect effects
Literature
This course is a companion to Andrew Hayes’s book Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press. The content of the course overlaps the book to some extent, but many of the examples are different, and this course includes material not in the first edition of the book. A copy of the book is not required to benefit from the course, but it could be helpful to reinforce understanding.
Beyond IMMCPA additional materials include:
Montoya, A. K., & Hayes, A. F. (2017). Twocondition withinparticipant statistical mediation analysis: A pathanalytic framework. Psychological Methods, 22(1), 627.
Hayes, A. F. (2015). An index and test of linear moderated mediation. Multivariate Behavioral Research, 50, 122.
Mandatory:
No materials are mandatory, but students will benefit greatly from reading Andrew Hayes’s book Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press
Supplementary / voluntary:
Introduction to Mediation, Moderation, and Conditional Process Analysis (IMMCPA), published by The Guilford Press
Montoya, A. K., & Hayes, A. F. (2017). Twocondition withinparticipant statistical mediation analysis: A pathanalytic framework. Psychological Methods, 22(1), 627.
Hayes, A. F. (2015). An index and test of linear moderated mediation. Multivariate Behavioral Research, 50, 122.
Mandatory readings before course start:
N/A
Examination part
100% of assessment will be based on a written final examination at the end of the course. The exam will be a combination of multiple choice questions and shortanswer/fill in the blank questions, along with some interpretation of computer output. Students will take the examination home on the last day of class and return it to the instructor within one week.
During the examination students will be allowed to use all course materials, such as PDFs of PowerPoint slides, student notes taken during class, and any other materials distributed or studentgenerated during class. Although the book mentioned in “Literature” is not a requirement of the course nor is it necessary to complete the exam, students may use the book if desired during the exam.
A computer is not required during the exam, though students may use a computer if desired, for example as a storage and display device for class notes provided to them during class.
Examination content
Among the topics of the exam may include how to quantify and interpret path analysis models, calculate direct, indirect, and total effects, and determine whether evidence of a mediation effect exists in a data set based on computer output provided or other information. Also covered will be the testing moderation of an effect, interpreting evidence of interaction, and probing interactions. Students will be asked to generate or interpret conditional indirect effects from computer output given to them and/or determine whether an indirect effect is moderated. Students may be asked to construct computer commands that will conduct certain analyses. All questions will come from the content listed in “Course Content” above.
Literature
Although the book mentioned in “Literature” is not a requirement of the course nor is it necessary to complete the assignments, students may use the book if desired.
Prerequisites (knowledge of topic)
 Basic knowledge of the R programming language
 Basic statistical knowledge including graduate level statistics
Hardware
 A laptop computer with Internet connection. The laptop should have at least 4GBS of RAM (preferably more because text mining is intensive).
Software
 A modern web browser (ie Chrome)
 R (https://www.rproject.org/), R Studio (https://www.rstudio.com/products/rstudio/) and git (https://gitscm.com/downloads) are available at no cost and are needed for this course. Please install all three on your personal laptop prior to class.
 As a backup, students should also sign up at https://rstudio.cloud/
 Specific R Packages will be shared prior to class for installation onto the laptop. The installation script will be shared via email with participants and shared on the class github repository.
Course content
Text mining is the art and science of extracting insights from large amounts of natural language. The topics of Text Mining will help students add natural language processing techniques to their research, and data science toolset. As a technical course with some machine learning elements, limited exposure to programming, graduate level statistics and mathematical theory is needed but the vast majority of the course content will be focused on applying popular text mining methods. As a result, the target audience may also include qualitative researchers looking to add quantitative analysis to interviews, media and other language based field research as long as participants have some basic R background.
If you stay engaged in the course and complete the suggested readings and code:
Students will be able to think systematically about how information can be obtained from diverse natural language. Students will learn how to implement a variety of popular text mining algorithms in R (a free and opensource software) to identify insights, extract information and measure emotional content.
Structure
Overall the course is meant to be a practical examination of text mining, with some overlap of machine learning techniques for natural language. Following the adult learning model, each day will have a lecture, demonstration, coworking session and finally students will have a standalone lab where they can apply the technique to new data with instructor support.
Specifically, each morning session will include a lecture and code step through demonstrating a text mining technique. In the afternoon, the technique will be applied to a new data set followed by a lab. During the lab yet another data set will be provided or students can apply the day’s technique to their own data.
Day 1: R Basics & What is text mining?
Intro to R programming
String Manipulation & Text Cleaning
Lab Section: Clean tweets, and prepare for bag of words examination
Day 2: Common Text Mining Visuals
Word Frequency & TermFrequency Inverse Document Frequency (TFIDF)
Term Document, & Document Term Matrices
Word Clouds – Comparison Clouds, Commonality Clouds
Other Visuals – Word Networks, Associations, Pyramid Plots, Treemaps
Lab Section: Create various visualizations with news articles
Day 3: Sentiment Analysis & Unsupervised Learning: Topic Modeling & Clustering
Sentiment Lexicons – Negation, Amplification, Valence Shifters,
KMeans & Spherical KMeans
Correlated Topic Modeling
Lab Section: Clustering Professional Resumes/CVs
Day 4: Supervised Learning: Document Classification
Elastic Net (Lasso & Ridge Regression)
Data Science Ethics – IBM Watson’s use of text for cancer diagnosis
Lab Section: Classify clickbait from news headlines
Day 5: OpenNLP & Text Sources
Named Entity Recognition
APIs, webscraping basics, Microsoft Office documents
Afternoon Session: Final Examination (no lab)
Literature
Mandatory
 Text Mining in Practice with R by Ted Kwartler; Wiley & Sons Publishing
ISBN: 9781119282013
 Two Data Ethics articles assigned at class to spur reflection for the ethics essay.
Supplementary / voluntary
None.
Mandatory readings before course start
 Read chapter 1 of Text Mining in Practive with R entitled “What is Text Mining?”
 Please install R & R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. As a backup, sign up for an account at RStudio’s cloud environment https://rstudio.cloud.
Examination part
20% Ethics Paper – Due at midnight at the last course day
 500750 word essay with personal reflection on the ethical implications of text mining research methods
80% Final Exam – Proctored on the final day of the week
 30 multiple choice (2pts each),
 1 of (20pts) code review section asking students to describe what and why specific code steps are being taken
 4 of (5pts each) Short form questions/answers requiring 1 paragraph (24 sentences each)
Supplementary aids
Students may bring a hand written “index card” to the final examination period. It may be double sided, and should be functionally equivalent to the UK standard 3in by 5in notecard. Students may put any information they deem important for the final on their notecard and use it as a supplement during the exam. Use of an exam supporting notecard is optional.
Examination content
Topic 
Example Topic 
R Coding principles and basic functions 
how to read in data, and data types

Steps in a machine learning or analytical project workflow 
SEMMA EDA functions, partitioning if modeling 
Steps in a text mining workflow 
Problem statement> unorganized state> organized state 
R text mining libraries and functions 
which functions are appropriate for text uses 
Text Preprocessing Steps 
Why perform “cleaning” steps 
Bag of Words Text Processing 
What is Bag of Words? 
Sentiment analysis 
Lexicons, their application and implications for understanding author emotion 
Document Classification 
Elastic Net Machine Learning for document classification 
Topic Extraction 
Unsupervised machine learning for topic extraction – Kmeans, Spherical K Mean, Hierarchical Clustering 
Text as inputs for Machine Learning Algorithms 
Classification and Prediction using mixed training sets including extracted text features as independent variables 
Text Mining Visuals 
Word frequencies, disjoint comparisons, and other common visuals 
Names Entity Recognition 
Examples of named entities in large corpora 
Text Sources 
APIs, web scraping, OCR and other text sources 
Literature
The exam will be based on the lectures and mandatory assigned reading from Text Mining in Practice with R.
Prerequisites (knowledge of topic)
Students should be interested in spatial topics such as real estate markets, urban economics, crime, pollution, spatial distribution of political preferences, and trade flows. We assume that students are familiar with matrix algebra, and have had courses in probability theory and econometrics. The course emphasizes programming and empirical application. The empirical implementation of spatial models is done in R, hence some familiarity in R is useful but not required for the course. The course is open to students from the PiF/PEF and other external PhD programs.
Learning objectives
The goal of this course is to provide students with the main tools for analyzing and visualizing spatial data. Students will learn how to estimate and interpret a range of spatial models and how to program own models in R.
Course content
This course focuses on the visualization and modeling of spatial data. Examples are taken from different research areas such as political science, empirical international trade, criminology, and real estate. It offers a detailed explanation of individual estimation methods and their implementation in R. In this course, students will learn:
• How to generate a variety of different maps that visualize the location of spatial units
• How maximum likelihood estimation works and how to set up and optimize a likelihood function in R
• How to deal with computational problems that are frequently accounted when working with spatial data
• How to increase computation speed using concentrated maximum likelihood and the matrix exponential spatial specification model
• How to estimate a spatial regression model both, with cross‑sectional and with time‑series data
• How to properly interpret the output from a spatial regression model and how to investigate policy interventions.
• A basic background on spatial interaction models, heterogeneous coefficient SAR models, and spatio‑temporal models
What students do NOT learn in this course:
• Estimation of spatial regression models with other estimation techniques such as IV, NLS, and Bayesian Methods
• The use of a specialized Geographic Information System such as ArcGIS
Structure
Monday
Lecture 1: 09:15 ‑ 12:00
R Tutorial 1: 13:00 ‑ 15:00
Tuesday
Lecture 2: 09:15 ‑ 12:00
R Tutorial 2: 13:00 ‑ 15:00
Wednesday
Lecture 3: 09:15 ‑ 12:00
R Tutorial 3: 13:00 ‑ 15:00
Thursday
Lecture 4: 09:15 ‑ 12:00
R Tutorial 4: 13:00 ‑ 15:00
Friday
Lecture 5: 09:15 ‑ 12:00
R Tutorial 5: 13:00 ‑ 15:00
Times and room information in the timetable apply.
Literature
Mandatory
LeSage, J., and R.K. Pace (2009), ʺIntroduction to Spatial Econometricsʺ. CRC Press.
Supplementary / voluntary
Elhorst, J.P. (2014), ʺSpatial Econometrics: From Cross‑Sectional Data to Spatial Panelsʺ, Springer.
Holly, S., M.H. Pesaran, and T. Yamagata (2011), ʺThe Spatial and Temporal Diffusion of House Prices in the UKʺ,Journal of Urban Economics 69, 2‑23.
LeSage, J. (2014), ʺWhat Regional Scientists Need to Know about Spatial Econometricsʺ,The Review of Regional Studies 44, 13‑32.
Examination part
Examination paper written at home (100%)
Remark: Paper Replication or own research idea.
Examination content
• SAR model, SDM model, CML, MESS, Spatial Interaction model, Spatial Panel model, HSAR model
Implementing maximum likelihood estimation in R: Full Maximum Likelihood, Concentrated Maximum Likelihood, Matrix Exponential Spatial Specification.
Examination relevant literature
• LeSage, J., and R.K. Pace (2009), “Introduction to Spatial Econometrics”. CRC Press, Chapter 1, 2, 3, 4, 8, and 9.
• LeSage, J., and Y.Y. Chih (2016), “Interpreting Heterogeneous Coefficient Spatial Autoregressive Panel Models”, Economics Letters 142, 1–5.
Course Description
As in many other fields, economists are increasingly making use of highdimensional models – models with many unknown parameters that need to be inferred from the data. Such models arise naturally in modern data sets that include rich information for each unit of observation (a type of “big data”) and in nonparametric applications where researchers wish to learn, rather than impose, functional forms. Highdimensional models provide a vehicle for modeling and analyzing complex phenomena and for incorporating rich sources of confounding information into economic models.
Our goal in this course is twofold. First, we wish to provide an overview and introduction to several modern methods, largely coming from statistics and machine learning, which are useful for exploring highdimensional data and for building prediction models in highdimensional settings. Second, we will present recent proposals that adapt highdimensional methods to the problem of doing valid inference about model parameters and illustrate applications of these proposals for doing inference about economically interesting parameters.
Course prerequisites
The course is a PhD level course. Basic knowledge of parametric statistical models and associated asymptotic theory is expected.
Preliminary Outline
Lecture 1 (Hansen): Introduction to HighDimensional Modeling
 Breiman, L. (1996), “Bagging Predictors,” Machine Learning 26: 123140
 Friedman, J., T. Hastie, and R. Tibshirani (2000), “Additive logistic regression: A statistical view of boosting (with discussion),” Annals of Statistics, 28, 337407
 Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [Elements from Chapters 2, 5, 7, 8.7, 10]
 James, G., D. Witten, T. Hastie, and R. Tibshirani (2014), An Introduction to Statistical Learning with Applications in R, Springer. [Elements from Chapters 2, 3, 5, 7, 8.2]
 Li, Q. and J. S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press. [Elements from Chapters 2, 14]
 Schapire, R. (1990), “The strength of weak learnability,” Machine Learning, 5, 197227
Lecture 2 (Spindler): Introduction to Distributed Computing for Very Large Data Sets
Lecture 3 (Hansen): Treebased Methods
 Athey, S. and G. Imbens (2015), “Machine Learning Methods for Estimating Heterogeneous Causal Effects,” working paper, http://arxiv.org/abs/1504.01132
 Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [Chapters 9, 10, 15, 16]
 James, G., D. Witten, T. Hastie, and R. Tibshirani (2014), An Introduction to Statistical Learning with Applications in R, Springer. [Chapter 8]
 Wager, S. and S. Athey (2015), “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” working paper, http://arxiv.org/abs/1510.04342
 Wager, S. and G. Walther (2015), “Uniform Convergence of Random Forests via Adaptive Concentration,” working paper, http://arxiv.org/abs/1503.06388
 Wager, S., T. Hastie, and B. Efron (2014), “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife,” Journal of Machine Learning Research, 15, 1625−1651
Lecture 4 (Spindler): An Overview of HighDimensional Inference
 Belloni, A. and V. Chernozhukov (2013), “Least Squares After Model Selection in Highdimensional Sparse Models,” Bernoulli, 19(2), 521547.
 Belloni, A., D. Chen, V. Chernohukov, and C. Hansen (2012), “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica, 80(6), 23692430
 Belloni, A., V. Chernozhukov, and C. Hansen (2014), “HighDimensional Methods and Inference on Structural and Treatment Effects,” Journal of Economic Perspectives, 28(2), 2950
 Belloni, A., V. Chernozhukov, and C. Hansen (2014), “Inference on Treatment Effects after Selection amongst HighDimensional Controls,” Review of Economic Studies, 81(2), 608650
 Belloni, A., V. Chernozhukov, and C. Hansen (2015), “Inference in High Dimensional Panel Models with an Application to Gun Control,” forthcoming Journal of Business and Economic Statistics
 Belloni, A., V. Chernozhukov, I. FernándezVal, and C. Hansen (2013), “Program Evaluation with HighDimensional Data,” working paper, http://arxiv.org/abs/1311.2645
 Chernozhukov, V., C. Hansen, and M. Spindler (2015), “PostSelection and PostRegularization Inference in Linear Models with Many Controls and Instruments,” American Economic Review, 105(5), 486490
 Chernozhukov, V., C. Hansen, and M. Spindler (2015), “Valid PostSelection and PostRegularization Inference: An Elementary, General Approach,” Annual Review of Economics, 7, 649688
Lecture 5 (Hansen): Penalized Estimation Methods
 Belloni, A. and V. Chernozhukov (2013), “Least Squares After Model Selection in Highdimensional Sparse Models,” Bernoulli, 19(2), 521547.
 Fan, J. and J. Lv (2008), “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society, Series B, 70(5), 849911
 Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [Chapters 3, 4, 5, 18]
 James, G., D. Witten, T. Hastie, and R. Tibshirani (2014), An Introduction to Statistical Learning with Applications in R, Springer. [Chapter 6]
Lecture 6 (Spindler): Moderate p Asymptotics
Lecture 7 (Hansen): Examples
 Belloni, A., D. Chen, V. Chernohukov, and C. Hansen (2012), “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica, 80(6), 23692430
 Belloni, A., V. Chernozhukov, and C. Hansen (2014), “HighDimensional Methods and Inference on Structural and Treatment Effects,” Journal of Economic Perspectives, 28(2), 2950
 Belloni, A., V. Chernozhukov, and C. Hansen (2014), “Inference on Treatment Effects after Selection amongst HighDimensional Controls,” Review of Economic Studies, 81(2), 608650
 Belloni, A., V. Chernozhukov, and C. Hansen (2015), “Inference in High Dimensional Panel Models with an Application to Gun Control,” forthcoming Journal of Business and Economic Statistics
 Belloni, A., V. Chernozhukov, I. FernándezVal, and C. Hansen (2013), “Program Evaluation with HighDimensional Data,” working paper, http://arxiv.org/abs/1311.2645
 Chernozhukov, V., C. Hansen, and M. Spindler (2015), “PostSelection and PostRegularization Inference in Linear Models with Many Controls and Instruments,” American Economic Review, 105(5), 486490
 Chernozhukov, V., C. Hansen, and M. Spindler (2015), “Valid PostSelection and PostRegularization Inference: An Elementary, General Approach,” Annual Review of Economics, 7, 649688
 Gentzkow, M., J. Shapiro, and M. Taddy (2015), “Measuring Polarization in HighDimensional Data: Method and Application to Congressional Speech,” working paper, http://www.brown.edu/Research/Shapiro/
 Hansen, C. and D. Kozbur (2014), “Instrumental Variables Estimation with Many Weak Instruments Using Regularized JIVE,” Journal of Econometrics, 182(2), 290308
 Kleinberg, J., J. Ludwig, S. Mullainathan, and Z. Obermeyer (2015), “Prediction Policy Problems,” American Economic Review: Papers and Proceedings, 105(5), 491495
Lecture 8 (Spindler): Inference: Computation
Lecture 9 (Hansen): Introduction to Unsupervised Learning
 Blei, D., A. Ng, and M. Jordan (2003), Lafferty, J., ed. “Latent Dirichlet allocation,” Journal
 of Machine Learning Research, 3 (45), 9931022
 Hastie, T., R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [Chapter 14]
 James, G., D. Witten, T. Hastie, and R. Tibshirani (2014), An Introduction to Statistical Learning with Applications in R, Springer. [Chapter 10]
 Li, Q. and J. S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press. [Chapter 1]
 Stock J. H and Watson M. W (2002), “Forecasting using principal components from a large number of predictors,” Journal of the American Statistical Association, 97, 11671179
Lecture 10 (Spindler): Very Large p Asymptotics
 Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012): “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica, 80, 2369–2429. (ArXiv, 2010)
 Belloni, A., and V. Chernozhukov (2011): “`1penalized quantile regression in highdimensional sparse models,” Annals of Statistics, 39(1), 82–130. (ArXiv, 2009)
 Belloni, A., and V. Chernozhukov (2013): “Least Squares After Model Selection in Highdimensional Sparse Models,” Bernoulli, 19(2), 521–547. (ArXiv, 2009)
 Belloni, A., V. Chernozhukov, and C. Hansen (2010) “Inference for HighDimensional Sparse Econometric Models,” Advances in Economics and Econometrics. 10th World Congress of Econometric Society, Shanghai, 2010. (ArXiv, 2011).
 Belloni, A., V. Chernozhukov, and C. Hansen (2014), “Inference on Treatment Effects after Selection amongst HighDimensional Controls,” Review of Economic Studies, 81(2), 608650
 Belloni, A., V. Chernozhukov, K. Kato (2013): “Uniform Post Selection Inference for LAD Regression Models,” arXiv:1304.0282. (ArXiv, 2013)
 Belloni, A., V. Chernozhukov, L. Wang (2011a): “SquareRootLASSO: Pivotal Recovery of Sparse Signals via Conic Programming,” Biometrika, 98(4), 791–806. (ArXiv, 2010).
 Belloni, A., V. Chernozhukov, L. Wang (2011b): “SquareRootLASSO: Pivotal Recovery of Nonparametric Regression Functions via Conic Programming,” (ArXiv, 2011)
 Belloni, A., V. Chernozhukov, Y. Wei (2013): “Honest Confidence Regions for Logistic Regression with a Large Number of Controls,” arXiv preprint arXiv:1304.3969 (ArXiv, 2013)
 Bickel, P., Y. Ritov and A. Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector”, Annals of Statistics, 2009.
 Candes E. and T. Tao, “The Dantzig selector: statistical estimation when p is much larger than n,” Annals of Statistics, 2007.
 Donald S. and W. Newey, “Series estimation of semilinear models,” Journal of Multivariate Analysis, 1994.
 Tibshirani, R, “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser. B, 1996.
 Frank, I. E., J. H. Friedman (1993): “A Statistical View of Some Chemometrics Regression Tools,” Technometrics, 35(2), 109–135.
 Gautier, E., A. Tsybakov (2011): “Highdimensional Instrumental Variables Rergession and Confidence Sets,” arXiv:1105.2454v2
 Hahn, J. (1998): “On the role of the propensity score in efficient semiparametric estimation of average treatment effects,” Econometrica, pp. 315–331.
 Heckman, J., R. LaLonde, J. Smith (1999): “The economics and econometrics of active labor market programs,” Handbook of labor economics, 3, 1865–2097.
 Imbens, G. W. (2004): “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review,” The Review of Economics and Statistics, 86(1), 4–29.
 Leeb, H., and B. M. Potscher (2008): “Can one estimate the unconditional distribution of postmodelselection estimators?,” Econometric Theory, 24(2), 338–376.
 Robinson, P. M. (1988): “RootNconsistent semiparametric regression,” Econometrica, 56(4), 931–954.
 Rudelson, M., R. Vershynin (2008): “On sparse reconstruction from Foruier and Gaussian Measurements”, Comm Pure Appl Math, 61, 10241045.
 Jing, B.Y., Q.M. Shao, Q. Wang (2003): “Selfnormalized Cramertype large deviations for independent random variables,” Ann. Probab., 31(4), 2167–2215.
Course literature
Course notes and a list of readings provided at the beginning of the course.
Examination
Written examination (100%). Participants get a takehome final exam on the last day to be completed over the next couple of weeks.
Examination content
Content of the lectures
Examination relevant literature
To be discussed in class
Prerequisites (knowledge of topic)
Students should have basic knowledge of quantitative research and qualitative research.
Hardware
Projector for slide shows. Laptops and software are not required for this course.
Software
None
Learning objectives
At the end of the course the student will be able to do the following:
• Compare quantitative, qualitative, and mixed methods research.
• Define and justify mixed methods research.
• Explain the major kinds of mixed methods research.
• Write research questions for mixed methods research studies.
• Explain how to construct basic and advanced research designs in mixed methods research.
• Explain the multiple kinds of data collection appropriate in mixed methods research.
• Explain the major methodologies, theoretical frameworks, and paradigms popular in mixed methods research.
• Explain sampling methods used in mixed methods research.
• Explain how to produce high quality/justified mixed methods research, including the multiple kinds of validity.
• Explain how mixed methods data analysis adds to traditional quantitative and qualitative data analysis, including crossover analysis and integration.
• Explain how to structure reports and articles in mixed methods research.
• Explain how to publish and disseminate mixed methods research results.
Course content
The content is elaborated, day by day, in the section on structure. We will complete an inventory on philosophical/scientific and methodological beliefs, examine major paradigms in mixed methods research, compare qualitative, quantitative, and mixed methods (always including important scholars in the fields), examine multiple types of mixed methods research, learn how to write research questions that suggest the need for mixed methods research, examine major methods of data collection including between and within mixing of wellknown methods, learn about the “popular mixed methods designs” and how to construct better designs appropriate for your research questions and situation, learn about producing defensible/justifiable mixed methods research including major types of validity to be addressed in particular studies, learn how to move beyond quantitative and qualitative data analysis and conduct mixed methods data analysis including crossover analysis, learn how to structure and write mixed methods proposals and manuscripts.
Structure
Day 1:
• Comparative overview of quantitative, qualitative, and mixed methods research.
• Students complete an inventory so they can articulate their philosophical/scientific and methodological beliefs and assumptions.
• Overview the history of mixed methods research in the social and behavioral sciences including business research.
• Discuss a short empirical mixed methods research article.
Day 2:
• Examine the many purposes of mixing methods (and methodologies) in mixed methods research.
• Discuss how to write research questions in mixed methods research.
• Discuss multiple methods of data collection in mixed methods research.
• Discuss how to transform traditional methodologies into mixed methodologies (e.g., mixed experiments, mixed grounded theory).
• Begin examination of common mixed methods research designs and how to construct more nuanced designs for your mixed methods research study.
• Discuss an empirical and a methodological journal article.
Day 3:
• Continue discussion of mixed methods research designs.
• Examine the multiple and combined sampling methods used in mixed methods research.
• Discuss the issue of causation in mixed methods research.
• Examine the kinds of validity used in mixed methods research.
• Discuss assigned empirical and methodological mixed methods journal articles.
Day 4:
• Catch up on previous topics.
• Discuss data analysis approaches in mixed methods research, including crossover analysis.
• Discuss how to structure mixed methods research proposals and research manuscripts.
• Discuss assigned empirical and methodological mixed methods journal articles.
Day 5:
• Discuss how to write mixed methods research manuscripts.
• Each student presents their brief research proposal following format provided and discussed earlier in the class.
Literature
Mandatory
The exact reading list and pdf files will be available to download a month before the course begins.
Examination part
Class participation: 20%
Assignment 1, research questions: 5% (Written)
Assignment 2, Articles summaries/critiques: 25% (Written)
Assignment 3, Brief research proposal, 40% (Written)
Assignment 4, brief presentation with handout, 10% (Oral presentation)
Assignment 1: Research questions: Written statement of 25 research questions of interest for your research. (Even if you are not sure about conducting the study, it will still be useful in this class to think about a study that would be of interest.)
Assignment 2: Article summary/critique of 10 mixed methods research journal articles (some are assigned to everyone), including several in your area of interest (one page long for each summary/critique).
Assignment 3: Brief empirical research proposal: Describe your research question/s, data collection, analysis plan, and expected contributions (no longer than 10 pages).
Here are the starting headings for a typical proposal in this course/workshop:
1. Working Title
2. Introduction (short overview of topic/problem/opportunity and theoretical motivation)
3. Purpose of the Study (“The purpose of the proposed study is ____.”)
4. Research Questions (typically 25 research questions)
5. Paradigm(s), Methodology(ies), and Methods
6. Mixed Methods Research Design
7. Sampling Methods
8. Validity Types and Strategies
9. Expected Data Analyses
10. Strengths and Weaknesses of the Proposed Study (optional)
Assignment 4: A very short presentation (depending on the number of students) with a handout to share with the others for discussion of your proposed study.
There will not be a cumulative “knowledge exam” in this course. What is most important is to learn the material/readings, discuss the material/readings, and apply your knowledge in a short research proposal of interest to you.
Supplementary aids
See above.
Examination content
See above.
Examination relevant literature
See above.
Prerequisites (knowledge of topic)
A course in regression (e.g., GSERM Regression I) is essential. A second course in regression (e.g., GSERM Regression II) is recommended. Regression topics that are particularly important: i) assessing and dealing with nonlinearity ii) dummy variables (including block Ftests) iii) standardization.
Hardware
Participants should bring laptops loaded with the software identified below.
Software
We will make primary use of the lavaan package in R but will also demonstrate the sem procedure in STATA. The following R packages should be installed on participant laptops: lavaan, haven, semTools. STATA will be available in a computer lab at the University of St. Gallen for participants who do not have it installed on their own laptops.
Learning objectives
The course will provide a conceptual introduction to structural equation models, provide a thorough outline of model “fitting” and assessment, teach how to effectively program structural equation models using available software, demonstrate how to extend basic models into multiple group situations, and provide an introduction to models where common model assumptions regarding missing and nonnormal data are not met.
Course content
1. Introduction to latent variable models, measurement error, path diagrams.
2. Estimation, identification, interpretation of model parameters.
3. Scaling and interpretation issues
4. Scalar programming for structural equation models in Rlavaan and STATA.
5. Mediation models in the structural equation framework.
6. Model fit and model improvement
7. General linear parameter constraints
8. Multiplegroup models
9. Introduction to models for means and intercepts
10. The FIML approach to analysis with missing data
11. Alternative estimators for nonnormal data.
Structure
Schedule may vary slightly according to class progress.
Day 1 Morning
Path models, mediation. Introduction to latent variable conceptualization. Diagrams, equations and model parameters. Moving from equations to diagrams and vice versa; listing model parameters.
Day 1 Afternoon
Introduction to computer SEM software. Computer exercises: A simple singleindicator model. A latent variable measurement model.
Day 2 Morning
Identification. Variances, scaling. Covariance algebra for structural equation models. Applications. Class exercises: a) identification b) covariance algebra. Equality constraints and dummy variables in SEM models.
Day 2 Afternoon
Computer exercises (R, STATA): A latent variable measurement model with covariates. Model diagnostics, fit improvement approaches. Mediation with manifest and latent variables.
Day 3 Morning
Nested models, Wald and LM tests, mixing single and multipleindicator measurement models. Fit functions. Estimation. Dealing with estimation problems, including negative variance estimates and nonconvergence.
Day 3 Afternoon
Computer exercise (R/lavaan): SEM model with multiple latent variables, singleindicator and multipleindicator covariates. Improving model fit, assessing diagnostics. Nonstandard models. Multiple group models: conceptual introduction.
Day 4 Morning
Multiple Group Models. Measurement equation equivalence across groups (tests, assessment). Construct equation equivalence. Software applications, formal versus substantive comparisons. Reporting SEM model results. Computer exercise (R): a multiplegroup model.
Day 4 Afternoon
Computer exercise: multiplegroup models in STATA. Alternative estimators and scaled variance estimators: dealing with missing data and nonnormal data. Item parcels (pro and con).
Day 5 Morning
Computer exercise (R/lavaan) for datasets with missing and/or nonnormal data. An introduction to models for means and intercepts.
Day 5 Afternoon
Computer exercises (R/lavaan and STATA): a model for means and intercepts
Literature
Mandatory
Nine PDF files will be made available to participants as reading materials for this course, titled Notes(Section1) through Notes(Section9).
Supplementary / voluntary
Randall Schumacker and Richard Lomax, A Beginner’s Guide to Structural Equation Modeling. 4th edition (Routledge, 2016). This reading is helpful but not essential. Earlier versions of this text can be used.
Mandatory readings before course start
There are no mandatory precourse readings. Participants are encouraged to red through section 1 of the course notes in advance of the class, but may choose to read this while the class is in progress.
Examination part
Two computer exercises, 20% each: 40%.
First exercise is due Thursday during the course. Second exercise is due Monday immediately following the course.
One major exercise: 60%.
This exercise will consist of a series of 57 questions requiring essaystyle responses (approx. 814 pp. total). Some questions will involve the interpretation of computer output listings, while other questions will deal with conceptual issues discussed in the course. The exercise is due within 2 weeks of the end of the course.
Supplementary aids
For computer exercises, the following materials will be helpful: a) lab exercise materials and descriptions and b) an abbreviated software user manual/guide (one available for each of STATA and lavaan), c) PDF course text files. For the major project, the PDF course files will be very helpful.
Examination content
For the final exercise, students will need to understand the following subject matter:
1. Converting equations to path diagrams and vice versa.
2. Principles of mediation assessment: total, direct and indirect effects in structural equation path models
3. Determining whether a model is identified or not
4. Dealing with estimation difficulties
5. Interpreting model parameters in the metric of the manifest variables
6. Interpreting standardized model parameters
7. Determining whether the fit of a model is acceptable
8. Hypothesis testing: simultaneous tests for b=0; tests for equality
9. Interpreting models with parameter constraints
10. Testing measurement model equivalence in multiplegroup models
11. Testing construct equation equivalence in multiple group models; assessing individual parameters and groups of parameters for crossgroup differences
12. Dummy exogenous variables in structural equation models
13. Approaches to missing data in SEM models.
14. Dealing with nonnormal data: ADF, DWLS estimators, BentlerSatorra and other variance adjustment approaches.
Examination relevant literature
For the major assignment exercise, students should have access to the course powerpoint slide materials and the course text PDF files.
Prerequisites (knowledge of topic)
Students should have previous exposure to social research methods, including basic training in quantitative methods, at the postbaccalaureate level.
Hardware
Laptop (PC or Mac): Students should bring a laptop. The course will include instruction in the use of the software package fsQCA (for both Windows and Mac).
Software
Please install the fsQCA software package ahead of the course. It can be please download for free at fsqca.com
Learning objectives
Qualitative comparative analysis (QCA) is a research approach consisting of both an analytical technique and a conceptual perspective for researchers interested in studying configurational phenomena. QCA is particularly appropriate for the analysis of causally complex phenomena marked by multiple, conjunctural causation where multiple causes combine to bring about outcomes in complex ways.
QCA was developed in the 1980s by Charles Ragin, a sociologist and political scientist, as an alternative comparative approach that lies midway between the primarily qualitative, caseoriented approach and the primarily quantitative, variableoriented approach, with the goal of bridging both by combining their advantages and tackling situations where causality is complex and conjunctural. QCA uses Boolean algebra for the analysis of set relations and allows researchers to formally analyze patterns of necessity and sufficiency regarding outcomes of interest. Since its inception, QCA has developed into a broad set of techniques that share their setanalytic nature and include both descriptive and inferential techniques.
Many researchers have drawn on QCA because it offers a means to systematically analyze data sets with only few observations. In fact, QCA was originally applied to smalln situations of between 10 and 50 cases; situations where there are frequently too many cases to pursue a classical qualitative approach but too few cases for conventional statistical analysis. However, more recently, researchers have also applied QCA to medium and largen situations marked by hundreds of thousands of cases. While these applications require some changes to how QCA is applied, they retain many advantages for analyzing situations that are configurational in nature and marked by causal complexity.
The goal of this workshop is to provide a groundup introduction to Qualitative
Comparative Analysis (QCA) and fuzzy sets. Participants will get intensive
instruction and handson experience with the fsQCA software package and on
completion should be prepared to design and execute research projects using the
setanalytic approach.
After successful completion of you should be able to:
1. understand the goals, assumptions, and key concepts of QCA
2. conduct data analysis using the fsQCA software package
3. design and execute research projects using a setanalytic approach
4. apply advanced forms of setanalytic investigation
I would like this workshop to be as useful to you as possible. To get the most out of this workshop, you would ideally already be working on an empirical project that might be aided by taking a configurational approach, but that is not essential. Over the course of this workshop, I hope you will be thinking about how you can apply these methods to your research, and I will do my best to be of assistance.
Course content
See below under structure
Structure
Day 1: Units 13
Day 2: Units 34
Day 3: Units 56
Day 4: Units 67
Day 5: Student Presentations
Unit 1. Introduction to the Comparative Method
The goal of this first unit is to offer an introduction to the logic of comparative research, as this perspective will be fundamental in informing our thinking for the coming days. The focus is on understanding social research from a setanalytic perspective as well as examining the distinctive place of configurational and comparative research.
Key Readings:
Ragin, 2008 (“Redesigning Social Inquiry”): Chapters 12
Unit 2. The Basics of QCA
We’ll move on to the basics of QCA. We will begin with an Introduction to Boolean algebra and setanalytic methods. Other issues we will cover include setanalytic analysis vs. correlational analysis, the concepts of necessity and sufficiency as well as consistency, coverage, and set coincidence. Time permitting, we will also examine caseoriented research strategies for theory building.
Key Readings:
Ragin, 2000: Chapters 35
Ragin, 2008: Chapters 13
Unit 3. Crisp Set Analysis
In this unit, we will dive into crispset QCA (csQCA), the simpler version of QCA using binary data sets. This will include the coding of data, the construction of truth tables, and understanding the three solutions—complex, parsimonious, and intermediate. We will also begin to examine the importance of counterfactual analysis based on easy versus difficult counterfactuals. Topics also include understanding consistency and coverage in crispset truth table analysis.
Key Readings:
Ragin, 2000: Chapters 35
Unit 4. Fuzzy Set Analysis I
Fuzzy set analysis presents a slightly more complex version of QCA. We will start with the notions of fuzzy sets and fuzzy set relations before moving on to calibrating fuzzy sets and fuzzy set consistency, coverage, and coincidence.
Ragin, 2008: Chapters 45
Unit 5. The FuzzySet Truth Table Algorithm
We will next cover the fuzzyset truth table algorithm. Building on crisp set analysis, we will further examine issues around limited diversity, fuzzy sets and counterfactual analysis. We also will work with sample data sets.
Unit 6. Advanced Topics in QCA
This unit provides us with an opportunity to catch up and delve deeper into some of the topics introduced above. Should we feel comfortable enough, we will move on to some more advanced topics in QCA, including the testing of causal recipes and substitutable causal conditions.
Key Readings:
Ragin, 2008: Chapters 710
Unit 7. LargeN Applications of QCA
The last unit will provide examples of recent largeN applications of QCA. These examples will give us an opportunity to raise further questions about how to execute research using a setanalytic approach. We will also reserve some time for further questions that have come up during the workshop.
Literature
There are four key books for the course, and required chapters are posted here in pdf format. I recommend reading the remainder of the books, but this is not required.
Ragin, Charles C. 1987. The Comparative Method: Moving beyond Qualitative and Quantitative Strategies. Berkeley, CA: University of California Press
Ragin, Charles C. 2000. Fuzzy Set Social Science. Chicago, IL: University of Chicago Press.
Ragin, Charles C. 2008. Redesigning Social Inquiry: FuzzySets and Beyond. Chicago, IL: University of Chicago Press.
Ragin, Charles, C., and Fiss, Peer C. 2017. Intersectional Inequality: Race, Class, Test Scores, and Poverty. Chicago, IL: University of Chicago Press.
Mandatory
Background Reading: The Comparative Method, chapters 68 and Redesigning Social Inquiry, chapters 15 and FuzzySet social Science, chapters 35.
Supplementary / voluntary
Goertz, Gary. 2006. Social Science Concepts: A User’s Guide. Princeton, NJ: Princeton University Press.
Goertz, Gary and James Mahoney. 2012. A Tale of Two Cultures: Qualitative and Quantitative Research in the Social Sciences. Princeton, NJ: Princeton University Press.
Rihoux, Benoit and Charles C. Ragin (eds.) 2008. Configurational Comparative Methods. Thousand Oaks, CA: Sage.
Schneider, Carsten and Claudius Wagemann. 2012. SetTheoretic Methods for the Social Sciences: A Guide to QCA. New York: Cambridge.
Mandatory readings before course start
The above chapters.
Examination part
Presentation (individual) (50%)
Research proposal written at home (individual) (50%)
Supplementary aids
To get inspiration for research proposals, I recommend that participants review recent research projects in their field using QCA. A bibliography of such projects is available at http://compasss.org/bibliography/
Examination content
The “structure” section above presents a complete list of topics relevant to the examination. Specifically, the oral presentation and subsequent examination paper will focus on using course materials to develop a research proposal.
Examination relevant literature
All required chapters listed above are part of the examination relevant literature, as are all course materials such as PPTs and additional materials distributed to the participants during the course.
Prerequisites (knowledge of topic)
A strong background in linear regression is a necessity. Background exposure to maximum likelihood models like logistic regression would be very helpful but is not strictly necessary. Some previous background exposure to multilevel, longitudinal, panel, or mixed effects models would be very helpful but is not necessary. People without a background in multilevel models should (time permitting) order a copy of either Multilevel Analysis: Techniques and applications by Joop Hox, Mirjam Moerbeek, and Rens van de Schoot (2017) or Multilevel Analysis by Tom Snijders and Roel Bosker (2011) and attempt to read the early chapters ahead of time. Again, this is not requirement to attend the class but will help you to absorb the material in lecture much more easily.
Hardware
A laptop—preferably a PC as that is what I use. Please insure that you have administrator access on your machine or that someone who does can help you install needed software prior to the workshop. Without doing this you will be unable to follow along with the labs in class.
Software
The course will use R and RStudio which are both free and open source. We will mainly be using the R packages lme4 and brms as well as some extensions. Please have both programs and the specific packages installed on your machine before you arrive. Note that brms will require rtools to be installed on your machine and that requires administrator access. An install script for all needed packages will be provided for registered students.
Note: if you are primarily a Stata user then I can provide you with some code (for version 16) to do many of the things covered in the course. However, we will not have time to go through it in class.
Course content
This course is designed to provide a practical guide to fitting advanced multilevel models. It is pitched for people from widely different backgrounds, so a significant amount of attention is paid to translating concepts across fields. My approach to the class combines work from econometrics, statistics/biostatistics, and psychometrics. The class is structured using a maximum likelihood framework with practical applied Bayesian extensions on different topics. R packages are selected specifically to make the transition from MLE to Bayesian multilevel models as straightforward and seamless as possible. This is a very applied course with annotated code provided and time in class for lab work. However, it is necessary to spend a class time working through theory and interpretation as well as the logic of mixed effects models.
Specific topics include:
• Random intercept and random slope models
• Crossclassified and multiple membership models
• Generalized linear mixed models
• Special topics chosen by students
The last day of class will have material chosen by the students from a predetermined list of possible topics. In order for your topic to be considered you must respond to the course survey by the end of lunch on Monday so that we can discuss updates during the afternoon.
While you will not be an expert in multilevel modeling after one week—this takes years of practice—you will have the tools to go home and fit many advanced models in your own work. By the end of the week you will have practical experience fitting both Bayesian and likelihood versions of basic and advanced multilevel models with RStudio. You will be able to produce diagnostics and results and hopefully interpret them correctly. If you use the models in your own work and read the supplementary materials for the course, you will end up with a very high level of knowledge in multilevel modeling over time. While we do cover Bayesian extensions for multilevel models, this course is not a substitute for a fullyfledged course on Bayesian data analysis. However, it will leave you very well prepared for such a course or for reading a Bayesian analysis textbook
Structure
Day 1 – Introduction
Morning Lecture
• The basic multilevel modeling toolkit
o Random intercept models
o Random coefficient models
o Fixed effects models
• The model fitting checklist
o Data structures
o Missing data and selection bias
o Omitted variable bias
o Latent dependency structures
Afternoon Discussion: Modeling Test Scores
• This is the basic students in classes in schools example on steroids. We will discuss differences between designs where we have some exogenous intervention for students (aka a causal inference model) and ones where we have observational data and can only really model correlations but have potentially very complex structures at work.
Afternoon Lab
• Software Introduction to lme4 and brms
• Fitting random intercept and random slope models with lme4 and brms
Day 2 – Complex Data Structures
Morning Lecture
• Review of the toolkit and checklist
• Grouping structures
o What kinds of groups do you have?
o How many groups do you have?
• Latent dependency structures
o Groups and experiments
o Groups and time
o Groups and space
o Groups and networks
Afternoon Discussion: Modeling State Policymaking
• We will work out the logic of how to model environmental policy adoption in US states over time. We will mainly follow along with the design and analysis for my paper on state environmental policy adoption. This example highlights latent dependency structures (time, space, networks, latent classes) and complicated grouping structures.
Afternoon Lab
• Fitting crossclassified models with lme4 and brms
• Fitting multiple membership models with lme4 and brms
• Modeling problems for interference, time, space, and networks
Day 3 – Bias from Selection and Omitted Variables
Morning Lecture
• Review of the toolkit and checklist
• Selection bias and missing data
o Multilevel multiple imputation
o Multilevel selection models
• Omitted variable bias
o Mundlak and Hausman
o Fixed vs mixed effects models
o Random coefficients and Bayesian shrinkage
Afternoon Discussion: Modeling Changes in Public Opinion Over Time
• We will go through some basic and notsobasic strategies to model dynamic changes in public opinion over time. We will mainly use a data set on Jewish Israeli public support for a twostate solution over a 20year period. This example provides an excellent illustration of time varying confounding also known as contextual effect moderation through time.
Afternoon Lab
• Review of previous labs
• Multilevel multiple imputation
• Multilevel selection models
• Fixed, random, and mixed effects model comparisons
Day 4 – Practical Model Fitting, Diagnostics, and Model Comparison
Morning Lecture
• Review of the toolkit and checklist
• Building an analysis plan
• Building a coherent workflow
• Model comparison
Morning Lab
• Model diagnostics in lme4
• Model comparison with lme4
Afternoon Lab
• Priors in brms
• Model diagnostics in brms
• Model comparison with brms
Afternoon Lecture
• Explaining and justifying what you’ve done to others
Day 5 – Special Topics
Participants choose among the following advanced topics based on personal interest. I will have the final list of topics that I plan to cover updated by Tuesday afternoon. While I will attempt to accommodate everyone in the class it is unlikely that there will be sufficient time to cover all requested topics.
• Review of any topic already covered in the class
• Multilevel categorical modeling
• Multilevel ordered choice modeling
• Multilevel survival modeling
• Multilevel propensity scores
• Multilevel regression and poststratification
• Multilevel structural equation modeling
Literature
Mandatory readings before course start
Gill, J. and A. J. Womack (2013). The Multilevel Model Framework. The SAGE handbook of multilevel modeling. M. A. Scott, J. S. Simonoff and B. D. Marx, Sage.
Enders, C. K. (2013). Centering predictors and contextual effects. SAGE Handbook of Multilevel Modeling. M. A. Scott, J. S. Simonoff and B. D. Marx.
Fielding, Antony, and Harvey Goldstein. 2006. “Crossclassified and multiple membership structures in multilevel models: An introduction and review.”
Bell, A. and K. Jones (2015). “Explaining fixed effects: Random effects modeling of timeseries crosssectional and panel data.” Political Science Research and Methods 3(01): 133153.
Examination part
Students will have two weeks after the last day of class to complete a homework assignment worth 100% of their grade in the course. This assignment will include coding practice, theory, results interpretation, and research design problems. Students may submit it for feedback at least 72 hours before the deadline. As this is a homework assignment, course notes, readings, and r scripts are obviously allowed. Examples of these problems will be worked through during class time.
Homework will be a mix of research design applications, coding, and fitting models.
1. Students will be given research questions and be required to outline a set of potential analyses designed to answer them. This will include tradeoffs and potential weaknesses in their analysis.
2. Students will be required to diagram R code and explain the purpose and use of each segment. They will be required to articulate how different sections of the code work “under the hood” and outline any relevant implications.
3. Students will be required to fit models, perform diagnostics, and report/interpret results accurately.
The material needed for study will be lecture notes, the required readings in the above list, and the R package documentation for packages used in the course
Prerequisites (knowledge of topic)
– Basic programming skills, Python recommended
– Undergraduatelevel linear algebra, analysis and statistics
Hardware
– Personal laptop with Mac OSX or Linux, Windows
– Tablets (iOS, Windows) will not be working with this lecture
Software
– Webbrowser (Chrome, Safari, Firefox)
– Text editor
– Jupyer notebook
– Local Python installation including Numpy, Scipy, scikitlearn, PyTorch
(there will be an installation session on the first day for participants)
Course content
– Machine Learning Refresh
o Supervised Learning vs. Unsupervised Learning
o Traditional Machine Learning vs. EndtoEnd Learning
– Fundamentals of Neuronal Networks:
o Rosenblatt Perceptron and Neurons
o Network Structure (feedforward, recurrent), matrix notation, forward evaluation
– Training as optimization
o Loss and Error functions
o Backpropagation
o SGD and other optimizer
– Activation functions and topologies
o Convolutional neural networks
o Generative Adversarial Networks
o Long shortterm memory networks
o Special layer types (inception, resnet)
o Embeddings
o Attention Mechanis & Transformer
– Applications to realworld problems:
o Acoustic keyword recognition (audio/speech processing)
o Sentiment analysis (text processing)
o Digit recognition (image processing)
o Tiny Image Recognition (image processing)
o Face Detection and Tracking (image/video processing)
o Stock market prediction (time series prediction)
– Training on large data sets (Hardware, GPU)
– Trustworthy AI
Structure
The course is a theoretical content in the morning and practical exercises in the afternoon in form of lab Jupyter notebook programming.
Literature
Goodfellow I, Benjo Y., Courville A., Courville A, Deep Learning, MIT Press, 2016
Supplementary
• http://jupyter.org/
• http://www.numpy.org/
• https://www.scipy.org/
• https://www.tensorflow.org
• https://pytorch.org
Examination part
– Completed Jupyter notebooks labs 18 (40% closed book), in class
– Complete Jupyter Notebook Assignment 18 (60%, openbook), at home
Prerequisites (knowledge of topic)
Basic knowledge of descriptive statistics, data analysis and R is useful, but not necessary. Participants need to bring their own laptop and complete our detailed installation instructions for R and RStudio (both open source software) shared prior to the course.
Learning objectives
The creation and communication of data visualizations is a critical step in any data analytic project. Modern opensource software packages offer ever more powerful data visualizations tools. When applied with psychological and design principles in mind, users competent in these tools can produce data visualizations that easily tell more than a thousand words. In this course, participants learn how to employ stateoftheart data visualization tools within the programming language R to create stunning, publicationready data visualizations that communicate critical insights about data. Prior to, during, and after the course participants work their own data visualization project.
Course content
Each day will contain a series of short lectures and demonstrations that introduce and discuss new topics. The bulk of each day will be dedicated to handson, stepbystep exercises to help participants ‘learn by doing’. In these exercises, participants will learn how to readin and prepare data, how to create various types of static and interactive data visualizations, how to tweak them to exactly fit one’s needs, and how to embed them in digital reports. Accompanying the course, each participant will work on his or her own data visualization project turning an initial visualization sketch into a onepage academic paper featuring a polished, welldesigned figure. To advance these projects, participants will be able to draw on support from the instructors in the afternoons of course days two to four.
Structure
Day 1
Morning: Cognitive and design principles of good data visualizations
Afternoon: Introduction to R
Day 2
Morning: Readingin, organizing and transforming data
Afternoon: Project sketch pitches
Day 3
Morning: Creating plots using the grammar of graphics
Afternoon: Visualizing statistical uncertainty, facets, networks, and maps
Day 4
Morning: Styling and exporting plots
Afternoon: Making visualizations interactive
Day 5
Morning: Reporting visualizations using Markdown
Afternoon: Final presentation and competition
Literature
Voluntary readings:
Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons.
Healy, K. (2018). Data visualization: a practical introduction. Princeton University Press.
Examination part
The course grade is determined based on the quality of the initial project sketch (20%), the data visualization produced during the course (40%), and the onepage paper submitted after the course (40%).
In the past 60 years econometrics provided us with many tools to uncover lots of different types of correlations. The technical level of this literature is impressive (see the PEF course Advanced Microeconometrics). However, at the end of day, correlations are less interesting if they do not have a causal implication. For example, the fact that smokers are more likely to die earlier than other people does not tell us much about the effect of smoking. For example, it might just be that smokers are the type of people who face more health and crime risks for quite different (social or genetic) reasons. The same problem occurs with almost any correlation of economic or financial variables. The interesting question is always whether these correlations are spurious, or whether they do tell us something about the underlying causal link of the different variables involved?
In this course we review and organize the rapidly developing literature on causal analysis in economics and econometrics and consider the conditions and methods required for drawing causal inferences from the data. Empirical applications play an important role in this course.
Active participation of PhD students participating in this course is expected. During the second part of the course, participants will conduct their own empirical study and present their results.
General structure and rules
Students activities
Active participation of the students in this course is the key to its success. Students are expected to do the following:
 Read the papers shown as ‘compulsary reading’ in the reading list BEFORE the lecture concerned with the topic.
 Each morning students will present a paper (15‑30 minutes each; depending on the number of participants) and there will be some general discussion about these papers. Students not presenting will be expected at least to sketch the papers to be able to participate in the discussion.
 Small groups of students (group size depends on number of participants) will conduct an independent empirical study (using Software of their own choice; GAUSS or STATA is recommended). In the empirical project students will show that they understood the basic concepts and are able to apply them to a ‘real life’ situation.
Grades
 Written Exam about 4 weeks after the last lecture (2 hours) (40%).
 Students’ active participation in general discussions during lectures and presentations (20%).
 Presentation of papers (20%).
 Empirical project (based on two presentations; 20%).
Prerequisites
As defined for the econometrics specialisation of PEF.
Course literature
To be published shortly before the lecture
Examination content
Empirical work, literature, contents of lecture
Examination relevant literature
To be defined during the lecture