Mixed models

aka Random effects models aka hierarchical models aka multilevel models

Aleš Vomáčka

The Plan

  • Mixed models technically.

  • Application

  • Technical comments

No, Complete & Partial Pooling

GPA Over Time and Gender

  • We are interested how academic skills develop over time, based on gender.

  • Data from university students measured across 6 semesters.

  • Academic skills measured as GPA (Graded Point Average).

  • Multiple ways to do it.

Option 1: Complete Pooling

  • We can pool all information from all students together and estimate a single effect:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion, \epsilon) \]

  • In R code:
gpa_complete <- lm(gpa ~ occasion, data = gpa)

Option 1: Complete Pooling

  • Single regression line for everyone

Option 1: Complete Pooling

  • Complete pooling assumes “clusters” (students) are interchangeble.

    • Trend is the same for all students

    • Observations within students are no more correlated than across students.

  • Advantage - Efficient, both statistically and computationally

  • Disadvantages - Ignores effect heterogeneity, overly optimistic standard errors

Option 2: No Pooling

  • We can estimate the effect for each student separately:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]

  • In R code:
gpa_no <- lm(gpa ~ occasion * student, data = gpa)

Option 2: No Pooling

  • Individual regression line for every student

Option 2: No Pooling

  • No pooling assumes every “cluster” (student) is unique

    • Every student has unique trend.

    • Students are unique - information about one says nothing about the rest.

  • Advantage - Allows for effect heterogeneity, accounts for dependent observations.

  • Disadvantage - Extremely inefficient (large standard errors), No generalization to new students.

Option 3: Partial Pooling

  • Assuming all students are interchangeable is unrealistic (complete pooling).

  • Assuming all students are completely unique is impractical (no pooling).

  • What is one to do?

  • Assume that students are distinct, but similar (partial pooling)!

Option 3: Partial Pooling

  • We can assume all students come from the same population.

  • By convention, we (usually) assume the population is normally distributed.

  • That is, the regression lines for most students will be similar to each other, with a small number of outliers (both positive and negative).

Option 3: Partial Pooling - Random Intercepts

  • We can assume students have different starting points, but the same progression.

  • We call this Random Intercept model.

Option 3: Partial Pooling - Random Slopes

  • We can assume both starting points and progression is individual.

  • We call this Random Slopes model.

Option 3: Partial Pooling

  • Partial pooling assumes “clusters” (students) are distinct, but similar.

    • General trend + individual effects

    • Final prediction for each student uses both the information we have about them specifically and information we have about all others.

  • Advantage:

    • Allows for effect heterogeneity, accounts for dependent observations, statistically efficient.
  • Disadvantage:

    • Computationally demanding, harder to set up.

Complete/No/Partial Pooling Comparison

Comparing Regression Coefficients/AMEs

Questions?

A Bit of Math

  • Classical linear regression (complete pooling) looks like this:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]

  • Partial pooling models look like this:

\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]

A Bit of Math

\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]

  • The model has multiple levels, hence multilevel models

  • \(\beta_2j\) is random variable, not a fixed coefficient, hence random effect models.

  • Models that have both fixed and random coefficients are mixed effects models.

  • The nomenclature is… difficult.

  • There actually 8-10 different names for this kind of models.

  • Fixed/random effects can also mean different things.

A Bit of Math

  • The predicted gpa for a “cluster” (e.g. student) is

\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]

  • Final prediction for a student is a weighted average of the information we know about the student and the information we know about everyone else.

A Bit of Math

\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]

  • The more observation we have from the student and the smaller their variance, the closer to no pooling approach.

  • The less observations we have from the student and the bigger the variance across all students, the closer to complete pooling approach.

A Bit of Math

\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]

  • (Roughly speaking) Mixed models dynamically balance estimate pooling.

    • Clusters we have a lot of information about don’t need pooling and the results will reflect that.

    • Cluster we have little information about need more “support”, so we add info from the rest of the data.

Questions?

Applications

Mixed Models in Practice

  • Accounting for dependent observations

  • Regularization

  • Individual vs Group effects

Accounting for dependent observations

  • By far the most common use in social sciences.

  • Common regression assumes observations are independent of each other.

  • With multilevel models, observations inside groups are assumed to be dependent on each other.

InteRmezzo!

GPA Model Recap

  • We can give each student own intercept and/or slope

  • We can look at average trend or individual random effects.

  • When interpreting, we need to choose between:

    • Average effect across student (re.form = NULL)

    • Effect for average student (re.form = NA}

  • Other model interpretation and diagnostic works the same as usual.

  • Predictive importance of random components can be assessed by ICC.

Intraclass Correlation Coefficient (ICC)

  • How much variance in the outcome can be predicted by the random components.

    • 0 = no difference between clusters (e.g. students).

    • 1 = all variance is due to cluster differences.

  • ICC is correlation between values from the same cluster (usually).

  • In random slopes model, ICC depends on the value of the random variable.

Accounting for dependent observations

  • Most common application.

  • Bit of an overkill - easier way to that like fixed effects (fixest package)

  • Multilevel models can do more.

Questions?

Regularization

Sexual Violence by Region

  • (All data here are simulated!)

  • We are interested in the prevalence of sexual harassment in the population by region.

  • We have data from survey asking people whether they have experienced sexual harassment in last 6 month.

  • Problem: The survey only included 1 000 respondents.

Sexual Violence by Region - Simple Estimates

Sexual Violence by Region - Pooled Estimates

Regularization

  • Two sources of error:

    • Bias - systematic errors

    • Variance - random errors

  • Traditional (applied) statistics placed large focus on reducing bias.

  • Today, bias and variance can be exchanged (Bias-Variance trade off)

Regularization

  • Regularization - technique that increases bias, but lowers variance.

    • LASSO, Ridge regression

    • Bayesian priors

    • Partial pooling

  • Hopefully, the decrease in variance is much higher than the increase of bias.

Non-regularized Predictions

Regularized Predictions (Partial Pooling)

Regularization

  • Simple estimates have no bias, but high variance.

  • Regularized estimates have high bias, but low variance

  • Bias can be reduced by including cluster (region) level predictors:

    • Unemployment rate

    • education level

  • (And of course by increasing sample size)

Regularization to Prevent False Positives

  • Partial pooling can be used to lower false positive rate.

  • By partialy pooling data, we are squishing unusually large estimates towards the grand mean, unless we have a lot of data supporting the large estimate.

  • Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple Comparisons. Journal of Research on Educational Effectiveness, 5(2), 189–211. https://doi.org/10.1080/19345747.2011.618213

Question?

InteRmezzo!

Individual vs Group Effects

Individual vs Group Effects

  • We know that students with higher SES perform better at school.

  • But does this effect exists on individual level, group level or both?

    • Do low income students benefit from going to high SES schools?

  • Mixed models allow for estimation of both effects at once!

Individual vs Group Effects

Jakub Lisek 2023

Individual vs Group Effects

  • Sometimes called Mundlak approach.

  • Pretty easy, just add cluster means and de-meaned individual values as predictors.

  • You can even add random slopes to account for each cluster (school) having different relationship between SES and test scores.

  • Works the same with categorical predictors.

Questions?

InteRmezzo!

Technical stuff

Convergence problems

  • Multilevel problems often have computational problems because of:

    • Low sample size

    • Small number of clusters

    • Large number of parameters

    • Parameter estimates to close to bounds

  • It matters, how you compute them!

Convergence problems

  • Often, trying different optimizers works

  • lme4 offers several options:

    • "nloptwrap" - fast, but unstable

    • "bobyqa" - slower, but more reliable

  • You can set optimizer like this:

glmer(achievement ~ ses + ses_school + (1|primary_school_id),
      control = glmerControl(optimizer = "bobyqua"))

Convergence problems

  • If choosing different optimizer and respecifing model fail, there is the nuclear option.

  • Simulation based estimation.

    • Advantage - almost guaranteed to work

    • Disadvantage - much slower to compute

  • In R, there is brms package, using the same syntax:

brm(achievement ~ ses + ses_school + (1|primary_school_id),
    iter = 2000,
    cores = 4)

Minimal Number of Clusters

  • Many rules of thumb (15, 30, 50) - most of them are wrong.

  • Read this (in order):

    • Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science, 57(3), 748–761. https://doi.org/10.1111/ajps.12001

    • Elff, M., Heisig, J., Schaeffer, M., & Shikano, S. (2016). No Need to Turn Bayesian in Multilevel Analysis with Few Clusters: How Frequentist Methods Provide Unbiased Estimates and Accurate Inference.

Minimal Number of Clusters

  • Theoretical minimum is 3 clusters

    • (lower gives th same result as normal regression).
  • Practically, the minimum is 5-6

    • (lower and partial pooling offers little benefit)

  • You need to be a bit careful though.

Minimal Number of Clusters

  • Bayesian/simulation based models work out of the box (just shove the data into brms)

  • Frequentist/maximum likelihood models (e.g. lme4) need some consideration:

    • Use Restricted maximum likelihood (REML) - lme4 does by default

    • Assume T distribution for random effect - honestly no idea…

  • You’ll get good results (unbiased, proper coverage, etc.), but more data obviously better.

Question?