Mixed models

aka Random effects models aka hierarchical models aka multilevel models

Aleš Vomáčka

The Plan

Mixed models technically.
Application
Technical comments

No, Complete & Partial Pooling

GPA Over Time and Gender

We are interested how academic skills develop over time, based on gender.
Data from university students measured across 6 semesters.
Academic skills measured as GPA (Graded Point Average).

Multiple ways to do it.

Option 1: Complete Pooling

We can pool all information from all students together and estimate a single effect:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion, \epsilon) \]

In R code:

gpa_complete <- lm(gpa ~ occasion, data = gpa)

Option 1: Complete Pooling

Single regression line for everyone

Option 1: Complete Pooling

Complete pooling assumes “clusters” (students) are interchangeble.
- Trend is the same for all students
- Observations within students are no more correlated than across students.

Advantage - Efficient, both statistically and computationally
Disadvantages - Ignores effect heterogeneity, overly optimistic standard errors

Option 2: No Pooling

We can estimate the effect for each student separately:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]

In R code:

gpa_no <- lm(gpa ~ occasion * student, data = gpa)

Option 2: No Pooling

Individual regression line for every student

Option 2: No Pooling

No pooling assumes every “cluster” (student) is unique
- Every student has unique trend.
- Students are unique - information about one says nothing about the rest.

Advantage - Allows for effect heterogeneity, accounts for dependent observations.
Disadvantage - Extremely inefficient (large standard errors), No generalization to new students.

Option 3: Partial Pooling

Assuming all students are interchangeable is unrealistic (complete pooling).
Assuming all students are completely unique is impractical (no pooling).

What is one to do?

Assume that students are distinct, but similar (partial pooling)!

Option 3: Partial Pooling

We can assume all students come from the same population.

By convention, we (usually) assume the population is normally distributed.

That is, the regression lines for most students will be similar to each other, with a small number of outliers (both positive and negative).

Option 3: Partial Pooling - Random Intercepts

We can assume students have different starting points, but the same progression.
We call this Random Intercept model.

Option 3: Partial Pooling - Random Slopes

We can assume both starting points and progression is individual.
We call this Random Slopes model.

Option 3: Partial Pooling

Partial pooling assumes “clusters” (students) are distinct, but similar.
- General trend + individual effects
- Final prediction for each student uses both the information we have about them specifically and information we have about all others.

Advantage:
- Allows for effect heterogeneity, accounts for dependent observations, statistically efficient.
Disadvantage:
- Computationally demanding, harder to set up.

Complete/No/Partial Pooling Comparison

Comparing Regression Coefficients/AMEs

Questions?

A Bit of Math

Classical linear regression (complete pooling) looks like this:

\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]

Partial pooling models look like this:

\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]

A Bit of Math

\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]

The model has multiple levels, hence multilevel models
\(\beta_2j\) is random variable, not a fixed coefficient, hence random effect models.
Models that have both fixed and random coefficients are mixed effects models.

The nomenclature is… difficult.
There actually 8-10 different names for this kind of models.
Fixed/random effects can also mean different things.

A Bit of Math

The predicted gpa for a “cluster” (e.g. student) is

\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]

Final prediction for a student is a weighted average of the information we know about the student and the information we know about everyone else.

A Bit of Math

The more observation we have from the student and the smaller their variance, the closer to no pooling approach.
The less observations we have from the student and the bigger the variance across all students, the closer to complete pooling approach.

A Bit of Math

(Roughly speaking) Mixed models dynamically balance estimate pooling.
- Clusters we have a lot of information about don’t need pooling and the results will reflect that.
- Cluster we have little information about need more “support”, so we add info from the rest of the data.

Questions?

Applications

Mixed Models in Practice

Accounting for dependent observations
Regularization
Individual vs Group effects

Accounting for dependent observations

By far the most common use in social sciences.
Common regression assumes observations are independent of each other.
With multilevel models, observations inside groups are assumed to be dependent on each other.

InteRmezzo!

GPA Model Recap

We can give each student own intercept and/or slope
We can look at average trend or individual random effects.
When interpreting, we need to choose between:
- Average effect across student (re.form = NULL)
- Effect for average student (re.form = NA}
Other model interpretation and diagnostic works the same as usual.
Predictive importance of random components can be assessed by ICC.

Intraclass Correlation Coefficient (ICC)

How much variance in the outcome can be predicted by the random components.
- 0 = no difference between clusters (e.g. students).
- 1 = all variance is due to cluster differences.
ICC is correlation between values from the same cluster (usually).
In random slopes model, ICC depends on the value of the random variable.

Accounting for dependent observations

Most common application.
Bit of an overkill - easier way to that like fixed effects (fixest package)
Multilevel models can do more.

Questions?

Regularization

Sexual Violence by Region

(All data here are simulated!)
We are interested in the prevalence of sexual harassment in the population by region.
We have data from survey asking people whether they have experienced sexual harassment in last 6 month.
Problem: The survey only included 1 000 respondents.

Sexual Violence by Region - Simple Estimates

Sexual Violence by Region - Pooled Estimates

Regularization

Two sources of error:
- Bias - systematic errors
- Variance - random errors
Traditional (applied) statistics placed large focus on reducing bias.
Today, bias and variance can be exchanged (Bias-Variance trade off)

Regularization

Regularization - technique that increases bias, but lowers variance.
- LASSO, Ridge regression
- Bayesian priors
- Partial pooling
Hopefully, the decrease in variance is much higher than the increase of bias.

Non-regularized Predictions

Regularized Predictions (Partial Pooling)

Regularization

Simple estimates have no bias, but high variance.
Regularized estimates have high bias, but low variance

Bias can be reduced by including cluster (region) level predictors:
- Unemployment rate
- education level
(And of course by increasing sample size)

Regularization to Prevent False Positives

Partial pooling can be used to lower false positive rate.
By partialy pooling data, we are squishing unusually large estimates towards the grand mean, unless we have a lot of data supporting the large estimate.

Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple Comparisons. Journal of Research on Educational Effectiveness, 5(2), 189–211. https://doi.org/10.1080/19345747.2011.618213

Question?

InteRmezzo!

Individual vs Group Effects

We know that students with higher SES perform better at school.
But does this effect exists on individual level, group level or both?
- Do low income students benefit from going to high SES schools?

Mixed models allow for estimation of both effects at once!

Individual vs Group Effects

Jakub Lisek 2023

Individual vs Group Effects

Sometimes called Mundlak approach.
Pretty easy, just add cluster means and de-meaned individual values as predictors.
You can even add random slopes to account for each cluster (school) having different relationship between SES and test scores.
Works the same with categorical predictors.

Questions?

InteRmezzo!

Technical stuff

Convergence problems

Multilevel problems often have computational problems because of:
- Low sample size
- Small number of clusters
- Large number of parameters
- Parameter estimates to close to bounds

It matters, how you compute them!

Convergence problems

Often, trying different optimizers works
lme4 offers several options:
- "nloptwrap" - fast, but unstable
- "bobyqa" - slower, but more reliable
You can set optimizer like this:

glmer(achievement ~ ses + ses_school + (1|primary_school_id),
      control = glmerControl(optimizer = "bobyqua"))

Convergence problems

If choosing different optimizer and respecifing model fail, there is the nuclear option.
Simulation based estimation.
- Advantage - almost guaranteed to work
- Disadvantage - much slower to compute
In R, there is brms package, using the same syntax:

brm(achievement ~ ses + ses_school + (1|primary_school_id),
    iter = 2000,
    cores = 4)

Minimal Number of Clusters

Many rules of thumb (15, 30, 50) - most of them are wrong.
Read this (in order):
- Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science, 57(3), 748–761. https://doi.org/10.1111/ajps.12001
- Elff, M., Heisig, J., Schaeffer, M., & Shikano, S. (2016). No Need to Turn Bayesian in Multilevel Analysis with Few Clusters: How Frequentist Methods Provide Unbiased Estimates and Accurate Inference.

Minimal Number of Clusters

Theoretical minimum is 3 clusters
- (lower gives th same result as normal regression).
Practically, the minimum is 5-6
- (lower and partial pooling offers little benefit)

You need to be a bit careful though.

Minimal Number of Clusters

Bayesian/simulation based models work out of the box (just shove the data into brms)
Frequentist/maximum likelihood models (e.g. lme4) need some consideration:
- Use Restricted maximum likelihood (REML) - lme4 does by default
- Assume T distribution for random effect - honestly no idea…

You’ll get good results (unbiased, proper coverage, etc.), but more data obviously better.