aka Random effects models aka hierarchical models aka multilevel models
Mixed models technically.
Application
Technical comments
We are interested how academic skills develop over time, based on gender.
Data from university students measured across 6 semesters.
Academic skills measured as GPA (Graded Point Average).
\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion, \epsilon) \]
Complete pooling assumes “clusters” (students) are interchangeble.
Trend is the same for all students
Observations within students are no more correlated than across students.
Advantage - Efficient, both statistically and computationally
Disadvantages - Ignores effect heterogeneity, overly optimistic standard errors
\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]
No pooling assumes every “cluster” (student) is unique
Every student has unique trend.
Students are unique - information about one says nothing about the rest.
Advantage - Allows for effect heterogeneity, accounts for dependent observations.
Disadvantage - Extremely inefficient (large standard errors), No generalization to new students.
Assuming all students are interchangeable is unrealistic (complete pooling).
Assuming all students are completely unique is impractical (no pooling).
We can assume students have different starting points, but the same progression.
We call this Random Intercept model.
We can assume both starting points and progression is individual.
We call this Random Slopes model.
Partial pooling assumes “clusters” (students) are distinct, but similar.
General trend + individual effects
Final prediction for each student uses both the information we have about them specifically and information we have about all others.
Advantage:
Disadvantage:
\[ gpa \sim N(\beta_0 + \beta_1 \cdot occasion + \beta_2 \cdot student + \beta_3 \cdot (occasion \cdot student), \epsilon) \]
\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]
\[ \begin{align} gpa & \sim N(\beta_1 \cdot occasion + \beta_{2j} \cdot student, \epsilon) \\ \beta_{2j} & \sim N(\gamma_0 + \gamma_1 \cdot student, \delta) \end{align} \]
The model has multiple levels, hence multilevel models
\(\beta_2j\) is random variable, not a fixed coefficient, hence random effect models.
Models that have both fixed and random coefficients are mixed effects models.
The nomenclature is… difficult.
There actually 8-10 different names for this kind of models.
Fixed/random effects can also mean different things.
\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]
\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]
The more observation we have from the student and the smaller their variance, the closer to no pooling approach.
The less observations we have from the student and the bigger the variance across all students, the closer to complete pooling approach.
\[ gpa_j^{multilevel} \approx \frac{\color{blue}{\frac{n_j}{\sigma^2_y} \overline{gpa_j} }+ \color{red}{ \frac{1}{\sigma^2_\alpha} \overline{gpa_{all}}}}{\color{blue}{\frac{n_j}{\sigma^2_y}} + \color{red} {\frac{1}{\sigma^2_\alpha}}} \]
(Roughly speaking) Mixed models dynamically balance estimate pooling.
Clusters we have a lot of information about don’t need pooling and the results will reflect that.
Cluster we have little information about need more “support”, so we add info from the rest of the data.
Accounting for dependent observations
Regularization
Individual vs Group effects
By far the most common use in social sciences.
Common regression assumes observations are independent of each other.
With multilevel models, observations inside groups are assumed to be dependent on each other.
We can give each student own intercept and/or slope
We can look at average trend or individual random effects.
When interpreting, we need to choose between:
Average effect across student (re.form = NULL)
Effect for average student (re.form = NA}
Other model interpretation and diagnostic works the same as usual.
Predictive importance of random components can be assessed by ICC.
How much variance in the outcome can be predicted by the random components.
0 = no difference between clusters (e.g. students).
1 = all variance is due to cluster differences.
ICC is correlation between values from the same cluster (usually).
In random slopes model, ICC depends on the value of the random variable.
Most common application.
Bit of an overkill - easier way to that like fixed effects (fixest package)
Multilevel models can do more.
(All data here are simulated!)
We are interested in the prevalence of sexual harassment in the population by region.
We have data from survey asking people whether they have experienced sexual harassment in last 6 month.
Problem: The survey only included 1 000 respondents.
Two sources of error:
Bias - systematic errors
Variance - random errors
Traditional (applied) statistics placed large focus on reducing bias.
Today, bias and variance can be exchanged (Bias-Variance trade off)
Regularization - technique that increases bias, but lowers variance.
LASSO, Ridge regression
Bayesian priors
Partial pooling
Hopefully, the decrease in variance is much higher than the increase of bias.
Simple estimates have no bias, but high variance.
Regularized estimates have high bias, but low variance
Bias can be reduced by including cluster (region) level predictors:
Unemployment rate
education level
(And of course by increasing sample size)
Partial pooling can be used to lower false positive rate.
By partialy pooling data, we are squishing unusually large estimates towards the grand mean, unless we have a lot of data supporting the large estimate.
We know that students with higher SES perform better at school.
But does this effect exists on individual level, group level or both?
Jakub Lisek 2023
Sometimes called Mundlak approach.
Pretty easy, just add cluster means and de-meaned individual values as predictors.
You can even add random slopes to account for each cluster (school) having different relationship between SES and test scores.
Works the same with categorical predictors.
Multilevel problems often have computational problems because of:
Low sample size
Small number of clusters
Large number of parameters
Parameter estimates to close to bounds
Often, trying different optimizers works
lme4 offers several options:
"nloptwrap" - fast, but unstable
"bobyqa" - slower, but more reliable
You can set optimizer like this:
If choosing different optimizer and respecifing model fail, there is the nuclear option.
Simulation based estimation.
Advantage - almost guaranteed to work
Disadvantage - much slower to compute
In R, there is brms package, using the same syntax:
Many rules of thumb (15, 30, 50) - most of them are wrong.
Read this (in order):
Stegmueller, D. (2013). How Many Countries for Multilevel Modeling? A Comparison of Frequentist and Bayesian Approaches. American Journal of Political Science, 57(3), 748–761. https://doi.org/10.1111/ajps.12001
Elff, M., Heisig, J., Schaeffer, M., & Shikano, S. (2016). No Need to Turn Bayesian in Multilevel Analysis with Few Clusters: How Frequentist Methods Provide Unbiased Estimates and Accurate Inference.
Theoretical minimum is 3 clusters
Practically, the minimum is 5-6
Bayesian/simulation based models work out of the box (just shove the data into brms)
Frequentist/maximum likelihood models (e.g. lme4) need some consideration:
Use Restricted maximum likelihood (REML) - lme4 does by default
Assume T distribution for random effect - honestly no idea…