Binomial and Beta
Test scores
Likert sum scores
Number of received votes in elections
We’ve already used binomial regression, for modeling categorical data.
Every respondent had one “attempt” to succeed.
\[ logit(Y) \sim Binom(X\beta, 1) \]
\[ logit(Y) \sim Binom(X\beta, k) \]
Test questions
“Members of which of the following institutions are elected by the people directly?”
“Who is currently the President of the European Commission?”
“Which two countries are not EU members?”
“In which of the following areas does Czechia have the least freedom in setting its own policy as a result of its membership in the EU?”
Almost identical to “classical” logistic regression, but we need to tell R what the number of trials, i.e. the maximum possible score, is.
Every package is different, glm()
requires to specify number of success and number of failures.
Coefficient | |
---|---|
(Intercept) | 0.09 |
age | 0.01 |
Edu: Without diploma | - |
Edu: With diploma | 0.41 |
Edu: University | 0.84 |
EU index: Pro-EU | - |
EU index: Pro-EU | -0.84 |
EU index: Eurosceptic | -1.15 |
EU index: Strongly eurosceptic | -1.21 |
Validity and reliability
Representativity
Linearity between logit of dependent variable and predictors
The probability of answering correctly is the same for each variable, e.g. difficulty is the same.
We can treat a single likert item as a “score”.
E.g. Attitudes towards EU
0 Strongly eurosceptic
1 Eurosceptic
2 Pro-EU
3 Strongly pro-EU
Unlike linear regression, this approach naturaly accounts for item’s bound.
Votes in a district
Proportion of returning customers
Proportion of day spend commuting
Likert mean scores (kinda)
Used for proportions and bounded continuous data
In (0:1), non-inclusive!
Two parameters, \(\alpha\) and \(\beta\)
mean = \(\frac{\alpha}{\alpha + \beta} = \frac{sucess}{sucess + failure}\)
variance = \(\frac{\alpha\cdot\beta}{(\alpha + \beta)^2 \cdot (\alpha + \beta + 1)}\)
\[ logit(Y) \sim Beta(X\beta) \]
Logit link! Non-collapsibility, we meet again…
Doesn’t work if there are 0s or 1s in our data. Solution?
Multiple ways to incorporate 0s and 1s into our model.
A neat one is ordered beta (other option is zero-one-inflated beta)
(Roughly speaking) We predict whether an observation is 0 or 1 using logistic regression. If not, predict the proportion using beta.
\[ \begin{align} logit(Y = 0) &\sim Binomial(X\beta) \\ logit(1> Y > 0) &\sim Beta(X\beta) \\ logit(Y = 1) &\sim Binomial(X\beta) \\ \end{align} \]
\[ \begin{align} logit(EU\;score) \sim OrdBeta(&Age + Knowledge + \\ &Education + Knowledge \cdot Education) \end{align} \]
Validity and reliability
Representative
Linearity
Conditional distribution is beta
By remembering few distribution, you can make modeling much easier.
Proper Generalized linear model respects properties of data and serves as a base for more complicate models.
Item response theory
Conjoint analysis
Propensity score matching
and more…
Don’t rely on regression coefficients, use marginal effects and predicted probabilities plot.
Model fit can be checked using randomized residuals and posterior predictive plots.
Variable Type | Distribution | Example | R package |
---|---|---|---|
Binary | Binomial | Voter turnout | base::glm() |
Ordinal | Binomial | Subjective health | ordinal::clm() |
Multinominal | Binomial | Party preference | nnet::multinom() |
Counts | Neg. Binomial | No. of absences | glmmTMB::glmmTMB() |
Bounded counts | (beta)Binomial | Test scores | base::glm(), glmmTMB::glmmTMB() |
Proportions | (ordered)Beta | Vote shares | glmmTMB::glmmTMB() |
Modelling functions are spread across number of packages.
Takes time to find the right package
Compatibility problems
Different syntax
Solution?
Few general packages (glmmTMB, VGAM).
By far the best (IMHO) is brms
.
Bayesian Regression ModelS
(can be used for frequentist estimates)
Everything in one place
All the distributions (and more)
Diagnostic plots
Perfectly compatible with marginaleffects
So why not teach it?
brms
uses simulations to estimate models (Stan language).
Advantage:
Extremely flexible
Easy to work with/postprocesses
More robust
Disadvantage:
It takes much longer to estimate models
Need to make sure the simulation went ok
When Maximum likelihood based model takes 0.5 seconds to estimate, brms
takes 1-5 minutes.
When ML based model takes 3 minutes to estimate, brms
takes 30-60 minutes.
Is it worth it?
From my experience, yes.
Waiting 5 minutes for a model > Spending 1 hour looking for packages and fixing compatibility issues.
Waiting 1 hour for a model > Spending 4 hours making sure a model converges.
Sometimes, you straight up don’t have a choice
So give brms
a try!