Modeling proportions

Binomial and Beta

Aleš Vomáčka

Modeling Bounded Counts

Bounded counts are pretty common

Test scores
Likert sum scores
Number of received votes in elections

Binomial regression, our old friend

We’ve already used binomial regression, for modeling categorical data.
Every respondent had one “attempt” to succeed.

\[ logit(Y) \sim Binom(X\beta, 1) \]

Now, we will model the number of trials, as well as probability of success.

\[ logit(Y) \sim Binom(X\beta, k) \]

How well do you EU?

We administered a knowledge test on the EU. We want to model sum scores based on education, age and attitudes towards the EU.

Test questions

“Members of which of the following institutions are elected by the people directly?”
“Who is currently the President of the European Commission?”
“Which two countries are not EU members?”
“In which of the following areas does Czechia have the least freedom in setting its own policy as a result of its membership in the EU?”

Specifying binomial regression

Almost identical to “classical” logistic regression, but we need to tell R what the number of trials, i.e. the maximum possible score, is.
Every package is different, glm() requires to specify number of success and number of failures.

	Coefficient
(Intercept)	0.09
age	0.01
Edu: Without diploma	-
Edu: With diploma	0.41
Edu: University	0.84
EU index: Pro-EU	-
EU index: Pro-EU	-0.84
EU index: Eurosceptic	-1.15
EU index: Strongly eurosceptic	-1.21

Marginal effects for binomial model

Interpreting as usual through marginal effects.

Assumptions

Validity and reliability
Representativity
Linearity between logit of dependent variable and predictors
The probability of answering correctly is the same for each variable, e.g. difficulty is the same.
- If violated, difficulty can be modeled directly using beta-binomial regression.

Single likert items

We can treat a single likert item as a “score”.
E.g. Attitudes towards EU
- 0 Strongly eurosceptic
- 1 Eurosceptic
- 2 Pro-EU
- 3 Strongly pro-EU
Unlike linear regression, this approach naturaly accounts for item’s bound.

EU attitudes an binomial regression

Questions?

InteRmezzo!

Proportions

Good old proportions

Votes in a district
Proportion of returning customers
Proportion of day spend commuting
Likert mean scores (kinda)

Beta distribution

Used for proportions and bounded continuous data
In (0:1), non-inclusive!
Two parameters, \(\alpha\) and \(\beta\)
mean = \(\frac{\alpha}{\alpha + \beta} = \frac{sucess}{sucess + failure}\)
variance = \(\frac{\alpha\cdot\beta}{(\alpha + \beta)^2 \cdot (\alpha + \beta + 1)}\)

Modeling mean scores

Beta distribution assumes values between 0-1, we need to rescale the data.

\[ logit(Y) \sim Beta(X\beta) \]

Logit link! Non-collapsibility, we meet again…
Doesn’t work if there are 0s or 1s in our data. Solution?

Ordered beta

Multiple ways to incorporate 0s and 1s into our model.
A neat one is ordered beta (other option is zero-one-inflated beta)
(Roughly speaking) We predict whether an observation is 0 or 1 using logistic regression. If not, predict the proportion using beta.

\[ \begin{align} logit(Y = 0) &\sim Binomial(X\beta) \\ logit(1> Y > 0) &\sim Beta(X\beta) \\ logit(Y = 1) &\sim Binomial(X\beta) \\ \end{align} \]

Modeling EU index

Are EU attitudes associated with knowing how the EU works, after controlling for age and education?

\[ \begin{align} logit(EU\;score) \sim OrdBeta(&Age + Knowledge + \\ &Education + Knowledge \cdot Education) \end{align} \]

Interpreted as usual trough marginal effects and predicted values plots.

Modeling EU index

Assumptions

Validity and reliability
Representative
Linearity
Conditional distribution is beta

Questions?

InteRmezzo!

Wrapping up GLMs

GLMs wrap up

By remembering few distribution, you can make modeling much easier.
Proper Generalized linear model respects properties of data and serves as a base for more complicate models.
- Item response theory
- Conjoint analysis
- Propensity score matching
- and more…
Don’t rely on regression coefficients, use marginal effects and predicted probabilities plot.
Model fit can be checked using randomized residuals and posterior predictive plots.

GLMs in a table

Variable Type	Distribution	Example	R package
Binary	Binomial	Voter turnout	base::glm()
Ordinal	Binomial	Subjective health	ordinal::clm()
Multinominal	Binomial	Party preference	nnet::multinom()
Counts	Neg. Binomial	No. of absences	glmmTMB::glmmTMB()
Bounded counts	(beta)Binomial	Test scores	base::glm(), glmmTMB::glmmTMB()
Proportions	(ordered)Beta	Vote shares	glmmTMB::glmmTMB()

GLMs and the R ecosystem

Modelling functions are spread across number of packages.
- Takes time to find the right package
- Compatibility problems
- Different syntax
Solution?

The Beauty of the BRMS package

Few general packages (glmmTMB, VGAM).
By far the best (IMHO) is brms.
Bayesian Regression ModelS
(can be used for frequentist estimates)
Everything in one place
- All the distributions (and more)
- Diagnostic plots
- Perfectly compatible with marginaleffects
So why not teach it?

BRMS trade-offs

brms uses simulations to estimate models (Stan language).
Advantage:
- Extremely flexible
- Easy to work with/postprocesses
- More robust
Disadvantage:
- It takes much longer to estimate models
- Need to make sure the simulation went ok

BRMS trade-offs

When Maximum likelihood based model takes 0.5 seconds to estimate, brms takes 1-5 minutes.
When ML based model takes 3 minutes to estimate, brms takes 30-60 minutes.
Is it worth it?

Time management, modeling style

From my experience, yes.
Waiting 5 minutes for a model > Spending 1 hour looking for packages and fixing compatibility issues.
Waiting 1 hour for a model > Spending 4 hours making sure a model converges.
Sometimes, you straight up don’t have a choice
So give brms a try!

Questions?

Next time: Causal inference