Modeling proportions

Binomial and Beta

Aleš Vomáčka

Modeling Bounded Counts

Bounded counts are pretty common

  • Test scores

  • Likert sum scores

  • Number of received votes in elections

Binomial regression, our old friend

  • We’ve already used binomial regression, for modeling categorical data.

  • Every respondent had one “attempt” to succeed.

\[ logit(Y) \sim Binom(X\beta, 1) \]

  • Now, we will model the number of trials, as well as probability of success.

\[ logit(Y) \sim Binom(X\beta, k) \]

How well do you EU?

  • We administered a knowledge test on the EU. We want to model sum scores based on education, age and attitudes towards the EU.


Test questions

  • “Members of which of the following institutions are elected by the people directly?”

  • “Who is currently the President of the European Commission?”

  • “Which two countries are not EU members?”

  • “In which of the following areas does Czechia have the least freedom in setting its own policy as a result of its membership in the EU?”

Specifying binomial regression

  • Almost identical to “classical” logistic regression, but we need to tell R what the number of trials, i.e. the maximum possible score, is.

  • Every package is different, glm() requires to specify number of success and number of failures.

Coefficient
(Intercept) 0.09
age 0.01
Edu: Without diploma -
Edu: With diploma 0.41
Edu: University 0.84
EU index: Pro-EU -
EU index: Pro-EU -0.84
EU index: Eurosceptic -1.15
EU index: Strongly eurosceptic -1.21

Marginal effects for binomial model

  • Interpreting as usual through marginal effects.

Assumptions

  • Validity and reliability

  • Representativity

  • Linearity between logit of dependent variable and predictors

  • The probability of answering correctly is the same for each variable, e.g. difficulty is the same.

    • If violated, difficulty can be modeled directly using beta-binomial regression.

Single likert items

  • We can treat a single likert item as a “score”.

  • E.g. Attitudes towards EU

    • 0 Strongly eurosceptic

    • 1 Eurosceptic

    • 2 Pro-EU

    • 3 Strongly pro-EU

  • Unlike linear regression, this approach naturaly accounts for item’s bound.

EU attitudes an binomial regression

Questions?

InteRmezzo!

Proportions

Good old proportions

  • Votes in a district

  • Proportion of returning customers

  • Proportion of day spend commuting

  • Likert mean scores (kinda)

Beta distribution

  • Used for proportions and bounded continuous data

  • In (0:1), non-inclusive!

  • Two parameters, \(\alpha\) and \(\beta\)

  • mean = \(\frac{\alpha}{\alpha + \beta} = \frac{sucess}{sucess + failure}\)

  • variance = \(\frac{\alpha\cdot\beta}{(\alpha + \beta)^2 \cdot (\alpha + \beta + 1)}\)

Modeling mean scores

  • Beta distribution assumes values between 0-1, we need to rescale the data.

\[ logit(Y) \sim Beta(X\beta) \]

  • Logit link! Non-collapsibility, we meet again…

  • Doesn’t work if there are 0s or 1s in our data. Solution?

Ordered beta

  • Multiple ways to incorporate 0s and 1s into our model.

  • A neat one is ordered beta (other option is zero-one-inflated beta)

  • (Roughly speaking) We predict whether an observation is 0 or 1 using logistic regression. If not, predict the proportion using beta.

\[ \begin{align} logit(Y = 0) &\sim Binomial(X\beta) \\ logit(1> Y > 0) &\sim Beta(X\beta) \\ logit(Y = 1) &\sim Binomial(X\beta) \\ \end{align} \]

Modeling EU index

  • Are EU attitudes associated with knowing how the EU works, after controlling for age and education?

\[ \begin{align} logit(EU\;score) \sim OrdBeta(&Age + Knowledge + \\ &Education + Knowledge \cdot Education) \end{align} \]

  • Interpreted as usual trough marginal effects and predicted values plots.

Modeling EU index

Assumptions

  • Validity and reliability

  • Representative

  • Linearity

  • Conditional distribution is beta

Questions?

InteRmezzo!

Wrapping up GLMs

GLMs wrap up

  • By remembering few distribution, you can make modeling much easier.

  • Proper Generalized linear model respects properties of data and serves as a base for more complicate models.

    • Item response theory

    • Conjoint analysis

    • Propensity score matching

    • and more…

  • Don’t rely on regression coefficients, use marginal effects and predicted probabilities plot.

  • Model fit can be checked using randomized residuals and posterior predictive plots.

GLMs in a table

Variable Type Distribution Example R package
Binary Binomial Voter turnout base::glm()
Ordinal Binomial Subjective health ordinal::clm()
Multinominal Binomial Party preference nnet::multinom()
Counts Neg. Binomial No. of absences glmmTMB::glmmTMB()
Bounded counts (beta)Binomial Test scores base::glm(), glmmTMB::glmmTMB()
Proportions (ordered)Beta Vote shares glmmTMB::glmmTMB()

GLMs and the R ecosystem

  • Modelling functions are spread across number of packages.

    • Takes time to find the right package

    • Compatibility problems

    • Different syntax

  • Solution?

The Beauty of the BRMS package

  • Few general packages (glmmTMB, VGAM).

  • By far the best (IMHO) is brms.

  • Bayesian Regression ModelS

  • (can be used for frequentist estimates)

  • Everything in one place

    • All the distributions (and more)

    • Diagnostic plots

    • Perfectly compatible with marginaleffects

  • So why not teach it?

BRMS trade-offs

  • brms uses simulations to estimate models (Stan language).

  • Advantage:

    • Extremely flexible

    • Easy to work with/postprocesses

    • More robust

  • Disadvantage:

    • It takes much longer to estimate models

    • Need to make sure the simulation went ok

BRMS trade-offs

  • When Maximum likelihood based model takes 0.5 seconds to estimate, brms takes 1-5 minutes.

  • When ML based model takes 3 minutes to estimate, brms takes 30-60 minutes.

  • Is it worth it?

Time management, modeling style

  • From my experience, yes.

  • Waiting 5 minutes for a model > Spending 1 hour looking for packages and fixing compatibility issues.

  • Waiting 1 hour for a model > Spending 4 hours making sure a model converges.

  • Sometimes, you straight up don’t have a choice

  • So give brms a try!

Questions?

Next time: Causal inference