Fitting count and zero-inflated count GLMMs with mgcv
A couple of days ago, Mollie Brooks and coauthors posted a preprint on BioRχiv illustrating the use of the glmmTMB R package for fitting zero-inflated GLMMs (Brooks et al., 2017). In the paper, glmmTMB is compared with several other GLMM-fitting packages. mgcv has recently gained the ability to fit a wider range of families beyond the exponential family of distributions, including zero-inflated Poisson models. mgcv can also fit simple GLMMs through a spline equivalent of a Gaussian random effect. So, whilst I was waiting on some Bayesian GAMs to finish sampling, I decided to see how mgcv compared against glmmTMB on the two examples used in the paper.
For this post I’ll be using a couple of packages beyond glmmTMB and mgcv; make sure you have ggplot2 and ggstance installed if you wish to run through the code below.
There are several ways in which mgcv allows GLMMs to be fitted, but the way that interests me here is via
gam() and the random effect spline basis. Penalised splines of the type provided in mgcv can also be represented in mixed model form, such that GAMs can also be fitted using mixed effect modelling software. The general idea is that the spline is decomposed into two parts:
- the perfectly smooth parts of the basis, namely those functions, including constant and linear functions, in the penalty null space of the spline. These are added to the fixed effects model matrix, whilst,
- the remaining wiggly parts of the basis are treated as random effects.
Given this duality between splines and random effects, you can reverse the idea and create a spline basis that is the equivalent of a simple Gaussian i.i.d random effect, such that you can fit a GLMM or GAMM using GAM software like mgcv. mgcv has the
re basis for this, and I’ll exploit that to fit the zero-inflated GLMMs to the two examples.
In Brooks et al. (2017), two example data sets are used;
Salamanders— Seven combinations of different salamander species and life-stages were repeatedly sampled four times at 23 sites in Applachian streams (Price et al., 2016). Some of the streams were impacted by mountaintop removal and valley filling from coal mining. The data are available from Price et al. (2015), as well as the glmmTMB package.
Owls— the second example is a well-studied one in mixed modelling papers and textbooks (Zuur et al., 2009, @Bolker2013–vl), and relates to the begging behaviour of owl nestlings. The data were originally reported in Roulin and Bersier (2007).
Brooks et al. (2017) fit several count models to the
Salamander data set, including standard Poisson GLMMs, negative binomial GLMMs, with \(\theta\) estimated and modelled via a linear predictor, as well as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models. Of these,
gam() can currently fit all but the negative binomial with \(\theta\) modelled via a linear predictor and the ZINB models.
The best fitting model of those presented was a negative binomial model, whilst Brooks et al. (2017) also illustrate how to generate fitted values from the ZIP. Rather than go through fitting all of the Brooks et al. (2017) models, I restrict fitting here to these two models. A gist with code to fit all the models that
gam() is capable of is available on Github. I have named the models similarly to Brooks et al. (2017) to facilitate comparison.
glmmTMB() is currently only capable of fitting models using maximum likelihood, not REML, I use the Laplace approximate maximum likelihood estimation method for
gam(). The new
nb family in mgcv is for the negative binomial distribution with the (fixed) dispersion parameter \(\theta\) estimated as a model parameter, in the same way that
lme4::glmer.nb() models do.
gam() model, the random effect is specified using the standard
s() smooth function with the
"re" basis selected. The named variable, here
site, should be stored as a factor in the data object to avoid problems.
The figure below compares the coefficient estimates returned by
gam(); they are very similar, which is encouraging.
The values (posterior modes, or means) for the
site random effect can also be compared
As the figure above shows, these too are essentially equivalent for the two fits.
summary() output for the
glmmTMB() model conveniently provides some additional useful information, in the context of GLMMs most notably the estimated variances (or standard deviations) of the random effect terms. As
gam() wasn’t designed with GLMMs specifically in mind, the same information is not provided in the the
summary() method for
gam() model fits. However, Simon Wood has provided the
gam.vcomp() function, which can be used to return the variance components of the model in a way that allows comparison with other mixed-models specific software.
gam() version, conveniently with a confidence interval
One further analysis that Brooks et al. (2017) do with the
Salamander data (in their Appendix B) is to demonstrate how to generate and plot fitted values from the model. To do this, the analyst needs to consider whether to and how to marginalise over or condition on the random effects. The Appendix has some details on this more generally (via a linked reference) and more-specific pointers on how to go about doing this with
glmmTMB() models. In the next few code chunks I will illustrate how to achieve the result from their section Alternative prediction method, where the aim is to predict at the population mode by setting the random effect component to 0. To illustrate this, Brooks et al. (2017) use the more complex ZIP model with linear predictors for both the mean and the zero-inflation components of the model. I fit those models first
glmmTMB() model has the zero-inflation linear predictor specified via the
ziformula argument (abbreviated to
zi above). With
gam() however, multiple linear predictors are specified via a list of formula objects, only the first of which has a response (left-hand-side). The first formula, with the response, is for the Poisson mean, whilst the second is for the zero-inflation component. Note also that we use the special
ziplss() family and that now the model is being estimated using REML, because that is the only option available for these models, which Simon Wood calls general smooth models (Wood et al., 2016). Do note that there is (as of writing) no
link argument(s) for the
ziplss() family. This is due to the way the model is parameterised internally in the software. This will require us to pay particular attention to the implementation shortly.
To recreate part of Figure B.3 in Appendix B (Brooks et al., 2017), the code below predicts from the fitted
gam() model for all combinations of the factors
spp. Notice how we have to specify a
site in the prediction data, otherwise
predict() will throw a tantrum. To set the random effect for
site to zero, use the
exclude argument. To exclude (i.e. set to zero) any model term, you supply a character vector or list of terms to
exclude. For smooth terms, these must be named as they appear in
summary(model), hence the use of
"s(site)". The final step is to call
type = "link". This will return a two column matrix (or a list of two-column matrices if
se.fit = TRUE is also used).
The first column is the predicted value of the response from the Poisson part of the model on the scale of the linear predictor (the log scale). The second column is the predicted value from the zero-inflation component and is on the complementary log-log scale. Both of these need to be back transformed to the respective response scales and then multiplied together. To do this for the zero-inflation part, I copied the code from the base R
binomial() family with the appropriate link specified. The second line of code below adds the predicted values for each combination of
spp to the prediction data object. Note that each component is back-transformed using the appropriate link, and then multiplied together.
A plot of the predicted values is then easily produced
Because of the way the
gam() model is implemented, I could also have computed the Bayesian credible intervals using the Bayesian covariance matrix of the model parameters via the
se.fit argument to
predict(). I’ll perhaps save that for another day…
Owls data are also available in the glmmTMB package, which I load and then do a little processing of the data to simplify the name of the response variable and to mean centre the
Two ZIP models are considered
- a ZIP with constant zero-inflation (an intercept-only model for the zero-inflation), and
- a ZIP with complex zero-inflation, where one covariate and a random effect for
Nestare included in the linear predictor of the zero-inflation part of the model.
The constant zero-inflation models are fitted using the
ziformula argument for
family = poisson, whilst for
gam() we use a list of two formula objects, the second for the ZI linear predictor, and the
ziplss family. Note that this model could also be fitted using the
Zip() family in mgcv but that employs a different, simpler fitting algorithm so to facilitate comparison with the more complex model I use
Again note that these models are not estimated in the same way;
glmmTMB() estimates the model parameters using maximum likelihood, whilst only REML estimation is available for the
ziplss() family with
gam(), the intercept-only ZI linear predictor is specified with the formula
To compare the estimates of the model coefficients I wrote a little function to extract the estimated values and their standard errors from the two model objects
Passing each of the models to
results in a tidy data frame suitable for plotting with
As can be seen in the figure, the estimates from the two functions are quite similar.
The more-complex models with covariates in the ZI linear predictor are fitted next
As before, we gather the model coefficients
and plot them
and likewise as before, the estimates of the fixed effect terms are very similar indeed.
The comparisons shown above show that
glmmTMB() produce very similar estimates for the two models. And some crude timings showed that
gam() was 20–40% faster than
glmmTMB() at fitting the examples discussed in the paper. So all is roses, right!? Who needs
That would however, be totally the wrong message to take from this comparison. Most notably, and something that isn’t surfaced in these simple examples is that
gam() is limited in the complexity of the random effects it can efficiently represent in models:
- it can’t do correlated random effects for random slopes and intercepts models (as far as I can tell anyway), and, and this is probably the deal breaker,
- model fitting with
gam()gets bogged down quickly if the number of levels in a random effect gets large. Jamie Ashander did some quick tests with a larger version of the Salamander with 100s of
And that’s fine;
gam() was not designed to fit GLMMs — there are no less than three implementations by Simon Wood alone of functions to fit GAMs with complex random effects in mixed model software (
gamm() to fit with
gamm4() to fit using
jagam() in mgcv to fit via JAGS). Furthermore,
glmmTMB() is currently more flexible in the range of models that it can fit than any these implementations, except for JAGS, because the
ziplss families only work with
What the above comparison illustrates, however, is that if you either don’t have complex or many random effects or that you don’t mind running models overnight,
gam() is a good option for fitting GLMMs. Plus you have the advantage of estimating smooth functions of covariates, which is one area where
glmmTMB() is currently very lacking compared to
That said, it should be possible to emulate what Paul-Christian Bürkner has done in his brms package (and similar implementations by Simon Wood in
gamm4()) to use mgcv to set up the correct model matrices for the random effect representation of splines which can then be fitted using
Finally, this was a fun exercise to replicate the analyses in Brooks et al. (2017), motivated by a desire to understand what mgcv and
gam() are doing with these random effect splines. It wasn’t intended as a prize-fight between two title contenders — hopefully this write-up didn’t come across that way. I also learned a lot more about glmmTMB, which is shaping up nicely and looks like it’ll have a place in my modelling toolbox.
Bolker, B. M., Gardner, B., Maunder, M., Berg, C. W., Brooks, M., Comita, L., et al. (2013). Strategies for fitting nonlinear ecological models in r, AD model builder, and BUGS. Methods in ecology and evolution / British Ecological Society 4, 501–512. doi:10.1111/2041-210X.12044.
Brooks, M. E., Kristensen, K., Benthem, K. J. van, Magnusson, A., Berg, C. W., Nielsen, A., et al. (2017). Modeling Zero-Inflated count data with glmmTMB. bioRxiv, 132753. doi:10.1101/132753.
Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2015). Data from: Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. doi:10.5061/dryad.5m8f6.
Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2016). Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. The Journal of applied ecology 53, 459–468. doi:10.1111/1365-2664.12585.
Roulin, A., and Bersier, L.-F. (2007). Nestling barn owls beg more intensely in the presence of their mother than in the presence of their father. Animal behaviour 74, 1099–1106.
Wood, S. N., Pya, N., and Säfken, B. (2016). Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548–1563. doi:10.1080/01621459.2016.1180986.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., and Smith, G. M. (2009). Mixed effects models and extensions in ecology with r: Springer New York doi:10.1007/978-0-387-87458-6.