Pangaea and R and open palaeo data (also GAM all the things!)
16 December 2016 /posted in: R
For a while now, I’ve been wanting to experiment with rOpenSci’s pangaear package (Chamberlain et al., 2016), which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of Scientific Data, Springer Nature’s open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal δ18O record from the Gulf of Taranto in the Ionian Sea (Taricco et al., 2016). The data descriptor was recently published and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out pangaear? (Oh and to fit a GAM to the data while I’m at it!)
Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). Pangaear: Client for the ’pangaea’ database. Available at: https://CRAN.R-project.org/package=pangaear.
Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal ()18O record covering the last 2,200 years. Scientific Data 3, 160042. doi:10.1038/sdata.2016.42.
Simultaneous intervals for smooths revisited correcting a silly mistake
15 December 2016 /posted in: R
Eighteen months ago I wrote a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake.
I’ll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, I’ll use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert et al., 2003). First, load the packages we’ll need as well as the data, which is data set
fossil. If you don’t have SemiPar installed, install it using
install.packages(“SemiPar”) before proceeding
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press.
ISEC 2016 Talk
02 July 2016 /posted in: Science
My ISEC 2016 talk, Estimating temporal change in mean and variance of community composition via location, scale additive models, describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores.
Rootograms a new way to assess count models
07 June 2016 /posted in: R
Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across Kleiber and Zeileis (2016) who propose the rootogram as an improved approach to the assessment of fit of a count regression model. The paper is illustrated using R and the authors’ countreg package (currently on R-Forge only). Here, I thought I’d take a quick look at the rootogram with some simulated species abundance data.
Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms.
Harvesting more Canadian climate data
24 May 2016 /posted in: R
A while back I wrote some code to download climate data from Government of Canada’s historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.
A new default plot for multivariate dispersions tribulations of base graphics programming
17 April 2016 /posted in: R
This weekend, prompted by a pull request from Michael Friendly, I finally got round to improving the
plot method for
betadisper() in the vegan package.
betadisper() is an implementation of Marti Anderson’s Permdisp method, a multivariate analogue of Levene’s test for homogeneity of variances. In improving the default plot and allowing customisation of plot features, I was reminded of how much I dislike programming plot functions that use base graphics. But don’t worry, this isn’t going to degenerate into a ggplot love-in nor a David Robinson-esque dig at Jeff Leek.
10 April 2016 /posted in: Science
It’s fair to say I have gotten a bee1 in my bonnet about how palaeolimnologists handle time. For a group of people for whom time is everything, we sure do a poor job (in general) of dealing with it in when it comes time to analyse our data. In many instances, “poor job” means making no attempt at all to account for the special nature of the time series. LOESS comes in for particular criticism because it is widely used by palaeolimnologists despite not being particularly suited to the task. Why this is so is perhaps due to it’s promotion in influential books, papers, and software. I am far from innocent in this regard having taught LOESS and it’s use for many years on the now-defunct ECRC Numerical Course. Here I want to look at further problems with our use of LOESS, and will argue that we need to resign it to the trash can for all but exploratory analyses. I will begin the case for the prosecution with one of my own transgressions.
an entire hive is perhaps more apt!↩
Soap-film smoothers & lake bathymetries
27 March 2016 /posted in: R
A number of years ago, whilst I was still working at ENSIS, the consultancy arm of the ECRC at UCL, I worked on a project for the (then) Countryside Council for Wales (CCW; now part of Natural Resources Wales). I don’t recall why they were doing this project, but we were tasked with producing a standardised set of bathymetric maps for Welsh lakes. The brief called for the bathymetries to be provided in standard GIS formats. Either CCW’s project manager or the project lead at ENSIS had proposed to use inverse distance weighting (IWD) to smooth the point bathymetric measurements. This probably stemmed from the person that initiatied our bathymetric programme at ENSIS being a GIS wizard, schooled in the ways of ArcGIS. My involvement was mainly data processing of the IDW results. I was however, at the time, also somewhat familiar with the problem of finite area smoothing1 and had read a paper of Simon Wood’s on his then new soap-film smoother (Wood et al., 2008). So, as well as writing scripts to process and present the IDW-based bathymetry data in the report, I snuck a task into the work programme that allowed me to investigate using soap-film smoothers for modelling lake bathymetric data. The timing was never great to write up this method (two children and a move to Canada have occurred since the end of this project), so I’ve not done anything with the idea. Until now…
Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. Journal of the Royal Statistical Society. Series B, Statistical methodology 70, 931–955. doi:10.1111/j.1467-9868.2008.00665.x.
smoothing over a domain with known boundaries, like a lake↩
Additive modelling global temperature time series: revisited
25 March 2016 /posted in: R
Quite some time ago, back in 2011, I wrote a post that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great global warming hiatus or pause. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.
Better use of transfer functions?
16 December 2015 /posted in: Science
Transfer functions have had a bit of a hard time of late following Steve Juggins (2013) convincing demonstration that 1) secondary gradients can influence your model, and 2) that variation down-core in a secondary variable can induce a signal in the thing being reconstructed. This was followed up by further comment on diatom-TP reconstructions (Juggins et al., 2013), and not to be left out, chironomid transfer functions have come in from some heat, if the last (that I went to) IPS meeting was any indication. In a session at the 2015 Fall Meeting of the AGU, my interest was piqued by Yarrow Axford’s talk using chironomid temperature reconstructions, but not for the reasons you might be thinking.
Juggins, S. (2013). Quantitative reconstructions in palaeolimnology: New paradigm or sick science? Quaternary science reviews 64, 20–32. doi:10.1016/j.quascirev.2012.12.014.
Juggins, S., John Anderson, N., Ramstack Hobbs, J. M., and Heathcote, A. J. (2013). Reconstructing epilimnetic total phosphorus using diatoms: Statistical and ecological constraints. Journal of paleolimnology 49, 373–390. doi:10.1007/s10933-013-9678-x.