Samplers, samplers, everywhere...
This notebook aims to provide basic examples of how to run a variety of MCMC and nested sampling codes in Python. It won't go into detail about MCMC methods in general, and assumes a bit of knowledge about them, nor will it discuss all the various bells-and-whistles that each sampler can use, but it will hopefully help people get started with writing a sampler code using an understandable (albeit "toy") example. A very nice, quite similar, article on this topic by Jake VanderPlas can be found here. There are also very good online notes from the Computational Statistics and Statistical Computing course at Duke University by Cliburn Chan & Janice McCarthy, covering some of this material (including PyMC3 and PyStan examples that are very similar to what is covered here). Section 2 of Brewer & Foreman-Mackey (2016) gives a nice summary of the advantages and disadvantages of some of the algorithms shown here. A good general description of fitting a model to data in a Bayesian formalism, including various useful examples, can be found in Hogg, Bovy & Lang (2010).
This code within this notebook is designed to work for Python 3. An older version that should be compatible with Python 2 can be found here.
You can jump straight to the examples for the different samplers with the links below:
An example using the bilby package, which provides a unified interface to a variety of different samplers is also provided.
Background¶
MCMC methods are most often used in the context of Bayesian parameter estimation, e.g., you have a model, $y$, defined by a set of parameters, ${\vec{\theta}}$, with certain prior probability distributions $p(\vec{\theta}|I)$, and you want to efficiently sample from the marginalised posterior probability distributions on those parameters given some data, $\mathbf{d}$, and a particular likelihood function, $p(\mathbf{d}|\vec{\theta},I)$, ($I$ is just a substitute for all the implicit assumptions that have gone into defining our model). From Bayes theorem we have the joint posterior on the whole set of parameters,
$$ p(\vec{\theta}|\mathbf{d},I) = \frac{p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I)}{p(\mathbf{d}|I)}, $$with marginalised posteriors on an individual parameter, say $\theta_1$, given by nested integrals over all other parameters,
$$ p(\theta_1|\mathbf{d},I) = \frac{1}{p(\mathbf{d}|I)}\int^{\forall \theta_i \in (\vec{\theta}\neq \theta_1)} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$($\forall \theta_i \in (\vec{\theta}\neq \theta_1)$ can be translated into "for all parameters $\theta_i$ in $\vec{\theta}$ that are not equal to $\theta_1$). $p(\mathbf{d}|I)$ is the marginal likelihood, or evidence, for the data given the particular model used, and is given by
$$ p(\mathbf{d}|I) = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$which is a normalising constant for the posterior. It will be discussed more below in relation to nested sampling.
This is the only context I (and probably many people in the physical sciences) have encountered using MCMC methods, and is the context I will use below.
Toy model¶
The very basic example that I will use to demonstrate coding up the samplers is that of sampling from the posterior of two parameters, $m$ and $c$, defining the model:
$$ y(\mathbf{x};m,c) = m\mathbf{x} + c, $$where $\mathbf{x}$ is a vector of values of $x$. This is basically fitting the gradient and $y$-intercept of a straight line, which should be fairly familiar as linear regression, and is obviously a solved problem (assuming uniform priors!) using, e.g., least squares fitting. But, this provides a simple example that can hopefully be extended to more realistics scenarios.
Our aim is to produce samples drawn from the posterior $p(m, c|\mathbf{d},I)$, which can be used (by, e.g., histogramming the sample values) to approximate the marginal posteriors $p(m|\mathbf{d},I)$ and $p(c|\mathbf{d},I)$.
The vector of data samples, $\mathbf{d}$, will consist of the model defined at a set of $M$ values, $\mathbf{x}$, with additive Gaussian noise of known standard deviation, i.e., data point $i$ will be defined by
$$ d_i = y(x_i;m,c) + n_i, $$with
$$ n_i \sim N(0, \sigma_i), $$which means the noise is drawn from a Gaussian (or Normal) distribution of zero mean and standard deviation of $\sigma_i$.
Note: I'll generally use the term Gaussian distribution rather than "Normal" distribution, as this it what I first heard it called. But, in many packages discussed below "Normal" is the more common word for the distribution.
Setting the likelihood¶
A sensible and fairly standard likelihood function (due to the Central Limit Theorem and maximum entropy arguments) for a single data point given values of $m$ and $c$ and a known noise standard deviation, will be given by a Gaussian distribution of the form
$$ p(d_i|m,c,I) = \left(2\pi\sigma_i^2\right)^{-1/2} \exp{\left(-\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2} \right)}, $$and the joint likelihood for the whole dataset of $M$ points (assuming independent noise) will be the product of the individual likelihoods
$$ p(\mathbf{d}|m,c,I) = \prod_{i=1}^M p(d_i|m,c,I) = \left(2\pi\right)^{-M/2}\left(\prod_{i=1}^M \sigma_i^{-1} \right) \exp{\left(-\sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2} \right)}. $$When computing likelihoods, numerical considerations mean that one almost always works with the natural logarithm of the likelihood, so we have
$$ \ln{p(\mathbf{d}|m,c,I)} \equiv \log{L} = -\frac{M}{2}\ln{(2\pi)} - \ln\left(\prod_{i=1}^M \sigma_i\right) - \sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2}. $$For many cases we can ignore the terms that do not include our required parameters when sampling from their marginalised likelihoods (as they will just be constants), so often the log-likelihood can be set to be
$$ \log{L} = - \sum_{i=1}^M\frac{\left[(d_i - y(x_i;m,c)\right]^2}{2\sigma_i^2}. $$Note: If wanting to evaluate the marginal likelihood for the data (e.g., using nested sampling, which is discussed later) to use it for model comparison, the constants may be required.
Setting the prior¶
In addition to the likelihood, we need to define prior distributions on the parameters. To demonstrate a couple of different, and reasonably common, priors, I'll use a uniform probability distribution for the prior on $m$ and a Gaussian probability distribution for the prior on $c$:
$$ p(m|\mu_m,\sigma_m,I) = \frac{1}{\sqrt{2\pi\sigma_m^2}}\exp{\left(-\frac{(m-\mu_m)^2}{2\sigma_m^2}\right)}, $$and
$$ p(c|c_{\rm min}, c_{\rm max}) = \Bigg\{\begin{array}{cl} \frac{1}{c_{\rm max}-c_{\rm min}} & \text{if}~c_{\rm min} < c < c_{\rm max}, \\ 0 & \text{otherwise}. \end{array} $$In our example we'll set:
- $\mu_m = 0$ and $\sigma_m = 10$,
- $c_{\rm min} = -10$ and $c_{\rm max} = 10$.
Note: These priors are not intended to be the least informative for the parameterisation that we have, but are just chosen for demonstration purposes to show how to code up two different distributions. A better prior for this particular problem might be the one discussed here, which could be coded up for PyMC3 using the example here.
Creating the data¶
So, let's get started creating our data. I'll (arbitrarily) use:
- $m = 3.5$,
- $c = 1.2$,
for the model and
- $\mathbf{x} \in [0, 10)$ in $M=50$ uniform steps ($[a,b)$ means inclusive of $a$, but exclusive of $b$),
- $\sigma_i = \sigma = 2$ (i.e., each noise point is independent, but has the same standard deviation),
for the data.
# show plots inline in the notebook
%matplotlib inline
import numpy as np # import numpy
from time import time # use for timing functions
# useful modules!
import os
import sys
# make the plots look a bit nicer with some defaults
from matplotlib import pyplot as pl # import pyplot from matplotlib
import matplotlib as mpl
rcparams = {}
rcparams['axes.linewidth'] = 0.5
rcparams['font.family'] = 'serif'
rcparams['font.size'] = 22
rcparams['legend.fontsize'] = 16
rcparams['mathtext.fontset'] = "stix"
# functions for plotting posteriors
import corner
from scipy.stats import gaussian_kde
# set the true values of the model parameters for creating the data
m = 3.5 # gradient of the line
c = 1.2 # y-intercept of the line
# set the "predictor variable"/abscissa
M = 50
xmin = 0.
xmax = 10.
stepsize = (xmax - xmin) / M
x = np.arange(xmin, xmax, stepsize)
# define the model function
def straight_line(x, m, c):
"""
A straight line model: y = m*x + c
Args:
x (list): a set of abscissa points at which the model is defined
m (float): the gradient of the line
c (float): the y-intercept of the line
"""
return m * x + c
# seed our random number generator, so we have reproducible data
np.random.seed(sum([ord(v) for v in 'samplers']))
# create the data - the model plus Gaussian noise
sigma = 2.0 # standard deviation of the noise
data = straight_line(x, m, c) + np.random.normal(scale=sigma, size=M)
# plot the data
mpl.rcParams.update(rcparams) # update plot parameters
fig, ax = pl.subplots(figsize=(9,6))
ax.plot(x, data, 'bo', alpha=0.5, label='data')
ax.plot(x, straight_line(x, m, c), 'r-', lw=2, label='model')
ax.legend()
ax.set_xlim([xmin, xmax])
ax.set_xlabel(r'$x$');
MCMC samplers¶
First up I'll deal with MCMC samplers that are purely written in Python, then a couple that are wrappers to other libraries.
emcee¶
emcee (Foreman-Mackey et al, 2013) is a Python MCMC implementation that uses an affine invariant ensemble sampler (Goodman & Weare, 2010). This basically means that it doesn't just evolve a single point in the model parameter space, but evolves and ensemble of points, and steps in the evolution are tuned based on the current state of the ensemble. The steps scale naturally (the affine invariant bit) as the ensemble closes in on the posterior. This code has been well used in the astrophysics community (and as an astrophysicist myself I'm showing it as my first example!).
emcee is available on PyPI and is installable via pip
with:
pip install emcee
conda install -c conda-forge emcee
The source code is available on GitHub here. The example here is very similar to the line model example given in the emcee documentation.
First up, we need to define the likelihood function, prior functions, and posterior probability function. These all need to be defined as the natural logarithms of the functions. All the functions take in a sample tuple or list, where a sample is a vector $\vec{\theta}$ of parameter values, that can be unpacked to give the individual parameter values. Other arguments to the posterior function can be user defined.
Posterior¶
As we do not care about the marginal likelihood, the posterior will just be the product of the likelihood and prior, and will take the following form:
def logposterior(theta, data, sigma, x):
"""
The natural logarithm of the joint posterior.
Args:
theta (tuple): a sample containing individual parameter values
data (list): the set of data/observations
sigma (float): the standard deviation of the data points
x (list): the abscissa values at which the data/model is defined
"""
lp = logprior(theta) # get the prior
# if the prior is not finite return a probability of zero (log probability of -inf)
if not np.isfinite(lp):
return -np.inf
# return the likeihood times the prior (log likelihood plus the log prior)
return lp + loglikelihood(theta, data, sigma, x)
Likelihood¶
The likelihood will take the following form:
def loglikelihood(theta, data, sigma, x):
"""
The natural logarithm of the joint Gaussian likelihood.
Args:
theta (tuple): a sample containing individual parameter values
data (list): the set of data/observations
sigma (float): the standard deviation of the data points
x (list): the abscissa values at which the data/model is defined
Note:
We do not include the normalisation constants (as discussed above).
"""
# unpack the model parameters from the tuple
m, c = theta
# evaluate the model (assumes that the straight_line model is defined as above)
md = straight_line(x, m, c)
# return the log likelihood
return -0.5 * np.sum(((md - data) / sigma)**2)
Prior¶
The prior function will take the following form:
def logprior(theta):
"""
The natural logarithm of the prior probability.
Args:
theta (tuple): a sample containing individual parameter values
Note:
We can ignore the normalisations of the prior here.
"""
lp = 0.
# unpack the model parameters from the tuple
m, c = theta
# uniform prior on c
cmin = -10. # lower range of prior
cmax = 10. # upper range of prior
# set prior to 1 (log prior to 0) if in the range and zero (-inf) outside the range
# (we don't care about it being properly normalised, but you can if you want)
lp = 0. if cmin < c < cmax else -np.inf
# Gaussian prior on m
mmu = 0. # mean of the Gaussian prior
msigma = 10. # standard deviation of the Gaussian prior
lp -= 0.5 * ((m - mmu) / msigma)**2
return lp
MCMC set up¶
We need to decide on an initial number of ensemble points and initialise the samples, i.e., set starting points for them. The initial samples can be drawn from the prior distributions (in the emcee documentation example it starts the opposite way round, i.e., with initial samples tightly packed around the true value). We'll choose 100 ensemble points (the number of so-called walkers) and initialise them with:
Nens = 100 # number of ensemble points
mmu = 0. # mean of the Gaussian prior
msigma = 10. # standard deviation of the Gaussian prior
mini = np.random.normal(mmu, msigma, Nens) # initial m points
cmin = -10. # lower range of prior
cmax = 10. # upper range of prior
cini = np.random.uniform(cmin, cmax, Nens) # initial c points
inisamples = np.array([mini, cini]).T # initial samples
ndims = inisamples.shape[1] # number of parameters/dimensions
Because our initial samples are drawn from the prior it can take time for them to converge on sampling from the posterior distributions that we want. To try and get around this we can run the sampler with a burn-in period (although see, e.g., here or here for some discussions about whether burn-ins are actually necessary), and we will ignore samples during this burn-in.
Note: Alternative options could be to use some optimisation routine to try and find the posterior mode and start samples at that point, or use all samples, work out an autocorrelation length and thin the samples accordingly so they are uncorrelated (as shown a bit later), which gets rid of correlations during the initial convergence of the samples.
We'll choose a burn-in of 500 samples, and then run the chain for another 500 samples, which will give an output with $(500+500)\times 100 = 100000$ samples.
Nburnin = 500 # number of burn-in samples
Nsamples = 500 # number of final posterior samples
The sampler first has to be initialised and told the required arguments, including any additional arguments for the posterior function:
import emcee # import the emcee package
print('emcee version: {}'.format(emcee.__version__))
# for bookkeeping set number of likelihood calls to zero
loglikelihood.ncalls = 0
# set additional args for the posterior (the data, the noise std. dev., and the abscissa)
argslist = (data, sigma, x)
# set up the sampler
sampler = emcee.EnsembleSampler(Nens, ndims, logposterior, args=argslist)
Now the sampler can be run, and the burn-in samples removed, with:
# pass the initial samples and total number of samples required
t0 = time() # start time
sampler.run_mcmc(inisamples, Nsamples + Nburnin);
t1 = time()
timeemcee = (t1-t0)
print("Time taken to run 'emcee' is {} seconds".format(timeemcee))
# extract the samples (removing the burn-in)
samples_emcee = sampler.get_chain(flat=True, discard=Nburnin)
Now we can plot the resulting posteriors using corner.py, and also show smoothed versions using a Gaussian kernel density estimate.
# plot the resulting posteriors
mpl.rcParams.update({'font.size': 16})
def plotposts(samples, **kwargs):
"""
Function to plot posteriors using corner.py and scipy's gaussian KDE function.
"""
if "truths" not in kwargs:
kwargs["truths"] = [m, c]
fig = corner.corner(samples, labels=[r'$m$', r'$c$'], hist_kwargs={'density': True}, **kwargs)
# plot KDE smoothed version of distributions
for axidx, samps in zip([0, 3], samples.T):
kde = gaussian_kde(samps)
xvals = fig.axes[axidx].get_xlim()
xvals = np.linspace(xvals[0], xvals[1], 100)
fig.axes[axidx].plot(xvals, kde(xvals), color='firebrick')
# make plot
plotposts(samples_emcee)
# lets store some results for showing together later
resdict = {}
resdict['memcee_mu'] = np.mean(samples_emcee[:,0]) # mean of m samples
resdict['memcee_sig'] = np.std(samples_emcee[:,0]) # standard deviation of m samples
resdict['cemcee_mu'] = np.mean(samples_emcee[:,1]) # mean of c samples
resdict['cemcee_sig'] = np.std(samples_emcee[:,1]) # standard deviation of c samples
resdict['ccemcee'] = np.corrcoef(samples_emcee.T)[0,1] # correlation coefficient between parameters
The time to convergence can depend greatly on how the initial points are set up, so if you have a better guess as to location and shape of the posterior, then using those guesses is advised. But, beware overtuning! emcee has some handy utilities to draw samples from around an initial guess. In the example below I show the use of the (now deprecated) sample_ball()
function, which creates a ball of samples assuming no correlations between parameters. However, if you knew that your parameter uncertainties had a particular covariance, you could use sample_ellipsoid()
to generate a sample cloud using a supplied covarance matrix.
from emcee.utils import sample_ball
m_guess, c_guess = 1., 1. # guess at the mean of the distribution
m_std, c_std = 1., 1. # guess at the width of the distribution
# generate samples from a ball based on those guesses
inisamples = sample_ball((m_guess, c_guess), (m_std, c_std), size=Nens)
# reset so we don't pick up the result from earlier
sampler.reset()
# pass the initial samples and total number of samples required
t0 = time() # start time
sampler.run_mcmc(inisamples, Nsamples+Nburnin);
t1 = time()
timeemcee = (t1-t0)
print("Time taken to run 'emcee' is {} seconds".format(timeemcee))
# extract the samples (removing the burn-in)
samples_emcee = sampler.get_chain(flat=True, discard=Nburnin)
# plot the resulting posteriors
plotposts(samples_emcee)
Once the sampler has settled, the samples that come out of MCMC will be drawn from the posterior. However, even in the best of situations, draws which are adjacent are often correlated with each other. The number of "independent" samples, that is, fully independent draws from the posterior can be much fewer than the samples in hand.
The emcee EnsembleSampler
class has an inbuilt function, get_autocorr_time()
, to calculate the auto-correlation length (ACL) (or time) - this is an estimate the number of samples between independent samples in the chain. You can use this number to estimate convergence of chains, and also to thin out chains to produce independent samples.
m_acl, c_acl = sampler.get_autocorr_time(quiet=True)
print("The autocorrelation length for m is {0} and c is {1}".format(m_acl, c_acl))
# thin out the chain
samples_emcee = sampler.get_chain(flat=True, discard=Nburnin, thin=int(max([m_acl, c_acl])))
print("Number of independent samples is {}".format(len(samples_emcee)))
resdict['esemcee'] = len(samples_emcee)
resdict['essemcee'] = int(resdict['esemcee'] / timeemcee)
print("Effective samples per second: {}".format(resdict['essemcee']))
# plot the resulting posteriors
plotposts(samples_emcee)
In practice, if you have a minimum number of independent samples that you require then you can repeatedly run the sampler, checking the ACL each time (and maybe adjusting the c
input parameter for get_autocorr_time()
), until you have collected the required number.
Checking the ACL is something that can be done for any of the samplers discussed in the rest of this notebook. Different sampling methods can lead to different ACLs, and methods that may take more time per sample might end up having shorter ACLs. So, examining the various trade-offs between overall speed and the final number of independent samples can be worthwhile.
PyMC3¶
PyMC3 (Salvatier, Wiecki & Fonnesbeck, 2016) is a Python MCMC implementation that can use a variety of modern sampling method, including "No-U-turn sampling" (NUTS) (Hoffman & Gellman, 2014) and Hamiltonian Monte Carlo (Duane et al, 1987), both of which make use of the gradient of the posterior to efficiently sample the space. An example of getting started using PyMC3 can be found here upon which this example is based.
This code can be installed using pip
with:
pip install pymc3
conda install -c conda-forge pymc3
and the source is available on GitHub here.
The majority of distributions that can be used for priors and likelihood functions are predefined in PyMC3, so you do not have to worry about writing them yourself (although you can create your own custom distribution).
Note that in PyMC3 the prior and likelihood distributions have to be defined within the context of a PyMC3 Model()
class, and cannot be defined outside that context (the with
statment). Therefore, to set up the model we can't just use the straight_line()
function defined above, but can do the following:
import pymc3 as pm # import PyMC3
print('PyMC3 version: {}'.format(pm.__version__))
# set the PyMC3 model
linear_model = pm.Model()
with linear_model:
# set prior parameters
cmin = -10. # lower range of uniform distribution on c
cmax = 10. # upper range of uniform distribution on c
mmu = 0. # mean of Gaussian distribution on m
msigma = 10. # standard deviation of Gaussian distribution on m
# set priors for unknown parameters
cmodel = pm.Uniform('c', lower=cmin, upper=cmax) # uniform prior on y-intercept
mmodel = pm.Normal('m', mu=mmu, sd=msigma) # Gaussian prior on gradient
sigmamodel = sigma # set a single standard deviation
# Expected value of outcome, aka "the model"
mu = mmodel*x + cmodel
# Gaussian likelihood (sampling distribution) of observations, "data"
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigmamodel, observed=data)
Now we can perform the sampling. By default, for continuous variables like we have, the sampling step method will use NUTS, but the method can be chosen manually. We'll run with 1000 samples, and tune (like the burn-in discussed above) with 1000 samples (discarding these tuning samples):
%%capture
Nsamples = 1000 # final number of samples
Ntune = 1000 # number of tuning samples
# perform sampling
t0 = time()
with linear_model:
trace = pm.sample(Nsamples, tune=Ntune, discard_tuned_samples=True); # perform sampling
t1 = time()
timepymc3 = (t1 - t0)
print("Time taken to run 'PyMC3' is {} seconds".format(timepymc3))
Now, let's extract the samples from the trace
object and make a corner plot!
samples_pymc3 = np.vstack((trace['m'], trace['c'])).T
resdict['mpymc3_mu'] = np.mean(samples_pymc3[:,0]) # mean of m samples
resdict['mpymc3_sig'] = np.std(samples_pymc3[:,0]) # standard deviation of m samples
resdict['cpymc3_mu'] = np.mean(samples_pymc3[:,1]) # mean of c samples
resdict['cpymc3_sig'] = np.std(samples_pymc3[:,1]) # standard deviation of c samples
resdict['ccpymc3'] = np.corrcoef(samples_pymc3.T)[0,1] # correlation coefficient between parameters
plotposts(samples_pymc3)
# calculate the autocorrelation length
acts = []
for param in ["m", "c"]:
acts.append(emcee.autocorr.integrated_time(trace[param]))
print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
resdict['espymc3'] = int(len(samples_pymc3) / np.max(acts))
resdict['esspymc3'] = int(resdict['espymc3'] / timepymc3)
print("Effective samples per second: {}".format(resdict['esspymc3']))
TensorFlow Probability¶
TensorFlow Probability (TFP) is a probabilistic modelling framework built upon the TensorFlow library. TFP contains some samplers for performing MCMC. At the time of writing TFP is still fairly new, and currently includes a Hamiltonian MC sampler, a Metropolis-Hastings sampler, and a slice sampler.
TFP can be installed from PyPI via pip
with:
pip install tensorflow tensorflow-probability
where tensorflow
must be explicitly installed as it's not a dependency of tensorflow-probability
.
Using TFP for MCMC sampling has some similarites to PyMC3, however it can be fairly opaque, and (for me at least!) non-intuitive. The example given does not come with a detailed explanation of the constructions I've used, because I don't have any in-depth knowlegde of them! It was mainly cobbled together by looking at the example given here.
Firstly, we can set up the probablistic model (with a JointDistributionSequential
class) that contains the prior distributions on the parameters of interest and a lambda
function defining the likelihood function, within which we write out the straight line model for the data. Some things to be aware of are: the arguments to the lambda
function defining the likelihood must be given in the opposite order to that in which those arguments are placed within the JointDistributionSequential
class; the name
of the Independent
distribution that holds the likelihood function should be the same as the name of the variable that holds the data.
# import TensorFlow probability
import tensorflow_probability as tfp
from tensorflow_probability import distributions as tfd
import tensorflow as tf
cmin = -10. # lower range of uniform distribution on c
cmax = 10. # upper range of uniform distribution on c
mmu = 0. # mean of Gaussian distribution on m
msigma = 10. # standard deviation of Gaussian distribution on m
# convert x values and data to 32 bit float
xtfp = x.astype(np.float32) # x is being use globally here
datatfp = data.astype(np.float32)
# set model - contains priors and the expected linear model
model = tfd.JointDistributionSequential([
tfd.Normal(loc=mmu, scale=msigma, name="m"), # m prior
tfd.Uniform(cmin, cmax, name="c"), # c prior
lambda c, m: (tfd.Independent(
tfd.Normal( # the Normal likelihood function
loc=(m[..., tf.newaxis] * xtfp + c[..., tf.newaxis]), # the straight line model
scale=sigma # the data standard deviation
),
name="datatfp", # the name of the variable holding the data
reinterpreted_batch_ndims=1,
))
])
print('TensorFlow version: {}'.format(tf.__version__))
print('TensorFlow Probability version: {}'.format(tfp.__version__))
Next, the probablistic model must be wrapped in a function that takes in values of the parameters being sampled and returns the probability.
def target_log_prob_fn(mvalue, cvalue):
return model.log_prob(
(mvalue, cvalue, datatfp)
)
We can now set up the sampler and here we'll use the No U-turn Sampler (NUTS). For the set up of the NoUTurnSampler
we have to specify a step size (i.e., the size of the proposals). Some experimentation for our particular case shows a step size of 0.05 is reasonable, but it won't work for every case and most be tuned by hand. It also means that if different parameters have quite different dynamic ranges this method will not work well (it may produce jumps that are always for too big or small for certain sets of parameters).
Wrapping the sampling within a function (do_sampling()
) with the @tf.function
decorator compiles the model and means it runs far faster than without the wrapper.
We'll draw initial values of the m
and c
parameters from distributions that match their priors.
Nsamples = 4000 # final number of samples
Nburn = 4000 # number of tuning samples
# set up Hamiltonian MC (within TensorFlow function to speed up computation)
@tf.function(autograph=False)
def do_sampling():
# set initial state (random draw from prior)
qc = tf.random.uniform([], minval=cmin, maxval=cmax, name="init_c")
qm = tf.random.normal([], stddev=msigma, mean=mmu, name="init_m")
hmc_kernel = tfp.mcmc.HamiltonianMonteCarlo(
target_log_prob_fn=target_log_prob_fn,
step_size=0.05,
num_leapfrog_steps=5,
)
return tfp.mcmc.sample_chain(
num_results=Nsamples,
current_state=[qm, qc],
kernel=hmc_kernel,
num_burnin_steps=Nburn,
)
t0 = time()
states, kernel_results = do_sampling()
t1 = time()
timetfp = (t1-t0)
# extract the samples
ms, cs = states
# convert output states to numpy arrays
samples_tfp = np.vstack((ms.numpy(), cs.numpy())).T
num_accepted = np.sum(kernel_results.is_accepted)
print("Acceptance rate: {}".format(num_accepted / Nsamples)) # print out acceptance rate
print("Time taken to run 'TFP' is {} seconds".format(timetfp))
resdict['mtfp_mu'] = np.mean(samples_tfp[:,0]) # mean of m samples
resdict['mtfp_sig'] = np.std(samples_tfp[:,0]) # standard deviation of m samples
resdict['ctfp_mu'] = np.mean(samples_tfp[:,1]) # mean of c samples
resdict['ctfp_sig'] = np.std(samples_tfp[:,1]) # standard deviation of c samples
resdict['cctfp'] = np.corrcoef(samples_tfp.T)[0,1] # correlation coefficient between parameters
plotposts(samples_tfp)
# calculate the autocorrelation length
acts = []
for i, param in enumerate(["m", "c"]):
acts.append(emcee.autocorr.integrated_time(samples_tfp[:,i]))
print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
resdict['estfp'] = int(len(samples_tfp) / np.max(acts))
resdict['esstfp'] = int(resdict['estfp'] / timetfp)
print("Effective samples per second: {}".format(resdict['esstfp']))
Zeus¶
Zeus (Karamanis & Beutler, 2020, Karamanis, Beutler & Peacock, 2021) is a pure Python MCMC code that uses ensemble slice sampling. It is very similar in usage to emcee.
Zeus is available via PyPI and can be installed using pip
with:
pip install zeus-mcmc
conda install -c conda-forge zeus-mcmc
The Zeus source code is available on Github.
The Zeus documentation gives a nice example of fitting a straight line. The interface is so similar to emcee that we can reuse the logprior
, loglikelihood
and logposterior
functions that we used in the case (I won't bother redefining them, so you'll have to scroll back up!). So, that just leaves the sampling. We will just leave any tunable parameters at their default values, but you can choose between a few different proposal distributions.
import zeus
print('Zeus version: {}'.format(zeus.__version__))
ndim = 2 # Number of parameters/dimensions (e.g. m and c)
Nens = 100 # for consistency with the emcee example we'll use 100
Nburn = 500 # burn-in samples
Nsamples = 500 # Nsamples
# set the initial samples (we'll draw from the prior)
mini = np.random.normal(mmu, msigma, Nens) # initial m points
cini = np.random.uniform(cmin, cmax, Nens) # initial c points
inisamples = np.array([mini, cini]).T # initial samples
# Initialise the sampler
sampler = zeus.EnsembleSampler(Nens, ndim, logposterior, args=[data, sigma, x])
t0 = time()
# Run the sampler
sampler.run_mcmc(inisamples, Nburn + Nsamples)
t1 = time()
timezeus = t1 - t0
# print summary diagnostics
sampler.summary
print("Time taken to run 'Zeus' is {} seconds".format(timezeus))
The EnsembleSampler
object has a method to extract the posterior samples and we can discard the burn-in and thin the samples if required.
samples_zeus = sampler.get_chain(flat=True, discard=Nburn)
resdict['mzeus_mu'] = np.mean(samples_zeus[:,0]) # mean of m samples
resdict['mzeus_sig'] = np.std(samples_zeus[:,0]) # standard deviation of m samples
resdict['czeus_mu'] = np.mean(samples_zeus[:,1]) # mean of c samples
resdict['czeus_sig'] = np.std(samples_zeus[:,1]) # standard deviation of c samples
resdict['cczeus'] = np.corrcoef(samples_zeus.T)[0,1] # correlation coefficient between parameters
plotposts(samples_zeus)
# calculate the autocorrelation length
acts = []
for i, param in enumerate(["m", "c"]):
acts.append(emcee.autocorr.integrated_time(samples_zeus[:, i]))
print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
resdict['eszeus'] = int(len(samples_zeus) / np.max(acts))
resdict['esszeus'] = int(resdict['eszeus'] / timezeus)
print("Effective samples per second: {}".format(resdict['esszeus']))
MCMC wrapper samplers¶
There are other MCMC samplers that are not pure Python packages, but have a Python-based wrapper to a different library. With these you can't write pure Python model and probability functions, but instead have to create a code string that is then compiled and run using the underlying library functions. We will look at PyStan and PyJAGS.
PyStan¶
PyStan is a Python interface to the popular Stan library. As with PyMC3, it uses the modern NUTS and Hamiltonain MC sampling methods. It can be installed using pip
with:
pip install pystan
or via Conda with:
conda install -c conda-forge pystan
and the source code is available on GitHub here.
To define the model and distributions you need to write a code string like this:
# model and distributions for straight line fitting
line_code = """
data {{
int N; // number of data points
real y[N]; // observed data points
real x[N]; // abscissa points
real sigma; // standard deviation
}}
parameters {{
// parameters for the fit
real m;
real c;
}}
transformed parameters {{
real theta[N];
for (j in 1:N)
theta[j] = m * x[j] + c; // straight line model
}}
model {{
m ~ normal({mmu}, {msigma}); // prior on m (gradient)
c ~ uniform({clower}, {cupper}); // prior on c (y-intercept)
y ~ normal(theta, sigma); // likelihood of the data given the model
}}
"""
You can then define the inputs to the data
section in a dictionary:
# set the data and the abscissa
linear_data = {
'N': M, # number of data points
'y': data, # observed data (converted from numpy array to a list)
'x': x, # abscissa points (converted from numpy array to a list)
'sigma': sigma, # standard deviation
}
Now you can compile the model and run the sampler, with 1000 iterations and 4 chains, with:
%%capture
import pystan # import PyStan
Nsamples = 1000 # set the number of iterations of the sampler
chains = 4 # set the number of chains to run with
# dictionary for inputs into line_code
linedict = {}
linedict['mmu'] = 0.0 # mean of Gaussian prior distribution for m
linedict['msigma'] = 10 # standard deviation of Gaussian prior distribution for m
linedict['clower'] = -10 # lower bound on uniform prior distribution for c
linedict['cupper'] = 10 # upper bound on uniform prior distribution for c
t0 = time()
sm = pystan.StanModel(model_code=line_code.format(**linedict)); # compile model
t1 = time()
timepystancomp = (t1 - t0)
t0 = time()
fit = sm.sampling(data=linear_data, iter=Nsamples, chains=chains); # perform sampling
t1 = time()
timepystan = (t1 - t0)
print('PyStan version: {}'.format(pystan.__version__))
print("Time taken to comile 'PyStan' model is {} seconds".format(timepystancomp))
print("Time taken to run 'PyStan' is {} seconds".format(timepystan))
PyStan does have a plotting function (you could run fit.plot()
), but we'll extract the chains and use corner.py
:
la = fit.extract(permuted=True) # return a dictionary of arrays
samples_pystan = np.vstack((la['m'], la['c'])).T
resdict['mpystan_mu'] = np.mean(samples_pystan[:,0]) # mean of m samples
resdict['mpystan_sig'] = np.std(samples_pystan[:,0]) # standard deviation of m samples
resdict['cpystan_mu'] = np.mean(samples_pystan[:,1]) # mean of c samples
resdict['cpystan_sig'] = np.std(samples_pystan[:,1]) # standard deviation of c samples
resdict['ccpystan'] = np.corrcoef(samples_pystan.T)[0,1] # correlation coefficient between parameters
# plot using corner.py
plotposts(samples_pystan)
# calculate the autocorrelation length
acts = []
for i, param in enumerate(["m", "c"]):
acts.append(emcee.autocorr.integrated_time(samples_pystan[:, i]))
print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
resdict['espystan'] = int(len(samples_pystan) / np.max(acts))
resdict['esspystan'] = int(resdict['espystan'] / timetfp)
print("Effective samples per second: {}".format(resdict['esspystan']))
PyJAGS¶
PyJAGS is another wrapper to a non-Python library, in this case the JAGS library (Plummer, 2003). As with PyStan, you have to write a code string that is then compiled and run. To use PyJAGS you need to install JAGS, e.g. on a Debian-based system this can be done using:
sudo apt-get install jags
and then install PyJAGS using pip
with:
pip install pyjags
The source code for PyJAGS is available on GitHub here.
A linear regression fitting example using PyJAGS is given here, on which this example is based.
We can define the model string using (note that in this model the Gaussian/Normal distribution dnorm
takes the inverse variance as the second argument, rather than the standard deviation):
# create the JAGS code for the linear model
line_code_jags = """
model {{
for (i in 1:N) {{
y[i] ~ dnorm(c + m * x[i], {invvar}) # Gaussian likelihood
}}
m ~ dnorm({mmu}, {minvvar}) # Gaussian prior on m
c ~ dunif({clower}, {cupper}) # Uniform prior on c
}}
"""
You can then define the data inputs section in a dictionary:
datadict = {
'x': x, # abscissa points (converted from numpy array to a list)
'N': M, # number of data points
'y': data, # the observed data
}
Now you can compile the model and run the sampler, with 1000 iterations and 4 chains, with:
%%capture
import pyjags # import PyJAGS
Nsamples = 1000 # set the number of iterations of the sampler
chains = 4 # set the number of chains to run with
# dictionary for inputs into line_code
linedict = {}
linedict['mmu'] = 0.0 # mean of Gaussian prior distribution for m
linedict['minvvar'] = 1 / 10**2 # inverse variance of Gaussian prior distribution for m
linedict['clower'] = -10 # lower bound on uniform prior distribution for c
linedict['cupper'] = 10 # upper bound on uniform prior distribution for c
linedict['invvar'] = 1 / sigma**2 # inverse variance of the data
# compile model
t0 = time()
model = pyjags.Model(line_code_jags.format(**linedict), data=datadict, chains=chains)
t1 = time()
timepyjagscomp = (t1 - t0)
t0 = time()
samples = model.sample(Nsamples, vars=['m', 'c']) # perform sampling
t1 = time()
timepyjags = (t1 - t0)
print('PyJAGS version: {}'.format(pyjags.__version__))
print("Time taken to comile 'PyJAGS' model is {} seconds".format(timepyjagscomp))
print("Time taken to run 'PyJAGS' is {} seconds".format(timepyjags))
Now we can plot the samples using corner.py
:
mchainjags = samples['m'].flatten()
cchainjags = samples['c'].flatten()
samples_pyjags = np.vstack((mchainjags, cchainjags)).T
resdict['mpyjags_mu'] = np.mean(samples_pyjags[:,0]) # mean of m samples
resdict['mpyjags_sig'] = np.std(samples_pyjags[:,0]) # standard deviation of m samples
resdict['cpyjags_mu'] = np.mean(samples_pyjags[:,1]) # mean of c samples
resdict['cpyjags_sig'] = np.std(samples_pyjags[:,1]) # standard deviation of c samples
resdict['ccpyjags'] = np.corrcoef(samples_pyjags.T)[0,1] # correlation coefficient between parameters
# plot using corner.py
plotposts(samples_pyjags)
# calculate the autocorrelation length
acts = []
for i, param in enumerate(["m", "c"]):
acts.append(emcee.autocorr.integrated_time(samples_pystan[:, i]))
print("Autocorrelation length for '{}' is {}".format(param, acts[-1]))
print("Maximum autocorrelation length is {0:.1f}".format(np.max(acts)))
resdict['espyjags'] = int(len(samples_pyjags) / np.max(acts))
resdict['esspyjags'] = int(resdict['espyjags'] / timetfp)
print("Effective samples per second: {}".format(resdict['esspyjags']))
That ends our exploration of MCMC codes. We now move on to Nested Sampling!
Nested Sampling¶
Nested sampling (Skilling, 2006) is a method to numerically perform the integral required to evaluate the marginal likelihood of the data given a particular model. This is not a value that is produced by most standard MCMC methods, but is a value that is very useful if wanting to do Bayesian model comparison. Above, we defined the marginal likelihood, or evidence, as
$$ p(\mathbf{d}|I) = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta},I) p(\vec{\theta}|I) {\rm d}{\theta_i}, $$but, we can make this more explicit for a given hypothesis, or model, $H_j$, by stating:
$$ p(\mathbf{d}|H_j, I) \equiv Z = \int^{\forall \theta_i \in \vec{\theta}} p(\mathbf{d}|\vec{\theta}, H_j, I) p(\vec{\theta}|H_j,I) {\rm d}{\theta_i}, $$where previously the $H_j$ had been subsumed inside our implicit assumptions, $I$.
Nested sampling algorithms generally start by drawing a set of points (sometimes called "live points", or "active points") from the prior distribution. It relies on these points being independent draws from the prior, otherwise the resulting marginal likelihood value can be biased. The likelihood is evaluated at each point, and the smallest likelihood point is found. This points is then removed from the set of live points (and goes into the sum calculating the marginal likelihood integral) and a new point is drawn from the prior with the constraint that it has a likelihood larger than the just removed point. Drawing truly independent points from this constrained prior is the main tricky part of nested sampling (particularly when the number of parameters is not small), and a variety of methods (including MCMC methods) can be used.
The uncertainty on the estimate of the natural logarithm of the marginal likelihood, $\ln{Z}$, can be found if the information gain (or Kullback-Leibler divergence), $h$, in going from the prior to the posterior is calculated. If $h$ (in nats) is available then the uncertainty is (Skilling, 2006):
$$ \sqrt{\frac{h}{N_{\rm live}}}, $$where $N_{\rm live}$ is the number of live points used.
Note: This is the statistical uncertainty on the value, but if there are any biases due to the sampling not being independent draws from the prior then there can be systematic uncertainties too.
As well as producing a value for the marginal likelihood, a by-product of the nested sampling algorithm is the production a set of samples that can be re-sampled from to provide samples drawn from the posterior distribution.
Note: MCMC methods can be used to estimate the marginal likelihood, e.g.,
emcee
can be run with a parallel tempering sampling method and used to compute the evidence using thermodynamic integration (e.g., Goggans & Chi, 2004), or potentially using Slice sampling with PyMC3.
As with MCMC, there are various codes available to perform the nested sampling algorithm. We will again look at pure Python implementations and some that are wrappers to other libraries.
Nestle¶
Nestle is a pure Python nested sampling algorithm. It has three options for the method with which it draws new samples: "classic", "single ellipsoid", and "multiple ellipsoids. In this example we'll make use of the last of these. This uses the MultiNest method (also see PyMultiNest below) (Feroz et al, 2009) in which the current live points are clustered using a recursive k-means method (however, clustering is not re-done at every iteration of the algorithm), ellipsoids completely bounding each cluster are found, an ellipsoid is randomly chosen with a probability proportional to its volume, and a new sample is drawn uniformly from it.
Nestle is available on PyPI and installable using pip
with:
pip install nestle
or via Conda with:
conda install -c conda-forge nestle
The source code is available on GitHub here.
The set up of likelihood functions is quite similar to that for emcee, but the prior that is used is a unit hypercube (an $n$-dimensional cube with all sides equal to one), so parameters must be mapped into this space. This is simple for uniform priors, but requires a re-parameterisation using a Jacobian for more complex priors (if possible).
For our Gaussian prior on $m$ we can't analytically re-parameterise it for a unit hypercube, however we can use inverse transform sampling to take samples from a uniform range between 0 and 1, and convert them back to a Gaussian, e.g.
$$ m = \mu_m + \sqrt{2}\sigma_m {\rm erf}^{-1}\left(2c - 1\right), $$where $c \sim {\rm Uniform}(0,1)$ and ${\rm erf}^{-1}$ is the inverse error function. As shown in the Nestle documentation we can use the scipy.special.ndtri
function for this.
An example of using Nestle for the linear regression problem, but with uniform priors on both parameters, is given here.
Setting the prior transform¶
Given the above description, rather than setting a prior function, we set a prior transformation function that maps from parameters drawn from the unit hypercube into the true parameters.
# import the inverse error function from scipy
from scipy.special import ndtri
def prior_transform(theta):
"""
A function defining the tranform between the parameterisation in the unit hypercube
to the true parameters.
Args:
theta (tuple): a tuple containing the parameters.
Returns:
tuple: a new tuple or array with the transformed parameters.
"""
mprime, cprime = theta # unpack the parameters (in their unit hypercube form)
cmin = -10. # lower bound on uniform prior on c
cmax = 10. # upper bound on uniform prior on c
mmu = 0. # mean of Gaussian prior on m
msigma = 10. # standard deviation of Gaussian prior on m
m = mmu + msigma*ndtri(mprime) # convert back to m
c = cprime*(cmax-cmin) + cmin # convert back to c
return (m, c)
Setting the likelihood function¶
This is similar to setting the likelihood function for emcee except, as we're calculating the marginal likelihood, we will keep the normalisation constants. As before, we work with the natural logarithm of the likelihood function. The function can only take in a tuple variable containing the parameter values, so any additional required values must be hard coded, or set as global variables.
# set the natural logarithm of 2pi, so that it doesn't have to be recalculated
LN2PI = np.log(2.*np.pi)
LNSIGMA = np.log(sigma) # natural log of the data noise standard deviation
def loglikelihood_nestle(theta):
"""
The log-likelihood function.
"""
m, c = theta # unpack the parameters
# normalisation
norm = -0.5 * M * LN2PI - M * LNSIGMA
# chi-squared (data, sigma and x are global variables defined early on in this notebook)
chisq = np.sum(((data - straight_line(x, m, c)) / sigma)**2)
return norm - 0.5 * chisq
Now you can run the sampler. We'll use 1024 live/active points (there's nothing special about this number, and powers of two are not required). The more live points you use the longer the code will take to run, but the uncertainty on the final marginal likelihood value will decrease with greater numbers of live points.
Unlike with MCMCs, where you give it a number of iterations to run and after that point it stops, nested sampling needs to be supplied with a stopping criterion at which point it will terminate. One common criterion is based on the ratio between an estimate of the total evidence and the current calculated evidence value. Once this value gets below some set value the algorithm will terminate.
# import nestle
import nestle
print('Nestle version: {}'.format(nestle.__version__))
nlive = 1024 # number of live points
method = 'multi' # use MutliNest algorithm
ndims = 2 # two parameters
tol = 0.1 # the stopping criterion
t0 = time()
res = nestle.sample(loglikelihood_nestle, prior_transform, ndims, method=method, npoints=nlive, dlogz=tol)
t1 = time()
timenestle = (t1-t0)
print("Time taken to run 'Nestle' is {} seconds".format(timenestle))
We can extract the value of the marginal likelihood, $\ln{Z}$, and its uncertainties with the following:
logZnestle = res.logz # value of logZ
infogainnestle = res.h # value of the information gain in nats
logZerrnestle = np.sqrt(infogainnestle/nlive) # estimate of the statistcal uncertainty on logZ
print("log(Z) = {} ± {}".format(logZnestle, logZerrnestle))
A summary of the above values can also be output using:
print(res.summary())
As mentioned above, the samples output by nested sampling are not drawn from the posterior, and as such need to be resampled from using certain weights. These weights are output by Nestle in the results. We can get posterior samples with the following:
# re-scale weights to have a maximum of one
nweights = res.weights/np.max(res.weights)
# get the probability of keeping a sample from the weights
keepidx = np.where(np.random.rand(len(nweights)) < nweights)[0]
# get the posterior samples
samples_nestle = res.samples[keepidx,:]
resdict['mnestle_mu'] = np.mean(samples_nestle[:,0]) # mean of m samples
resdict['mnestle_sig'] = np.std(samples_nestle[:,0]) # standard deviation of m samples
resdict['cnestle_mu'] = np.mean(samples_nestle[:,1]) # mean of c samples
resdict['cnestle_sig'] = np.std(samples_nestle[:,1]) # standard deviation of c samples
resdict['ccnestle'] = np.corrcoef(samples_nestle.T)[0,1] # correlation coefficient between parameters
resdict['nestle_npos'] = len(samples_nestle) # number of posterior samples
resdict['nestle_time'] = timenestle # run time
resdict['nestle_logZ'] = logZnestle # log marginalised likelihood
resdict['nestle_logZerr'] = logZerrnestle # uncertainty on log(Z)
print('Number of posterior samples is {}'.format(len(samples_nestle)))
Now we can plot the posteriors:
plotposts(samples_nestle)
With nested sampling the samples should be uncorrelated by construction (in reality, for some complex problems, some methods used to draw new samples within the nested sampling algorithm can leave correlations). During nested sampling, you can estimate the number of effective posterior samples from the full chain of samples using, for example, (an unnormalised verion of) the effective sampler size calculation given here.
def ess(weights):
"""
Estimate the effective sample size from the weights.
Args:
weights (array_like): an array of weights values for each nested sample
Returns:
int: the effective sample size
"""
N = len(weights)
w = weights / weights.sum()
ess = N / (1.0 + ((N * w - 1)**2).sum() / N)
return int(ess)
resdict['esnestle'] = ess(res.weights)
resdict['essnestle'] = int(resdict['esnestle'] / timenestle)
print("Effective samples per second: {}".format(resdict['essnestle']))
CPNest¶
CPNest (Veitch & del Pozzo, 2017) is another pure Python nested sampling algorithm. To perform the sampling from the restricted prior it uses a Metropolis-Hastings based MCMC, with sample proposal distributions that use a combination of the affine invarient ensemble methods described for emcee (Goodman & Weare, 2010), differential evolution, or jumps based on the eigenvectors of the covariance matrix of live points.
CPNest is available on PyPI and can be installed using pip
with:
pip install cpnest
or via Conda with:
conda install -c conda-forge cpnest
The source code is available on GitHub here.
To run CPNest we have to define a class containing the likelihood and prior functions. CPNest also requires explicit bounds on all parameters from which it will uniformly draw an initial set of live points. Prior to running the nested sampling algorithm, it will run an MCMC using the prior as the target distribution to make sure the samples are drawn from the required prior, so it does not require any mapping between the prior and a unit-hypercube.
# import the Model class from CPNest
from cpnest.model import Model
LN2PI = np.log(2.*np.pi)
# set the model, likelihood and prior
class StraightLineModel(Model):
"""
A simple straight line model, with a Gaussian likelihood.
Args:
data (:class:`numpy.ndarray`): an array containing the observed data
abscissa (:class:`numpy.ndarray`): an array containing the points at which the data were taken
modelfunc (function): a function defining the model
sigma (float): the standard deviation of the noise in the data
"""
names = ['m','c'] # parameter names (this is a required variables for the class)
# define the boundaries using for initial sampling of the live points
cmin = -10. # lower range on c (the same as the uniform c prior lower bound)
cmax = 10. # upper range on c (the same as the uniform c prior upper bound)
mmu = 0. # mean of the Gaussian prior on m
msigma = 10. # standard deviation of the Gaussian prior on m
cbounds = [cmin, cmax]
# set the m bounds to be +/- 10sigma about the mean
mbounds = [mmu-10.*msigma, mmu+10.*msigma]
# NOTE: the bounds could instead be set through arguments passed to the __init__
# function for the class if wanted
bounds=[mbounds, cbounds] # upper and lower bounds on each parameter (required for the class)
def __init__(self, data, abscissa, modelfunc, sigma):
# set the data
self._data = data # oberserved data
self._abscissa = abscissa # points at which the observed data are taken
self._sigma = sigma # standard deviation(s) of the data
self._logsigma = np.log(sigma) # log sigma here to save computations in the likelihood
self._ndata = len(data) # number of data points
self._model = modelfunc # model function
def log_likelihood(self, params):
"""
The natural logarithm of the likelihood function.
Args:
params (dict): a dictionary keyed to the parameter names.
Returns:
float: the log likelihood value.
"""
# extract the parameters
m = params['m']
c = params['c']
# calculate the model
model = self._model(x, m, c)
# normalisation
norm = -0.5*self._ndata*LN2PI - self._ndata*self._logsigma
# chi-squared
chisq = np.sum(((self._data - model)/(self._sigma))**2)
return norm - 0.5*chisq
def log_prior(self,p):
"""
The natural logarithm of the prior function.
Args:
p (dict): a dictionary keyed to parameter names.
Returns:
float: The log prior value.
"""
# the uniform priors are dealt with by just checking that we're within the bounds
if not self.in_bounds(p): return -np.inf # check parameters are all in bounds
# Gaussian prior on m
lp = 0.
m = p['m']
lp -= 0.5 * ((m - self.mmu) / self.msigma)**2 # no need for normalisation constant on the prior
return lp
We can now run CPNest. As with Nestle we will use 1024 live points. CPNest can use mutltiple CPU cores, but we'll set it to use one to be comparable to the other tests. The number of iterations used for the MCMC sampling is automatically worked out within the code by calculating autocorrelation lengths to see how many iterations are needed to produce independent samples, but a maximum chain length can also be set. We'll set this to 1024. As with Nestle a stopping criterion is required, which for CPNest is defined in the same way as Nestle. The current CPNest default is 0.1 and this cannot be altered by the user.
%%capture
# import cpnest
import cpnest
print('CPNest version: {}'.format(cpnest.__version__))
nlive = 1024 # number of live point
maxmcmc = 1024 # maximum MCMC chain length
nthreads = 1 # use one CPU core
# set up the algorithm
work = cpnest.CPNest(
StraightLineModel(data, x, straight_line, sigma), verbose=0,
nthreads=nthreads, nlive=nlive, maxmcmc=maxmcmc
);
# run the algorithm
t0 = time()
work.run();
t1 = time()
timecpnest = (t1 - t0)
print("Time taken to run 'CPNest' is {} seconds".format(timecpnest))
We can again extract the value of $\ln{Z}$ and its uncertainty. One simple bit of model comparison that we may want to do is to compare the evidence for the model of the data containg a straight line (with the given priors on the parameter), with the evidence for a model that the data just contains independent Gaussian noise with the given standard deviation. We can calculate the latter evidence easily by just setting the model to zero in the log-likelihood function, e.g.:
$$ \ln{Z}_{\rm noise} = -\frac{M}{2}\ln{(2\pi)} - \ln\left(\prod_{i=1}^M \sigma_i\right) - \sum_{i=1}^M\frac{d_i^2}{2\sigma_i^2}. $$The natural logarithm of the Bayes factor comparing the two models is then:
$$ \ln{\mathcal{B}} = \ln{Z} - \ln{Z}_{\rm noise}. $$We will also extract this value.
logZcpnest = work.NS.logZ # value of log Z
infogaincpnest = work.NS.state.info # value of the information gain
logZerrcpnest = np.sqrt(infogaincpnest/nlive) # estimate of the statistcal uncertainty on logZ
# get the null log likelihood (evidence that the data is Gaussian noise with zero mean, and given standard devaition)
logZnull = work.user.log_likelihood({'m': 0., 'c': 0.})
print("log(Z) = {} ± {}".format(logZcpnest, logZerrcpnest))
# output the log Bayes factor
print('log Bayes factor is {}'.format(logZcpnest - logZnull))
CPNest will itself resample from the nested samples to give posterior samples held in a dictionary, which we can plot.
mchain_cpnest = work.posterior_samples['m'] # extract chain of m values
cchain_cpnest = work.posterior_samples['c'] # extract chain if c values
samples_cpnest = np.vstack((mchain_cpnest, cchain_cpnest)).T
# print out the number of posterior samples
print('Number of posterior samples is {}'.format(samples_cpnest.shape[0]))
resdict['mcpnest_mu'] = np.mean(samples_cpnest[:,0]) # mean of m samples
resdict['mcpnest_sig'] = np.std(samples_cpnest[:,0]) # standard deviation of m samples
resdict['ccpnest_mu'] = np.mean(samples_cpnest[:,1]) # mean of c samples
resdict['ccpnest_sig'] = np.std(samples_cpnest[:,1]) # standard deviation of c samples
resdict['cccpnest'] = np.corrcoef(samples_cpnest.T)[0,1] # correlation coefficient between parameters
resdict['cpnest_npos'] = len(samples_cpnest) # number of posterior samples
resdict['cpnest_time'] = timecpnest # run time
resdict['cpnest_logZ'] = logZcpnest # log marginalised likelihood
resdict['cpnest_logZerr'] = logZerrcpnest # uncertainty on log(Z)
# plot using corner.py
plotposts(samples_cpnest)
from cpnest.nest2pos import compute_weights
_, weights = compute_weights(work.get_nested_samples()["logL"], work.nlive)
weights = np.exp(weights)
resdict['escpnest'] = ess(weights)
resdict['esscpnest'] = int(resdict['escpnest'] / timecpnest)
print("Effective samples per second: {}".format(resdict['esscpnest']))
dynesty¶
dynesty (Speagle, 2019) is a pure Python-based implementation of the dynamic nested sampling algorithm (Higson et al, 2017). It is very similar in usage to Nestle. Dynamic nested sampling is a way of dynamically adapting the number of live points used to provide more accurate evidence estimates and/or larger numbers of independent posterior samples. The code can run using dynamic adjustment of live points or just in a "static" mode with a fixed number of live points. There a many options for how the new live points are drawn, both in terms of the bounds from which they are drawn and the sampling from within those bounds. The bound options include the using bounding ellipsoids in the style of MultiNest, or the RadFriends algorithm of UltraNest, among others. The sampling options include uniform draws from the bounds, random walks, or slice sampling in the style of PolyChord. If not explicitly specified, dynesty will select a sampling algorithm based on the number of dimensions of the problem being tackled.
dynesty is available on PyPI and can be installed with pip
using:
pip install dynesty
or using Conda with:
conda install -c conda-forge dynesty
Note: Other dynamic nested sampling codes exist such as dyPolyChord and
perfectns
.
The likelihood function and prior transform can be set up in a way that is identical to Nestle (we'll define both again below for completeness), with the differences being in how the sampler is called. Here we will show the sampling being used for both "static" and "dynamic" nested sampling.
# import the inverse error function from scipy
from scipy.special import ndtri
def prior_transform(theta):
"""
A function defining the tranform between the parameterisation in the unit hypercube
to the true parameters.
Args:
theta (tuple): a tuple containing the parameters.
Returns:
tuple: a new tuple or array with the transformed parameters.
"""
mprime, cprime = theta # unpack the parameters (in their unit hypercube form)
cmin = -10. # lower bound on uniform prior on c
cmax = 10. # upper bound on uniform prior on c
mmu = 0. # mean of Gaussian prior on m
msigma = 10. # standard deviation of Gaussian prior on m
m = mmu + msigma*ndtri(mprime) # convert back to m
c = cprime*(cmax-cmin) + cmin # convert back to c
return (m, c)
# set the natural logarithm of 2pi, so that it doesn't have to be recalculated
LN2PI = np.log(2. * np.pi)
LNSIGMA = np.log(sigma) # natural log of the data noise standard deviation
def loglikelihood_dynesty(theta):
"""
The log-likelihood function.
"""
m, c = theta # unpack the parameters
# normalisation
norm = -0.5 * M * LN2PI - M * LNSIGMA
# chi-squared (data, sigma and x are global variables defined early on in this notebook)
chisq = np.sum(((data-straight_line(x, m, c))/sigma)**2)
return norm - 0.5 * chisq
# import dynesty
import dynesty
print('dynesty version: {}'.format(dynesty.__version__))
nlive = 1024 # number of live points
bound = 'multi' # use MutliNest algorithm for bounds
ndims = 2 # two parameters
sample = 'unif' # uniform sampling
tol = 0.1 # the stopping criterion
from dynesty import NestedSampler
sampler = NestedSampler(loglikelihood_dynesty, prior_transform, ndims,
bound=bound, sample=sample, nlive=nlive)
t0 = time()
sampler.run_nested(dlogz=tol, print_progress=False) # don't output progress bar
t1 = time()
timedynesty = (t1-t0)
print("Time taken to run 'dynesty' (in static mode) is {} seconds".format(timedynesty))
We can extract the value of the marginal likelihood, $\ln{Z}$, and its uncertainties with the following:
res = sampler.results # get results dictionary from sampler
logZdynesty = res.logz[-1] # value of logZ
logZerrdynesty = res.logzerr[-1] # estimate of the statistcal uncertainty on logZ
print("log(Z) = {} ± {}".format(logZdynesty, logZerrdynesty))
Or, get a summary of this information using:
print(res.summary())
As with Nestle we need to draw posterior samples from the nested samples based on their prior weights.
# get function that resamples from the nested samples to give sampler with equal weight
from dynesty.utils import resample_equal
# draw posterior samples
weights = np.exp(res['logwt'] - res['logz'][-1])
samples_dynesty = resample_equal(res.samples, weights)
resdict['mdynesty_mu'] = np.mean(samples_dynesty[:,0]) # mean of m samples
resdict['mdynesty_sig'] = np.std(samples_dynesty[:,0]) # standard deviation of m samples
resdict['cdynesty_mu'] = np.mean(samples_dynesty[:,1]) # mean of c samples
resdict['cdynesty_sig'] = np.std(samples_dynesty[:,1]) # standard deviation of c samples
resdict['ccdynesty'] = np.corrcoef(samples_dynesty.T)[0,1] # correlation coefficient between parameters
resdict['dynesty_npos'] = len(samples_dynesty) # number of posterior samples
resdict['dynesty_time'] = timedynesty # run time
resdict['dynesty_logZ'] = logZdynesty # log marginalised likelihood
resdict['dynesty_logZerr'] = logZerrdynesty # uncertainty on log(Z)
print('Number of posterior samples is {}'.format(len(samples_dynesty)))
# plot using corner.py
plotposts(samples_dynesty)
resdict['esdynesty'] = ess(np.exp(res["logwt"] - logZdynesty))
resdict['essdynesty'] = int(resdict['esdynesty'] / timedynesty)
print("Effective samples per second: {}".format(resdict['essdynesty']))
We can now do the same thing, but with the "dynamic" sampler. For this we can specify an initial number of live points (we will stick with 1024), but the actual number used will evolve. The stopping criterion for the dynamic method is more complex than for the static case, and we will just leave it at the default values.
# import the dynamic nested sampler
from dynesty import DynamicNestedSampler
# set up the sampler
dsampler = DynamicNestedSampler(loglikelihood_dynesty, prior_transform, ndims,
bound=bound, sample=sample)
# run the sampler (not that we have not included the dlogz tolerance)
t0 = time()
dsampler.run_nested(nlive_init=nlive, print_progress=False)
t1 = time()
timedynestydynamic = (t1-t0)
print("Time taken to run 'dynesty' (in dynamic mode) is {} seconds".format(timedynestydynamic))
Let's just compare the marginal likelihoods and numbers of posterior samples between the static and dynamic methods.
dres = dsampler.results # get results dictionary from sampler
logZdynestydynamic = dres.logz[-1] # value of logZ
logZerrdynestydynamic = dres.logzerr[-1] # estimate of the statistcal uncertainty on logZ
print("Static: log(Z) = {} ± {}".format(logZdynesty, logZerrdynesty))
print("Dynamic: log(Z) = {} ± {}".format(logZdynestydynamic, logZerrdynestydynamic))
# draw posterior samples
dweights = np.exp(dres['logwt'] - dres['logz'][-1])
samples_dynestydynamic = resample_equal(dres.samples, dweights)
essdynestydynamic = ess(np.exp(dres["logwt"] - logZdynestydynamic))
print('Static: num. posterior samples: {}'.format(len(samples_dynesty)))
print('Dynamic: num. posterior samples: {}'.format(len(samples_dynestydynamic)))
print('Static: effective sample size: {}'.format(resdict['esdynesty']))
print('Dynamic: effective sample size: {}'.format(essdynestydynamic))
UltraNest¶
UltraNest is Python-based implementation of the nested sampling, that contains several sampling algorithms designed for correctness of sampling. One particular method it provides is called RadFriends (Buchner, 2014) that is designed to be robust against effects caused by inaccuracies of drawing new samples from the constrained prior.
UltraNest is available from its github repository, but can be installed form PyPi using
pip install ultranest
or using Conda with:
conda install -c conda-forge ultranest
Similarly to Nestle and dynesty, UltraNest samples from a unit hypercube, and therefore you have to defined a prior transformation function, so we can set up these functions in a very similar way to what we have seen above.
# define the prior transform
from scipy.special import ndtri
def prior_transform_ultranest(theta):
"""
A function defining the tranform between the parameterisation in the unit hypercube
to the true parameters.
Args:
theta (list): a list/array containing the parameters.
Returns:
list: a new list/array with the transformed parameters.
"""
# unpack the parameters (in their unit hypercube form)
mprime = theta[0]
cprime = theta[1]
cmin = -10. # lower bound on uniform prior on c
cmax = 10. # upper bound on uniform prior on c
mmu = 0. # mean of Gaussian prior on m
msigma = 10. # standard deviation of Gaussian prior on m
m = mmu + msigma*ndtri(mprime) # convert back to m
c = cprime*(cmax-cmin) + cmin # convert back to c
return np.array([m, c])
We can again define the likelihood in a very similay way to Nestle.
LN2PI = np.log(2.*np.pi)
LNSIGMA = np.log(sigma)
# define the Gaussian likelihood function
def loglikelihood_ultranest(theta):
"""
The log-likelihood function.
"""
# unpack the parameters
m = theta[0]
c = theta[1]
# normalisation
norm = -0.5*M*LN2PI - M*LNSIGMA
# chi-squared (data, sigma and x are global variables defined early on in this notebook)
chisq = np.sum(((data-straight_line(x, m, c))/sigma)**2)
return norm - 0.5*chisq
To run UltraNest we'll use the ReactiveNestedSampler
method. Unlike other nested sampling methods, this does not require you to specify the number of live points to use. It will automatically adjust the number as required, starting with a default minimum of 400.
import ultranest
print('UltraNest version: {}'.format(ultranest.__version__))
# set the ReactiveNestedSampler method
sampler = ultranest.ReactiveNestedSampler(["m", "c"], loglikelihood_ultranest, prior_transform_ultranest)
ndims = 2 # two parameters
tol = 0.1 # the stopping criterion
# run the nested sampling algorithm
t0 = time()
result = sampler.run(dlogz=tol)
t1 = time()
timeultranest = (t1 - t0)
print("Time taken to run 'UltraNest' is {} seconds".format(timeultranest))
We can extract the value of the marginal likelihood, $\ln{Z}$, and its uncertainties with the following:
logZultranest = result['logz'] # value of logZ
logZerrultranest = result['logzerr'] # estimate of the statistcal uncertainty on logZ
# output marginal likelihood
print("Marginalised evidence is {} ± {}".format(logZultranest, logZerrultranest))
We can now resample from the nested samples to get the posterior samples.
points = np.array(result["weighted_samples"]["points"])
weights = np.array(result["weighted_samples"]["weights"])
scaledweights = weights / weights.max()
mask = np.random.rand(len(scaledweights)) < scaledweights
samples_ultranest = points[mask, :]
resdict['multranest_mu'] = np.mean(samples_ultranest[:,0]) # mean of m samples
resdict['multranest_sig'] = np.std(samples_ultranest[:,0]) # standard deviation of m samples
resdict['cultranest_mu'] = np.mean(samples_ultranest[:,1]) # mean of c samples
resdict['cultranest_sig'] = np.std(samples_ultranest[:,1]) # standard deviation of c samples
resdict['ccultranest'] = np.corrcoef(samples_ultranest.T)[0,1] # correlation coefficient between parameters
resdict['ultranest_npos'] = len(samples_ultranest) # number of posterior samples
resdict['ultranest_time'] = timeultranest # run time
resdict['ultranest_logZ'] = logZultranest # log marginalised likelihood
resdict['ultranest_logZerr'] = logZerrultranest # uncertainty on log(Z)
# plot using corner.py
plotposts(samples_ultranest)
print('Number of posterior samples is {}'.format(samples_ultranest.shape[0]))
resdict['esultranest'] = ess(weights)
resdict['essultranest'] = int(resdict['esultranest'] / timeultranest)
print("Effective samples per second: {}".format(resdict['essultranest']))
We will now look at some nested sampling algorithms that are not pure Python, but do have Python wrappers.
PyMultiNest¶
PyMultiNest (Buchner et al, 2014) is a wrapper to the popular MultiNest code (Feroz & Hobson, 2008 & Feroz et al, 2009), which has been particularly taken up by the astrophysics and cosmology community. The basics underlying MultiNest were described in relation to Nestle, although MultiNest can use more sophisticated sampling methods, such as importance sampling (Feroz et al, 2014).
To use PyMultiNest, the MultiNest library itself must be installed (MultiNest can be downloaded from GitHub here). Installation of PyMultiNest and its requirements are described here; once MultiNest is installed you can then install PyMultiNest using pip
with:
pip install pymultinest
A simpler solution, that comes bundled with MultiNest, is to install using Conda with:
conda install -c conda-forge pymultinest
Note: If using MultiNest, and software wrapping it, for your work please note the software license restrictions here.
Some examples of running PyMultiNest can be found here, although an interface through a solver
function and Solver
class are also available. We'll use the Solver
class method, which can take in the same initialisation keyword arguments as the run()
method.
We use the Solver
class to create a new class for our specific example. As with CPNest we use this class to define our log likeihood function, and like Nestle we define a prior transform to go from a unit hypercube to the true parameterisation. These functions can only take in arrays of the parameter values that we are exploring and cannot take in additional arguments. Therefore any additional values required must either be global variables, or a new __init__
method for the class must be defined. We'll show the latter of these.
Note: PyMultiNest is also available in Python through the Monte Python package.
# import the Solver class
import pymultinest
from pymultinest.solve import Solver
from scipy.special import ndtri
LN2PI = np.log(2.*np.pi)
class StraightLineModelPyMultiNest(Solver):
"""
A simple straight line model, with a Gaussian likelihood.
Args:
data (:class:`numpy.ndarray`): an array containing the observed data
abscissa (:class:`numpy.ndarray`): an array containing the points at which the data were taken
modelfunc (function): a function defining the model
sigma (float): the standard deviation of the noise in the data
**kwargs: keyword arguments for the run method
"""
# define the prior parameters
cmin = -10. # lower range on c (the same as the uniform c prior lower bound)
cmax = 10. # upper range on c (the same as the uniform c prior upper bound)
mmu = 0. # mean of the Gaussian prior on m
msigma = 10. # standard deviation of the Gaussian prior on m
def __init__(self, data, abscissa, modelfunc, sigma, **kwargs):
# set the data
self._data = data # oberserved data
self._abscissa = abscissa # points at which the observed data are taken
self._sigma = sigma # standard deviation(s) of the data
self._logsigma = np.log(sigma) # log sigma here to save computations in the likelihood
self._ndata = len(data) # number of data points
self._model = modelfunc # model function
Solver.__init__(self, **kwargs)
def Prior(self, cube):
"""
The prior transform going from the unit hypercube to the true parameters. This function
has to be called "Prior".
Args:
cube (:class:`numpy.ndarray`): an array of values drawn from the unit hypercube
Returns:
:class:`numpy.ndarray`: an array of the transformed parameters
"""
# extract values
mprime = cube[0]
cprime = cube[1]
m = self.mmu + self.msigma*ndtri(mprime) # convert back to m
c = cprime*(self.cmax-self.cmin) + self.cmin # convert back to c
return np.array([m, c])
def LogLikelihood(self, cube):
"""
The log likelihood function. This function has to be called "LogLikelihood".
Args:
cube (:class:`numpy.ndarray`): an array of parameter values.
Returns:
float: the log likelihood value.
"""
# extract parameters
m = cube[0]
c = cube[1]
# calculate the model
model = self._model(x, m, c)
# normalisation
norm = -0.5*self._ndata*LN2PI - self._ndata*self._logsigma
# chi-squared
chisq = np.sum(((self._data - model)/(self._sigma))**2)
return norm - 0.5*chisq
Now we can run the sampler. As before we'll use 1024 live points, and like with Nestle we'll use a stopping criterion of 0.1. By default the sampler will use importance sampling.
nlive = 1024 # number of live points
ndim = 2 # number of parameters
tol = 0.1 # stopping criterion
# run the algorithm
t0 = time()
solution = StraightLineModelPyMultiNest(data, x, straight_line, sigma, n_dims=ndim,
n_live_points=nlive, evidence_tolerance=tol);
t1 = time()
timepymultinest = (t1-t0)
print("Time taken to run 'PyMultiNest' is {} seconds".format(timepymultinest))
Now we can extract the value of $\ln{Z}$ and its uncertainty.
logZpymnest = solution.logZ # value of log Z
logZerrpymnest = solution.logZerr # estimate of the statistcal uncertainty on logZ
print("log(Z) = {} ± {}".format(logZpymnest, logZerrpymnest))
These solutions, and the mean and standard deviation of the $m$ and $c$ parameters, can also be shown by using:
print(solution)
We can extract and plot the posteriors. The solution.samples
value contains an array of the posterior samples rather than the nested samples, so no resamping is required.
mchain_pymnest = solution.samples[:,0] # extract chain of m values
cchain_pymnest = solution.samples[:,1] # extract chain if c values
samples_pymnest = np.vstack((mchain_pymnest, cchain_pymnest)).T
resdict['mpymnest_mu'] = np.mean(samples_pymnest[:,0]) # mean of m samples
resdict['mpymnest_sig'] = np.std(samples_pymnest[:,0]) # standard deviation of m samples
resdict['cpymnest_mu'] = np.mean(samples_pymnest[:,1]) # mean of c samples
resdict['cpymnest_sig'] = np.std(samples_pymnest[:,1]) # standard deviation of c samples
resdict['ccpymnest'] = np.corrcoef(samples_pymnest.T)[0,1] # correlation coefficient between parameters
resdict['pymnest_npos'] = len(samples_pymnest) # number of posterior samples
resdict['pymnest_time'] = timepymultinest # run time
resdict['pymnest_logZ'] = logZpymnest # log marginalised likelihood
resdict['pymnest_logZerr'] = logZerrpymnest # uncertainty on log(Z)
# print out the number of posterior samples
print('Number of posterior samples is {}'.format(samples_pymnest.shape[0]))
# plot using corner.py
plotposts(samples_pymnest)
# note: I'm not calculating the correct ESS here as I can't access the weights
resdict['espymnest'] = len(samples_pymnest)
resdict['esspymnest'] = int(resdict['espymnest'] / timepymultinest)
print("Effective samples per second: {}".format(resdict['esspymnest']))
DNest4¶
DNest4 (Brewer & Foreman-Mackey, 2016) uses a nested sampling algorithm called Diffusive Nested Sampling (Brewer et al, 2011). The main difference between this method and others is that it can at any point still propose samples throughout the whole prior, rather than being constrained to draw points within the current minimum likelihood bound. This allows it to find volumes of high likelihood that might be missed by other methods because they lie in an area where no live points remain, i.e. it is good for sampling narrow multi-model posteriors. It can also be used in cases where the number of parameters in the model that is being fit is itself a variable, e.g., if the line model could either be a straight line, or a quadratic, or a cubic. The method used is called a Reversible Jump MCMC (Green, 1995), although this is not currently available in the Python interface for DNest4.
An example of using the Python wrapper for DNest4 for a straight line mode is given here, but we will reproduce it with minor modifications (in that example $\sigma$ is also included in the fitting).
You can install the development version of DNest4 by cloning it from its GitHub repository and following the instructions here. However, the Python interace can be installed from PyPI via pip with
pip install dnest4
or through Conda with
conda install -c conda-forge dnest4
For DNest4 we need to create a Model
class that contains the following methods:
from_prior()
: a function that draws a sample from the prior distributions of the parametersperturb()
: a function that proposes new sample valueslog_likelihood()
: the natural logarithm of the likelihood function
The perturb()
method also needs to calculate the natural logarithm of the Metropolis-Hastings ratio required to give the acceptance probability:
where the first ratio on the right-hand-side is the ratio of priors at the proposed point $\theta'$ and current point $\theta$, and the second ratio of the ratio of the proposal distributions. In our case the proposal distribution is symmetric, so its ratio is unity, and the ratio of the prior depends on the Gaussian prior on $m$, but not on the uniform prior for $c$.
Parameters cannot be passed to the class on initialisation, so things such as the data must be global variables.
# import randh function, which draws points from a heavy-tailed distribution, and wrapm which wraps a point within a given range
from dnest4 import randh, wrap
LN2PI = np.log(2.*np.pi)
LNSIGMA = np.log(sigma)
# define prior range values as global variables
cmin = -10. # lower range on c (the same as the uniform c prior lower bound)
cmax = 10. # upper range on c (the same as the uniform c prior upper bound)
mmu = 0. # mean of the Gaussian prior on m
msigma = 10. # standard deviation of the Gaussian prior on m
class DNest4Model(object):
"""
Specify the model in Python.
"""
def __init__(self):
"""
Parameter values *are not* stored inside the class
"""
pass
def from_prior(self):
"""
Draw points from the prior distribution.
Returns:
:class:`np.ndarray`: a numpy array containing the parameter values.
"""
m = np.random.normal(mmu, msigma)
c = np.random.uniform(cmin, cmax)
return np.array([m, c])
def perturb(self, params):
"""
Perturb the current parameters by proposing a new position. This takes a numpy array of
parameters as input, and modifies it in-place. In just perturbs one parameter at a time.
Args:
params (:class:`np.ndarray`): the current parameters (this is modified by the function)
Returns:
float: the natural logarithm of the Metropolis-Hastings proposal ratio.
"""
logH = 0.0 # log of the Metropolis-Hastings prior x proposal ratio
# randomly choose which parameter to perturb
which = np.random.randint(len(params))
mmu = 0.
msigma = 10.
if which == 0:
# update H for Gaussian prior
logH -= -0.5*((params[which]-mmu)/msigma)**2
params[which] += 1.*randh() # we could scale this or use different distributions for each parameter, but in our case we know our distribution is quite compact and unimodal
if which == 0:
# update H for Gaussian prior
logH += -0.5*((params[which]-mmu)/msigma)**2
else:
# wrap c value so that it stays within the prior range
params[which] = wrap(params[which], cmin, cmax)
return logH
def log_likelihood(self, params):
"""
Gaussian sampling distribution.
"""
m, c = params # unpack parameters
norm = -0.5*M*LN2PI - M*LNSIGMA
chisq = np.sum(((data - straight_line(x, m, c))/sigma)**2)
return norm - 0.5*chisq
We can now run the sampler. The values that are passed to the sampler are described in Section 7 of Brewer & Foreman-Mackey, 2016, but we'll stick with the defaults used in the example code. Unlike with the other nested sampling algorithms we do not pass a tolerance for the stopping criterion, but just specify the number of levels and steps to perform. The num_steps
gives to total number of MCMC iterations, but if not set the code will run forever by default, until the process is killed by the user.
# import dnest4
import dnest4
print('DNest4 version: {}'.format(dnest4.__version__))
# Create a model object and a sampler
model = DNest4Model()
sampler = dnest4.DNest4Sampler(model, backend=dnest4.backends.CSVBackend(".", sep=" "))
# Set up the sampler. The first argument is max_num_levels
gen = sampler.sample(max_num_levels=30, num_steps=1000, new_level_interval=10000,
num_per_step=10000, thread_steps=100, num_particles=5, lam=10, beta=100)
# Do the sampling (one iteration here = one particle save)
t0 = time()
for sample in enumerate(gen):
pass
t1 = time()
timednest4 = (t1-t0)
print("Time taken to run 'DNest4' is {} seconds".format(timednest4))
To create the posterior samples the output of DNest4 has to be processed, which can be performed using the postprocess()
function. Once this is done the posterior samples will be held in a posterior_sample.txt
file containing a column for each parameter. As stated in Brewer & Foreman-Mackey, 2016:
"it is harder to compute justified error bars on $\ln{(Z)}$ in DNS than it is in standard NS."
so we will not quote a value here.
# Run the postprocessing to get marginal likelihood and generate posterior samples
logZdnest4, infogaindnest4, _ = dnest4.postprocess(plot=False);
samples_dnest4 = np.loadtxt('posterior_sample.txt')
Now let's plot the posteriors (noting that the effective number of posterior samples is quite a bit smaller than for the other samplers).
# print out the number of posterior samples
print('Number of posterior samples is {}'.format(samples_dnest4.shape[0]))
resdict['mdnest4_mu'] = np.mean(samples_dnest4[:,0]) # mean of m samples
resdict['mdnest4_sig'] = np.std(samples_dnest4[:,0]) # standard deviation of m samples
resdict['cdnest4_mu'] = np.mean(samples_dnest4[:,1]) # mean of c samples
resdict['cdnest4_sig'] = np.std(samples_dnest4[:,1]) # standard deviation of c samples
resdict['ccdnest4'] = np.corrcoef(samples_dnest4.T)[0,1] # correlation coefficient between parameters
resdict['dnest4_npos'] = len(samples_dnest4) # number of posterior samples
resdict['dnest4_time'] = timednest4 # run time
resdict['dnest4_logZ'] = logZdnest4 # log marginalised likelihood
# plot using corner.py
plotposts(samples_dnest4)
# note: I'm not calculating the correct ESS here as I can't access the weights
resdict['esdnest4'] = len(samples_dnest4)
resdict['essdnest4'] = len(samples_dnest4) / timednest4
print("Effective samples per second: {0:.2f}".format(resdict['essdnest4']))
PyPolyChord¶
PolyChord (Handley et al, 2015a, Handley et al, 2015b) is a nested sampling implementation that uses slice sampling when drawing new points from the constrained prior.
The PolyChord package itself provides a Python interface called PyPolyChord for using the algorithm. PolyChord can be downloaded from Github here with:
git clone https://github.com/PolyChord/PolyChordLite.git
PyPolyChord can then be installed with
cd PolyChordLite
python setup.py install
The installation will attempt to find a copy of the libmpi.so
library. If you have not got MPI (or are having further problems with MPI) you can disable it by instead installing with:
python setup.py --no-mpi install
As with Nestle and PyMultiNest PolyChord samples from a unit hypercube, so you need to define a prior transform function. We also need a log likelihood function. These can be very similar to those used for Nestle, but the functions must take in and output list rather than lists or tuples. Values such as the data array cannot be passed to the functions explicitly, so must be set as global variables.
Note: If using PolyChord for your work, please note the license restrictions described here.
from scipy.special import ndtri
LN2PI = np.log(2. * np.pi)
LNSIGMA = np.log(sigma)
def prior_transform_polychord(cube):
"""
A function defining the tranform between the parameterisation in the unit hypercube
to the true parameters.
Args:
cube (array, list): a list containing the parameters as drawn from a unit hypercube.
Returns:
list: the transformed parameters.
"""
# unpack the parameters (in their unit hypercube form)
mprime = cube[0]
cprime = cube[1]
cmin = -10. # lower bound on uniform prior on c
cmax = 10. # upper bound on uniform prior on c
mmu = 0. # mean of Gaussian prior on m
msigma = 10. # standard deviation of Gaussian prior on m
m = mmu + msigma * ndtri(mprime) # convert back to m
c = cprime * (cmax - cmin) + cmin # convert back to c
theta = [m, c]
return theta
def loglikelihood_polychord(theta):
"""
The log-likelihood function.
Args:
theta (array, list): the set of parameter values.
Returns:
float: the log-likelihood value.
list: A list of any derived parameters (an empty list if there are none).
"""
# unpack the parameters
m = theta[0]
c = theta[1]
# normalisation
norm = -0.5 * M * LN2PI - M*LNSIGMA
# chi-squared (data, sigma and x are global variables defined early on in this notebook)
chisq = np.sum(((data-straight_line(x, m, c)) / sigma)**2)
return norm - 0.5 * chisq, []
As before we'll use 1024 live points (the default is to use 25 times the dimensionality of the problem) and a stopping criterion of 0.1 (the default value is 0.001). The current version of PyPolyChord will return values of the marginal likelihood and its uncertainty from the PyPolyChord
object that runs the algorithm, but the class does not currently contain the posterior samples. It instead outputs these to files. The location of these files is set by setting the keyword arguments base_dir
(defaults to "chains
") and file_root
(defaults to "test
") in a PolyChordSettings
object. This sets {root} = {base_dir}/{file_root}
, and within this the posterior samples can be found in {root}.txt
and the other statistics found in {root}.stats
.
You may find that codes produce a "Segmentation Fault" soon after they start running. The PolyChord documentation says:
"The slice sampling & clustering steps use a recursive procedure. The default memory allocated to recursive procedures is embarassingly small (to guard against memory leaks)."
and the solution is to:
Try increasing the stack size:
Linux:ulimit -s unlimited
OSX:ulimit -s hard
Alternatively, within a Python script you can achieve this using the resource
module, as shown below.
import resource
curlimit = resource.getrlimit(resource.RLIMIT_STACK)
resource.setrlimit(resource.RLIMIT_STACK, (resource.RLIM_INFINITY,resource.RLIM_INFINITY))
nlive = 1024 # number of live points
ndims = 2 # number of parameters
nderived = 0 # number of derived parameters (this is zero)
tol = 0.1 # stopping criterion
basedir = os.path.join(os.getcwd(), 'polychord') # output base directory
if not os.path.isdir(basedir):
os.makedirs(basedir) # create base directory
os.makedirs(os.path.join(basedir, 'clusters'))
fileroot = 'straightline' # output file name
broot = os.path.join(basedir, fileroot)
# import PolyChord
import pypolychord
from pypolychord.settings import PolyChordSettings # class for passing setup information
from pypolychord.priors import UniformPrior # pre-defined class for a uniform prior
# setup run settings using the PolyChordSetting class
pargs = {'nlive': nlive,
'precision_criterion': tol,
'base_dir': basedir,
'file_root': fileroot,
'write_resume': False, # don't output a resume file
'read_resume': False} # don't read a resume file
settings = PolyChordSettings(ndims, nderived, **pargs)
t0 = time()
output = pypolychord.run_polychord(loglikelihood_polychord, ndims, nderived, settings, prior_transform_polychord)
t1 = time()
timepolychord = (t1-t0)
print("Time taken to run 'PyPolyChord' is {} seconds".format(timepolychord))
# reset stack resource limit
resource.setrlimit(resource.RLIMIT_STACK, curlimit)
We can extract the marginal likelihood and the uncertainty from the output
variable.
print("log(Z) = {} ± {}".format(output.logZ, output.logZerr))
We can also extract the posterior samples from the file with the _equal_weights.txt
suffix. Given we have two parameters, the chains for these will be the final two columns within the samples file.
samplefile = broot+'_equal_weights.txt'
samples_polychord = np.loadtxt(samplefile)
samples_polychord = samples_polychord[:,-ndims:] # extract the last 'ndims' columns
# print out the number of posterior samples
print('Number of posterior samples is {}'.format(samples_polychord.shape[0]))
resdict['mpolychord_mu'] = np.mean(samples_polychord[:,0]) # mean of m samples
resdict['mpolychord_sig'] = np.std(samples_polychord[:,0]) # standard deviation of m samples
resdict['cpolychord_mu'] = np.mean(samples_polychord[:,1]) # mean of c samples
resdict['cpolychord_sig'] = np.std(samples_polychord[:,1]) # standard deviation of c samples
resdict['ccpolychord'] = np.corrcoef(samples_polychord.T)[0,1] # correlation coefficient between parameters
resdict['polychord_npos'] = len(samples_polychord) # number of posterior samples
resdict['polychord_time'] = timepolychord # run time
resdict['polychord_logZ'] = output.logZ # log marginalised likelihood
resdict['polychord_logZerr'] = output.logZerr # uncertainty on log(Z)
# plot using corner.py
plotposts(samples_polychord)
weights = np.loadtxt(broot+'.txt')[:,0] # get weights
resdict['espolychord'] = ess(weights)
resdict['esspolychord'] = int(resdict['espolychord'] / timepolychord)
print("Effective samples per second: {}".format(resdict['esspolychord']))
One code to rule them all!¶
There are a variety of different styles, syntaxes and formats required to use the various packages decribed above, so wouldn't it be nice to have a unified interface to access a range of different samplers!
Well, there are in fact a few different packages that can allow you to use more than on of the above samplers, e.g., Monte Python, cronus, Gleipnir and bilby (other may, and probably do, exist).
bilby¶
Of these, bilby (Ashton et al, 2019) provides the access to the widest range of samplers (I may be biased as one of the developers), including both MCMC and nested sampler packages. An example of using bilby for a linear regression model can be found here, but I'll reproduce something equivalent to the example in this notebook below.
Bilby has a selection of built-in likelihood functions and prior distributions. You can define your own likelihood, but here we'll use the built-in Gaussian likelihood function (the source code for which can be found here). By default, bilby will use the dynesty sampler, but we'll explicitly set it in this example.
%%capture
import bilby
# import the likelihood function and prior function
from bilby.core.likelihood import GaussianLikelihood
from bilby.core.prior import Gaussian, Uniform
# define the likelihood function
likelihood = bilby.core.likelihood.GaussianLikelihood(x, data, straight_line, sigma=sigma)
cmin = -10. # lower range of uniform distribution on c
cmax = 10. # upper range of uniform distribution on c
mmu = 0. # mean of Gaussian distribution on m
msigma = 10. # standard deviation of Gaussian distribution on m
# define the priors in a dictionary
priors = {
"m": Gaussian(mmu, msigma, name="m"),
"c": Uniform(cmin, cmax, name="c"),
}
# run the sampler (using dynesty)
bilbyresult = bilby.run_sampler(
likelihood=likelihood,
priors=priors,
outdir="bilby", # the directory to output the results to
label="bilby", # a label to apply to the output file
plot=False, # by default this is True and a corner plot will be produced
injection_parameters={"m": m, "c": c}, # (optional) true values for adding to plots
sampler='dynesty', # set the name of the sampler
nlive=1024, # add in any keyword arguments used by that sampler
sample='unif',
)
The bilby Result
object contains all the posterior samples (as a Pandas data frame) and has methods and attributes to extract the required information. We can use these to produce a corner plot:
bilbyresult.plot_corner()
Summary¶
The timings given in this notebook are not directly comparable as the various codes have not been run with identical settings, and the number of likelihood evaluations produced can be quite different. But, they may be helpful in providing a rough ball-park of expected run times. As mentioned in relation to emcee above, it is important to note that different samplers will produce different numbers of truly independent samples, so some methods that may take longer to run may actually give more independent samples. Therefore comparisons of the number of effective samples per second may be more useful, although these will still be quite problem dependent. It is also worth noting that some of the codes can make use or multiple CPU cores, or GPUs, but these have not been used here. So, speed-ups could be achieved by using these in-built parallelisations.
The number of posterior samples produced by the nested sampling codes can depend strongly on the number of initial live points used. However, for the codes where we have specified a number of live points the final posterior sizes are very comparable. It should be noted that comparing posterior sample sizes and run times for DNest4 in particular is not entirely fair, as it is designed with more complex problems in mind and is not created to be efficient at simple examples where other methods work easily.
Something that is also worth noting, in particular with regard to nested sampling algorithms, is that they will generally take longer to run when there is a larger degree of difference between the prior and the posterior (i.e., the data contains a lot of new information, which, in technical terms, is called having a large Kullback-Leibler (KL) divergence). So, estimating the parameters of a model when the data contains a high signal-to-noise ratio signal will often take longer than the same model but with a weaker signal in the data.
Now, let's overplot everything (it will look a bit horrific)!
from matplotlib.pyplot import cm
from matplotlib.colors import to_hex
# get a list of all the samples
chains = [samples_emcee, samples_pymc3, samples_zeus, samples_tfp, samples_pystan,
samples_pyjags, samples_nestle, samples_cpnest, samples_dynesty, samples_ultranest,
samples_pymnest, samples_dnest4, samples_polychord]
color = cm.gnuplot(np.linspace(0, 1, len(chains)))
hist2dkwargs = {'plot_datapoints': False,
'plot_density': False,
'levels': 1.0 - np.exp(-0.5 * np.arange(1.5, 2.1, 0.5) ** 2)} # roughly 1 and 2 sigma
for i, chain in enumerate(chains):
cv = to_hex(color[i])
if i == 0:
fig = corner.corner(
chain,
labels=[r"$m$", r"$c$"],
color=cv,
hist_kwargs={'density': True},
**hist2dkwargs,
)
else:
if i < len(chains)-1:
corner.corner(
chain,
labels=[r"$m$", r"$c$"],
fig=fig,
color=cv,
hist_kwargs={'density': True},
**hist2dkwargs,
)
else: # plot true values on last plot
corner.corner(
chain,
labels=[r"$m$", r"$c$"],
truths=[m, c],
fig=fig,
color=cv,
hist_kwargs={'density': True},
**hist2dkwargs,
)
fig.set_size_inches(10.0, 9.0)
from IPython.display import Markdown
outputtable = """
| Sampler | $m$ | $c$ | corr. coeff | num. effective samples | effective samples / s |
|--- | :---: | :---: | :---: | :---: | :---: |
| emcee | {memcee_mu:.3f} ± {memcee_sig:.3f} | {cemcee_mu:.3f} ± {cemcee_sig:.3f} | {ccemcee:.3f} | {esemcee:d} | {essemcee:d} |
| PyMC3 | {mpymc3_mu:.3f} ± {mpymc3_sig:.3f} | {cpymc3_mu:.3f} ± {cpymc3_sig:.3f} | {ccpymc3:.3f} | {espymc3:d} | {esspymc3:d} |
| TFP | {mtfp_mu:.3f} ± {mtfp_sig:.3f} | {ctfp_mu:.3f} ± {ctfp_sig:.3f} | {cctfp:.3f} | {estfp:d} | {esstfp:d} |
| emcee | {mzeus_mu:.3f} ± {mzeus_sig:.3f} | {czeus_mu:.3f} ± {czeus_sig:.3f} | {cczeus:.3f} | {eszeus:d} | {esszeus:d} |
| PyStan | {mpystan_mu:.3f} ± {mpystan_sig:.3f} | {cpystan_mu:.3f} ± {cpystan_sig:.3f} | {ccpystan:.3f} | {espystan:d} | {esspystan:d} |
| PyJAGS | {mpyjags_mu:.3f} ± {mpyjags_sig:.3f} | {cpyjags_mu:.3f} ± {cpyjags_sig:.3f} | {ccpyjags:.3f} | {espyjags:d} | {esspyjags:d} |
| Nestle | {mnestle_mu:.3f} ± {mnestle_sig:.3f} | {cnestle_mu:.3f} ± {cnestle_sig:.3f} | {ccnestle:.3f} | {esnestle:d} | {essnestle:d} |
| CPNest | {mcpnest_mu:.3f} ± {mcpnest_sig:.3f} | {ccpnest_mu:.3f} ± {ccpnest_sig:.3f} | {cccpnest:.3f} | {escpnest:d} | {esscpnest:d} |
| dynesty | {mdynesty_mu:.3f} ± {mdynesty_sig:.3f} | {cdynesty_mu:.3f} ± {cdynesty_sig:.3f} | {ccdynesty:.3f} | {esdynesty:d} | {essdynesty:d} |
| UltraNest | {multranest_mu:.3f} ± {multranest_sig:.3f} | {cultranest_mu:.3f} ± {cultranest_sig:.3f} | {ccultranest:.3f} | {esultranest:d} | {essultranest:d} |
| PyMultiNest | {mpymnest_mu:.3f} ± {mpymnest_sig:.3f} | {cpymnest_mu:.3f} ± {cpymnest_sig:.3f} | {ccpymnest:.3f} | {espymnest:d} | {esspymnest:d} |
| DNest4 | {mdnest4_mu:.3f} ± {mdnest4_sig:.3f} | {cdnest4_mu:.3f} ± {cdnest4_sig:.3f} | {ccdnest4:.3f} | {esdnest4:.2f} | {essdnest4:.2f} |
| PyPolyChord | {mpolychord_mu:.3f} ± {mpolychord_sig:.3f} | {cpolychord_mu:.3f} ± {cpolychord_sig:.3f} | {ccpolychord:.3f} | {espolychord:d} | {esspolychord:d} |
"""
Markdown(outputtable.format(**resdict))
outputnest = """
#### Nested sampling
| Sampler | $\ln{{Z}}$ | num. posterior samples | run time (s) |
| --- | --- | :---: | --- |
| Nestle | {nestle_logZ:.3f} ± {nestle_logZerr:.3f} | {nestle_npos} | {nestle_time:.3f} |
| CPNest | {cpnest_logZ:.3f} ± {cpnest_logZerr:.3f} | {cpnest_npos} | {cpnest_time:.3f} |
| dynesty | {dynesty_logZ:.3f} ± {dynesty_logZerr:.3f} | {dynesty_npos} | {dynesty_time:.3f} |
| UltraNest | {ultranest_logZ:.3f} ± {ultranest_logZerr:.3f} | {ultranest_npos} | {ultranest_time:.3f} |
| PyMultiNest | {pymnest_logZ:.3f} ± {pymnest_logZerr:.3f} | {pymnest_npos} | {pymnest_time:.3f} |
| DNest4 | {dnest4_logZ:.3f} | {dnest4_npos} | {dnest4_time:.3f} |
| PyPolyChord| {polychord_logZ:.3f} ± {polychord_logZerr:.3f} | {polychord_npos} | {polychord_time:.3f} |
"""
Markdown(outputnest.format(**resdict))
import cpuinfo
print('This was run on a single core "{}" CPU'.format(cpuinfo.get_cpu_info()['brand_raw']))
Running this notebook¶
I have created a Docker image, available on Quay.io, in which Python 3.7 and all the above samplers are installed (this is roughly 5 Gb in size, so may take some time to download). This can be used to run this notebook in Jupyter. You can download this notebook from here.
If you have Docker installed then under Linux, Mac OS or Windows you can run the notebook by opening a terminal (e.g., the Powershell in Windows) and running:
sudo docker run -i -t -p 8888:8888 -v ${notebook_location}:/samplers quay.io/mattpitkin/samplers-demo:latest /bin/bash -c "jupyter notebook --notebook-dir=/samplers --ip='*' --port=8888 --no-browser --allow-root --MultiKernelManager.default_kernel_name=Samplers"
where {notebook_location}
is replaced with the directory path containing the downloaded notebook. This will provide a link that can be pasted into your browser to show the notebook, and the Samplers.ipynb
file can be selected. Under Windows remove the sudo
from the above command.