Some Important and Weighty Stuff in Statistics

Feb 08 2011 Published by under Education, Statistics

(note: to create maximal puzzlement I have cross-posted this at my other blog)

Right, now let's talk about methods for stochastic numerical integration, in particular calculating importance weights. This is a technique that is probably underused, largely because...

Why are you staring at the screen like that?

OK, OK. I'm sorry for putting up such a technical post,but I was going to write this up for a colleague, and then thought that there might be 3 other people in this world that would also find it interesting.

The problem we're facing is integrating over probability distributions. For example, if we want to calculate the mean of a distribution, we need to calculate this:

\[mu = int theta p(theta) dtheta\]

where p(θ) is the probability distribution we want to integrate over. More generally, for any function f(θ) we want to calculate

\[omega = int f(theta) p(theta) dtheta\]

which is just the mean of f(θ). f(θ) could be anything from am mean or standard deviation to a p-value or a prediction from a model, like predicting Saturday's football results. Underneath this, we need to estimate the probability density, p(θ).

As an example, let's consider a Beta distribution (all of the R code I used here - plus some extras - are in this text file over on my other blog). The probability density function is

\[f(x sim alpha, beta) = frac{Gamma(alpha+beta)}{Gamma(alpha)Gamma(beta)} x^{alpha-1}(1-x)^{beta-1}\]

and it looks like this, for α=2 and β=10;

Now the Beta isn't too difficult to work with, but in practice we find ourselves dealing with horrible distributions which have lots of parameters. It's then not possible to use a pen and paper to work with them, or even to find a way of calculating the integrals numerically, so we resort to Dark Arts of simulation and throw a computer and some random number generators at it.

The ideal simulation strategy is to simulate from the distribution, p*(θ) (I'm inventing some notation by indicating with an asterisk a distribution that is simulated), to produce a lot of random numbes, θ*, and then calculate

\[mu = frac{1}{n}sum f(theta^*)\]

For example, to calculate the standard deviation for the beta distribution above, we can simulate it using R's functions:

simBeta=rbeta(n=1000, shape1=2, shape2=10)
Mean=mean(simBeta)

This is simple, but with more complicated distributions, p*(θ) is difficult to sample from, partly because the shape usually isn't obvious a priori, but also because the area under the distribution has to equal 1, so we need to normalise the density (that's what those Γ's are for in the equation for the Beta density). This means integrating over the whole density, which can be a nightmare.

It is usually fairly easy to calculate p(θ), i.e. the probability (or probability density) at any point, so several algorithms have been developed which only need to make this calculation, plus some easier ones. One popular group of algorithms are called MCMC. This works, and often works very well, but it can be slow and difficult to implement. So there are times when a different approach is called for. Importance sampling is one approach. It has different strengths and weaknesses. One strength is that it is easier to code than MCMC (the only reason MCMC is use by non-specialists is because of programmes like OpenBUGS have been written for us).

The basic idea behind importance sampling is not to to simulate from p(θ), but to simulate from a simpler distribution, q(θ), and then correct the mean for doing the wrong thing by weighting the simulations:

\[omega = frac{1}{n}sum f(theta^*) w(theta^*)\]

θ* is the numbers simulated from the simpler distribution, q(), and w(θ*) are the weights.

Now we just have to find the correct weight. For this we just need a bit of algebra. What we want is

\[omega = int f(theta) p(theta) dtheta\]

but by simulating from q() and correcting, we find ourselves with

\[omega = int f(theta) w(theta) q^*(theta)\]

So, if we use w(θ)= p(θ)/q(θ) we get

\[omega = int frac{f(theta) p(theta)}{q(theta)} q^{*}(theta)\]

and the q()'s cancel out. Which means we simulate θ* from q(),
calculate the weights p(θ*)/q(θ*) and the function f(θ*) and find the mean:

\[omega = frac{1}{n}sum frac{f(theta^*) p(theta^*)}{q(theta^*)}\]

This works because if a value of θ is drawn from q() that is less likely than it would be from the target distribution, p(), it gets up-weighted, so it has a bigger effect in the summation: it becomes more important. Conversely if a value is drawn that is more likely, it gets down-weighted. The weights ensure that this is done by the right amount.

Let's see how this works for the Beta distribution. We'll use a triangular distribution to sample from. It looks like this:

This is the R code to do the sampling and calculate the importance weights. The first line simulates θ from the distribution (rTriangular is a function I wrote, which I've put with all of the code), and the second line calculates the weights, w(θ*):

tria.samp=rTriangular(1e4, min=TriPars$min,max=TriPars$max,mode=TriPars$mode)

tria.wt.unsc=dbeta(tria.samp, TruePars$al, TruePars$be)/dTriangular(tria.samp, min=TriPars$min, max=TriPars$max,mode= TriPars$mode)

tria.wt=tria.wt.unsc/sum(tria.wt.unsc)

The third line simply scales the weights, so they sum up to 1. This is not always necessary, as the raw weights can be used, and the scaling applied later:

tria.mean=sum(tria.samp*tria.wt)/sum(tria.wt)
tria.sd=sum(tria.wt*(tria.samp-tria.mean)^2)/sum(tria.wt)

The density from the simulated distribution (along with the simulation from the true distribution) looks like this:

The simulations are pretty close, but the importance sample density is a bit more spread out. This is because there is also some extra variation because this is a simulation: if we simulate more values, this decreases.

Why is the simulation variation larger than from the correct distribution? Let's look at the histogram of the weights:

We can see that most of them are small: about a quarter have less than 1% of the largest weight. So, the larger weights dominate the calculation, and there are relatively few of these. The simulation of values of θ that have small weights is a waste of time, as they add little to the final sum. So we want to find a good sampling distribution that doesn't have a lot of small weights (but which still covers the target distribution well). Finding this distribution can be a problem, but even if a good distribution can't be found, a simple fix is just to simulate more values: that's quick, at least.

To sum up

That's importance sampling in a nutshell. The advantage of it is that it's easy to code: the distribution θ is sampled from should be easy to simulate (otherwise there's no point!), and any simulation approach will need to calculate the target density. Beyond that it's just a case of calculating the weights, which is straightforward.

The problem with using importance sampling is that a good simulation distribution is needed. With small problems, a rough guess at the distribution will probably be OK because more samples can be simulated, so that enough "good" weights are found. In general, as the number of parameters increases so does the proportion of parameter space with small weights, so it becomes more important to think about the sampling distribution: having 1% of weights being OK may be fine, but 0.1%, 0.0001%? My impression is that, in general, MCMC will work better for big problems. When you have 100s or 1000s of parameters (as I do regularly), it will be almost impossible to find good weights for all.

The one area where importance sampling has been used a lot is with time series. The reason for this is that it is easy to update the weights as more information comes in. There are problems, though, because as more data is added, fewer draws of θ have large weights, so the estimates become dominated by these draws. But there are some Really Dark Arts that can be applied to improve this. learning them is something that's on my list of stuff to do sometime, but even this simple importance sampling has its uses.

3 responses so far