There has been some interesting traffic on the PyMC mailing list lately. It seems that there is a common trouble with the “Adaptive Metropolis” step method, and it’s failure to converge. I’ve been quite impressed with this approach, and I haven’t had the problems that others reported, so I started wondering: Have I been lucky? Have I been not looking carefully?

I decided to do some experiments to make Metropolis and Adaptive Metropolis shine, and since I’m an aspiring math filmmaker these days, I made my experiments into movies.

I consider the Metropolis step method the essence of MCMC. You have a particular point in parameter space, and you tentatively perturb it to a new random point, and then decide to accept or reject this new point with a carefully designed probability, so that the stationary distribution of the Markov chain is something you’re interested in. It’s magic, but it’s good, simple magic.

Here is the simplest example I could come up with, sampling from a uniform distribution on the unit square in the plane using Metropolis steps. Any self-respecting step method can sample from the unit square in two dimensions!

It is worth mentioning that this model takes nearly no code in PyMC (making the movie takes a little bit of work, though). Setting up and running the model is just this:

X = mc.Uniform('X', lower=-1, upper=1, value=[0,0]) mod = mc.MCMC({'X': X}) mod.use_step_method(mc.Metropolis, mod.X) mod.sample(5000)

But the Metropolis approach doesn’t fare so well when the parameters are correlated (to speak in statistician’s terms). Geometrically speaking, Metropolis is bad when the sample space is, for example, a diagonal strip, instead of an axis-aligned square:

This is the setting where Adaptive Metropolis really wins big. By updating its proposal distribution after a bit of exploring, Adaptive Metropolis effectively converts this long, narrow diagonal region back into a square. You can see how the autocorrelation drops after the first batch of steps are used for tuning:

Tuning the proposal distribution is not always a good idea, however. In this X-shaped region, the Metropolis stepper isn’t great, but Adaptive Metropolis is tricked really badly. It thinks it has discovered that it lives in a diagonal, and so it gets stuck when it starts to explore the other diagonal in the X.

The situation is much more complicated in the models that arise in real statistical modeling applications, but I think this is a useful sketch of why Metropolis is sometimes better and sometimes worse than Adaptive Metropolis.

*What do you do if you need to sample something analogous to the X?* I offer here a “proof of concept” PyMC implementation of a very cool step method, the hit-and-run sampler. It’s a pretty simple concept, and I hope that someone can take it to the industrial strength level required to really be used in PyMC. You pick a random direction, and look forwards and backwards in that direction to figure out what the posterior probability density looks like on that line. Then you hop to some point, with probability proportional to the posterior at each point.

To be explicit,

class HitAndRun(Gibbs): def __init__(self, stochastic, proposal_sd=None, verbose=None): Metropolis.__init__(self, stochastic, proposal_sd=proposal_sd, verbose=verbose, tally=False) self.proposal_tau = self.proposal_sd**-2. def step(self): x0 = copy(self.stochastic.value) dx = rnormal(zeros(len(x0)), self.proposal_tau) logp = [self.logp_plus_loglike] x_prime = [x0] for direction in [-1, 1]: for i in xrange(25): delta = direction*exp(.1*i)*dx try: self.stochastic.value = x0 + delta logp.append(self.logp_plus_loglike) x_prime.append(x0 + delta) except ZeroProbability: self.stochastic.value = x0 i = rcategorical(exp(array(logp) - flib.logsum(logp))) self.stochastic.value = x_prime[i] if i == 0: self.rejected += 1 else: self.accepted += 1

And what does this get you? A step method that cuts through the cross-diagonal region like it was the unit square:

Related paper: http://www.jstor.org/stable/1390645

Related discussion: http://stats.stackexchange.com/questions/5077/hit-and-run-mcmc

Research Idea: Combine Hit-and-Run with Adaptive Metroplis (to get the H-RAM!) This will work better than hit-and-run in the diagonal, and not be fooled like AM in the X.

Code to make the movies: https://github.com/aflaxman/pymc-mixing-movies

Hi!

That’s a really nice and complete illustration of these fancy McMC techniques!

The hit-and-run technique seems rather alike a “randomly oriented” variant of the slice sampling [1] technique to me… I think it is even more robust for your crossing lines example though.

And also, I’d be keen on seeing how it performs on some curvy contour — e.g. an elliptical ring. I may find it useful for refining a Support Vector or Gaussian Process Classifier, in an adaptive online learning scheme.

Thank you again for sharing the tip and your neat examples…

Cheers,

Vincent

[1] http://eco.uninsubria.it/webdocenti/amira/papers/annals-disc.pdf

Great. Well explained and technically brilliant. Thank you for sharing.

Adaptive Metropolis is _not_ a silver bullet (as you have aptly illustrated). I haven’t read the discussions, but perhaps these people are using adaptive Metropolis when it doesn’t have the proper stationary distribution (there are cases where it doesn’t, which the PyMC documentation doesn’t mention!). If I recall correctly, the original algorithm required bounded support. Usually not a big deal in practice, since we can take the bounds to be huge, but it could make problems.

A more recent review is here:

I don’t know much about Adaptive Metropolis, but if the issue is combining hit-and-run with an affine transformation that attempts to make the target distribution more isotropic, then this reminds me of some things that Santosh Vempala has done. For example, see Section 5.1 of his paper with Bertsimas called Solving convex programs by random walks.

@Vincent: Thanks for the words and the reference. It should be pretty straightforward to extend my code to other objective functions. Another request from a colleague was to try it on the Rosenbrock function.

@Xavier: Thanks! 🙂

@Jared: I should take another look at the assumptions required by Adaptive Metropolis. Maybe PyMC could even check for them automatically. Thanks for the more recent review.

@aram: This is a really cool approach to combining Hit-and-Run with Adaptive Metropolis. And, unlike Adaptive Metropolis, it’s actually a Markov chain! Definitely worth further investigation.

Pingback: Introducing the H-RAM | Healthy Algorithms

@Vincent: Anand found a paper by Hans Andersen and Percy Diaconis ,Hit and Run as a Unifying Device which shows precisely how you can view Slice Sampling as a specialization of a generalization of Hit and Run. Cool stuff!

Pingback: MCMC in Python: Part II of PyMC Step Methods and their pitfalls | Healthy Algorithms

Pingback: MCMC in Python: Part IIb of PyMC Step Methods and their pitfalls | Healthy Algorithms