Jake
- Aug 3
- 8 min read

An Example of Calibrating Posterior Distributions

Updated: Aug 26

For the Objective Bayesian, the Calibration norm means ensuring that inference is consistent with known frequency information. I've previously discussed examples from Jon Williamson which do this for point estimation. However, one of the major selling points of Bayesian inference is the presence of a genuine posterior distribution. Here I want to demonstrate a fairly simple method for enforcing a calibration constraint on a posterior distribution using some modern ML packages. I'm sure this has been done elsewhere and better, but I hope it can be illustrative. After a quick walkthrough, I want to speculate on the relationship between calibrated posterior distributions and point estimation with an eye toward Williamson's approach.

Explicitly Calibrating Posterior Distributions

From an Objective Bayesian perspective, the problem of calibrating a posterior distribution is reducible to finding a probability distribution with the target density over an identified coverage interval which is minimal with respect to the KL-divergence from the uncalibrated distribution. That is, we want the closest calibrated distribution to our uncalibrated distribution (as measured by KL-divergence).

Of course, most Bayesian posterior distributions are not tractable, posing a problem for such optimizations. To allow for an efficient search, let us approximate our calibrated distribution as a Gaussian mixture with a sufficient number of components. Then, our search reduces to a search over the parameters of the component Gaussians as well as the mixture coefficients. Much more tractable.

In this example, I write this search out using Optuna, but I'm sure there are plenty of better ways to do this:

import numpy as np
import scipy
from scipy.integrate import quad
from scipy.stats import norm
from sklearn.mixture import GaussianMixture
import optuna
import matplotlib.pyplot as plt

def kl_divergence(p, q):
    return np.sum(scipy.special.kl_div(p, q))

metric_loss = kl_divergence

def mixture_density(x, gmm):
    # Calculate the density of a Gaussian Mixture Model at x
    return np.exp(gmm.score_samples(np.array(x).reshape(-1, 1)))

def coverage_constraint(new_gmm, interval, coverage_prob):
    # Integrate the new density over the interval and compare with the target coverage probability
    new_density_func = lambda x: mixture_density(x, new_gmm)
    new_coverage, _ = quad(new_density_func, *interval)
    return new_coverage - coverage_prob

def find_closest_gmm(target_gmm, interval, coverage_prob, new_n_components):
    def objective(trial):
        # Sample weights from a Dirichlet distribution
        # https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-suggest-variables-which-represent-the-proportion-that-is-are-in-accordance-with-dirichlet-distribution 
        xs = np.array([-np.log(trial.suggest_float(f"x_{i}", 0, 1)) for i in range(new_n_components)])
        weights = xs/np.sum(xs)
        for i, w in enumerate(weights):
            trial.set_user_attr(f"weight_{i}", w)
        interval_len = 10 * (interval[1] - interval[0])  # This is just a search interval heuristic

        means = np.array([trial.suggest_float(f'mean_{i}', interval[0]-interval_len, interval[1]+interval_len) for i in range(new_n_components)])

        covariances = np.array([trial.suggest_float(f'covariance_{i}', 0.1, 5.0) for i in range(new_n_components)])

        new_gmm = GaussianMixture(n_components=new_n_components)
        new_gmm.weights_ = weights
        new_gmm.means_ = means.reshape(-1, 1)
        new_gmm.covariances_ = covariances.reshape(-1, 1, 1)
        new_gmm.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(new_gmm.covariances_))

        # This determines the support used for calculating the loss
        x = np.linspace(interval[0]-interval_len, interval[1]+interval_len, 1000)
        p = mixture_density(x, target_gmm)
        q = mixture_density(x, new_gmm)

        # Check the coverage constraint
        coverage_diff = coverage_constraint(new_gmm, interval, coverage_prob)
        coverage_penalty = -100 * coverage_diff * (coverage_diff<0)

        # p is the "true" distribution to be (constrained) approximated by q
        divergence_penalty = metric_loss(p, q)
        return divergence_penalty + coverage_penalty

    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=1000, show_progress_bar=True, n_jobs=-1)

    best_trial = study.best_trial

    best_weights = np.array([best_trial.user_attrs[f'weight_{i}'] for i in range(new_n_components)])                        

    best_means = np.array([best_trial.params[f'mean_{i}'] for i in range(new_n_components)])

    best_covariances = np.array([best_trial.params[f'covariance_{i}'] for i in range(new_n_components)])

    best_gmm = GaussianMixture(n_components=new_n_components)
    best_gmm.weights_ = best_weights
    best_gmm.means_ = best_means.reshape(-1, 1)
    best_gmm.covariances_ = best_covariances.reshape(-1, 1, 1)
    best_gmm.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(best_gmm.covariances_))
    return best_gmm

The code above does a few things, but its primary goal is to conduct a search over a space of Gaussian mixtures and measure their fitness according to how well the satisfy a target density constraint. Note that this constraint is only enforced if the proposed distribution does not have enough density in the interval, not if it is denser than the target. This is because frequentist coverage intervals aim to provide at least their target coverage, if not more. So, a proposed distribution that is denser in the interval is consistent with the "frequentist information" constraint.

Now, let's run an example:

np.random.seed(0)
target_gmm = GaussianMixture(n_components=2)
target_gmm.means_ = np.array([[0], [3]])
target_gmm.covariances_ = np.array([[[1]], [[1.5]]])
target_gmm.weights_ = np.array([0.6, 0.4])
target_gmm.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(target_gmm.covariances_))

interval = (-1, 0)  # Example interval
coverage_prob = 0.3  # Example coverage probability
new_n_components = 10  # Example number of components for the new GMM

new_gmm = find_closest_gmm(target_gmm, interval, coverage_prob, new_n_components)

Here we can see some of the results of this optimization. First, we see the distribution we're trying to approximate ("Target GMM") as well as the target interval (-1, 0). As seen above, we're targeting at least 30% density in this interval. The result ("New GMM") is indeed much improved as demonstrated below:

def mix_norm_cdf(x, gmm: GaussianMixture):
    mcdf = np.sum([gmm.weights_[i] * norm.cdf(x, loc=gmm.means_[i], scale=np.sqrt(gmm.covariances_[i])) for i in range(len(gmm.weights_))])
    return mcdf

mix_norm_cdf(interval[1], new_gmm) - mix_norm_cdf(interval[0], new_gmm)

0.2739577966904756

The optimized distribution has nearly the target coverage while the original distribution was not close:

mix_norm_cdf(interval[1], target_gmm) - mix_norm_cdf(interval[0], target_gmm)

0.20744985629298662

While this is a bit oversimplified and hand-wavey, this is one reasonable way a Bayesian could calibrate their posterior distributions. Of course, this is only possible once one has the prerequisites: an uncalibrated posterior distribution and a coverage interval with its coverage level. However, I've gestured in previous posts about the ability to use techniques like conformal prediction to generate frequentist coverage intervals for ML-based predictions, though more traditional techniques could also work, as appropriate.

Discussion and Thoughts

The result of this density-targeting optimization is a posterior distribution that satisfies the coverage constraint provided by frequency information while minimizing deviation from the original distribution. In that sense, these optimizations are conservative, fitting with common themes in generalized probabilistic inference.

Williamson's Constrained Point Estimation vs Point Estimation via Calibrated Posterior

One observation here is how point estimation would differ when using a calibrated posterior as opposed to the direct approach previously discussed by Williamson. Recall that in that direct approach, a point estimate is directly constrained to the coverage interval and chosen based on minimizing its distance to the MAP estimate of the (uncalibrated) posterior. In the example above, we might consider a point estimate of x=0 with this approach (using the Target GMM and the red interval).

In this approach it seems more appropriate once the distribution is calibrated to just calculate the MAP on this newly calibrated distribution, yielding something like x=-0.5 in this case. However, it's worth to note that the point estimate is not constrained to be in the target interval -- depending on the "prior" distribution and the target interval density, the calibrated interval may still not contain this MAP. This may seem counterintuitive at first, but I think that this is actually very acceptable. After all, we are Bayesians and we recognize that being most consistent with our data (as frequentists are wont to target with their point estimates) does not mean being most likely (at least not on its own). Insofar as our posterior distribution is correctly consistent with all the information we know (including frequentist coverage properties), we should not further artificially constrain our point estimates to the frequentist interval.

This second approach is, thus, inconsistent with Williamson's articulation of the role of the coverage interval. Recall that in his approach, once we identify a target coverage level representing an appropriate level of empirical risk and calculate our associated coverage interval, we are to "accept" that the parameter lies in that interval. Hence, our point estimate must be constrained to that interval. In contrast here, we avoid the commitment step, choosing to instead make our posterior density merely consistent with the frequentist coverage requirement. Off the cuff, this seems to make more sense to me than a hard clamp of point estimates to an interval -- if prior information is very strong, then we should yield to it to the degree that it remains consistent with available empirical information.

This does initially raise the question of whether the risk-based analysis proposed by Williamson will work the same for this second approach as with his own approach. I think that it does -- the main thing that changes is that the role of the coverage interval changes. It no longer constrains point estimates in an all-or-nothing way but instead acts as a density constraint on the posterior, which is several steps down into the OB inference process. It can be seen as a kind of empirical "regularizer" on inferred posteriors.

Another question raised is when providing an interval estimate of uncertainty, which should be reported: the frequentist interval or a Bayesian credence interval? I think that in this case we should recognize that we're not dealing with one notion of uncertainty but at least two. As such, we should report both: a Bayesian HDI as well as the frequentist interval used to calibrate the posterior. The discrepancy between the two can further inform consumers on how much information is coming from "prior" and how much is based on the current empirical data alone. While the net Bayesian estimate is most appropriate for decision making, the latter informs about the degree of narrowly empirical support.

Inferences about Prior Distributions

One interesting thing to investigate further is to see how this sort of constraint optimization changes the implied prior distributions. That is, we can interpret this optimization to be equivalent to a search over prior distributions that induce the desired coverage in the posterior. I believe this shares themes with Gelman's test-reject-modify iterative approach to modeling, as he also agrees that frequency properties are relevant for Bayesian modeling and understands prior distributions to be a modeling choice. His approach, as I understand it however, prefers to let modelers make informed decisions over how to negotiate between prior distributions and posterior behaviors. While all statistical modeling is open to charges of subjectivity on some level, it would be better to avoid additional channels if we can avoid it in a principled way. The approach taken here seems to be one such way. That is, we sidestep the issue of trying to specify a sufficiently frequency-matching prior distribution, focusing instead on directly incorporating frequency constraints via optimization.

Conclusion

No conclusion today; just food for thought as I try to make sense of things and am probably wrong out loud on the internet.

---

Additional Comments 8/10/24:

Circling around the same tree ---

It's worth pointing out that the posterior distribution, even one "calibrated" like this, should not be directly read as having frequentist coverage in general -- just over the interval used to generate the particular frequentist interval. That is, while the "calibrated" posterior distribution matches the frequentist coverage over the identified interval, it isn't the case that any other intervals in the posterior distribution have any frequentist coverage guarantees. And nor should we expect this even in the best case, I think, as is discussed with conformal prediction -- producing perfectly calibrated X-conditional coverage is impossible, which is what would be necessary for posterior distributions to be completely calibrated in this strong sense. Don't read posterior distributions frequentist-ly in general, even if you inform those distributions with frequentist information. This could be the basis of a small retort by the frequentist who might say that if even after injecting frequentist information into the posterior one still can't get coverage then there's little point in doing this at all. However, the posterior does have coverage precisely where the frequentist does, it just doesn't (necessarily) have it where the frequentist doesn't. The Bayesian doesn't need to accept the frequentist's challenge here -- just let them be disappointed. Those that expect probability statements to be uncompromisingly coverage-oriented really should just be frequentists, it seems to me, since this seems to just be an operational commitment to a fundamentally frequentist definition of the notion. (But, in a Moorean shift, since probability statements should indeed be concerned with more than just coverage, we should not be frequentists. Indeed, consistency alone provides elements of a reductio against this sort of coverage fundamentalism, which is seemingly expressed at times by folks in the literature.) The subjectivist might similarly object and say that if we can't generalize frequentist coverage with the posterior distribution then there's little point in playing this game with the frequentist. However, like mentioned before, the subjectivist leaves relevant frequency information on the table that should inform our beliefs. So, even if our posterior doesn't gain a bunch of frequentist properties as a result, it nonetheless should be constrained by the totality of our information and failing to do so is an error.