top of page
  • Writer's pictureJake

An OB workflow for PyMC

Updated: Apr 9

In response to one of my previous blog posts, one of my colleagues asked me for a practical example of how to put some Objective Bayesian (OB) ideas into practice, preferably using existing software. I put together the following walk-through to sketch out one seemingly plausible workflow that employs OB ideas on a simple dataset. Afterward, I offer some thoughts.


import statsmodels.api as sm
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
import numpy as np

First, let's generate some example data


rng = np.random.RandomState(314)
n_points = 50
x = 12 * rng.rand(n_points)
# True line: y=3x-17
y = 3 * x - 17 + 9 * rng.randn(n_points)
x_data, y_data = x, y
plt.scatter(x, y);


Our first step here is to generate (frequentist) coverage intervals (i.e. confidence intervals) for the predictions here. For generality in the ML world, let's do that using conformal prediction. I won't go into any background on conformal prediction here, but it's a general way of producing coverage intervals that has recently become popular in the ML world (mostly due to so-called split-conformal prediction). Any method of producing (narrow) coverage intervals should work here.


In order to produce conformal intervals, we need a predictive model. Let's just use an OLS model, as follows:


from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)

model.fit(x_prop_train[:, None], y_prop_train)

xfit = np.linspace(0, 12, 1000)
yfit = model.predict(xfit[:, None])

plt.scatter(x, y)
plt.plot(xfit, yfit, color='red');

Let's go ahead and make our hold-out sets here. The first line creates a test holdout set. The second line creates a calibration hold out set, which we will used for the conformal interval estimation.

x_train, x_test, y_train, y_test = train_test_split(x, y)
x_prop_train, x_cal, y_prop_train, y_cal = train_test_split(x_train, y_train)


Now, we'll use the crepes library to produce a split-conformal predictor. This library allows for data-adaptive conformal prediction intervals by using its DifficultyEstimator class as well. Let's throw that in for size.

from crepes import WrapRegressor
lr = WrapRegressor(LinearRegression())
lr.fit(x_prop_train[:, None], y_prop_train)
from crepes.extras import DifficultyEstimator
de = DifficultyEstimator()
de.fit(x_prop_train[:, None], y_prop_train, k=25)
sigmas_cal = de.apply(x_cal[:, None])
lr.calibrate(x_cal[:, None], y_cal, sigmas=sigmas_cal)

So, now we have a calibrated conformal interval predictor. Let's see what those intervals look like graphically. For organization, we can put these into a pandas DataFrame.


sigmas_test = de.apply(xfit[:, None])
pred_ints = lr.predict_int(xfit[:, None], sigmas=sigmas_test,
	confidence=0.90)
import pandas as pd
df_conf = pd.DataFrame()
df_conf['x'] = xfit
df_conf['lb'] = pred_ints[:, 0]
df_conf['ub'] = pred_ints[:, 1]
plt.scatter(x, y)
yfit = model.predict(xfit[:, None])
plt.plot(xfit, yfit, color='red')
plt.fill_between(df_conf.x, df_conf.lb, df_conf.ub, alpha=0.2, 
	color="yellow")

It looks like the conformal predictor isn't detecting that much variation in the interval size. This, however, is controlled by the k parameter, which by default is set to 25. This is the number of neighboring points used in the underlying conformal function if we lower it to, say, 5, we get something less smooth:



This is a hyperparameter for you to play with. We'll stick with k=25 to generate our conformal intervals for now.


Now that we have a set of conformal intervals over our space, what we want to do is turn these into constraints that we can use for our Bayesian analysis. In PyMC, constraints are specified as Potentials, which are terms that get added to the log-likelihood during MCMC sampling. Let's fit a Bayesian linear model subject to these constraints:


with pm.Model():
    
    x = xfit
    intercept = pm.Normal('intercept', mu=0, sigma=100)
    slope = pm.Normal('slope', mu=0, sigma=100)
    y_ = slope * x + intercept 
        
    y_out_sigma = pm.HalfNormal('y_sig', sigma=100)
    y_out = pm.Normal('y', mu=y_, sigma=y_out_sigma)
    
    lb_constraint_diff = y_out - df_conf.lb
    ub_constraint_diff = y_out - df_conf.ub

    lb_constraint = pm.Potential('lb_constraint', 
		pm.math.switch(lb_constraint_diff > 0, 0, lb_constraint_diff))
    ub_constraint = pm.Potential('ub_constraint', 
		pm.math.switch(ub_constraint_diff < 0, 0, -ub_constraint_diff))

    idata = pm.sample(10000, tune=10000)
    # Note that the constraints will add divergences, but this seems to
    # help some. A reparameterization would be better! Alas another day...

For computational reasons, we specify the constraints here as soft constraints, being more severe the more the proposal violates the constraint, otherwise the sampler seems to have a difficult time even getting started. We can validate the degree of compliance afterward to see if this is a practical issue.


Note a few differences from a usual Bayesian analysis here. First, note the use of very wide prior distributions here. From the OB perspective, it's necessary that we minimize as much 'prior' information as possible by maximizing our prior entropy (at least in this approach). Some subjectivist Bayesians also suggest this, but note that it is not a requirement, on subjectivism, that your priors be broad.


Second, there is no observed keyword used to constrain y_out (sampled as 'y'). In a subjectivist Bayesian analysis, this would be necessary because they use conditioning via Bayes Rule to inject data-information into their distributions. The OB practitioner has a different approach. Instead of seeing updating on data as a matter of conditioning on new information, we see updating as a matter of constraining our otherwise maximal degree of uncertainty. While a subjectivist Bayesian may even be inspired to use so-called maximum-entropy priors (or, at least, very broad priors as specified above), their conditioning-based approach must consist of two phases at this point: first, they must condition their model on the training data (using x_prop_train and y_prop_train) then they must predict on what I've called xfit to produce the graphs below. For the OB, the constraints above have already been informed by data, so it's not necessary to do additional conditioning here.


Now that we have our OB posterior samples, let's check how well the OB samples obey our desired constraints (recall that we expressed them as soft constraints for computational purposes).


import arviz as az

# Calculate Bayesian credence intervals as highest-density intervals.
cred_ints = az.hdi(idata['posterior'], 0.90).y

df_cred = pd.DataFrame()
df_cred['x'] = xfit
df_cred['lb'] = cred_ints[:, 0]
df_cred['ub'] = cred_ints[:, 1]

df_intervals = pd.DataFrame()
df_intervals['x'] = xfit

#Let's use our original conformal intervals here
df_intervals['conf_lb'] = df_conf.lb  
df_intervals['conf_ub'] = df_conf.ub

# And we'll use our Bayesian HDI's as above
df_intervals['cred_lb'] = df_cred.lb
df_intervals['cred_ub'] = df_cred.ub

One obvious check is that, at the very least, the 90% credible intervals should be within the calculated 90% coverage intervals.


np.all(df_intervals.conf_lb < df_intervals.cred_lb)
True

np.all(df_intervals.conf_ub > df_intervals.cred_ub)
True

And we can view their relationship as follows:



Notice how the Bayesian interval correctly places our 90% credence near the middle of the chart in a slightly narrower band that expands toward the edges. This is a common occurrence with linear models (Bayesian or not), since the constraints at the line's endpoints do a lot of work, restricting the plausible values near the middle. The main takeaway here is that the OB credence bands can and do vary within the frequency constraints.


However, we ideally want all of the Bayesian credence intervals to be contained in the coverage constraints. We can see that this isn't strictly the case if we consider the 99.9% Bayesian credence band:



Here we can see, particularly toward the edges, that the 99.9% credence band leaks out of the frequency band which is supposed to be constraining it. This is a result of the soft constraint that we used for computational convenience with the MCMC sampler. If we care about these values in the edges and want to make use of such a large credence band or have other concerns, we can tweak the soft penalty to see what provides us the practical coverage we need. We may even consider truncation as a working solution. I won't worry about that here -- the compliance is very high.


Let's also see where the Bayesian model puts its, say, top 10% values within the coverage bands:



Here we can see some of the anticipated disagreement with the frequentist OLS model, particularly on the right side of the graph. The Bayesian model seems to prefer a less extreme slope at the 10% credence level than the OLS model. This makes sense if we consider (point-estimating) Bayesian models on analogy to regularized frequentist modeling.


So, what have we done?

We have a Bayesian linear regression model. This model provides a full probability distribution as output and, thus, is fully Bayesian (as opposed to, say, MAP-based, point-estimation models). This model is also calibrated to the empirical evidence by constraining its output with frequentist coverage intervals, generated by conformal prediction. This model also uses very wide, uninformative prior distributions over quantities of interest. This isn't quite maximal entropy, but even an OB practitioner can be satisfied with a sufficiently uninformative prior for practical purposes.


The role of coverage intervals

Perhaps some additional conversation about the coverage intervals is warranted here. After all, if we choose a different coverage level (typically represented as 1-α), this will impact the whole model. In usual scientific discourse, the coverage level is conventionally set at 0.95 without much more thought behind it. Here, though, this coverage level represents a kind of empirical risk that represents which hypothesis spaces we're willing to take for granted. This is akin to the way null hypotheses are treated. A null hypothesis represents a hypothesis that one is willing to take for granted unless provided compelling anomalies (something we'd expect to see less than 5% of the time, at a 95% coverage level if the null hypothesis is true). Here, the Objective Bayesian uses a similar concept, but over the whole hypothesis space. We use the coverage interval to constrain the set of possible hypotheses to those that are reasonably compatible with the empirical evidence (at a 95% level). Once we have that empirically-viable hypothesis space, the Bayesian does what Bayesians do -- figure out rational degrees of belief over the contending hypotheses.


Because the role of coverage in the Objective Bayesian context differs from the role of coverage in frequentist inference, it may or may not be appropriate to use the conventional 95% coverage level in an OB inference. This will depend on contextual factors and a contextual empirical risk profile which may be resolvable using something like decision theory. It will be upon the analyst to resolve these contextual questions before proceeding into analysis, just as with any other responsible statistical analysis.


What other things can we do?

There are plenty of other things that can be done in this sort of workflow.


One thing to note is that there's no requirement for the Bayesian model to resemble the model used for generating the coverage intervals. Once we have the empirical constraints expressible in PyMC, then the actual Bayesian model can be otherwise divorced from the frequentist model. Wanna use BART instead? Do it.


Why?

Having seen how an Objective Bayesian analysis can be done using some existing tools, one question usually remains: Why? Why go through all of this when I can just use frequentist tools in the first place or, alternatively, Bayesian tools as they're usually intended?


In reverse order, the primary value that this sort of analysis brings to Bayesians is empirical calibration. Like it or not, empirical calibration is an established norm in scientific discourse and has been for a century. The analysis should address that and, ideally, the workflow used should have a principled method for producing analyses with desirable frequency properties. Some other approaches from empirically-minded Bayesians can do this in ways that, at least to me, seem more ad hoc and potentially overconfident with respect to the data. The Objective Bayesian framework establishes where calibration lives in the analysis and precisely how the Bayesian should make use of frequency information not amenable to typical Bayesian conditionalization.


On the other hand, one value that this brings over a frequentist analysis is the ability to answer the correct question. Most people, especially in industry, are not interested in discovering which statistical hypotheses best fit their particular data -- they want to know which hypotheses they should bet on with their actions. The OB analysis provides a probabilistically coherent, data-driven, empirically grounded way to trade off between hypotheses that are in real contention, allowing prescription of actions because some hypotheses are more likely than others. Frequentist analysis does not provide that, nor does it advertise that it can.


A second value that this brings over a frequentist analysis is more practical: genuine distributional output from the analysis. Point estimation sucks from a practical perspective since we know they're nearly always wrong, and interval estimation leaves you in limbo inside the interval. Distributional information is practically useful, and the maximum entropy constraints proffered by Objective Bayesianism ensure that those distributions do not sell themselves more than what the data coherently allow them to.


I intend to explore more and better ways to improve my Bayesian analysis with Objective Bayesian principles, and I hope that this has been useful for you to do the same.


 

Links and References:


PyMC (probabilistic programming): www.pymc.io/

Crepes (conformal prediction): crepes.readthedocs.io/


Williamson, Jon. 2010. In Defence of Objective Bayesianism. OUP Oxford.

18 views4 comments

4 Comments


Mohamad Nasr-Azadani
Mohamad Nasr-Azadani
Apr 09

I have another question. When I think of hyper-parameter tuning, I get reminded of how those hyper-parameters are 'searched' based on a fitness function which is directly connected to the overall "accuracy" of the model (being trained).

In this example, the K(=25) hyper-parameter seems to be too "early" in the model training pipeline. Therefore, it can be complicated to find the best hyper-parameter value. Is there a best practice to perform hyper-parameter tuning for such conformal approach?

Like
Mohamad Nasr-Azadani
Mohamad Nasr-Azadani
Apr 09
Replying to

That's intriguing. In that case, then like you said, it should not matter toooo much if the selected value for K is not as rigorously searched by an optimizer agent w/ direct tie to the "accuracy" the model (on a hold out set etc). I think I got the gist of it. Thanks for the links too.

Like

Mohamad Nasr-Azadani
Mohamad Nasr-Azadani
Apr 09

You put a lot of time and thinking into this post. Well done. #OB-for-Dummies. I also like the Why? section and how you show it can actually address a unique family of questions in the business world.

Edited
Like
bottom of page