Jake
- Jun 4
- 13 min read

Simulating a Statistics Head-to-Head Rumble

In the previous post, I responded to a betting scenario from a LinkedIn-er, and one of my key points was that the long run losses experienced by bettors in the bettor's market was disproportionately funded by uncalibrated bettors. While I think my reasoning there is pretty good, I wanted to make that point more tangible. So, I took to producing some simulations of this market and how different bettors get exploited. This isn't going to be particularly in depth, but it should be interesting.

Setting the Stage

First, I'm not interested in dealing with the original author's estimation scenario exactly. I'm just going to be concerned with having bettors solve the following simpler and related problem:

There is an ideal coin with unknown bias. You will be given the results of 14 previous flips of this coin. Using whatever statistical methods you prefer, estimate the probability that this coin will land heads and tails on its next flip.

The results from this estimation feed into the bettor's market where you, now acting as a bookie, search the betting pool for sure gain bets. That is, you want to form a pseudo-Dutch Book with pairs of bettors such that while one bettor bets on H and the other bets on Tails, you walk away with cash regardless of how the coin lands.

The working hypothesis here is that you, as the bookie, will have monotonic gains precisely when you can form such a pseudo-Dutch Book, and that when uncalibrated bettors are no longer in the pool, such gains essentially vanish. One consequence of this is that while this pseudo-Dutch Book is problematic for subjectivist Bayesians who do not calibrate their estimates, objectivist Bayesians, who do calibrate their estimates, are not subject to this long run exploitation.

Simulation Design

Bettors

For this simulation, I created a series of different betting agents with different estimation strategies. Several of these are intended act as baselines.

Random Betting (R)
- This agent simply takes bets according to a probability selected uniformly at random for p(Heads). We should expect this agent to lose in the long run compared to other agents.
  - I had thought to include "incoherent" vs "coherent" random bettors, but since each bettor only ends up on one side of a paired bet with the bookie, incoherence will not be relevant.
Oracle Betting (O)
- This agent knows the true bias of the coin and bets accordingly. We should expect this agent to win in the long run compared to other agents, since it essentially cheats.
Uncalibrated Bayesian Betting (UB)
- This agent has a Jeffreys prior (Beta(0.5,0.5)) with regard to the bias of the coin and updates their maximum a posteriori estimate using Bayes' rule with a Binomial likelihood.
Extreme Subjective Bayesian Betting (ESB)
- This agent, for subjective reasons, uses a Beta(13, 100) prior, but otherwise behaves like the Uncalibrated Bayesian.
Frequentist Plug-in Betting (FP)
- This agent simply uses the empirically observed proportion of Heads as their estimate for p(Heads). Note that this is the frequentist maximum likelihood estimate (MLE). We should expect this agent to perform best in the long run, except for the Oracle, since the MLE minimizes long run expected loss.
Frequentist Confidence Lower Bound Betting (FLB)
- This agent calculates a frequentist coverage interval at an alpha of 0.05 and bets using the lower bound as its probability estimate.
Frequentist Confidence Upper Bound Betting (FUB)
- This agent is like the previous but bets using the upper bound as its probability estimate.
Frequentist Confidence Random Betting (FR)
- This agent calculates a frequentist coverage interval at an alpha of 0.05 and bets according to a point in that interval chosen uniformly at random.
Calibrated Bayesian Betting (CB)
- This agent estimates similarly to the Uncalibrated Bayesian but constrains their estimates to empirically-supported coverage intervals (with alpha level = 0.05). That is, this bettor uses frequentist coverage information to constrain their Bayesian estimate.
- More specifically, this bettor first calculates its maximum a posteriori (MAP) estimate just like the uncalibrated Bayesian, but then also calculates its (1-alpha) coverage interval. If the MAP estimate lies in the interval, it returns that. If it lies below, it returns the lower bound of the coverage interval. If it lies above, it returns the upper bound.
High Alpha Calibrated Bayesian Betting (HCB)
- Just like the Calibrated Bayesian but with alpha level = 0.95.
Calibrated Extreme Subjective Bayesian Betting (CESB)
- This agent is the Extreme Subjective Bayesian analog to the Calibrated Bayesian. It uses a frequentist coverage interval to bound its otherwise subjective posterior MAP estimate.
High Alpha Calibrated Extreme Subjective Bayesian Betting (HCESB)
- Just like the CESB but with alpha level = 0.95.
Objective Bayesian Betting (OB)
- This agent estimates similarly to the Calibrated Bayesian but uses a uniform (Beta(1, 1)) prior. In this scenario, we should expect this to perform no worse than the Calibrated Bayesian.
  - Note that the uniform prior is the maximum entropy prior here, aligning with objective Bayesian norms. In this circumstance, this makes the Bayesian MAP and the frequentist MLE coincide.
High Alpha Objective Bayesian Betting (HOB)
- Just like the OB with with alpha level = 0.95.

Experiment

This experiment has a number of deviations from the original author's scenario which will hopefully get to the root of the phenomena that are active in that scenario. As such, I will not consider a "pool" of arbitrary bettors here but will instead consider putting bettors in head-to-head matchups. (The arbitrary pool scenario can be seen as just a more complicated version of this, which I will explain later.)

Each bettor is given a sufficiently large bank of cash with which they may play until they go broke ($10,000 here). This balance is tracked over time, along with the bookie's net winnings. For each bet, a new "coin" is generated, according to some bias that is chosen uniformly at random. Any Oracle bettors are informed of this bias. Then this coin is simulated 14 times to provide data for those bettors that use it (everyone except the Random bettors). With the stakes of the bet fixed at $100, the bookie then evaluates the bettors to see if a sure-gain bet can be made using the fair bet prices of each bettor. If not, the bookie does not offer a bet this round. If a sure-gain bet is possible, the bookie chooses to make a bet with that pair of bettors, which they each accept at their fair price. After bets are made, a true result is generated using the coin, and rewards are paid out accordingly.

To avoid overly long experimental runs, each experiment is truncated to 2500 bets. Each experimental run is also replicated 10 times each. Also note that I am not tracking how many rounds the bookie decides not to make a bet. (Maybe in the future.) In the appendix, I also have some observations regarding how more data affects these inferences.

Head-to-Head Results and Discussion

Let us compare performance of these bettors when placed against each other, 1-v-1.

The diagonal results

Let us first consider the "diagonal" bets, which are bets which feature the same betting strategy on both sides of the bookie's bet. If the bookie gets a gain here, then the betting strategy is subject to a true Dutch Book (see previous post), because the bookie's gain must be funded by the losses afforded by the current betting strategy. To evaluate this, it's enough just to look at the bookie's bank over the course of the runs.

R v R: Bookie gains (>17500 over ~600 bets)
O v O: No gains
UB v UB: No gains (~2.5e^-12 over ~2500 bets, within rounding error of 0)
ESB v ESB: No gains (~6e^-13 over ~350 bets, within rounding error of 0)
FP v FP: No gains
FLB v FLB: No gains
FUB v FUB: Bookie gains (> 17500 over ~600 bets)
FR v FR: Bookie gains (> 17500 over ~1500 bets)
CB v CB: No gains
HCB v HCB: No gains
CESB v CESB: No gains (~1e^-12 over ~800 bets, within rounding error of 0)
HCESB v HCESB: No gains (~8e^-13 over ~750 bets, within rounding error of 0)
OB v OB: No gains
HOB v HOB: No gains

There aren't any surprises here. The Oracle is obviously not Dutch Book-able, and neither are any of the Bayesian methods. The plug-in estimator isn't Dutch Book-able here since it isn't making different decisions on either side of the bet and is implemented as being coherent.

Interestingly, the FLB isn't exploited here while the FUB is. I suspect that this is because of the way this problem is set up (with a fixed, positive stakes). If we analyze the expected long-run gain associated with a bet, we find that it makes sense to be more conservative. Let p be an estimated bias of the coin and let q be the 'true' bias of the coin. Then, we expect to gain q(100) on average with our bets for a fair price of p(100). That is, we expect to gain 100(q-p). This is positive when q>p, so there's some advantage to underestimating the true bias of the coin. If the bookie could alter the sakes negatively (as in a traditional Dutch Book), then this would obviously become unfavorable. So while FLB isn't Dutch Booked here, the fact that FUB is susceptible means that FLB is, actually, Dutch Book-able.

And FR is Dutch-bookable for obvious reasons -- it's probabilistically incoherent.

All of these strategies are exploitable on their own by the bookie, and the bookie's gain in these matchups is funded by the sure losses of these strategies.

Off-diagonal results

More on the role of miscalibration

The off-diagonal head-to-head matches are perhaps more interesting, since in these cases the bookie's sure gain is not funded by the sure loss of any particular strategy but is instead funded by long run expected loss, to the degree that these strategies are uncalibrated. As discussed above with regard to FUB, the fact that the stakes are fixed and positive mitigates the punishment for being estimating too low but retains the punishment for estimating too high, so we can say that this scenario punishes positive miscalibration in that sense. We'll see how this crops up in the results below shortly. But keep in mind that if the bookie has the opportunity to alter the stakes from positive to negative, then the downsides for the positively miscalibrated reasoners will also apply to the negatively miscalibrated ones as well, without corresponding benefit.

The poor performers

The above table has some mixed results. It is not surprising that the Random bettor does very poorly -- worse than almost any other bettor. It is not only uncalibrated but is also unresponsive to data. What is slightly surprising is that it is only tied for second worst with FUB. The worst performer happens to be the Extreme Subjective Bayesian bettor. The poor result for the ESB is also not surprising it itself, but the fact that it beats out how bad FUB turns out to be is a testament to the classic Humean advice to apportion one's belief to one's evidence. Extreme prior positions that are not warranted by evidence are worse than random choice, in this case.

The better performers

It is also unsurprising that the Oracle bettor wins compared to all others -- it essentially cheats by knowing what the coin's true bias is when it bets. Second best, however, is the Frequentist Plug-in bettor.

To be fair, that the plug-in estimator also performs well in this problem shouldn't be very surprising either, given the other competitors. The plug-in bettor bets at the maximum likelihood estimate which, by design, minimizes long run expected loss -- the very quantity at issue in this scenario. The MLE sits at the intersection of all confidence intervals, being the peak of the confidence density distribution, and so, in some sense, represents an asymptotically minimal confidence interval (~0%). In contrast, the various (non-high-alpha) calibrated estimators are set to use a 95% confidence level. It is trivial that if we were to set the confidence level near zero, they too would match the MLE, and this is born out with the two high-alpha Bayesian estimators which match the win performance of the plug-in estimator. So, if we concede that the MLE is the best in terms of solving the long-run expected loss problem and that the plug-in estimator gives that, then perhaps the more interesting results will be in the estimators that use a nonzero confidence bound.

Just below those bettors, the Objective Bayes method, with its 95% alpha level, wins two more rounds over the FLB with its 95% alpha level. This makes sense since the OB will typically bet within the interval when possible, which would be closer to the MLE than the lower bound is. The Calibrated Bayes method wins once more outright over the FLB, but when counting inconclusive sets of runs, these three methods appear to "win" similarly.

The rest of the performers

One interesting result for me is seeing the middling performance of the two calibrated Extreme Subjective Bayes methods. We know that the uncalibrated method is the worst, but it apparently does so poorly that not even a 95% alpha level can recover its performance. Stay away from strong and unwarranted prior beliefs kids. Or, better yet, become an Objective Bayesian ;P.

The bookie's bank, part deux

We looked at one aspect of the bookie's bank when considering the diagonal (symmetric) pairs of bettors and saw that the probabilistically consistent methods are insulated from sure loss. For the off-diagonal pairs, we know that the bookie is going to gain, but does the degree of gain of the bookie correspond to the miscalibration of a betting method? (See appendix)

What the results show is that calibration indeed limits the rate at which the bookie can exploit the betting pair. Looking at the mean bookie gain rates, we see that the terribly performing ESB bettor is, as expected, near the bottom of the chart, its calibrated cousin, CESB, is nearly on par with the Oracle, with only its high alpha cousin in between.

So, it appears that being uncalibrated can indeed increase the rate at which the bookie can exploit you in this scenario, but this cannot be the whole story. The Oracle, by definition correct and, therefore, ideally calibrated, itself sits in the middle of the chart. I suspect, especially due to the dominating placement of FLB, that the selection bias of the bookie (only making advantageous bets) forces less-than-ideally calibrated bettors to produce the appearance of preference for underestimated bets, particularly when placed against other calibrated estimators. I suspect that in a scenario where the bookie controlled the sign of the stakes that the Oracle bettor would sit correctly at the top of this list.

Despite these complications, it appears that the bookie's sure gain is disproportionately funded by particularly badly calibrated estimators, and that calibrated estimators, when put head-to-head, severely limit the bookie's sure gain, as hypothesized.

Closing Thoughts and Takeaways

This experiment was interesting and had a number of shortcomings that I didn't anticipate beforehand, particularly the bias introduced by having a fixed, positive stakes with the selection criterion for the bookie. I would like to revisit this shortcoming in the future and see if that better empirically supports or invalidates the funding of the bookie's gain by the degree of miscalibration.

Also, I may choose to come back to this post and provide an analysis of a broader "sea" of bettors rather than just head-to-head analysis, though it seems to me that it would just degenerate to a mixture of these head-to-head analyses: the bookie should end up exploiting the worse-calibrated methods until they crash, then move on to the remaining contenders.

Walking away from this experiment, we can see how the scenario in the author's original post does not show that just anyone is exploitable here, and this scenario, while correctly demonstrating the dangers of uncalibrated inference procedures such as those used by subjectivist Bayesians, it similarly allows us to show that Bayesianism is not incompatible with calibration, and that objective Bayesianism in particular remains quite viable in the face of it. Indeed, the (high alpha) objective Bayesian methods are equally performant to frequentist methods when put head-to-head, largely because they use frequentist methods to constrain their inference. Frequentists, however, must still contend with guaranteed sure loss in traditional Dutch Books, as demonstrated in the diagonal cases.

References

Berger, J., Bernardo, J., & Sun, D. (2024). Objective Bayesian Inference. WORLD SCIENTIFIC.

Williamson, J. (2010). In defence of objective Bayesianism, Oxford University Press.

Appendix

Head-to-Head Charts

Inc = Inconclusive (worth 0.5 point in secondary scoring)

Random

Oracle

Uncalibrated Bayes

Extreme Sub Bayes

Freq Plug In

Freq LB

Freq UB

Freq Random

Calibrated Bayes

High Alpha Cal Bayes

Calib Ext Sub Bayes

High Alpha CESB

Objective Bayes

High Alpha OB

Random

Oracle

Random

FLB

Inc

HCB

CESB

HCESB

HOB

Oracle

Oracle

Uncalibrated Bayes

Oracle

FLB

HCB

HOB

Extreme Sub Bayes

Random

Oracle

FLB

FUB

HCB

CESB

HCESB

HOB

Freq Plug In

Oracle

Inc

Freq LB

FLB

Oracle

FLB

Inc

FLB

Inc

FLB

Inc

Freq UB

Inc

Oracle

FUB

FLB

HCB

Inc

HOB

Freq Random

Oracle

Inc

HCB

HOB

Calibrated Bayes

Oracle

Inc

HCB

HOB

High Alpha Cal Bayes

HCB

Oracle

HCB

Inc

HCB

Inc

Calib Ext Sub Bayes

CESB

Oracle

CESB

FLB

Inc

HCB

CESB

HOB

High Alpha CESB

HCESB

Oracle

HCESB

FLB

Inc

HCB

CESB

HOB

Objective Bayes

Oracle

Inc

HCB

HOB

High Alpha OB

HOB

Oracle

HOB

Inc

HOB

Inc

HOB

Win Counts

Bettor	Wins	Wins with Incs
Oracle	13	13
Freq Plug In	9	10.5
High Alpha Cal Bayes	9	10.5
High Alpha OB	9	10.5
Objective Bayes	8	8.5
Freq LB	6	8.5
Calibrated Bayes	7	7
Uncalibrated Bayes	6	6
Freq Random	5	5.5
Calib Ext Sub Bayes	3	3.5
High Alpha CESB	2	2.5
Freq UB	1	2
Random	1	1.5
Extreme Sub Bayes	0	0

Bookie Gain Rates: Head-to-Head

The numbers in this grid are read off of charts generated under a linear assumption. There are better ways to do it, but I felt lazy.

	Random	Oracle	Uncalibrated Bayes	Extreme Sub Bayes	Freq Plug In	Freq LB	Freq UB	Freq Random	Calibrated Bayes	High Alpha Cal Bayes	Calib Ext Sub Bayes	High Alpha CESB	Objective Bayes	High Alpha OB
Random		33.33	34.29	40.00	35.00	28.00	43.75	36.67	36.67	36.36	37.50	37.50	31.43	36.36
Oracle	33.33		9.09	36.36	8.40	6.25	20.00	12.50	8.70	8.33	17.50	17.50	9.00	8.33
Uncalibrated Bayes	34.29	9.09		37.00	2.00	3.86	19.17	10.00	0.53	2.00	1.71	16.67	0.60	2.00
Extreme Sub Bayes	40.00	36.36	37.00		36.00	30.00	50.00	36.67	36.36	36.36	21.67	20.00	36.36	36.36
Freq Plug In	35.00	8.40	2.00	36.00		0.00	17.50	8.93	1.50	0.00	15.71	15.25	1.56	0.00
Freq LB	28.00	6.25	3.86	30.00	0.00		0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Freq UB	43.75	20.00	19.17	50.00	17.50	0.00		36.00	18.18	18.18	33.33	33.50	18.00	18.00
Freq Random	36.67	12.50	10.00	36.67	8.93	0.00	36.00		9.17	9.23	17.50	16.67	9.60	9.17
Calibrated Bayes	36.67	8.70	0.53	36.36	1.50	0.00	18.18	9.17		1.50	16.67	16.67	0.00	1.50
High Alpha Cal Bayes	36.36	8.33	2.00	36.36	0.00	0.00	18.18	9.23	1.50		16.00	15.38	1.50	0.00
Calib Ext Sub Bayes	37.50	17.50	1.71	21.67	15.71	0.00	33.33	17.50	16.67	16.00		0.00	16.67	16.67
High Alpha CESB	37.50	17.50	16.67	20.00	15.25	0.00	33.50	16.67	16.67	15.38	0.00		0.00	16.67
Objective Bayes	31.43	9.00	0.60	36.36	1.56	0.00	18.00	9.60	0.00	1.50	16.67	0.00		16.67
High Alpha OB	36.36	8.33	2.00	36.36	0.00	0.00	18.00	9.17	1.50	0.00	16.67	16.67	16.67

Bookie Mean Gain Rate

The numbers in this chart are based on the columns of the previous chart. They indicate the association between the presence of a bettor in a pair and the per-bet exploitability of that pair.

Bettor	Mean Gain Rate
Freq LB	5.239010989
Uncalibrated Bayes	10.68574759
Objective Bayes	10.87546898
Freq Plug In	10.91141636
High Alpha Cal Bayes	11.14290837
Calibrated Bayes	11.34162359
High Alpha OB	12.44055944
Oracle	15.02309213
High Alpha CESB	15.83086785
Calib Ext Sub Bayes	16.22527473
Freq Random	16.31482108
Freq UB	25.0472028
Extreme Sub Bayes	34.85780886
Random	35.91217116

1 Comment

Mohamad Nasr-Azadani

Jun 17

Such a great experiment. Thanks for sharing all the hardwork behind setting it up, Jake. I am also amazed about the unanticipated observations you had in your experiment where second-order effects may sound like bookie and the frequentist could seem as teammates while in actuality, they never colluded before hand.

Simulating a Statistics Head-to-Head Rumble

Setting the Stage

Simulation Design

Bettors

Experiment

Head-to-Head Results and Discussion

The diagonal results

Off-diagonal results

More on the role of miscalibration

The poor performers

The better performers

The rest of the performers

The bookie's bank, part deux

Closing Thoughts and Takeaways

References

Appendix

Head-to-Head Charts

Win Counts

Bookie Gain Rates: Head-to-Head

Bookie Mean Gain Rate

Recent Posts

1 Comment