Forecasting measured the wrong thing

25 Apr 2026 · Read on Substack · 5

*The Oracle of Delphi Entranced,* Heinrich Leutemann

Forecasting has become enormously better over the last fifteen years. Tournaments, superforecasters, and prediction markets, and now LLMs closing in on the best forecasters. Metaculus, Manifold, Polymarket, and Kalshi have billions of dollars moving through them.

But the point of a prediction is to change a decision. Forecasters predicted COVID and yet the actual response to it was slow and fragmented. Making good predictions is legible and measurable, but how those predictions impact the world is not. We optimised the thing that was measurable but gave no thought to how those predictions were used in the real world.

Goodharting on predictions

The only possible purpose of predictions could be for them to have some impact on the real world, otherwise they’re just navel gazing. But making good predictions is much more legible than coupling them to real-world impact. What gets measured gets improved, so people just optimised for on prediction accuracy without paying attention to how those predictions get used.

Actual impact is much harder to quantify, and the delay from a prediction being made to that prediction being operationalised is long and hard to measure. No-one gets a career boost from focusing on this, so few people do. As a result the impact of accurate predictions on the world is hard to detect.

Outcomes are a second thought

IARPA’s ACE program was intended to “dramatically enhance the accuracy, precision, and timeliness of intelligence forecasts”, which it surely did. The Good Judgement team did so much better than IARPA’s prediction goal that they stopped funding the other teams after two years such was their accuracy. But when asked if their work influences policy decisions, Mellers said “we certainly hope it does.”

Everything inherited the frame of this tournament. Over the following years, forecasting and prediction markets entered the common consciousness and a half dozen prediction markets appeared, many of them trading real money. But ultimately people were trading either on status or on making money off their predictions.

Ahead of the curve on COVID

A Metaculus user flagged the Wuhan coronavirus as a plausible starting point for a pandemic in January 2020, and by the end of that month the platform’s predictions were pricing in over 100,000 infections. Good Judgement similarly was predicting 100,000–200,000 cases by March 20.

This was a hit. The prediction markets were genuinely ahead. Some individuals did update, especially in the EA and rationalist spheres – some individuals started stocking up and preparing themselves and their families for what was coming.

At a large scale, however, nothing happened. No-one who mattered actually updated on any of these predictions. Lots of things could have happened: public health decision-making pipelines could have included a step which included forecasts, government bodies could have been clued in to what decision markets were saying. There were some prediction market-pilled people in key roles, like Dominic Cummings who pushed an evidence-based approach to the pandemic response in the UK government.

A notable exception to this tells a story: Shannon Gifford was both a superforecaster and a projects officer in the Denver mayor’s office. She pushed her colleagues in January to listen to these predictions and succeeded. She was essentially someone who could bridge these two worlds and actually get something done with the predictions being made.

Essentially the bottleneck wasn’t accuracy; solid predictions existed and turned out to be directionally correct. But there was no mechanism for coupling those predictions to a response among the people who mattered.

Coupling forecasts to impact

In some sense forecasting is a solved problem, and doesn’t need more resources directed to it – predictions will inevitably continue to improve as AI starts to consistently beat even the best forecasters. But cheap LLM forecasting makes the prediction–action gap even worse; there will be more and better predictions but without a working pipeline to translate those predictions into outcomes they will have no effect.

Instead of status being a function of prediction accuracy, forecasters should also be judged on their real-world impact. Informing the conversation or being widely cited are not sufficient – forecasters ideally should be able to point to specific outcome or shift in government policy (or even just the meaningful inclusion of their predictions into a decision-making process with teeth).

So instead of focusing on increasing the accuracy of predictions the focus needs to move to making sure those predictions get operationalised. Decisionmakers need to be persuaded to precommit to taking action in response to particular forecasts, e.g. if a pandemic-related prediction market moves past some threshold, the body in question is committed to acting on it. The CDC could have policy ‘triggers’ which are tied to Metaculus prediction markets or confident Good Judgement predictions.

We should also try to get forecasters embedded in decision-making orgs to bridge the gap between prediction and action – we need more Shannon Giffords. Chief Forecasting Officer should be a prominent job title in government agencies, a specific point person who is across the prediction markets and superforecaster predictions to ensure they’re included in that agency’s decisionmaking process. Cummings attempted to do this in the UK government and failed, so this needs to happen at the agency level and within the civil service instead of the upper echelons of government.

To reduce Goodharting, we can move the stated outcome of prediction markets to bake in the real-world effect of those predictions, pricing in the implementation of the prediction. A funder could pilot funding-specific prediction markets looking at e.g. whether if funding is granted to an org, there will be a specific outcome in the real world.

Decision audits could be used to build accountability on the side of the decisionmakers – public retrospectives asking “did a relevant, high-quality forecast exist and was it acted on?” This creates the missing career incentive for government bodies to incorporate forecasts into their decisionmaking process.

For the very longest tail risks, catastrophe-bond style instruments (like the bonds insurers issue against hurricane risk) could incorporate prediction market data and pay out if the specific prediction pans out. This essentially joins the prediction to an actual consequence to the underwriter.

Without real-world outcomes, forecasting is just a hobby. As a field it needs to aim higher: influencing better decisions among those who matter and have the biggest impact. We need more Shannon Giffords, not more Tetlocks.