Dumb econometrics questions/bleg on forecast probabilities

I'm bad at econometrics. I've got a couple of simple questions, that ought to have a simple answer. Noah Smith's post (HT Mark Thoma) reminded me of it. There are probably other students at the back of the econometrics class who don't know the answer either, so I'm going to ask for all of us.


There are two weather forecasters. Each day each forecaster tells us the probability P(t) that it will rain tomorrow. We have data on 365 observations of P(t) for each forecaster. We also have data on 365 observations of a dummy variable R(t), where R(t)=1 means it did rain tomorrow and R(t)=0 means it did not rain tomorrow.

A perfect weather forecaster would only ever say "P(t)=1.0" or "P(t)=0.0", and would always be right. So R(t)=P(t) for all t.

But if both forecasters are imperfect, how do we use the data to tell us which forecaster was better?  Or how good each one is on a scale with pure guess at one end and a crystal ball at the other end? How can we estimate the degree to which a forecaster was over-confident or under-confident in his own forecasting ability? What about bias?

Simple intuitive answers preferred, if possible. Thanks.

76 comments

  1. Min's avatar

    Robert Cooper: “Suppose a the forecaster gives a 70% chance of rain, and has a net worth of $1.00. They would bet $1.00 on “rain” at 7 to 3, and would bet $1.00 (on margin) for “no rain” at 3 to 7. If it rained, they’d end up with a net position of $2.33 (7/3 dollars) and if it shined, they’d end up with $0.43 (3/7 dollars).”
    Isn’t that backwards? The first bet is $1 vs. $0.43 in favor of rain. The second bet is $1 vs $2.33 against rain. If it rains, the gambler wins the first bet but loses the second. Result: $0.43. If it does not rain, the gambler loses the first bet but wins the second. Result: $2.33. If the probability of the forecast is correct, the expectation of both bets together is $1.
    If I think that the chance of rain is 70%, how do I find someone else to give me odds of 7:3 on rain?

  2. DocMerlin's avatar
    DocMerlin · · Reply

    The problem is this isn’t actually covered in any econometrics or probability classes. You have to go to a meteorology department to figure out how to do this without making stupid mistakes.
    A good example of why good skill scores are needed is this model. This model was a real model back in the day an actually used by a real weather forecaster. It predicts on every given day there will not be a Tornado in a given town. The forecaster claimed that he will be 98% right, but what you care about is that day when there in fact is a tornado. He got no false positives, but way too many (100%) false negatives. His score was biased. This is very different from what econometricians call bias.
    The Heidke skill score is a better predictor.
    Here’s a really simple example for yes-no answers:
    a= Forecast=Yes, Reality=Yes.
    b= Forecast=Yes, Reality=No.
    c= Forecast=No, Reality=Yes
    d=Forecast=No, Reality=Yes
    We come up with the Heike Skill score by
    We compare how good our model does compared to a coin flip. And how good a perfect model is compared to a coin flip.
    Then to make the numbers nice we take the ratio of those two results:
    HSS = (number correct – expected number correct with a coin flip)/(perfect model’s number – number correct with a coin flip)
    This simplifies to:
    HSS = 2(ad – bc)/[(a+c)(c+d) + (a+d)(b+d)]
    HSS of 1 means a perfect forecaster, a 0 means the forecaster has no skill, and a negative value says that flipping a coin is actually better than the forecaster.
    There are many other types of Skill Scores. They differ based on how they treat rare events and non-events and systemic vs random errors. You can extend skill scores from a 2×2 table to a larger table for more complex forecasts. This won’t do for probabilistic forecasts.
    For probabilistic forecasts instead of weighing false positives vs false negatives you are weighing sharpness vs reliability.
    Here are some skill scores for probabilistic forecasts.
    The Ignorance Skill Score:
    Let: f be the predicted probability of an event occurring lying on the open interval (0,1). (The ignorance skill score assumes that we are never 100% sure about anything.) Also the ignorance skill score has units of “bits.” Yes, its the same thing as we talk about when we speak of “bits” in a computer. It traces its foundations to information theory.
    And Let:
    Ignorance_n(f_t)=-log_base_2(f_t) when the event happens at time period n, and
    Ignorance_n(f_t)=-log_base_2(1 – f_t) when the event does not happen at time period n.
    T=Number of time periods t.
    The expected ignorance is computed the normal way:
    Ignorance(f)=Sum_over_all_t( I_t(f_t)) / T
    Standard errors for our estimate of ignorance are also computed the normal way
    Back to your original question, we can then compare the ignorance of the two forecasters by seeing which one is more ignorant.
    Explanation:
    That was not intuitive, so next we will try to come up with an intuitive way to explain it.
    Lets define a function that is “a measure of the information content associated with the outcome of a random variable.”
    Since its a measure of information, then it should have the following properties.
    That has the following properties.
    1) The self information of A_i depends only on the probability p_i of A happening.
    2) Its a strictly decreasing function of p_i. This is so the higher the probability, the less useful our prediction of event A_i.
    3) Its a continuous function of p_i. We don’t want finite changes in information when we have infinitesimal changes in probability.
    4) If an event A is the intersection of two independent events B and C then the amount of information we gain when we find out C has happened is should be a function of the intersection of A and B. Also, It should be equal to the information we gain when we find out A and B have happened.
    Said another way: if P_1=P_2P_3 then I(P_1)=I(P_2)+I(P_3).
    Luckily only one class of functions that fulfill these criteria!
    I(event x)=k
    log(p(x)). Now k can be any negative number so we pick k to give us units of bits.
    I(event x) =(1/ln(2)) * ln(1/p(x)) = – log_base_2(p(x)) where p(x) is the probability of that event happening.
    Now lets define a sort of measure of our surprise. This is the information we gain from seing the results of our predictions. If the event happened, the knowledge we gained was from our probability forecast was -log_base_2(f_t). However, if we picked incorrectly, we gained evidence for the prediction of the alternate event. So if we believed incorrectly we gain -log_base_2(1-f_t) knowledge.
    Lets work this out for an event.
    We think there’s 10% chance of James winning an election in 2010. James loses so we gain -log_b2(0.9) bits of info. We gained very little information, because 10% chance that James looses is is cose certainty that he loses.
    We think there’s 90% of Bill winning his election. Bill wins, so we gain -log_b2(0.9). Again we gain very little information, because again 90% is close to certainty.
    Bill gets caught cheating on his wife with a goat. We think there’s a 1% chance of Bill winning his next election. He manages to win. We are very surprised! We gain a lot of information this time. We gain -log_b2(0.01).
    James turns out to have done a great job in office. We think there is a 90% chance that he gets reelected. But we are surprised; he loses. We gain log_b2(0.1) bits of information.
    Then our total information gained is:
    10.27 bits of information.
    Our expected ignorance as a forecaster is about:
    10.27/4 =2.57 bits per forecast.

  3. DocMerlin's avatar
    DocMerlin · · Reply

    I wrote this way early in the morning, before going to bed, so it may have typos.

  4. DocMerlin's avatar
    DocMerlin · · Reply

    Another way to think of Ignorance skill score is an estimate of the difference in surprise as measured in binary bits, between you and an omniscient forecaster.

  5. DocMerlin's avatar

    Ok, a grammar, typo, and example corrected version of my explanation is here:
    http://entmod.blogspot.com/2012/11/skill-scores-re-nick-rowe.html

  6. Frank Restly's avatar
    Frank Restly · · Reply

    Min,
    “Note: You can permute the wagers. For instance, suppose that the prognosticator predicts rain with a probability of 80% on one day and no rain with a probability of 60% on another day, and it rains both days
    Day 1: Prediction of rain of 80%. Bet of $0.60, which wins. New bankroll: $1.60.
    Day 2: Prediction of no rain of 60%. Bet of $0.32, which loses. New bankroll: $1.28.”
    What I was referring to is the effect that one long odds bet has on the bankroll. Suppose the house odds for rain on a particular day are a 1 million : 1 against (0.0001%. chance of rain) but it happens to rain on that day.
    Player: Prediction of rain 100%. Bet of (1 + 1/1000000 – 1/1000000)=$1.00 which wins. New bankroll = $1,000,000
    What effect does one long odds bet have on the bankroll in a finite number of gambles? For instance suppose that two gamblers are given 1000 guesses, but on one of those days the odds of rain are significantly higher or lower than the total number of guesses given (1 million to 1, or 1 to a billion). In a betting strategy, that would place a premium on guessing that day correctly.
    “The Great Banker in the Sky does not care about winning or losing.”
    But the payout is a function of what the prevailing house odds are. The Great Banker in the Sky may not care about winning or losing, but the payout is determined by the house odds set by that Banker.
    “This is equivalent to the Bayes comparison.”
    I don’t think so. It has to do with the effect that one “lucky guess” can have on the net result. In the Bayes calculation, if the actual weather deviates from the house odds by 99.9999% on a single day, then the net effect of one long shot bet for that day on the results of 1000 bets is 99.9999% / 1000 bets = 0.099999%. Meaning it will shift the result of the Bayes calculation about 0.1%. However, the effect of the long shot bet for that day on the gamblers 1000 bets can be much more significant.
    l

  7. DocMerlin's avatar

    The gambler idea is a bad skill score, because it makes later winnings dependent on earlier winnings. This weighs earlier predictions higher than later ones.

  8. Frank Restly's avatar
    Frank Restly · · Reply

    DocMerlin,
    “The gambler idea is a bad skill score, because it makes later winnings dependent on earlier winnings. This weighs earlier predictions higher than later ones.”
    The betting strategy being discussed sets a bet amount as a percentage of total holdings, not a fixed amount. And it really doesn’t matter when a win occurs because the percentage gain on holdings will carry through.
    For instance, three bet results
    Bet one: Win 15% of holdings
    Bet two: Lose 5% of holdings
    Bet three: Lose 5% of holdings
    It really doesn’t matter what order these bets will occur. The net result is the same (1 + .15)(1 – 0.5)(1 – .05) = 1.0379
    My issue had more to do with when the odds of winning on a particular day are significantly greater than or less than the total number of gamble chances that are given. That places a premium on winning those days, since the payout on those days can be significantly higher than the rest.

  9. Robert Cooper's avatar
    Robert Cooper · · Reply

    Frank Restly: You convinced me that expected ignorance is the way to go. The perplexity of the forecaster’s ignorance might, which you can interpret as a gamble, might be more intuitive for people who don’t like to think in bits.
    Suppose our forecaster must, by law, give a probability for weather tomorrow and accept bets with payouts 1 / probability he assigns to the event. For example, if he gives a 75% forecast of rain, he must offer anyone bids that pay 4/3s if it rains and pays 4/1 if it doesn’t rain.
    If you could always perfectly predict the future and always bet correctly the against the forecaster, how much money could you make? How much can you expect to you multiply your net worth by per forecast, on average?
    Answers:
    Gambler’s interpretation: Multiply the forecaster’s payouts for correct bids for the year, and take the 365th root (geometric average).
    Information theoretic interpretation: We expect b bits of ignorance per forecast, and so you can expect multiply your net worth by 2^b bits per forecast.

  10. Min's avatar

    Frank Restly: “What I was referring to is the effect that one long odds bet has on the bankroll. Suppose the house odds for rain on a particular day are a 1 million : 1 against (0.0001%. chance of rain) but it happens to rain on that day.”
    First, the house odds do not have to exist. (I think that they do, but that’s another question.) As I said, that has to do with how well a forecaster does with regard to a probabilistic reality. What I was doing with the Kelly scheme, in a slightly roundabout way, was comparing two forecasters against each other, in a way that does not require probabilistic reality. Reality can be deterministic, or non-deterministic without probabilities, or non-deterministic with non-numeric probabilities, and there are surely other possibilities for reality! 🙂
    Second, the Kelly comparison does not take into account the variability of results. That is indeed a concern. 🙂

  11. Greg Ransom's avatar
    Greg Ransom · · Reply

    Noam Chomsky on calculating statistical probabilities between coorelated weather events vs understanding complex weather systems:
    http://tillerstillers.blogspot.com/2012/11/noam-chomsky-on-ai-bayesianism-and-big.html

  12. Frank Restly's avatar
    Frank Restly · · Reply

    Min,
    I understand what you are saying, but without a probabilistic reality to compare the gamblers against, they could both be flat out guessers and one of them just happened to guess right more often.
    That is why I think the Bayes calculation makes more sense, even when comparing two gamblers, the house odds are not thrown out.
    Value of Gambler #1 = (House Deviation from Actual – Gambler #1 Deviation from Actual) / House Deviation from Actual
    Value of Gambler #2 = (House Deviation from Actual – Gambler #2 Deviation from Actual) / House Deviation from Actual
    The relative value of Gambler #1 to Gambler #2 is:
    (House Deviation from Actual – Gambler #1 Deviation from Actual) / (House Deviation from Actual – Gambler #2 Deviation from Actual)
    The relative value of one gambler to another becomes smaller as the house deviation increases. In a truly random event that is non-deterministic without probabilities, the relative value of one gambler to another is always 1 because the house deviation from actual will be infinite. Meaning in a random event, one gambler’s guess is as good as another’s. In a random event, even if one gambler gets more guesses right, the relative value of that gambler to another is still 1.

  13. Min's avatar

    @ Frank Restly
    The “house odds” do not appear in the Bayesian comparison. What you have is P(data | predictions of A)/P(data | predictions of B). That’s it. You do not have P(data | predictions of the House).

  14. Michael Bishop's avatar

    I thought I’d point you to this common measure of accuracy and calibration. http://en.wikipedia.org/wiki/Brier_score

  15. Phil H's avatar

    There seems to be a conceptual confusion in here between predictions for repeated events and predictions for single events.
    A probabilistic prediction for a single event (like rain/no rain today) is worthless/meaningless. For a single event, you just want a yes/no answer or a number (maybe generated from a probabilistic model, but you need a definite answer). For sets of events, a probabilistic prediction could be useful.
    The Brier score that some have mentioned is about how probabilistic predictions fare over a defined set of similar events.

  16. Greg Ransom's avatar
    Greg Ransom · · Reply

    Has anyone written a clear and comprehensive paper on the limits of applying the pure logic of probabilities & statistical math to non-linear dynamic phenomena with open-ending evolutionary unfolding and essential sensitivity to initial conditions?
    For example, what what does statistical ‘science’ have to tell us about the unfolding of a Mandelbrot set? Anything?
    http://en.wikipedia.org/wiki/Mandelbrot_set
    Mark Stone wrote an good piece in the 1990s explaining how this sort of phenomena defeats ‘prediction’ as imagined by Laplacean determinism.
    Is there a literature on this generatl topic?

  17. Patrick's avatar

    Greg: Yes. Ergodic theory.
    And I think you’re guilty of misrepresenting Laplace’s view. People tend to forget that he concluded:
    “All these efforts in the search for truth tend to lead back continually to the vast intelligence which we have just mentioned, but from which it will always remain infinitely removed”

  18. Patrick's avatar

    Oh, and optimal filtering is the practical machinery.

  19. DocMerlin's avatar

    @Phil H
    “The Brier score that some have mentioned is about how probabilistic predictions fare over a defined set of similar events.”
    The ignorance score is more sensitive for events with very high or very low probability than the Brier score is. Plus it has really nice units in terms of information theory.

  20. Госбанк's avatar
    Госбанк · · Reply

    Phil:
    You wrote:
    “A probabilistic prediction for a single event (like rain/no rain today) is worthless/meaningless”
    Not at all. It depends on what you mean by “probability”. In the frequentist interpretation, your statement would be correct, but that’s just one of many interpretations. A Bayesian, for example, would be quite comfortable assigning a probability to a single event.
    Now, weather forecasting is closer to the frequentist’s use of probabilities as mentioned above.

  21. Determinant's avatar
    Determinant · · Reply

    Do Austrian economists buy life insurance? Keynes was president of a life insurance company, he knew risk, probability and uncertainty very well. Without aggregation and risk modelling, life insurance would not work. In fact it has worked for centuries.

  22. DocMerlin's avatar

    @Determinant?
    WTF is this coming from?
    Of course Austrians believe in financial risk modeling, they just believe that it has no place in economic theory. They believe that sort of modeling isn’t rich enough to handle the complexities of human behavior.

  23. Ken Schulz's avatar
    Ken Schulz · · Reply

    From the Theory of Signal Detection (SDT), there is a procedure for plotting a Receiver Operating Characteristic (ROC) curve for an observer based on the confidence of the decision/prediction, self-rated:
    http://radiology.rsna.org/content/229/1/3.full.pdf+html
    If you interpret forecast probability as a measure of confidence, the application is straightforward. SDT characterizes performance with separate measures of sensitivity and bias. Also, the sensitivity measures have true zero points at the chance level of performance.
    Of course, this is not an ‘econometric’ methodology; rather, it has been applied in fields from radio communications to human sensory psychology to medical diagnosis – any case of decision-making under uncertainty.
    Good luck.

  24. Simon van Norden's avatar

    Nick:
    Galbraith, John W. and Simon van Norden (2012) “Assessing gross domestic product and inflation
    probability forecasts derived from Bank of England fan charts” J. R. Statist. Soc. A (2012)
    175, Part 3, pp. 713–727
    Think in terms of the regression of R(t) on a constant and P(t). The forecast will be unbiased if the constant is zero and the estimated coefficient on P(t) is 1. Statisticians say such a forecast is “well-calibrated”. More precisely, that means E(R|P) = P. However, not all such forecasts are created equal. We’d also want to consider the R^2 from our regression. A higher R^2 implies that our P(t) has more explanatory power for R(t). Statisticians say that such a forecast has higher “resolution”.
    For forecast comparisons, statisticians typically like to ensure that both forecasts are well-calibrated and then compare their resolution. However, you might instead want to just compare MSFE (or, if you have a different loss function in mind, just compare expected loss.) You could also do tests in the spirit of forecast encompassing. For example, supppose you have P1 and P2 and you’d like to compare them. Well, just regress R on a constant, P1 and P2. Do both have significant coefficients? If not, then you can say that the one with the insignificant coefficient adds nothing significant to the other forecast.

  25. Simon van Norden's avatar

    Ken Schulz: Actually, Oscar Jorda (among others) has been using ROC analysis in economic applications. His recent paper in American Economic Review is a good example.

  26. Ken Schulz's avatar
    Ken Schulz · · Reply

    Simon van Norden,
    Thanks for the pointer. I should have said ROC analysis is not specific to any one discipline. I’m glad to see it’s finding application in economics.
    A little while back I was reading a discussion of entrepreneurship in the US – apparently new-business starts are down, but the success rate is up. I thought, well, yes, if entrepreneurs and investors have any ability to discriminate good from poor opportunities, that’s exactly what should be expected; false alarms should drop off faster than hits as the criteria tighten. I didn’t find anything relevant in a quick Google Scholar search; jumped too quickly to the conclusion that ROC wasn’t being used much in Econ.
    I am an engineering psychologist, this is purely an avocational interest.

Leave a reply to Greg Ransom Cancel reply