Dumb econometrics questions/bleg on forecast probabilities

I'm bad at econometrics. I've got a couple of simple questions, that ought to have a simple answer. Noah Smith's post (HT Mark Thoma) reminded me of it. There are probably other students at the back of the econometrics class who don't know the answer either, so I'm going to ask for all of us.


There are two weather forecasters. Each day each forecaster tells us the probability P(t) that it will rain tomorrow. We have data on 365 observations of P(t) for each forecaster. We also have data on 365 observations of a dummy variable R(t), where R(t)=1 means it did rain tomorrow and R(t)=0 means it did not rain tomorrow.

A perfect weather forecaster would only ever say "P(t)=1.0" or "P(t)=0.0", and would always be right. So R(t)=P(t) for all t.

But if both forecasters are imperfect, how do we use the data to tell us which forecaster was better?  Or how good each one is on a scale with pure guess at one end and a crystal ball at the other end? How can we estimate the degree to which a forecaster was over-confident or under-confident in his own forecasting ability? What about bias?

Simple intuitive answers preferred, if possible. Thanks.

76 comments

  1. Daniel's avatar

    One obvious answer is to use the likelihood. For day t, this is P(t) if R(t)=1 and 1-P(t) if R(t)=0. Then take the product across all the days.

  2. Britonomist's avatar

    What I would do is make an accuracy score for each prediction, for each day: if it rains the score is the forecaster’s probability it will rain, if it doesn’t rain the score is 1 – probability. Then you can either sum across all days to get the cumulative performance and compare with the other forecaster, or you can work out the average score as well as the standard deviation of each forecaster and compare that way, the standard deviation allowing you to test how significantly they deviate from each other. Furthermore, it allows you to compare against random chance, by seeing whether the score is above 0.5.

  3. Nick Rowe's avatar

    Daniel and Britonomist: thanks. Are your answers the same, except Britonomist is adding up all the things that Daniel is multiplying together?

  4. Nick Rowe's avatar

    Or, wouldn’t sum squared [R(t)-P(t)] work the same, and be more Least Squaredish?

  5. Britonomist's avatar

    Well, if you take the nth-root of Daniel’s likelihood you have the geometric mean, which could be compared in a similar fashion to using my arithmetic mean.

  6. Ryan's avatar

    Maybe I’m stupid, but…
    …how do you assess the accuracy of P(t)? It either rains or it doesn’t, so unless P(t) is a perfect model, it will almost always be wrong, even in cases when P(t)=0.95 and it rains (because P(t1)=1.0).
    Is the assumption that the 0.95 forecast was “5% incorrect” on the day it rained? If so, then I’m pretty sure the P(t) model will always appear to look more inaccurate than the R(t) because P(t) is inaccurate even when it’s accurate.
    I can’t demonstrate this, but I’m putting it out there in hopes that the person who solves this problem confirms my suspicion.

  7. Nick Rowe's avatar

    Or, what about a forecaster who is really perfect, but who underestimates his own ability. So whenever he says “P=0.6” it always rains, and whenever he says “P=0.4” it never rains? How can we distinguish between someone like that, and another forecaster who is always perfectly confident but who sometimes gets it wrong?

  8. Nick Rowe's avatar

    Ryan: you are asking the same “stupid” question as me. But it has an answer. I know it must have an answer!

  9. Mathias's avatar

    I’ll give it a try: to determine which forecaster to prefer the ratio p(data|A)/p(data|B) could be a good indicator (reason below), where p(data|A) is the probability of observing the data given the predictions of forecaster A. If the ratio is much larger than 1 then A is better, if the ratio is much smaller than 1 then B. p(data|A) can be calculated as p(data|A) = PRODUCT_t [ (P(t) == 1.0) * R_A(t) + (P(t) == 0.0) * (1 – R_A(t)) ]. Similarly p(data|B). The ratio tells us which to prefer.

    I’ll take the argument for why this could be a way to do it from the chapter “Model Selection” in “Data Analysis: A Bayesian Tutorial” by D. S. Sivia: We have observed some data the forecast of two forecasters A and B. We want to determine which forecaster to prefer. If we denote by p(A|data) the “probability/belief that A is right” given the data, then we can calculate p(A|data)/p(B|data). If this is larger than one, then we prefer A, if its much smaller than one we prefer B. Using Bayes theorem this can be written as
    p(A|data)/p(B|data) = (p(data|A)/p(data|B)) * (p(A)/p(B)). If we have no a priori preference for A or B then p(A)/p(B) = 1 and we are left with p(data|A)/p(data|B).

    I’m curious if this is helpful (and simple enough :-)). It also doesn’t help much to determine if one forecaster is just guessing. To do that, I would expect that we have to know more (Example: how often we expect it to rain. In the Sahara someone will do really well by guessing “no rain”). So the approach above just helps to differentiate between the two weather forecasters.

  10. Unknown's avatar

    Mathias beat me to it. You revise in favour of the forecaster who gave the higher probability for event that actually occurred according to Bayes’ rule.

  11. Britonomist's avatar

    Remember that in reality is deterministic, not probabilistic (with the exception of quantum physics), if you have enough information you should be able to produce a greater (or lesser) probability than someone with less information, so my method both penalizes an inability to give informative predictions (low deviation from 0.5 probability), and being wrong. I suppose this doesn’t account for confidence explicitly, low confidence will be penalized, that just illustrates that low confidence makes you less useful (unless you’re better than the more confident forecaster, in which case the other forecaster’s overconfidence is making him worse instead).

  12. Britonomist's avatar

    Reality is deterministic*

  13. Ritwik's avatar

    My first thought was to go with the maximum likelihood estimator too.
    But this may not work.
    For example, say the long run average of rainy days is 10%. A moderately naive person may fix P(t) = 0.1 for each day. A more rational Bayesian updater might begin with a prior of 0.1, and then update it with each day of rain or no rain. Ideally we should reward this person.
    Now consider a third person, who forecasts P(t) = 0.2 everyday, and wins out simply because in that year rainy days were indeed 20%, so he maximized likelihood (additive, multiplicative, std deviation etc. doesn’t matter- it can be shown that a constant ex ante prediction that matches the realized ex post data series maximizes all likelihoods).
    The answer to our question is dependent upon whether we should indeed be using the past long run average of rain as the prior, or if a de nouveau prior is justified. In other words, is the guy predicted 0.2 for everyday Warren Buffett, or an idiot savant. (Is Warren Buffett an idiot savant?)
    The test of forecast efficiency suffers from the same problem as do all tests of market efficiency – they are joint tests of the market model and the forecast. Do we know the true model of rain. If yes, MLE gives us the right answer. If no, MLE might fail to tell us who is the better forecaster. What then matters is replicability and longevity. Can the forecaster teach another forecaster to outperform the naive long run average setter or the Bayesian updater? Does the forecaster’s method perform in out of sample tests?

  14. Britonomist's avatar

    Oh and I’m pretty sure that two identical forecasters with the same method and information, but with one being less confident than the other, will still have the average accuracy.

  15. Britonomist's avatar

    The same average accuracy*, bah! I wish there was an edit function…

  16. Nick Rowe's avatar

    Matias: ” p(data|A) can be calculated as p(data|A) = PRODUCT_t [ (P(t) == 1.0) * R_A(t) + (P(t) == 0.0) * (1 – R_A(t)) ].”
    Is there a typo there? Did you switch P and R? Or am I even more confused than I think I am?
    Suppose we estimated the regression: R(t) = a + b.P(t) + e(t) for each of them. Would a high R^2 but low estimate for b (much less than one) tell is the forecaster is better than he thinks he is?

  17. Britonomist's avatar

    Since the dependent is binary there, I’m not sure R^2 is valid; you should use probit instead.

  18. William's avatar

    what about a forecaster who is really perfect, but who underestimates his own ability.
    What about a forecaster whose method is perfect but can’t interpret his own results? He only ever gives probabilities of one or zero, but every time he gives P = 0.0 for rain, it does rain, and when he predicts P = 1.0 it never rains. Is he the best forecaster or the worst possible one?

  19. Scott Sumner's avatar
    Scott Sumner · · Reply

    Nick, Define “better.” There are two dimension in which you could judge the forecast accuracy, making correct predictions in a binary rain/nonrain sense (throw out the 50-50 calls, or tell the forecasters to never use precisely 50%.) See which forecster gets it right more often.
    The second dimension is over/underconfidence. Suppose a forecaster aleways calls it 80-20 (or that ratio on average), sometime for rain, sometimes that it won’t rain. In that case you test the accuracy of claimed confidence by comparing the 80-20 ratio to the percentage of correct calls. If the forecaster is neither under or over confident, you would expert him or her to be right 80% of the time. If they are right 90% of the time they would be underconfident, and if they are right 70% of the time they are over confident.
    It very possible that one forecaster will be “better” in the sense of getting it right more often, and the other will be closer to the proper degree of confidence. In that case you’d obviously want to take some sort of linear combination of the two forecast to get the best estimate of the probability of rain.

  20. Nick Rowe's avatar

    Scott: that was sort of my thinking too. There’s information-content, bias, over/underconfidence, and overall accuracy.
    Britonomist. Ah yes, probit, because the errors will have a very ugly distribution, or something.
    William: yep. That’s an example of a forecaster with very high information content but negative self-confidence!

  21. Frances Woolley's avatar
    Frances Woolley · · Reply

    Nick, do you care how much it rains? That is, is a person who forecasts sunny weather just before Superstorm Sandy blows in a worse forecaster than the person who forecasts dry but overcast when in fact it was overcast with a millimetre of rain?

  22. Phil Koop's avatar
    Phil Koop · · Reply

    Nate Silver himself gives a perfectly intelligible summary of the difference between bias and accuracy, and what accuracy means in this context:
    Bias, in a statistical sense, means missing consistently in one direction β€” for example, overrating the Republican’s performance across a number of different examples, or the Democrat’s. It is to be distinguished from the term accuracy, which refers to how close you come to the outcome in either direction. If our forecasts miss high on Mr. Obama’s vote share by 10 percentage points in Nevada, but miss low on it by 10 percentage points in Iowa, our forecasts won’t have been very accurate, but they also won’t have been biased since the misses were in opposite directions (they’ll just have been bad).

  23. Phil Koop's avatar
    Phil Koop · · Reply

    Your posited forecaster who “is really perfect, but who underestimates his own ability” is a contradiction; all that matters is the forecasts (probabilities) that the forecaster gives. The problem, as you have set it up, is like this: every day (or poll) is treated as a binary random variable (biased coin), where the probability of “heads” itself is drawn from an unknown distribution. To be accurate means to correctly guess this unobservable bias on each trial, not to guess the outcome of each trial. That is not even being attempted.

  24. Thomas Holloway - @Zerodown0's avatar
    Thomas Holloway - @Zerodown0 · · Reply

    Recall the meaning of a 40% forecast = “On 40% of days with these conditions it will rain”. Or perhaps, “On 40% of days with this forecast it will rain”. Suppose all the forecasts were multiples of 10% (or round a needed). For each of the eleven possible forecasts compare the empirical frequency of rain. The upshot here is you are usually comparing a number between (0,1) with another, instead of comparing it to a Bernoulli trial (zero or one).
    Now you have a vector of accuracy across the forecast spectrum. This gives some insight on the shape of the error/bias – perhaps the forecast quality is high in the low range and biased upward in the high range.
    If you need a final utility score come up with a metric on the vector.

  25. Phil Koop's avatar
    Phil Koop · · Reply

    OK, so pace ritwik, likelihood is in fact the orthodox measure to employ here. The question of bias is distinct from that of accuracy, as noted earlier, and in the case of polls there is no earlier sequence of data to draw on to estimate an average. In the case of weather, the same holds, because weather is not stationary.
    There is another way to quantify the issue worth mentioning. We are observing a sequence of random variables X_i, with each variable drawn from a different binary distribution with unobservable Pr(rain) = mu_i; our premise is that mu_i is itself a random variable, or the problem would collapse to a standard convergence in law.
    You are surely familiar with the standard version of the central limit theorem, in which the normalized sum of iid random variables converges to a normal distribution. There are also extended versions of the limit theorem which do not require the independent variables to be identical, provided that they meet some other technical conditions. One of these is Lindeberg’s version (which is satisfied here), that says:
    1 / sum(X_i) x sum(X_i – mu_i)
    converges toward the standard normal distribution. For each day (or poll), X_i is 1 if it rained, and mu_i is the forecaster’s given probability of rain. The degree to which a given forecaster’s sequence of forecasts converges to the standard normal distribution in this transformation is a measure of the goodness of those forecasts.

  26. Nick Rowe's avatar

    Thomas: but that doesn’t really work. Suppose I know nothing, except that it rains 30% of the time. So every day I say “30%”, and I am exactly right, because the frequency is in fact 30% when I say it’s 30%. I am unbiased, but otherwise a useless forecaster.
    Phil: Your posited forecaster who “is really perfect, but who underestimates his own ability” is a contradiction; …”
    OK. Suppose one forecaster had a perfect model (it’s a crystal ball), but he doesn’t know it’s perfect and doesn’t trust it. So he takes an average of the model’s forecast and the population mean probability. When the model says 0% he says 10% and when the model says 100% he says 90%. A second forecaster has a flawed model that he trusts perfectly, and so always says 0% or 100%. Both forecasters are imperfect, but they are imperfect in different ways. And it is much easier to fix the first forecaster’s imperfections. And a useful metric would let us know if a forecaster is making mistakes that can easily be fixed.

  27. Nick Rowe's avatar

    Frances; good point. We need a loss function. I think I was (implicitly) assuming a simple loss function, where the amount of rain doesn’t matter, and the cost of carrying an umbrella conditional on no rain is the same as the cost of not carrying an umbrella if it does rain?

  28. Nick Rowe's avatar

    Phil (quoting nate Silver): “If our forecasts miss high on Mr. Obama’s vote share by 10 percentage points in Nevada, but miss low on it by 10 percentage points in Iowa, our forecasts won’t have been very accurate,…”
    But that is a point estimate of a continuous variable. I’m talking about a probabilistic estimate of a binary variable. “What is the probability of rain/O wins?”

  29. J.V. Dubois's avatar
    J.V. Dubois · · Reply

    Hmm. In this case, given the data the best thing I could do is to evaluate those forecasters utilizing information theory. So if based on actual observations we know the “objective” probability distribution of binary outcomes (Rain/NotRain) we may then calculate the Shannon entropy and therefore an information value of respective forecasts. I would say that the winner is the forecaster who transmitted most information.
    PS: more information about the subject here: http://en.wikipedia.org/wiki/Entropy_(information_theory)
    PPS: anyways, contrary to what some people may think this is not how I would evaluate forecasters. It would be better to evaluate their forecasting models instead of actual forecasts with probabilities. If you would have access to models, you would be able to mot only evaluate such model’s raw predictive (or to be more precise explanatory) power, but also their stability. The best way to think about this is to imagine that observations are points in space where dimensions are variables that you use for prediction. The actual model can be represented by a plane that you constructed using these points in space.
    Now you can have a fantastic model that has 95% prediction power and it still could be useless. Why? Imagine that you have nice observations tightly packed roughly along a straight line along the x dimension. So you construct a plane and that is your model. Now imagine that just one observation shifts a little. Suddenly your plane changes slope and further you get from your observation the greater is the difference in prediction. Suddenly you find out that just slight changes in observations end to widely different predictions. Such model is so useless, it is just random match for past data. There are mathematical tools that allow you to calculate the stability of your model. There are more complicated things involved, like your model can be generally stable but it can show large local instabilities in some range of observations etc.

  30. Госбанк's avatar
    Госбанк · · Reply

    Nick:
    Not sure you can compare those two probabilities (“O wins” vs weather forecast) as they most like mean different things in differing contexts.
    The major and uncomfortable problem with the notion of “probability” is that it admits multiple, sometimes incompatible, interpretations (about half a dozen) thus leading to confusion while trying to understand what exactly the interlocutor means by “probability”. Perhaps, before using the p-word, it has to be definde each time before any such use.
    While climatologists use something close to the standard frequentist interpretation that they modestly call “climatological probability”, the “O wins” probability can be most charitably interpreted as subjective probability of de Finetti’s kind (“subjective degree of belief”), the political scientist quantitative models notwithstanding.
    Re. “climatological probability”:

    Because each of these categories occurs 1/3 of the time (10 times) during 1981-2010, for any particular calendar 7-day period, the probability of any category being selected at random from the 1981-2010 set of 30 observations is one in three (1/3), or 33.33%.

    http://www.cpc.ncep.noaa.gov/products/predictions/erf_info.php?outlook=814&var=p

  31. Phil Koop's avatar
    Phil Koop · · Reply

    “I’m talking about a probabilistic estimate of a binary variable.”
    Suppose that there are only two types of election, those which Republicans are 70% likely to win and those that Democrats are 70% to win, and these types occur with equal frequencies. Then a forecast that assigns every election a 50-50 chance is unbiased but useless.

  32. Phil Koop's avatar
    Phil Koop · · Reply

    ‘if based on actual observations we know the “objective” probability distribution of binary outcomes’
    If we knew this, the problem would be much easier.

  33. Unknown's avatar

    That’s a useful canonical example that illustrates why we’d like to see the whole vector. Moreover, it’s the extreme version of the theoretically valid strategy of consistently averaging in the information-less prior of 30% to every forecast.
    In this way, every forecast-year is a linear combination of the information-less 30% constant and a non-trivial component. Perhaps there’s a way to parse each forecast-year into components.

  34. Ed Seedhouse's avatar
    Ed Seedhouse · · Reply

    Well the weather forecasters at least admit that their predictions are not perfect, and then go on to quantify the uncertainty of their prediction. So what they predict is not actually whether it will rain or not on any particular day, but a probability distribution. Thus they should be judged on how close their predicted probability distributions are to the actual distributions.
    So if the forecast of a 60% probability of precipitation we expect rain on 60% of the days when this forecast is given, and if instead we consistently get 50% or 70% then this is evidence that the forecast is biased in one way or the other. If the forecaster is doing a good job we would expect a fairly narrow distribution, presumably roughly “normal” around the predicted frequency. Whether it rains or does not rain on any particular day taken by itself is beside the point, it seems to me. If the actual distribution of precipitation over time coincides closely with the predicted distribution then the forecaster is doing a good job. If it doesn’t then he isn’t.

  35. Phil Koop's avatar
    Phil Koop · · Reply

    “Suppose one forecaster had a perfect model (it’s a crystal ball) … it is much easier to fix the first forecaster’s imperfections.”
    You are merely assuming the second sentence. There is no reason why it should be easier to increase “confidence”, interpreted as a metaphor, than to improve the “crystal ball”, interpreted as a metaphor, it is just that your choice of metaphors has made this seem plausible. The complete model is the combination of the crystal ball plus confidence (plus star-gazing plus sunspots plus whatever else you want to add) and the complete model is what we must assess.

  36. Arin D's avatar

    A bit surprised that no one has mentioned that the most common way to assesses forecasts would be by taking the forecast with the smallest mean square prediction error E[ R(t)-P(t))^2]. Lowest MSPE is what you want to pick if you have a quadratic loss function.
    Of course, if you are particularly sensitive to certain contingencies, such as thresholds like Frances suggests, you have a different loss function. If you, say, are made worse off by rain when no rain is predicted, but not vice versa (because costs of carrying umbrella is small) then you would calculate what fraction of the 365 days prediction 1 made an error of that sort as opposed to prediction 2. And pick whichever performed better.

  37. Frank Restly's avatar
    Frank Restly · · Reply

    You would need some measure of what a person who is not a weather forecaster could reasonably guess. In that way you determine what value a weather forecaster adds. One method would be to gather a large enough time data series to determine what the probability of rain is on any particular day from past experience. This would serve as a baseline against which to measure both deviations in observed weather and deviations in predicted weather.
    P(T) is prediction of a weather forecaster for day T
    R(T) is weather on day T
    S(T) is probability of rain on day T based upon all previous R(T), or S(T) = (sum of R(t) as t goes from 1 to T-1) / ( T-1 )
    Suppose our non-weather forecaster bases his guess entirely on previous weather
    P(T) = S(T) = ( sum of [ R(t) ] as t goes from 1 to T-1 ) / ( T – 1 )
    The deviation of his guess at time T from the actual value would be:
    | S(T) – R(T) | / R(T)
    The average deviation A would be:
    A = ( sum of [ | S(t) – R(t) | / R(t) ] as t goes from 1 to T ) / T
    And so you can expect a non-weather forecaster to have an average deviation of A. For our weatherman:
    P(T) = F(T) * S(T)
    F(T) represents the factor that the weatherman applies to the data set to take into account his own experience and senses. The average deviation (AF) for our weatherman would be:
    AF = ( sum of [ | F(t) * S(t) – R(t) | / R(t) ] as t goes from 1 to T ) / T
    To calculate the value (V) of the weatherman, you must consider his accuracy above and beyond what a non-weatherman could guess.
    V = ( A – AF ) / A
    If the weatherman has exactly the same average deviation that the non-weatherman does, then his value is:
    V = ( A – A ) / A = 0
    If the weatherman calls the weather perfectly then his value is:
    V = (A – 0) / A = 1
    To compare two weathermen, you would need to compute the value of each.

  38. rpl's avatar

    Nick, the metric you propose in your 9:03 post is called the “Brier Score” and is in fact commonly used to evaluate forecasting skill. The Wikipedia article on the subject describes a couple of ways of decomposing it into terms that can be related to intuitive concepts like “reliability”.
    http://en.wikipedia.org/wiki/Brier_score

  39. Nick Rowe's avatar

    OK. I think I’m getting the intuition of Matias’ answer (similar to Daniel’s answer in the first comment). You work out the likelihood of observing the data, conditional on the forecast being true. So if on day 1 A says 90%, and it does rain, that’s 90%. If on day 2 A says 70%, and it doesn’t rain, that’s 30%. Putting the two days together that’s a probability of 0.9 x 0.3 = 0.27. Etc for all 365 days. Then use Bayes’ theorem to get the probability of the forecast, conditional on the data, as the likelihood x the prior of the forecast/the prior of the data. When we do this for both forecasters, and take the ratio, and assume we have equal prior confidence in both forecasters, all the priors cancel out.
    That probably wasn’t very clear.

  40. Frank Restly's avatar
    Frank Restly · · Reply

    Nick,
    From Mathias above:
    “I’m curious if this is helpful (and simple enough :-)). It also doesn’t help much to determine if one forecaster is just guessing. To do that, I would expect that we have to know more (Example: how often we expect it to rain. In the Sahara someone will do really well by guessing “no rain”). So the approach above just helps to differentiate between the two weather forecasters.”
    To determine the value of each weatherman, you would also need a reasonable set of predictions by a non-weatherman. Like Mathias mentions, having a weatherman in the Sahara desert to forecast no rain does not tell you much about his / her value. The Brier score will tell you how accurate a weatherman is in relation to what the actual weather is, but will tell you nothing about how much more accurate a weatherman is compared to say sticking your head out the window or venturing a guess based upon the time of year.

  41. Nick Rowe's avatar

    Frank: we could always construct a fake weatherman C, who just makes the same average forecast every day, and repeat the likelihood ratio test.
    Neat stuff this Bayesian thing. I think I’m getting the gist of it.
    As for my other question, about how we could tell if a forecaster was over- or under-confident in his predictions, we could also construct a fake weatherman A’, by taking an S-shaped function of A’s forecasts, to push those forecasts either closer to 0%-100%, or closer to 50%, and then see if A’ does better than the original A.
    Arin: that was what I was wondering about in my 9.03 am comment, which rpl says is called a Brier Score.
    I didn’t realise there would be several quite different answers to my question.

  42. Phil Koop's avatar
    Phil Koop · · Reply

    I have to surrender to “Brier Score”; I apologize for my ignorance.
    The 3-component decomposition mentioned in the Wikipedia article is particularly instructive (the Brier score can be decomposed into “Uncertainty” – the unconditioned entropy of the event, “Reliability” – how well the forecast probabilities match the true probabilities, and “Resolution” – how much the forecast reduces the entropy of the event.)

  43. Frank Restly's avatar
    Frank Restly · · Reply

    Suppose you had two weathermen #1 and #2. Weatherman #1 gets it right 95% of the time and weatherman #2 gets it right 90% of the time.
    In the Sahara, suppose a non-weatherman has an 80% chance of guessing the weather just from living there. The relative value of weatherman #1 to weatherman #2 is:
    (95% – 80%)/(90% – 80%) = 1.5
    On the Florida coast, suppose a non-weatherman has only a 50% chance of guessing the weather from living there. The relative value of weatherman #1 to weatherman #2 shrinks to:
    (95% – 50%)/(90% – 50%) = 1.13
    And of course there are other measures of value. The average deviation of a weatherman’s predictions tells us nothing about the volatility of those predictions. Which kind of weatherman would be more valuable?
    1. One that on average gets things pretty close, but swings wildly from overestimating the chances of rain to underestimating them
    2. One that on average overestimates the chances of rain, but also consistently overestimates the chances of rain

  44. rpl's avatar

    “Having a weatherman in the Sahara forecast no rain does not tell you much about his/her value.”
    Well, not until the unusual day when it does rain. Then you learn everything about the forecaster’s value (sort of — real weather forecasting has a dimension of time phasing as well). Metrics based on Bayesian methods have the same problem.
    “The Brier score will tell you how accurate a weatherman is in relation to what the actual weather is, but will tell you nothing about how much more accurate a weatherman is compared to say sticking your head out the window or venturing a guess based upon the time of year.”
    Typically, short-range weather forecasts are compared to the persistence forecast (i.e., that the weather won’t change from what it is now). Long-range forecasts might use the climatological average. For other applications you usually have two forecasting methods that you want to compare to one another to see which is more skillful, so each serves, after a fashion, as a benchmark for the other.
    Also note that if you interpret the ratio of the Bayesian-derived scores as an odds ratio, then you’re implicitly assuming that the events being forecast are independent, which may or may not be a good assumption. (For short-range weather forecasts, it probably isn’t.)

  45. Min's avatar

    Hmmm. Maybe Kelly’s Theorem works, as well. πŸ™‚
    Give each forecaster an initial bankroll of $1, and let her make an even bet of (2P – 1)B of rain or no rain for each day, where B is their current bankroll and P is their probability of rain or no rain, whichever they predict to be more likely. (They are not betting against each other, but against the Banker in the Sky.) Who ends up with the bigger bankroll?

    The correct but timid prognosticator who is always right about which option is more likely, but underestimates, so that she predicts a 60% chance of rain or no rain, will always bet 20% of her bankroll and will always win, ending up with a bankroll of $1.2^365. The overconfident prognosticator who always predicts a 100% chance of rain or no rain will always bet 100% of her bankroll and will, unless perfect, go bust.

    Here is another betting scheme, which is kinder to the all or nothing predictor. Let the two predictors bet against each other, giving odds such that their subjective expectation for each bet is $1. See who wins overall.
    Suppose that the timid predictor above (T) predicts a 60% chance of rain, while the confident predictor (C) predicts rain. Then T bets $1 that it will not rain while C bets $2.50 that it will rain. C will win the $1. OTOH, when their predictions differ, T will bet $1 and C will bet $1β…”. T will win the $1β…”.
    Under this scheme C will win more money if she is correct more than 62.5% of the time, even though, from a Bayes/Kelly standpoint, she is the worse predictor because sometimes she makes a wrong prediction with 100% certainty. πŸ˜‰

  46. Robert Cooper's avatar
    Robert Cooper · · Reply

    The best forecasters would be the best gambler, so gauge the forecasters by their expected return if they bet their total worth every day.
    Example:
    Suppose a the forecaster gives a 70% chance of rain, and has a net worth of $1.00. They would bet $1.00 on “rain” at 7 to 3, and would bet $1.00 (on margin) for “no rain” at 3 to 7. If it rained, they’d end up with a net position of $2.33 (7/3 dollars) and if it shined, they’d end up with $0.43 (3/7 dollars).
    Weather in San Francisco is rainy about 20% of the time. A forecaster who always gives a 20% chance of rain will have an expected return of 0%. A forecaster who gives a 1% chance of rain will be “right” 80% of the time but will have a rate of return of about -95%. A forecaster who changed odds every day could have a very positive expected return.
    The logarithmic rate of return of these bets is an average of logits (http://en.wikipedia.org/wiki/Logit). An average of logits almost certainly has some information theoretic interpretation, but I’d need to sit down with pen and paper to work out exactly what.

  47. Thomas's avatar

    I’m siding with it being a higher-dimensional problem: you need to specify a loss or utility for a prediction of p with actual outcomes 0 or 1.
    In general, if you don’t want to specify a loss function you could use the one implied by the likelihood, or you could use the log predicted probability, which has some theoretically pretty properties (it’s related to entropy). In the case of independent binary outcomes these are the same.
    The log probability is the only proper scoring rule that is a function only of p, but “proper scoring rule” is less desirable than the name makes it sound. A proper scoring rule is one where its always to your advantage to quote your actual predicted probability of an event. It’s pretty clear that in weather forecasting this isn’t true. Nate Silver writes about weather forecasting in his new book and points out that the US national weather service quotes their actual probabilities (which are pretty well calibrated), but the Weather Channel quotes higher probabilities of rain at the low end because it is actually worse to be wrong by predicting no rain than to be wrong by predicting rain.
    The Brier score is another useful default, but like all defaults it sweeps under the carpet the question of what scoring rule you actually want.

  48. Greg Ransom's avatar
    Greg Ransom · · Reply

    Nick, your questions seems to assume that weather is a pre-determined closed system which is subject to Laplacean determinism and Laplacean predicitve mastery.
    If it is no such a system, then running history over in repeated ‘trials’ would produced different results each time from the same set of ‘given’ initial conditions knowable to the two weather forecasters.
    Re-run history and different forecasters can come out the ‘better’ forecaster.
    Are you familiar with the concept of sensitivity of initial conditions or are you familiar with nonlinear dynamic systems — or systems involving fluid dynamics with turbulence which we mathematically intractable.

  49. Frank Restly's avatar
    Frank Restly · · Reply

    RPL,
    “Give each forecaster an initial bankroll of $1, and let her make an even bet of (2P – 1)B of rain or no rain for each day, where B is their current bankroll and P is their probability of rain or no rain, whichever they predict to be more likely. Who ends up with the bigger bankroll?”
    That doesn’t sound quite right. The Kelly bet is:
    p = probability of rain based on gamblers intuition
    b = payoff on wager (for instance 4:1, this represents the “house” odds)
    f = fraction of bankroll to bet
    r = result of wager (1 = win, 0 = loss) – Obviously, this is also the weather results (1 = rain, 0 = no rain)
    w = winnings from wagering
    You are assuming that b is always equal to 1, that the “house odds” for rain is always 50%. If that is the case, then you get the simplified form:
    f = 2p – 1
    w = 1 + f * ( rb – 1 )
    Instead the full form is:
    f = ( p + p/b – 1/b )
    w = 1 + f * ( rb – 1 ) = 1 + ( p + p/b – 1/b ) * ( rb – 1 )
    The total payout over a timeframe T would be:
    Product [ 1 + ( p + p/b – 1/b ) * ( rb – 1 ) ] as t goes from 1 to T – Note that p, b, and r are all functions of t
    Compare it to this:
    a = b / (1 + b) : This converts the house odds b (for instance 4:1 odds of rain) into a percentage a (for instance 80% chance of rain)
    “House” deviation (H)= Sum of [ | a – r | / T ] as t goes from 1 to T
    Gambler deviation (G) = Sum of [ | p – r | / T ] as t goes from 1 to T
    Value (V) of the gambler would be:
    V = H – G / H
    It seems to me that the payout method for evaluating a weatherman / gambler is only as good as the accuracy of the house odds. If the actual weather significantly differs from the weather predicted by the house, then that would skew the results of the betting strategy. A few long odds successful big bets against the house could overwhelm a lot of short odds losing bets.
    Instead I think it is better to look at a ratio of the gamblers deviations from actual weather in comparison with the house deviations.

  50. Min's avatar

    Frank Restly: “You are assuming that b is always equal to 1, that the “house odds” for rain is always 50%.”
    I am postulating that. No learning from the wagering occurs. The Great Banker in the Sky does not care about winning or losing.
    Frank Restly: “Instead I think it is better to look at a ratio of the gamblers deviations from actual weather in comparison with the house deviations.”
    I am not assessing the prognostication vs. actual weather, but one prognosticator vs. another. That makes a big difference. The comparison is indirect, by having both play against the house. The payoff for each is the product of their probabilities divided by (0.5)^365. In the ratio of their payoffs, the denominators drop out. This is equivalent to the Bayes comparison.
    Note: You can permute the wagers. For instance, suppose that the prognosticator predicts rain with a probability of 80% on one day and no rain with a probability of 60% on another day, and it rains both days.
    Order 1:
    Day 1: Prediction of rain of 80%. Bet of $0.60, which wins. New bankroll: $1.60.
    Day 2: Prediction of no rain of 60%. Bet of $0.32, which loses. New bankroll: $1.28.
    Order 2:
    Day 1: Prediction of no rain of 60%. Bet of $0.20, which loses. New bankroll: $0.80.
    Day 2: Prediction of rain of 80%. Bet of $0.48, which wins. Now bankroll: $1.28.
    All same same. πŸ™‚

Leave a reply to Simon van Norden Cancel reply