Surrogate scoring rules (SSR) is the method we used to score forecasters for accuracy in our survey. We encourage all forecasters to watch our short video about the method. This document has three sections: (1) Overview, (2) Implementation, (3) Example.
SSR Overview
Surrogate scoring rules enable us to score forecasts’ accuracy using strictly proper scoring rules without relying on the events’ realized outcome. Strictly proper scoring rules (SPSR) are standard measures of prediction accuracy. Given a prediction p and a realized event outcome Y, SPSR will assign the prediction p a score S(p, Y). For example, consider a prediction p=0.6 about an event Y(e.g., “It rains tomorrow.”). If it rains tomorrow, we denote Y=1, and if it doesn’t rain tomorrow, we denote Y=0. Then, suppose after tomorrow, we know Y is 1. According to SPSR, the forecaster will get a score S(0.6, 1). A commonly used formula for function S(p, Y) is the Brier score, where S(p, Y)=(p-Y)². This means that if your prediction is perfect, your prediction will get a score zero, meaning that your prediction has no difference to the event outcome. If your prediction is the opposite, your prediction will get a score one, meaning that your prediction is totally different from the event outcome. The smaller the Brier score, the more accurate your prediction is.
In our project, however, we do not know the outcomes of the claims at the time of scoring survey predictions. Therefore, we generate a surrogate event outcome for each claim and use this surrogate outcome together with SPSR to score the predictions for accuracy. Our method is hence called surrogate scoring rules.
To generate a surrogate event outcome for a claim to score your prediction, we will first take the mean of the all forecasters’ predictions on that claim, except yours. We denote the mean as q. We generate the surrogate event outcome Y’=1 for that claim with probability q (which means Y’=0 with probability 1-q). This surrogate event outcome may not be the same as the true event outcome. So we model the relationship of the surrogate event outcome Y’ and the true event outcome Y using two error rate parameters e_{0} and e_{1}.
e0 represents the probability of Y’=1 when Y=0.
e1 represents the probability of Y’=0 when Y=1.
Then, we assign your prediction p with a surrogate score S'(p, Y’), where
S'(p, Y’)=((1-e0)S(p, 1)-e1S(p,0))/(1-e0–e1)if Y’=1,
and
S'(p, Y’)=((1-e1)S(p, 0)-e0S(p,1))/(1-e0–e1)if Y’=0.
Our theorem shows that if we know the correct values of e0 and e1, then the mean surrogate score you get will be equal to the true mean SPSR score you get when you answer a sufficiently large number of claims. To get the correct values of e0 and e1, we assume that the value of e0 is the same for all claims in Phase 1 and so is e1. Then, we used the “method of moments” to estimate these values from the collected predictions. A detailed introduction of surrogate scoring rules can be found in the paper Surrogate scoring rules. In our project, we set S(p, Y) as a rank-sum scoring rule. Forecasts with higher rank sum scores are considered to be more accurate.
Implementation of SSR in Phase I:
- We first remove predictions from users who completed less than 5 predictions in each round.
- For an arbitrary user i, we compute e0 and e1 using methods of moment according to Section 5 of the paper Surrogate Scoring Rules.
- Consider a single batch of claims and a user i who completed that batch. For each claim, we compute for user i the aforementioned mean prediction q from predictions of other users on the claim. User i’s surrogate outcome for the claim is then viewed as being generated according to q .
- For each claim in this batch, we compute S'(p, 1) and S'(p, 0) for user i according to the surrogate score formulas. In the formulas, S(p, Y) is set to be the rank-sum score of user i’s prediction p on this claim, which is calculated based on all predictions user i made within the batch. The computation of the rank-sum score for a single prediction can be found in Section 2.5 of the paper Linear scoring rules for probabilistic binary classification. User i’s score for this claim is his expected surrogate rank-sum score 𝔼Y’~Bernouilli(q)[S'(p,Y’)]=qS'(p,1)+(1-q)S'(p,0).
- User i’s score for the batch is the total score (total expected surrogate rank-sum score) the user received on all claims in the batch. We denote this score as SSR_{i}^{rank-sum} for user i in the batch.
- For any other user j completing the batch, we compute SSR_{j}^{rank-sum} in the same way. We then rank all users in that batch according to their SSR_{i}^{rank-sum}, from the highest to the lowest, and dispatch our prizes to the top users. When we rank the users, we remove all users who have not completed all claims in the batch. The users being ranked have made predictions on all claims in the batch.
To make this implementation more concrete, we provide an example showing how a user’s SSR_{i}^{rank-sum} is computed.
An Example:
User 1 | User 2 | User 3 | User 4 | User 5 | |
Claim 1 | 0.8 | 0.7 | 0.6 | 0.6 | 0.9 |
Claim 2 | 0.3 | 0.1 | 0.1 | 0.2 | 0.4 |
Claim 3 | 0.4 | 0.1 | 0.2 | 0.4 | 0.3 |
Claim 4 | 0.5 | 0.3 | 0.4 | 0.4 | 0.5 |
- Suppose we have already estimated the values of e0 and e1. Let’s say e_{0}=0.2, e_{1}=0.3.
- For user 1, the mean q of other users’ predictions on Claim 1 to Claim 4 is 0.7, 0.2, 0.25, 0.4 respectively.
- For user 1, we then need to compute the original rank-sum score S(p, Y)for both potential cases Y=0 and Y=1 for each of her predictions. (The original rank-sum score for a user will be the sum of the rank-sum scores of all her predictions given the ground truth of each prediction.) To compute the original rank-sum score S(p, Y) for a prediction, we first need to compute a rank value for each prediction from user 1. The rank value of a prediction p from an array of predictions made by a user is the number of the predictions strictly smaller than the prediction p, minus the number of the predictions strictly larger than the prediction p. So, predictions 0.8, 0.3, 0.4, 0.5 from user 1 get a rank value 3, -3, -1, 1 respectively. Given the rank value of each prediction, the rank-sum score S(p, Y) for each of these predictions is the rank value if Y=1, and is 0 if Y=0. So user 1 has the following original rank-sum score for each prediction she made.
Y=1 | Y=0 | |
Claim 1 | 3 | 0 |
Claim 2 | -3 | 0 |
Claim 3 | -1 | 0 |
Claim 4 | 1 | 0 |
- According to the surrogate scoring rule formulas, user 1 gets the following surrogate score depending on the surrogate event outcome.
Y’=1 | Y’=0 | |
Claim 1 | 4.8 | -1.2 |
Claim 2 | -4.8 | 1.2 |
Claim 3 | -1.6 | 0.4 |
Claim 4 | 1.6 | -0.4 |
- We then compute the expected surrogate score for user 1 as if Y’Bernoulli(q) for each claim where qis the mean prediction for the claim calculated in Step 2. User 1 gets an expected surrogate score for the four claims as follows.
𝔼_{Y’}_{Bernoulli(q)}[S'(p,Y’)] | |
Claim 1 | 0.7×4.8+0.3x(-1.2) = 3 |
Claim 2 | 0.2x(-4.8)+0.8*1.2 = 0 |
Claim 3 | 0.25*(-1.6)+0.75*0.4= -0.1 |
Claim 4 | 0.4*1.6+0.6*(-0.4)= 0.4 |
- For this batch, the final surrogate rank-sum score for user 1 is the total expected surrogate score user 1 gets for the four claims. Thus, user 1’s score for the batch is 3+0+(-0.1)+0.4 = 3.3.
- We compute the surrogate rank-sum score for other users in the same way. Then, we rank users according to their final surrogate rank-sum scores. The higher a user’s score is, the more accurate it is believed to be, based on our surrogate scoring method.
For more information:
- About strictly proper scoring rules: https://en.wikipedia.org/wiki/Scoring_rule#Interpretation_of_proper_scoring_rules
- The paper on thef surrogate scoring rule: https://econcs.seas.harvard.edu/files/econcs/files/liu_ec20.pdf
About rank-sum scoring rule: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1464966339#:~:text=Typically%2C%20if%20S(y%2C,%3D%20k%20%C2%B7%20y%20%2B%20c.
I won’t pretend that I took the time to completely understand the survey scoring but, it sounds like it captures how my prediction compared to all other predictions. That is, I will have a high survey ranking if my predictions matched the rest of the group’s predictions pretty well. Is that true? If so, then is one area of interest for you folks – predictors who don’t do well on the survey ranking but do very well when the actual results come in? Presumably, somebody who is predicting differently than the ‘pack’ but doing better than the ‘pack’ potentially has some quasi-unique skillset? AchinToBe
@AchinToBe – Basically yes. SSR would down-rank that one hero who knows the zombies are coming. And yes, we would be very interested in a better zombie detector. The open question is whether we are closer to a jelly-bean jar world or a zombie world.
A little more detail: SSR could use anything with knowable error-rates — like original p-values — as the surrogate score. Assuming error rates are stable and estimable, SSR has some nice convergence properties: the expected value is the same as for the ground-truth score. We think the peer forecasts will beat the p-values. However, if the pack is led astray — effectively coordinating on a non-truth signal — SSR will also be misled.
Even if SSR is working great, noise means someone will outperform their SSR rank. But if SSR ranks are generally a poor match, we’re closer to a zombie world of hidden signals or rare genius. Had claims resolved progressively, the market would still have found the heroes and saved us. As it is, we’re hoping for few zombies.
Data is just beginning to arrive….
Hi Charles, got it. As the predictors we might love the conceit of being the one or two that can predict when the others can’t but from the point of view of hoping to be able to know which results are reliable we (society) would prefer if there are clear, objective and discernible signals of what results are reliable. Best, AchinToBe
@AchinToBe – this is a great and challenging question we are working on. And thank @Charles for your vivid explanation. SSR’s philosophy is built upon the assumption that “the majority answer is correct on the majority of the questions”, i.e., if we consider each forecast question as a world, then this assumes that there are more jelly-bean jar worlds than zombie worlds.
Our SSR has the steps of estimating the error rates of the majority signal and then de-biasing the majority signal when used as the surrogate ground truth to score a prediction. These steps could save some users who are correct on some questions where the majority is wrong. These steps also distinguish SSR from simply comparing one’s prediction directly to the others’ mean prediction. But if most of the worlds are the zombie worlds, the SSR will still fail to identify the real heroes.
A solution to mitigate this issue is to use machine learning or statistical methods to generate, based on the papers’ features, the surrogate outcome for scoring. This solution could save SSR from users coordinating to cheat SSR, but we still require these algorithmically generated outcomes to be correct on the majority of the questions.
What if some proportion of forecasters were putting junk estimates into their survey responses? After spending a time making sincere predictions on the first few survey rounds, I was repeatedly coming up short in the rankings. Without any understanding of the underlying scoring system I had no idea how to improve. With no means of improving and no incentive to do so, all my survey responses in the mid to late rounds were basically coin flips with almost no thought behind them. (Note: I made more survey money with the randomized strategy).
Hi, DMS. Thanks for your information. We are very sorry to hear that you have such an experience. We will identify your problem and improvement our scoring scheme with new resolution data coming.
Our SSR’s theory suggests that even if there is some proportion of the forecasters putting random guesses (junk estimates), as long as the mean prediction of all other forecasters is informative (i.e., the mean forecast is more likely to be larger than 0.5 when the paper can replicate and is more likely to be smaller than 0.5 when the paper cannot replicate), then you should receive a higher score when you provide informative predictions other than the opposite prediction or random guess.
In the mid to later round, we did observe the decrease in the total rewards of top users of the previous rounds. We conjecture that was due to an increase in the competition among users, as from the early rounds to the mid rounds, we observed a considerable increase in the average number of claims predicted per user, and a decease of total active users. This observation indicates that the participation population shifted to a smaller but more engaging set of users.
However, you reported that “I made more survey money with the randomized strategy”. This is very concerning to us, and we will try to find out whether it is a statistically significant phenomena and the potential reasons with the coming resolution data.
@DMS – Were you doing notably fewer batches when “spending a time making sincere predictions”, than when flipping coins? If so, your experience may have been the result of the Top-4 tournament structure.
@Charles I completed about the same number of total surveys in the early rounds as I did in the late rounds. I only had one round in which I completed all surveys, if I recall correctly.
Thinking back, I may have submitted more survey batches in the later rounds. Putting in estimates with little or no thought allowed me to move a lot faster and complete more surveys.
@DMS – We could go back and count if needed. But completing more batches may explain it. Even a randomly filled batch has some chance, esp. if that batch has few strong signals.