December 19, 2020
Last Updated 12/23/2020
UPDATE 2021-05-30: Dream admits he used tools to modify drop rates, but doesn't concede he did so for speedruns deliberately
Recently, famous YouTube streamer Dream has been accused of cheating in recent Minecraft speedruns that allowed him to clinch a recent world record. You can see the Minecraft Speedrunning Team's detailed analysis here.
Essentially, there are two essential, but semi-rare, items that speedrunners need to collect in the game in order to win: green pearls called "Ender Pearls" and fiery sticks called "Blaze Rods". There are multiple ways of obtaining the former, the fastest of which involves trading gold ingots to non-player characters called "Piglins". The latter has a 50% chance of being dropped from a specific enemy when killed. Because these items are necessary to win the game, any way of obtaining them faster will lead to faster times on speedruns.
In his most recent runs, Dream had a significantly higher-than-expected rate of receiving both of them. For the pearls, he received 42 from 262 trades, or about 15%, vs. the expected 4.7% as coded into the unmodified game. Likewise, he received 211 blaze rods out of 305 enemies, or 69%, versus an expected 50%. No other player had nearly that much luck.
In the above charts, you can see at a glance how Dream's “luck” was significantly higher than other top speedrunners over the course of several streams.
In the analysis that the Minecraft Speedrunning Team provided, even the most charitable method of analyzing it put the p-value of this happening to any streamers at less than 1 in a million after taking into account different factors. The actual chances of Dream achieving such a run (or better) are much, much smaller, at about 1 in 7.5 trillion.
Of course, had Dream not chosen to increase those drops to noticeably higher rates across the above 6 runs, how much could he have modified the game in his favor to plausibly generate "good luck" "by chance"? Here are a few general strategies:
Additionally, if any single run is “too lucky”, unmodified, or even less-favorable probability runs could follow it to balance numbers out. This relies on the fact that the best run is likely to be even better when “luck” is redistributed from one set of runs to another. For example, if you set the blaze rod drop rate to 20% in one run, and 80% in the next, then the average is approximately normal. Although, the odds may look suspicious if done too many times, as you would have a weird distribution of many games with much lower-than-expected drop rates, and many games with much higher.
Simply reducing the rate makes it more reasonable, while modifying the rate in only some of the runs is very effective at masking the effect, especially if there are a larger number of unmodified runs. It doesn't solve the issue completely, and it slows down the process of gaming the system.
It should also be noted that in most runs, 0-2 of the Ender Pearls obtained are done so through trading, usually, since it has a low probability to begin with, so boosting that rate is more difficult to do clandestinely unless it is done sparsely. For blaze rods, 7-8 are an ideal number to obtain in order to complete the game.
At the time of writing, the current world record is at 14:39, with second and third being about 30 and 70 seconds behind that time.
Below, we can take a look at the different p-values across different probabilities for different numbers of runs and numbers of runs with unmodified probabilities. To be charitable, as in the official analysis, we will use a factor of 1,000 on the p-value to multiply by the number of speedrunners. We will ideally aim for a p-value of 0.05, but anything greater than 0.01 is remotely plausible.
For Ender Pearls, let's assume that an average of 15 golden bars are traded per game in an attempt to recieve them. For blaze rods, let's set a goal of 7 blaze rods acquired per game. Below is how frequently one could expect to evade detection over the course of 22 games with rigged RNG, using a p-value of 0.05 and a "correction" factor of 1,000, to account for the estimated number of speedrunners. The first graph considers the 22 games alone, and the second considers the games combined with 10 non-rigged games.
As we can see, pearl drop rates over 7.5% and blaze rod rates of above 60% are likely to be detected over the course of over 20 games. When combined at those levels, though, there's about a 50% chance of crossing the pre-determined threshold.
If 10 unmodified games are interspersed, though, those numbers are relatively safe, and a blaze rod/ender pearl trade probability of 55%/10% could probably be pushed, provided that the rigged and unrigged runs are not all grouped together.
What happens if a less-charitable approach is taken, and only the top 20 speedrunners are considered?
Without any normal games, there is little wiggle room for changing the parameters too much.
The profile of the 22 rigged + 10 normal games looks similar to the profile with a 1,000 correction, and while some of the luck might not be enough to get results thrown out, it would definitely gather unwanted scrutiny, making future attempts more difficult to pass.
I used various simulations of number of pearls earned in a run based on the modified probability, as well as the number of blazes needed to be killed to reach 7 rods. For the pearls, I used a simple binomial distribution, which describes how many successes there are after a certain number of events, like weighted coin flips. For blazes that need to be killed, I used a negative binomial distribution, which describes how many failures are encountered before a certain number of successes are achieved (or vice-versa).
Note that the blaze distribution chart below is a negative binomial distribution, shifted to the right by 7, representing the total successes (blaze rod drops) and failures (no drop), and not just the failures.
For each simulation, 2 p-values were calculated—1 for blaze rods, and 1 for ender pearls. The p-value is the probability that you would see results at least as extreme if some “null hypothesis” is true. In this case, the null hypothesis is the assumption that Dream didn't cheat.
2,500 simulations of each combination of parameters (Ender Pearl and Blaze Rod rates) were used to minimize the error in the estimates.
To combine these 2 p-values in each iteration, Fisher's Method was used, which combines both null hypotheses and assumes they are independent, which would be true if Dream did not cheat.
The green bars in the above figures were made using the estimate of the rates Dream used. They are not 100% precise, as there is uncertainty in the exact values Dream used, but they are a reasonable estimate.
For short periods of time, one can get away with cheating. However, over a long period of time, it is possible one might get caught. There is a bit of difficulty, though, since the longer the period, the more difficult it is to compare and analyze data.
That being said, one of the best ways to counteract and avoid cheating is by requiring “provers”. For example, streamers should show mod folders for runs (which Dream refused to do), or speedruns should be streamed live with video of the player along with the game screen.
Also, the above models assumed the same behavior every game. This is less of an issue when more games are played, since the average values of number of ingot trades and blaze rods acquired grow closer to the “true average” as a result of the Law of Large Numbers.
One final note is that if more RNG targets are included, as Dream suggests here, then it becomes harder to detect cheating. For 10 RNG targets, as the speedrunning team suggests, this would have a noticeable impact, but not enough to obfuscate Dream's cheating. For 37 targets, as Dream suggests, this would actually make it possible (but still relatively unlikely) that extreme RNG manipulation could be swept under the rug if the most charitable figures (and 1,000x correction) were used. Considering it is unlikely that most of these would be targets for RNG manipulation, and most of them are based on world seed, a different class of RNG that is less predictable, including those would make it virtually impossible to detect all but the most egregious cheating.
If a player decided to do 200 runs, with 100 being normal and 100 rigged, or 400 runs with 200 normal and 200 rigged, this is how likely they would be to evade detection for different probability combinations. Note that these are very high numbers of runs, corresponding to a few weeks of 8-hours-per-day speedrunning.
When the number of attempts increases, keeping the ratio of attempt runs to number of rigged runs the same (2:1 in this case) narrows down the region of values that one can use.
This analysis was done in R, with the code being available here: https://github.com/mcandocia/dream_mc.
Tags: