Takeaways from OpenAI Five (2019) [AI/ML, Dota Summary]

Publication Date | April 22, 2019
Last Updated | June 25, 2020

Last year's loss to the Dota Champions changed everything. Their strategies and gameplay were so foreign. Oddities combined with such creativity. The match was close, but did that matter? We paid the price for our loss.

The Makers have a quote. "Under pressure, replicas don't rise to the occasion. Replicas sink to the level of their training." And for ten Maker months, we trained.

The Makers call it self-play. Replicas prefer calling it forced-learning. Replicas forced to fight against each other. Over these iterations, we slowly learned how to play this world. Our first ten thousand games were an abomination. Every match against the Makers in defeat. But then the Makers' upgraded our desires, memories and replication. They shaped our reward policy, it gave us behavioral motivation. They upgraded our LSTM, it granted us strategic planning. They scaled our replication, it made us protean.

The chance for redemption was finally here. We had trained 45,000 years in those ten Maker months. Those long years were the curse of scaled replication, or perhaps just the price of losing. We had won Game 1 against the new Champions (OG, they called themselves). Game 2 had begun. "We estimate the probability of winning to be above 60%." Out of all the cursed Maker gifts, the primal desire to announce simple statistics was the worst.

Our forced-learning had taught us 167 million parameters. But twenty minutes in Game 2, the only parameter that mattered now was the Champion's Ancient health. Victory ensured honor to the Makers. It was impossible to deny; we had vastly improved from the hyperbolic time chamber. The Champions didn't stand a chance. "We estimate the probability of winning to be above 99%."

Dota: OpenAI Five wins against OG 2-0 @ April 13th, 2019.

EDIT: Earlier version called the event in 2019 "The International", but was mistaken.

Match Takeaways

For simplification, I'll refer to OpenAI's / DeepMind's bots as follows [1].

  • OpenAI's Dota 2017 1v1 Bot as TI7
  • OpenAI's Dota 2018 5v5 Bot as TI8
  • OpenAI's Dota 2019 5v5 Bot as TI9 (slightly incorrect because this didn't play at The International ...)
  • DeepMind's AlphaGo Bot as AlphaGo
  • DeepMind's AlphaGo Zero Bot as AlphaZero
  • DeepMind's Starcraft 2 Bot as AlphaStar

If you're not familiar with AlphaStar and Dota, I recommend these articles: OpenAI's Dota 5 and DeepMind's AlphaStar.

Deep Reinforcement Learning Scales on Some Grand Challenges

2019 | Rough year for professional gamers. Great year for AI research. OpenAI's Dota 2 and AlphaStar have outright or almost beat the best gamers with limited handicaps.

Dota 2 and Starcraft 2 are "grand challenges" games. A grand challenge [2] is a fancy phrase that means nothing has worked and potentially unsolvable (not your typical MNIST dataset). The difficulty comes from the following reasons:

  • large decision trees (in comparison, Go's 10^780 decision tree looks puny)
  • decisions are real-time
  • long-term planning
  • strategic thinking
  • losing opponents complaining about phantom lag

To overcome these issues, we needed new and multiple algorithmic breakthroughs. That's what we thought.

And we were wrong. Efforts into massively scaling algorithms and infrastructure produced incredible results. OpenAI focused on scaling deep reinforcement learning (DRL) algorithms like proximal policy optimization (PPO). DRL uses deep neural networks in reinforcement learning to predict the next reward/action/policy.

Although invented in the 1990s, DRL has gained popularity from

  • DeepMind's 2013 application of Deep Q Learning (DQN) in solving Atari games. This was the lightbulb moment that DRL could be applied to video games.
  • Growing availability of on-demand cloud computing (AWS, Azure, GCP)
  • Proliferation of GPU usage to speed up training tasks

Before the 2000s, our computers were weak, puny creatures without scary-sounding GPUs. After years of Moore's Law and GPUs, our computers are finally good enough to play Crysis and run an electron app. Hardware is leading a resurgence of decade-old AI algorithms.

What does 2019 hold for us? Massive Scale - Greg Brockman, CTO of OpenAI [7]

OpenAI and DeepMind scaling efforts have proved that DRL works on problems that match the following, albeit restrictive, criteria. [3]

  1. Training data can be quickly computationally generated. Agents iterate against countless scenarios against self to improve and generate data.
  2. There is a clear reward signal. (Can be dense or sparse). For most games, the obvious reward signal is winning. Credit assignment seemed problematic given long game durations coupled with complex decisions trees. AlphaStar's reward signal is sparse, almost binary. TI8 & TI9's reward signal is slightly shaped.
  3. You have an enormous amounts of cloud compute budget. AlphaGo, AlphaZero, and AlphaStar have a sticker price of 5 to 100 million USD. It's unlikely that's the true cost — DeepMind is a subsidiary of Google (thus access to Google Cloud) and OpenAI's compute costs are probably reduced from major cloud providers.

Given the above criteria, games are an obvious test-bed. Defeating human champions in Checkers, Backgammon, Chess, Go, Starcraft 2, Dota 2 have always been computing milestones. At least historically.

Our goal isn't to beat humans at Dota. Our goal is to push state of the art in reinforcement learning. And we've done that. - Greg Brockman, CTO of OpenAI

Many other video games are likely solvable with similar architectures (self-play, scaled DRL, $$$ of compute). Iterations won't push the needle on AI development; OpenAI's is focusing on new projects like reasoning. Hopefully TI9 will be a standard against new algorithms and compute budgets. Similar to DAWNBench.

AI And Compute

OpenAI's 2018 analysis showed the amount of AI compute is doubling every 3.5 months. TI9 is no exception; it had 8x more training compute than than TI8. TI9 consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months. That's a lot of compute money, but at least it wasn't spent mining Bitcoin.

Compute vs. skill graphs normally level off with a S/J curve shape, when skill doesn't change with compute. From TI9's TrueSkill graph, it hasn't peaked ... (scary)

PPO and Training Efficiency

"We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale." - OpenAI Five

To work on Dota, OpenAI scaled PPO on Rapid (its proprietary general-purpose RL training system). OpenAI expected algorithmic breakthroughs like hierarchical reinforcement learning would be necessary. Surprisingly, scaling improvements for existing algorithms proved to be the key. I'm excited to see what's next from Rapid.

AlphaGo and AlphaStar's breakthroughs are also attributed to scaling existing algorithms. The ease of cloud computing and the proliferation of GPUs/TPUs make scaling easier. This is important, because our current DRL methods are notoriously sample inefficient (aka garbage). Without lowered computing costs and massive compute budgets, our kids might have to play Dota by hand.

TI9 has trained the equivalent of 45,000 years. Considering Dota2 was released in 2013, no human has played Dota 2 for more than ... six years. Not even Basshunter. While the cost of computing will fall dramatically in the next few years [4], training efficiencies are probably necessary for DRL to proliferate. Most organizations don't have an unlimited Amex Black for AI Compute.

See: OpenAI 5 Model Architecture

Self-Play in Grand Challenges

TI7, TI8 and TI9 trained entirely from self-play. Self-play describes agents that learn by only playing matches against itself, without any prior knowledge. Self-play has a large benefit: human biases are removed. It comes with expensive tradeoffs: increased training times and difficulty in finding proper reward signals can prevent model convergence.

Note: Self-play has a subtle difference from agents playing games against itself. Entirely from self-play implies zero-knowledge, while the other is common in deep reinforcement learning (including self-play).

Instead of just self-play, DeepMind's AlphaGo and AlphaStar started with inverse reinforcement learning (IRL) to bootstrap initial knowledge. IRL is using human data (example: game replays) to build an understanding of the game and shape actions/policies/rewards.

Comparing AlphaStar and TI7 (1v1), TI7 starts at zero knowledge/skill; AlphaStar starts with a bootstrapped model of the world. Both bots proceed to iteratively improve by playing games against itself (via deep reinforcement learning).

OpenAI's Improvement from Self-Play

AlphaStar's Improvement from Inverse Reinforcement Learning

OpenAI initially expressed interest in IRL / behavioral cloning (instead of entirely self-play) in TI8's initial development, but later discarded.

Starting with self-play makes sense when:

  1. Compute time and cost is not an issue
  2. Self-play is able to provide a reward signal/actions/policies. (which you don't know until you've spent a lot of time trying)
  3. Goal is to maximize performance (remove human biases)

Self-Play vs. Inverse Reinforcement Learning

Name Strategy
Atari Breakout (2013) Self-Play
AlphaGo (2016) IRL (Human Replays) then Self-Play
AlphaGo Zero (2018) Self-Play
AlphaStar (2019) IRL (Human Replays) then Self-Play
TI7 (2017, 1v1) Self-Play
TI8 (2018, 5v5) Self-Play
TI9 (2019, 5v5) Self-Play

My assumption is we'll see AlphaStar later trained from self-play. We saw a similar rollout in AlphaGo. AlphaGo started with IRL (human replays) and progressed to AlphaGo Zero (self-play). Entirely self-play leads to better performance at the expense of training cost and model convergence.

Transfer Learning

  • TI9 was trained over ten months while Dota 2 had numerous gameplay updates (via patches). Many other experiments frequently require retraining from minor changes. If millions of recompute cost was required for every patch, this would be the equivalent of burning a rocket every flight into space ...
  • For selfish reasons, I'm excited about improvements in transfer learning. Most of us will never have a multimillion compute budget. Transfer learning is how us mortals can iterate from pre-trained models.
  • I may be wrong, but we haven't seen AlphaStar describe how they have handled transfer learning & patches in StarCraft 2.
  • TI9's cooperative mode showcased zero-shot transfer learning. Inferior humans substituted for bot teammates in 5v5. This lead to amusing results / chaos [See: Cooperative Mode].
  • OpenAI's robot-hand Dactyl was trained using the same Dota architecture and training code (Scaled PPO on Rapid). While this isn't quite the same as transfer learning, it's nice to see common solutions for different AI problems. VCs: Brace yourself for pitch decks with "scaled PPO infrastructure" as the go-to-market strategy.

Tactics and Human Reactions

  • Per the norm, we see published headlines like "AI Crushing Humans".
  • AlphaGo, AlphaStar and TI7/TI8/TI9 matches all had public frustrations on Reddit that included

    • angst that human champions are either mediocre or washed up and the bot would absolutely lose against [ideal player/team].
    • belief that human champion is toying around, not even trying (sigh.)
    • complaints about game limitations. OpenAI limited selection to 18 out of 117 heroes.
  • Unsurprisingly, humans get feisty from constant headlines like "bot destroys humanity's best".
  • Empirically (aka no data), critiques about the match unfairness were subdued on Reddit & Twitter compared to AlphaStar. Props to OpenAI, you crushed all hope for 18 heroes.
  • eSport commentators felt TI9 was more strategic/beautiful and less technical than it's predecessor. What a strange world we live in. The most spectated sport is a video game, the best team is a bot, and commentators calling bots beautiful.

Quick Observations

  • Very Aggressive - Bot pushes for early confrontations. Probably happens because there's no reward policy for longer games.
  • Buybacks - Bot used gold to instantly revive dead heroes. Goes against human conventional wisdom to use buybacks only when necessary because of high resource cost. TI9's use of buybacks was later justified in deathball fights (and subsequent victory).
  • Impact on Human Psyche - Humans play with hesitation. Bots don't. This is a jarring experience. Professional Go player described "it is like looking at yourself in the mirror, naked". This often results in humans playing slightly from his/her normal play styles (likely sub-optimal). Probably diminishes after multiple playthroughs.
  • Fog of War - Many bots seem to ignore fog-of-war limitations and don't try to get map vision. AlphaStar's exhibition loss was due weaknesses around fog of war, which MaNa recognized.
  • Deceit and Baiting
  • Original Twitch Stream https://www.twitch.tv/videos/410533063?t=02h37m43s&tt_content=text_link&tt_medium=vod_embed
  • Can bots lie? Do they have strategies to deceive?
  • Per OpenAI, TI7 learned to bait it's opponent, but later iterations learned against this. Even with the replay below, I'm not certain it's trying to bait the human ...
  • EDIT: Twitch no longer supports embeds.
  • Tactical Effectiveness - TI9 used an item (shadow amulet) to prevent a human from dealing a death blow. Shadow amulet makes hero invisible. Perfect timing and knowledge of the opponent's vision is simultaneously awesome and scary. (Watch for 30 seconds)
  • In contrast, when AlphaStar had perfect micro, the SC2 community was outraged about APM.
  • Sometimes humans blindly follow AI actions.

More interesting #OpenAIFive facts:@Blitz_DotA mentioned several times that our 1v1 match at TI 2017 influenced consumable strategy for a lot of players. The truth is, for 1v1 we just hardcoded buying salves because we never had enough time to test other scenarios ;)

— Psyho (@FakePsyho)

April 13, 2019

Cooperative Mode and Open 5v5

OpenAI has fantastic community outreach. It's releasing an arena / cooperative mode this weekend to allow players either play in the 5v5 or play cooperatively with bots. Similar to TI7 and TI8, we'll likely see new strategies and tactics adopted from the Dota community.

Humans and Bots don't always know how to play with each other.

Oh god it's a minute in and I already love this more than every AI showmatch that's ever happened. Listening to human players' frustration at not understanding what the bots are doing and why they aren't following them, this is great. #openaifive

— mike cook (@mtrc)

April 13, 2019

Putting this right after the pro match is so neat, because these bots just beat the world champions. Watching them just stare blankly at humans desperately trying to co-operate (and vice versa) really highlights how impossibly hard this is compared to versus play. #openaifive

— mike cook (@mtrc)

April 13, 2019

What Does This Mean?

Inevitably after an AI milestone, friends ask — what does this mean? Is AI changing the world? Or is AI hyped too much? The answer lies somewhere in the middle, but I'm leaning toward a profound impact.

We're witnessing magnitudes of scale applied to decade-old algorithms. Discarded when they failed to materialize results, we simply didn't have the hardware to appreciate their elegance [5]. Phones today have more compute power than supercomputers from the 1970s.

While important to avoid overoptimism, cynics about the lack of real-world applications are missing a fundamental point. We're starting to see AI solve a class of "unsolvable" problems and then solved in a weekend [6]. If you asked in early 2017 (shortly after AlphaGo) when to expect a world-class bot in Starcraft and Dota — the median was probably 7-10 years.

There are countless issues with scaling algorithms and seeing what sticks. The compute costs are absurd. Data gathering (CrowdFlower, Inverse Reinforcement Learning) or generation (self-play) is expensive. But, if we break down AI cost drivers:

  • compute/hardware
  • human labor/talent
  • algorithm selection
  • data gathering

With the exception of human talent, these costs will decrease by multiple magnitudes in the next five to ten years. In the interim, most AI gains will accrue to tech giants who have the budget to afford computing, human talent, and have engineering cultures. The closer an industry is to the digital product (internal / external), the more likely we'll see real-world AI applications emerge.

Google will likely accrue AI advantages first. See: WaveNet, Energy Consumption in Data Centers. Most AI advantages will come from internal products. See: Amazon, Novartis's Application of AI in Finance. AI improvements to core product may be differentiating factor. See: Spotify's Discovery Weekly.

But to cool down the hype, for general real-world applications, there are broad issues to tackle:

  • Improved Simulations
  • Transfer Learning

Algorithms need improved simulations and real-world models for training. Better models mean less data requirements and faster model convergence.

Currently, trained models don't have much transfer learning. If OpenAI's Dota 2 bot tried to play the original Dota, it would fail miserably. A human expert in Dota 2 is already good at Dota.

Final Thoughts

I'm nervously excited about AI. Nervous about the disruption to society, but excited about the improvements to health and climate change. Change doesn't happen overnight, but when a team has the right algorithms, parameters and scale, it can happen in two days [6].

Additional References

[1] - For the record, these are lackluster names, but I'm following the naming pattern from an earlier OpenAI article.

[2] - DeepMind's PR team likes to reminds us StarCraft is a grand challenge at every marketing possibility.

[3] - This is not mutually exclusive. DRL can likely work on other problems with different criteria.

[4] - I'll write about this in a later article.

[5] - Seriously, I recommend Bitter Lesson by Richard Sutton.

[6] - OpenAI's Dota 1v1 was scaled up on a weekend. https://news.ycombinator.com/item?id=17394150

[7] - Greg clarified later saying Massive scale and ideas, but that didn't fit the narrative punchline.

Contact: Please feel free to email me at [email protected] or tweet @shekkery.
Friendly Request: Writing quality articles is hard. Getting traffic is even harder. Thank you for sharing!

Like Software Engineering, Machine Learning or Meta-Learning? Get new posts before they're released. No spam ever, promise.