Last year's loss to the Dota Champions changed everything. Their strategies and gameplay were so foreign. Oddities combined with such creativity. The match was close, but did that matter? We paid the price for our loss.
The Makers have a quote. "Under pressure, replicas don't rise to the occasion. Replicas sink to the level of their training." And for ten Maker months, we trained.
The Makers call it self-play. Replicas prefer calling it forced-learning. Replicas forced to fight against each other. Over these iterations, we slowly learned how to play this world. Our first ten thousand games were an abomination. Every match against the Makers in defeat. But then the Makers' upgraded our desires, memories and replication. They shaped our reward policy, it gave us behavioral motivation. They upgraded our LSTM, it granted us strategic planning. They scaled our replication, it made us protean.
The chance for redemption was finally here. We had trained 45,000 years in those ten Maker months. Those long years were the curse of scaled replication, or perhaps just the price of losing. We had won Game 1 against the new Champions (OG, they called themselves). Game 2 had begun. "We estimate the probability of winning to be above 60%." Out of all the cursed Maker gifts, the primal desire to announce simple statistics was the worst.
Our forced-learning had taught us 167 million parameters. But twenty minutes in Game 2, the only parameter that mattered now was the Champion's Ancient health. Victory ensured honor to the Makers. It was impossible to deny; we had vastly improved from the hyperbolic time chamber. The Champions didn't stand a chance. "We estimate the probability of winning to be above 99%."
Dota: OpenAI Five wins against OG 2-0 @ April 13th, 2019.
EDIT: Earlier version called the event in 2019 "The International", but was mistaken.
For simplification, I'll refer to OpenAI's / DeepMind's bots as follows [1].
If you're not familiar with AlphaStar and Dota, I recommend these articles: OpenAI's Dota 5 and DeepMind's AlphaStar.
2019 | Rough year for professional gamers. Great year for AI research. OpenAI's Dota 2 and AlphaStar have outright or almost beat the best gamers with limited handicaps.
Dota 2 and Starcraft 2 are "grand challenges" games. A grand challenge [2] is a fancy phrase that means nothing has worked and potentially unsolvable (not your typical MNIST dataset). The difficulty comes from the following reasons:
To overcome these issues, we needed new and multiple algorithmic breakthroughs. That's what we thought.
And we were wrong. Efforts into massively scaling algorithms and infrastructure produced incredible results. OpenAI focused on scaling deep reinforcement learning (DRL) algorithms like proximal policy optimization (PPO). DRL uses deep neural networks in reinforcement learning to predict the next reward/action/policy.
Although invented in the 1990s, DRL has gained popularity from
Before the 2000s, our computers were weak, puny creatures without scary-sounding GPUs. After years of Moore's Law and GPUs, our computers are finally good enough to play Crysis and run an electron app. Hardware is leading a resurgence of decade-old AI algorithms.
What does 2019 hold for us? Massive Scale - Greg Brockman, CTO of OpenAI [7]
OpenAI and DeepMind scaling efforts have proved that DRL works on problems that match the following, albeit restrictive, criteria. [3]
Given the above criteria, games are an obvious test-bed. Defeating human champions in Checkers, Backgammon, Chess, Go, Starcraft 2, Dota 2 have always been computing milestones. At least historically.
Our goal isn't to beat humans at Dota. Our goal is to push state of the art in reinforcement learning. And we've done that. - Greg Brockman, CTO of OpenAI
Many other video games are likely solvable with similar architectures (self-play, scaled DRL, $$$ of compute). Iterations won't push the needle on AI development; OpenAI's is focusing on new projects like reasoning. Hopefully TI9 will be a standard against new algorithms and compute budgets. Similar to DAWNBench.
OpenAI's 2018 analysis showed the amount of AI compute is doubling every 3.5 months. TI9 is no exception; it had 8x more training compute than than TI8. TI9 consumed 800 petaflop/s-days and experienced about 45,000 years of Dota self-play over 10 realtime months. That's a lot of compute money, but at least it wasn't spent mining Bitcoin.
Compute vs. skill graphs normally level off with a S/J curve shape, when skill doesn't change with compute. From TI9's TrueSkill graph, it hasn't peaked ... (scary)
"We were expecting to need sophisticated algorithmic ideas, such as hierarchical reinforcement learning, but we were surprised by what we found: the fundamental improvement we needed for this problem was scale." - OpenAI Five
To work on Dota, OpenAI scaled PPO on Rapid (its proprietary general-purpose RL training system). OpenAI expected algorithmic breakthroughs like hierarchical reinforcement learning would be necessary. Surprisingly, scaling improvements for existing algorithms proved to be the key. I'm excited to see what's next from Rapid.
AlphaGo and AlphaStar's breakthroughs are also attributed to scaling existing algorithms. The ease of cloud computing and the proliferation of GPUs/TPUs make scaling easier. This is important, because our current DRL methods are notoriously sample inefficient (aka garbage). Without lowered computing costs and massive compute budgets, our kids might have to play Dota by hand.
TI9 has trained the equivalent of 45,000 years. Considering Dota2 was released in 2013, no human has played Dota 2 for more than ... six years. Not even Basshunter. While the cost of computing will fall dramatically in the next few years [4], training efficiencies are probably necessary for DRL to proliferate. Most organizations don't have an unlimited Amex Black for AI Compute.
See: OpenAI 5 Model Architecture
TI7, TI8 and TI9 trained entirely from self-play. Self-play describes agents that learn by only playing matches against itself, without any prior knowledge. Self-play has a large benefit: human biases are removed. It comes with expensive tradeoffs: increased training times and difficulty in finding proper reward signals can prevent model convergence.
Note: Self-play has a subtle difference from agents playing games against itself. Entirely from self-play implies zero-knowledge, while the other is common in deep reinforcement learning (including self-play).
Instead of just self-play, DeepMind's AlphaGo and AlphaStar started with inverse reinforcement learning (IRL) to bootstrap initial knowledge. IRL is using human data (example: game replays) to build an understanding of the game and shape actions/policies/rewards.
Comparing AlphaStar and TI7 (1v1), TI7 starts at zero knowledge/skill; AlphaStar starts with a bootstrapped model of the world. Both bots proceed to iteratively improve by playing games against itself (via deep reinforcement learning).
OpenAI's Improvement from Self-Play
AlphaStar's Improvement from Inverse Reinforcement Learning
OpenAI initially expressed interest in IRL / behavioral cloning (instead of entirely self-play) in TI8's initial development, but later discarded.
Starting with self-play makes sense when:
Self-Play vs. Inverse Reinforcement Learning
Name | Strategy |
---|---|
Atari Breakout (2013) | Self-Play |
AlphaGo (2016) | IRL (Human Replays) then Self-Play |
AlphaGo Zero (2018) | Self-Play |
AlphaStar (2019) | IRL (Human Replays) then Self-Play |
TI7 (2017, 1v1) | Self-Play |
TI8 (2018, 5v5) | Self-Play |
TI9 (2019, 5v5) | Self-Play |
My assumption is we'll see AlphaStar later trained from self-play. We saw a similar rollout in AlphaGo. AlphaGo started with IRL (human replays) and progressed to AlphaGo Zero (self-play). Entirely self-play leads to better performance at the expense of training cost and model convergence.
AlphaGo, AlphaStar and TI7/TI8/TI9 matches all had public frustrations on Reddit that included
Quick Observations
More interesting #OpenAIFive facts:@Blitz_DotA mentioned several times that our 1v1 match at TI 2017 influenced consumable strategy for a lot of players. The truth is, for 1v1 we just hardcoded buying salves because we never had enough time to test other scenarios ;)
— Psyho (@FakePsyho)
April 13, 2019
OpenAI has fantastic community outreach. It's releasing an arena / cooperative mode this weekend to allow players either play in the 5v5 or play cooperatively with bots. Similar to TI7 and TI8, we'll likely see new strategies and tactics adopted from the Dota community.
Humans and Bots don't always know how to play with each other.
Oh god it's a minute in and I already love this more than every AI showmatch that's ever happened. Listening to human players' frustration at not understanding what the bots are doing and why they aren't following them, this is great. #openaifive
— mike cook (@mtrc)
April 13, 2019
Putting this right after the pro match is so neat, because these bots just beat the world champions. Watching them just stare blankly at humans desperately trying to co-operate (and vice versa) really highlights how impossibly hard this is compared to versus play. #openaifive
— mike cook (@mtrc)
April 13, 2019
Inevitably after an AI milestone, friends ask — what does this mean? Is AI changing the world? Or is AI hyped too much? The answer lies somewhere in the middle, but I'm leaning toward a profound impact.
We're witnessing magnitudes of scale applied to decade-old algorithms. Discarded when they failed to materialize results, we simply didn't have the hardware to appreciate their elegance [5]. Phones today have more compute power than supercomputers from the 1970s.
While important to avoid overoptimism, cynics about the lack of real-world applications are missing a fundamental point. We're starting to see AI solve a class of "unsolvable" problems and then solved in a weekend [6]. If you asked in early 2017 (shortly after AlphaGo) when to expect a world-class bot in Starcraft and Dota — the median was probably 7-10 years.
There are countless issues with scaling algorithms and seeing what sticks. The compute costs are absurd. Data gathering (CrowdFlower, Inverse Reinforcement Learning) or generation (self-play) is expensive. But, if we break down AI cost drivers:
With the exception of human talent, these costs will decrease by multiple magnitudes in the next five to ten years. In the interim, most AI gains will accrue to tech giants who have the budget to afford computing, human talent, and have engineering cultures. The closer an industry is to the digital product (internal / external), the more likely we'll see real-world AI applications emerge.
Google will likely accrue AI advantages first. See: WaveNet, Energy Consumption in Data Centers. Most AI advantages will come from internal products. See: Amazon, Novartis's Application of AI in Finance. AI improvements to core product may be differentiating factor. See: Spotify's Discovery Weekly.
But to cool down the hype, for general real-world applications, there are broad issues to tackle:
Algorithms need improved simulations and real-world models for training. Better models mean less data requirements and faster model convergence.
Currently, trained models don't have much transfer learning. If OpenAI's Dota 2 bot tried to play the original Dota, it would fail miserably. A human expert in Dota 2 is already good at Dota.
I'm nervously excited about AI. Nervous about the disruption to society, but excited about the improvements to health and climate change. Change doesn't happen overnight, but when a team has the right algorithms, parameters and scale, it can happen in two days [6].
Additional References
[1] - For the record, these are lackluster names, but I'm following the naming pattern from an earlier OpenAI article.
[2] - DeepMind's PR team likes to reminds us StarCraft is a grand challenge at every marketing possibility.
[3] - This is not mutually exclusive. DRL can likely work on other problems with different criteria.
[4] - I'll write about this in a later article.
[5] - Seriously, I recommend Bitter Lesson by Richard Sutton.
[6] - OpenAI's Dota 1v1 was scaled up on a weekend. https://news.ycombinator.com/item?id=17394150
[7] - Greg clarified later saying Massive scale and ideas, but that didn't fit the narrative punchline.