Balancing PokeHearth - part 1

Today we will be looking over how to balance a game using randomisation and evolution! How fun! Specifically, how we will do this is to look over statistics.

There are two ways to balance a game. The one that most often is used is to use subjective statements about how the game feels to play, and what opinions there are as to what is balanced and what is not. However, opinions change from person to person. Others might then dismiss it as subjective. That's not my point. My point is that it's messy, having to deal with all these conflicting interests, which might slow down the process of balancing. This is also its strength; games are to be played by humans, not by computers. Perhaps something is completely balanced, objectively speaking, but still lead to uninteresting gameplay.

For now, though, I'll be using statistics to determine what is balanced and what is not. At all points, we need to keep in mind the human factor. Perhaps it is okay for small imbalances to exist - it might even be necessary. Rather, we should look for places where the imbalance is not fun. This might be due to:

  • Some strategies being so strong, they preclude other methods of play
  • Some cards being so strong, they take up spots for other cards
  • Strategies that do not have weaknesses or counters the opponent's agency
  • The most interesting cards are too weak to be played
  • The most complex or random cards are too strong, destroying a feeling of coherence
To this end, I will be collecting several stats apart from just winrate. I have no idea if they are any good or not. Anyway, let's talk a bit about the game.


PokeHearth
The unholy amalgam of Hearthstone and Pokémon, PokeHearth is neither one or the other. In it, two players, or here, computer controlled players, battle to see who wins. Each uses a deck of 25 cards assembled from a library of 772 cards. The first goal is then to find out if all these 772 cards are equally usable, or if some of them are a waste of space.

Which is too strong, which is too weak?

However, each player must choose a type-combination out of nine different, depending on which they have at minimum 96 options for Bug/Fairy, or at maximum, 145 options for Water/Electric. The second goal might be to see if all these nine type combinations have some chance of winning, or if some are just much weaker than others.

In a game like this, one player must go first and another one second. The player going first is called White, and the second called Black. To balance this discrepancy, Black is given an extra card called Black's Gambit, which for one turn only allows them to gain a lead in mana, the incremental ressource used to play cards. However, is that enough? How close is it to be 50% chance of winning whether you go first or second? And even worse than that, are there certain decks that benefit from going first, compared to others, which are good with both?

Talking of mana, the game starts with players having access to little mana, and ends with plenty of mana. This makes some quite different early- and lategame strategies available. We should thus try to uncover if early and lategame strategies are possible, and to what degree? Just right, or much too strong in difference?

So now we have four questions. Let's look them over once more:
  1. Are individual cards balanced?
  2. Are type-combinations balanced?
  3. Are white/black differences balanced?
  4. Are different strategies available?




Tournament bracket
What makes competitive games so interesting is that there are metagames, that is, different strategies evolve to counter other strategies. The rules in their way shape balance. But so do the players.

This means we cannot just look at a given card, imagine all possible scenarios it can be played, and determine if it is balanced or not. Instead, we need to see it being played in the context of other cards being played as well as possible.

To simulate this, I will create 10,000 random decks, which will be playing against each-other. My cruddy laptop can simulate around five matches every second, which means it will take about half an hour for all decks to play a couple of games - one where they go first, one where they go second. But to see what decks are good, we need muuuuch more than a couple of games. However, we can use a bit of selection to speed up the process.

If we continually select the very best decks, we will also avoid the biases that appear from using random decks. Like in a tournament bracket, first the weakest decks are weeded out. With 10,000 decks in total, at least some of them should be pretty good.

Just as I wrote this, the first generation is over. All decks which lost both games have been eliminated, leaving 7387 survivors. As we go further and further in, the rate of removal will be lowered, so the top 100 decks will get to play quite a few matches. Still, we will keep track of what cards lead the first losers to, well, lose. Already now, we are getting a sign that things are not completely balanced. If all decks had a 50% chance of winning, a full 7500 should have survived. However, if all decks either always win or always lose, it would be 5000. So we can tell we are at least closer to equal winrate than completely unequal.

Apart from the number of wins and losses, we will also look at the decks which will make it all the way to the top of the bracket. This might give us a better insight into why certain cards are so strong. It might, for instance, explain why Raichu is so bad and Poliwhirl is so strong.


The stats
The primary statistic we are going to collect is winrate. Each time is either won or lost, a script will run through all the cards in the two decks that faced each-other, and add a win or a loss to a tally done for every card in the library. This will give us the average winrate of decks which have this card in them - or almost. The tally counts twice when there are two copies of a card in the deck. But for most intents and purposes, we can read the number as the card's winrate.

So let's just take a look at the ten best and worst cards, see what turns up!


Er, what a random smattering of cards. It's really difficult to get a picture of anything, really.

First, let me add a comment regarding the numbers. Pansear and Skiploom are the cards with the highest winrate, and 56% can sound like a really high number, sure. However, throughout the times I've done balance rounds, this is one of the lower numbers. It has happened before to have cards up around 60%. 56% winrate is, well, sort of problematic, but it is not a one-word death sentence. Personally, I would be more interested in the prevalent typings when it comes to being overpowered, and mostly look at individual cards when it comes to being underpowered.

How does one fix a card? With cards like Pansear and Skiploom, it might be possible to only change their stats, that is, health or attack. Changing mana cost is possible too, but one change of mana has the same effect as changing both health AND attack, sending a card directly from 56% to 44%, or something like that.

There are of course other options. For instance, one might change the cost AND one of the stats. Sometimes reorganizing stats is enough, for instance changing a 4/5 to a 5/4 is a slight nerf. Generally, 4/5 is the best statline, then 3/6, and only then, 5/4. I don't remember if a 2/7 or 6/3 is better, but probably the latter.

Finally, the card's effect can be changed, either by tuning the numbers of completely rewriting the card. In the case of Exactness and Improve, I might want to add an extra effect and then perhaps increasing the cost. Both simply upgrade the Idol Power without giving an immediate benefit, making the cards both boring and difficult to use.

Before all of this, I need to do a bit of bug-testing. For instance, Bidoof is quite conspicous, and testing it out, I find that it gives the opponent mana instead of the intended effect. But this just poses more questions - all the other cards around 40% must be equally terrible. Sort of. Because the AI knows the effect of Bidoof, meaning that 39% WR means "card which is almost never played and just works as a dead draw". It is thus pretty worrisome that Jigglypuff and the two hydras have similar winrates.

I see three Fairy type cards in the top ten worst; three Dark type cards, too. This makes me curious. Three of the ten best are Grass type, three Fire type and three Ground types. This makes me suspicious.


Types and Turnout
Because how did the tournament even turn out? What kinds of strategies are the strongest? What cards do they use? Might this explain why things are like they are?

Out of a full 20,000 decks, the top 180 decks are as follows:

So, several different options are possible, but the top four typings hold over 50% of the best decks, while the bottom eleven typings hold just 25%. This gives a much clearer view than just looking at the top 10 best cards. For instance, Fire is the third best typing and has the most cards in the top 10. But grass probably has several strong cards just outside of the top 10.

This does not, however, mean that so many grass cards are overpowered. Rather, it is the synergy between several good cards that in total makes the whole type overpowered. A bad charmander in a deck of good fire type cards will, on average, still have a positive winrate:


Don't be mistaken. The full bottom row are the five worst Fire-type cards. It just so happens that even the least useful Fire-type cards still have a fair winrate. This is something we need to change.

This looks really strange, and it is. Water and Grass, which both had more decks in the top 180, have a more even distribution.

But how best to hit the Fire type? Fire plays by quite different rules than the other types. Fire is all about burn, and this is why it has both the fastest and the most aggressive decks. This is also why I am focusing specifically on Fire rather than the other types. I know Fire is problematic. I've been over it before. This is why there is only an 8% difference between the best and worst Fire-type card. I've solved the single-card issues already. All that's left is to nerf the best of the cards that are not unfair in themselves, but together simply are too strong.

I hope, however, that we do not need to nerf all Fire type cards with a positive winrate. Hitting those five top cards might be enough to set Charmander down to 46% or whatever.

Pansear can be nerfed into a 2-cost 2/2 without much worry, since its strength really isn't in the stats either way. It might still be above 50% winrate afterwards.

Fiery Ascendance is just really strong. I could increase its cost by one, but instead I think I will switch the order of its effects, meaning that the newly evolved Pokémon will also take damage. This means that the card is still available as a control AOE, as if that was something Fire wanted to do.

Camerupt can have a point of health docked without it making a big difference. Generally, stat-changes are more important with small Pokémon than large Pokémon. Simisear will have a point of attack docked, though this might be too harsh.

I was about to do the same with Infernape, but then I noticed that the 3rd evolutions of the other starters of its generation cost 6 mana, so instead, I will increase its cost and its health by one each, creating a relative nerf of 1 attack.


Why I care about Fire
There are several reasons to care so much about Fire. First of all, its strategy is so direct - win as early as possible - that it might skew the whole meta. If there are a lot of aggressive decks running around, they will cut down the possibilities of other, more value- or combo-oriented strategies.

The Fire type, as mentioned, was very particular in its playstyle. This can also be seen in the difference between win-rate of going first or second. The average difference was 8,6%, but Fire was tied for second highest difference at 10,6%. Also, in the early part of the tournament, where lots of weak decks were running around, Fire had a 2,3 percentage points higher winrate than in the later rounds, significantly more than average.

But I should not only care about Fire. That would be shortsighted. Therefore, next time, we will look at the other types.


The Fastest Deck
The fastest deck is measured as the deck which had the lowest average turn win and turn loss. The award went to this Fire-type deck with game length of less than 7 turns. To compare, the average is somewhere around 8-9, and the slowest deck was just below turn 13 - this, of course, includes the times this fast deck went up against the slowest, and vice versa.



Of the cards we nerfedd, here actually only see Pansear. This is a bit worrisome. Perhaps I should also go and nerf Combusken and Blaziken, the 6th and 7th strongest cards. I'll dock a point of health from both of them. We'll get back to Tyrogue once we look over the Fighting type cards, so in total, Xagcaro should be demolished.

Anyway, the deck itself. Opposite to what one might think considering the early game wins and losses, Xagcaro actually has a pretty balanced approach. It has some strange inclusions, like Heatmor, whose ability is useless since the deck does not have any powers, as well as Riolu, a primarily defensive card. Of course, building a wall can protect your own attackers, but... With several late-game cards, I am very uncertain how exactly Xagcaro manages such early games.

My only hypothesis is that it employs a dual-strategy. The first, pure burn, manages to win some games early, by using Pansear with other cards like Magby, Combusken, Ponyta and Charmeleon, which also are able to deal direct damage to the enemy. The second strategy is a more tempo-oriented strategy where it dominates the playing field and manages to end the game that way. Finally, it has a few late-game options to push through and making sure games do not drag out too long. Or at least, that's my theory. It is a bit of an enigma.

This is what happens whenever one uses neural nets or evolutionary methods - you end up with results you cannot explain. And really, that's why I do it. So thank you Xagcaro. I will still destroy your gameplan, though.

To prove I have no idea what I'm talking about, the best Fire deck, Caklerusl, which came #10 in the tournament, has little burn and is a tempo/value oriented deck. But this is not a meta report.

Comments