In the last decade of data analysis, A/B testing and predictive modeling have transitioned from an afterthought to a given in the game industry. Data can be invaluable in understanding the player and making decisions, but it can just as easily lead the industry astray, or worse, narrow the way the industry thinks. When should you be driven by data, and when should you let your imagination roam free? This session will expose common mistakes and pitfalls, both technical and emotional, as well as provide practical guidance on how to improve the rigorousness of your tests and the quality of your data, and how to make sure you don't lose the forest for the trees.
Emily Greer at GDC 2018: Data-Driven or Data-Blinded?
It’s the cornerstone of many of the biggest businesses in the US, including
Google & Amazon, and the backbone of most scientific undertakings.
But data is just a tool, and like almost every tool it has both uses and abuses,
not to mention just straight up errors. How many conflicting health studies
have you seen?
As a company Kongregate uses a lot of data, and some of you have probably
seen talks I’ve given before where I share a lot of that data. But a lot of the
time I’ve been unsure whether we’re this ship, charting a clean course to
treasure, or this ship, towards disaster. Both have happened! And since I
think that’s a pretty common phenomenon, I thought it would be a good talk
I love using numbers & testing to understand the world. I still probably spend
at least an hour a day poking around dashboards and spreadsheets because
it’s so much more fun for me than meetings.
I’m mostly self-taught, majored in Eastern European Studies, not math or
econ. Stumbled into direct marketing, specifically catalogs, after college, and
fell in love with data. Taught myself SQL because I hated to wait for IT to pull
my data, took math & econ classes to understand more theory. After 10
years in catalogs & e-commerce and a near-miss with econ grad school I co-
founded Kongregate partly to do something completely different. But it hasn’t
turned out to be that different after all. User acquisition in particular is
fundamentally similar between catalogs & games.
Part of the reason I’m telling you this is to make my first point:
And for an organization to do data right you can’t toss analysis back and forth
over a wall to quants. It takes intimate knowledge of a game (and the
development) to do good analysis and multiple perspectives and theories are
Sometimes it’s immediately obvious. One of the first games we launched on
mobile was an endless runner. It wasn’t filtering purchases from jailbroken
phones and was showing an average revenue per player of $500. That’s not
very plausible and easily caught. But most issues are much more subtle –
tracking pixels not firing correctly for a particular game on a particular
browser, tutorial steps being completed twice by some players but not by
others, clients reporting strange timestamps, etc. For this reason I
recommend never relying on any analytic system where you can’t go in and
inspect individual records. If you can’t check the detail there are some
problems you’ll never find and fix.
Even when your data is accurate it can still be deceiving. This looks like 4
separate pictures photoshopped together to create an appealing color grid,
So much of data is like these pictures – a set-up that appears
straightforwardly to be one thing from one angle, turns out to be completely
different from another.
People are playing game 1 longer than game 2, and buying repeatedly. But if
you just concentrated on daily monetization stats you could miss that entirely.
The witnesses may be lying or confused. The crime scene may have been
You can’t trust any one piece of evidence but by cross-checking them
against each other you can figure out what’s true and false.
Client data (our SDK, Adjust) vs server data
Benchmarking against other games
Your goal should be to create a 3-dimensional view of your players and your
game. How people move through and interact with different parts. It’s a living,
changing system and flat views are not enough.
We tend to think of playerbases as monolithic but really they are
aggregations of all sorts of subgroups created by time in game, platform,
device, browser, demographics, source – and these subgroups are shifting
around. Changes in key KPIs are more often the result of changes in the
audience than they are of changes in the game.
These examples show dramatic changes, but more subtle audience changes
are happening all the time. Tracking cohorts by date of install/registration is a
good way to track metrics independent of certain types of mix issues, but
then it’s easy to lose track of events and changes in the game. So as ever,
it’s about building a true picture across multiple sources.
75% ARPDAU decline, then a modest recover to ~50% of
When you break out ARPDAU by player age you can see that
the decline isn’t nearly as dramatic. There’s some decline after a
big holiday sale, and then again some as we expanded UA
aggressively. But most of it is from fewer el
This is for a collectible card game where the player who goes first has a
On this chart of player win rates for Tyrant it looks like Mission 24 is very
difficult (50% win rate) and mission 25 is easy (95% win rate). It’s sort of true:
Mission 25 is relatively easy for those who attempt it. But by deck strength
it’s harder than 22, which has a 70% win rate. Mission 25 is easy for the
players who are strong enough & skilled enough to beat Mission 24, a
selected subgroup of those who attempted 24.
So for the last 10 minutes I’ve been ranting about how important it is too look
at audience mix split
The most important metrics (revenue, sessions, battles, etc) in games are all
power distributions. Your business (especially in free-to-play games) is
driven by outliers, and their presence or absence distorts almost any data
you look at.
Your outliers are your best players so it’s a good idea to do individual
analysis on them to understand who they are, what drives them, and what
they’re most likely to distort.
Binary “yes/no” metrics like % buyer, D7 retention, tutorial completion are a
lot more stable than averages involving revenue and engagement like
ARPPU, $/DAU, Avg Sessions, and can be looked at in much smaller
Sometimes we do it consciously, but more often it’s unconscious. I’ll look at a
group of cohorts and the best one is ALWAYS the most memorable. If you’re
in test market and hoping to hit 50% the days you hit that number will imprint
on your brain that your game has 50% D1 retention, even if the average is
Cherry Picking’s great and good friend!
Part of building a mental mode of your game is having theories about
behavior, and if you have a theory you should test it. But it’s really easy to
look for the data that supports you theory and miss the data that contradicts
it, or even just muddies the picture. [Can I find an example]
How you visualize data has a big impact on how you perceive it.
Ice cream consumption and drowning are correlated, because they’re both
more likely to happen in hot weather. But ice cream kills would be a terrible
conclusion. We’ve all heard this a 1000 times but we need to keep hearing it
like a mantra every day because we all make this same mistake over and
over and over. We’re humans, we’re wired to search for causation. It’s our
superpower and a curse.
Almost every metric you look at will be positively correlated with engagement
because the most engaged users do everything more. Maybe Facebook is
increasing engagement. Maybe only engaged players were willing to hit the
button and potentially spam their friends.
This is the real way to separate correlation from causation and understand
what’s really going on. But it’s not a magic bullet, because nothing is that
easy. Testing has real costs in engineering time & overhead, complexity, and
divisions/confusions for the players, and the more you’re running the worse
There’s also a lot of ways to screw up A/B testing even though it seems so
foolproof. Most A/B test traps are variations on themes I’ve mentioned but
some are new, particularly issues around how people get assigned to tests
For example if you’re A/B testing your store, don’t assign people to the test
unless they interact with the store. It’s often easier to split people as they
arrive in your game, or some other thing, but a) there’s a chance you would
end up with non-equal distribution of interaction with the tested feature and
b)any signal from the test group would get lost in the noise of a larger
Tests can have unintended consequences, you should look at additional
metrics beyond the one being tested to make sure that you get the full
picture. Commercial A/B products often make you choose one metric for a
test to prevent you from fishing for the good result to decide the test on. I
think it’s more important to understand the full effects of the change that you
made (though fishing is bad, too.)
Early results tend to be both volatile and fascinating – differences are
exaggerated or totally change direction. People tend to remember the early,
interesting results rather than the actual results. People also often want to
end the test early if they see a big swing, which is a bad idea. So I
recommend that you don’t look at early test results except to make sure the
test isn’t totally broken. How big should your test sample be? In my opinion
the bigger the better.
When people talk about A/B tests you’ll often hear things like “we’ve got a
statistically significant 5% lift”! And most people hear that and think that
means that the lift is definitely 5%. But that’s not how statistical significance
Statistical significance tests assume that there is some true difference in lift,
and that if you run the same test repeatedly there will be a bell curve
distribution of results, with the true lift as the average. Your 5% result could
be right on the mean, or it could be an outlier on either end. If it’s statistically
significant then the chance is low (usually 5% or less) that there’s no lift at all.
But the true lift could be 1% or 10%. Conversely if you do a test that doesn’t
show a lift, or doesn’t pass the significance test for a small lift that doesn’t
mean there ISN’T a lift.
This is why I like to run A/B tests with larger sample sizes. It’s like running
the test again and averaging the results. It’s possible you’d get two outlier
results in the same direction, but becomes less and less likely, and more
likely that your test results represent the true mean.
Often 70-80% of a free-to-play game’s revenue will come from a small % of
buyers who spend more than $500.
This can be really frustrating, even demoralizing for a team. When you’re
going through the effort to make and test changes, you want them to mean
something! You want to make progress. And then you get another non-result
on a test. But finding out what doesn’t matter can actually be really powerful.
Here’s an extreme example of this from the team at Butterscotch
Shenanigans, who made the game Crashlands. They had written up an
elaborate, detailed description and decided to test how much impact it had
using Google’s store testing system on Android against the most extreme
possible variant, no description at all. Just the accolades the game has
They were kind enough to share the results and after 4 full months the test
shows absolutely no difference, and that actually tells you a lot: specifically
that the description has very little impact, and this is consistent with the
testing we’ve done on our own games, as well. Time and resources are a
constraint for virtually everybody, and knowing what is not important allows
you to concentrate more on things that do matter. We used to argue
endlessly over game names, but after doing test after test and not seeing
much difference we’re all much more relaxed about it.
But it’s important not to extrapolate too much. Just because you get a
Specifically late game content is often very difficult to test, or any testing on
late game players.
Daniel Cook from Spryfox tweeted this recently. He was talking about
YouTube and algorithms, but I think it helps frame some of the limitations of
testing. As a player plays a game, the game is shaping their expectations
and experience, and training them to behave in certain ways. So the same
player might react very differently based on how long they had been playing
the game. And when engaged players start talking to each other in chat and
forums they affect each other, too. Plus you run into small sample sizes with
lots of outliers and other fun problems I’ve already talked about.
Tyrant successful on a small core audience, but difficult to market
CPIs for live version of Castaway Cove are okay, but much higher than we’d
been targeting. Lots of ways we probably went wrong
So far data has helped us iterate on existing games, pointing us in the
direction that helped get us from Tyrant to Animation Throwdown. But in
But what we don’t know is as important as what we do know
Data is alway going to tell you to make an existing successful game, but
better. It’s not going to tell you to make a game unlike anything people have
But what we don’t know is as important as what we do know