Effective Testing of Free-to-Play Games

Back a few years ago when social games were exploding on facebook the
conventional wisdom was that you wanted to release your minimum viable product as
quickly as you could, and iterate on it in the wild with real data from players. But that
only made sense in a world where if you “wasted” your early traffic on a poor game it
was relatively easy to get more. Mobile is the opposite: traffic is always at a premium
and a strong global launch has become crucial to success. It’s the best chance to get
substantial features from the platforms, to get noticed in the new charts, and
potentially get picked up by recommendation algorithms.
Ideally we’d all be like Blizzard & Supercell and be able to polish and test games
internally for years, but there are all sorts of pressures and costs pushing games out
the door. For companies with <$1B in annual profit it’s important to be realistic about
what, why, and how to test to maximize our chance at success.

We started Kongregate back in 2006 as an open platform for browser games, a little
like YouTube for games: anyone can upload, we then add chat, comments, forums,
achievements, and a lot of other social features that make the site a whole game
itself. In 10 years more than 100,000 games have been put on our site, covering
pretty much every genre imaginable. A fairly wide array of games are popular, from
casual puzzles and launch games to MMOs and collectible card games. Overall our
audience trends male, with heavy overlap with console and Steam.

Four years ago we started publishing third party games on mobile, and have
launched more than 20 games in the last 3 years. Like on Kongregate itself we
publish a fairly broad range of games, from more niche, high monetizing RPGs and
CCGs to single-player games with mostly ad monetization. To give you a feel of the
range here are the games we published in 2015.

And like everything in life they have different strengths and weaknesses. To use them
properly it’s important to understand what those are, so I’m going to spend a bit of
time going through the different types.
One note: I’m not going to be talking much about defect and bug-oriented QA testing,
which is not to say it’s not important. It’s very important, and you should do it, ideally
with dedicated in-house resources supplemented by 3rd party resources. But I only
have an hour so I have to skimp on some topics.

By Team Playtests I mean both the testing you do naturally as you add on features to
the game, and also scheduled team sessions to look at the game more broadly.
Game is available, everyone’s getting paid to do this
You don’t play like a player, you know how things are supposed to work. And you
know games too well – every convention is obvious to you. And since you’re testing to
make sure features are working you play the game in totally unnatural ways “I’m
jumping right to x” “I’m pushing all the buttons” “I’m going to play with an OP account
to breeze through”

By in-person playtests I mean getting a 3rd party unrelated to game development to
play the game. It could be as informal as handing somebody a device out in the wild,
or could be an organized, in-office thing where you hire people to come in and play
the game.
•  They’re a pain-in-the-ass to arrange whether you’re bringing strangers from
Craigslist into your office or harassing them at a coffee shop. And they take a lot
of skill to run & analyze well – not to prompt the person, to jump to solutions,
conflate problems, hear beyond what they say
•  They are psychologically difficult – exposing your work is hard, It’s pretty
equivalent to the feeling most of us have about getting up to sing in a public
Karaoke bar, only without it being appropriate to get a little drunk first. when
combined with it being a pita, get pushed off indefinitely “it’s not ready” “they’ll
only tell us what we already know”.
•  Depending on how you’re recruiting

Companies like Usertesting.com (whom we use) and others
•  Generally a directed task testers are supposed to complete, so less natural
experience than just picking up a game and exploring, higher willingness to
“figure it out”
•  Limited view of body language, narration expresses conscious thoughts, not
unconscious, no chance to follow-up
•  Still limited to first session experience, now with a time limit
•  Still small sample, luck of the draw on testers, some selection bias in terms of
who takes these kind of gigs

By this I don’t mean in person, but sending out a mobile build or a link to a web
version to a group of people you know and seeing what happens
•  More realistic experience, they’re testing the whole game across as many
sessions as they want to play
•  Depending on who it goes to you can get good qualitative feedback,
•  Audience tends to be biased in your support, professional game developers, or
both. a lot of people won’t want to hurt your feelings. You’re unlikely to hear “this
sucks” even if it’s true.
•  Low sample size & unrepresentative/biased audience = mostly garbage metrics

The way to think about this: if we assume that tutorial completion is around 80%
globally, if I get 400 installs representative of the global audience 95% of the time
their tutorial completion rate will be between 72% and 88%.

Now just because you have a smaller sample than this doesn’t mean metrics are
useless. In a normal distribution you are more likely to be closer to the mean than
farther away. My unscientific rule of thumb is that you start getting directionally useful
if not accurate metrics when an event has occurred 75 or more times. So for
directionally useful tutorial results you need about 100 people installing a game, and
for directionally useful buyer conversion rate more like 5,000.

By this I mean inviting a broader group of players to play a game not yet released,
either on web or Android, usually volunteers from a fanbase
•  All the benefits of friends & family tests with larger sample sizes!
•  REALLY engaged audience excited to give feedback, not constrained by
politeness. They’ll tell you your game sucks.
•  Chance to build a community for your game pre-release

On Kongregate.com beta access to games is a benefit of Kong+, our ad-free version
players pay $29.99/year. We also gift it to our volunteer moderators and big spenders.
There are about 30,000 MAU, and the average beta game gets ~3k beta users.
We consistently have 5-10 games in closed betas, which we use for our publishing
portfolio, but is also open to other developers

Metrics are the product of the audience and the underlying game. You can get
average metrics by either putting poorly qualified traffic into an amazing game, or by
putting amazing traffic into a mediocre game. Now this is somewhat obvious, but if
you’ve been working on a game for a year it’s easy to forget about audience, and
think metrics are entirely about the game. It is especially easy to underestimate how
BIG the audience swing can be.
This was the most extreme split we’ve ever seen, with a 9X difference in % buyers,
which is more commonly 3-4x higher in beta than in global release. But even though
it’s inflated it’s still useful: we could tell this was going to be a high ARPU game with
mediocre initial retention but good long-term retention. (Note, web d1 tends to be
much lower than mobile, but they are comparable by D30.

This is the now classic method for mobile, releasing fully but in just a few countries,
often Canada and/or Australia
The real thing is hard, and a lot of the weakness are just aspects of the mobile game
market
•  It’s hard work releasing anything on mobile – builds, screenshots, but particularly
getting games working on such a broad range of devices
•  Long Apple approval times makes iteration slow even when you can quickly fix a
problem
•  Traffic doesn’t magically show up in games and buying it is expensive – average
is ~$3 per install in Canada & Australia, a bit less on Android

Australia & Canada may be good proxies for the US, but that’s likely to be <1/3 of
installs
They used to be closer, but the gap widened after the release of the iPhone 6 when a
lot of high end device users switched back to iOS
Especially for more niche, high LTV genres like CCGs, RPGs & Strategy games
whether it’s CPIs going up or retention & LTV going down after your first “golden
cohort”.
In a small market a few big buyers can blow out a market. And test market spend
tends to be less ROI focused, so you see weird patterns.

Spellstone, a polished collectible card game from Synapse games, shows some of
the dynamics. You can see that the performance of iOS is generally much stronger
than Android, and that paid generally is quite a bit stronger than organics. But the
really dramatic number is the huge drop in performance on organic traffic coming from
the substantial features that the game got, with around a 70% drop in the ARPU on
both iOS and Android. This is most dramatic with more niche, high LTV genres like
CCGs, RPGs & Strategy games. For a more casual, broad audience game like
AdVenture Capitalist we don’t typically see a big delta.

Your game is driven by outliers and their presence or absence distort almost anything
you look at.
Binary “yes/no” metrics like D1 retention or tutorial completion are more reliable than
averages involving engagement or revenue. And the deeper your game, the less
spending is capped, the more unpredictable those averages get. So as much as
possible look at binary metrics that are proxies for the averages, or answer the
questions. % repeat buyers, for example, rather than ARPPU.

Don’t get me wrong, I love A/B tests. They divorce correlation from causation, and
manage the audience mix problems well, too. But don’t expect to be able to run a lot
of A/B tests in test markets unless you’re willing to spend major $$s. The problem is
sample size. Take the numbers I was giving earlier for cohort sizes then double them
for an A/B test. And with an A/B test you need to measure the results fairly precisely,
directional numbers are not good enough, or you can make bad choices based on the
results.

More of a strategy than a testing method, but one I think is underused. I’m biased of
course! I have a web portal. But this strategy has worked for a lot of big successes,
from King with Candy Crush to Blizzard recently with Hearthstone.
– Steam, Kongregate, Facebook, Miniclip, Newgrounds, Addicting Games and
hundreds more. But it’s important to find the right audience fit – Facebook is a much
better choice for a very casual game than Steam or Kongregate
Better social feature support (forums, videos, streaming, etc) to build community
Comparable LTVs (at least on Kongregate)
Chrome no longer supports the Unity plug-in, and Firefox will likely kill it by the end of
the year. Flash is still going for now, but will likely be phased out in a few years. But
the webGL export from Unity is improving rapidly, and there are a lot of other good
cross-platform frameworks to work with, such as Haxe.

Over the next 6 months they worked on polishing the UI, adding monetization, all
while releasing it on dozens of additional platforms across the web, and were able to
extend the content and improve the balance while building a bigger and bigger
audience.

Mobile test markets lasted about 2 months, focused on mobile device stability, FTUE
and the new rewarded video integration, a huge addition to monetization

And since each type of test adds something, the ideal plan is to use all of them, in
approximately this order, with company playtests throughout a given.

How much money and time you have are intimately related: time is generally the
biggest cost because each month of a studio’s burn rate adds up. But there are
situations where that’s not true: in a big company you may have intense pressure to
launch by a certain date but a lavish budget for test market marketing. Or an indie
doing work-for-hire or with a full-time jobs may have little time pressure but no money
to buy traffic. The indie should focus on in-person play tests and closed betas, while
with enough money the big studio can blast their way through geo-locked test
markets.
A simple puzzle game or endless runner that are easy to pick up and play are going
to get more value out of the experiential testing that help them nail the fun in the core
experience, but isolated, single session tests are much less useful to multiplayer
games with deep metagames and economies, where long-term metric-based testing
is crucial to getting things right.
Games with lots of graphics, or are otherwise technically demanding, are going to
need extra time in mobile geo-locked test markets to deal with all the problems that
crop up with low memory and low GPU devices on both Android & iOS
And finally: what are your goals? Do you expect to get significant features from the
platforms? Are you look for top 10 grossing? Top 1000? The bigger the launch you’re
expecting, the higher the stakes and the more crucial it is to have the game in the
best state possible at launch.

Assuming both time and money are at least somewhat constrained. Say a mid-sized
studio with most of their burn covered by existing game income, but not a big cushion.
This still assumes 6 months from friends/family to global mobile release. If you cut it
much shorter on a multiplayer game you will almost certainly regret it.

Single-player game made by a small cash-strapped team, mobile-specific controls
and ad-based monetization. In this case most of the real game testing and iteration
should be on the back of rigorous, frequent, in-person testing. Then mobile test
market can concentrate on just a few metrics.

During pre-production & production the key question you’re asking is “Is this fun? Are
we on the right track?” The sooner you figure out something isn’t working as you
expected, the easier it is to fix. The more you are departing from convention and
comparables the more you need to validate as you go along.

As you approach release you’re asking “Is This Ready”? It’s a great time to do remote
playtests focused on the first time user experience, and make sure that analytics are
hooked up and firing correctly – that last is not a given, analytics are very easy to
screw up. This is also a great time to send your game to a 3rd party QA service to test
on a broad range of devices if you’re going straight to mobile.

It’s the next Clash of Clans! Or Crossy Road! Everything is broken! Who would even
play this piece of shit? Total failure.
Depending on the person, they may cherry-pick the good, or focus only on the bad.

This is where the rubber really meets the road. To know if the game is working, you
have to know what you’re looking for.

We have a very successful game with 20% D1 retention. We have very successful
games with $0.03 ARPDAUs.

So here are some sample metrics ranges, from low to high, for the genres I’ve been
using for example test plans, then for all genres. This is loosely based on the metrics
we’ve seen from games in our portfolio and more generally on Kongregate.com. As
you can see the “all genres” low/medium/high can be pretty different from the ranges
by genres. Good retention for a multiplayer RPG game is drastically lower than an
idle game. Good monetization for a casual runner would be terrible for that same
RPG. What’s missing from this is expectations around traffic: that casual runner has
probably 10x the potential traffic of the multiplayer RPG.

Here’s a couple of outcome models based on the average metrics for each of these
genres, then broken out by different levels of traffic. You’ll notice these profitability
scenarios aren’t great: games with average metrics need exceptional traffic to
become a sustainable business. In general for a success you need at least one or two
metrics to fall in one of the “high scenarios”.
Set your goals matching realistic expectations of the genre and acceptable (not ideal)
business outcomes.
If you need to hit top 20 grossing to justify huge budget/company expectations, your
goals should be much higher than if you mostly want to learn.

It’s not that you don’t look at other metrics, but you want to set the gates based on the
most important one for that stage.

One of the benefits of breaking your testing into stages is that on mobile it allows you
to use a wider variety of test markets. Canada & Australia are not only expensive
places to test stability, they’re a bad choice because they have a much lower % of the
low-end devices most likely to trigger issues. That’s better tested in emerging
markets. Overall testing in a range of countries from tier 1 to tier 3 will give you a
much more representative view of your global performance than limiting to a few
English-speaking markets. We’ve used more than 20 different countries in the last
year, all shown in this map.

Note that the sample sizes are cumulative. 12k isn’t enough for statistical significance
on buyer %, but 25k is.

This is not optional, but a surprising number of developers don’t. Crashes are
annoying to players, effecting both retention and your ratings and reviews in the app
stores. We recommend a 3rd party service like Fabric/Crashlytics or Hockey App,
though there’s a quite bit of info in Google Play console as well.

Stage 2 is where you’re really optimizing your game, and you should look at in the
same stream that players are moving through the game, as that’s the order in which
sample sizes will get large enough, and because problems in one will likely flow into
the next. You start with the progression through the first time user experience,
checking drop-off pre-tutorial, and then at each tutorial step. Then you’ll watch PVE
progression, what’s the progression through missions, what are the win rates. After
players have been in the game a little longer you want to look at PVP participation
and win rates, and then finally at the economy, where you should look at the full sinks
and sources flow, but keep a particularly close eye on resource balances and how
they grow.

Retention is the KPI reflection of progression. Without longer term retention, which
reflects commitment and engagement with the game, few people will pay. Conversion
reflects both engagement and balance: do they care enough to buy something, and is
there a good reason to. If retention is good but conversion is bad, then either what’s
being sold is not compelling, the balance is not challenging enough for it to be useful,
or the economy is imbalanced in a way where there’s no reason to purchase,
because you can get it for free. Note that there is some tension between retention
and conversion, though, because tight economies may make players more likely to
churn.
New buyer packs can be great at boosting conversion, while masking underlying
problems. One of the most important stats to look at is repeat conversion – how many
people buy a 2nd time? A 3rd? Repeat conversion shows both how players feel about
their first purchase and whether there is depth of spend. If you have a high
conversion rate but low repeat purchasing, your game will just pop and drop.
Note that I haven’t mentioned ARPPU. That’s because while it’s an important metric,
it’s not one you can really look at with any reliability because revenue is an
exponential distribution, and very erratic in small samples. However ARPPUs are
capped fairly low unless there are repeat purchases, so looking at that statistic, which
is a normal distribution, answers the same question with better stability.

It’s human nature to project causation, so as you make changes to your game you’re
likely to look at daily numbers and think it’s the result of what you’ve done. Resist.
Even with a game doing a reasonably big test market you get tons and tons of
random variation in the daily numbers. These numbers are from Spellstone’s test
markets, which had between 1000-1500 DAU through most of this but the daily
numbers are all over the place.
We track rolling averages, which helps, but if you want to look at the impact of
changes roll up cohorts from before and after the change to get a statistically
significant sample to look at. And still take that with a grain of salt because of
audience mix and confounding effects from other changes you’re making.

An important part of mobile test markets is optimizing the assets you’re using in the
app store – we regularly see substantial gains testing icons, video, screenshots, and
copy, though we’ve never seen a significant difference from game name testing. But
context is important – results are often inconsistent between app stores and geos. For
that reason we don’t start our ASO until we’ve expanded beyond T3 markets.
Google’s tools for this are great, but for iOS you need to use a paid service like Store
Maven.

Test market marketing isn’t just about driving installs, it’s about testing the marketing
itself, optimizing creatives & targeting, generally figuring out: will this work? Will I be
able to drive audience into my game? Can I do it profitably? Test a lot of creatives
across a lot of networks. Keep refreshing. You never know what’s going to work
because again context matters a lot.

There’s nothing worse than having a big feature and then have the servers crumble
under the load. It feels like flushing success down the toilet, and you are. Now “how to
load test a game” is a subject that deserves it’s own GDC talk and I’m not the person
to give it. But if you have a server based game and any hope of success you need to
do this.

Now I want to go through a case study of what this looks like with a game that that
was neither a triumph or a failure.
Raid Brigade is a one-handed party-based Action RPG with base-building elements
and an unusual one-handed control scheme. It’s the first game made by San Diego
studio Ultrabit made up of mostly Zynga veterans. After about 12 months of
development we released it to mobile test markets last June, skipping our normal PC
stage because of questions on how the controls might work. We were all excited for
the game, but the initial results were way below our goals and expectations: the first
few weeks saw dreadful performance with only 40% of people getting through the
tutorial and D7 retention 75% below the goal numbers.

The good news is that there were lots of obvious things to fix. There were long
loading times for assets being streamed in, essential to keep the initial download size
of a game with 3D art under 100MB. Improving those and the tutorial got tutorial
completion up to 70% and doubled D7 retention, though still well below our goal of
18%.

To help people get in to the various branching systems we then switched from a linear
tutorial to one based on a series of quests, which again helped increase D7 retention
significantly, this time up to about 12%. Again it was great to see that much
improvement, now 3x what we started at…but still below goals. And gains in retention
after that became harder to get. Though we kept working on that, along with many
other things.
After 3 months in test markets we we had to face the dilemma: the game was meeting
some of our goals (crash rate, tutorial, D1 retention, conversion) but not all of them
(D7, D30 retention, Repeat Buyer %) and the developer’s runway was starting to be a
concern. We could go ahead and launch in October as planned, or keep working on
it, cut into the developer’s runway, and go up against the glut of games coming out in
the holiday season.

When your metrics are good the decision path on when to launch is easy. When you
have plenty of time and money, you can usually keep working on the game, though
even there it’s important to be realistic about whether the game is fixable. If you’re
pretty sure you understand what’s wrong and have a good idea to fix it that’s great,
but there may be diminishing returns or unfixable flaws. When you get to the point
where you either can’t keep working on the game, or it doesn’t seem worthwhile then
the question becomes: do you launch?
At that point it’s time to think of the money you’ve spent on development as a sunk
cost, and think just about the effort needed to launch and support the game after
launch. For a single-player game without servers, this is fairly simple, and in most
cases you should go ahead and launch and see what happens. But with multiplayer
games this becomes more complicated as there are ongoing server costs and the
necessity of releasing additional content to drive revenue and engagement, as well as
critical player mass and opportunity cost issues. Supercell and some other companies
would only support a game if it’s a huge success, but exactly what that level should
be is going to have different answers by studio. But personally I don’t think you should
launch a game unless you can support it for a fairly extended time frame, a year plus
at least. Players are investing in your game, and that should be respected.
In 4 years of publishing we’ve never cancelled a single-player game, though we did
stop one from bothering to do an Android version. But this year alone we canceled
three multiplayer games, one after a Kong+ beta and two during mobile test markets.

Making a game these days can feel like walking into a dark forest. But remember: you
have many tools at your disposal. Be prepared, like the boy scouts say, have a plan,
be realistic, and hopefully you can make it through the forest and find the treasure
you were looking for.

Effective Testing of Free-to-Play Games

Effective Testing of Free-to-Play Games

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Effective Testing of Free-to-Play Games

Similar to Effective Testing of Free-to-Play Games (20)

Recently uploaded

Recently uploaded (20)

Effective Testing of Free-to-Play Games