From Theory to Practice
A Phased Approach to User Testing

Aaron Norstad
Casual Connect Magazine, Winter 2009

What is the difference between creating a casual game and shipping a hit title? Certainly there is a vast array of factors—including solid design, talented development team, sufficient production budget, and a strong marketing plan—that contribute to the success of any title. But one often-overlooked key is continued user testing throughout the entire production cycle. That’s not to say that it is as simple as building the game and testing it along the way. If it were that easy, then there would be less mediocre content available and higher conversion rates across the board. The concept of user testing is incredibly easy to grasp. The practice of doing it, however—and doing it well—is another story.

User testing is not rocket science, but doing it correctly so that the testing yields useful results is hard and requires multiple testing cycles. In a sense, it is similar to the continued stages of crash testing automotive companies go through while engineering and developing new cars. Engineers continually road test and crash vehicles, analyze the data, refine designs, re-engineer, and then retest.

Testing games follows the same process: develop a concept, test, develop a design, test, build the game, test, redesign and rebuild the game, crash test, then re-engineer and retest. This process can be applied to the development of any game title, be it an educational game, a core console game, or a casual downloadable game. However, in casual development this iterative process is especially important because of the end user. People who play and purchase casual games are discerning consumers, but they aren’t always the most sophisticated online gamers. They grew up on cards, boards, and perhaps Pong paddles; they weren’t born with fourteen button control devices in hand. This crowd of end users requires additional cycles of diligent user testing to further refine, simplify, and perfectly balance UI and level design.

From its inception, PlayFirst has spent significant resources refining a formula for creating and launching hit games. We have come up with a five phase research program in an attempt to turn consumer testing into more of a science. Our methodology looks like this: Informal Usability, Formal Usability, Internal PlayDate, First Peek, and Public Beta.

We see strong ROI from our testing methodology, and Casual Connect asked us to share our experience by defining and stepping through these five phases in an effort to explain how to put the theory into practice. The take home message is: test often, test smart, re-engineer, and re-test.

Step One: Informal Usability
Testing isn’t cheap, and redesigning based on usability feedback is even more expensive. The entire idea of iterative testing is to find issues as quickly and cheaply as possible—to make sure the product being developed actually entertains the target market as much as the designer feels the game should. Therefore, it is beneficial to begin performing informal testing as early in the development cycle as possible; however you must balance the need for early feedback with the need to make sure the game is ready to test. Testing too early may result in false negatives simply because the game mechanic just isn’t playable. For this reason, we try to map the First Playable as a marker for when to begin doing Informal Usability.

Objective: The goal of Informal Usability is to refine the game concept and design and then make any changes necessary to get the Pre-Alpha build ready for more formal usability testing.

Methodology: During Informal Usability we bring people from the target market in to our office to spend 30 to 60 minutes playing the First Playable build. It is best when administered by an impartial third party. We have a Marketing Brand Manager run these test sessions, and in a pinch we’ll use the game’s Designer or Producer. The preferred method is to have a game build that includes tutorials in place so a user can just sit down and begin playing with very little instruction. If the game doesn’t have necessary tutorial scaffolding, then we conduct the test sessions with minimal guidance and watch the users stumble along rather than directing them how to play. We use various San Francisco Bay Area websites to recruit users for Informal Usability. This can be done fairly easily in most metropolitan areas by posting game testing ads on various community sites. Prepare a wellcrafted qualifier document to use for selecting test candidates, and then begin interviewing candidates to narrow down the final pool of testers. We offer a small monetary payment along with free game coupons for the service of performing the usability tests.

Time Period: Informal Usability should take place over two-to-six weeks building up to Pre-Alpha. That said, it is the one test phase that can and should continue throughout all of production up through Beta. Phases of user testing can also begin prior to First Playable; however, those earlier phases would be more like paper and prototype testing and should have a different set of goals and criteria related to teasing out potential design issues. (One we’re always looking to determine as quickly as possible: “Is it fun?”)

Best Practice: Be objective, ask open questions without leading the testers, take good notes, and apply the results to make game design changes. Also, be sure to know the audience. It is great to sample a large cross-section of users, but temper results that come from users outside of the core demographic. Ideally, the testing is focused on core users with people outside of the core group supplementing the testing. For example: Does your mom like you? Then she’s not a good tester. Friends and family are great for initial prototype testing, and potentially late cycle testing, but not usability testing. Finally, make sure not to recycle testers from the pool of candidates. It’s not a good practice to be in the business of training professional testers.

The Scariest Moment I’ve Ever Had at a Usability Session: It wasn’t seeing people cry (which has happened by the way), but hearing our Creative Director look a game designer in the eye and say “the big issue you have with your game is that you don’t have a game.” After many months of design and development, one can imagine this didn’t go over very well. Incidentally, after the usability test we spent six months putting “a game” into the game, and it paid off. The game performed below average in Usability but then had a 4.28 (out of 5) ranking in First Peek with 36% of users saying they would purchase the game. That’s great!

Step Two: Formal Usability
Building from the rounds of Informal Usability and iterating on design changes, the next phase of user testing is to take a solid and stable Pre-Alpha build into a formal research center to conduct Formal Usability studies.

Objective: Identify authentic consumer experiences with the first 45 minutes of game play. The take-away is a detailed report capturing user rankings with top 10 bullet lists of what is and isn’t working in the game, along with suggested solutions for addressing what isn’t working.

Methodology: Use a professional third-party facilitator conducting formal tests that are recorded on DVD. Design and development teams are on-site (or patched in via video conference) watching the usability studies in real-time. PlayFirst uses XEO Design (www. xeodesign.com) in Oakland, California, for most Formal Usability studies.

Time Period: The Usability Study is one intense day of six to eight individual one-hour study sessions. The entire Formal Usability phase takes about three weeks: Week One is the kickoff, writing the test plan and recruiting users; Week Two is the pilot test (dry run) and the Usability Study; and Week Three is Usability analysis and review.

Best Practices: All key design decision makers on the development team should be present for the Formal Usability study. This is the most critical part of Usability. Spending a day working together and watching real users play and respond to the game live and in person is invaluable. Have an open mind, and expect the unexpected. Regardless of type of game, there is always something new to discover. Typically the biggest concerns turn out to be non-issues and the features thought to be most locked down often have the biggest usability issues and require the most redesign work. The morning after the Usability, gather everyone together and debrief to thoroughly understand all of the issues before beginning to work on solutions. As a publisher, we have found that experiencing usability testing together with one or more members of the development team is a critical component of maintaining alignment through often difficult decisions surrounding goals, scope, budget and schedule.

Step Three: Internal “PlayDate”
PlayFirst has a fairly rigorous QA Alpha test cycle requiring a game to be feature-complete with a representation of all functionality and no missing assets before being approved for the Alpha milestone. Once a game hits that milestone it is then ready for an internal testing cycle we call PlayDate.

Objective: Identify issues and red flags with game mechanic, design, and look and feel in preparation for the First Peek release.

Methodology: Employees at PlayFirst play the first hour of the game and fill out a survey with feedback.

Time Period: It is one hour of testing, completed by various people over the course of one or two days. Occasionally, though rare, a second PlayDate is conducted later in development as a way to substantiate design changes.

Best Practice: The people working at PlayFirst come from a large and varied talent pool. We have found over time that the organization as a whole is very good at predicting sales performance of a game once it hits the market. Timing this phase of testing is key to ensure the next phase is a success and yields optimal testing data. What is most important about the PlayDate phase is to obtain tangible feedback that can be acted upon and implemented in preparation for the First Peek phase. For developers that do not have fifty or more employees, you might need to be creative in coming up with a cheap but “clean” testing pool. You could consider some combination of friends, family and your most loyal end users.

Favorite All-time Quote from an Internal PlayDate: “I was starting to enjoy the game, but the headache-inducing clanking sound was so ear-piercing I couldn’t stand to play the game longer than three minutes.” Interesting note: This quote was specific to the PlayFirst title, Mahjong Roadshow. Although we fixed this sound effect, the game never turned a corner. It performed mildly or below average at each phase and its First Peek ranking is noted below. The game unfortunately never performed once hitting the market. It’s an example of what can happen when you somewhat ignore the data telling you the game will be a miss with your target market.

Step Four: “First Peek”
Nothing is more eye-opening than reading feedback from a thousand real users stating why they hate a feature or why they love the game’s audio, art, or story. Actually, the one thing that is even more telling is seeing the real metrics data capturing how users played the game. It is very interesting to read survey feedback stating one thing and then to review the metrics data indicating the complete opposite. Using PlayFirst’s Playground SDK as the development architecture gives us the ability to easily track this data. As a process, PlayFirst has an analyst who works with the developer to create a metrics dictionary detailing the specific play session information we want to collect. Then, working from the hooks within the SDK framework, the developer is able to code the metrics and easily build a First Peek version of the game source.

Objective: The business models for the casual download space are ever-evolving, but the core model is still focused on the 60-minute trial. For this reason, the main object of First Peek is to finely tune the game for the 60- minute trial in preparation for the final version of the shippable game.

Methodology: A 60-minute content limited build is released to several thousand users in the PlayFirst beta community. Users fill out a survey at the end of the trial, and metrics data is collected and tabulated at the end of the First Peek phase.

Time Period: The First Peek version is made available for one week and during that period users can play the trial and submit feedback. A large bulk of feedback comes within the first few days, which then allows the development team to immediately begin analyzing data and begin considering changes.

Outcome: First Peek is the most telling phase of user testing. Two incredibly useful pieces of information are gleaned: quantitative data that shows how users actually played the game, including data points such as where users got stuck and how many click strokes were made to complete a level; and the qualitative feedback with overall exit survey rankings. The quantitative data is used for level tuning, sometimes level redesign, and game-play balancing. It is also used as a measurement of success, or failure, of specific game features. The survey rankings have great accuracy at predicting a game’s conversion rate once launched to the public. It is a one to five ranking system, with five ranking best. The data has proven that users ranking a five have a high probability of purchasing, and thus the total percentage of fives is a marker for a game’s potential performance. For example, someone may say “I love this game and I can’t wait to buy it,” but then will rank it a three. That user may download the game, but most likely will never purchase it. The char t below puts this ranking phenomenon into context by providing an inside look at how various games have performed in First Peek. Any time over 35% of people rate a game a five (out of five), it’s good. Anything above 40% is really good. On the other hand, anything below 30% isn’t great, and anything below 20% is bad.

Best Practice: At PlayFirst, First Peek is a little bit like Groundhog Day in that depending on the outcome it tells how far or how close a game is to hitting a launch date. The critical business decision is to use the data wisely to determine how much more time and resources should be put into a game. If a game has an average ranking and a large number of users identify a specific problem, then we must make a business decision: Will the eventual rate of conversion be sufficiently high to justify the time and cost of “fixing” the problem?
In addition to the potential financial impact of addressing changes after First Peek, it is equally important to make sure a game is ready to go into First Peek. If we release into First Peek a game that we know has a flaw, it means we’ll have 500 to 1,000 users spending time telling us about that flaw. It ends up being a partial waste of time and the data collected is less valuable. Similarly, it’s critical to ensure that specific features we want user feedback on are in the game and functioning properly. It may seem obvious, but we learned this lesson the hard way. (For instance, if you want to get reactions to voiceover dialogue, make sure the audio is actually audible.)

Step Five: Public Beta
After months of development, four tough phases of user testing, and the grueling QA cycles, the game is ready to go live on www.playfirst.com. This is when the game developers sit back, and when the PlayFirst producers, marketing, sales, and PR folks really kick into gear.

Objective: Track sales, watch forum posts, read reviews, pay attention to leader boards, and prepare for Channel Launch.

Methodology: Marketing rolls out go-tomarket launch plan, PR begins building a buzz, press begins reviewing the game, and then the game launches on the PlayFirst site. Teams immediately begin tracking performance.

Time Period: Public Beta continues for the first six weeks after the game launches on PlayFirst, after which the game begins going live on partner sites.

Best Practice: Pay close attention to performance. Watch what users are saying, track customer service reports for any odd issues, and be patient. There is a tendency after a game launches to overreact to what people are saying or to make gross assumptions from early sales reports. There are a few occasions when PlayFirst has made a design change to a game after launch and then re-launched prior releasing it to the Channel. This is only done when the risk is low and confidence is high that the changes will improve conversion. For cases like these, PlayFirst games on PlayFirst. com have an updater technology built in to facilitate updates post-launch so that the entire consumer base is on the same version regardless of when they downloaded the game.

Conclusion
So is this methodology truly a success formula? Well, after this process kicked into gear in the second quarter of 2007, three of the twelve games PlayFirst published in 2007 won Zeeby awards (Diner Dash: Hometown Hero, Chocolatier, and Dream Chronicles), and five others were huge financial successes. By comparison, several of the games that launched in the beginning of 2007—games that did not go through the full five cycles—did not perform very well. Then 2008 was another breakout year for PlayFirst and continued to yield great success with hits such as Dream Chronicles 2: The Eternal Maze, Cooking Dash, Wedding Dash 2: Rings Around the World, Pet Shop Hop, Doggie Dash, Dairy Dash, Parking Dash, and Nightshift Legacy: The Jaguar’s Eye. We believe that our hit rate would not have been possible without the insights derived from extensive consumer testing and related development iterations. That isn’t to say that everything is done to perfection and that there isn’t any room for improvement, but rather that there’s extraordinary value in robust testing.

Of course, a phased approach to user testing is by no means original to the PlayFirst publishing model. Furthermore, the practice of such an approach will not guarantee a hit game. There are games that suffer from lack of proper focus in the early phases of testing, which results in poor ratings at the later phases of testing, which in turn leads to mediocre sales performance because proper time and resources weren’t applied to making improvements. It is hard to properly conduct the phased approach to user testing, and it takes tremendous collaboration between the design, development, and publishing teams. Making games is fun, but it is painful. Most importantly, it requires humility and a lot of laughter, and a willingness to change.