When I was first learning Mixed Reality development, I found myself unsure how to get started. I’m an experienced game developer with experience with both Unity and Unreal, but most “Intro to Mixed Reality” articles or videos I found assumed I knew nothing. While it’s important that those intro-level resources exist, I didn’t really have a great jumping-in point.
I want to fix that today. This article provides a brief introduction to mixed reality — what works, what doesn’t work, how to best approach the medium from a design standpoint — for folks coming in from a game design background.
Here’s the big secret: the difference between making games for “normal” screens and for VR or AR has very little to do with programming. The nuts and bolts of using Unity or Unreal to make a 3D game is broadly the same for both. Rendering at high frame rates on stereoscopic displays on smartphone-grade embedded hardware means you need to think a lot about performance, and APIs are different for handling different forms of input, but on a technical level a 3D Unity game is a 3D Unity game is a 3D Unity game.
Building games that truly take advantage of mixed reality is primarily an issue of design rather than engineering, and even more specifically about UX design and inclusive design. Novel forms of input and output all but require rapid UX prototyping beyond what you might be used to with traditional games playtesting, and hardware that can readily cause physical discomfort and illness requires you to always center player comfort in your design process.
Mixed reality uses a plethora of jargon and acronyms, so let’s start by reviewing some terminology.
Virtual reality, or VR, is about creating a new physical environment for you to inhabit, usually by putting on a headset that is closed off from the world around you. VR typically means a stereoscopic headset with head tracking, and often some sort of hand motion controllers. It can be stationary, meaning you sit or stand in place, or it can be room-scale, meaning the tracking system will let you physically walk around an environment.
Augmented reality, or AR, involves, well, augmenting your reality. Traditionally, this means superimposing computer graphics on top of the real world, either wearing a headset (with a see-through display or a pass-through camera view) or using something like a smartphone. I personally view AR as a broader umbrella containing any sort of technology that augments your sense of the real world with technology. Later, I’ll talk about a voice-powered board game and a spatial audio installation piece that I would classify as “AR” despite not involving a headset or a smartphone camera.
Mixed reality is an umbrella term that encompasses the whole spectrum of VR, AR, and everything in between. This could be abbreviated as MR, or also XR, with the “X” as a stand-in for whatever letter you want.
Instead of thinking of different forms of mixed reality as rigid distinct categories, I think it’s useful to think of MR as a spectrum: how much control do you have as a creator over the experience?
In a traditional screen-based 3D game, if you want the player to look at something, you can just move the camera and make the player look at it. If you try that in VR, the player will get motion sick. If you want to control the player’s attention and focus, you no longer have the tool of the camera and you need to use environmental design and psychological design — tools such as lighting, architecture, spatial audio — to convince the player to look at what you want them to look at.
AR is a step further even from that since you no longer have full control over the playspace. Your virtual environment being intermingled with the real world means you can’t control what might be surrounding a player, and have to deal with it regardless. Even more than that, while AR hardware lets you gather a lot of data about the real world around the player using tools like 3D scene mapping or GPS location, that’s never going to give you the whole picture. You’re not just designing for a space you can’t control, but a space you often can’t even detect.
All of these sound like overwhelming constraints, but they’re not! These elements are the strengths of mixed reality design. These are really interesting juicy design constraints that breed creativity. They’re first and foremost user interface and user experience design problems, but they’re ones that are deeply interconnected with game design problems.
Building a VR game is still building a game, with the same sorts of core loops and systems and feedback mechanisms that will be familiar to most any game dev. The way that players physically interact with these systems matters, and it’s important to design game systems that play to the strengths of these weird different interaction patterns that are drastically different from a mouse and keyboard or controller.
As a practical example of this: a lot of first-time VR designers will lean into the spectacle of VR. If you build a beautiful environment, there’s often an assumption that players will be entertained simply by the spectacle of being immersed in this 3D scene. This is totally true, especially if you’re building a short 5-minute demo. But if you want to build a sustained game that people will play and love and enjoy over a long period of time, you can’t rely on the novelty of the medium. At the core of truly good VR games is a robust and fun game design that plays to the strengths of the immersive design of VR, but doesn’t rely on them as a gimmick. That’s a very delicate line to walk!
The first affordance of mixed reality I want to talk about is spatial audio. If you’re making traditional 3D games, you’re likely already using spatial audio. But it is meaningfully more important in virtual reality.
In general, audio is underrated for its ability to aid in immersion in a scene, and that’s certainly something that’s more important in VR than screen-based games. But just as important, high-quality positional audio is one of the best tools we have to draw players’ focus. Positional sound effects are a key tool in the toolbox to getting a player to look in a given direction, whatever your mixed reality platform is.
Again, this is usually not a particularly deep technical problem: whatever mixed reality platform you are targeting will likely have its own spatialization system that integrates with little to no work with Unity or Unreal’s default audio systems. This is a question of learning to approach audio design as a core pillar of your creative team. If you’re not already working this way, you may want to reconsider that as you begin work on mixed reality titles. In mixed reality, audio design is absolutely the difference between a good experience and a great experience.
Most mixed reality, from a VR headset to smartphone AR, is going to have some way to track where the player is looking. If you’re on a smartphone, this will be “where the player is pointing their phone”, whereas with a top-tier AR headset like the HoloLens 2 you may have access to both gaze tracking and head tracking as separate inputs. These are broadly the same in terms of how to design from them. There are still a few principles to keep in mind.
First, this is not just an equivalent for a mouse or analog stick. If you design a twitchy action game where you have to look at things quickly, players are going to get exhausted very fast.
You need to treat hand tracking as an input that requires explicit confirmation from the player before doing anything. It might be okay in some situations if you do something in the game passively in response to where the player is looking, but in general, players aren’t going to notice something that subtle. There just isn’t a built-in expectation that passively looking at something will trigger something in the gameplay unless that’s something you actively incorporate as part of your core experience.
If you look at HoloLens design documentation, a recommended pattern for object selection is called "gaze and commit". Selecting an object means moving your cursor over to it and holding your gaze for a few seconds while the interface gives you explicit feedback that your gaze is being recognized.
Definitely do that. But also, don’t think of it as a mouse cursor. Patterns like gaze and commit work well for business applications meant to be used for short focused interactions, and they are great for that. If you design a game like a point-and-click adventure where players interact with objects by looking at them repeatedly for a half-hour or hour-long gameplay sessionthe players will be nearly as exhausted as the twitchy action idea.
I tend to think of head and gaze tracking as mostly useful for situations where the player’s hands are busy, but you still need them to select menu items or similar items. To be honest, this is much more useful for enterprise AR applications than games. There are specific games that will very intentionally design a gaze mechanic, or will use head tracking to do things like ask the player to shake or nod their head as a dialog system, but in general I tend to think of head tracking as an input to be avoided by most games.
Hand tracking, on the other hand, is the primary way that players are going to interface with your game most of the time.
If you have an AR headset, you’ll likely have full-on hand tracking powered by computer vision, where the player’s “controller” is just their hands. If you have a VR headset, you may also have access to that sort of hand tracking, but it’s more likely you’ll have some sort of hand controller with both motion tracking and traditional control inputs. These are the bread-and-butter of how people will play your mixed reality games.
You need to design games that recognize and take advantage of the fact that this is not a dual analog controller, even if you’re using hand controllers that do technically have two analog sticks and a handful of buttons.
Most MR interactions work best when you use your hands as actual hands, and do things that you’d typically do in real life with them. Picking up objects and throwing them is a common example of this, and one you see a lot of commercial VR games lean into. This is also a reason shooters tend to do well in VR, despite there being other design considerations that can make first-person shooters a minefield from a motion sickness perspective: most VR controllers have trigger buttons, so asking you to act as if you’re holding a gun is a natural interaction. Conversely, a lot of early motion control games for platforms like the Wii and PlayStation Move felt awkward because asking players to perform precise gestures in the air is an unfamiliar action, and there also isn’t a good way to give players real-time feedback on how an AI-powered gesture recognizer is interpreting their motions.
A point I will keep reiterating is, even though I can’t specifically tell you “this is how your game should use this controller”, your initial instincts will likely be incorrect. You will need to spend time experimenting with what works. And that means putting your game in front of players early and often. This is the only way you can really find out what works and what doesn’t.
Telling you to playtest is probably something you already know as a game designer, but this sort of playtesting is both vitally important and subtly different from normal game playtesting, as you’re doing something closer to HCI usability testing rather than testing your game design itself. When working with non-traditional interfaces and emerging technologies, the only way to prove or disprove your hypotheses about user interfaces is to test those user interfaces with actual users, and get that feedback loop as tight as possible. No matter what hardware you’re using, no matter what you’re designing. User testing is key.
Most mixed reality platforms have some sort of built-in support for voice commands. There’ll often be a platform-specific plugin for Unity or Unreal, which will let you set up utterances to act as a simple trigger system: when the player says X, do Y in-game. Voice controls can be an effective way to provide fluidity to your input system, aid in immersion, and help make your game more accessible to those with motor impairments.
You might, however, be tempted to make a game that’s controlled entirely by voice. I’m going to warn you not to do that.
A few years ago, I worked on an Alexa-powered board game called When in Rome. You buy a physical board game (with no electronics, just non-’smart’ cardboard components), provide an Alexa device, and then Alexa acts as a game show host for a travel trivia game.
I spent 12 weeks with a few colleagues prototyping what a hybrid voice-controlled board game could or should be before we settled on this design. If you look at the Alexa app marketplace, trivia is by far the most common game genre. That isn’t something we were consciously aware of when we settled on making a trivia game, but it also isn’t a coincidence. Of all the game genres we prototyped — narrative games, puzzle games, you name it — trivia games were best-suited to the limits of what voice recognition technology could do at the time.
It’s really difficult to take freeform human text input and translate it into something that a computer knows what to do, and then take that and systematize it into a game system. A trivia game works really well here because it has such a rigid formal conversation flow: the computer asks you a question, you respond with one of a fixed set of possible responses (one of a set of possible multiple-choice answers, or a set of generic fixed input like “can you repeat the question?”), and the computer responds. If the player says something that isn’t recognized as an acceptable input, it’s socially acceptable in this context for the computer to say “hey, I didn’t understand that, can you repeat it?” without breaking the flow of gameplay or frustrating users.
The same thing holds true in mixed reality. Trying to do anything at fast real-time speeds will be frustrating, and if you try to maintain that the computer is an intelligent human being, this will also result in misaligned expectations. In mixed reality contexts, I recommend thinking of voice input as auxiliary UI input rather than a core gameplay input.
A stellar example of this is Elite Dangerous, a popular space-flight VR game. Arguably the optimal way to play Elite Dangerous is with a VR headset and a dedicated HOTAS controller (flight joystick / throttle), but even this complex setup is imperfect. The HOTAS is great for immersion when you’re actively flying your ship, but a large part of the game is navigating complex menus, which is awkward to do with a joystick. Since you’re wearing a headset, it’s even more awkward to try to switch to a mouse or controller when you can’t physically see them in your field of view.
The fan community solves this problem with third-party voice control mods. When you’re looking at a large trading screen with lots of goods you can buy or sell, you just say what you want to do, and the game selects the appropriate menu item for you.
This works well for the exact same reason trivia works so well for Alexa games: navigating these menus is a non-time-sensitive task with an explicit fixed set of acceptable inputs. Just saying what you want is both faster for players to enter than navigating with a mouse/keyboard or controller, and it’s a voice recognition problem that plays to the strengths of current technology. This division of interface labor works really well: you have your custom physical controller for the twitchy action gameplay, and your voice for the slower-paced sections where saying what you want is faster than cursor navigation.
Whether you’re designing for VR or AR, I would encourage you strongly to include voice input, but I’d similarly encourage you to think of it in this secondary menu UI input role. If you’re asking players to navigate menus, or putting them in any sort of low-pressure situation with a lot of options, voice may be a strong choice of input.
The interaction paradigms we’ve talked about so far are applicable to both VR and AR. Let’s look at some AR-specific inputs now. If you have access to what the player sees, by way of a camera and possibly mesh data, how do you interact with the real world? How do you use computer vision to interact with objects in the player’s physical space or change the scene based on where the player is in the world?
This is a tough problem that’s similar to what we discussed with voice recognition. There is a large gap between what theoretically could be, and what is feasible today.
As a first example, some devices like the HoloLens will use a depth camera to give you a 3D mesh of the world. This works well in a sterile environment and it’s easy to put together compelling tech demos by easily setting up, say, a 3D ball that can roll across the floor and bump up against an object in the real world by applying physics to the ball and the world mesh.
That’s great, but what happens when this is in a real player’s living room and a cat walks through the scene? From a technical standpoint, many headsets have fast enough mesh detection that they’ll be able to handle that just fine, but what will happen from a game design standpoint? How does your game make use of the world mesh and is the result when unexpected things like that happen going to provide a coherent gameplay experience? I think this is an eminently solvable problem, but it requires a lot of playtesting in real-world environments with real players to make sure your assumptions about the world mesh aren’t incorrect based on your testing environment.
Regardless of whether or not your target hardware will let you do 3D world mapping, you can still look at the raw RGB image coming out of the camera and do cool things with that. But even thinking about using out-of-the-box image recognition technology, you’ll start running into limitations of what modern machine learning can do in real-time. It’s pretty easy to train a model to say “is there an apple in this image, and if so where is it?”. You can easily build something like an AR app that reads a hand of physical poker cards and tells the player what to do. But anything more general purpose than that is pretty hard. Asking, “Tell me every object that’s in this image” isn’t necessarily something that can be done reliably enough in real-time to base an interesting game off it.
As a game designer, if you’re interested in using CV, you probably want to think about it from a standpoint of “what is the specific object we want to track?” and “how do we design a mechanic around having a specific object?”.
A common way to do that is to use a fiducial marker, a known unique object that can be recognized and used as a stand-in to tell your game where to render a specific 3D object. You see this fairly often with AR experiences that offer paper or cards with printed QR codes, and as soon as you look at that QR code through your AR glasses it becomes a different 3D-rendered object. As you physically move the QR code through the real world the object moves too.
That pattern works well, but it also has limited use. Providing a physical object or asking players to print out a piece of paper can be a bit clunky, if you’re not already designing an experience that involves a physical object you can use as a fiducial marker.
Beyond that, think hard about what you can do with technology today rather than what might be possible in 5-10 years with advances to technology.
Geolocation is very similar. The hardest part with location-based games is that it’s really powerful to control the game based on where you are in the world. Figuring out how to make that legible and understandable is challenging . Any complex system in your game that is not legible by players is indistinct from randomness, and if you’re going through the effort of integrating a complex GPS-based system into your game, it’s important that it’s designed so players will appreciate it.
A number of years ago, I built a site-specific generative poetry walk for a park in San Francisco. You go to Fort Mason Park, open an app on your phone, put your headphones on, and place your phone in your pocket. It’s augmented reality in the sense that we’re augmenting your reality with 3D audio, but there are no 3D graphics. As you walk around the park, a neural network makes up and reads you poems based on where you are, using your phone’s location services. Walk by the waterfront and hear poems of the sea but walk by the old cannons and you’ll hear poems of war.
The poetry walk app gives a sense of place, the sense that the experience is reacting to you and where you choose to go, and the path you take does feel like a legitimately meaningful choice. But it took a long time to get there.
GPS is usually only accurate within tens of meters and that can get closer to hundreds of meters in the middle of a dense city. Figuring out the right level of precision to base data on was tricky. I experimented with changing the quality of the poetry based on how quickly you were moving, or changing the poetry based on the current weather so you’d get different types of poems if it’s raining or sunny or cloudy out. As I went into the park to walk around and test the system, I discovered that really wasn’t legible. Players didn’t understand the relationship between the sensor data I was using and the poetry that was coming out.
I ended up striking a balance with a simple solution. The park was divided into several areas, so you’d hear different types of poems while in the big grassy area, near the big statue, in the community garden. This mapping of “I’m roughly in this part of the park” ended up being easily understood by players, especially with the use of 3D audio soundscapes that explicitly changed as you moved across area boundaries (e.g. you’d hear the distant noise of kids playing while in the big grassy area, and lots of loud birds and animals while in the garden).
This is similar to what we’ve discussed with voice recognition or hand tracking. We’re primarily grappling with imperfect input technologies that players aren’t used to yet. Our primary jobs as designers are to (a) find the right level of input fidelity that will give us good data at a technical level, and (b) communicate those explicit expectations to the player and give adequate real-time feedback.
Whatever mixed reality input you’re talking about, it’s really the same: come up with a hypothesis for an input mechanism that can be communicated and enables meaningful design choices. You test that hypothesis by building a prototype, putting it in front of players, and trying again once it doesn’t work.
The most important element is getting the core design fundamentals right, a process that’s grounded in active experimentation and playtesting. Technical understanding about how to implement specific tools is far less important than a conceptual understanding of the strengths and weaknesses of those tools today.
There are two high-level ways you can design a game.
Both ways can produce great games, but when working with mixed reality or any sort of emerging interface technology, a bottom-up approach is likely to yield better results. You’re dealing with fiddly technologies that often don’t behave the way you expect them to when imagining the science-fiction ideal of that interface, and you ultimately need to design for the hardware you have rather than the hardware you wish existed.
I’ve spent most of the last decade building games for novel interfaces and emerging technologies, and in this specific design space, a top-down approach is more likely to yield something that fits a marketing team or creative director’s design brief but ultimately feels gimmicky and shallow.
When playtesting, a thing that is unique to MR (even compared to other novel platforms) is that you need to prioritize player comfort at all times. This is everything from how you move through the world in a VR game creates motion sickness, to whether hand controls will make a player tired, to performance.
If your screen-based game consistently drops below 30fps, you might get angry players and a low Metacritic score. If your VR game drops below 72 or 90fps or whatever you’re targeting, players will get physically ill. This matters a lot more in mixed reality, and you really need to care about making something that respects players’ comfort. You also need to account for accessibility needs. A player with a motor control impairment can buy an Xbox Adaptive Controller and build a setup that works to let them play console games; mixed reality games that have bespoke hand gesture input currently instead need to handle concerns like that at a per-game level.
This again helps underscore that mixed reality design is UX design. The fundamentals of designing a game are the same across MR and flat screens, but what makes MR design special specifically is this extra layer of designing an interface that delights and supports all players.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.