Game Engine Architecture


Game Engine Architecture Game Engine Architecture Jason Gregory A K Peters, Ltd. Wellesley, Massachusetts A K Peters/CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor and Francis Group, LLC A K Peters/CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-1-4398-6526-2 (Ebook-PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the A K Peters Web site at http://www.akpeters.com Dedicated to Trina, Evan and Quinn Gregory, in memory of our heros, Joyce Osterhus and Kenneth Gregory. vii Contents Foreword xiii Preface xvii I Foundations 1 1 Introduction 3 1.1 Structure of a Typical Game Team 5 1.2 What Is a Game? 8 1.3 What Is a Game Engine? 11 1.4 Engine Differences Across Genres 13 1.5 Game Engine Survey 25 1.6 Runtime Engine Architecture 28 1.7 Tools and the Asset Pipeline 49 2 Tools of the Trade 57 2.1 Version Control 57 2.2 Microsoft Visual Studio 66 2.3 Profi ling Tools 85 viii Contents 2.4 Memory Leak and Corruption Detection 87 2.5 Other Tools 88 3 Fundamentals of Software Engineering for Games 91 3.1 C++ Review and Best Practices 91 3.2 Data, Code, and Memory in C/C++ 98 3.3 Catching and Handling Errors 128 4 3D Math for Games 137 4.1 Solving 3D Problems in 2D 137 4.2 Points and Vectors 138 4.3 Matrices 151 4.4 Quaternions 169 4.5 Comparison of Rotational Representations 177 4.6 Other Useful Mathematical Objects 181 4.7 Hardware-Accelerated SIMD Math 185 4.8 Random Number Generation 192 II Low-Level Engine Systems 195 5 Engine Support Systems 197 5.1 Subsystem Start-Up and Shut-Down 197 5.2 Memory Management 205 5.3 Containers 223 5.4 Strings 242 5.5 Engine Confi guration 252 6 Resources and the File System 261 6.1 File System 262 6.2 The Resource Manager 272 7 The Game Loop and Real-Time Simulation 303 7.1 The Rendering Loop 303 7.2 The Game Loop 304 ix Contents 7.3 Game Loop Architectural Styles 307 7.4 Abstract Timelines 310 7.5 Measuring and Dealing with Time 312 7.6 Multiprocessor Game Loops 324 7.7 Networked Multiplayer Game Loops 333 8 Human Interface Devices (HID) 339 8.1 Types of Human Interface Devices 339 8.2 Interfacing with a HID 341 8.3 Types of Inputs 343 8.4 Types of Outputs 348 8.5 Game Engine HID Systems 349 8.6 Human Interface Devices in Practice 366 9 Tools for Debugging and Development 367 9.1 Logging and Tracing 367 9.2 Debug Drawing Facilities 372 9.3 In-Game Menus 379 9.4 In-Game Console 382 9.5 Debug Cameras and Pausing the Game 383 9.6 Cheats 384 9.7 Screen Shots and Movie Capture 384 9.8 In-Game Profi ling 385 III Graphics and Motion 397 10 The Rendering Engine 399 10.1 Foundations of Depth-Buffered Triangle Rasterization 400 10.2 The Rendering Pipeline 444 10.3 Advanced Lighting and Global Illumination 469 10.4 Visual Effects and Overlays 481 11 Animation Systems 491 11.1 Types of Character Animation 491 11.2 Skeletons 496 x Contents 11.3 Poses 499 11.4 Clips 504 11.5 Skinning and Matrix Palette Generation 518 11.6 Animation Blending 523 11.7 Post-Processing 542 11.8 Compression Techniques 545 11.9 Animation System Architecture 552 11.10 The Animation Pipeline 553 11.11 Action State Machines 568 11.12 Animation Controllers 593 12 Collision and Rigid Body Dynamics 595 12.1 Do You Want Physics in Your Game? 596 12.2 Collision/Physics Middleware 601 12.3 The Collision Detection System 603 12.4 Rigid Body Dynamics 630 12.5 Integrating a Physics Engine into Your Game 666 12.6 A Look Ahead: Advanced Physics Features 684 IV Gameplay 687 13 Introduction to Gameplay Systems 689 13.1 Anatomy of a Game World 690 13.2 Implementing Dynamic Elements: Game Objects 695 13.3 Data-Driven Game Engines 698 13.4 The Game World Editor 699 14 Runtime Gameplay Foundation Systems 711 14.1 Components of the Gameplay Foundation System 711 14.2 Runtime Object Model Architectures 715 14.3 World Chunk Data Formats 734 14.4 Loading and Streaming Game Worlds 741 14.5 Object References and World Queries 750 14.6 Updating Game Objects in Real Time 757 xi Contents 14.7 Events and Message-Passing 773 14.8 Scripting 794 14.9 High-Level Game Flow 817 V Conclusion 819 15 You Mean There’s More? 821 15.1 Some Engine Systems We Didn’t Cover 821 15.2 Gameplay Systems 823 References 827 Index 831 xiii Foreword The very fi rst video game was built entirely out of hardware, but rapid ad- vancements in microprocessors have changed all that. These days, video games are played on versatile PCs and specialized video game consoles that use soft ware to make it possible to off er a tremendous variety of gaming ex- periences. It’s been 50 years since those fi rst primitive games, but the industry is still considered by many to be immature. It may be young, but when you take a closer look, you will fi nd that things have been developing rapidly. Video games are now a multibillion-dollar industry covering a wide range of demographics. Video games come in all shapes and sizes, falling into categories or “genres” covering everything from solitaire to massively multiplayer online role-playing games, and these games are played on virtually anything with a microchip in it. These days, you can get games for your PC, your cell phone, as well as a number of diff erent specialized gaming consoles—both handheld and those that connect to your home TV. These specialized home consoles tend to represent the cutt ing edge of gaming technology, and the patt ern of these platforms being released in cycles has come to be called console “gen- erations.” The powerhouses of this latest generation are Microsoft ’s Xbox 360 and Sony’s PLAYSTATION 3, but the ever-present PC should never be over- looked, and the extremely popular Nintendo Wii represents something new this time around. xiv Foreword The recent explosion of downloadable and casual games has added even more complexity to the diverse world of commercial video games. Even so, big games are still big business. The incredible computing power available on today’s complicated platforms has made room for increased complexity in the soft ware. Naturally, all this advanced soft ware has to be created by some- one, and that has driven up the size of development teams—not to mention development costs. As the industry matures, we’re always looking for bett er, more effi cient ways to build our products, and development teams have be- gun compensating for the increased complexity by taking advantage of things like reusable soft ware and middleware. With so many diff erent styles of game on such a wide array of platforms, there cannot be any single ideal soft ware solution. However, certain patt erns have developed, and there is a vast menu of potential solutions out there. The problem today is choosing the right solution to fi t the needs of the particular project. Going deeper, a development team must consider all the diff erent as- pects of a project and how they fi t together. It is rare to fi nd any one soft ware package that perfectly suits every aspect of a new game design. Those of us who are now veterans of the industry found ourselves pio- neering unknown territory. Few programmers of our generation have Com- puter Science degrees (Matt ’s is in Aeronautical Engineering, and Jason’s is in Systems Design Engineering), but these days many colleges are starting to programs and degrees in video games. The students and developers of today need a good place to turn to for solid game-development information. For pure high-end graphics, there are a lot of sources of very good information from research to practical jewels of knowledge. However, these sources are oft en not directly applicable to production game environments or suff er from not having actual production-quality implementations. For the rest of game development, there are so-called beginner books that so gloss over the details and act as if they invented everything without giving references that they are just not useful or oft en even accurate. Then there are high-end specialty books for various niches like physics, collision, AI, etc. But these can be needlessly obtuse or too high level to be understood by all, or the piecemeal approach just doesn’t all fi t together. Many are even so directly tied to a particular piece of technology as to become rapidly dated as the hardware and soft ware change. Then there is the Internet, which is an excellent supplementary tool for knowledge gathering. However, broken links, widely inaccurate data, and variable-to-poor quality oft en make it not useful at all unless you know ex- actly what you are aft er. Enter Jason Gregory, himself an industry veteran with experience at Naughty Dog—one of the most highly regarded video game studios in the xv Foreword world. While teaching a course in game programming at USC, Jason found himself facing a shortage of textbooks covering the fundamentals of video- game architecture. Luckily for the rest of us, he has taken it upon himself to fi ll that gap. What Jason has done is pull together production-quality knowledge actu- ally used in shipped game projects and bring together the entire game-devel- opment picture. His experience has allowed him to bring together not only the ideas and techniques but also actual code samples and implementation examples to show you how the pieces come together to actually make a game. The references and citations make it a great jumping-off point to dig deeper into any particular aspect of the process. The concepts and techniques are the actual ones we use to create games, and while the examples are oft en ground- ed in a technology, they extend way beyond any particular engine or API. This is the kind of book we wanted when we were gett ing started, and we think it will prove very instructive to people just starting out as well as those with experience who would like some exposure to the larger context. Jeff Lander Matt hew Whiting xvii Preface Welcome to Game Engine Architecture. This book aims to present a com- plete discussion of the major components that make up a typical com- mercial game engine. Game programming is an immense topic, so we have a lot of ground to cover. Nevertheless, I trust you’ll fi nd that the depth of our discussions is suffi cient to give you a solid understanding of both the theory and the common practices employed within each of the engineering disci- plines we’ll cover. That said, this book is really just the beginning of a fasci- nating and potentially life-long journey. A wealth of information is available on all aspects of game technology, and this text serves both as a foundation- laying device and as a jumping-off point for further learning. Our focus in this book will be on game engine technologies and architec- ture. This means we’ll cover both the theory underlying the various subsys- tems that comprise a commercial game engine and also the data structures, algorithms, and soft ware interfaces that are typically used to implement them. The line between the game engine and the game is rather blurry. We’ll fo- cus primarily on the engine itself, including a host of low-level foundation systems, the rendering engine, the collision system, the physics simulation, character animation, and an in-depth discussion of what I call the gameplay foundation layer. This layer includes the game’s object model, world editor, event system, and scripting system. We’ll also touch on some aspects of game- xviii Preface play programming, including player mechanics, cameras, and AI. However, by necessity, the scope of these discussions will be limited mainly to the ways in which gameplay systems interface with the engine. This book is intended to be used as a course text for a two- or three-course college-level series in intermediate game programming. Of course, it can also be used by amateur soft ware engineers, hobbyists, self-taught game program- mers, and existing members of the game industry alike. Junior engineers can use this text to solidify their understanding of game mathematics, engine ar- chitecture, and game technology. And some senior engineers who have de- voted their careers to one particular specialty may benefi t from the bigger picture presented in these pages, as well. To get the most out of this book, you should have a working knowledge of basic object-oriented programming concepts and at least some experience programming in C++. Although a host of new and exciting languages are be- ginning to take hold within the game industry, industrial-strength 3D game engines are still writt en primarily in C or C++, and any serious game pro- grammer needs to know C++. We’ll review the basic tenets of object-oriented programming in Chapter 3, and you will no doubt pick up a few new C++ tricks as you read this book, but a solid foundation in the C++ language is best obtained from [39], [31], and [32]. If your C++ is a bit rusty, I recommend you refer to these or similar books to refresh your knowledge as you read this text. If you have no prior C++ experience, you may want to consider reading at least the fi rst few chapters of [39], or working through a few C++ tutorials online, before diving into this book. The best way to learn computer programming of any kind is to actually write some code. As you read through this book, I strongly encourage you to select a few topic areas that are of particular interest to you and come up with some projects for yourself in those areas. For example, if you fi nd character animation interesting, you could start by installing Ogre3D and exploring its skinned animation demo. Then you could try to implement some of the anima- tion blending techniques described in this book, using Ogre. Next you might decide to implement a simple joypad-controlled animated character that can run around on a fl at plane. Once you have something relatively simple work- ing, expand upon it! Then move on to another area of game technology. Rinse and repeat. It doesn’t particularly matt er what the projects are, as long as you’re practicing the art of game programming, not just reading about it. Game technology is a living, breathing thing that can never be entirely captured within the pages of a book. As such, additional resources, errata, updates, sample code, and project ideas will be posted from time to time on this book’s website at htt p://gameenginebook.com. xix Preface Acknowledgments No book is created in a vacuum, and this one is certainly no exception. This book would not have been possible without the help of my family, friends, and colleagues in the game industry, and I’d like to extend warm thanks to everyone who helped me to bring this project to fruition. Of course, the ones most impacted by a project like this one are invariably the author’s family. So I’d like to start by off ering a special thank-you to my wife Trina, who has been a pillar of strength during this diffi cult time, tak- ing care of our two boys Evan (age 5) and Quinn (age 3) day aft er day (and night aft er night!) while I holed myself up to get yet another chapter under my belt, forgoing her own plans to accommodate my schedule, doing my chores as well as her own (more oft en than I’d like to admit), and always giv- ing me kind words of encouragement when I needed them the most. I’d also like to thank my eldest son Evan for being patient as he endured the absence of his favorite video game playing partner, and his younger brother Quinn for always welcoming me home aft er a long day’s work with huge hugs and endless smiles. I would also like to extend special thanks to my editors, Matt Whiting and Jeff Lander. Their insightful, targeted, and timely feedback was always right on the money, and their vast experience in the game industry has helped to give me confi dence that the information presented in these pages is as accu- rate and up-to-date as humanly possible. Matt and Jeff were both a pleasure to work with, and I am honored to have had the opportunity to collaborate with such consummate professionals on this project. I’d like to thank Jeff in particular for putt ing me in touch with Alice Peters and helping me to get this project off the ground in the fi rst place. A number of my colleagues at Naughty Dog also contributed to this book, either by providing feedback or by helping me with the structure and topic content of one of the chapters. I’d like to thank Marshall Robin and Carlos Gonzalez-Ochoa for their guidance and tutelage as I wrote the rendering chapter, and Pål-Kristian Engstad for his excellent and insightful feedback on the text and content of that chapter. I’d also like to thank Chris- tian Gyrling for his feedback on various sections of the book, including the chapter on animation (which is one of his many specialties). My thanks also go to the entire Naughty Dog engineering team for creating all of the in- credible game engine systems that I highlight in this book. Special thanks go to Keith Schaeff er of Electronic Arts for providing me with much of the raw content regarding the impact of physics on a game, found in Section 12.1. I’d also like to thank Paul Keet of Electronic Arts and Steve Ranck, the xx Preface lead engineer on the Hydro Thunder project at Midway San Diego, for their mentorship and guidance over the years. While they did not contribute to the book directly, their infl uences are echoed on virtually every page in one way or another. This book arose out of the notes I developed for a course called ITP-485: Programming Game Engines, which I have been teaching under the auspices of the Information Technology Program at the University of Southern Cali- fornia for approximately three years now. I would like to thank Dr. Anthony Borquez, the director of the ITP department at the time, for hiring me to de- velop the ITP-485 course curriculum in the fi rst place. I’d also like to extend warm thanks to Ashish Soni, the current ITP director, for his continued sup- port and encouragement as ITP-485 continues to evolve. My extended family and friends also deserve thanks, in part for their un- wavering encouragement, and in part for entertaining my wife and our two boys on so many occasions while I was working. I’d like to thank my sister- and brother-in-law, Tracy Lee and Doug Provins, my cousin-in-law Matt Glenn, and all of our incredible friends, including: Kim and Drew Clark, Sherilyn and Jim Kritzer, Anne and Michael Scherer, and Kim and Mike Warner. My father Kenneth Gregory wrote a book on investing in the stock market when I was a teenager, and in doing so he inspired me to write a book. For this and so much more, I am eternally grateful to him. I’d also like to thank my mother Erica Gregory, in part for her insistence that I embark on this project, and in part for spending countless hours with me when I was a child, beating the art of writing into my cranium—I owe my writing skills (not to mention my work ethic… and my rather twisted sense of humor…) entirely to her! Last but certainly not least, I’d like to thank Alice Peters and Kevin Jack- son-Mead, as well as the entire A K Peters staff , for their Herculean eff orts in publishing this book. Alice and Kevin have both been a pleasure to work with, and I truly appreciate both their willingness to bend over backwards to get this book out the door under very tight time constraints, and their infi nite patience with me as a new author. Jason Gregory April 2009 I Foundations 1 Introduction When I got my fi rst game console in 1979—a way-cool Intellivision sys- tem by Matt el—the term “game engine” did not exist. Back then, video and arcade games were considered by most adults to be nothing more than toys, and the soft ware that made them tick was highly specialized to both the game in question and the hardware on which it ran. Today, games are a multi-billion-dollar mainstream industry rivaling Hollywood in size and popularity. And the soft ware that drives these now-ubiquitous three-dimen- sional worlds—game engines like id Soft ware’s Quake and Doom engines, Epic Games’ Unreal Engine 3 and Valve’s Source engine—have become fully fea- tured reusable soft ware development kit s that can be licensed and used to build almost any game imaginable. While game engines vary widely in the details of their architecture and implementation, recognizable coarse-grained patt erns are emerging across both publicly licensed game engines and their proprietary in-house counter- parts. Virtually all game engines contain a familiar set of core components, in- cluding the rendering engine, the collision and physics engine, the animation system, the audio system, the game world object model, the artifi cial intelli- gence system, and so on. Within each of these components, a relatively small number of semi-standard design alternatives are also beginning to emerge. There are a great many books that cover individual game engine subsys- tems, such as three-dimensional graphics, in exhaustive detail. Other books 3 4 1. Introduction cobble together valuable tips and tricks across a wide variety of game technol- ogy areas. However, I have been unable to fi nd a book that provides its reader with a reasonably complete picture of the entire gamut of components that make up a modern game engine. The goal of this book, then, is to take the reader on a guided hands-on tour of the vast and complex landscape of game engine architecture. In this book you will learn how real industrial-strength production game engines are architected; how game development teams are organized and work in the real world; which major subsystems and design patt erns appear again and again in virtually every game engine; the typical requirements for each major subsystem; which subsystems are genre- or game-agnostic, and which ones are typ- ically designed explicitly for a specifi c genre or game; where the engine normally ends and the game begins. We’ll also get a first-hand glimpse into the inner workings of some popu- lar game engines, such as Quake and Unreal , and some well-known mid- dleware packages, such as the Havok Physics library, the OGRE rendering engine, and Rad Game Tools’ Granny 3D animation and geometry man- agement toolkit. Before we get started, we’ll review some techniques and tools for large- scale soft ware engineering in a game engine context, including the diff erence between logical and physical soft ware architecture; confi guration management, revision control, and build systems; some tips and tricks for dealing with one of the common development environments for C and C++, Microsoft Visual Studio. In this book I assume that you have a solid understanding of C++ (the language of choice among most modern game developers) and that you un- derstand basic soft ware engineering principles. I also assume you have some exposure to linear algebra, three-dimensional vector and matrix math, and trigonometry (although we’ll review the core concepts in Chapter 4). Ideally you should have some prior exposure to the basic concepts of real-time and event-driven programming. But never fear—I will review these topics briefl y, and I’ll also point you in the right direction if you feel you need to hone your skills further before we embark. 5 1.1. Structure of a Typical Game Team 1.1. Structure of a Typical Game Team Before we delve into the structure of a typical game engine, let’s fi rst take a brief look at the structure of a typical game development team. Game stu- dios are usually composed of fi ve basic disciplines: engineers, artists, game designers, producers, and other management and support staff (marketing, legal, information technology/technical support, administrative, etc.). Each discipline can be divided into various subdisciplines. We’ll take a brief look at each below. 1.1.1. Engineers The engineers design and implement the soft ware that makes the game, and the tools, work. Engineers are oft en categorized into two basic groups: runtime programmers (who work on the engine and the game itself) and tools program- mers (who work on the off -line tools that allow the rest of the development team to work eff ectively). On both sides of the runtime/tools line, engineers have various specialties. Some engineers focus their careers on a single engine system, such as rendering, artifi cial intelligence, audio, or collision and phys- ics. Some focus on gameplay programming and scripting, while others prefer to work at the systems level and not get too involved in how the game actu- ally plays. Some engineers are generalists—jacks of all trades who can jump around and tackle whatever problems might arise during development. Senior engineers are sometimes asked to take on a technical leadership role. Lead engineers usually still design and write code, but they also help to manage the team’s schedule, make decisions regarding the overall technical direction of the project, and sometimes also directly manage people from a human resources perspective. Some companies also have one or more technical directors (TD), whose job it is to oversee one or more projects from a high level, ensuring that the teams are aware of potential technical challenges, upcoming industry devel- opments, new technologies, and so on. The highest engineering-related posi- tion at a game studio is the chief technical offi cer (CTO), if the studio has one. The CTO’s job is to serve as a sort of technical director for the entire studio, as well as serving a key executive role in the company. 1.1.2. Artists As we say in the game industry, “content is king.” The artists produce all of the visual and audio content in the game, and the quality of their work can literally make or break a game. Artists come in all sorts of fl avors: 6 1. Introduction Concept artists produce sketches and paintings that provide the team with a vision of what the fi nal game will look like. They start their work early in the concept phase of development, but usually continue to pro- vide visual direction throughout a project’s life cycle. It is common for screen shots taken from a shipping game to bear an uncanny resem- blance to the concept art. 3D modelers produce the three-dimensional geometry for everything in the virtual game world. This discipline is typically divided into two subdisciplines: foreground modelers and background model- ers. The former create objects, characters, vehicles, weapons, and the other objects that populate the game world, while the latt er build the world’s static background geometry (terrain, buildings, bridges, etc.). Texture artists create the two-dimensional images known as textures, which are applied to the surfaces of 3D models in order to provide de- tail and realism. Lighting artists lay out all of the light sources in the game world, both static and dynamic, and work with color, intensity, and light direction to maximize the artfulness and emotional impact of each scene. Animators imbue the characters and objects in the game with motion. The animators serve quite literally as actors in a game production, just as they do in a CG fi lm production. However, a game animator must have a unique set of skills in order to produce animations that mesh seamlessly with the technological underpinnings of the game engine. Motion capture actors are oft en used to provide a rough set of motion data, which are then cleaned up and tweaked by the animators before being integrated into the game. Sound designers work closely with the engineers in order to produce and mix the sound eff ects and music in the game. Voice actors provide the voices of the characters in many games. Many games have one or more composers, who compose an original score for the game. As with engineers, senior artists are oft en called upon to be team lead- ers. Some game teams have one or more art directors—very senior artists who manage the look of the entire game and ensure consistency across the work of all team members. 7 1.1.3. Game Designers The game designers’ job is to design the interactive portion of the player’s experience, typically known as gameplay. Diff erent kinds of designers work at diff erent levels of detail. Some (usually senior) game designers work at the macro level, determining the story arc, the overall sequence of chapters or lev- els, and the high-level goals and objectives of the player. Other designers work on individual levels or geographical areas within the virtual game world, lay- ing out the static background geometry, determining where and when en- emies will emerge, placing supplies like weapons and health packs, designing puzzle elements, and so on. Still other designers operate at a highly technical level, working closely with gameplay engineers and/or writing code (oft en in a high-level scripting language). Some game designers are ex-engineers, who decided they wanted to play a more active role in determining how the game will play. Some game teams employ one or more writers. A game writer’s job can range from collaborating with the senior game designers to construct the story arc of the entire game, to writing individual lines of dialogue. As with other disciplines, some senior designers play management roles. Many game teams have a game director, whose job it is to oversee all aspects of a game’s design, help manage schedules, and ensure that the work of indi- vidual designers is consistent across the entire product. Senior designers also sometimes evolve into producers. 1.1.4. Producers The role of producer is defi ned diff erently by diff erent studios. In some game companies, the producer’s job is to manage the schedule and serve as a hu- man resources manager. In other companies, producers serve in a senior game design capacity. Still other studios ask their producers to serve as liaisons be- tween the development team and the business unit of the company (fi nance, legal, marketing, etc.). Some smaller studios don’t have producers at all. For example, at Naughty Dog, literally everyone in the company, including the two co-presidents, play a direct role in constructing the game; team man- agement and business duties are shared between the senior members of the studio. 1.1.5. Other Staff The team of people who directly construct the game is typically supported by a crucial team of support staff . This includes the studio’s executive manage- 1.1. Structure of a Typical Game Team 8 1. Introduction ment team, the marketing department (or a team that liaises with an external marketing group), administrative staff , and the IT department, whose job is to purchase, install, and confi gure hardware and soft ware for the team and to provide technical support. 1.1.6. Publishers and Studios The marketing, manufacture, and distribution of a game title are usually handled by a publisher, not by the game studio itself. A publisher is typically a large corporation, like Electronic Arts, THQ, Vivendi, Sony, Nintendo, etc. Many game studios are not affi liated with a particular publisher. They sell each game that they produce to whichever publisher strikes the best deal with them. Other studios work exclusively with a single publisher, either via a long- term publishing contract, or as a fully owned subsidiary of the publishing company. For example, THQ’s game studios are independently managed, but they are owned and ultimately controlled by THQ. Electronic Arts takes this relationship one step further, by directly managing its studios. First-party de- velopers are game studios owned directly by the console manufacturers (Sony, Nintendo, and Microsoft ). For example, Naughty Dog is a fi rst-party Sony developer. These studios produce games exclusively for the gaming hardware manufactured by their parent company. 1.2. What Is a Game? We probably all have a prett y good intuitive notion of what a game is. The general term “game” encompasses board games like chess and Monopoly, card games like poker and blackjack, casino games like roulett e and slot machines, military war games, computer games, various kinds of play among children, and the list goes on. In academia we sometimes speak of “game theory,” in which multiple agents select strategies and tactics in order to maximize their gains within the framework of a well-defi ned set of game rules. When used in the context of console or computer-based entertainment, the word “game” usually conjures images of a three-dimensional virtual world featuring a hu- manoid, animal, or vehicle as the main character under player control. (Or for the old geezers among us, perhaps it brings to mind images of two-dimen- sional classics like Pong, Pac-Man, or Donkey Kong.) In his excellent book, A Theory of Fun for Game Design, Raph Koster defi nes a “game” to be an inter- active experience that provides the player with an increasingly challenging sequence of patt erns which he or she learns and eventually masters [26]. Ko- ster’s assertion is that the activities of learning and mastering are at the heart 9 1.2. What Is a Game? of what we call “fun,” just as a joke becomes funny at the moment we “get it” by recognizing the patt ern. For the purposes of this book, we’ll focus on the subset of games that comprise two- and three-dimensional virtual worlds with a small number of players (between one and 16 or thereabouts). Much of what we’ll learn can also be applied to Flash games on the Internet, pure puzzle games like Tetris, or massively multiplayer online games (MMOG). But our primary focus will be on game engines capable of producing fi rst-person shooters, third-person action/platform games, racing games, fi ghting games, and the like. 1.2.1. Video Games as Soft Real-Time Simulations Most two- and three-dimensional video games are examples of what comput- er scientists would call soft real-time interactive agent-based computer simulations. Let’s break this phrase down in order to bett er understand what it means. In most video games, some subset of the real world—or an imaginary world—is modeled mathematically so that it can be manipulated by a com- puter. The model is an approximation to and a simplifi cation of reality (even if it’s an imaginary reality), because it is clearly impractical to include every detail down to the level of atoms or quarks. Hence, the mathematical model is a simulation of the real or imagined game world. Approximation and sim- plifi cation are two of the game developer’s most powerful tools. When used skillfully, even a greatly simplifi ed model can sometimes be almost indistin- guishable from reality—and a lot more fun. An agent-based simulation is one in which a number of distinct entities known as “agents” interact. This fi ts the description of most three-dimen- tsional computer games very well, where the agents are vehicles, characters, fi reballs, power dots, and so on. Given the agent-based nature of most games, it should come as no surprise that most games nowadays are implemented in an object-oriented, or at least loosely object-based, programming language. All interactive video games are temporal simulations, meaning that the vir- tual game world model is dynamic—the state of the game world changes over time as the game’s events and story unfold. A video game must also respond to unpredictable inputs from its human player(s)—thus interactive temporal simulations. Finally, most video games present their stories and respond to player input in real-time , making them interactive real-time simulations. One notable exception is in the category of turn-based games like computerized chess or non-real-time strategy games. But even these types of games usually provide the user with some form of real-time graphical user interface . So for the purposes of this book, we’ll assume that all video games have at least some real-time constraints. 10 1. Introduction At the core of every real-time system is the concept of a deadline. An obvi- ous example in video games is the requirement that the screen be updated at least 24 times per second in order to provide the illusion of motion. (Most games render the screen at 30 or 60 frames per second because these are mul- tiples of an NTSC monitor’s refresh rate.) Of course, there are many other kinds of deadlines in video games as well. A physics simulation may need to be updated 120 times per second in order to remain stable. A character’s artifi cial intelligence system may need to “think” at least once every second to prevent the appearance of stupidity. The audio library may need to be called at least once every 1/60 second in order to keep the audio buff ers fi lled and prevent audible glitches. A “soft ” real-time system is one in which missed deadlines are not cata- strophic. Hence all video games are soft real-time systems—if the frame rate dies, the human player generally doesn’t! Contrast this with a hard real-time system, in which a missed deadline could mean severe injury to or even the death of a human operator. The avionics system in a helicopter or the control- rod system in a nuclear power plant are examples of hard real-time systems. Mathematical models can be analytic or numerical. For example, the ana- lytic (closed-form) mathematical model of a rigid body falling under the infl u- ence of constant acceleration due to gravity is typically writt en as follows: y(t) = ½ g t2 + v0 t + y0 . (1.1) An analytic model can be evaluated for any value of its independent variables, such as the time t in the above equation, given only the initial conditions v0 and y0 and the constant g. Such models are very convenient when they can be found. However many problems in mathematics have no closed-form solu- tion. And in video games, where the user’s input is unpredictable, we cannot hope to model the entire game analytically. A numerical model of the same rigid body under gravity might be y(t + Δt) = F(y(t), y˙ (t), ÿ(t), …) . (1.2) That is, the height of the rigid body at some future time (t + Δt) can be found as a function of the height and its fi rst and second time derivatives at the current time t. Numerical simulations are typically implemented by running calcula- tions repeatedly, in order to determine the state of the system at each discrete time step. Games work in the same way. A main “game loop” runs repeatedly, and during each iteration of the loop, various game systems such as artifi cial intelligence, game logic, physics simulations, and so on are given a chance to calculate or update their state for the next discrete time step. The results are then “rendered” by displaying graphics, emitt ing sound, and possibly pro- ducing other outputs such as force feedback on the joypad. 11 1.3. What Is a Game Engine? The term “ game engine” arose in the mid-1990s in reference to fi rst-person shooter (FPS) games like the insanely popular Doom by id Soft ware. Doom was architected with a reasonably well-defi ned separation between its core soft - ware components (such as the three-dimensional graphics rendering system, the collision detection system, or the audio system) and the art assets, game worlds, and rules of play that comprised the player’s gaming experience. The value of this separation became evident as developers began licensing games and re-tooling them into new products by creating new art, world layouts, weapons, characters, vehicles, and game rules with only minimal changes to the “engine” soft ware. This marked the birth of the “mod community ”—a group of individual gamers and small independent studios that built new games by modifying existing games, using free toolkits provided by the origi- nal developers. Towards the end of the 1990s, some games like Quake III Arena and Unreal were designed with reuse and “ modding” in mind. Engines were made highly customizable via scripting languages like id’s Quake C, and en- gine licensing began to be a viable secondary revenue stream for the develop- ers who created them. Today, game developers can license a game engine and reuse signifi cant portions of its key soft ware components in order to build games. While this practice still involves considerable investment in custom soft ware engineering, it can be much more economical than developing all of the core engine components in-house. The line between a game and its engine is oft en blurry. Some engines make a reasonably clear distinction, while others make almost no att empt to separate the two. In one game, the rendering code might “know” specifi - cally how to draw an orc. In another game, the rendering engine might pro- vide general-purpose material and shading facilities, and “orc-ness” might be defi ned entirely in data. No studio makes a perfectly clear separation between the game and the engine, which is understandable considering that the defi nitions of these two components oft en shift as the game’s design so- lidifi es. Arguably a data-driven architecture is what differentiates a game en- gine from a piece of software that is a game but not an engine. When a game contains hard-coded logic or game rules, or employs special-case code to render specific types of game objects, it becomes difficult or im- possible to reuse that software to make a different game. We should prob- ably reserve the term “game engine” for software that is extensible and can be used as the foundation for many different games without major modification. 1.3. What Is a Game Engine? 12 1. Introduction Clearly this is not a black-and-white distinction. We can think of a gamut of reusability onto which every engine falls. Figure 1.1 takes a stab at the loca- tions of some well-known games/engines along this gamut. One would think that a game engine could be something akin to Apple QuickTime or Microsoft Windows Media Player—a general-purpose piece of soft ware capable of playing virtually any game content imaginable. However this ideal has not yet been achieved (and may never be). Most game engines are carefully craft ed and fi ne-tuned to run a particular game on a particular hardware platform. And even the most general-purpose multiplatform en- gines are really only suitable for building games in one particular genre, such as fi rst-person shooters or racing games. It’s safe to say that the more general- purpose a game engine or middleware component is, the less optimal it is for running a particular game on a particular platform. This phenomenon occurs because designing any effi cient piece of soft - ware invariably entails making trade-off s, and those trade-off s are based on assumptions about how the soft ware will be used and/or about the target hardware on which it will run. For example, a rendering engine that was de- signed to handle intimate indoor environments probably won’t be very good at rendering vast outdoor environments. The indoor engine might use a BSP tree or portal system to ensure that no geometry is drawn that is being oc- cluded by walls or objects that are closer to the camera. The outdoor engine, on the other hand, might use a less-exact occlusion mechanism, or none at all, but it probably makes aggressive use of level-of-detail (LOD) techniques to ensure that distant objects are rendered with a minimum number of triangles, while using high resolution triangle meshes for geometry that is close to the camera. The advent of ever-faster computer hardware and specialized graphics cards, along with ever-more-effi cient rendering algorithms and data struc- tures, is beginning to soft en the diff erences between the graphics engines of diff erent genres. It is now possible to use a fi rst-person shooter engine to build a real-time strategy game, for example. However, the trade-off between gener- Can be “modded” to build any game in a specific genre Can be us ed to build any game imaginable Cannot be us ed to build more than one game Can be c ustomized to make very similar games Quake III Engine Unreal Engine 3 Hydro Thunder Engine Probably impossible PacMan Figure 1.1. Game engine reusability gamut. 13 1.4. Engine Differnces Across Genres ality and optimality still exists. A game can always be made more impressive by fi ne-tuning the engine to the specifi c requirements and constraints of a particular game and/or hardware platform. 1.4. Engine Differences Across Genres Game engines are typically somewhat genre specifi c. An engine designed for a two-person fi ghting game in a boxing ring will be very diff erent from a massively multiplayer online game (MMOG) engine or a fi rst-person shooter (FPS) engine or a real-time strategy (RTS) engine. However, there is also a great deal of overlap—all 3D games, regardless of genre, require some form of low-level user input from the joypad, keyboard, and/or mouse, some form of 3D mesh rendering, some form of heads-up display (HUD) including text rendering in a variety of fonts, a powerful audio system, and the list goes on. So while the Unreal Engine, for example, was designed for fi rst-person shoot- er games, it has been used successfully to construct games in a number of other genres as well, including the wildly popular third-person shooter Gears of War by Epic Games; the character-based action-adventure game Grimm, by American McGee’s Shanghai-based development studio, Spicy Horse; and Speed Star, a futuristic racing game by South Korea-based Acro Games. Let’s take a look at some of the most common game genres and explore some examples of the technology requirements particular to each. 1.4.1. First-Person Shooters (FPS) The fi rst-person shooter (FPS) genre is typifi ed by games like Quake , Unreal Tournament, Half-Life, Counter-Strike, and Call of Duty (see Figure 1.2). These games have historically involved relatively slow on-foot roaming of a poten- tially large but primarily corridor-based world. However, modern fi rst-person shooters can take place in a wide variety of virtual environments including vast open outdoor areas and confi ned indoor areas. Modern FPS traversal me- chanics can include on-foot locomotion, rail-confi ned or free-roaming ground vehicles, hovercraft , boats, and aircraft . For an overview of this genre, see htt p://en.wikipedia.org/wiki/First-person_shooter. First-person games are typically some of the most technologically chal- lenging to build, probably rivaled in complexity only by third-person shooter/ action/platformer games and massively multiplayer games. This is because fi rst-person shooters aim to provide their players with the illusion of being immersed in a detailed, hyperrealistic world. It is not surprising that many of the game industry’s big technological innovations arose out of the games in this genre. 14 1. Introduction First-person shooters typically focus on technologies, such as effi cient rendering of large 3D virtual worlds; a responsive camera control/aiming mechanic; high-fi delity animations of the player’s virtual arms and weapons; a wide range of powerful hand-held weaponry; a forgiving player character motion and collision model, which oft en gives these games a “fl oaty” feel; high-fi delity animations and artifi cial intelligence for the non-player characters (the player’s enemies and allies); small-scale online multiplayer capabilities (typically supporting up to 64 simultaneous players), and the ubiquitous “death match” gameplay mode. The rendering technology employed by fi rst-person shooters is almost always highly optimized and carefully tuned to the particular type of envi- Figure 1.2. Call of Duty 2 (Xbox 360/PLAYSTATION 3). 15 ronment being rendered. For example, indoor “dungeon crawl” games oft en employ binary space partitioning (BSP) trees or portal -based rendering sys- tems. Outdoor FPS games use other kinds of rendering optimizations such as occlusion culling , or an offl ine sectorization of the game world with manual or automated specifi cation of which target sectors are visible from each source sector. Of course, immersing a player in a hyperrealistic game world requires much more than just optimized high-quality graphics technology. The charac- ter animations, audio and music, rigid-body physics, in-game cinematics, and myriad other technologies must all be cutt ing-edge in a fi rst-person shooter. So this genre has some of the most stringent and broad technology require- ments in the industry. 1.4.2. Platformers and Other Third-Person Games “ Platformer” is the term applied to third-person character-based action games where jumping from platform to platform is the primary gameplay mechanic. Typical games from the 2D era include Space Panic, Donkey Kong, Pitfall!, and 1.4. Engine Differnces Across Genres Figure 1.3. Jak & Daxter: The Precursor Legacy. 16 1. Introduction Super Mario Brothers. The 3D era includes platformers like Super Mario 64, Crash Bandicoot, Rayman 2, Sonic the Hedgehog, the Jak and Daxter series (Figure 1.3), the Ratchet & Clank series, and more recently Super Mario Galaxy. See htt p:// en.wikipedia.org/wiki/Platformer for an in-depth discussion of this genre. In terms of their technological requirements, platformers can usually be lumped together with third-person shooters and third-person action/adven- ture games, like Ghost Recon, Gears of War (Figure 1.4), and Uncharted: Drake’s Fortune. Third-person character-based games have a lot in common with fi rst-per- son shooters, but a great deal more emphasis is placed on the main character’s abilities and locomotion modes. In addition, high-fi delity full-body character animations are required for the player’s avatar, as opposed to the somewhat less-taxing animation requirements of the “fl oating arms” in a typical FPS game. It’s important to note here that almost all fi rst-person shooters have an online multiplayer component, so a full-body player avatar must be rendered in addition to the fi rst-person arms. However the fi delity of these FPS player avatars is usually not comparable to the fi delity of the non-player characters Figure 1.4. Gears of War. 17 in these same games; nor can it be compared to the fi delity of the player avatar in a third-person game. In a platformer, the main character is oft en cartoon-like and not particu- larly realistic or high-resolution. However, third-person shooters oft en feature a highly realistic humanoid player character. In both cases, the player charac- ter typically has a very rich set of actions and animations. Some of the technologies specifi cally focused on by games in this genre include moving platforms, ladders, ropes, trellises, and other interesting loco- motion modes; puzzle-like environmental elements; a third-person “follow camera ” which stays focused on the player char- acter and whose rotation is typically controlled by the human player via the right joypad stick (on a console) or the mouse (on a PC—note that while there are a number of popular third-person shooters on PC, the platformer genre exists almost exclusively on consoles); a complex camera collision system for ensuring that the view point never “clips” through background geometry or dynamic foreground objects. 1.4.3. Fighting Games Fighting games are typically two-player games involving humanoid char- acters pummeling each other in a ring of some sort. The genre is typifi ed by games like Soul Calibur and Tekken (see Figure 1.5). The Wikipedia page htt p://en.wikipedia.org/wiki/Fighting_game provides an overview of this genre. Traditionally games in the fi ghting genre have focused their technology eff orts on a rich set of fi ghting animations; accurate hit detection; a user input system capable of detecting complex butt on and joystick combinations; crowds, but otherwise relatively static backgrounds. Since the 3D world in these games is small and the camera is centered on the action at all times, historically these games have had litt le or no need for world subdivision or occlusion culling . They would likewise not be ex- pected to employ advanced three-dimensional audio propagation models, for example. 1.4. Engine Differnces Across Genres 18 1. Introduction State-of-the-art fi ghting games like EA’s Fight Night Round 3 (Figure 1.6) have upped the technological ante with features like high-defi nition character graphics, including realistic skin shaders with subsurface scatt ering and sweat eff ects; high-fi delity character animations; physics-based cloth and hair simulations for the characters. It’s important to note that some fi ghting games like Heavenly Sword take place in a large-scale virtual world, not a confi ned arena. In fact, many people consider this to be a separate genre, sometimes called a brawler. This kind of fi ghting game can have technical requirements more akin to those of a fi rst- person shooter or real-time strategy game. Figure 1.5. Tekken 3 (PlayStation). 19 1.4.4. Racing Games The racing genre encompasses all games whose primary task is driving a car or other vehicle on some kind of track. The genre has many subcategories. Simulation-focused racing games (“sims”) aim to provide a driving experi- ence that is as realistic as possible (e.g., Gran Turismo). Arcade racers favor over-the-top fun over realism (e.g., San Francisco Rush, Cruisin’ USA, Hydro Thunder). A relatively new subgenre explores the subculture of street racing with tricked out consumer vehicles (e.g., Need for Speed, Juiced). Kart racing is a subcategory in which popular characters from platformer games or cartoon characters from TV are re-cast as the drivers of whacky vehicles (e.g., Mario Kart, Jak X, Freaky Flyers). “Racing” games need not always involve time-based competition. Some kart racing games, for example, off er modes in which play- ers shoot at one another, collect loot, or engage in a variety of other timed and untimed tasks. For a discussion of this genre, see htt p://en.wikipedia.org/ wiki/Racing_game. 1.4. Engine Differnces Across Genres Figure 1.6. Fight Night Round 3 (PLAYSTATION 3). 20 1. Introduction A racing game is oft en very linear, much like older FPS games. However, travel speed is generally much faster than in a FPS. Therefore more focus is placed on very long corridor-based tracks, or looped tracks, sometimes with various alternate routes and secret short-cuts. Racing games usually focus all their graphic detail on the vehicles, track, and immediate surroundings. How- ever, kart racers also devote signifi cant rendering and animation bandwidth to the characters driving the vehicles. Figure 1.7 shows a screen shot from the latest installment in the well-known Gran Turismo racing game series, Gran Turismo 5. Some of the technological properties of a typical racing game include the following techniques. Various “tricks” are used when rendering distant background elements, such as employing two-dimensional cards for trees, hills, and mountains. The track is oft en broken down into relatively simple two-dimension- al regions called “sectors.” These data structures are used to optimize rendering and visibility determination, to aid in artifi cial intelligence and path fi nding for non-human-controlled vehicles, and to solve many other technical problems. Figure 1.7. Gran Turismo 5 (PLAYSTATION 3). 21 The camera typically follows behind the vehicle for a third-person per- spective, or is sometimes situated inside the cockpit fi rst-person style. When the track involves tunnels and other “tight” spaces, a good deal of eff ort is oft en put into ensuring that the camera does not collide with background geometry. 1.4.5. Real-Time Strategy (RTS) The modern real-time strategy (RTS) genre was arguably defi ned by Dune II: The Building of a Dynasty (1992). Other games in this genre include Warcraft , Command & Conquer, Age of Empires, and Starcraft . In this genre, the player deploys the batt le units in his or her arsenal strategically across a large play- ing fi eld in an att empt to overwhelm his or her opponent. The game world is typically displayed at an oblique top-down viewing angle. For a discussion of this genre, see htt p://en.wikipedia.org/wiki/Real-time_strategy. The RTS player is usually prevented from signifi cantly changing the viewing angle in order to see across large distances. This restriction permits Figure 1.8. Age of Empires. 1.4. Engine Differnces Across Genres 22 1. Introduction developers to employ various optimizations in the rendering engine of an RTS game. Older games in the genre employed a grid-based (cell-based) world con- struction, and an orthographic projection was used to greatly simplify the ren- derer. For example, Figure 1.8 shows a screen shot from the classic RTS Age of Empires. Modern RTS games sometimes use perspective projection and a true 3D world, but they may still employ a grid layout system to ensure that units and background elements, such as buildings, align with one another properly. A popular example, Command & Conquer 3, is shown in Figure 1.9. Some other common practices in RTS games include the following tech- niques. Figure 1.9. Command & Conquer 3. 23 Each unit is relatively low-res, so that the game can support large num- bers of them on-screen at once. Height-fi eld terrain is usually the canvas upon which the game is de- signed and played. The player is oft en allowed to build new structures on the terrain in ad- dition to deploying his or her forces. User interaction is typically via single-click and area-based selection of units, plus menus or toolbars containing commands, equipment, unit types, building types, etc. 1.4.6. Massively Multiplayer Online Games (MMOG) The massively multiplayer online game (MMOG) genre is typifi ed by games like Neverwinter Nights, EverQuest, World of Warcraft , and Star Wars Galaxies, to name a few. An MMOG is defi ned as any game that supports huge numbers of simultaneous players (from thousands to hundreds of thousands), usually all 1.4. Engine Differnces Across Genres Figure 1.10. World of Warcraft. 24 1. Introduction playing in one very large, persistent virtual world (i.e., a world whose internal state persists for very long periods of time, far beyond that of any one player’s gameplay session). Otherwise, the gameplay experience of an MMOG is oft en similar to that of their small-scale multiplayer counterparts. Subcategories of this genre include MMO role-playing games (MMORPG), MMO real-time strategy games (MMORTS), and MMO fi rst-person shooters (MMOFPS). For a discussion of this genre, see htt p://en.wikipedia.org/wiki/MMOG. Figure 1.10 shows a screen shot from the hugely popular MMORPG World of Warcraft . At the heart of all MMOGs is a very powerful batt ery of servers. These servers maintain the authoritative state of the game world, manage users sign- ing in and out of the game, provide inter-user chat or voice-over-IP (VoIP) services, etc. Almost all MMOGs require users to pay some kind of regular subscription fee in order to play, and they may off er micro-transactions within the game world or out-of-game as well. Hence, perhaps the most important role of the central server is to handle the billing and micro-transactions which serve as the game developer’s primary source of revenue. Graphics fi delity in an MMOG is almost always lower than its non-mas- sively multiplayer counterparts, as a result of the huge world sizes and ex- tremely large numbers of users supported by these kinds of games. 1.4.7. Other Genres There are of course many other game genres which we won’t cover in depth here. Some examples include sports, with subgenres for each major sport (football, baseball, soccer, golf, etc.); role-playing games (RPG); God games, like Populus and Black & White; environmental/social simulation games, like SimCity or The Sims; puzzle games like Tetris; conversions of non-electronic games, like chess, card games, go, etc.; web-based games, such as those off ered at Electronic Arts’ Pogo site; and the list goes on. We have seen that each game genre has its own particular technologi- cal requirements. This explains why game engines have traditionally diff ered quite a bit from genre to genre. However, there is also a great deal of tech- nological overlap between genres, especially within the context of a single hardware platform. With the advent of more and more powerful hardware, 25 diff erences between genres that arose because of optimization concerns are beginning to evaporate. So it is becoming increasingly possible to reuse the same engine technology across disparate genres, and even across disparate hardware platforms. 1.5. Game Engine Survey 1.5.1. The Quake Family of Engines The fi rst 3D fi rst-person shooter (FPS) game is generally accepted to be Castle Wolfenstein 3D (1992). Writt en by id Soft ware of Texas for the PC platform, this game led the game industry in a new and exciting direction. Id Soft ware went on to create Doom, Quake , Quake II, and Quake III. All of these engines are very similar in architecture, and I will refer to them as the Quake family of engines. Quake technology has been used to create many other games and even other engines. For example, the lineage of Medal of Honor for the PC platform goes something like this: Quake I II (Id); Sin (Ritual); F.A.K.K. 2 (Ritual); Medal of Honor: Allied Assault (2015 & Dreamworks Interactive); Medal of Honor: Pacifi c Assault (Electronic Arts, Los Angeles). Many other games based on Quake technology follow equally circuitous paths through many diff erent games and studios. In fact, Valve’s Source engine (used to create the Half-Life games) also has distant roots in Quake technology. The Quake and Quake II source code is freely available, and the original Quake engines are reasonably well architected and “clean” (although they are of course a bit outdated and writt en entirely in C). These code bases serve as great examples of how industrial-strength game engines are built. The full source code to Quake and Quake II is available on id’s website at htt p://www. idsoft ware.com/business/techdownloads. If you own the Quake and/or Quake II games, you can actually build the code using Microsoft Visual Studio and run the game under the debugger using the real game assets from the disk. This can be incredibly instructive. You can set break points, run the game, and then analyze how the engine actually works by stepping through the code. I highly recommend down- loading one or both of these engines and analyzing the source code in this manner. 1.5. Game Engine Survey 26 1. Introduction 1.5.2. The Unreal Family of Engines Epic Games Inc. burst onto the FPS scene in 1998 with its legendary game Un- real . Since then, the Unreal Engine has become a major competitor to Quake technology in the FPS space. Unreal Engine 2 (UE2) is the basis for Unreal Tournament 2004 (UT2004) and has been used for countless “mods,” university projects, and commercial games. Unreal Engine 3 (UE3) is the next evolution- ary step, boasting some of the best tools and richest engine feature sets in the industry, including a convenient and powerful graphical user interface for creating shaders and a graphical user interface for game logic programming called Kismet. Many games are being developed with UE3 lately, including of course Epic’s popular Gears of War. The Unreal Engine has become known for its extensive feature set and cohesive, easy-to-use tools. The Unreal Engine is not perfect, and most devel- opers modify it in various ways to run their game optimally on a particular hardware platform. However, Unreal is an incredibly powerful prototyping tool and commercial game development platform, and it can be used to build virtually any 3D fi rst-person or third-person game (not to mention games in other genres as well). The Unreal Developer Network (UDN) provides a rich set of documenta- tion and other information about the various versions of the Unreal Engine (see htt p://udn.epicgames.com). Some of the documentation on Unreal Engine 2 is freely available, and “mods” can be constructed by anyone who owns a copy of UT2004. However, access to the balance of the UE2 docs and all of the UE3 docs are restricted to licensees of the engine. Unfortunately, licenses are extremely expensive, and hence out of reach for all independent game devel- opers and most small studios as well. But there are plenty of other useful web- sites and wikis on Unreal. One popular one is htt p://www.beyondunreal.com. 1.5.3. The Half Life Source Engine Source is the game engine that drives the smash hit Half-Life 2 and its sequels HL2: Episode One, HL2: Episode Two, Team Fortress 2, and Portal (shipped to- gether under the title The Orange Box). Source is a high-quality engine, rivaling Unreal Engine 3 in terms of graphics capabilities and tool set. 1.5.4. Microsoft’s XNA Game Studio Microsoft ’s XNA Game Studio is an easy-to-use and highly accessible game development platform aimed at encouraging players to create their own games and share them with the online gaming community, much as YouTube encourages the creation and sharing of home-made videos. 27 XNA is based on Microsoft ’s C# language and the Common Language Runtime (CLR). The primary development environment is Visual Studio or its free counterpart, Visual Studio Express. Everything from source code to game art assets are managed within Visual Studio. With XNA, develop- ers can create games for the PC platform and Microsoft ’s Xbox 360 console. Aft er paying a modest fee, XNA games can be uploaded to the Xbox Live network and shared with friends. By providing excellent tools at essentially zero cost, Microsoft has brilliantly opened the fl oodgates for the average person to create new games. XNA clearly has a bright and fascinating future ahead of it. 1.5.5. Other Commercial Engines There are lots of other commercial game engines out there. Although indie developers may not have the budget to purchase an engine, many of these products have great online documentation and/or wikis that can serve as a great source of information about game engines and game programming in general. For example, check out the C4 Engine by Terathon Soft ware (htt p:// www.terathon.com), a company founded by Eric Lengyel in 2001. Docu- mentation for the C4 Engine can be found on Terathon’s website, with ad- ditional details on the C4 Engine wiki (htt p://www.terathon.com/wiki/index. php?title=Main_Page). 1.5.6. Proprietary in-House Engines Many companies build and maintain proprietary in-house game engines. Electronic Arts built many of its RTS games on a proprietary engine called SAGE, developed at Westwood Studios. Naughty Dog’s Crash Bandicoot, Jak and Daxter series, and most recently Uncharted: Drake’s Fortune franchises were each built on in-house engines custom-tailored to the PlayStation, PlayStation 2, and PLAYSTATION 3 platforms, respectively. And of course, most commer- cially licensed game engines like Quake , Source, or the Unreal Engine started out as proprietary in-house engines. 1.5.7. Open Source Engines Open source 3D game engines are engines built by amateur and professional game developers and provided online for free. The term “open source” typi- cally implies that source code is freely available and that a somewhat open de- velopment model is employed, meaning almost anyone can contribute code. Li- censing, if it exists at all, is oft en provided under the Gnu Public License (GPL) or Lesser Gnu Public License (LGPL). The former permits code to be freely used 1.5. Game Engine Survey 28 1. Introduction by anyone, as long as their code is also freely available; the latt er allows the code to be used even in proprietary for-profi t applications. Lots of other free and semi-free licensing schemes are also available for open source projects. There are a staggering number of open source engines available on the web. Some are quite good, some are mediocre, and some are just plain aw- ful! The list of game engines provided online at htt p://cg.cs.tu-berlin.de/~ki/ engines.html will give you a feel for the sheer number of engines that are out there. OGRE 3D is a well-architected, easy-to-learn, and easy-to-use 3D render- ing engine. It boasts a fully featured 3D renderer including advanced lighting and shadows , a good skeletal character animation system, a two-dimensional overlay system for heads-up display s and graphical user interface s, and a post-processing system for full-screen eff ects like bloom . OGRE is, by its au- thors’ own admission, not a full game engine, but it does provide many of the foundational components required by prett y much any game engine. Some other well-known open source engines are listed here. Panda3D is a script-based engine. The engine’s primary interface is the Python custom scripting language. It is designed to make prototyping 3D games and virtual worlds convenient and fast. Yake is a relatively new fully featured game engine built on top of OGRE . Crystal Space is a game engine with an extensible modular architecture. Torque and Irrlicht are also well-known and widely used engines. 1.6. Runtime Engine Architecture A game engine generally consists of a tool suite and a runtime component. We’ll explore the architecture of the runtime piece fi rst and then get into tools architecture in the following section. Figure 1.11 shows all of the major runtime components that make up a typical 3D game engine. Yeah, it’s big! And this diagram doesn’t even account for all the tools. Game engines are defi nitely large soft ware systems. Like all soft ware systems, game engines are built in layers. Normally up- per layers depend on lower layers, but not vice versa. When a lower layer depends upon a higher layer, we call this a circular dependency. Dependency cycles are to be avoided in any soft ware system, because they lead to un- desirable coupling between systems, make the soft ware untestable, and in- hibit code reuse. This is especially true for a large-scale system like a game engine. 29 1.6. Runtime Engine Architecture Gameplay Foundations Event/Messaging System Dynamic Game Object Model Scripting System World Loading / Streaming Static World Elements Real-Time Agent- Based Simulation High-Level Game Flow System/FSM Skeletal Animation Animation Decompression Inverse Kinematics (IK) Game-Specific Post-Processing Sub-skeletal Animation LERP and Additive Blending Animation Playback Animation State Tree & Layers Profiling & Debugging Memory & Performance Stats In-Game Menus or Console Recording & Playback Hierarchical Object Attachment 3rd Party SDKs Havok, PhysX, ODE etc. DirectX, OpenGL, libgcm, Edge, etc. Boost++ STL / STLPort etc.Kynapse EuphoriaGranny, Havok Animation, etc. OS Drivers Hardware (PC, XBOX360, PS3, etc.) Platform Independence Layer Atomic Data TypesPlatform Detection Collections and Iterators Threading LibraryHi-Res TimerFile System Network Transport Layer (UDP/TCP) Graphics Wrappers Physics /Coll. Wrapper Core Systems Module Start-Up and Shut-Down Parsers (CSV, XML, etc.) Assertions Unit Testing Math Library Strings and Hashed String Ids Debug Printing and LoggingMemory Allocation Engine Config (INI files etc.) Profiling / Stats Gathering Object Handles / Unique Ids RTTI / Reflection & Serialization Curves & Surfaces Library Random Number Generator Localization Services Asynchronous File I/O Movie Player Memory Card I/O (Older Consoles) Resources (Game Assets) Resource Manager Texture Resource Material Resource 3D Model Resource Font Resource Collision Resource Physics Parameters Game World/Map etc.Skeleton Resource Human Interface Devices (HID) Physical Device I/O Game-Specific Interface Audio Audio Playback / Management DSP/Effects 3D Audio Model Online Multiplayer Match-Making & Game Mgmt. Game State Replication Object Authority PolicyScene Graph / Culling Optimizations LOD SystemOcclusion & PVSSpatial Subdivision (BSP Tr ee, kd-Tree, …) Visual Effects Particle & Decal Systems Post Effects HDR Lighting PRT Lighting, Subsurf. Scatter Environment Mapping Light Mapping & Dynamic Shadows Front End Heads-Up Display (HUD) Full-Motion Video ( F MV) In-Game MenusIn-Game GUI Wrappers / Attract Mode In-Game Cinematics (IGC) Collision & Physics Shapes/ Collidables Rigid Bodies Phantoms Ray/Shape Casting (Queries) Forces & Constraints Physics /Collision World Ragdoll Physics GAME-SPECIFIC SUBSYSTEMS Game-Specific Rendering Terrain Rendering Water Simulation & Rendering etc. Player Mechanics Collision Manifold Movement State Machine & Animation Game Cameras Player-Follow Camera Debug Fly- Through Cam Fixed Cameras Scripted/Animated Cameras AI Sight Traces & Perception Path Finding (A*) Goals & Decision- Making Actions (Engine Interface) Camera-Relative Controls (HID) Weapons Power-Ups etc.Vehicles Puzzles Low-Level Renderer Primitive Submission Viewports & Virtual Screens Materials & Shaders Texture and Surface Mgmt. Graphics Device Interface Static & Dynamic Lighting Cameras Text & Fonts Debug Drawing (Lines etc.) Skeletal Mesh Rendering Figure 1.11. Runtime game engine architecture. 30 1. Introduction What follows is a brief overview of the components shown in the diagram in Figure 1.11. The rest of this book will be spent investigating each of these components in a great deal more depth and learning how these components are usually integrated into a functional whole. 1.6.1. Target Hardware The target hardware layer, shown in isolation in Figure 1.12, represents the computer system or console on which the game will run. Typical platforms include Microsoft Windows- and Linux-based PCs, the Apple iPhone and Macintosh, Microsoft ’s Xbox and Xbox 360, Sony’s PlayStation, PlayStation 2, PlayStation Portable (PSP), and PLAYSTATION 3, and Nintendo’s DS, Game- Cube, and Wii. Most of the topics in this book are platform-agnostic, but we’ll also touch on some of the design considerations peculiar to PC or console development, where the distinctions are relevant. Hardware (PC, XBOX360, PS3, etc.) Figure 1.12. Hardware layer. Drivers Figure 1.13. Device driver layer. 1.6.2. Device Drivers As depicted in Figure 1.13, device drivers are low-level soft ware components provided by the operating system or hardware vendor. Drivers manage hard- ware resources and shield the operating system and upper engine layers from the details of communicating with the myriad variants of hardware devices available. 1.6.3. Operating System On a PC, the operating system (OS) is running all the time. It orchestrates the execution of multiple programs on a single computer, one of which is your game. The OS layer is shown in Figure 1.14. Operating systems like Microsoft Windows employ a time-sliced approach to sharing the hardware with mul- tiple running programs, known as pre-emptive multitasking . This means that a PC game can never assume it has full control of the hardware—it must “play nice” with other programs in the system. 31 1.6. Runtime Engine Architecture OS Figure 1.14. Operating system layer. 3rd Party SDKs Havok, PhysX, ODE etc. DirectX, OpenGL, libgcm, Edge, etc. Boost++ STL / STLPort etc.Kynapse EuphoriaGranny, Havok Animation, etc. Figure 1.15. Third-party SDK layer. On a console, the operating system is oft en just a thin library layer that is compiled directly into your game executable. On a console, the game typically “owns” the entire machine. However, with the introduction of the Xbox 360 and PLAYSTATION 3, this is no longer strictly the case. The operating sys- tem on these consoles can interrupt the execution of your game, or take over certain system resources, in order to display online messages, or to allow the player to pause the game and bring up the PS3’s Xross Media Bar or the Xbox 360’s dashboard, for example. So the gap between console and PC develop- ment is gradually closing (for bett er or for worse). 1.6.4. Third-Party SDKs and Middleware Most game engines leverage a number of third-party soft ware development kit s (SDKs) and middleware, as shown in Figure 1.15. The functional or class- based interface provided by an SDK is oft en called an application program- ming interface (API). We will look at a few examples. 1.6.4.1. Data Structures and Algorithms Like any soft ware system, games depend heavily on collection data structures and algorithms to manipulate them. Here are a few examples of third-party libraries which provide these kinds of services. STL. The C++ standard template library provides a wealth of code and algorithms for managing data structures, strings, and stream-based I/O. STLport . This is a portable, optimized implementation of STL. Boost . Boost is a powerful data structures and algorithms library, designed in the style of STL. (The online documentation for Boost is also a great place to learn a great deal about computer science!) Loki . Loki is a powerful generic programming template library which is exceedingly good at making your brain hurt! 32 1. Introduction Game developers are divided on the question of whether to use template libraries like STL in their game engines. Some believe that the memory alloca- tion patt erns of STL, which are not conducive to high-performance program- ming and tend to lead to memory fragmentation (see Section 5.2.1.4), make STL unusable in a game. Others feel that the power and convenience of STL outweigh its problems, and that most of the problems can in fact be worked around anyway. My personal belief is that STL is all right for use on a PC, be- cause its advanced virtual memory system renders the need for careful mem- ory allocation a bit less crucial (although one must still be very careful). On a console, with limited or no virtual memory facilities and exorbitant cache miss costs, you’re probably bett er off writing custom data structures that have pre- dictable and/or limited memory allocation patt erns. (And you certainly won’t go far wrong doing the same on a PC game project either.) 1.6.4.2. Graphics Most game rendering engines are built on top of a hardware interface library, such as the following: Glide is the 3D graphics SDK for the old Voodoo graphics cards. This SDK was popular prior to the era of hardware transform and lighting (hardware T&L) which began with DirectX 8. OpenGL is a widely used portable 3D graphics SDK. DirectX is Microsoft ’s 3D graphics SDK and primary rival to OpenGL . libgcm is a low-level direct interface to the PLAYSTATION 3’s RSX graph- ics hardware, which was provided by Sony as a more effi cient alterna- tive to OpenGL. Edge is a powerful and highly-effi cient rendering and animation engine produced by Naughty Dog and Sony for the PLAYSTATION 3 and used by a number of fi rst- and third-party game studios. 1.6.4.3. Collision and Physics Collision detection and rigid body dynamics (known simply as “physics” in the game development community) are provided by the following well- known SDKs. Havok is a popular industrial-strength physics and collision engine. PhysX is another popular industrial-strength physics and collision en- gine, available for free download from NVIDIA. Open Dynamics Engine (ODE) is a well-known open source physics/col- lision p ackage. 33 1.6.4.4. Character Animation A number of commercial animation packages exist, including but certainly not limited to the following. Granny . Rad Game Tools’ popular Granny toolkit includes robust 3D model and animation exporters for all the major 3D modeling and ani- mation packages like Maya, 3D Studio MAX, etc., a runtime library for reading and manipulating the exported model and animation data, and a powerful runtime animation system. In my opinion, the Granny SDK has the best-designed and most logical animation API of any I’ve seen, commercial or proprietary, especially its excellent handling of time. Havok Animation . The line between physics and animation is becoming increasingly blurred as characters become more and more realistic. The company that makes the popular Havok physics SDK decided to create a complimentary animation SDK, which makes bridging the physics- animation gap much easier than it ever has been. Edge. The Edge library produced for the PS3 by the ICE team at Naughty Dog, the Tools and Technology group of Sony Computer Entertainment America, and Sony’s Advanced Technology Group in Europe includes a powerful and effi cient animation engine and an effi cient geometry- processing engine for rendering. 1.6.4.5. Artifi cial Intelligence Kynapse . Until recently, artifi cial intelligence (AI) was handled in a cus- tom manner for each game. However, a company called Kynogon has produced a middleware SDK called Kynapse. This SDK provides low- level AI building blocks such as path fi nding, static and dynamic object avoidance, identifi cation of vulnerabilities within a space (e.g., an open window from which an ambush could come), and a reasonably good interface between AI and animation. 1.6.4.6. Biomechanical Character Models Endorphin and Euphoria . These are animation packages that produce character motion using advanced biomechanical models of realistic hu- man movement. As we mentioned above, the line between character animation and phys- ics is beginning to blur. Packages like Havok Animation try to marry physics and animation in a traditional manner, with a human animator providing the majority of the motion through a tool like Maya and with physics augmenting that motion at runtime. But recently a fi rm called Natural Motion Ltd. has pro- 1.6. Runtime Engine Architecture 34 1. Introduction duced a product that att empts to redefi ne how character motion is handled in games and other forms of digital media. Its fi rst product, Endorphin , is a Maya plug-in that permits animators to run full biomechanical simulations on characters and export the resulting animations as if they had been hand-animated. The biomechanical model ac- counts for center of gravity, the character’s weight distribution, and detailed knowledge of how a real human balances and moves under the infl uence of gravity and other forces. Its second product, Euphoria , is a real-time version of Endorphin intend- ed to produce physically and biomechanically accurate character motion at runtime under the infl uence of unpredictable forces. 1.6.5. Platform Independence Layer Most game engines are required to be capable of running on more than one hardware platform. Companies like Electronic Arts and Activision/Blizzard, for example, always target their games at a wide variety of platforms, because it exposes their games to the largest possible market. Typically, the only game studios that do not target at least two diff erent platforms per game are fi rst- party studios, like Sony’s Naughty Dog and Insomniac studios. Therefore, most game engines are architected with a platform independence layer, like the one shown in Figure 1.16. This layer sits atop the hardware, drivers, oper- ating system, and other third-party soft ware and shields the rest of the engine from the majority of knowledge of the underlying platform. By wrapping or replacing the most commonly used standard C library functions, operating system calls, and other foundational application pro- gramming interfaces (APIs), the platform independence layer ensures consis- tent behavior across all hardware platforms. This is necessary because there is a good deal of variation across platforms, even among “standardized” librar- ies like the standard C library. Platform Independence Layer Atomic Data TypesPlatform Detection Collections and Iterators Threading LibraryHi-Res TimerFile System Network Transport Layer (UDP/TCP) Graphics Wrappers Physics /Coll. Wrapper Figure 1.16. Platform independence layer. 1.6.6. Core Systems Every game engine, and really every large, complex C++ soft ware application, requires a grab bag of useful soft ware utilities. We’ll categorize these under the label “core systems.” A typical core systems layer is shown in Figure 1.17. Here are a few examples of the facilities the core layer usually provides. 35 Assertions are lines of error-checking code that are inserted to catch logi- cal mistakes and violations of the programmer’s original assumptions. Assertion checks are usually stripped out of the fi nal production build of the game. Memory management. Virtually every game engine implements its own custom memory allocation system(s) to ensure high-speed allocations and deallocations and to limit the negative eff ects of memory fragmen- tation (see Section 5.2.1.4). Math library. Games are by their nature highly mathematics-intensive. As such, every game engine has at least one, if not many, math libraries. These libraries provide facilities for vector and matrix math, quaternion rota- tions, trigonometry, geometric operations with lines, rays, spheres, frusta, etc., spline manipulation, numerical integration, solving systems of equa- tions, and whatever other facilities the game programmers require. Custom data structures and algorithms. Unless an engine’s designers de- cided to rely entirely on a third-party package such as STL, a suite of tools for managing fundamental data structures (linked lists, dynamic arrays, binary trees, hash maps, etc.) and algorithms (search, sort, etc.) is usually required. These are oft en hand-coded to minimize or elimi- nate dynamic memory allocation and to ensure optimal runtime perfor- mance on the target platform(s). A detailed discussion of the most common core engine systems can be found in Part II. 1.6.7. Resource Manager Present in every game engine in some form, the resource manager provides a unifi ed interface (or suite of interfaces) for accessing any and all types of game assets and other engine input data. Some engines do this in a highly centralized and consistent manner (e.g., Unreal ’s packages, OGRE 3D ’s Re- sourceManager class). Other engines take an ad hoc approach, oft en leaving it up to the game programmer to directly access raw fi les on disk or within compressed archives such as Quake ’s PAK fi les. A typical resource manager layer is depicted in Figure 1.18. 1.6. Runtime Engine Architecture Core Systems Module Start-Up and Shut-Down Parsers (CSV, XML, etc.) Assertions Unit Testing Math Library Strings and Hashed String Ids Debug Printing and LoggingMemory Allocation Engine Config (INI files etc.) Profiling / Stats Gathering Object Handles / Unique Ids RTTI / Reflection & Serialization Curves & Surfaces Library Random Number Generator Localization Services Asynchronous File I/O Movie Player Memory Card I/O (Older Consoles) Figure 1.17. Core engine systems. 36 1. Introduction Low-Level Renderer Primitive Submission Viewports & Virtual Screens Materials & Shaders Texture and Surface Mgmt. Graphics Device Interface Static & Dynamic Lighting Cameras Text & Fonts Debug Drawing (Lines etc.) Skeletal Mesh Rendering Figure 1.19. Low-level rendering engine. Resources (Game Assets) Resource Manager Texture Resource Material Resource 3D Model Resource Font Resource Collision Resource Physics Parameters Game World/Map etc.Skeleton Resource Figure 1.18. Resource manager. 1.6.8. Rendering Engine The rendering engine is one of the largest and most complex components of any game engine. Renderers can be architected in many diff erent ways. There is no one accepted way to do it, although as we’ll see, most modern rendering engines share some fundamental design philosophies, driven in large part by the design of the 3D graphics hardware upon which they depend. One common and eff ective approach to rendering engine design is to em- ploy a layered architecture as follows. 1.6.8.1. Low-Level Renderer The low-level renderer , shown in Figure 1.19, encompasses all of the raw ren- dering facilities of the engine. At this level, the design is focused on rendering a collection of geometric primitives as quickly and richly as possible, without much regard for which portions of a scene may be visible. This component is broken into various subcomponents, which are discussed below. Graphics Device Interface Graphics SDKs, such as DirectX and OpenGL, require a reasonable amount of code to be writt en just to enumerate the available graphics devices, initialize them, set up render surfaces (back-buff er, stencil buff er etc.), and so on. This 37 is typically handled by a component that I’ll call the graphics device interface (although every engine uses its own terminology). For a PC game engine, you also need code to integrate your renderer with the Windows message loop. You typically write a “ message pump ” that ser- vices Windows messages when they are pending and otherwise runs your render loop over and over as fast as it can. This ties the game’s keyboard poll- ing loop to the renderer’s screen update loop. This coupling is undesirable, but with some eff ort it is possible to minimize the dependencies. We’ll explore this topic in more depth later. Other Renderer Components The other components in the low-level renderer cooperate in order to collect submissions of geometric primitives (sometimes called render packet s), such as meshes, line lists, point lists, particles , terrain patches, text strings, and what- ever else you want to draw, and render them as quickly as possible. The low-level renderer usually provides a viewport abstraction with an associated camera -to-world matrix and 3D projection parameters, such as fi eld of view and the location of the near and far clip plane s. The low-level renderer also manages the state of the graphics hardware and the game’s shaders via its material system and its dynamic lighting system. Each submitt ed primitive is associated with a material and is aff ected by n dynamic lights. The mate- rial describes the texture (s) used by the primitive, what device state sett ings need to be in force, and which vertex and pixel shader to use when rendering the primitive. The lights determine how dynamic lighting calculations will be applied to the primitive. Lighting and shading is a complex topic, which is covered in depth in many excellent books on computer graphics, including [14], [42], and [1]. 1.6.8.2. Scene Graph/Culling Optimizations The low-level renderer draws all of the geometry submitt ed to it, without much regard for whether or not that geometry is actually visible (other than back-face culling and clipping triangles to the camera frustum). A higher-level component is usually needed in order to limit the number of primitives sub- mitt ed for rendering, based on some form of visibility determination. This layer is shown in Figure 1.20. For very small game worlds, a simple frustum cull (i.e., removing objects that the camera cannot “see”) is probably all that is required. For larger game worlds, a more advanced spatial subdivision data structure might be used to improve rendering effi ciency, by allowing the potentially visible set (PVS) of objects to be determined very quickly. Spatial subdivisions can take many 1.6. Runtime Engine Architecture 38 1. Introduction forms, including a binary space partitioning (BSP) tree, a quadtree , an octree , a kd-tree , or a sphere hierarchy . A spatial subdivision is sometimes called a scene graph, although technically the latt er is a particular kind of data struc- ture and does not subsume the former. Portals or occlusion culling methods might also be applied in this layer of the rendering engine. Ideally, the low-level renderer should be completely agnostic to the type of spatial subdivision or scene graph being used. This permits diff erent game teams to reuse the primitive submission code, but craft a PVS determination system that is specifi c to the needs of each team’s game. The design of the OGRE 3D open source rendering engine (htt p://www.ogre3d.org) is a great example of this principle in action. OGRE provides a plug-and-play scene graph architecture. Game developers can either select from a number of pre- implemented scene graph designs, or they can provide a custom scene graph implementation. 1.6.8.3. Visual Effects Modern game engines support a wide range of visual eff ects , as shown in Figure 1.21, including particle system s (for smoke, fi re, water splashes, etc.); decal systems (for bullet holes, foot prints, etc.); light mapping and environment mapping; dynamic shadows; full-screen post eff ects , applied aft er the 3D scene has been rendered to an off screen buff er. Scene Graph / Culling Optimizations LOD SystemOcclusion & PVSSpatial Subdivision (BSP Tr ee, kd-Tree, …) Figure 1.20. A typical scene graph/spatial subdivision layer, for culling optimization. Visual Effects Particle & Decal Systems Post Effects HDR Lighting PRT Lighting, Subsurf. Scatter Environment Mapping Light Mapping & Dynamic Shadows Figure 1.21. Visual effects. 39 Some examples of full-screen post eff ects include high dynamic range (HDR) lighting and bloom ; full-screen anti-aliasing (FSAA); color correction and color-shift eff ects, including bleach bypass , satura- tion and de-saturation eff ects, etc. It is common for a game engine to have an eff ects system component that manages the specialized rendering needs of particles, decals, and other vi- sual eff ects . The particle and decal systems are usually distinct components of the rendering engine and act as inputs to the low-level renderer . On the other hand, light mapping , environment mapping, and shadows are usually handled internally within the rendering engine proper. Full-screen post ef- fects are either implemented as an integral part of the renderer or as a separate component that operates on the renderer’s output buff ers. 1.6.8.4. Front End Most games employ some kind of 2D graphics overlaid on the 3D scene for various purposes. These include the game’s heads-up display (HUD); in-game menus, a console, and/or other development tools, which may or may not be shipped with the fi nal product; possibly an in-game graphical user interface (GUI), allowing the player to manipulate his or her character’s inventory, confi gure units for batt le, or perform other complex in-game tasks. This layer is shown in Figure 1.22. Two-dimensional graphics like these are usually implemented by drawing textured quads (pairs of triangles) with an orthographic projection . Or they may be rendered in full 3D, with the quads bill-boarded so they always face the camera . We’ve also included the full-motion video (FMV) system in this layer. This system is responsible for playing full-screen movies that have been recorded 1.6. Runtime Engine Architecture Front End Heads-Up Display (HUD) Full-Motion Video ( F MV) In-Game MenusIn-Game GUI Wrappers / Attract Mode In-Game Cinematics (IGC) Figure 1.22. Front end graphics. 40 1. Introduction earlier (either rendered with the game’s rendering engine or using another rendering package). A related system is the in-game cinematics (IGC) system. This component typically allows cinematic sequences to be choreographed within the game it- self, in full 3D. For example, as the player walks through a city, a conversation between two key characters might be implemented as an in-game cinematic. IGCs may or may not include the player character(s). They may be done as a deliberate cut-away during which the player has no control, or they may be subtly integrated into the game without the human player even realizing that an IGC is taking place. 1.6.9. Profi ling and Debugging Tools Games are real-time systems and, as such, game engineers oft en need to profi le the performance of their games in order to optimize performance. In addition, memory resources are usually scarce, so developers make heavy use of mem- ory analysis tools as well. The profi ling and debugging layer, shown in Figure 1.23, encompasses these tools and also includes in-game debugging facilities, such as debug drawing, an in-game menu system or console, and the ability to record and play back gameplay for testing and debugging purposes. There are plenty of good general-purpose soft ware profi ling tools avail- able, including Intel’s VTune, IBM’s Quantify and Purify (part of the PurifyPlus tool suite), Compuware’s Bounds Checker. However, most game engines also incorporate a suite of custom profi ling and debugging tools. For example, they might include one or more of the fol- lowing: a mechanism for manually instrumenting the code, so that specifi c sec- tions of code can be timed; a facility for displaying the profi ling statistics on-screen while the game is running; a facility for dumping performance stats to a text fi le or to an Excel spreadsheet; a facility for determining how much memory is being used by the en- gine, and by each subsystem, including various on-screen displays; the ability to dump memory usage, high-water mark, and leakage stats when the game terminates and/or during gameplay; Profiling & Debugging Memory & Performance Stats In-Game Menus or Console Recording & Playback Figure 1.23. Profi l- ing and debugging tools. 41 1.6. Runtime Engine Architecture tools that allow debug print statements to be peppered throughout the code, along with an ability to turn on or off diff erent categories of debug output and control the level of verbosity of the output; the ability to record game events and then play them back. This is tough to get right, but when done properly it can be a very valuable tool for tracking down bugs. 1.6.10. Collision and Physics Collision detection is important for every game. Without it, objects would in- terpenetrate, and it would be impossible to interact with the virtual world in any reasonable way. Some games also include a realistic or semi-realistic dynamics simulation . We call this the “physics system” in the game industry, although the term rigid body dynamics is really more appropriate, because we are usually only concerned with the motion (kinematics) of rigid bodies and the forces and torques (dynamics) that cause this motion to occur. This layer is depicted in Figure 1.24. Collision and physics are usually quite tightly coupled. This is because when collisions are detected, they are almost always resolved as part of the physics integration and constraint satisfaction logic. Nowadays, very few game companies write their own collision /physics engine. Instead, a third- party SDK is typically integrated into the engine. Havok is the gold standard in the industry today. It is feature-rich and performs well across the boards. PhysX by NVIDIA is another excellent collision and dynamics engine. It was integrated into Unreal Engine 3 and is also available for free as a standalone product for PC game development. PhysX was originally designed as the interface to Ageia’s new physics accelerator chip. The Collision & Physics Shapes/ Collidables Rigid Bodies Phantoms Ray/Shape Casting (Queries) Forces & Constraints Physics /Collision World Ragdoll Physics Figure 1.24. Collision and physics subsystem. 42 1. Introduction SDK is now owned and distributed by NVIDIA, and the company is adapting PhysX to run on its latest GPUs. Open source physics and collision engines are also available. Perhaps the best-known of these is the Open Dynamics Engine (ODE). For more informa- tion, see htt p://www.ode.org. I-Collide, V-Collide, and RAPID are other popu- lar non-commercial collision detection engines. All three were developed at the University of North Carolina (UNC). For more information, see htt p://www. cs.unc.edu/~geom/I_COLLIDE/index.html, htt p://www.cs.unc.edu/~geom/V_ COLLIDE/index.html, and htt p://www.cs.unc.edu/~geom/OBB/OBBT.html. 1.6.11. Animation Any game that has organic or semi-organic characters (humans, animals, car- toon characters, or even robots) needs an animation system. There are fi ve basic types of animation used in games: sprite/texture an imation, rigid body hierarchy animation, skeletal animation, vertex animation, and morph targets. Skeletal animation permits a detailed 3D character mesh to be posed by an animator using a relatively simple system of bones. As the bones move, the vertices of the 3D mesh move with them. Although morph targets and vertex animation are used in some engines, skeletal animation is the most prevalent animation method in games today; as such, it will be our primary focus in this book. A typical skeletal animation system is shown in Figure 1.25. Skeletal Animation Animation Decompression Inverse Kinematics (IK) Game-Specific Post-Processing Sub-skeletal Animation LERP and Additive Blending Animation Playback Animation State Tree & Layers Figure 1.25. Skeletal animation subsystem. 43 You’ll notice in Figure 1.11 that Skeletal Mesh Rendering is a component that bridges the gap between the renderer and the animation system. There is a tight cooperation happening here, but the interface is very well defi ned. The animation system produces a pose for every bone in the skeleton, and then these poses are passed to the rendering engine as a palett e of matrices. The renderer transforms each vertex by the matrix or matrices in the palett e, in order to generate a fi nal blended vertex position. This process is known as skinning. There is also a tight coupling between the animation and physics systems, when rag dolls are employed. A rag doll is a limp (oft en dead) animated char- acter, whose bodily motion is simulated by the physics system. The physics system determines the positions and orientations of the various parts of the body by treating them as a constrained system of rigid bodies. The animation system calculates the palett e of matrices required by the rendering engine in order to draw the character on-screen. 1.6.12. Human Interface Devices (HID) Every game needs to process input from the player, obtained from various human interface device s (HIDs) including the keyboard and mouse, a joypad, or other specialized game controllers, like steering wheels, fi shing rods, dance pads, the WiiMote, etc. We sometimes call this component the player I/O component, because we may also provide output to the player through the HID , such as force feed- back /rumble on a joypad or the audio produced by the WiiMote. A typical HID layer is shown in Figure 1.26. The HID engine component is sometimes architected to divorce the low-level details of the game controller(s) on a particular hardware platform from the high-level game controls. It massages the raw data coming from the hardware, introducing a dead zone around the center point of each joypad stick, de-bouncing butt on-press inputs, detecting butt on-down and butt on- up events, interpreting and smoothing accelerometer inputs (e.g., from the PLAYSTATION 3 Sixaxis controller), and more. It oft en provides a mecha- nism allowing the player to customize the mapping between physical controls and logical game functions. It sometimes also includes a system for detecting chords (multiple butt ons pressed together), sequences (butt ons pressed in se- quence within a certain time limit), and gestures (sequences of inputs from the butt ons, sticks, accelerometers, etc.). 1.6. Runtime Engine Architecture Human Interface Devices (HID) Physical Device I/O Game-Specific Interface Figure 1.26. The player input/out- put system, also known as the hu- man interface de- vice (HID) layer. 44 1. Introduction 1.6.13. Audio Audio is just as important as graphics in any game engine. Unfortunately, audio oft en gets less att ention than rendering, physics, animation, AI, and gameplay. Case in point: Programmers oft en develop their code with their speakers turned off ! (In fact, I’ve known quite a few game programmers who didn’t even have speakers or headphones.) Nonetheless, no great game is complete without a stunning audio engine. The audio layer is depicted in Figure 1.27. Audio engines vary greatly in sophistication. Quake ’s and Unreal ’s au- dio engines are prett y basic, and game teams usually augment them with custom functionality or replace them with an in-house solution. For DirectX platforms (PC and Xbox 360), Microsoft provides an excellent audio tool suite called XACT . Electronic Arts has developed an advanced, high-powered au- dio engine internally called SoundR!OT. In conjunction with fi rst-party stu- dios like Naughty Dog, Sony Computer Entertainment America (SCEA) pro- vides a powerful 3D audio engine called Scream, which has been used on a number of PS3 titles including Naughty Dog’s Uncharted: Drake’s Fortune. However, even if a game team uses a pre-existing audio engine, every game requires a great deal of custom soft ware development, integration work, fi ne- tuning, and att ention to detail in order to produce high-quality audio in the fi nal product. 1.6.14. Online Multiplayer/Networking Many games permit multiple human players to play within a single virtual world. Multiplayer games come in at least four basic fl avors. Single-screen multiplayer. Two or more human interface devices (joypads, keyboards, mice, etc.) are connected to a single arcade machine, PC, or console. Multiple player characters inhabit a single virtual world, and a single camera keeps all player characters in frame simultaneously. Ex- amples of this style of multiplayer gaming include Smash Brothers, Lego Star Wars, and Gauntlet. Split-screen multiplayer. Multiple player characters inhabit a single vir- tual world, with multiple HIDs att ached to a single game machine, but each with its own camera, and the screen is divided into sections so that each player can view his or her character. Networked multiplayer. Multiple computers or consoles are networked together, with each machine hosting one of the players. Massively multiplayer online games (MMOG). Literally hundreds of thousands of users can be playing simultaneously within a giant, per- Audio Audio Playback / Management DSP/Effects 3D Audio Model Figure 1.27. Audio subsystem. 45 sistent, online virtual world hosted by a powerful batt ery of central servers. The multiplayer networking layer is shown in Figure 1.28. Multiplayer games are quite similar in many ways to their single-player counterparts. However, support for multiple players can have a profound impact on the design of certain game engine components. The game world object model, renderer, human input device system, player control system, and animation systems are all aff ected. Retrofi tt ing multiplayer features into a pre-existing single-player engine is certainly not impossible, although it can be a daunting task. Still, many game teams have done it successfully. That said, it is usually bett er to design multiplayer features from day one, if you have that luxury. It is interesting to note that going the other way—converting a multi- player game into a single-player game—is typically trivial. In fact, many game engines treat single-player mode as a special case of a multiplayer game, in which there happens to be only one player. The Quake engine is well known for its client-on-top-of-server mode, in which a single executable, running on a single PC, acts both as the client and the server in single-player campaigns. 1.6.15. Gameplay Foundation Systems The term gameplay refers to the action that takes place in the game, the rules that govern the virtual world in which the game takes place, the abilities of the player character(s) (known as player mechanics) and of the other characters and objects in the world, and the goals and objectives of the player(s). Game- play is typically implemented either in the native language in which the rest of the engine is writt en, or in a high-level scripting language—or sometimes both. To bridge the gap between the gameplay code and the low-level engine systems that we’ve discussed thus far, most game engines introduce a layer 1.6. Runtime Engine Architecture Gameplay Foundations Event/Messaging System Dynamic Game Object Model Scripting System World Loading / Streaming Static World Elements Real-Time Agent- Based Simulation High-Level Game Flow System/FSM Hierarchical Object Attachment Figure 1.29. Gameplay foundation systems. Online Multiplayer Match-Making & Game Mgmt. Game State Replication Object Authority Policy Figure 1.28. On- line multiplayer subsystem. 46 1. Introduction that I’ll call the gameplay foundations layer (for lack of a standardized name). Shown in Figure 1.29, this layer provides a suite of core facilities, upon which game-specifi c logic can be implemented conveniently. 1.6.15.1. Game Worlds and Object Models The gameplay foundations layer introduces the notion of a game world, con- taining both static and dynamic elements. The contents of the world are usu- ally modeled in an object-oriented manner (oft en, but not always, using an object-oriented programming language). In this book, the collection of object types that make up a game is called the game object model. The game object model provides a real-time simulation of a heterogeneous collection of objects in the virtual game world. Typical types of game objects include static background geometry, like buildings, roads, terrain (oft en a spe- cial case), etc.; dynamic rigid bodies, such as rocks, soda cans, chairs, etc.; player characters (PC); non-player characters (NPC); weapons; projectiles; vehicles; lights (which may be present in the dynamic scene at run time, or only used for static lighting offl ine); cameras; and the list goes on. The game world model is intimately tied to a soft ware object model, and this model can end up pervading the entire engine. The term soft ware object model refers to the set of language features, policies, and conventions used to implement a piece of object-oriented soft ware. In the context of game engines, the soft ware object model answers questions, such as: Is your game engine designed in an object-oriented manner? What language will you use? C? C++? Java? OCaml? How will the static class hierarchy be organized? One giant monolithic hierarchy? Lots of loosely coupled components? Will you use templates and policy-based design, or traditional polymor- phism? How are objects referenced? Straight old pointers? Smart pointers? Handles? 47 How will objects be uniquely identifi ed? By address in memory only? By name? By a global unique identifi er (GUID)? How are the lifetimes of game objects managed? How are the states of the game objects simulated over time? We’ll explore soft ware object models and game object models in consider- able depth in Section 14.2. 1.6.15.2. Event System Game objects invariably need to communicate with one another. This can be accomplished in all sorts of ways. For example, the object sending the message might simply call a member function of the receiver object. An event-driven architecture, much like what one would fi nd in a typical graphical user inter- face, is also a common approach to inter-object communication. In an event- driven system, the sender creates a litt le data structure called an event or mes- sage, containing the message’s type and any argument data that are to be sent. The event is passed to the receiver object by calling its event handler function. Events can also be stored in a queue for handling at some future time. 1.6.15.3. Scripting System Many game engines employ a scripting language in order to make develop- ment of game-specifi c gameplay rules and content easier and more rapid. Without a scripting language, you must recompile and relink your game ex- ecutable every time a change is made to the logic or data structures used in the engine. But when a scripting language is integrated into your engine, changes to game logic and data can be made by modifying and reloading the script code. Some engines allow script to be reloaded while the game continues to run. Other engines require the game to be shut down prior to script recompi- lation. But either way, the turn-around time is still much faster than it would be if you had to recompile and relink the game’s executable. 1.6.15.4. Artifi cial Intelligence Foundations Traditionally, artifi cial intelligence (AI) has fallen squarely into the realm of game-specifi c soft ware—it was usually not considered part of the game en- gine per se. More recently, however, game companies have recognized pat- terns that arise in almost every AI system, and these foundations are slowly starting to fall under the purview of the engine proper. A company called Kynogon has developed a commercial AI engine called Kynapse , which acts as an “AI foundation layer” upon which game-specifi c AI logic can be quite easily developed. Kynapse provides a powerful suite of features, including 1.6. Runtime Engine Architecture 48 1. Introduction a network of path nodes or roaming volumes, that defi nes areas or paths where AI characters are free to move without fear of colliding with static world geometry; simplifi ed collision information around the edges of each free-roaming area; knowledge of the entrances and exits from a region, and from where in each region an enemy might be able to see and/or ambush you; a path-fi nding engine based on the well-known A* algorithm; hooks into the collision system and world model, for line-of-sight (LOS) traces and other perceptions; a custom world model which tells the AI system where all the entities of interest (friends, enemies, obstacles) are, permits dynamic avoidance of moving objects, and so on. Kynapse also provides an architecture for the AI decision layer, including the concept of brains (one per character), agents (each of which is responsible for executing a specifi c task, such as moving from point to point, fi ring on an enemy, searching for enemies, etc.), and actions (responsible for allowing the character to perform a fundamental movement, which oft en results in playing animations on the character’s skeleton). 1.6.16. Game-Specifi c Subsystems On top of the gameplay foundation layer and the other low-level engine com- ponents, gameplay programmers and designers cooperate to implement the features of the game itself. Gameplay systems are usually numerous, highly varied, and specifi c to the game being developed. As shown in Figure 1.30, these systems include, but are certainly not limited to the mechanics of the player character, various in-game camera systems, artifi cial intelligence for the control of non-player characters (NPCs), weapon systems, vehicles, and GAME-SPECIFIC SUBSYSTEMS Game-Specific Rendering Terrain Rendering Water Simulation & Rendering etc. Player Mechanics Collision Manifold Movement State Machine & Animation Game Cameras Player-Follow Camera Debug Fly- Through Cam Fixed Cameras Scripted/Animated Cameras AI Sight Traces & Perception Path Finding (A*) Goals & Decision- Making Actions (Engine Interface) Camera-Relative Controls (HID) Weapons Power-Ups etc.Vehicles Puzzles Figure 1.30. Game-specifi c subsystems. 49 1.7. Tools and the Asset Pipeline the list goes on. If a clear line could be drawn between the engine and the game, it would lie between the game-specifi c subsystems and the gameplay foundations layer. Practically speaking, this line is never perfectly distinct. At least some game-specifi c knowledge invariably seeps down through the gameplay foundations layer and sometimes even extends into the core of the engine itself. 1.7. Tools and the Asset Pipeline Any game engine must be fed a great deal of data, in the form of game assets, confi guration fi les, scripts, and so on. Figure 1.31 depicts some of the types of game assets typically found in modern game engines. The thicker dark-grey arrows show how data fl ows from the tools used to create the original source assets all the way through to the game engine itself. The thinner light-grey ar- rows show how the various types of assets refer to or use other assets. 1.7.1. Digital Content Creation Tools Games are multimedia applications by nature. A game engine’s input data comes in a wide variety of forms, from 3D mesh data to texture bitmaps to animation data to audio fi les. All of this source data must be created and ma- nipulated by artists. The tools that the artists use are called digital content cre- ation (DCC) applications. A DCC application is usually targeted at the creation of one particular type of data—although some tools can produce multiple data types. For ex- ample, Autodesk’s Maya and 3ds Max are prevalent in the creation of both 3D meshes and animation data. Adobe’s Photoshop and its ilk are aimed at creating and editing bitmaps (textures). SoundForge is a popular tool for cre- ating audio clips. Some types of game data cannot be created using an off - the-shelf DCC app. For example, most game engines provide a custom editor for laying out game worlds. Still, some engines do make use of pre-existing tools for game world layout. I’ve seen game teams use 3ds Max or Maya as a world layout tool, with or without custom plug-ins to aid the user. Ask most game developers, and they’ll tell you they can remember a time when they laid out terrain height fi elds using a simple bitmap editor, or typed world layouts directly into a text fi le by hand. Tools don’t have to be prett y—game teams will use whatever tools are available and get the job done. That said, tools must be relatively easy to use, and they absolutely must be reliable, if a game team is going to be able to develop a highly polished product in a timely manner. 50 1. Introduction Digital Content Creation (DCC) Tools Game World Game Object Mesh Skeletal Hierarchy Exporter Skel. Hierarchy Animation Exporter Animation Curves TGA Texture DXT Compression DXT Texture World Editor Game Object Definition Tool Material Game Obj. Template Animation Set Animation Tree Editor Animation Tree Game Object Game Object Asset Conditioning Pipeline GAME WAV sound Audio Manager Tool Sound Bank Mesh Exporter PhotoshopPhotoshop Sound Forge or Audio ToolSound Forge or Audio Tool Game Object Maya, 3DSMAX, etc.Maya, 3DSMAX, etc. Custom Material Plug-In Houdini/Other Particle ToolHoudini/Other Particle Tool Particle System Particle Exporter Figure 1.31. Tools and the asset pipeline. 1.7.2. Asset Conditioning Pipeline The data formats used by digital content creation (DCC) applications are rare- ly suitable for direct use in-game. There are two primary reasons for this. 1. The DCC app’s in-memory model of the data is usually much more complex than what the game engine requires. For example, Maya stores a directed acyclic graph (DAG) of scene nodes, with a complex web of interconnections. It stores a history of all the edits that have been performed on the fi le. It represents the position, orientation, and scale of every object in the scene as a full hierarchy of 3D transformations, decomposed into translation, rotation, scale, and shear components. A 51 game engine typically only needs a tiny fraction of this information in order to render the model in-game. 2. The DCC application’s fi le format is oft en too slow to read at run time, and in some cases it is a closed proprietary format. Therefore, the data produced by a DCC app is usually exported to a more ac- cessible standardized format, or a custom fi le format, for use in-game. Once data has been exported from the DCC app, it oft en must be fur- ther processed before being sent to the game engine. And if a game studio is shipping its game on more than one platform, the intermediate fi les might be processed diff erently for each target platform. For example, 3D mesh data might be exported to an intermediate format, such as XML or a simple binary format. Then it might be processed to combine meshes that use the same ma- terial, or split up meshes that are too large for the engine to digest. The mesh data might then be organized and packed into a memory image suitable for loading on a specifi c hardware platform. The pipeline from DCC app to game engine is sometimes called the asset conditioning pipeline . Every game engine has this in some form. 1.7.3. 3D Model/Mesh Data The visible geometry you see in a game is typically made up of two kinds of data. 1.7.3.1. Brush Geometry Brush geometry is defi ned as a collection of convex hulls, each of which is de- fi ned by multiple planes. Brushes are typically created and edited directly in the game world editor. This is what some would call an “old school” approach to creating renderable geometry, but it is still used. Pros: fast and easy to create; accessible to game designers—oft en used to “block out” a game level for prototyping purposes; can serve both as collision volumes and as renderable geometry. Cons: low-resolution – diffi cult to create complex shapes; cannot support articulated objects or animated characters. 1.7.3.2. 3D Models (Meshes) For detailed scene elements, 3D models (also referred to as meshes) are superior to brush geometry. A mesh is a complex shape composed of triangles and ver- 1.7. Tools and the Asset Pipeline 52 1. Introduction tices. (A mesh might also be constructed from quads or higher-order subdivi- sion surfaces. But on today’s graphics hardware, which is almost exclusively geared toward rendering rasterized triangles, all shapes must eventually be translated into triangles prior to rendering.) A mesh typically has one or more materials applied to it, in order to defi ne visual surface properties (color, re- fl ectivity, bumpiness, diff use texture , etc.). In this book, I will use the term “mesh ” to refer to a single renderable shape, and “model” to refer to a com- posite object that may contain multiple meshes, plus animation data and other metadata for use by the game. Meshes are typically created in a 3D modeling package such as 3ds Max, Maya, or Soft Image. A relatively new tool called ZBrush allows ultra high- resolution meshes to be built in a very intuitive way and then down-converted into a lower-resolution model with normal maps to approximate the high- frequency detail. Exporters must be writt en to extract the data from the digital content creation (DCC) tool (Maya, Max, etc.) and store it on disk in a form that is digestible by the engine. The DCC apps provide a host of standard or semi-standard export formats, although none are perfectly suited for game development (with the possible exception of COLLADA). Therefore, game teams oft en create custom fi le formats and custom exporters to go with them. 1.7.4. Skeletal Animation Data A skeletal mesh is a special kind of mesh that is bound to a skeletal hierarchy for the purposes of articulated animation. Such a mesh is sometimes called a skin, because it forms the skin that surrounds the invisible underlying skel- eton. Each vertex of a skeletal mesh contains a list of indices indicating to which joint(s) in the skeleton it is bound. A vertex usually also includes a set of joint weights, specifying the amount of infl uence each joint has on the vertex. In order to render a skeletal mesh , the game engine requires three distinct kinds of data. 1. the mesh itself, 2. the skeletal hierarchy (joint names, parent-child relationships and the base pose the skeleton was in when it was originally bound to the mesh ), and 3. one or more animation clips, which specify how the joints should move over time. 53 The mesh and skeleton are oft en exported from the DCC application as a sin- gle data fi le. However, if multiple meshes are bound to a single skeleton, then it is bett er to export the skeleton as a distinct fi le. The animations are usually exported individually, allowing only those animations which are in use to be loaded into memory at any given time. However, some game engines allow a bank of animations to be exported as a single fi le, and some even lump the mesh, skeleton, and animations into one monolithic fi le. An unoptimized skeletal animation is defi ned by a stream of 4 × 3 matrix samples, taken at a frequency of at least 30 frames per second, for each of the joints in a skeleton (of which there are oft en 100 or more). Thus animation data is inherently memory-intensive. For this reason, animation data is almost al- ways stored in a highly compressed format. Compression schemes vary from engine to engine, and some are proprietary. There is no one standardized for- mat for game-ready animation data. 1.7.5. Audio Data Audio clips are usually exported from Sound Forge or some other audio pro- duction tool in a variety of formats and at a number of diff erent data sam- pling rates. Audio fi les may be in mono, stereo, 5.1, 7.1, or other multichannel confi gurations. Wave fi les (.wav) are common, but other fi le formats such as PlayStation ADPCM fi les (.vag and .xvag) are also commonplace. Audio clips are oft en organized into banks for the purposes of organization, easy loading into the engine, and streaming. 1.7.6. Particle Systems Data Modern games make use of complex particle eff ects. These are authored by artists who specialize in the creation of visual eff ects . Third-party tools, such as Houdini, permit fi lm-quality eff ects to be authored; however, most game engines are not capable of rendering the full gamut of eff ects that can be cre- ated with Houdini. For this reason, many game companies create a custom particle eff ect editing tool, which exposes only the eff ects that the engine actu- ally supports. A custom tool might also let the artist see the eff ect exactly as it will appear in-game. 1.7.7. Game World Data and the World Editor The game world is where everything in a game engine comes together. To my knowledge, there are no commercially available game world editors (i.e., the game world equivalent of Maya or Max). However, a number of commercially available game engines provide good world editors. 1.7. Tools and the Asset Pipeline 54 1. Introduction Some variant of the Radiant game editor is used by most game engines based on Quake technology; The Half-Life 2 Source engine provides a world editor called Hammer; UnrealEd is the Unreal Engine’s world editor. This powerful tool also serves as the asset manager for all data types that the engine can con- sume. Writing a good world editor is diffi cult, but it is an extremely important part of any good game engine. 1.7.8. Some Approaches to Tool Architecture A game engine’s tool suite may be architected in any number of ways. Some tools might be standalone pieces of soft ware, as shown in Figure 1.32. Some tools may be built on top of some of the lower layers used by the runtime en- gine, as Figure 1.33 illustrates. Some tools might be built into the game itself. For example, Quake - and Unreal -based games both boast an in-game console that permits developers and “modders” to type debugging and confi guration commands while running the game. As an interesting and unique example, Unreal ’s world editor and asset manager, UnrealEd , is built right into the runtime game engine. To run the editor, you run your game with a command-line argument of “editor.” This unique architectural style is depicted in Figure 1.34. It permits the tools to have total access to the full range of data structures used by the engine and OS Drivers Hardware (PC, XBOX360, PS3, etc.) 3rd Party SDKs Platform Independence Layer Core Systems Run-Time Engine Tools and World Builder Figure 1.32. Standalone tools architecture. 55 avoids a common problem of having to have two representations of every data structure – one for the runtime engine and one for the tools. It also means that running the game from within the editor is very fast (because the game is actually already running). Live in-game editing, a feature that is normally very tricky to implement, can be developed relatively easily when the editor is a part of the game. However, an in-engine editor design like this does have its share of problems. For example, when the engine is crashing, the tools become unusable as well. Hence a tight coupling between engine and asset creation tools can tend to slow down production. OS Drivers Hardware (PC, XBOX360, PS3, etc.) 3rd Party SDKs Platform Independence Layer Core Systems Run-Time Engine Tools and World Builder Figure 1.33. Tools built on a framework shared with the game. OS Drivers Hardware (PC, XBOX360, PS3, etc.) 3rd Party SDKs Platform Independence Layer Core Systems Run-Time Engine Other Tools World Builder Figure 1.34. UnrealEngine’s tool architecture. 1.7. Tools and the Asset Pipeline 57 2 Tools of the Trade Before we embark on our journey across the fascinating landscape of game engine architecture, it is important that we equip ourselves with some ba- sic tools and provisions. In the next two chapters, we will review the soft ware engineering concepts and practices that we will need during our voyage. In Chapter 2, we’ll explore the tools used by the majority of professional game engineers. Then in Chapter 3, we’ll round out our preparations by reviewing some key topics in the realms of object-oriented programming, design pat- terns, and large-scale C++ programming. Game development is one of the most demanding and broad areas of soft - ware engineering, so believe me, we’ll want to be well equipped if we are to safely navigate the sometimes-treacherous terrain we’ll be covering. For some readers, the contents of this chapter and the next will be very familiar. How- ever, I encourage you not to skip these chapters entirely. I hope that they will serve as a pleasant refresher; and who knows—you might even pick up a new trick or two. 2.1. Version Control A version control system is a tool that permits multiple users to work on a group of fi les collectively. It maintains a history of each fi le, so that changes 58 2. Tools of the Trade can be tracked and reverted if necessary. It permits multiple users to modify fi les—even the same fi le—simultaneously, without everyone stomping on each other’s work. Version control gets its name from its ability to track the version history of fi les. It is sometimes called source control, because it is pri- marily used by computer programmers to manage their source code. Howev- er, version control can be used for other kinds of fi les as well. Version control systems are usually best at managing text fi les, for reasons we will discover below. However, many game studios use a single version control system to manage both source code fi les (which are text) and game assets like textures, 3D meshes, animations, and audio fi les (which are usually binary). 2.1.1. Why Use Version Control? Version control is crucial whenever soft ware is developed by a team of mul- tiple engineers. Version control provides a central repository from which engineers can share source code; keeps a history of the changes made to each source fi le; provides mechanisms allowing specifi c versions of the code base to be tagged and later retrieved; permits versions of the code to be branched off from the main develop- ment line, a feature oft en used to produce demos or make patches to older versions of the soft ware. A source control system can be useful even on a single-engineer project. Al- though its multiuser capabilities won’t be relevant, its other abilities, such as maintaining a history of changes, tagging versions, creating branches for demos and patches, tracking bugs, etc., are still invaluable. 2.1.2. Common Version Control Systems Here are the most common source control systems you’ll probably encounter during your career as a game engineer. SCCS and RCS. The Source Code Control System (SCCS) and the Revi- sion Control System (RCS) are two of the oldest version control systems. Both employ a command-line interface. They are prevalent primarily on UNIX platforms. CVS. The Concurrent Version System (CVS) is a heavy-duty profession- al-grade command-line-based source control system, originally built on top of RCS (but now implemented as a standalone tool). CVS is preva- 59 2.1. Version Control lent on UNIX systems but is also available on other development plat- forms such as Microsoft Windows. It is open source and licensed under the Gnu General Public License (GPL). CVSNT (also known as WinCVS) is a native Windows implementation that is based on, and compatible with, CVS. Subversion. Subversion is an open source version control system aimed at replacing and improving upon CVS. Because it is open source and hence free, it is a great choice for individual projects, student projects, and small studios. Git. This is an open source revision control system that has been used for many venerable projects, including the Linux kernel. In the git development model, the programmer makes changes to fi les and commits the changes to a branch. The programmer can then merge his changes into any other code branch quickly and easily, because git “knows” how to rewind a sequence of diff s and reapply them onto a new base revision—a process git calls rebasing. The net result is a revision control system that is highly effi cient and fast when dealing with multiple code branches. More information on git can be found at htt p://git-scm.com/. Perforce. Perforce is a professional-grade source control system, with both text-based and GUI interfaces. One of Perforce’s claims to fame is its concept of change lists. A change list is a collection of source fi les that have been modifi ed as a logical unit. Change lists are checked into the repository atomically – either the entire change list is submitt ed, or none of it is. Perforce is used by many game companies, including Naughty Dog and Electronic Arts. NxN Alienbrain. Alienbrain is a powerful and feature-rich source control system designed explicitly for the game industry. Its biggest claim to fame is its support for very large databases containing both text source code fi les and binary game art assets, with a customizable user interface that can be targeted at specifi c disciplines such as artists, producers, or programmers. ClearCase. ClearCase is professional-grade source control system aimed at very large-scale soft ware projects. It is powerful and employs a unique user interface that extends the functionality of Windows Explor- er. I haven’t seen ClearCase used much in the game industry, perhaps because it is one of the more expensive version control systems. Microsoft Visual SourceSafe. SourceSafe is a light-weight source control package that has been used successfully on some game projects. 60 2. Tools of the Trade 2.1.3. Overview of Subversion and TortoiseSVN I have chosen to highlight Subversion in this book for a few reasons. First off , it’s free, which is always nice. It works well and is reliable, in my experience. A Subversion central repository is quite easy to set up; and as we’ll see, there are already a number of free repository servers out there, if you don’t want to go to the trouble of sett ing one up yourself. There are also a number of good Windows and Mac Subversion clients, such as the freely available Tortois- eSVN for Windows. So while Subversion may not be the best choice for a large commercial project (I personally prefer Perforce for that purpose), I fi nd it perfectly suited to small personal and educational projects. Let’s take a look at how to set up and use Subversion on a Microsoft Windows PC development platform. As we do so, we’ll review core concepts that apply to virtually any version control system. Subversion, like most other version control systems, employs a client- server architecture. The server manages a central repository, in which a ver- sion-controlled directory hierarchy is stored. Clients connect to the server and request operations, such as checking out the latest version of the directory tree, committ ing new changes to one or more fi les, tagging revisions, branch- ing the repository, and so on. We won’t discuss sett ing up a server here; we’ll assume you have a server, and instead we will focus on sett ing up and using the client. You can learn how to set up a Subversion server by reading Chap- ter 6 of [37]. However you probably will never need to do so, because you can always fi nd free Subversion servers. For example, Google provides free Subversion code hosting at htt p://code.google.com/. 2.1.4. Setting up a Code Repository on Google The easiest way to get started with Subversion is to visit htt p://code.google. com/ and set up a free Subversion repository. Create a Google user name and password if you don’t already have one, then navigate to Project Hosting un- der Developer Resources (see Figure 2.1). Click “Create a new project,” then enter a suitable unique project name, like “mygoogleusername-code.” You can enter a summary and/or description if you like, and even provide tags so that other users all over the world can search for and fi nd your repository. Click the “Create Project” butt on and you’re off to the races. Once you’ve created your repository, you can administer it on the Google Code website. You can add and remove users, control options, and perform a wealth of advanced tasks. But all you really need to do next is set up a Subver- sion client and start using your repository. 61 2.1.5. Installing TortoiseSVN TortoiseSVN is a popular front-end for Subversion. It extends the functionality of the Microsoft Windows Explorer via a convenient right-click menu and over- lay icons to show you the status of your version-controlled fi les and folders. To get TortoiseSVN, visit htt p://tortoisesvn.tigris.org/. Download the lat- est version from the download page. Install it by double-clicking the .msi fi le that you’ve downloaded and following the installation wizard’s instructions. Once TortoiseSVN is installed, you can go to any folder in Windows Ex- plorer and right-click—TortoiseSVN’s menu extensions should now be vis- ible. To connect to an existing code repository (such as one you created on Google Code), create a folder on your local hard disk and then right-click and select “SVN Checkout….” The dialog shown in Figure 2.2 will appear. In the “URL of repository” fi eld, enter your repository’s URL. If you are using Google Code, it should be htt ps://myprojectname.googlecode.com/svn/trunk, where myprojectname is whatever you named your project when you fi rst cre- ated it (e.g., “mygoogleusername-code”). If you forget the URL of your repository, just log in to htt p://code.google. com/, go to “Project Hosting” as before, sign in by clicking the “Sign in” link in the upper right-hand corner of the screen, and then click the Sett ings link, also found in the upper right-hand corner of the screen. Click the “My Profi le” tab, and you should see your project listed there. Your project’s URL is htt ps:// myprojectname.googlecode.com/svn/trunk, where myprojectname is whatever name you see listed on the “My Profi le” tab. You should now see the dialog shown in Figure 2.3. The user name should be your Google login name. The password is not your Google login 2.1. Version Control Figure 2.1. Google Code home page, Project Hosting link. 62 2. Tools of the Trade password—it is an automatically generated password that can be obtained by signing in to your account on Goggle’s “Project Hosting” page and clicking on the “Sett ings” link. (See above for details.) Checking the “Save authenti- cation” option on this dialog allows you to use your repository without ever having to log in again. Only select this option if you are working on your own personal machine—never on a machine that is shared by many users. Once you’ve authenticated your user name, TortoiseSVN will download (“check out”) the entire contents of your repository to your local disk. If you just set up your repository, this will be … nothing! The folder you created will still be empty. But now it is connected to your Subversion repository on Google (or wherever your server is located). If you refresh your Windows Explorer window (hit F5), you should now see a litt le green and white check- mark on your folder. This icon indicates that the folder is connected to a Sub- version repository via TortoiseSVN and that the local copy of the repository is up-to-date. 2.1.6. File Versions, Updating, and Committing As we’ve seen, one of the key purposes of any source control system like Sub- version is to allow multiple programmers to work on a single soft ware code base by maintaining a central repository or “master” version of all the source code on a server. The server maintains a version history for each fi le, as shown in Figure 2.4. This feature is crucial to large-scale multiprogrammer soft ware development. For example, if someone makes a mistake and checks in code that “breaks the build,” you can easily go back in time to undo those changes (and check the log to see who the culprit was!). You can also grab a snapshot of the code as it existed at any point in time, allowing you to work with, dem- onstrate, or patch previous versions of the soft ware. Figure 2.2. TortoiseSVN initial check-out dialog. Figure 2.3. TortoiseSVN user authentication dialog. 63 Each programmer gets a local copy of the code on his or her machine. In the case of TortoiseSVN, you obtain your initial working copy by “ checking out” the repository, as described above. Periodically you should update your local copy to refl ect any changes that may have been made by other program- mers. You do this by right-clicking on a folder and selecting “SVN Update” from the pop-up menu. You can work on your local copy of the code base without aff ecting the other programmers on the team (Figure 2.5). When you are ready to share your changes with everyone else, you commit your changes to the repository (also known as submitt ing or checking in). You do this by right-clicking on the folder you want to commit and selecting “ SVN Commit…” from the pop-up 2.1. Version Control Figure 2.6. TortoiseSVN Commit dialog. Foo.cpp (version 1) Foo.cpp (version 2) Foo.cpp (version 3) Foo.cpp (version 4) Bar.cpp (version 1) Bar.cpp (version 2) Bar.cpp (version 3) Figure 2.4. File version histories. Foo.cpp (version 4) Foo.cpp (local edits) Figure 2.5. Editing the local copy of a ver- sion-controlled fi le. 64 2. Tools of the Trade menu. You will get a dialog like the one shown in Figure 2.6, asking you to confi rm the changes. During a commit operation, Subversion generates a diff between your lo- cal version of each fi le and the latest version of that same fi le in the repository. The term “diff ” means diff erence, and it is typically produced by performing a line-by-line comparison of the two versions of the fi le. You can double-click on any fi le in the TortoiseSVN Commit dialog (Figure 2.6) to see the diff s be- tween your version and the latest version on the server (i.e., the changes you made). Files that have changed (i.e., any fi les that “have diff s”) are committ ed. This replaces the latest version in the repository with your local version, add- ing a new entry to the fi le’s version history. Any fi les that have not changed (i.e., your local copy is identical to the latest version in the repository) are ignored by default during a commit. An example commit operation is shown in Figure 2.7. If you created any new fi les prior to the commit, they will be listed as “non-versioned” in the Commit dialog. You can check the litt le check boxes beside them in order to add them to the repository. Any fi les that you deleted locally will likewise show up as “missing”—if you check their check boxes, they will be deleted from the repository. You can also type a comment in the Commit dialog. This comment is added to the repository’s history log, so that you and others on your team will know why these fi les were checked in. 2.1.7. Multiple Check-Out, Branching, and Merging Some version control systems require exclusive check-out. This means that you must fi rst indicate your intentions to modify a fi le by checking it out and lock- ing it. The fi le(s) that are checked out to you are writable on your local disk and cannot be checked out by anyone else. All other fi les in the repository are read-only on your local disk. Once you’re done editing the fi le, you can check it in, which releases the lock and commits the changes to the repository for ev- eryone else to see. The process of exclusively locking fi les for editing ensures that no two people can edit the same fi le simultaneously. Subversion, CVS, Perforce, and many other high-quality version control systems also permit multiple check-out.; i.e., you can be editing a fi le while someone else is editing that same fi le. Whichever user’s changes are commit- ted fi rst become the latest version of the fi le in the repository. Any subsequent commits by other users require that programmer to merge his or her changes with the changes made by the programmer(s) who committ ed previously. Because more than one set of changes (diff s) have been made to the same fi le, the version control system must merge the changes in order to produce a fi nal version of the fi le. This is oft en not a big deal, and in fact many confl icts Foo.cpp (version 5) Foo.cpp (version 4) Figure 2.7. Com- mitting local edits to the repository. 65 can be resolved automatically by the version control system. For example, if you changed function f() and another programmer changed function g(), then your edits would have been to a diff erent range of lines in the fi le than those of the other programmer. In this case, the merge between your changes and his or her changes will usually resolve automatically without any con- fl icts. However, if you were both making changes to the same function f(), then the second programmer to commit his or her changes will need to do a three-way merge (see Figure 2.8). For three-way merges to work, the version control server has to be smart enough to keep track of which version of each fi le you currently have on your local disk. That way, when you merge the fi les, the system will know which ver- sion is the base version (the common ancestor, such as version 4 in Figure 2.8). Subversion permits multiple check-out, and in fact it doesn’t require you to check out fi les explicitly at all. You simply start editing the fi les locally—all fi les are writable on your local disk at all times. (By the way, this is one reason that Subversion doesn’t scale well to large projects, in my opinion. To deter- mine which fi les you have changed, Subversion must search the entire tree of source fi les, which can be slow. Version control systems like Perforce, which explicitly keep track of which fi les you have modifi ed, are usually easier to work with when dealing with large amounts of code. But for small projects, Subversion’s approach works just fi ne.) 2.1. Version Control Foo.cpp (joe_b) Foo.cpp (suzie_q) joe_b and s uzie_q both start editing Foo .cpp at the same time Foo.cpp (version 4) Foo.cpp (joe_b) Foo.cpp (version 5) suzie_q commits her changes first joe_b must now do a 3-way merge , which involves 2 sets of diffs: Foo.cpp (version 6) Foo.cpp (joe_b) Foo.cpp (version 5) Foo.cpp (version 4) Foo.cpp (version 4) version 4 to his local version version 4 to version 5 Figure 2.8. Three-way merge due to local edits by two different users. 66 2. Tools of the Trade When you perform a commit operation by right-clicking on any folder and selecting “SVN Commit…” from the pop-up menu, you may be prompt- ed to merge your changes with changes made by someone else. But if no one has changed the fi le since you last updated your local copy, then your changes will be committ ed without any further action on your part. This is a very con- venient feature, but it can also be dangerous. It’s a good idea to always check your commits carefully to be sure you aren’t committ ing any fi les that you didn’t intend to modify. When TortoiseSVN displays its Commit Files dialog, you can double-click on an individual fi le in order to see the diff s you made prior to hitt ing the “OK” butt on. 2.1.8. Deleting Files When a fi le is deleted from the repository, it’s not really gone. The fi le still ex- ists in the repository, but its latest version is simply marked “deleted” so that users will no longer see the fi le in their local directory trees. You can still see and access previous versions of a deleted fi le by right-clicking on the folder in which the fi le was contained and selecting “Show log” from the TortoiseSVN menu. You can undelete a deleted fi le by updating your local directory to the version immediately before the version in which the fi le was marked deleted. Then simply commit the fi le again. This replaces the latest deleted version of the fi le with the version just prior to the deletion, eff ectively undeleting the fi le. 2.2. Microsoft Visual Studio Compiled languages, such as C++, require a compiler and linker in order to transform source code into an executable program. There are many com- pilers/linkers available for C++, but for the Microsoft Windows platform the most commonly used package is probably Microsoft Visual Studio. The fully featured Professional Edition of the product can be purchased at any store that sells Windows soft ware. And Visual Studio Express, its lighter- weight cousin, is available for free download at htt p://www.microsoft .com/ express/download/. Documentation on Visual Studio is available online at the Microsoft Developer’s Network (MSDN) site (htt p://msdn.microsoft .com/en- us/library/52f3sw5c.aspx). Visual Studio is more than just a compiler and linker. It is an integrated development environment (IDE), including a slick and fully featured text editor for source code and a powerful source-level and machine-level debugger. In 67 this book, our primary focus is the Windows platform, so we’ll investigate Visual Studio in some depth. Much of what you learn below will be applicable to other compilers, linkers, and debuggers, so even if you’re not planning on ever using Visual Studio, I suggest you skim this section for useful tips on us- ing compilers, linkers, and debuggers in general. 2.2.1. Source Files, Headers, and Translation Units A program writt en in C++ is comprised of source fi les. These typically have a .c, .cc, .cxx, or .cpp extension, and they contain the bulk of your program’s source code. Source fi les are technically known as translation units, because the com- piler translates one source fi le at a time from C++ into machine code. A special kind of source fi le, known as a header fi le, is oft en used in order to share information, such as type declarations and function prototypes, between translation units. Header fi les are not seen by the compiler. Instead, the C++ preprocessor replaces each #include statement with the contents of the corre- sponding header fi le prior to sending the translation unit to the compiler. This is a subtle but very important distinction to make. Header fi les exist as distinct fi les from the point of view of the programmer—but thanks to the preproces- sor’s header fi le expansion, all the compiler ever sees are translation units. 2.2.2. Libraries, Executables, and Dynamic Link Libraries When a translation unit is compiled, the resulting machine code is placed in an object fi le (fi les with a .obj extension under Windows, or .o under UNIX- based operating systems). The machine code in an object fi le is relocatable, meaning that the memory addresses at which the code re- sides have not yet been determined, and unlinked, meaning that any external references to functions and global data that are defi ned outside the translation unit have not yet been re- solved. Object fi les can be collected into groups called libraries. A library is simply an archive, much like a Zip or tar fi le, containing zero or more object fi les. Li- braries exist merely as a convenience, permitt ing a large number of object fi les to be collected into a single easy-to-use fi le. Object fi les and libraries are linked into an executable by the linker. The executable fi le contains fully resolved machine code that can be loaded and run by the operating system . The linker’s jobs are to calculate the fi nal relative addresses of all the machine code, as it will appear in memory when the program is run, and 2.2. Microsoft Visual Studio 68 2. Tools of the Trade to ensure that all external references to functions and global data made by each translation unit (object fi le) are properly resolved. It’s important to remember that the machine code in an executable fi le is still relocatable, meaning that the addresses of all instructions and data in the fi le are still relative to an arbitrary base address, not absolute. The fi nal absolute base address of the program is not known until the program is actually loaded into memory, just prior to running it. A dynamic link library (DLL) is a special kind of library that acts like a hybrid between a regular static library and an executable. The DLL acts like a library, because it contains functions that can be called by any number of diff erent executables. However, a DLL also acts like an executable, because it can be loaded by the operating system independently, and it contains some start-up and shut-down code that runs much the way the main() function in a C++ executable does. The executables that use a DLL contain partially linked machine code. Most of the function and data references are fully resolved within the fi nal execut- able, but any references to external functions or data that exist in a DLL re- main unlinked. When the executable is run, the operating system resolves the addresses of all unlinked functions by locating the appropriate DLLs, load- ing them into memory if they are not already loaded, and patching in the necessary memory addresses. Dynamically linked libraries are a very useful operating system feature, because individual DLLs can be updated without changing the executable(s) that use them. 2.2.3. Projects and Solutions Now that we understand the diff erence between libraries, executables, and dynamic link libraries (DLLs), let’s see how to create them. In Visual Studio, a project is a collection of source fi les which, when compiled, produce a library, an executable, or a DLL. Projects are stored in project fi les with a .vcproj ex- tension. In Visual Studio .NET 2003 (version 7), Visual Studio 2005 (version 8), and Visual Studio 2008 (version 9), .vcproj fi les are in XML format, so they are reasonably easy for a human to read and even edit by hand if necessary. All versions of Visual Studio since version 7 (Visual Studio 2003) employ solution fi les (fi les with a .sln extension) as a means of containing and manag- ing collections of projects. A solution is a collection of dependent and/or in- dependent projects intended to build one or more libraries, executables and/ or DLLs. In the Visual Studio graphical user interface , the Solution Explorer is usually displayed along the right or left side of the main window, as shown in Figure 2.9. 69 The Solution Explorer is a tree view. The solution itself is at the root, with the projects as its immediate children. Source fi les and headers are shown as children of each project. A project can contain any number of user-defi ned folders, nested to any depth. Folders are for organizational purposes only and have nothing to do with the folder structure in which the fi les may reside on-disk. However it is common practice to mimic the on-disk folder structure when sett ing up a project’s folders. 2.2.4. Build Confi gurations The C/C++ preprocessor, compiler, and linker off er a wide variety of options to control how your code will be built. These options are normally specifi ed on the command line when the compiler is run. For example, a typical com- mand to build a single translation unit with the Microsoft compiler might look like this: C:\> cl /c foo.cpp /Fo foo.obj /Wall /Od /Zi This tells the compiler/linker to compile but not link (/c) the translation unit named foo.cpp, output the result to an object fi le named foo.obj (/Fo foo.obj), turn on all warnings (/Wall), turn off all optimizations (/Od), and generate debugging information (/Zi). Modern compilers provide so many options that it would be impracti- cal and error prone to specify all of them every time you build your code. That’s where build confi gurations come in. A build confi guration is really just a collection of preprocessor, compiler, and linker options associated with a particular project in your solution. You can defi ne any number of build con- 2.2. Microsoft Visual Studio Figure 2.9. The VisualStudio Solution Explorer window. 70 2. Tools of the Trade fi gurations, name them whatever you want, and confi gure the preprocessor, compiler, and linker options diff erently in each confi guration. By default, the same options are applied to every translation unit in the project, although you can override the global project sett ings on an individual translation unit basis. (I recommend avoiding this if at all possible, because it becomes diffi cult to tell which .cpp fi les have custom sett ings and which do not.) Most projects have at least two build confi gurations, typically called “Debug” and “Release.” The release build is for the fi nal shipping soft ware, while the debug build is for development purposes. A debug build runs more slowly than a release build, but it provides the programmer with invaluable information for developing and debugging the program. 2.2.4.1. Common Build Options This section lists some of the most common options you’ll want to control via build confi gurations for a game engine project. Preprocessor Settings The C++ preprocessor handles the expansion of #included fi les and the defi - nition and substitution of #defined macros. One extremely powerful feature of all modern C++ preprocessors is the ability to defi ne preprocessor macros via command-line options (and hence via build confi gurations). Macros de- fi ned in this way act as though they had been writt en into your source code with a #define statement. For most compilers, the command line option for this is –D or /D, and any number of these directives can be used. This feature allows you to communicate various build options to your code, without having to modify the source code itself. As a ubiquitous exam- ple, the symbol _DEBUG is always defi ned for a debug build, while in release builds the symbol NDEBUG is defi ned instead. The source code can check these fl ags and in eff ect “know” whether it is being built in debug or release mode. This is known as conditional compilation. For example: void f() { #ifdef _DEBUG printf(“Calling function f()\n”); #endif // ... } The compiler is also free to introduce “magic” preprocessor macros into your code, based on its knowledge of the compilation environment and target platform. For example, the macro __cplusplus is defi ned by most C/C++ 71 compilers when compiling a C++ fi le. This allows code to be writt en that auto- matically adapts to being compiled for C or C++. As another example, every compiler identifi es itself to the source code via a “magic” preprocessor macro. When compiling code under the Microsoft compiler, the macro _MSC_VER is defi ned; when compiling under the GNU compiler (gcc), the macro _GNUC_ is defi ned instead, and so on for the oth- er compilers. The target platform on which the code will be run is likewise identifi ed via macros. For example, when building for a 32-bit Windows machine, the symbol _WIN32 is always defi ned. These key features permit cross-platform code to be writt en, because they allow your code to “know” what compiler is compiling it and on which target platform it is destined to be run. Compiler Settings One of the most common compiler options controls whether or not the com- piler should include debugging information with the object fi les it produces. This information is used by debuggers to step through your code, display the values of variables, and so on. Debugging information makes your executa- bles larger on disk and also opens the door for hackers to reverse-engineer your code. So, it is always stripped from the fi nal shipping version of your executable. However, during development, debugging information is invalu- able and should always be included in your builds. The compiler can also be told whether or not to expand inline functions. When inline function expansion is turned off , every inline function appears only once in memory, at a distinct address. This makes the task of tracing through the code in the debugger much simpler, but obviously comes at the expense of the execution speed improvements normally achieved by inlin- ing. Inline function expansion is but one example of generalized code trans- formations known as optimizations. The aggressiveness with which the com- piler att empts to optimize your code, and the kinds of optimizations its uses, can be controlled via compiler options. Optimizations have a tendency to re- order the statements in your code, and they also cause variables to be stripped out of the code altogether, or moved around, and can cause CPU registers to be reused for new purposes later in the same function. Optimized code usu- ally confuses most debuggers, causing them to “lie” to you in various ways, and making it diffi cult or impossible to see what’s really going on. As a result, all optimizations are usually turned off in a debug build. This permits every variable and every line of code to be scrutinized as it was originally coded. But, of course, such code will run much more slowly than its fully optimized counterpart. 2.2. Microsoft Visual Studio 72 2. Tools of the Trade Linker Settings The linker also exposes a number of options. You can control what type of output fi le to produce—an executable or a DLL. You can also specify which external libraries should be linked into your executable, and which directory paths to search in order to fi nd them. A common practice is to link with de- bug libraries when building a debug executable and with optimized libraries when building in release mode. Linker options also control things like stack size, the preferred base ad- dress of your program in memory, what type of machine the code will run on (for machine-specifi c optimizations), and a host of other minutia with which we will not concern ourselves here. 2.2.4.2. Typical Build Confi gurations Game projects oft en have more than just two build confi gurations. Here are a few of the common confi gurations I’ve seen used in game development. Debug. A debug build is a very slow version of your program, with all optimizations turned off , all function inlining disabled, and full debug- ging information included. This build is used when testing brand new code and also to debug all but the most trivial problems that arise dur- ing development. Release. A release build is a faster version of your program, but with debugging information and assertions still turned on. (See Section 3.3.3.3 for a discussion of assertions.) This allows you to see your game running at a speed representative of the fi nal product, but still gives you some opportunity to debug problems. Production. A production confi guration is intended for building the fi nal game that you will ship to your customers. It is sometimes called a “Fi- nal” build or “Disk” build. Unlike a release build, all debugging informa- tion is stripped out of a production build, all assertions are usually turned off , and optimizations are cranked all the way up. A production build is very tricky to debug, but it is the fastest and leanest of all build types. Tools. Some game studios utilize code libraries that are shared between offl ine tools and the game itself. In this scenario, it oft en makes sense to defi ne a “Tools” build, which can be used to conditionally compile shared code for use by the tools. The tools build usually defi nes a pre- processor macro (e.g., TOOLS_BUILD) that informs the code that it is be- ing built for use in a tool. For example, one of your tools might require certain C++ classes to expose editing functions that are not needed by the game. These functions could be wrapped in an #ifdef TOOLS_ 73 BUILD directive. Since you usually want both debug and release ver- sions of your tools, you will probably fi nd yourself creating two tools builds, named something like “ToolsDebug” and “ToolsRelease.” Hybrid Builds A hybrid build is a build confi guration in which the majority of the translation units are built in release mode, but a small subset of them is built in debug mode. This permits the segment of code that is currently under scrutiny to be easily debugged, while the rest of the code continues to run at full speed. With a text-based build system like make, it is quite easy to set up a hybrid build which permits users to specify the use of debug mode on a per-transla- tion-unit basis. In a nutshell, we defi ne a make variable called something like $HYBRID_SOURCES, which lists the names of all translation units (.cpp fi les) that should be compiled in debug mode for our hybrid build. We set up build rules for compiling both debug and release versions of every translation unit, and arrange for the resulting object fi les (.obj/.o) to be placed into two diff er- ent folders, one for debug and one for release. The fi nal link rule is set up to link with the debug versions of the object fi les listed in $HYBRID_SOURCES and with the release versions of all other object fi les. If we’ve set it up properly, make’s dependency rules will take care of the rest. Unfortunately, this is not so easy to do in Visual Studio, because its build confi gurations are designed to be applied on a per-project basis, not per trans- lation unit. The crux of the problem is that we cannot easily defi ne a list of the translation units that we want to build in debug mode. However, if your source code is already organized into libraries, you can set up a “Hybrid” build confi guration at the solution level, which picks and chooses between debug and release builds on a per-project (and hence per-library) basis. This isn’t as fl exible as having control on a per-translation-unit basis, but it does work reasonably well if your libraries are suffi ciently granular. Build Confi gurations and Testability The more build confi gurations your project supports, the more diffi cult test- ing becomes. Although the diff erences between the various confi gurations may be slight, there’s a fi nite probability that a critical bug may exist in one of them but not in the others. Therefore, each build confi guration must be tested equally thoroughly. Most game studios do not formally test their debug builds, because the debug confi guration is primarily intended for internal use during initial development of a feature and for the debugging of problems found in one of the other confi gurations. However, if your testers spend most of their time testing your release confi guration, then you cannot simply make a production build of your game the night before Gold Master and expect it 2.2. Microsoft Visual Studio 74 2. Tools of the Trade to have an identical bug profi le to that of the release build. Practically speak- ing, the test team must test both your release and production builds equally throughout alpha and beta, to ensure that there aren’t any nasty surprises lurking in your production build. In terms of testability, there is a clear advan- tage to keeping your build confi gurations to a minimum, and in fact some stu- dios have no production build for this reason—they simply ship their release build once it has been thoroughly tested. 2.2.4.3. Project Confi guration Tutorial Right-clicking on any project in the Solution Explorer and selecting “Proper- ties…” from the menu brings up the project’s “Property Pages” dialog. The tree view on the left shows various categories of sett ings. Of these, the three we will use most are Confi guration Properties/General, Confi guration Properties/Debugging, Confi guration Properties/C++, Confi guration Properties/Linker. Confi gurations Drop-Down Combo Box Notice the drop-down combo box labeled “Confi guration:” at the top-left cor- ner of the window. All of the properties displayed on these property pages ap- ply separately to each build confi guration. If you set a property for the debug confi guration, this does not necessarily mean that the same sett ing exists for the release confi guration. If you click on the combo box to drop down the list, you’ll fi nd that you can select a single confi guration or multiple confi gurations, including “All confi gurations.” As a rule of thumb, try to do most of your build confi guration editing with “All confi gurations” selected. That way, you won’t have to make the same edits multiple times, once for each confi guration—and you don’t risk sett ing things up incorrectly in one of the confi gurations by accident. How- ever, be aware that some sett ings need to be diff erent between the debug and release confi gurations. For example, function inlining and code optimization sett ings should of course be diff erent between debug and release builds. General Tab On the General tab, shown in Figure 2.10, the most useful fi elds are the fol- lowing. Output directory. This defi nes where the fi nal product(s) of the build will go—namely the executable, library, or DLL that the compiler/linker ul- timately outputs. 75 Intermediate directory. This defi nes where intermediate fi les, primarily object fi les (.obj extension), are placed during a build. Intermediate fi les are never shipped with your fi nal program—they are only required during the process of building your executable, library, or DLL. Hence, it is a good idea to place intermediate fi les in a diff erent directory than the fi nal products (.exe, .lib or .dll fi les). Note that VisualStudio provides a macro facility which may be used when specifying directories and other sett ings in the “Project Property Pages” dialog. A macro is essentially a named variable that contains a global value and that can be referred to in your project confi guration sett ings. Macros are invoked by writing the name of the macro enclosed in paren- theses and prefi xed with a dollar sign (e.g., $(ConfigurationName)). Some commonly used macros are listed below. $(TargetFileName). The name of the fi nal executable, library, or DLL fi le being built by this project. $(TargetPath). The full path of the folder containing the fi nal execut- able, library, or DLL. $(ConfigurationName). The name of the build confi g, typically “De- bug” or “Release.” 2.2. Microsoft Visual Studio Figure 2.10. Visual Studio project property pages—General page. 76 2. Tools of the Trade $(OutDir). The value of the “Output Directory” fi eld specifi ed in this dialog. $(IntDir). The value of the “Intermediate Directory” fi eld in this dialog. $(VCInstallDir). The directory in which Visual Studio’s standard C library is currently installed. The benefi t of using macros instead of hard-wiring your confi guration sett ings is that a simple change of the global macro’s value will automatically aff ect all confi guration sett ings in which the macro is used. Also, some macros like $(ConfigurationName) automatically change their values depending on the build confi guration, so using them can permit you to use identical set- tings across all your confi gurations. To see a complete list of all available macros, click in either the “Output Directory” fi eld or the “Intermediate Directory” fi eld on the “General” tab, click the litt le arrow to the right of the text fi eld, select “Edit…” and then click the “Macros” butt on in the dialog that comes up. Debugging Tab The “Debugging” tab is where the name and location of the executable to debug is specifi ed. On this page, you can also specify the command-line argument(s) that should be passed to the program when it runs. We’ll discuss debugging your program in more depth below. C/C++ Tab The C/C++ tab controls compile-time language sett ings—things that aff ect how your source fi les will be compiled into object fi les (.obj extension). The sett ings on this page do not aff ect how your object fi les are linked into a fi nal executable or DLL. You are encouraged to explore the various subpages of the C/C++ tab to see what kinds of sett ings are available. Some of the most commonly used set- tings include the following. General Tab/Include Directories. This fi eld lists the on-disk directories that will be searched when looking for #included header fi les. Important: It is always best to specify these directories using relative paths and/or with Visual Studio macros like $(OutDir) or $(IntDir). That way, if you move your build tree to a diff erent location on disk or to another computer with a diff erent root folder, everything will continue to work properly. General Tab/Debug Information. This fi eld controls whether or not debug information is generated. Typically both debug and release confi gura- 77 tions include debugging information so that you can track down prob- lems during development of your game. The fi nal production build will have all the debug info stripped out to prevent hacking. Preprocessor Tab/Preprocessor Defi nitions. This very handy fi eld lists any number of C/C++ preprocessor symbols that should be defi ned in the code when it is compiled. See Preprocessor Sett ings in Section 2.2.4.1 for a discussion of preprocessor-defi ned symbols. Linker Tab The “Linker” tab lists properties that aff ect how your object code fi les will be linked into an executable or DLL. Again, you are encouraged to explore the various subpages. Some commonly used sett ings follow. General Tab/Output File. This sett ing lists the name and location of the fi nal product of the build, usually an executable or DLL. General Tab/Additional Library Directories. Much like the C/C++ Include Directories fi eld, this fi eld lists zero or more directories that will be searched when looking for libraries and object fi les to link into the fi nal executable. Input Tab/Additional Dependencies. This fi eld lists external libraries that you want linked into your executable or DLL. For example, the Ogre libraries would be listed here if you are building an Ogre-enabled application. Note that Visual Studio employs various “magic spells” to specify librar- ies that should be linked into an executable. For example, a special #pragma instruction in your source code can be used to instruct the linker to automati- cally link with a particular library. For this reason, you may not see all of the libraries you’re actually linking to in the “Additional Dependencies” fi eld. (In fact, that’s why they are called additional dependencies.) You may have noticed, for example, that Direct X applications do not list all of the DirectX libraries manually in their “Additional Dependencies” fi eld. Now you know why. 2.2.4.4. Creating New .vcproj Files With so many preprocessor, compiler, and linker options, all of which must be set properly, creating a new project may seem like an impossibly daunting task. I usually take one of the following two approaches when creating a new Visual Studio project. Use a Wizard Visual Studio provides a wide variety of wizards to create new projects of various kinds. If you can fi nd a wizard that does what you want, this is the easiest way to create a new project. 2.2. Microsoft Visual Studio 78 2. Tools of the Trade Copy an Existing Project If I am creating a project that is similar to an existing project that I know al- ready works, I’ll oft en just copy that .vcproj fi le and then modify it as neces- sary. In Visual Studio 2005, this is very easy. You simply copy the .vcproj fi le on disk, then add the newly copied project to your solution by right-clicking the solution in the Solution Explorer and selecting “Add…” and “Existing project…” from the pop-up menus. One caveat when copying project fi les is that the name of the project is stored inside the .vcproj fi le itself. So when you load up the new project for the fi rst time in Visual Studio 2005, it will still have the original name. To rectify this, you can simply select the project in the Solution Explorer window, and hit F2 to rename it appropriately. Another problem arises when the name of the executable, library, or DLL that the project creates is specifi ed explicitly in the .vcproj fi le. For example, the executable might be specifi ed as “C:\MyGame\bin\MyGame.exe” or “$(OutDir)\MyGame.exe.” In this case, you’ll need to open the .vcproj fi le and do a global search-and-replace of the executable, library, or DLL name and/or its directory path. This is not too diffi cult. Project fi les are XML, so you can rename your copied .vcproj fi le to have an “.xml” extension and then open it in Visual Studio (or any other XML or raw text editor). One elegant solution to this problem is to use Visual Studio’s macro system when specifying all out- put fi les in your project. For example, if you specify the output executable as “$(OutDir)\$(ProjectName).exe”, then the project’s name will automati- cally be refl ected in the name of the output executable fi le. I should mention that using a text editor to manipulate .vcproj fi les is not always to be avoided. In fact, the practice is quite common, at least in my ex- perience. For example, let’s say you decided to move the folder containing all of your graphics header fi les to a new path on disk. Rather than manually open each project in turn, open the Project Property Pages window, navigate to the C/C++ tab, and fi nally update the include path manually, it’s much easier and less error-prone to edit the fi les as XML text and do a search-and-replace. You can even do a “Replace in fi les” operation in Visual Studio for mass edits. 2.2.5. Debugging Your Code One of the most important skills any programmer can learn is how to eff ec- tively debug code. This section provides some useful debugging tips and tricks. Some are applicable to any debugger and some are specifi c to Microsoft Visual Studio. However, you can usually fi nd an equivalent to Visual Studio’s debugging features in other debuggers, so this section should prove useful even if you don’t use Visual Studio to debug your code. 79 2.2.5.1. The Start-Up Project A Visual Studio solution can contain more than one project. Some of these projects build executables, while others build libraries or DLLs. It’s possible to have more than one project that builds an executable in a single solution. However, you cannot debug more than one program at a time. For this reason, Visual Studio provides a sett ing known as the “Start-Up Project.” This is the project that is considered “current” for the purposes of the debugger. The start-up project is highlighted in bold in the Solution Explorer. Hitt ing F5 to run your program in the debugger will run the .exe built by the start-up project (if the start-up project builds an executable). 2.2.5.2. Break Points Break points are the bread and butt er of code debugging. A break point in- structs the program to stop at a particular line in your source code so that you can inspect what’s going on. In Visual Studio, select a line and hit F9 to toggle a break point. When you run your program and the line of code containing the break point is about to be executed, the debugger will stop the program. We say that the break point has been “hit.” A litt le arrow will show you which line of code the CPU’s pro- gram counter is currently on. This is shown in Figure 2.11. 2.2. Microsoft Visual Studio Figure 2.11. Setting a break point in Visual Studio. 2.2.5.3. Stepping through Your Code Once a break point has been hit, you can single-step your code by hitt ing the F10 key. The yellow program-counter arrow moves to show you the lines as they execute. Hitt ing F11 steps into a function call (i.e., the next line of code you’ll see is the fi rst line of the called function), while F10 steps over that func- 80 2. Tools of the Trade tion call (i.e., the debugger calls the function at full speed and then breaks again on the line right aft er the call). 2.2.5.4. The Call Stack The call stack window, shown in Figure 2.12, shows you the stack of functions that have been called at any given moment during the execution of your code. To display the call stack (if it is not already visible), go to the “Debug” menu on the main menu bar, select “Windows,” and then “Call Stack.” Once a break point has been hit (or the program is manually paused), you can move up and down the call stack by double-clicking on entries in the “Call Stack” window. This is very useful for inspecting the chain of function calls that were made between main() and the current line of code. For example, you might trace back to the root cause of a bug in a parent function which has manifested itself in a deeply nested child function. Figure 2.12. The call stack window. 2.2.5.5. The Watch Window As you step through your code and move up and down the call stack, you will want to be able to inspect the values of the variables in your program. This is what watch windows are for. To open a watch window, go to the “Debug” menu, select “Windows…,” then select “Watch…,” and fi nally select one of “Watch 1” through “Watch 4.” (Visual Studio allows you to open up to four watch windows simultaneously.) Once a watch window is open, you can type the names of variables into the window or drag expressions in directly from your source code. As you can see in Figure 2.13, variables with simple data types are shown with their values listed immediately to the right of their names. Complex data types are shown as litt le tree views that can be easily expanded to “drill 81 down” into virtually any nested structure. The base class of a class is always shown as the fi rst child of an instance of a derived class. This allows you to inspect not only the class’ data members, but also the data members of its base class(es). You can type virtually any valid C/C++ expression into the watch window, and Visual Studio will evaluate that expression and att empt to display the resulting value. For example, you could type “5+3” and Visual Studio will display “8.” You can cast variables from one type to another by using C or C++ casting syntax. For example, typing “(float)myIntegerVariable * 0.5f” in the watch window will display the value of myIntegerVariable divided by two, as a fl oating-point value. You can even call functions in your program from within the watch window. Visual Studio re-evaluates the expressions typed into the watch window(s) automatically, so if you invoke a function in the watch window, it will be called every time you hit a break point or single-step your code. This allows you to leverage the functionality of your program in order to save yourself work when trying to interpret the data that you’re inspecting in the debug- ger. For example, let’s say that your game engine provides a function called quatToAngleDeg() which converts a quaternion to an angle of rotation in degrees. You can call this function in the watch window in order to easily in- spect the rotation angle of any quaternion from within the debugger. You can also use various suffi xes on the expressions in the watch window in order to change the way Visual Studio displays the data, as shown in Fig- ure 2.14. The “,d” suffi x forces values to be displayed in decimal notation. The “,x” suffi x forces values to be displayed in hexadecimal notation. 2.2. Microsoft Visual Studio Figure 2.13. Visual Studio’s watch window. 82 2. Tools of the Trade The “,n” suffi x (where n is any positive integer) forces Visual Studio to treat the value as an array with n elements. This allows you to expand array data that is referenced through a pointer. Be careful when expanding very large data structures in the watch window, be- cause it can sometimes slow the debugger down to the point of being unusable. 2.2.5.6. Data Break Points Regular break points trip when the CPU’s program counter hits a particular machine instruction or line of code. However, another incredibly useful fea- ture of modern debuggers is the ability to set a break point that trips when- ever a specifi c memory address is writt en to (i.e., changed). These are called data break points, because they are triggered by changes to data, or sometimes hardware break points, because they are implemented via a special feature of the CPU’s hardware—namely the ability to raise an interrupt when a predefi ned memory address is writt en to. Here’s how data break points are typically used. Let’s say you are tracking down a bug that manifests itself as a zero (0.0f) value mysteriously appear- ing inside a member variable of a particular object called m_angle that should always contain a nonzero angle. You have no idea which function might be writing that zero into your variable. However, you do know the address of the variable. (You can just type “&object.m_angle” into the watch window to fi nd its address.) To track down the culprit, you can set a data break point on the address of object.m_angle, and then simply let the program run. When the value changes, the debugger will stop automatically. You can then inspect the call stack to catch the off ending function red-handed. To set a data break point in Visual Studio, take the following steps. Bring up the “Breakpoints” window found on the “Debug” menu under “Windows” and then “Breakpoints” (Figure 2.15). Select the “New” drop-down butt on in the upper-left corner of the win- dow. Figure 2.14. Comma suffi xes in the Visual Studio watch window. 83 Select “New Data Breakpoint.” Type in the raw address or an address-valued expression, such as “&myVariable” (Figure 2.16). The “Byte count” fi eld should almost always contain the value 4. This is because 32-bit Pentium CPUs can really only inspect 4-byte (32-bit) values na- tively. Specifying any other data size requires the debugger to do some trickery which tends to slow your program’s execution to a crawl (if it works at all). 2.2.5.7. Conditional Break Points You’ll also notice in the “Break Points” window that you can set conditions and hit counts on any type break point—data break points or regular line-of- code break points. A conditional break point causes the debugger to evaluate the C/C++ expres- sion you provide every time the break point is hit. If the expression is true, the debugger stops your program and gives you a chance to see what’s going on. If the expression is false, the break point is ignored and the program contin- ues. This is very useful for sett ing break points that only trip when a function is called on a particular instance of a class. For example, let’s say you have a game level with 20 tanks on-screen, and you want to stop your program Figure 2.16. Defi ning a data break point. Figure 2.15. The Visual Studio break points window. 2.2. Microsoft Visual Studio 84 2. Tools of the Trade when the third tank, whose memory address you know to be 0x12345678, is running. By sett ing the break point’s condition express to something like “(unsigned)this == 0x12345678”, you can restrict the break point only to the class instance whose memory address (this pointer) is 0x12345678. Specifying a hit count for a break point causes the debugger to decrement a counter every time the break point is hit, and only actually stop the program when that counter reaches zero. This is really useful for situations where your break point is inside a loop, and you need to inspect what’s happening during the 376th iteration of the loop (e.g., the 376th element in an array). You can’t very well sit there and hit the F5 key 375 times! But you can let the hit count feature of Visual Studio do it for you. One note of caution: Conditional break points cause the debugger to eval- uate the conditional expression every time the break point is hit, so they can bog down the performance of the debugger and your game. 2.2.5.8. Debugging Optimized Builds I mentioned above that it can be very tricky to debug problems using a release build, due primarily to the way the compiler optimizes the code. Ideally, every programmer would prefer to do all of his or her debugging in a debug build. However, this is oft en not possible. Sometimes a bug occurs so rarely that you’ll jump at any chance to debug the problem, even if it occurs in a release build on someone else’s machine. Other bugs only occur in your release build, but magically disappear whenever you run the debug build. These dreaded release-only bugs are sometimes caused by uninitialized variables, because vari- ables and dynamically allocated memory blocks are oft en set to zero in debug mode, but are left containing garbage in a release build. Other common causes of release-only bugs include code that has been accidentally omitt ed from the release build (e.g., when important code is erroneously placed inside an asser- tion statement), data structures whose size or data member packing changes between debug and release builds, bugs that are only triggered by inlining or compiler-introduced optimizations, and (in rare cases) bugs in the compiler’s optimizer itself, causing it to emit incorrect code in a fully optimized build. Clearly, it behooves every programmer to be capable of debugging prob- lems in a release build, unpleasant as it may seem. The best ways to reduce the pain of debugging optimized code is to practice doing it and to expand your skill set in this area whenever you have the opportunity. Here are a few tips. Learn to read and step through disassembly in the debugger. In a release build, the debugger oft en has trouble keeping track of which line of source code is currently being executed. Thanks to instruction reordering, you’ll oft en see the program counter jump around erratically within the 85 function when viewed in source code mode. However, things become sane again when you work with the code in disassembly mode (i.e., step through the assembly language instructions individually). Every C/C++ programmer should be at least a litt le bit familiar with the architecture and assembly language of their target CPU(s). That way, even if the de- bugger is confused, you won’t be. Use registers to deduce variables’ values or addresses. The debugger will sometimes be unable to display the value of a variable or the contents of an object in a release build. However, if the program counter is not too far away from the initial use of the variable, there’s a good chance its ad- dress or value is still stored in one of the CPU’s registers. If you can trace back through the disassembly to where the variable is fi rst loaded into a register, you can oft en discover its value or its address by inspecting that register. Use the register window, or type the name of the register into a watch window, to see its contents. Inspect variables and object contents by address. Given the address of a vari- able or data structure, you can usually see its contents by casting the address to the appropriate type in a watch window. For example, if we know that an instance of the Foo class resides at address 0x1378A0C0, we can type “(Foo*)0x1378A0C0” in a watch window, and the debugger will interpret that memory address as if it were a pointer to a Foo object. Leverage static and global variables. Even in an optimized build, the de- bugger can usually inspect global and static variables. If you cannot de- duce the address of a variable or object, keep your eye open for a static or global that might contain its address, either directly or indirectly. For example, if we want to fi nd the address of an internal object within the physics system, we might discover that it is in fact stored in a member variable of the global PhysicsWorld object. Modify the code. If you can reproduce a release-only bug relatively eas- ily, consider modifying the source code to help you debug the problem. Add print statements so you can see what’s going on. Introduce a global variable to make it easier to inspect a problematic variable or object in the debugger. Add code to detect a problem condition or to isolate a particular instance of a class. 2.3. Profi ling Tools Games are typically high-performance real-time programs. As such, game en- gine programmers are always looking for ways to speed up their code. There 2.3. Profi ling Tools 86 2. Tools of the Trade is a well-known, albeit rather unscientifi c, rule of thumb known as the Pareto principle (see htt p://en.wikipedia.org/wiki/Pareto_principle). It is also known as the 80-20 rule, because it states that in many situations, 80% of the eff ects of some event come from only 20% of the possible causes. In computer sci- ence, we oft en use a variant of this principle known as the 90-10 rule, which states that 90% of the wall clock time spent running any piece of soft ware is accounted for by only 10% of the code. In other words, if you optimize 10% of your code, you can potentially realize 90% of all the gains in execution speed you’ll ever realize. So, how do you know which 10% of your code to optimize? For that, you need a profi ler. A profi ler is a tool that measures the execution time of your code. It can tell you how much time is spent in each function. You can then di- rect your optimizations toward only those functions that account for the lion’s share of the execution time. Some profi lers also tell you how many times each function is called. This is an important dimension to understand. A function can eat up time for two reasons: (a) it takes a long time to execute on its own, or (b) it is called fre- quently. For example, a function that runs an A* algorithm to compute the optimal paths through the game world might only be called a few times each frame, but the function itself may take a signifi cant amount of time to run. On the other hand, a function that computes the dot product may only take a few cycles to execute, but if you call it hundreds of thousands of times per frame, it might drag down your game’s frame rate. Even more information can be obtained, if you use the right profi ler. Some profi lers report the call graph, meaning that for any given function, you can see which functions called it (these are known as parent functions) and which functions it called (these are known as child functions or descendants). You can even see what percentage of the function’s time was spent calling each of its descendants and the percentage of the overall running time accounted for by each individual function. Profi lers fall into two broad categories. 1. Statistical profi lers. This kind of profi ler is designed to be unobtrusive, meaning that the target code runs at almost the same speed, wheth- er or not profi ling is enabled. These profi lers work by sampling the CPU’s program counter register periodically and noting which func- tion is currently running. The number of samples taken within each function yields an approximate percentage of the total running time that is eaten up by that function. Intel’s VTune is the gold standard in statistical profi lers for Windows machines employing Intel Pentium processors, and it is now also available for Linux. See htt p://www. 87 intel.com/cd/soft ware/products/ asmo-na /eng /vtune /239144.htm for details. 2. Instrumenting profi lers. This kind of profi ler is aimed at providing the most accurate and comprehensive timing data possible, but at the ex- pense of real-time execution of the target program—when profi ling is turned on, the target program usually slows to a crawl. These profi lers work by preprocessing your executable and inserting special prologue and epilogue code into every function. The prologue and epilogue code calls into a profi ling library, which in turn inspects the program’s call stack and records all sorts of details, including which parent function called the function in question and how many times that parent has called the child. This kind of profi ler can even be set up to monitor every line of code in your source program, allowing it to report how long each line is taking to execute. The results are stunningly accurate and com- prehensive, but turning on profi ling can make a game virtually unplay- able. IBM’s Rational Quantify, available as part of the Rational Purify Plus tool suite, is an excellent instrumenting profi ler. See htt p://www. ibm.com/developerworks/rational/library/957.html for an introduction to profi ling with Quantify. Microsoft has also published a profi ler that is a hybrid between the two approaches. It is called LOP, which stands for low-overhead profi ler. It uses a statistical approach, sampling the state of the processor periodically, which means it has a low impact on the speed of the program’s execution. However, with each sample it analyzes the call stack, thereby determining the chain of parent functions that resulted in each sample. This allows LOP to provide information normally not available with a statistical profi ler, such as the dis- tribution of calls across parent functions. 2.3.1. List of Profi lers There are a great many profi ling tools available. See htt p://en.wikipedia.org/ wiki/List_of_performance_analysis_tool for a reasonably comprehensive list. 2.4. Memory Leak and Corruption Detection Two other problems that plague C and C++ programmers are memory leaks and memory corruption. A memory leak occurs when memory is allocated but never freed. This wastes memory and eventually leads to a potentially fatal out-of-memory condition. Memory corruption occurs when the program inadvertently writes data to the wrong memory location, overwriting the im- 2.4. Memory Leak and Corruption Detection 88 2. Tools of the Trade portant data that was there—while simultaneously failing to update the mem- ory location where that data should have been writt en. Blame for both of these problems falls squarely on the language feature known as the pointer. A pointer is a powerful tool. It can be an agent of good when used prop- erly—but it can also be all-too-easily transformed into an agent of evil. If a pointer points to memory that has been freed, or if it is accidentally assigned a nonzero integer or fl oating-point value, it becomes a dangerous tool for cor- rupting memory, because data writt en through it can quite literally end up anywhere. Likewise, when pointers are used to keep track of allocated mem- ory, it is all too easy to forget to free the memory when it is no longer needed. This leads to memory leaks. Clearly good coding practices are one approach to avoiding pointer-re- lated memory problems. And it is certainly possible to write solid code that essentially never corrupts or leaks memory. Nonetheless, having a tool to help you detect potential memory corruption and leak problems certainly can’t hurt. Thankfully, many such tools exist. My personal favorite is IBM’s Rational Purify, which comes as part of the Purify Plus tool kit. Purify instruments your code prior to running it, in order to hook into all pointer dereferences and all memory allocations and dealloca- tions made by your code. When you run your code under Purify, you get a live report of the problems—real and potential—encountered by your code. And when the program exits, you get a detailed memory leak report. Each problem is linked directly to the source code that caused the problem, making tracking down and fi xing these kinds of problems relatively easy. You can fi nd more information on Purify at htt p://www-306.ibm.com/soft ware/awdtools /purify. Another popular tool is Bounds Checker by CompuWare. It is similar to Purify in purpose and functionality. You can fi nd more information on Bounds Checker at htt p://www.compuware.com/products/devpartner/visualc .htm. 2.5. Other Tools There are a number of other commonly used tools in a game programmer’s toolkit. We won’t cover them in any depth here, but the following list will make you aware of their existence and point you in the right direction if you want to learn more. Diff erence tools. A diff erence tool, or diff tool, is a program that com- pares two versions of a text fi le and determines what has changed be- 89 tween them. (See htt p://en.wikipedia.org/wiki/Diff for a discussion of diff tools.) Diff s are usually calculated on a line-by-line basis, although modern diff tools can also show you a range of characters on a changed line that have been modifi ed. Most version control systems come with a diff tool. Some programmers like a particular diff tool and confi gure their version control soft ware to use the tool of their choice. Popular tools include ExamDiff (htt p://www.prestosoft .com/edp_examdiff .asp), AraxisMerge (htt p://www.araxis.com), WinDiff (available in the Op- tions Packs for most Windows versions and available from many inde- pendent websites as well), and the GNU diff tools package (htt p://www. gnu.org/soft ware/diff utils/diff utils.html). Three-way merge tools. When two people edit the same fi le, two inde- pendent sets of diff s are generated. A tool that can merge two sets of diff s into a fi nal version of the fi le that contains both person’s changes is called a three-way merge tool. The name “three-way” refers to the fact that three versions of the fi le are involved: the original, user A’s version, and user B’s version. (See htt p://en.wikipedia.org/wiki/3-way_ merge#Three-way_merge for a discussion of two-way and three-way merge technologies.) Many merge tools come with an associated diff tool. Some popular merge tools include AraxisMerge (htt p://www.arax- is.com) and WinMerge (htt p://winmerge.org). Perforce also comes with an excellent three-way merge tool (htt p://www.perforce.com/perforce/ products/merge.html). Hex editors. A hex editor is a program used for inspecting and modify- ing the contents of binary fi les. The data are usually displayed as in- tegers in hexadecimal format, hence the name. Most good hex editors can display data as integers from one byte to 16 bytes each, in 32- and 64-bit fl oating point format and as ASCII text. Hex editors are particu- larly useful when tracking down problems with binary fi le formats or when reverse-engineering an unknown binary format—both of which are relatively common endeavors in game engine development circles. There are quite literally a million diff erent hex editors out there; I’ve had good luck with HexEdit by Expert Commercial Soft ware (htt p:// www.expertcomsoft .com/index.html), but your mileage may vary. As a game engine programmer you will undoubtedly come across other tools that make your life easier, but I hope this chapter has covered the main tools you’ll use on a day-to-day basis. 2.5. Other Tools 91 3 Fundamentals of Software Engineering for Games In this chapter, we’ll briefl y review the basic concepts of object-oriented pro- gramming and then delve into some advanced topics which should prove invaluable in any soft ware engineering endeavor (and especially when creat- ing games). As with Chapter 2, I hope you will not to skip this chapter en- tirely; it’s important that we all embark on our journey with the same set of tools and supplies. 3.1. C++ Review and Best Practices 3.1.1. Brief Review of Object-Oriented Programming Much of what we’ll discuss in this book assumes you have a solid understand- ing of the principles of object-oriented design. If you’re a bit rusty, the follow- ing section should serve as a pleasant and quick review. If you have no idea what I’m talking about in this section, I recommend you pick up a book or two on object-oriented programming (e.g., [5]) and C++ in particular (e.g., [39] and [31]) before continuing. 3.1.1.1. Classes and Objects A class is a collection of att ributes (data) and behaviors (code) which together form a useful, meaningful whole. A class is a specifi cation describing how in- 92 3. Fundamentals of Software Engineering for Games dividual instances of the class, known as objects, should be constructed. For example, your pet Rover is an instance of the class “dog.” Thus there is a one- to-many relationship between a class and its instances. 3.1.1.2. Encapsulation Encapsulation means that an object presents only a limited interface to the out- side world; the object’s internal state and implementation details are kept hid- den. Encapsulation simplifi es life for the user of the class, because he or she need only understand the class’ limited interface, not the potentially intricate details of its implementation. It also allows the programmer who wrote the class to ensure that its instances are always in a logically consistent state. 3.1.1.3. Inheritance Inheritance allows new classes to be defi ned as extensions to pre-existing class- es. The new class modifi es or extends the data, interface, and/or behavior of the existing class. If class Child extends class Parent, we say that Child in- herits from or is derived from Parent. In this relationship, the class Parent is known as the base class or superclass, and the class Child is the derived class or subclass. Clearly, inheritance leads to hierarchical (tree-structured) relation- ships between classes. Inheritance creates an “is-a” relationship between classes. For example, a circle is a type of shape. So if we were writing a 2D drawing application, it would probably make sense to derive our Circle class from a base class called Shape. We can draw diagrams of class hierarchies using the conventions defi ned by the Unifi ed Modeling Language (UML). In this notation, a rectangle repre- sents a class, and an arrow with a hollow triangular head represents inheritance. The inheritance arrow points from child class to parent. See Figure 3.1 for an ex- ample of a simple class hierarchy represented as a UML static class diagram. Shape Circle Rectangle Triangle Figure 3.1. UML static class diagram depicting a simple class hierarchy. Multiple In heritance Some languages support multiple inheritance (MI), meaning that a class can have more than one parent class. In theory MI can be quite elegant, but in 93 3.1. C++ Review and Best Practices practice this kind of design usually gives rise to a lot of confusion and techni- cal diffi culties (see htt p://en.wikipedia.org/wiki/Multiple_inheritance). This is because multiple inheritance transforms a simple tree of classes into a poten- tially complex graph. A class graph can have all sorts of problems that never plague a simple tree—for example, the deadly diamond (htt p://en.wikipedia. org/wiki/Diamond_problem), in which a derived class ends up containing two copies of a grandparent base class (see Figure 3.2). In C++, virtual inheritance al- lows one to avoid this doubling of the grandparent’s data. Most C++ soft ware developers avoid multiple inheritance completely or only permit it in a limited form. A common rule of thumb is to allow only simple, parentless classes to be multiply inherited into an otherwise strictly single-inheritance hierarchy. Such classes are sometimes called mix-in classes ClassA ClassB ClassC ClassD ClassA ClassA ClassB ClassB’s memory layout: ClassA’s memory layout: ClassA ClassC ClassC’s memory layout: ClassA ClassB ClassD’s memory layout: ClassA ClassC ClassD Figure 3.2. “Deadly diamond” in a multiple inheritance hierarchy. +Draw() Shape +Draw() Circle +Draw() Rectangle +Draw() Triangle +Animate() Animator Animator is a hypothetical mix -in class that adds animation functionality to whatever class it is inherited by. Figure 3.3. Example of a mix-in class. 94 3. Fundamentals of Software Engineering for Games because they can be used to introduce new functionality at arbitrary points in a class tree. See Figure 3.3 for a somewhat contrived example of a mix-in class. 3.1.1.4. Polymorphism Polymorphism is a language feature that allows a collection of objects of diff er- ent types to be manipulated through a single common interface. The common interface makes a heterogeneous collection of objects appear to be homoge- neous, from the point of view of the code using the interface. For example, a 2D painting program might be given a list of various shapes to draw on-screen. One way to draw this heterogeneous collection of shapes is to use a switch statement to perform diff erent drawing commands for each distinct type of shape. void drawShapes(std::list shapes) { std::list::iterator pShape = shapes.begin(); std::list::iterator pEnd = shapes.end(); for ( ; pShape != pEnd; ++pShape) { switch (pShape->mType) { case CIRCLE: // draw shape as a circle break; case RECTANGLE: // draw shape as a rectangle break; case TRIANGLE: // draw shape as a triangle break; //... } } } The problem with this approach is that the drawShapes() function needs to “know” about all of the kinds of shapes that can be drawn. This is fi ne in a simple example, but as our code grows in size and complexity, it can become diffi cult to add new types of shapes to the system. Whenever a new shape type is added, one must fi nd every place in the code base where knowledge of the set of shape types is embedded—like this switch statement—and add a case to handle the new type. The solution is to insulate the majority of our code from any knowledge of the types of objects with which it might be dealing. To accomplish this, we can 95 defi ne classes for each of the types of shapes we wish to support. All of these classes would inherit from the common base class Shape. A virtual function— the C++ language’s primary polymorphism mechanism—would be defi ned called Draw(), and each distinct shape class would implement this function in a diff erent way. Without “knowing” what specifi c types of shapes it has been given, the drawing function can now simply call each shape’s Draw() function in turn. struct Shape { virtual void Draw() = 0; // pure virtual function }; struct Circle : public Shape { virtual void Draw() { // draw shape as a circle } }; struct Rectangle : public Shape { virtual void Draw() { // draw shape as a rectangle } }; struct Triangle : public Shape { void Draw() { // draw shape as a triangle } }; void drawShapes(std::list shapes) { std::list::iterator pShape = shapes.begin(); std::list::iterator pEnd = shapes.end(); for ( ; pShape != pEnd; ++pShape) { pShape->Draw(); } } 3.1. C++ Review and Best Practices 96 3. Fundamentals of Software Engineering for Games 3.1.1.5. Composition and Aggregation Composition is the practice of using a group of interacting objects to accomplish a high-level task. Composition creates a “has-a” or “uses-a” relationship be- tween classes. (Technically speaking, the “has-a” relationship is called com- position, while the “uses-a” relationship is called aggregation.) For example, a space ship has an engine, which in turn has a fuel tank. Composition/aggrega- tion usually results in the individual classes being simpler and more focused. Inexperienced object-oriented programmers oft en rely too heavily on inheri- tance and tend to underutilize aggregation and composition. As an example, imagine that we are designing a graphical user interface for our game’s front end. We have a class Window that represents any rectan- gular GUI element. We also have a class called Rectangle that encapsulates the mathematical concept of a rectangle. A naïve programmer might derive the Window class from the Rectangle class (using an “is-a” relationship). But in a more fl exible and well-encapsulated design, the Window class would refer to or contain a Rectangle (employing a “has-a” or “uses-a” relationship). This makes both classes simpler and more focused and allows the classes to be more easily tested, debugged, and reused. 3.1.1.6. Design Patterns When the same type of problem arises over and over, and many diff erent pro- grammers employ a very similar solution to that problem, we say that a design patt ern has arisen. In object-oriented programming, a number of common de- sign patt erns have been identifi ed and described by various authors. The most well-known book on this topic is probably the “Gang of Four” book [17]. Here are a few examples of common general-purpose design patt erns. Singleton. This patt ern ensures that a particular class has only one in- stance (the singleton instance) and provides a global point of access to it. Iterator. An iterator provides an effi cient means of accessing the indi- vidual elements of a collection, without exposing the collection’s under- lying implementation. The iterator “knows” the implementation details of the collection, so that its users don’t have to. Abstract factory. An abstract factory provides an interface for creating families of related or dependent classes without specifying their con- crete classes. The game industry has its own set of design patt erns, for addressing problems in every realm from rendering to collision to animation to audio. In a sense, this book is all about the high-level design patt erns prevalent in modern 3D game engine design. 97 3.1.2. Coding Standards: Why and How Much? Discussions of coding conventions among engineers can oft en lead to heated “religious” debates. I do not wish to spark any such debate here, but I will go so far as to suggest that following at least some minimal coding standards is a good idea. Coding standards exist for two primary reasons. 1. Some standards make the code more readable, understandable, and maintainable. 2. Other conventions help to prevent programmers from shooting them- selves in the foot. For example, a coding standard might encourage the programmer to use only a smaller, more testable, and less error-prone subset of the whole language. The C++ language is rife with possibili- ties for abuse, so this kind of coding standard is particularly important when using C++. In my opinion, the most important things to achieve in your coding con- ventions are the following. Interfaces are king. Keep your interfaces (.h fi les) clean, simple, minimal, easy to understand, and well-commented. Good names encourage understanding and avoid confusion. Stick to intuitive names that map directly to the purpose of the class, function, or vari- able in question. Spend time up-front identifying a good name. Avoid a naming scheme that requires programmers to use a look-up table in order to decipher the meaning of your code. Remember that high-level programming languages like C++ are intended for humans to read. (If you disagree, just ask yourself why you don’t write all your soft ware directly in machine language.) Don’t clutt er the global namespace. Use C++ namespaces or a common naming prefi x to ensure that your symbols don’t collide with symbols in other libraries. (But be careful not to overuse namespaces, or nest them too deeply.) Name #defined symbols with extra care; remember that C++ preprocessor macros are really just text substitutions, so they cut across all C/C++ scope and namespace boundaries. Follow C++ best practices. Books like the Eff ective C++ series by Scott Mey- ers [31, 32], Meyers’ Eff ective STL [33], and Large-Scale C++ Soft ware De- sign by John Lakos [27] provide excellent guidelines that will help keep you out of trouble. Be consistent. The rule I try to use is as follows: If you’re writing a body of code from scratch, feel free to invent any convention you like—then stick to it. When editing pre-existing code, try to follow whatever con- ventions have already been established. 3.1. C++ Review and Best Practices 98 3. Fundamentals of Software Engineering for Games Make errors stick out. Joel Spolsky wrote an excellent article on coding conventions, which can be found at htt p://www.joelonsoft ware.com / articles /Wrong.html. Joel suggests that the “cleanest” code is not neces- sarily code that looks neat and tidy on a superfi cial level, but rather the code that is writt en in a way that makes common programming errors easier to see. Joel’s articles are always fun and educational, and I highly recommend this one. 3.2. Data, Code, and Memory in C/C++ 3.2.1. Numeric Representations Numbers are at the heart of everything that we do in game engine development (and soft ware development in general). Every soft ware engineer should under- stand how numbers are represented and stored by a computer. This section will provide you with the basics you’ll need throughout the rest of the book. 3.2.1.1. Numeric Bases People think most naturally in base ten, also known as decimal notation. In this notation, ten distinct digits are used (0 through 9), and each digit from right to left represents the next highest power of 10. For example, the number 7803 = (7×103) + (8×102) + (0×101) + (3×100) = 7000 + 800 + 0 + 3. In computer science, mathematical quantities such as integers and real- valued numbers need to be stored in the computer’s memory. And as we know, computers store numbers in binary format, meaning that only the two digits 0 and 1 are available. We call this a base-two representation, because each digit from right to left represents the next highest power of 2. Computer scientists sometimes use a prefi x of “0b” to represent binary numbers. For example, the binary number 0b1101 is equivalent to decimal 13, because 0b1101 = (1×23) + (1×22) + (0×21) + (1×20) = 8 + 4 + 0 + 1 = 13. Another common notation popular in computing circles is hexadecimal, or base 16. In this notation, the 10 digits 0 through 9 and the six lett ers A through F are used; the lett ers A through F replace the decimal values 10 through 15, respectively. A prefi x of “0x” is used to denote hex numbers in the C and C++ programming languages. This notation is popular because computers gener- ally store data in groups of 8 bits known as bytes, and since a single hexadeci- mal digit represents 4 bits exactly, a pair of hex digits represents a byte. For example, the value 0xFF = 0b11111111 = 255 is the largest number that can be stored in 8 bits (1 byte). Each digit in a hexadecimal number, from right to left , represents the next power of 16. So, for example, 0xB052 = (11×163) + (0×162) + (5×161) + (2×160) = (11×4096) + (0×256) + (5×16) + (2×1) = 45,138. 99 3.2. Data, Code, and Memory in C/C++ 3.2.1.2. Signed and Unsigned Integers In computer science, we use both signed and unsigned integers. Of course, the term “unsigned integer” is actually a bit of a misnomer—in mathematics, the whole numbers or natural numbers range from 0 (or 1) up to positive infi nity, while the integers range from negative infi nity to positive infi nity. Neverthe- less, we’ll use computer science lingo in this book and stick with the terms “signed integer” and “unsigned integer.” Most modern personal computers and game consoles work most easily with integers that are 32 bits or 64 bits wide (although 8- and 16-bit integers are also used a great deal in game programming as well). To represent a 32- bit unsigned integer, we simply encode the value using binary notation (see above). The range of possible values for a 32-bit unsigned integer is 0x00000000 (0) to 0xFFFFFFFF (4,294,967,295). To represent a signed integer in 32 bits, we need a way to diff erentiate be- tween positive and negative vales. One simple approach would be to reserve the most signifi cant bit as a sign bit—when this bit is zero the value is positive, and when it is one the value is negative. This gives us 31 bits to represent the magnitude of the value, eff ectively cutt ing the range of possible magnitudes in half (but allowing both positive and negative forms of each distinct magni- tude, including zero). Most microprocessors use a slightly more effi cient technique for encod- ing negative integers, called two’s complement notation. This notation has only one representation for the value zero, as opposed to the two representations possible with simple sign bit (positive zero and negative zero). In 32-bit two’s complement notation, the value 0xFFFFFFFF is interpreted to mean –1, and negative values count down from there. Any value with the most signifi cant bit set is considered negative. So values from 0x00000000 (0) to 0x7FFFFFFF (2,147,483,647) represent positive integers, and 0x80000000 (–2,147,483,648) to 0xFFFFFFFF (–1) represent negative integers. 3.2.1.3. Fixed-Point Notation Integers are great for representing whole numbers, but to represent fractions and irrational numbers we need a diff erent format that expresses the concept of a decimal point. One early approach taken by computer scientists was to use fi xed-point notation. In this notation, one arbitrarily chooses how many bits will be used to represent the whole part of the number, and the rest of the bits are used to represent the fractional part. As we move from left to right (i.e., from the most signifi cant bit to the least signifi cant bit), the magnitude bits represent decreasing powers of two (…, 16, 8, 4, 2, 1), while the fractional bits represent 100 3. Fundamentals of Software Engineering for Games decreasing inverse powers of two (1/2 , 1/4 , 1/8 , 1/16 , …). For example, to store the number –173.25 in 32-bit fi xed-point notation, with one sign bit, 16 bits for the magnitude and 15 bits for the fraction, we fi rst convert the sign, the whole part and fractional part into their binary equivalents individually (negative = 0b1, 173 = 0b0000000010101101, and 0.25 = 1/4 = 0b010000000000000). Then we pack those values together into a 32-bit integer. The fi nal result is 0x8056A000. This is illustrated in Figure 3.4. The problem with fi xed-point notation is that it constrains both the range of magnitudes that can be represented and the amount of precision we can achieve in the fractional part. Consider a 32-bit fi xed-point value with 16 bits for the magnitude, 15 bits for the fraction, and a sign bit. This format can only represent magnitudes up to ±65,535, which isn’t particularly large. To over- come this problem, we employ a fl oating-point representation. 3.2.1.4. Floating-Point Notation In fl oating-point notation, the position of the decimal place is arbitrary and is specifi ed with the help of an exponent. A fl oating-point number is broken into three parts: the mantissa, which contains the relevant digits of the number on both sides of the decimal point, the exponent, which indicates where in that string of digits the decimal point lies, and a sign bit, which of course indicates whether the value is positive or negative. There are all sorts of diff erent ways to lay out these three components in memory, but the most common standard is IEEE-754. It states that a 32-bit fl oating-point number will be represented with the sign in the most signifi cant bit, followed by 8 bits of exponent, and fi nally 23 bits of mantissa. The value v represented by a sign bit s, an exponent e and a mantissa m is v = s × 2(e – 127) × (1 + m). The sign bit s has the value +1 or –1. The exponent e is biased by 127 so that negative exponents can be easily represented. The mantissa begins with an implicit 1 that is not actually stored in memory, and the rest of the bits are interpreted as inverse powers of two. Hence the value represented is really 1 + m, where m is the fractional value stored in the mantissa. 31 15 0 magnitude (16 bits) fraction (15 bits) 1 = –173.25 sign 0x 80 0x56 0xA0 0x00 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Figure 3.4. Fixed-point notation with 16-bit magnitude and 16-bit fraction. 101 For example, the bit patt ern shown in Figure 3.5 represents the value 0.15625, because s = 0 (indicating a positive number), e = 0b01111100 = 124, and m = 0b0100… = 0×2–1 + 1×2–2 = ¼. Therefore, v = s × 2(e – 127) × (1 + m) = (+1) × 2(124 – 127) × (1 + 1/4) = 2–3 × 5/4 (3.1) = 1/8 × 5/4 = 0.125 × 1.25 = 0.15625. The Trade-Off between Magnitude and Precision The precision of a fl oating-point number increases as the magnitude decreases, and vice versa. This is because there are a fi xed number of bits in the mantissa, and these bits must be shared between the whole part and the fractional part of the number. If a large percentage of the bits are spent representing a large magnitude, then a small percentage of bits are available to provide fractional precision. In physics the term signifi cant digits is typically used to describe this concept (htt p://en.wikipedia.org/wiki/Signifi cant_digits). To understand the trade-off between magnitude and precision, let’s look at the largest possible fl oating-point value, FLT_MAX ≈ 3.403×1038, whose rep- resentation in 32-bit IEEE fl oating-point format is 0x7F7FFFFF. Let’s break this down: The largest absolute value that we can represent with a 23-bit mantissa is 0x00FFFFFF in hexadecimal, or 24 consecutive binary ones—that’s 23 ones in the mantissa, plus the implicit leading one. An exponent of 255 has a special meaning in the IEEE-754 format—it is used for values like not-a-number (NaN) and infi nity—so it cannot be used for regular numbers. Hence the maximum 8-bit exponent is actu- ally 254, which translates into 127 aft er subtracting the implicit bias of 127. So FLT_MAX is 0x00FFFFFF×2127 = 0xFFFFFF00000000000000000000000000. In other words, our 24 binary ones were shift ed up by 127 bit positions, leav- ing 127 – 23 = 104 binary zeros (or 104/4 = 26 hexadecimal zeros) aft er the 3.2. Data, Code, and Memory in C/C++ 0 31 23 0 exponent (8 bits) 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 mantissa (23 bits)sign = 0.15625 Figure 3.5. IEEE-754 32-bit fl oating-point format. 102 3. Fundamentals of Software Engineering for Games least signifi cant digit of the mantissa. Those trailing zeros don’t correspond to any actual bits in our 32-bit fl oating-point value—they just appear out of thin air because of the exponent. If we were to subtract a small number (where “small” means any number composed of fewer than 26 hexadecimal digits) from FLT_MAX, the result would still be FLT_MAX, because those 26 least sig- nifi cant hexadecimal digits don’t really exist! The opposite eff ect occurs for fl oating-point values whose magnitudes are much less than one. In this case, the exponent is large but negative, and the signifi cant digits are shift ed in the opposite direction. We trade the ability to represent large magnitudes for high precision. In summary, we always have the same number of signifi cant digits (or really signifi cant bits) in our fl oating- point numbers, and the exponent can be used to shift those signifi cant bits into higher or lower ranges of magnitude. Another subtlety to notice is that there is a fi nite gap between zero and the smallest nonzero value we can represent with any fl oating-point notation. The smallest nonzero magnitude we can represent is FLT_MIN = 2–126 ≈ 1.175×10–38, which has a binary representation of 0x00800000 (i.e., the exponent is 0x01, or –126 aft er subtracting the bias, and the mantissa is all zeros, except for the implicit leading one). There is no way to represent a nonzero magnitude that is smaller than 1.175×10–38, because the next smallest valid value is zero. Put another way, the real number line is quantized when using a fl oating-point representation. For a particular fl oating-point representation, the machine epsilon is de- fi ned to be the smallest fl oating-point value ε that satisfi es the equation, 1 + ε ≠ 1. For an IEEE-754 fl oating-point number, with its 23 bits of precision, the value of ε is 2–23, which is approximately 1.192×10–7. The most signifi cant digit of ε falls just inside the range of signifi cant digits in the value 1.0, so adding any value smaller than ε to 1.0 has no eff ect. In other words, any new bits con- tributed adding a value smaller than ε will get “chopped off ” when we try to fi t the sum into a mantissa with only 23 bits. The concepts of limited precision and the machine epsilon have real im- pacts on game soft ware. For example, let’s say we use a fl oating-point vari- able to track absolute game time in seconds. How long can we run our game before the magnitude of our clock variable gets so large that adding 1/30th of a second to it no longer changes its value? The answer is roughly 12.9 days. That’s longer than most games will be left running, so we can probably get away with using a 32-bit fl oating-point clock measured in seconds in a game. But clearly it’s important to understand the limitations of the fl oating-point format, so that we can predict potential problems and take steps to avoid them when necessary. 103 IEEE Floating-Point Bit Tricks See [7], Section 2.1, for a few really useful IEEE fl oating-point “bit tricks” that can make fl oating-point calculations lightning-fast. 3.2.1.5. Atomic Data Types As you know, C and C++ provide a number of atomic data types. The C and C++ standards provide guidelines on the relative sizes and signedness of these data types, but each compiler is free to defi ne the types slightly diff erently in order to provide maximum performance on the target hardware. char. A char is usually 8 bits and is generally large enough to hold an ASCII or UTF-8 character (see Section 5.4.4.1). Some compilers defi ne char to be signed, while others use unsigned chars by default. int, short, long. An int is supposed to hold a signed integer value that is the most effi cient size for the target platform; it is generally de- fi ned to be 32 bits wide on Pentium class PCs. A short is intended to be smaller than an int and is 16 bits on many machines. A long is as large as or larger than an int and may be 32 or 64 bits, depending on the hardware. float. On most modern compilers, a float is a 32-bit IEEE-754 fl oat- ing-point value. double. A double is a double-precision (i.e., 64-bit) IEEE-754 fl oating- point value. bool. A bool is a true/false value. The size of a bool varies widely across diff erent compilers and hardware architectures. It is never implemented as a single bit, but some compilers defi ne it to be 8 bits while others use a full 32 bits. Compiler-Specifi c Sized Types The standard C/C++ atomic data types were designed to be portable and therefore nonspecifi c. However, in many soft ware engineering endeavors, in- cluding game engine programming, it is oft en important to know exactly how wide a particular variable is. The Visual Studio C/C++ compiler defi nes the fol- lowing extended keywords for declaring variables that are an explicit number of bits wide: __int8, __int16, __int32, and __int64. SIMD Types The CPUs on many modern computers and game consoles have a special- ized type of arithmetic logic unit (ALU) referred to as a vector processor or vector unit. A vector processor supports a form of parallel processing known as single instruction, multiple data (SIMD), in which a mathematical operation 3.2. Data, Code, and Memory in C/C++ 104 3. Fundamentals of Software Engineering for Games is performed on multiple quantities in parallel, using a single machine in- struction. In order to be processed by the vector unit, two or more quanti- ties are packed into a 64- or 128-bit CPU register. In game programming, the most commonly used SIMD register format packs four 32-bit IEEE-754 fl oating-point quantities into a 128-bit SIMD register. This format allows us to perform calculations such as vector dot products and matrix multiplications much more effi ciently than would be possible with a SISD (single instruction, single data) ALU. Each microprocessor has a diff erent name for its SIMD instruction set, and the compilers that target those microprocessors use a custom syntax to declare SIMD variables. For example, on a Pentium class CPU, the SIMD in- struction set is known as SSE (streaming SIMD extensions), and the Microsoft Visual Studio compiler provides the built-in data type __m128 to represent a four-fl oat SIMD quantity. The PowerPC class of CPUs used on the PLAYSTA- TION 3 and Xbox 360 calls its SIMD instruction set Altivec, and the Gnu C++ compiler uses the syntax vector float to declare a packed four-fl oat SIMD variable. We’ll discuss how SIMD programming works in more detail in Sec- tion 4.7. Portable Sized Types Most other compilers have their own “sized” data types, with similar seman- tics but slightly diff erent syntax. Because of these diff erences between compil- ers, most game engines achieve source code portability by defi ning their own custom atomic data types. For example, at Naughty Dog we use the following atomic types: F32 is a 32-bit IEEE-754 fl oating-point value. U8, I8, U16, I16, U32, I32, U64, and I64 are unsigned and signed 8-, 16-, 32-, and 64-bit integers, respectively. U32F and I32F are “fast” unsigned and signed 32-bit values, respec- tively. Each of these data types acts as though it contains a 32-bit value, but it actually occupies 64 bits in memory. This permits the PS3’s cen- tral PowerPC-based processor (called the PPU) to read and write these variables directly into its 64-bit registers, providing a signifi cant speed boost over reading and writing 32-bit variables. VF32 represents a packed four-fl oat SIMD value. OGRE ’s Atomic Data Types OGRE defi nes a number of atomic types of its own. Ogre::uint8, Ogre:: uint16 and Ogre::uint32 are the basic unsigned sized integral types. 105 Ogre ::Real defi nes a real fl oating-point value. It is usually defi ned to be 32 bits wide (equivalent to a float), but it can be redefi ned globally to be 64 bits wide (like a double) by defi ning the preprocessor macro OGRE_DOU- BLE_PRECISION to 1. This ability to change the meaning of Ogre::Real is generally only used if one’s game has a particular requirement for double- precision math, which is rare. Graphics chips (GPUs) always perform their math with 32-bit or 16-bit fl oats, the CPU/FPU is also usually faster when working in single-precision, and SIMD vector instructions operate on 128-bit registers that contain four 32-bit fl oats each. Hence most games tend to stick to single-precision fl oating-point math. The data types Ogre ::uchar, Ogre::ushort, Ogre::uint and Ogre::ulong are just shorthand notations for C/C++’s unsigned char, un- signed short, and unsigned long, respectively. As such, they are no more or less useful than their native C/C++ counterparts. The types Ogre ::Radian and Ogre::Degree are particularly interest- ing. These classes are wrappers around a simple Ogre::Real value. The pri- mary role of these types is to permit the angular units of hard-coded literal constants to be documented and to provide automatic conversion between the two unit systems. In addition, the type Ogre::Angle represents an angle in the current “default” angle unit. The programmer can defi ne whether the default will be radians or degrees when the OGRE application fi rst starts up. Perhaps surprisingly, OGRE does not provide a number of sized atomic data types that are commonplace in other game engines. For example, it de- fi nes no signed 8-, 16-, or 64-bit integral types. If you are writing a game en- gine on top of OGRE, you will probably fi nd yourself defi ning these types manually at some point. 3.2.1.6. Multi-Byte Values and Endianness Values that are larger than eight bits (one byte) wide are called multi-byte quan- tities. They’re commonplace on any soft ware project that makes use of integers and fl oating-point values that are 16 bits or wider. For example, the integer value 4660 = 0x1234 is represented by the two bytes 0x12 and 0x34. We call 0x12 the most signifi cant byte (MSB) and 0x34 the least signifi cant byte (LSB). In a 32-bit value, such as 0xABCD1234, the MSB is 0xAB and the LSB is 0x34. The same concepts apply to 64-bit integers and to 32- and 64-bit fl oating-point values as well. Multi-byte integers can be stored into memory in one of two ways, and diff erent microprocessors may diff er in their choice of storage method (see Figure 3.6). 3.2. Data, Code, and Memory in C/C++ 106 3. Fundamentals of Software Engineering for Games Litt le-endian. If a microprocessor stores the least signifi cant byte (LSB) of a multi-byte value at a lower memory address than the most signifi cant byte (MSB), we say that the processor is litt le-endian. On a litt le-endian machine, the number 0xABCD1234 would be stored in memory using the consecutive bytes 0x34, 0x12, 0xCD, 0xAB. Big-endian. If a microprocessor stores the most signifi cant byte (MSB) of a multi-byte value at a lower memory address than the least signifi cant byte (LSB), we say that the processor is big-endian. On a big-endian ma- chine, the number 0xABCD1234 would be stored in memory using the bytes 0xAB, 0xCD, 0x12, 0x34. Most programmers don’t need to think much about endianness. How- ever, when you’re a game programmer, endianness can become a bit of a thorn in your side. This is because games are usually developed on a PC or Linux ma- chine running an Intel Pentium processor (which is litt le-endian), but run on a console such as the Wii, Xbox 360, or PLAYSTATION 3—all three of which utilize a variant of the PowerPC processor (which can be confi gured to use either endianness, but is big-endian by default). Now imagine what happens when you generate a data fi le for consumption by your game engine on an Intel processor and then try to load that data fi le into your engine running on a PowerPC processor. Any multi-byte value that you wrote out into that data fi le will be stored in litt le-endian format. But when the game engine reads the fi le, it expects all of its data to be in big-endian format. The result? You’ll write 0xABCD1234, but you’ll read 0x3412CDAB, and that’s clearly not what you intended! There are at least two solutions to this problem. 1. You could write all your data fi les as text and store all multi-byte num- bers as sequences of decimal digits, one character (one byte) per digit. This would be an ineffi cient use of disk space, but it would work. U32 value = 0 xABCD 1234; U8* pBytes = ( U8*)&value ; Big-endian 0xAB 0xCD pBytes + 0x0 0x12 0x34 pBytes + 0x1 pBytes + 0x2 pBytes + 0x3 Little-endian 0x34 0x12 0xCD 0xAB pBytes + 0x0 pBytes + 0x1 pBytes + 0x2 pBytes + 0x3 Figure 3.6. Big- and little-endian representations of the value 0xABCD1234. 107 2. You can have your tools endian-swap the data prior to writing it into a binary data fi le. In eff ect, you make sure that the data fi le uses the endi- anness of the target microprocessor (the game console), even if the tools are running on a machine that uses the opposite endianness. Integer Endian-Swapping Endian-swapping an integer is not conceptually diffi cult. You simply start at the most signifi cant byte of the value and swap it with the least signifi cant byte; you continue this process until you reach the half-way point in the value. For example, 0xA7891023 would become 0x231089A7. The only tricky part is knowing which bytes to swap. Let’s say you’re writ- ing the contents of a C struct or C++ class from memory out to a fi le. To properly endian-swap this data, you need to keep track of the locations and sizes of each data member in the struct and swap each one appropriately based on its size. For example, the structure struct Example { U32 m_a; U16 m_b; U32 m_c; }; might be writt en out to a data fi le as follows: void writeExampleStruct(Example& ex, Stream& stream) { stream.writeU32(swapU32(ex.m_a)); stream.writeU16(swapU16(ex.m_b)); stream.writeU32(swapU32(ex.m_c)); } and the swap functions might be defi ned like this: inline U16 swapU16(U16 value) { return ((value & 0x00FF) << 8) | ((value & 0xFF00) >> 8); } inline U32 swapU32(U32 value) { return ((value & 0x000000FF) << 24) | ((value & 0x0000FF00) << 8) | ((value & 0x00FF0000) >> 8) | ((value & 0xFF000000) >> 24); } 3.2. Data, Code, and Memory in C/C++ 108 3. Fundamentals of Software Engineering for Games You cannot simply cast the Example object into an array of bytes and blindly swap the bytes using a single general-purpose function. We need to know both which data members to swap and how wide each member is; and each data member must be swapped individually. Floating-Point Endian-Swapping Let’s take a brief look at how fl oating-point endian-swapping diff ers from in- teger endian-swapping. As we’ve seen, an IEEE-754 fl oating-point value has a detailed internal structure involving some bits for the mantissa, some bits for the exponent, and a sign bit. However, you can endian-swap it just as if it were an integer, because bytes are bytes. You can reinterpret fl oats as integers by using C++’s reinterpret_cast operator on a pointer to the fl oat; this is known as type punning. But punning can lead to optimization bugs when strict aliasing is enabled. (See htt p://cocoawithlove.com/2008/04/using-pointers-to- recast-in-c-is-bad.html for an excellent description of this problem.) One con- venient approach is to use a union, as follows: union U32F32 { U32 m_asU32; F32 m_asF32; }; inline F32 swapF32(F32 value) { U32F32 u; u.m_asF32 = value; // endian-swap as integer u.m_asU32 = swapU32(u.m_asU32); return u.m_asF32; } 3.2.2. Declarations, Defi nitions, and Linkage 3.2.2.1. Translation Units Revisited As we saw in Chapter 2, a C or C++ program is comprised of translation units. The compiler translates one .cpp fi le at a time, and for each one it generates an output fi le called an object fi le (.o or .obj). A .cpp fi le is the smallest unit of translation operated on by the compiler; hence, the name “ translation unit.” An object fi le contains not only the compiled machine code for all of the func- tions defi ned in the .cpp fi le, but also all of its global and static variables. In ad- dition, an object fi le may contain unresolved references to functions and global variables defi ned in other .cpp fi les. 109 3.2. Data, Code, and Memory in C/C++ ??? Unresolved Reference ??? Multiply-Defined Symbol ??? foo.cpp U32 gGlobalA ; U32 gGlobalB ; void f () { // ... gGlobalC = 5.3f; gGlobalD = -2; // ... } extern U 32 gGlobalC ; bar.cpp F32 gGlobalC ; void g () { // ... U 32 a = gGlobalA ; // ... f (); // ... gGlobalB = 0; } extern U 32 gGlobalA ; extern U 32 gGlobalB ; extern void f (); spam.cpp U32 gGlobalA ; void h () { // ... } Figure 3.9. The two most common linker errors. foo.cpp U32 gGlobalA ; U32 gGlobalB ; void f () { // ... gGlobalC = 5.3f; // ... } extern U 32 gGlobalC ; bar.cpp F32 gGlobalC ; void g () { // ... U 32 a = gGlobalA ; // ... f (); // ... gGlobalB = 0; } extern U 32 gGlobalA ; extern U 32 gGlobalB ; extern void f (); Figure 3.7. Unresolved external references in two translation units. foo.cpp U32 gGlobalA ; U32 gGlobalB ; void f () { // ... gGlobalC = 5.3f; // ... } extern U 32 gGlobalC ; bar.cpp F32 gGlobalC ; void g () { // ... U 32 a = gGlobalA ; // ... f (); // ... gGlobalB = 0; } extern U 32 gGlobalA ; extern U 32 gGlobalB ; extern void f (); Figure 3.8. Fully resolved external references after successful linking. 110 3. Fundamentals of Software Engineering for Games The compiler only operates on one translation unit at a time, so whenever it encounters a reference to an external global variable or function, it must “go on faith” and assume that the entity in question really exists, as shown in Figure 3.7. It is the linker’s job to combine all of the object fi les into a fi nal executable image. In doing so, the linker reads all of the object fi les and at- tempts to resolve all of the unresolved cross-references between them. If it is successful, an executable image is generated containing all of the functions, global variables, and static variables, with all cross-translation-unit references properly resolved. This is depicted in Figure 3.8. The linker’s primary job is to resolve external references, and in this ca- pacity it can generate only two kinds of errors: 1. The target of an extern reference might not be found, in which case the linker generates an “unresolved symbol” error. 2. The linker might fi nd more than one variable or function with the same name, in which case it generates a “multiply defi ned symbol” error. These two situations are shown in Figure 3.9. 3.2.2.2. Declaration versus Defi nition In the C and C++ languages, variables and functions must be declared and de- fi ned before they can be used. It is important to understand the diff erence be- tween a declaration and a defi nition in C and C++. A declaration is a description of a data object or function. It provides the compiler with the name of the entity and its data type or function signature (i.e., return type and argument type(s)). A defi nition, on the other hand, describes a unique region of memory in the program. This memory might contain a variable, an instance of a struct or class, or the machine code of a function. In other words, a declaration is a reference to an entity, while a defi nition is the entity itself. A defi nition is always a declaration, but the reverse is not always the case—it is possible to write a pure declaration in C and C++ that is not a defi nition. Functions are defi ned by writing the body of the function immediately af- ter the signature, enclosed in curly braces: foo.cpp // definition of the max() function int max(int a, int b) { 111 return (a > b) ? a : b; } // definition of the min() function int min(int a, int b) { return (a <= b) ? a : b; } A pure declaration can be provided for a function so that it can be used in other translation units (or later in the same translation unit). This is done by writing a function signature followed by a semicolon, with an optional prefi x of extern: foo.h extern int max(int a, int b); // a function declaration int min(int a, int b); // also a declaration (the // ‘extern’ is optional/ // assumed) Variables and instances of classes and structs are defi ned by writing the data type followed by the name of the variable or instance, and an optional array specifi er in square brackets: foo.cpp // All of these are variable definitions: U32 gGlobalInteger = 5; F32 gGlobalFloatArray[16]; MyClass gGlobalInstance; A global variable defi ned in one translation unit can optionally be declared for use in other translation units by using the extern keyword: foo.h // These are all pure declarations: extern U32 gGlobalInteger; extern F32 gGlobalFloatArray[16]; extern MyClass gGlobalInstance; Multiplicity of Declarations and Defi nitions Not surprisingly, any particular data object or function in a C/C++ program can have multiple identical declarations, but each can have only one defi ni- tion. If two or more identical defi nitions exist in a single translation unit, the compiler will notice that multiple entities have the same name and fl ag an error. If two or more identical defi nitions exist in diff erent transla- 3.2. Data, Code, and Memory in C/C++ 112 3. Fundamentals of Software Engineering for Games tion units, the compiler will not be able to identify the problem, because it operates on one translation unit at a time. But in this case, the linker will give us a “multiply defi ned symbol” error when it tries to resolve the cross-references. Defi nitions in Header Files and Inlining It is usually dangerous to place defi nitions in header fi les. The reason for this should be prett y obvious: If a header fi le containing a defi nition is #included into more than one .cpp fi le, it’s a sure-fi re way of generating a “multiply de- fi ned symbol” linker error. Inline function defi nitions are an exception to this rule, because each in- vocation of an inline function gives rise to a brand new copy of that function’s machine code, embedded directly into the calling function. In fact, inline func- tion defi nitions must be placed in header fi les if they are to be used in more than one translation unit. Note that it is not suffi cient to tag a function declara- tion with the inline keyword in a .h fi le and then place the body of that func- tion in a .cpp fi le. The compiler must be able to “see” the body of the function in order to inline it. For example: foo.h // This function definition will be inlined properly. inline int max(int a, int b) { return (a > b) ? a : b; } // This declaration cannot be inlined because the // compiler cannot “see” the body of the function. inline int min(int a, int b); foo.cpp // The body of min() is effectively “hidden” from the // compiler, and so it can ONLY be inlined within // foo.cpp. int min(int a, int b) { return (a <= b) ? a : b; } The inline keyword is really just a hint to the compiler. It does a cost/ benefi t analysis of each inline function, weighing the size of the function’s code versus the potential performance benefi ts of inling it, and the compiler gets the fi nal say as to whether the function will really be inlined or not. Some compilers provide syntax like __forceinline, allowing the programmer 113 to bypass the compiler’s cost/benefi t analysis and control function inlining directly. 3.2.2.3. Linkage Every defi nition in C and C++ has a property known as linkage. A defi nition with external linkage is visible to and can be referenced by translation units other than the one in which it appears. A defi nition with internal linkage can only be “seen” inside the translation unit in which it appears and thus cannot be referenced by other translation units. We call this property linkage because it dictates whether or not the linker is permitt ed to cross-reference the entity in question. So, in a sense, linkage is the translation unit’s equivalent of the public: and private: keywords in C++ class defi nitions. By default, defi nitions have external linkage. The static keyword is used to change a defi nition’s linkage to internal. Note that two or more identi- cal static defi nitions in two or more diff erent .cpp fi les are considered to be distinct entities by the linker (just as if they had been given diff erent names), so they will not generate a “multiply defi ned symbol” error. Here are some examples: foo.cpp // This variable can be used by other .cpp files // (external linkage). U32 gExternalVariable; // This variable is only usable within foo.cpp (internal // linkage). static U32 gInternalVariable; // This function can be called from other .cpp files // (external linkage). void externalFunction() { // ... } // This function can only be called from within foo.cpp // (internal linkage). static void internalFunction() { // ... } bar.cpp // This declaration grants access to foo.cpp’s variable. extern U32 gExternalVariable; 3.2. Data, Code, and Memory in C/C++ 114 3. Fundamentals of Software Engineering for Games // This ‘gInternalVariable’ is distinct from the one // defined in foo.cpp – no error. We could just as // well have named it gInternalVariableForBarCpp – the // net effect is the same. static U32 gInternalVariable; // This function is distinct from foo.cpp’s // version – no error. It acts as if we had named it // internalFunctionForBarCpp(). static void internalFunction() { // ... } // ERROR – multiply defined symbol! void externalFunction() { // ... } Technically speaking, declarations don’t have a linkage property at all, be- cause they do not allocate any storage in the executable image; therefore, there is no question as to whether or not the linker should be permitt ed to cross- reference that storage. A declaration is merely a reference to an entity defi ned elsewhere. However, it is sometimes convenient to speak about declarations as having internal linkage, because a declaration only applies to the transla- tion unit in which it appears. If we allow ourselves to loosen our terminology in this manner, then declarations always have internal linkage—there is no way to cross-reference a single declaration in multiple .cpp fi les. (If we put a declaration in a header fi le, then multiple .cpp fi les can “see” that declaration, but they are in eff ect each gett ing a distinct copy of the declaration, and each copy has internal linkage within that translation unit.) This leads us to the real reason why inline function defi nitions are permit- ted in header fi les: It is because inline functions have internal linkage by de- fault, just as if they had been declared static. If multiple .cpp fi les #include a header containing an inline function defi nition, each translation unit gets a private copy of that function’s body, and no “multiply defi ned symbol” errors are generated. The linker sees each copy as a distinct entity. 3.2.3. C/C++ Memory Layout A program writt en in C or C++ stores its data in a number of diff erent places in memory. In order to understand how storage is allocated and how the various 115 types of C/C++ variables work, we need to understand the memory layout of a C/C++ program. 3.2.3.1. Executable Image When a C/C++ program is built, the linker creates an executable fi le. Most UN- IX-like operating system s, including many game consoles, employ a popular executable fi le format called the executable and linking format (ELF). Executable fi les on those systems therefore have an .elf extension. The Windows execut- able format is similar to the ELF format; executables under Windows have an .exe extension. Whatever its format, the executable fi le always contains a partial image of the program as it will exist in memory when it runs. I say a “partial” image because the program generally allocates memory at runtime in addition to the memory laid out in its executable image. The executable image is divided into contiguous blocks called segments or sections. Every operating system lays things out a litt le diff erently, and the layout may also diff er slightly from executable to executable on the same op- erating system. But the image is usually comprised of at least the following four segments: 1. Text segment. Sometimes called the code segment, this block contains ex- ecutable machine code for all functions defi ned by the program. 2. Data segment. This segment contains all initialized global and static vari- ables. The memory needed for each global variable is laid out exactly as it will appear when the program is run, and the proper initial values are all fi lled in. So when the executable fi le is loaded into memory, the initialized global and static variables are ready to go. 3. BSS segment. “BSS” is an outdated name which stands for “block started by symbol.” This segment contains all of the uninitialized global and stat- ic variables defi ned by the program. The C and C++ languages explicitly defi ne the initial value of any uninitialized global or static variable to be zero. But rather than storing a potentially very large block of zeros in the BSS section, the linker simply stores a count of how many zero bytes are required to account for all of the uninitialized globals and statics in the segment. When the executable is loaded into memory, the operating system reserves the requested number of bytes for the BSS section and fi lls it with zeros prior to calling the program’s entry point (e.g. main() or WinMain()). 4. Read-only data segment. Sometimes called the rodata segment, this seg- ment contains any read-only (constant) global data defi ned by the pro- gram. For example, all fl oating-point constants (e.g., const float kPi 3.2. Data, Code, and Memory in C/C++ 116 3. Fundamentals of Software Engineering for Games = 3.141592f;) and all global object instances that have been declared with the const keyword (e.g., const Foo gReadOnlyFoo; ) reside in this segment. Note that integer constants (e.g., const int kMaxMon- sters = 255; ) are oft en used as manifest constants by the compiler, meaning that they are inserted directly into the machine code wherever they are used. Such constants occupy storage in the text segment, but they are not present in the read-only data segment. Global variables, i.e., variables defi ned at fi le scope outside any function or class declaration, are stored in either the data or BSS segments, depending on whether or not they have been initialized. The following global will be stored in the data segment, because it has been initialized: foo.cpp F32 gInitializedGlobal = -2.0f; and the following global will be allocated and initialized to zero by the operat- ing system , based on the specifi cations given in the BSS segment, because it has not been initialized by the programmer: foo.cpp F32 gUninitializedGlobal; We’ve seen that the static keyword can be used to give a global vari- able or function defi nition internal linkage, meaning that it will be “hidden” from other translation units. The static keyword can also be used to declare a global variable within a function. A function-static variable is lexically scoped to the function in which it is declared (i.e., the variable’s name can only be “seen” inside the function). It is initialized the fi rst time the function is called (rather than before main() is called as with fi le-scope statics). But in terms of memory layout in the executable image, a function-static variable acts identi- cally to a fi le-static global variable—it is stored in either the data or BSS seg- ment based on whether or not it has been initialized. void readHitchhikersGuide(U32 book) { static U32 sBooksInTheTrilogy = 5; // data segment static U32 sBooksRead; // BSS segment // ... } 3.2.3.2. Program Stack When an executable program is loaded into memory and run, the operating system reserves an area of memory for the program stack. Whenever a function is called, a contiguous area of stack memory is pushed onto the stack—we call this block of memory a stack frame. If function a() calls another function b(), 117 a new stack frame for b() is pushed on top of a()’s frame. When b() returns, its stack frame is popped, and execution continues wherever a() left off . A stack frame stores three kinds of data: 1. It stores the return address of the calling function, so that execution may continue in the calling function when the called function returns. 2. The contents of all relevant CPU registers are saved in the stack frame. This allows the new function to use the registers in any way it sees fi t, without fear of overwriting data needed by the calling function. Upon return to the calling function, the state of the registers is restored so that execution of the calling function may resume. The return value of the called function, if any, is usually left in a specifi c register so that the call- ing function can retrieve it, but the other registers are restored to their original values. 3. The stack frame also contains all local variables declared by the func- tion; these are also known as automatic variables. This allows each dis- tinct function invocation to maintain its own private copy of every local variable, even when a function calls itself recursively. (In practice, some local variables are actually allocated to CPU registers rather than being stored in the stack frame but, for the most part, such variables operate as if they were allocated within the function’s stack frame.) For example: 3.2. Data, Code, and Memory in C/C++ a()’s stack frame saved CPU registers return address aLocalsA1[5] localA2 a()’s stack frame saved CPU registers return address aLocalsA1[5] localA2 a()’s stack frame saved CPU registers return address aLocalsA1[5] localA2 b()’s stack frame saved CPU registers return address localB1 localB2 b()’s stack frame saved CPU registers return address localB1 localB2 saved CPU registers return address localC 1 c()’s stack frame function a() is called function b() is called function c() is called Figure 3.10. Stack frames. 118 3. Fundamentals of Software Engineering for Games void someFunction() { U32 anInteger; // ... } Pushing and popping stack frames is usually implemented by adjusting the value of a single register in the CPU, known as the stack pointer. Figure 3.10 illustrates what happens when the functions shown below are executed. void c() { U32 localC1; // ... } F32 b() { F32 localB1; I32 localB2; // ... c(); // call function c() // ... return localB1; } void a() { U32 aLocalsA1[5]; // ... F32 localA2 = b(); // call function b() // ... } When a function containing automatic variables returns, its stack frame is abandoned and all automatic variables in the function should be treated as if they no longer exist. Technically, the memory occupied by those variables is still there in the abandoned stack frame—but that memory will very likely be overwritt en as soon as another function is called. A common error involves returning the address of a local variable, like this: 119 U32* getMeaningOfLife() { U32 anInteger = 42; return &anInteger; } You might get away with this if you use the returned pointer immediately and don’t call any other functions in the interim. But more oft en than not, this kind of code will crash—in ways that can be diffi cult to debug. 3.2.3.3. Dynamic Allocation Heap Thus far, we’ve seen that a program’s data can be stored as global or static variables or as local variables. The globals and statics are allocated within the executable image, as defi ned by the data and BSS segments of the executable fi le. The locals are allocated on the program stack. Both of these kinds of stor- age are statically defi ned, meaning that the size and layout of the memory is known when the program is compiled and linked. However, a program’s memory requirements are oft en not fully known at compile time. A program usually needs to allocate additional memory dynamically. To allow for dynamic allocation, the operating system maintains a block of memory that can be allocated by a running program by calling malloc() and later returned to the pool for use by other programs by calling free(). This memory block is known as heap memory, or the free store. When we al- locate memory dynamically, we sometimes say that this memory resides on the heap. In C++, the global new and delete operators are used to allocate and free memory to and from the heap. Be wary, however—individual classes may overload these operators to allocate memory in custom ways, and even the global new and delete operators can be overloaded, so you cannot simply as- sume that new is always allocating from the heap. We will discuss dynamic memory allocation in more depth in Chap- ter 6. For additional information, see htt p://en.wikipedia.org/wiki/Dynamic_ memory_allocation. 3.2.4. Member Variables C structs and C++ classes allow variables to be grouped into logical units. It’s important to remember that a class or struct declaration allocates no memory. It is merely a description of the layout of the data—a cookie cutt er which can be used to stamp out instances of that struct or class later on. For example: 3.2. Data, Code, and Memory in C/C++ 120 3. Fundamentals of Software Engineering for Games struct Foo // struct declaration { U32 mUnsignedValue; F32 mFloatValue; bool mBooleanValue; }; Once a struct or class has been declared, it can be allocated (defi ned) in any of the ways that an atomic data type can be allocated, for example, as an automatic variable, on the program stack; void someFunction() { Foo localFoo; // ... } as a global, fi le-static or function-static; Foo gFoo; static Foo sFoo; void someFunction() { static Foo sLocalFoo; // ... } dynamically allocated from the heap. In this case, the pointer or refer- ence variable used to hold the address of the data can itself be allocated as an automatic, global, static, or even dynamically. Foo* gpFoo = NULL; // global pointer to a Foo void someFunction() { // allocate a Foo instance from the heap gpFoo = new Foo; // ... // allocate another Foo, assign to local // pointer Foo* pAnotherFoo = new Foo; // ... // allocate a POINTER to a Foo from the heap Foo** ppFoo = new Foo*; (*ppFoo) = pAnotherFoo; } 121 3.2.4.1. Class-Static Members As we’ve seen, the static keyword has many diff erent meanings depending on context: When used at fi le scope, static means “restrict the visibility of this variable or function so it can only be seen inside this .cpp fi le.” When used at function scope, static means “this variable is a global, not an automatic, but it can only be seen inside this function.” When used inside a struct or class declaration, static means “this variable is not a regular member variable, but instead acts just like a global.” Notice that when static is used inside a class declaration, it does not control the visibility of the variable (as it does when used at fi le scope)— rather, it diff erentiates between regular per-instance member variables and per-class variables that act like globals. The visibility of a class-static variable is determined by the use of public:, protected: or private: keywords in the class declaration. Class-static variables are automatically included within the namespace of the class or struct in which they are declared. So the name of the class or struct must be used to disambigu- ate the variable whenever it is used outside that class or struct (e.g., Foo::sVarName). Like an extern declaration for a regular global variable, the declaration of a class-static variable within a class allocates no memory. The memory for the class-static variable must be defi ned in a .cpp fi le. For example: foo.h class Foo { public: static F32 sClassStatic; // allocates no // memory! }; foo.cpp F32 Foo::sClassStatic = -1.0f; // define memory and // init 3.2.5. Object Layout in Memory It’s useful to be able to visualize the memory layout of your classes and structs. This is usually prett y straightforward—we can simply draw a box for the struct or class, with horizontal lines separating data members. An 3.2. Data, Code, and Memory in C/C++ 122 3. Fundamentals of Software Engineering for Games example of such a diagram for the struct Foo listed below is shown in Fig- ure 3.11. struct Foo { U32 mUnsignedValue; F32 mFloatValue; I32 mSignedValue; }; The sizes of the data members are important and should be represented in your diagrams. This is easily done by using the width of each data member to indicate its size in bits—i.e., a 32-bit integer should be roughly four times the width of an 8-bit integer (see Figure 3.12). struct Bar { U32 mUnsignedValue; F32 mFloatValue; bool mBooleanValue; // diagram assumes this is 8 bits }; 3.2.5.1. Alignment and Packing As we start to think more carefully about the layout of our structs and classes in memory, we may start to wonder what happens when small data members are interspersed with larger members. For example: struct InefficientPacking { U32 mU1; // 32 bits F32 mF2; // 32 bits U8 mB3; // 8 bits I32 mI4; // 32 bits bool mB5; // 8 bits char* mP6; // 32 bits }; You might imagine that the compiler simply packs the data members into memory as tightly as it can. However, this is not usually the case. Instead, the compiler will typically leave “holes” in the layout, as depicted in Fig- ure 3.13. (Some compilers can be requested not to leave these holes by us- ing a preprocessor directive like #pragma pack , or via command-line op- tions; but the default behavior is to space out the members as shown in Fig- ure 3.13.) mU1 mF2 mB3 mI4 mB5 mP6 +0x0 +0x4 +0x8 +0xC +0x10 +0x14 Figure 3.13. Ineffi cient struct packing due to mixed data member sizes. mUnsignedValue mFloatValue mSignedValue +0x0 +0x4 +0x8 Figure 3.11. Memory layout of a simple struct. mUnsignedValue mFloatValue mBooleanValue +0x0 +0x4 +0x8 Figure 3.12. A memory layout using width to indicate member sizes. 123 Why does the compiler leave these “holes?” The reason lies in the fact that every data type has a natural alignment which must be respected in order to permit the CPU to read and write memory eff ectively. The alignment of a data object refers to whether its address in memory is a multiple of its size (which is generally a power of two): An object with one-byte alignment resides at any memory address. An object with two-byte alignment resides only at even addresses (i.e., addresses whose least signifi cant nibble is 0x0, 0x2, 0x4, 0x8, 0xA, 0xC, or 0xE). An object with four-byte alignment resides only at addresses that are a multiple of four (i.e., addresses whose least signifi cant nibble is 0x0, 0x4, 0x8, or 0xC). A 16-byte aligned object resides only at addresses that are a multiple of 16 (i.e., addresses whose least signifi cant nibble is 0x0). Alignment is important because many modern processors can actually only read and write properly aligned blocks of data. For example, if a program requests that a 32-bit (four-byte) integer be read from address 0x6A341174, the memory controller will load the data happily because the address is four-byte aligned (in this case, its least signifi cant nibble is 0x4). However, if a request is made to load a 32-bit integer from address 0x6A341173, the memory control- ler now has to read two four-byte blocks: the one at 0x6A341170 and the one at 0x6A341174. It must then mask and shift the two parts of the 32-bit integer and logically OR them together into the destination register on the CPU. This is shown in Figure 3.14. Some microprocessors don’t even go this far. If you request a read or write of unaligned data, you might just get garbage. Or your program might just crash altogether! (The PlayStation 2 is a notable example of this kind of intol- erance for unaligned data.) Diff erent data types have diff erent alignment requirements. A good rule of thumb is that a data type should be aligned to a boundary equal to the width of the data type in bytes. For example, 32-bit values generally have a four-byte alignment requirement, 16-bit values should be two-byte aligned, and 8-bit values can be stored at any address (one-byte aligned). On CPUs that support SIMD vector math, the SIMD registers each contain four 32-bit fl oats, for a total of 128 bits or 16 bytes. And as you would guess, a four-fl oat SIMD vector typically has a 16-byte alignment requirement. This brings us back to those “holes” in the layout of struct Ineffi- cientPacking shown in Figure 3.13. When smaller data types like 8-bit bools are interspersed with larger types like 32-bit integers or floats in a structure 3.2. Data, Code, and Memory in C/C++ 124 3. Fundamentals of Software Engineering for Games or class, the compiler introduces padding (holes) in order to ensure that every- thing is properly aligned. It’s a good idea to think about alignment and pack- ing when declaring your data structures. By simply rearranging the members of struct InefficientPacking from the example above, we can eliminate some of the wasted padding space, as shown below and in Figure 3.15: struct MoreEfficientPacking { U32 mU1; // 32 bits (4-byte aligned) F32 mF2; // 32 bits (4-byte aligned) I32 mI4; // 32 bits (4-byte aligned) char* mP6; // 32 bits (4-byte aligned) U8 mB3; // 8 bits (1-byte aligned) bool mB5; // 8 bits (1-byte aligned) }; You’ll notice in Figure 3.15 that the size of the structure as a whole is now 20 bytes, not 18 bytes as we might expect, because it has been padded by two bytes at the end. This padding is added by the compiler to ensure proper alignment of the structure in an array context. That is, if an array of these structs is defi ned and the fi rst element of the array is aligned, then the padding at the end guarantees that all subsequent elements will also be aligned properly. The alignment of a structure as a whole is equal to the largest alignment requirement among its members. In the example above, the largest mem- ber alignment is four-byte, so the structure as a whole should be four-byte CPU alignedValue 0x6A341170 0x6A341174 0x6A341178 register -alignedValue 0x6A341170 0x6A341174 0x6A341178 un- -alignedValue un-shift shift -alignedValueun- Aligned read from 0x6A341174 Unaligned read from 0x6A341173 CPU register Figure 3.14. Aligned and unaligned reads of a 32-bit integer. (pad) mU1 mF2 mB3 mI4 mB5 mP6 +0x0 +0x4 +0x8 +0xC +0x10 Figure 3.15. More ef- fi cient packing by grouping small mem- bers together. 125 aligned. I usually like to add explicit padding to the end of my structs, to make the wasted space visible and explicit, like this: struct BestPacking { U32 mU1; // 32 bits (4-byte aligned) F32 mF2; // 32 bits (4-byte aligned) I32 mI4; // 32 bits (4-byte aligned) char* mP6; // 32 bits (4-byte aligned) U8 mB3; // 8 bits (1-byte aligned) bool mB5; // 8 bits (1-byte aligned) U8 _pad[2]; // explicit padding }; 3.2.5.2. Memory Layout of C++ Classes Two things make C++ classes a litt le diff erent from C structures in terms of memory layout: inheritance and virtual functions. When class B inherits from class A, B’s data members simply appear im- mediately aft er A’s in memory, as shown in Figure 3.16. Each new derived class simply tacks its data members on at the end, although alignment re- quirements may introduce padding between the classes. (Multiple inheritance does some whacky things, like including multiple copies of a single base class in the memory layout of a derived class. We won’t cover the details here, be- cause game programmers usually prefer to avoid multiple inheritance alto- gether anyway.) If a class contains or inherits one or more virtual functions, then four ad- ditional bytes (or however many bytes a pointer occupies on the target hard- ware) are added to the class layout, typically at the very beginning of the class’ layout. These four bytes are collectively called the virtual table pointer or vpointer, because they contain a pointer to a data structure known as the virtual function table or vtable. The vtable for a particular class contains pointers to all the virtual functions that it declares or inherits. Each concrete class has its own virtual table, and every instance of that class has a pointer to it, stored in its vpointer. The virtual function table is at the heart of polymorphism, because it al- lows code to be writt en that is ignorant of the specifi c concrete classes it is deal- ing with. Returning to the ubiquitous example of a Shape base class with de- rived classes for Circle, Rectangle, and Triangle, let’s imagine that Shape defi nes a virtual function called Draw(). The derived classes all override this function, providing distinct implementations named Circle::Draw(), Rectangle::Draw(), and Triangle::Draw(). The virtual table for any class derived from Shape will contain an entry for the Draw() function, but that entry will point to diff erent function implementations, depending on the 3.2. Data, Code, and Memory in C/C++ A B +0x0 +sizeof(A) Figure 3.16. Effect of inheritance on class layout. 126 3. Fundamentals of Software Engineering for Games concrete class. Circle’s vtable will contain a pointer to Circle::Draw(), while Rectangle’s virtual table will point to Rectangle::Draw(), and Tri- angle’s vtable will point to Triangle::Draw(). Given an arbitrary point- er to a Shape (Shape* pShape), the code can simply dereference the vtable pointer, look up the Draw() function’s entry in the vtable, and call it. The result will be to call Circle::Draw() when pShape points to an instance of Circle, Rectangle::Draw() when pShape points to a Rectangle, and Triangle::Draw() when pShape points to a Triangle. These ideas are illustrated by the following code excerpt. Notice that the base class Shape defi nes two virtual functions, SetId() and Draw(), the lat- ter of which is declared to be pure virtual. (This means that Shape provides no default implementation of the Draw() function, and derived classes must override it if they want to be instantiable.) Class Circle derives from Shape, adds some data members and functions to manage its center and radius, and overrides the Draw()function; this is depicted in Figure 3.17. Class Triangle also derives from Shape. It adds an array of Vector3 objects to store its three vertices and adds some functions to get and set the individual vertices. Class Triangle overrides Draw() as we’d expect, and for illustrative purposes it also overrides SetId(). The memory image generated by the Triangle class is shown in Figure 3.18. class Shape { public: virtual void SetId(int id) { m_id = id; } int GetId() const { return m_id; } virtual void Draw() = 0; // pure virtual – no impl. private: int m_id; }; Shape::m_id Circle::m_center Circle::m_radius vtable pointer pointer to SetId () pointer to Draw () +0x00 +0x04 +0x08 +0x14 pShape 1 Instance of Circle Circle’s Virtual Table Circle ::Draw () { // code to draw a Circle } Shape ::SetId (int id ) { m _id = id; } Figure 3.17. pShape1 points to an instance of class Circle. 127 class Circle : public Shape { public: void SetCenter(const Vector3& c) { m_center=c; } Vector3 GetCenter() const { return m_center; } void SetRadius(float r) { m_radius = r; } float GetRadius() const { return m_radius; } virtual void Draw() { // code to draw a circle } private: Vector3 m_center; float m_radius; }; class Triangle : public Shape { public: void SetVertex(int i, const Vector3& v); Vector3 GetVertex(int i) const { return m_vtx[i]; } virtual void Draw() { // code to draw a triangle } virtual void SetId(int id) { Shape::SetId(id); Figure 3.18. pShape2 points to an instance of class Triangle. Shape::m_id Triangle ::m_vtx[0] Triangle ::m_vtx[1] vtable pointer pointer to SetId () pointer to Draw () +0x00 +0x04 +0x08 +0x14 pShape 2 Instance of Triangle Triangle’s Virtual Table Triangle ::Draw() { // code to draw a Triangle } Triangle ::SetId (int id ) { Shape ::SetId (id); // do additional work // specific to Triangles } Triangle ::m_vtx[2]+0x20 3.2. Data, Code, and Memory in C/C++ 128 3. Fundamentals of Software Engineering for Games // do additional work specific to Triangles... } private: Vector3 m_vtx[3]; }; // ----------------------------- void main(int, char**) { Shape* pShape1 = new Circle; Shape* pShape2 = new Triangle; // ... pShape1->Draw(); pShape2->Draw(); // ... } 3.3. Catching and Handling Errors There are a number of ways to catch and handle error conditions in a game engine. As a game programmer, it’s important to understand these diff erent mechanisms, their pros and cons, and when to use each one. 3.3.1. Types of Errors In any soft ware project there are two basic kinds of error conditions: user er- rors and programmer errors. A user error occurs when the user of the program does something incorrect, such as typing an invalid input, att empting to open a fi le that does not exist, etc. A programmer error is the result of a bug in the code itself. Although it may be triggered by something the user has done, the essence of a programmer error is that the problem could have been avoided if the programmer had not made a mistake, and the user has a reasonable expec- tation that the program should have handled the situation gracefully. Of course, the defi nition of “user” changes depending on context. In the context of a game project, user errors can be roughly divided into two catego- ries: errors caused by the person playing the game and errors caused by the people who are making the game during development. It is important to keep track of which type of user is aff ected by a particular error and handle the er- ror appropriately. 129 There’s actually a third kind of user—the other programmers on your team. (And if you are writing a piece of game middleware soft ware, like Havok or OpenGL, this third category extends to other programmers all over the world who are using your library.) This is where the line between user er- rors and programmer errors gets blurry. Let’s imagine that programmer A writes a function f(), and programmer B tries to call it. If B calls f() with invalid arguments (e.g., a NULL pointer, or an out-of-range array index), then this could be seen as a user error by programmer A, but it would be a program- mer error from B’s point of view. (Of course, one can also argue that program- mer A should have anticipated the passing of invalid arguments and should have handled them gracefully, so the problem really is a programmer error, on A’s part.) The key thing to remember here is that the line between user and programmer can shift depending on context—it is rarely a black-and-white distinction. 3.3.2. Handling Errors When handling errors, the requirements diff er signifi cantly between the two types. It is best to handle user errors as gracefully as possible, displaying some helpful information to the user and then allowing him or her to continue working—or in the case of a game, to continue playing. Programmer errors, on the other hand, should not be handled with a graceful “inform and contin- ue” policy. Instead, it is usually best to halt the program and provide detailed low-level debugging information, so that a programmer can quickly identify and fi x the problem. In an ideal world, all programmer errors would be caught and fi xed before the soft ware ships to the public. 3.3.2.1. Handling Player Errors When the “user” is the person playing your game, errors should obviously be handled within the context of gameplay. For example, if the player att empts to reload a weapon when no ammo is available, an audio cue and/or an anima- tion can indicate this problem to the player without taking him or her “out of the game.” 3.3.2.2. Handling Developer Errors When the “user” is someone who is making the game, such as an artist, ani- mator or game designer, errors may be caused by an invalid asset of some sort. For example, an animation might be associated with the wrong skeleton, or a texture might be the wrong size, or an audio fi le might have been sampled at an unsupported sample rate. For these kinds of developer errors, there are two competing camps of thought. 3.3. Catching and Handling Errors 130 3. Fundamentals of Software Engineering for Games On the one hand, it seems important to prevent bad game assets from persisting for too long. A game typically contains literally thousands of assets, and a problem asset might get “lost,” in which case one risks the possibility of the bad asset surviving all the way into the fi nal shipping game. If one takes this point of view to an extreme, then the best way to handle bad game assets is to prevent the entire game from running whenever even a single problem- atic asset is encountered. This is certainly a strong incentive for the developer who created the invalid asset to remove or fi x it immediately. On the other hand, game development is a messy and iterative process, and generating “perfect” assets the fi rst time around is rare indeed. By this line of thought, a game engine should be robust to almost any kind of problem imaginable, so that work can continue even in the face of totally invalid game asset data. But this too is not ideal, because the game engine would become bloated with error-catching and error-handling code that won’t be needed once the development pace sett les down and the game ships. And the prob- ability of shipping the product with “bad” assets becomes too high. In my experience, the best approach is to fi nd a middle ground between these two extremes. When a developer error occurs, I like to make the error obvious and then allow the team to continue to work in the presence of the problem. It is extremely costly to prevent all the other developers on the team from working, just because one developer tried to add an invalid asset to the game. A game studio pays its employees well, and when multiple team mem- bers experience downtime, the costs are multiplied by the number of people who are prevented from working. Of course, we should only handle errors in this way when it is practical to do so, without spending inordinate amounts of engineering time, or bloating the code. As an example, let’s suppose that a particular mesh cannot be loaded. In my view, it’s best to draw a big red box in the game world at the places that mesh would have been located, perhaps with a text string hovering over each one that reads, “Mesh blah-dee-blah failed to load.” This is superior to printing an easy-to-miss message to an error log. And it’s far bett er than just crashing the game, because then no one will be able to work until that one mesh refer- ence has been repaired. Of course, for particularly egregious problems it’s fi ne to just spew an error message and crash. There’s no silver bullet for all kinds of problems, and your judgment about what type of error handling approach to apply to a given situation will improve with experience. 3.3.2.3. Handling Programmer Errors The best way to detect and handle programmer errors (a.k.a. bugs) is oft en to embed error-checking code into your source code and arrange for failed 131 error checks to halt the program. Such a mechanism is known as an assertion system; we’ll investigate assertions in detail in Section 3.3.3.3. Of course, as we said above, one programmer’s user error is another programmer’s bug; hence, assertions are not always the right way to handle every programmer error. Making a judicious choice between an assertion and a more graceful error handling technique is a skill that one develops over time. 3.3.3. Implementation of Error Detection and Handling We’ve looked at some philosophical approaches to handling errors. Now let’s turn our att ention to the choices we have as programmers when it comes to implementing error detection and handling code. 3.3.3.1. Error Return Codes A common approach to handling errors is to return some kind of failure code from the function in which the problem is fi rst detected. This could be a Bool- ean value indicating success or failure or it could be an “impossible” value, one that is outside the range of normally returned results. For example, a function that returns a positive integer or fl oating-point value could return a negative value to indicate that an error occurred. Even bett er than a Boolean or an “impossible” return value, the function could be designed to return an enu- merated value to indicate success or failure. This clearly separates the error code from the output(s) of the function, and the exact nature of the problem can be indicated on failure (e.g., enum Error { kSuccess, kAssetNot- Found, kInvalidRange, ... };). The calling function should intercept error return codes and act appro- priately. It might handle the error immediately. Or it might work around the problem, complete its own execution, and then pass the error code on to what- ever function called it. 3.3.3.2. Exceptions Error return codes are a simple and reliable way to communicate and respond to error conditions. However, error return codes have their drawbacks. Per- haps the biggest problem with error return codes is that the function that detects an error may be totally unrelated to the function that is capable of handling the problem. In the worst-case scenario, a function that is 40 calls deep in the call stack might detect a problem that can only be handled by the top-level game loop, or by main(). In this scenario, every one of the 40 functions on the call stack would need to be writt en so that it can pass an appropriate error code all the way back up to the top-level error-handling function. 3.3. Catching and Handling Errors 132 3. Fundamentals of Software Engineering for Games One way to solve this problem is to throw an exception. Structured excep- tion handling (SEH) is a very powerful feature of C++. It allows the function that detects a problem to communicate the error to the rest of the code with- out knowing anything about which function might handle the error. When an exception is thrown, relevant information about the error is placed into a data object of the programmer’s choice known as an exception object. The call stack is then automatically unwound, in search of a calling function that wrapped its call in a try-catch block. If a try-catch block is found, the exception object is matched against all possible catch blocks and if a match is found, the cor- responding catch block’s code is executed. The destructors of any automatic variables are called as needed during the stack unwinding. The ability to separate error detection from error handling in such a clean way is certainly att ractive, and exception handling is an excellent choice for some soft ware projects. However, SEH adds a lot of overhead to the program. Every stack frame must be augmented to contain additional information re- quired by the stack unwinding process. Also, the stack unwind is usually very slow—on the order of two to three times more expensive than simply return- ing from the function. Also, if even one function in your program (or a library that your program links with) uses SEH, your entire program must use SEH. The compiler can’t know which functions might be above you on the call stack when you throw an exception. Therefore, there’s a prett y strong argument for turning off structured ex- ception handling in your game engine altogether. This is the approach em- ployed at Naughty Dog and also on most of the projects I’ve worked on at Electronic Arts and Midway. Console game engines should probably never use SEH, because of a console’s limited memory and processing bandwidth. However, a game engine that is intended to be run on a personal computer might be able to use SEH without any problems. There are many interesting articles on this topic on the web. Here are links to a few of them: htt p://www.joelonsoft ware.com/items/2003/10/13.html htt p://www.nedbatchelder.com/text/exceptions-vs-status.html htt p://www.joelonsoft ware.com/items/2003/10/15.html 3.3.3.3. Assertions An assertion is a line of code that checks an expression. If the expression evalu- ates to true, nothing happens. But if the expression evaluates to false, the pro- gram is stopped, a message is printed, and the debugger is invoked if possible. Steve Maguire provides an in-depth discussion of assertions in his must-read book, Writing Solid Code [30]. 133 Assertions check a programmer’s assumptions. They act like land mines for bugs. They check the code when it is fi rst writt en to ensure that it is func- tioning properly. They also ensure that the original assumptions continue to hold for long periods of time, even when the code around them is constantly changing and evolving. For example, if a programmer changes code that used to work, but accidentally violates its original assumptions, they’ll hit the land mine. This immediately informs the programmer of the problem and permits him or her to rectify the situation with minimum fuss. Without assertions , bugs have a tendency to “hide out” and manifest themselves later in ways that are diffi cult and time-consuming to track down. But with as- sertions embedded in the code, bugs announce themselves the moment they are introduced—which is usually the best moment to fi x the problem, while the code changes that caused the problem are fresh in the programmer’s mind. Assertions are implemented as a #define macro, which means that the assertion checks can be stripped out of the code if desired, by simply changing the #define. The cost of the assertion checks can usually be tolerated during development, but stripping out the assertions prior to shipping the game can buy back that litt le bit of crucial performance if necessary. Assertion Implementation Assertions are usually implemented via a combination of a #defined macro that evaluates to an if/else clause, a function that is called when the asser- tion fails (the expression evaluates to false), and a bit of assembly code that halts the program and breaks into the debugger when one is att ached. Here’s a typical implementation: #if ASSERTIONS_ENABLED // define some inline assembly that causes a break // into the debugger – this will be different on each // target CPU #define debugBreak() asm { int 3 } // check the expression and fail if it is false #define ASSERT(expr) \ if (expr) { } \ else \ { \ reportAssertionFailure(#expr, \ __FILE__, \ __LINE__); \ debugBreak(); \ } 3.3. Catching and Handling Errors 134 3. Fundamentals of Software Engineering for Games #else #define ASSERT(expr) // evaluates to nothing #endif Let’s break down this defi nition so we can see how it works: The outer #if/#else/#endif is used to strip assertions from the code base. When ASSERTIONS_ENABLED is nonzero, the ASSERT() macro is defi ned in its fully glory, and all assertion checks in the code will be included in the program. But when assertions are turned off , ASSERT(expr) evaluates to nothing, and all instances of it in the code are eff ectively removed. The debugBreak() macro evaluates to whatever assembly-language instructions are required in order to cause the program to halt and the debugger to take charge (if one is connected). This diff ers from CPU to CPU, but it is usually a single assembly instruction. The ASSERT() macro itself is defi ned using a full if/else statement (as opposed to a lone if). This is done so that the macro can be used in any context, even within other unbracketed if/else statements. Here’s an example of what would happen if ASSERT() were defi ned using a solitary if: #define ASSERT(expr) if (!(expr)) debugBreak() void f() { if (a < 5) ASSERT(a >= 0); else doSomething(a); } This expands to the following incorrect code: void f() { if (a < 5) if (!(a >= 0)) debugBreak(); else // Oops! Bound to the wrong if()! doSomething(a); } The else clause of an ASSERT() macro does two things. It displays some kind of message to the programmer indicating what went wrong, 135 and then it breaks into the debugger. Notice the use of #expr as the fi rst argument to the message display function. The pound (#) preprocessor operator causes the expression expr to be turned into a string, thereby allowing it to be printed out as part of the assertion failure message. Notice also the use of __FILE__ and __LINE__. These compiler-defi ned macros magically contain the .cpp fi le name and line number of the line of code on which they appear. By passing them into our message dis- play function, we can print the exact location of the problem. I highly recommend the use of assertions in your code. However, it’s im- portant to be aware of their performance cost. You may want to consider de- fi ning two kinds of assertion macros. The regular ASSERT() macro can be left active in all builds, so that errors are easily caught even when not running in debug mode. A second assertion macro, perhaps called SLOW_ASSERT(), could be activated only in debug builds. This macro could then be used in places where the cost of assertion checking is too high to permit inclusion in release builds. Obviously SLOW_ASSERT() is of lower utility, because it is stripped out of the version of the game that your testers play every day. But at least these assertions become active when programmers are debugging their code. It’s also extremely important to use assertions properly. They should be used to catch bugs in the program itself—never to catch user errors. Also, as- sertions should always cause the entire game to halt when they fail. It’s usu- ally a bad idea to allow assertions to be skipped by testers, artists, designers, and other non-engineers. (This is a bit like the boy who cried wolf: if assertions can be skipped, then they cease to have any signifi cance, rendering them inef- fective.) In other words, assertions should only be used to catch fatal errors. If it’s OK to continue past an assertion, then it’s probably bett er to notify the user of the error in some other way, such as with an on-screen message, or some ugly bright-orange 3D graphics. For a great discussion on the proper usage of assertions, see htt p://www.wholesalealgorithms.com/blog9. 3.3. Catching and Handling Errors 137 4 3D Math for Games A game is a mathematical model of a virtual world simulated in real-time on a computer of some kind. Therefore, mathematics pervades everything we do in the game industry. Game programmers make use of virtually all branches of mathematics, from trigonometry to algebra to statistics to calculus. How- ever, by far the most prevalent kind of mathematics you’ll be doing as a game programmer is 3D vector and matrix math (i.e., 3D linear algebra ). Even this one branch of mathematics is very broad and very deep, so we cannot hope to cover it in any great depth in a single chapter. Instead, I will att empt to provide an overview of the mathematical tools needed by a typical game programmer. Along the way, I’ll off er some tips and tricks which should help you keep all of the rather confusing concepts and rules straight in your head. For an excellent in-depth coverage of 3D math for games, I highly rec- ommend Eric Lengyel’s book on the topic [28]. 4.1. Solving 3D Problems in 2D Many of the mathematical operations we’re going to learn about in the follow- ing chapter work equally well in 2D and 3D. This is very good news, because it means you can sometimes solve a 3D vector problem by thinking and draw- ing pictures in 2D (which is considerably easier to do!) Sadly, this equivalence 138 4. 3D Math for Games Figure 4.1. A point rep- resented in Cartesian coordinates. Py x z y Pz Px P Ph P h r Pr θ Pθ Figure 4.2. A point represent- ed in cylindrical coordinates. between 2D and 3D does not hold all the time. Some operations, like the cross product, are only defi ned in 3D, and some problems only make sense when all three dimensions are considered. Nonetheless, it almost never hurts to start by thinking about a simplifi ed two-dimensional version of the problem at hand. Once you understand the solution in 2D, you can think about how the prob- lem extends into three dimensions. In some cases, you’ll happily discover that your 2D result works in 3D as well. In others, you’ll be able to fi nd a coor- dinate system in which the problem really is two-dimensional. In this book, we’ll employ two-dimensional diagrams wherever the distinction between 2D and 3D is not relevant. 4.2. Points and Vectors The majority of modern 3D games are made up of three-dimensional objects in a virtual world. A game engine needs to keep track of the positions, orien- tations, and scales of all these objects, animate them in the game world, and transform them into screen space so they can be rendered on screen. In games, 3D objects are almost always made up of triangles, the vertices of which are represented by points. So before we learn how to represent whole objects in a game engine, let’s fi rst take a look the point and its closely related cousin, the vector. 4.2.1. Points and Cartesian Coordinates Technically speaking, a point is a location in n-dimensional space. (In games, n is usually equal to 2 or 3.) The Cartesian coordinate system is by far the most common coordinate system employed by game programmers. It uses two or three mutually perpendicular axes to specify a position in 2D or 3D space. So a point P is represented by a pair or triple of real numbers, (Px , Py) or (Px , Py , Pz). Of course, the Cartesian coordinate system is not our only choice. Some other common systems include: Cylindrical coordinates . This system employs a vertical “height” axis h, a radial axis r emanating out from the vertical, and a yaw angle theta (θ). In cylindrical coordinates, a point P is represented by the triple of num- bers (Ph , Pr , Pθ). This is illustrated in Figure 4.2. Spherical coordinates . This system employs a pitch angle phi (φ), a yaw angle theta (θ), and a radial measurement r. Points are therefore rep- resented by the triple of numbers (Pr , Pφ , Pθ). This is illustrated in Fig- ure 4.3. 139 4.2. Points and Vectors Figure 4.3. A point represented in spherical coordinates. r θ φ Pr P Pθ Pφ Right-Handed x z y Left-Handed x y z Figure 4.4. Left- and right-handed Cartesian coordinate systems. Cartesian coordinates are by far the most widely used coordinate system in game programming. However, always remember to select the coordinate system that best maps to the problem at hand. For example, in the game Crank the Weasel by Midway Home Entertainment, the main character Crank runs around an art-deco city picking up loot. I wanted to make the items of loot swirl around Crank’s body in a spiral, gett ing closer and closer to him until they disappeared. I represented the position of the loot in cylindrical coor- dinates relative to the Crank character’s current position. To implement the spiral animation, I simply gave the loot a constant angular speed in θ, a small constant linear speed inward along its radial axis r, and a very slight constant linear speed upward along the h-axis so the loot would gradually rise up to the level of Crank’s pants pockets. This extremely simple animation looked great, and it was much easier to model using cylindrical coordinates than it would have been using a Cartesian system. 4.2.2. Left-Handed vs. Right-Handed Coordinate Systems In three-dimensional Cartesian coordinates, we have two choices when ar- ranging our three mutually perpendicular axes: right-handed (RH) and left - handed (LH). In a right-handed coordinate system, when you curl the fi ngers of your right hand around the z-axis with the thumb pointing toward positive z coordinates, your fi ngers point from the x-axis toward the y-axis. In a left - handed coordinate system the same thing is true using your left hand. The only diff erence between a left -handed coordinate system and a right- handed coordinate system is the direction in which one of the three axes is pointing. For example, if the y-axis points upward and x points to the right, then z comes toward us (out of the page) in a right-handed system, and away from us (into the page) in a left -handed system. Left - and right-handed Carte- sian coordinate systems are depicted in Figure 4.4. 140 4. 3D Math for Games It is easy to convert from LH to RH coordinates and vice-versa. We sim- ply fl ip the direction of any one axis, leaving the other two axes alone. It’s important to remember that the rules of mathematics do not change between LH and RH coordinate systems. Only our interpretation of the numbers—our mental image of how the numbers map into 3D space—changes. Left -handed and right-handed conventions apply to visualization only, not to the underly- ing mathematics. (Actually, handedness does matt er when dealing with cross products in physical simulations, but we can safely ignore these subtleties for the majority of our game programming tasks. For more information, see htt p://en.wikipedia.org/wiki/Pseudovector .) The mapping between the numerical representation and the visual repre- sentation is entirely up to us as mathematicians and programmers. We could choose to have the y-axis pointing up, with z forward and x to the left (RH) or right (LH). Or we could choose to have the z-axis point up. Or the x-axis could point up instead—or down. All that matt ers is that we decide upon a mapping, and then stick with it consistently. That being said, some conventions do tend to work bett er than others for certain applications. For example, 3D graphics programmers typically work with a left -handed coordinate system, with the y-axis pointing up, x to the right and positive z pointing away from the viewer (i.e., in the direction the virtual camera is pointing). When 3D graphics are rendered onto a 2D screen using this particular coordinate system, increasing z-coordinates correspond to increasing depth into the scene (i.e., increasing distance away from the vir- tual camera). As we will see in subsequent chapters, this is exactly what is required when using a z-buff ering scheme for depth occlusion. 4.2.3. Vectors A vector is a quantity that has both a magnitude and a direction in n-dimensional space. A vector can be visualized as a directed line segment extending from a point called the tail to a point called the head. Contrast this to a scalar (i.e., an ordinary real-valued number), which represents a magnitude but has no di- rection. Usually scalars are writt en in italics (e.g., v) while vectors are writt en in boldface (e.g., v). A 3D vector can be represented by a triple of scalars (x, y, z), just as a point can be. The distinction between points and vectors is actually quite subtle. Technically, a vector is just an off set relative to some known point. A vector can be moved anywhere in 3D space—as long as its magnitude and direction don’t change, it is the same vector. A vector can be used to represent a point, provided that we fi x the tail of the vector to the origin of our coordinate system. Such a vector is sometimes 141 4.2. Points and Vectors called a position vector or radius vector. For our purposes, we can interpret any triple of scalars as either a point or a vector, provided that we remember that a position vector is constrained such that its tail remains at the origin of the chosen coordinate system. This implies that points and vectors are treated in subtly diff erent ways mathematically. One might say that points are absolute, while vectors are relative. The vast majority of game programmers use the term “vector” to refer both to points (position vectors) and to vectors in the strict linear algebra sense (purely directional vectors). Most 3D math libraries also use the term “vector” in this way. In this book, we’ll use the term “direction vector ” or just “direc- tion” when the distinction is important. Be careful to always keep the diff er- ence between points and directions clear in your mind (even if your math library doesn’t). As we’ll see in Section 4.3.6.1, directions need to be treated diff erently from points when converting them into homogeneous coordinates for manipulation with 4 × 4 matrices, so gett ing the two types of vector mixed up can and will lead to bugs in your code. 4.2.3.1. Cartesian Basis Vectors It is oft en useful to defi ne three orthogonal unit vectors (i.e., vectors that are mu- tually perpendicular and each with a length equal to one), corresponding to the three principal Cartesian axes. The unit vector along the x-axis is typically called i, the y-axis unit vector is called j, and the z-axis unit vector is called k. The vectors i, j, and k are sometimes called Cartesian basis ve ctors . Any point or vector can be expressed as a sum of scalars (real numbers) multiplied by these unit basis vectors. For example, (5, 3, –2) = 5i + 3j – 2k. 4.2.4. Vector Operations Most of the mathematical operations that you can perform on scalars can be applied to vectors as well. There are also some new operations that apply only to vectors. 4.2.4.1. Multiplication by a Scalar Multiplication of a vector a by a scalar s is accomplished by multiplying the individual components of a by s: sa = ( sax , say , saz ). Multiplication by a scalar has the eff ect of scaling the magnitude of the vector, while leaving its direction unchanged, as shown in Figure 4.5. Multi- plication by –1 fl ips the direction of the vector (the head becomes the tail and vice-versa). 142 4. 3D Math for Games a + b –b b a a – b Figure 4.6. Vector addition and subtraction. The scale factor can be diff erent along each axis. We call this nonuniform scale , and it can be represented as the component-wise product of a scaling vector s and the vector in question, which we’ll denote with the ⊗ operator. Techni- cally speaking, this special kind of product between two vectors is known as the Hadamard product . It is rarely used in the game industry—in fact, nonuni- form scaling is one of its only commonplace uses in games: (4.1) As we’ll see in Section 4.3.7.3, a scaling vector s is really just a compact way to represent a 3 × 3 diagonal scaling matrix S. So another way to write Equation (4.1) is as follows: 4.2.4.2. Addition and Subtraction The addition of two vectors a and b is defi ned as the vector whose components are the sums of the components of a and b. This can be visualized by placing the head of vector a onto the tail of vector b—the sum is then the vector from the tail of a to the head of b: a + b = [ (ax + bx), (ay + by), (az + bz) ]. Vector subtraction a – b is nothing more than addition of a and –b (i.e., the result of scaling b by –1, which fl ips it around). This corresponds to the vector v 2v v Figure 4.5. Multiplication of a vector by the scalar 2. ( , , ).xxyyzzsa sa sa⊗=sa 00 [ ] 0 0 [ ]. 00 x xyz y xxyyzz z s aaa s sasasa s ⎡⎤ ⎢⎥==⎢⎥ ⎢⎥⎣⎦ aS 143 4.2. Points and Vectors a ax ay |a| Figure 4.7. Magnitude of a vector (shown in 2D for ease of illustration). whose components are the diff erence between the components of a and the components of b: a – b = [ (ax – bx), (ay – by), (az – bz) ]. Vector addition and subtraction are depicted in Figure 4.6. Adding and Subtracting Points and Directions You can add and subtract direction vectors freely. However, technically speak- ing, points cannot be added to one another—you can only add a direction vector to a point, the result of which is another point. Likewise, you can take the diff erence between two points, resulting in a direction vector. These opera- tions are summarized below: direction + direction = direction direction – direction = direction point + direction = point point – point = direction point + point = nonsense (don’t do it!) 4.2.4.3. Magnitude The magnitude of a vector is a scalar representing the length of the vector as it would be measured in 2D or 3D space. It is denoted by placing vertical bars around the vector’s boldface symbol. We can use the Pythagorean theorem to calculate a vector’s magnitude, as shown in Figure 4.7: 222.xyzaaa= ++a 4.2.4.4. Vector Operations in Action Believe it or not, we can already solve all sorts of real-world game problems given just the vector operations we’ve learned thus far. When trying to solve a problem, we can use operations like addition, subtraction, scaling, and mag- nitude to generate new data out of the things we already know. For example, 144 4. 3D Math for Games if we have the current position vector of an A.I. character P1, and a vector v representing her current velocity , we can fi nd her position on the next frame P2 by scaling the velocity vector by the frame time interval Δt, and then adding it to the current position. As shown in Figure 4.8, the resulting vector equation is P2 = P1 + (Δt)v. (This is known as explicit Euler integration —it’s actually only valid when the velocity is constant, but you get the idea.) As another example, let’s say we have two spheres, and we want to know whether they intersect. Given that we know the center points of the two spheres, C1 and C2, we can fi nd a direction vector between them by simply subtracting the points, d = C2 – C1. The magnitude of this vector d = |d| de- termines how far apart the spheres’ centers are. If this distance is less than the sum of the spheres’ radii, they are intersecting; otherwise they’re not. This is shown in Figure 4.9. Square roots are expensive to calculate on most computers, so game programmers should always use the squared magnitude whenever it is valid to do so: Using the squared magnitude is valid when comparing the relative lengths of two vectors (“is vector a longer than vector b?”), or when comparing a vector’s magnitude to some other (squared) scalar quantity. So in our sphere-sphere intersection test, we should calculate d2 = and compare this to the squared sum of the radii, (r1 + r2)2 for maximum speed. When writing high-perfor- mance soft ware, never take a square root when you don’t have to! vΔt 1P 2P y x Figure 4.8. Simple vec- tor addition can be used to fi nd a character’s po- sition in the next frame, given her position and velocity ln the current frame. 1C 2C y x 1r 2r d 2C – 1C Figure 4.9. A sphere-sphere intersection test involves only vector subtraction, vector mag- nitude, and fl oating-point comparison operations. () 2 222.xyzaaa= ++a 4.2.4.5. Normalization and Unit Vectors A unit vector is a vector with a magnitude (length) of one. Unit vectors are very useful in 3D mathematics and game programming, for reasons we’ll see below. 2d 145 4.2. Points and Vectors Given an arbitrary vector v of length v = , we can convert it to a unit vector u that points in the same direction as v, but has unit length. To do this, we simply multiply v by the reciprocal of its magnitude. We call this normal- ization : 4.2.4.6. Normal Vectors A vector is said to be normal to a surface if it is perpendicular to that surface. Normal vectors are highly useful in games and computer graphics. For ex- ample, a plane can be defi ned by a point and a normal vector. And in 3D graphics, lighting calculations make heavy use of normal vectors to defi ne the direction of surfaces relative to the direction of the light rays impinging upon them. Normal vectors are usually of unit length, but they do not need to be. Be careful not to confuse the term “normalization” with the term “normal vec- tor.” A normalized vector is any vector of unit length. A normal vector is any vector that is perpendicular to a surface, whether or not it is of unit length. 4.2.4.7. Dot Product and Projection Vectors can be multiplied, but unlike scalars there are a number of diff erent kinds of vector multiplication. In game programming, we most oft en work with the following two kinds of multiplication: the dot product (a.k.a. scalar product or inner product), and the cross product (a.k.a. vector product or outer product). The dot product of two vectors yields a scalar; it is defi ned by adding the products of the individual components of the two vectors: (a scalar). The dot product can also be writt en as the product of the magnitudes of the two vectors and the cosine of the angle between them: The dot product is commutative (i.e., the order of the two vectors can be reversed) and distributive over addition: 1 .v ==vuvv xxyyzzab ab ab d⋅=++=ab v cos( ).⋅= θab a b ;⋅=⋅ab ba () .⋅ + =⋅+⋅abc abac 146 4. 3D Math for Games And the dot product combines with scalar multiplication as follows: Vector Projection If u is a unit vector ( = 1), then the dot product (a ⋅ u) represents the length of the projection of vector a onto the infi nite line defi ned by the direction of u, as shown in Figure 4.10. This projection concept works equally well in 2D or 3D and is highly useful for solving a wide variety of three-dimensional problems. Figure 4.10. Vector projection using the dot product. ( ).s ss⋅=⋅ = ⋅ab a b ab u Magnitude as a Dot Product The squared magnitude of a vector can be found by taking the dot product of that vector with itself. Its magnitude is then easily found by taking the square root: This works because the cosine of zero degrees is 1, so all that is left is Dot Product Tests Dot products are great for testing if two vectors are collinear or perpendicular, or whether they point in roughly the same or roughly opposite directions. For any two arbitrary vectors a and b, game programmers oft en use the following tests, as shown in Figure 4.11: Collinear. (a ⋅ b) = = ab (i.e., the angle between them is exactly 0 degrees—this dot product equals +1 when a and b are unit vectors). Collinear but opposite. (a ⋅ b) = –ab (i.e., the angle between them is 180 degrees—this dot product equals –1 when a and b are unit vectors). 2 ; . =⋅ =⋅ a aa a aa 2 .=aa a ab 147 4.2. Points and Vectors Perpendicular. (a ⋅ b) = 0 (i.e., the angle between them is 90 degrees). Same direction. (a ⋅ b) > 0 (i.e., the angle between them is less than 90 degrees). Opposite directions. (a ⋅ b) < 0 (i.e., the angle between them is greater than 90 degrees). Some Other Applications of the Dot Product Dot products can be used for all sorts of things in game programming. For ex- ample, let’s say we want to fi nd out whether an enemy is in front of the player character or behind him. We can fi nd a vector from the player’s position P to the enemy’s position E by simple vector subtraction (v = E – P). Let’s assume we have a vector f pointing in the direction that the player is facing . (As we’ll see in Section 4.3.10.3, the vector f can be extracted directly from the player’s model-to-world matrix .) The dot product d = v ⋅ f can be used to test whether the enemy is in front of or behind the player—it will be positive when the enemy is in front and negative when the enemy is behind. (a · b) = ab (a · b) = –ab (a · b) = 0 (a · b) > 0 (a · b) < 0 a b a b a b a b b a Figure 4.11. Some common dot product tests. Figure 4.12. The dot product can be used to fi nd the height of a point above or below a plane. 148 4. 3D Math for Games The dot product can also be used to fi nd the height of a point above or below a plane (which might be useful when writing a moon-landing game for example). We can defi ne a plane with two vector quantities: a point Q lying anywhere on the plane, and a unit vector n that is perpendicular (i.e., normal) to the plane. To fi nd the height h of a point P above the plane, we fi rst calculate a vector from any point on the plane (Q will do nicely) to the point in ques- tion P. So we have v = P – Q. The dot product of vector v with the unit-length normal vector n is just the projection of v onto the line defi ned by n. But that is exactly the height we’re looking for. Therefore, h = v ⋅ n = (P – Q) ⋅ n. This is illustrated in Figure 4.12. 4.2.4.8. Cross Product The cross product (also known as the outer product or vector product) of two vec- tors yields another vector that is perpendicular to the two vectors being multi- plied, as shown in Figure 4.13. The cross product operation is only defi ned in three dimensions: Magnitude of the Cross Product The magnitude of the cross product vector is the product of the magnitudes of the two vectors and the sine of the angle between them. (This is similar to the defi nition of the dot product, but it replaces the cosine with the sine.) The magnitude of the cross product is equal to the area of the par- allelogram whose sides are a and b, as shown in Figure 4.14. Since a triangle is one-half of a parallelogram, the area of a triangle whose vertices are speci- fi ed by the position vectors V1 , V2 , and V3 can be calculated as one-half of the magnitude of the cross product of any two of its sides: a × b a b Figure 4.13. The cross product of vectors a and b (right-handed). V2 V1 V3 a = (V2 – V1) b = (V3 – V1) |a × b| Figure 4.14. Area of a parallelogram expressed as the magnitude of a cross product. [( ), ( ), ( )] ( ) ( ) ( ). yz zy zx xz xy yx yz zy zx xz xy yx ab ab ab ab ab ab ab ab ab ab ab ab =− =− ab i − − jk+− +− × 1triangle 2 1 3 12 ( ) ( ).A = −×−VV VV sin( ).×= θa b ab ×ab 149 Direction of the Cross Product When using a right-handed coordinate system, you can use the right-hand rule to determine the direction of the cross product. Simply cup your fi ngers such that they point in the direction you’d rotate vector a to move it on top of vector b, and the cross product (a × b) will be in the direction of your thumb. Note that the cross product is defi ned by the left -hand rule when using a left -handed coordinate system. This means that the direction of the cross product changes depending on the choice of coordinate system. This might seem odd at fi rst, but remember that the handedness of a coordinate system does not aff ect the mathematical calculations we carry out—it only changes our visualization of what the numbers look like in 3D space. When converting from a RH system to a LH system or vice-versa, the numerical representations of all the points and vectors stay the same, but one axis fl ips. Our visualization of everything is therefore mirrored along that fl ipped axis. So if a cross prod- uct just happens to align with the axis we’re fl ipping (e.g., the z-axis), it needs to fl ip when the axis fl ips. If it didn’t, the mathematical defi nition of the cross product itself would have to be changed so that the z-coordinate of the cross product comes out negative in the new coordinate system. I wouldn’t lose too much sleep over all of this. Just remember: when visualizing a cross product, use the right-hand rule in a right-handed coordinate system and the left -hand rule in a left -handed coordinate system. Properties of the Cross Product The cross product is not commutative (i.e., order matt ers): a × b ≠ b × a. However, it is anti-commutative : a × b = – b × a. The cross product is distributive over addition: a × (b + c) = (a × b) + (a × c). And it combines with scalar multiplication as follows: (sa) × b = a × (sb) = s(a × b). The Cartesian basis vectors are related by cross products as follows: () (), ()(), ()(). × =− × = × =− × = × =− × = jk kj i ki ik j i jji k 4.2. Points and Vectors 150 4. 3D Math for Games These three cross products defi ne the direction of positive rotations about the Cartesian axes. The positive rotations go from x to y (about z), from y to z (about x) and from z to x (about y). Notice how the rotation about the y-axis “reversed” alphabetically, in that it goes from z to x (not from x to z). As we’ll see below, this gives us a hint as to why the matrix for rotation about the y-axis looks inverted when compared to the matrices for rotation about the x- and z-axes. The Cross Product in Action The cross product has a number of applications in games. One of its most common uses is for fi nding a vector that is perpendicular to two other vectors. As we’ll see in Section 4.3.10.2, if we know an object’s local unit basis vectors, (ilocal , jlocal , and klocal), we can easily fi nd a matrix representing the object’s orientation. Let’s assume that all we know is the object’s klocal vector—i.e., the direction in which the object is facing. If we assume that the object has no roll about klocal , then we can fi nd ilocal by taking the cross product between klocal (which we already know) and the world-space up vector jworld (which equals [0 1 0]). We do so as follows: ilocal = normalize(jworld × klocal). We can then fi nd jlocal by simply crossing ilocal and klocal as follows: jlocal = klocal × ilocal. A very similar technique can be used to fi nd a unit vector normal to the surface of a triangle or some other plane. Given three points on the plane P1 , P2 , and P3 , the normal vector is just n = normalize[(P2 – P1) × (P3 – P1)]. Cross products are also used in physics simulations. When a force is ap- plied to an object, it will give rise to rotational motion if and only if it is ap- plied off -center. This rotational force is known as a torque , and it is calculated as follows. Given a force F, and a vector r from the center of mass to the point at which the force is applied, the torque N = r × F. 4.2.5. Linear Interpolation of Points and Vectors In games, we oft en need to fi nd a vector that is midway between two known vectors. For example, if we want to smoothly animate an object from point A to point B over the course of two seconds at 30 frames per second, we would need to fi nd 60 intermediate positions between A and B. A linear interpolation is a simple mathematical operation that fi nds an in- termediate point between two known points. The name of this operation is oft en shortened to LERP. The operation is defi ned as follows, where β ranges from 0 to 1 inclusive: ( , , ) (1 ) [(1 ) , (1 ) , (1 ) ].xxyyzzAB AB AB = β = −β +β = −β +β −β +β −β +β L LERP A B AB 151 4.3. Matrices Geometrically, L = LERP(A, B, β) is the position vector of a point that lies β percent of the way along the line segment from point A to point B, as shown in Figure 4.15. Mathematically, the LERP function is just a weighted average of the two input vectors, with weights (1 – β) and β, respectively. Notice that the weights always add to 1, which is a general requirement for any weighted average. 4.3. Matrices A matrix is a rectangular array of m × n scalars. Matrices are a convenient way of representing linear transformations such as translation, rotation, and scale. A matrix M is usually writt en as a grid of scalars Mrc enclosed in square brackets, where the subscripts r and c represent the row and column indices of the entry, respectively. For example, if M is a 3 × 3 matrix, it could be writ- ten as follows: We can think of the rows and/or columns of a 3 × 3 matrix as 3D vectors. When all of the row and column vectors of a 3 × 3 matrix are of unit magni- tude, we call it a special orthogonal matrix. This is also known as an isotropic matrix, or an orthonormal matrix. Such matrices represent pure rotations. Under certain constraints, a 4 × 4 matrix can represent arbitrary 3D trans- formations , including translations , rotations , and changes in scale . These are called transformation matrices , and they are the kinds of matrices that will be most useful to us as game engineers. The transformations represented by a matrix are applied to a point or vector via matrix multiplication. We’ll inves- tigate how this works below. An affi ne matrix is a 4 × 4 transformation matrix that preserves parallelism of lines and relative distance ratios, but not necessarily absolute lengths and angles. An affi ne matrix is any combination of the following operations: rota- tion, translation, scale and/or shear. 4.3.1. Matrix Multiplication The product P of two matrices A and B is writt en P = AB. If A and B are transformation matrices, then the product P is another transformation matrix that performs both of the original transformations. For example, if A is a scale matrix and B is a rotation, the matrix P would both scale and rotate the points A L = LERP(A, B, 0.4) Bβ = 0 β = 1 β = 0.4 Figure 4.15. Linear in- terpolation (LERP) be- tween points A and B, with β = 0.4. 11 12 13 21 22 23 31 32 33 . MMM MMM MMM ⎡⎤ ⎢⎥=⎢⎥ ⎢⎥⎣⎦ M 152 4. 3D Math for Games or vectors to which it is applied. This is particularly useful in game program- ming, because we can precalculate a single matrix that performs a whole se- quence of transformations and then apply all of those transformations to a large number of vectors effi ciently. To calculate a matrix product, we simply take dot products between the rows of the nA × mA matrix A and the columns of the nB × mB matrix B. Each dot product becomes one component of the resulting matrix P. The two matrices can be multiplied as long as the inner dimensions are equal (i.e., mA = nB). For example, if A and B are 3 × 3 matrices, then P = AB, Matrix multiplication is not commutative. In other words, the order in which matrix multiplication is done matt ers: AB ≠ BA . We’ll see exactly why this matt ers in Section 4.3.2. Matrix multiplication is oft en called concatenation, because the product of n transformation matrices is a matrix that concatenates, or chains together, the original sequence of transformations in the order the matrices were mul- tiplied. 4.3.2. Representing Points and Vectors as Matrices Points and vectors can be represented as row matrices (1 × n) or column matrices (n × 1), where n is the dimension of the space we’re working with (usually 2 or 3). For example, the vector v = (3, 4, –1) can be writt en either as or as The choice between column and row vectors is a completely arbitrary one, but it does aff ect the order in which matrix multiplications are writt en. This happens because when multiplying matrices, the inner dimensions of the two matrices must be equal, so: 11 row1 col1 21 row2 col1 31 row3 col1 ; ; ; P P P =⋅ =⋅ =⋅ AB AB AB 12 row1 col2 22 row2 col2 32 row3 col2 ; ; ; P P P =⋅ =⋅ =⋅ AB AB AB 13 row1 col3 23 row2 col3 33 row3 col3 ; ; . P P P =⋅ =⋅ =⋅ AB AB AB 1 [3 4 1] ,=−v T 21 3 4. 1 ⎡⎤ ⎢⎥==⎢⎥ ⎢⎥−⎣⎦ vv 153 4.3. Matrices to multiply a 1 × n row vector by an n × n matrix, the vector must appear to the left of the matrix ( ), whereas to multiply an n × n matrix by an n × 1 column vector, the vector must appear to the right of the matrix ( ). If multiple transformation matrices A, B, and C are applied in order to a vector v, the transformations “read” from left to right when using row vectors, but from right to left when using column vectors. The easiest way to remember this is to realize that the matrix closest to the vector is applied fi rst. This is il- lustrated by the parentheses below: v’ = ( ( ( vA ) B ) C ) Row vectors: read left -to-right; v’ = ( C ( B ( Av ) ) ) Column vectors: read right-to-left . In this book we’ll adopt the row vector convention, because the left -to-right order of transformations is most intuitive to read for English-speaking people. That said, be very careful to check which convention is used by your game engine, and by other books, papers, or web pages you may read. You can usually tell by seeing whether vector-matrix multiplications are writt en with the vector on the left (for row vectors) or the right (for column vectors) of the matrix. When using column vectors, you’ll need to transpose all the matrices shown in this book. 4.3.3. The Identity Matrix The identity matrix is a matrix that, when multiplied by any other matrix, yields the very same matrix. It is usually represented by the symbol I. The identity matrix is always a square matrix with 1’s along the diagonal and 0’s everywhere else: AI = IA ≡ A . 4.3.4. Matrix Inversion The inverse of a matrix A is another matrix (denoted A–1) that undoes the eff ects of matrix A. So, for example, if A rotates objects by 37 degrees about the z-axis, then A–1 will rotate by –37 degrees about the z-axis. Likewise, if A scales objects to be twice their original size, then A–1 scales objects to be half-sized. When a ma- trix is multiplied by its own inverse, the result is always the identity matrix, so 11n n nn× ××′ =v vM 11n nn n× ××′ =v Mv 33 100 0 1 0; 001 × ⎡⎤ ⎢⎥=⎢⎥ ⎢⎥⎣⎦ I 154 4. 3D Math for Games Not all matrices have inverses. However, all affi ne matri- ces (combinations of pure rotations, translations, scales, and shears) do have inverses. Gaussian elimination or LU decomposition can be used to fi nd the inverse, if one exists. Since we’ll be dealing with matrix multiplication a lot, it’s important to note here that the inverse of a sequence of concatenated matrices can be writt en as the reverse concatenation of the individual matrices’ inverses. For example, (ABC)–1 = C–1  B–1 A–1. 4.3.5. Transposition The transpose of a matrix M is denoted MT. It is obtained by refl ecting the en- tries of the original matrix across its diagonal. In other words, the rows of the original matrix become the columns of the transposed matrix, and vice-versa: The transpose is useful for a number of reasons. For one thing, the inverse of an orthonormal (pure rotation) matrix is exactly equal to its transpose— which is good news, because it’s much cheaper to transpose a matrix than it is to fi nd its inverse in general. Transposition can also be important when mov- ing data from one math library to another, because some libraries use column vectors while others expect row vectors. The matrices used by a row-vector- based library will be transposed relative to those used by a library that employs the column vector convention. As with the inverse, the transpose of a sequence of concatenated matrices can be rewritt en as the reverse concatenation of the individual matrices’ trans- poses. For example, (ABC)T = CT   BT  AT. This will prove useful when we consider how to apply transformation matri- ces to points and vectors. 4.3.6. Homogeneous Coordinates You may recall from high-school algebra that a 2 × 2 matrix can represent a rotation in two dimensions. To rotate a vector r through an angle of φ degrees (where positive rotations are counter-clockwise), we can write 11()().−−≡≡AA A A I cos sin [][] .sin cosxy xyrr rr φφ⎡⎤′′= ⎢⎥−φ φ⎣⎦ T . abc adg de f g h i c f i ⎡ ⎤⎡ ⎤ ⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦⎣ ⎦ ehb 155 4.3. Matrices It’s probably no surprise that rotations in three dimensions can be represented by a 3 × 3 matrix. The two-dimensional example above is really just a three- dimensional rotation about the z-axis, so we can write The question naturally arises: Can a 3 × 3 matrix be used to represent translations? Sadly, the answer is no. The result of translating a point r by a translation t requires adding the components of t to the components of r in- dividually: Matrix multiplication involves multiplication and addition of matrix ele- ments, so the idea of using multiplication for translation seems promising. But, unfortunately, there is no way to arrange the components of t within a 3 × 3 matrix such that the result of multiplying it with the column vector r yields sums like (rx + tx). The good news is that we can obtain sums like this if we use a 4 × 4 matrix. What would such a matrix look like? Well, we know that we don’t want any rotational eff ects, so the upper 3 × 3 should contain an identity matrix. If we arrange the components of t across the bott om-most row of the matrix and set the fourth element of the r vector (usually called w) equal to 1, then taking the dot product of the vector r with column 1 of the matrix will yield (1 × rx) + (0 × ry) + (0 × rz) + (tx × 1) = (rx + tx), which is exactly what we want. If the bott om right-hand corner of the matrix contains a 1 and the rest of the fourth column contains zeros, then the resulting vector will also have a 1 in its w component. Here’s what the fi nal 4 × 4 translation matrix looks like: When a point or vector is extended from three dimensions to four in this manner, we say that it has been writt en in homogeneous coordinates. A point in homogeneous coordinates always has w = 1. Most of the 3D matrix math done by game engines is performed using 4 × 4 matrices with four-element points and vectors writt en in homogeneous coordinates. cos sin 0 [][] sincos0. 0 01 xyz xyzrrr rrr ⎡⎤φφ ⎢⎥′′′= −φ φ⎢⎥ ⎢⎥⎣⎦ [( )( )( )].xxyyzzrt rt rt+= + + +rt 1000 0 1 00 [1] 0010 1 [( )( )( )1]. xyz xyz xxyyzz rrr t rt rt rt ⎡⎤ ⎢⎥ ⎢⎥+= ⎢⎥ ⎢⎥⎣⎦ =+ + + rt tt 156 4. 3D Math for Games 4.3.6.1. Transforming Direction Vectors Mathematically, points (position vectors) and direction vectors are treated in subtly diff erent ways. When transforming a point by a matrix, the translation, rotation, and scale of the matrix are all applied to the point. But when trans- forming a direction by a matrix, the translational eff ects of the matrix are ig- nored. This is because direction vectors have no translation per se—applying a translation to a direction would alter its magnitude, which is usually not what we want. In homogeneous coordinates, we achieve this by defi ning points to have their w components equal to one, while direction vectors have their w com- ponents equal to zero. In the example below, notice how the w = 0 component of the vector v multiplies with the t vector in the matrix, thereby eliminating translation in the fi nal result: Technically, a point in homogeneous (four-dimensional) coordinates can be converted into non-homogeneous (three-dimensional) coordinates by di- viding the x, y, and z components by the w component: This sheds some light on why we set a point’s w component to one and a vec- tor’s w component to zero. Dividing by w = 1 has no eff ect on the coordinates of a point, but dividing a pure direction vector’s components by w = 0 would yield infi nity. A point at infi nity in 4D can be rotated but not translated, be- cause no matt er what translation we try to apply, the point will remain at in- fi nity. So in eff ect, a pure direction vector in three-dimensional space acts like a point at infi nity in four-dimensional homogeneous space. 4.3.7. Atomic Transformation Matrices Any affi ne transformation matrix can be created by simply concatenating a sequence of 4 × 4 matrices representing pure translations, pure rotations, pure scale operations, and/or pure shears. These atomic transformation building blocks are presented below. (We’ll omit shear from these discussions, as it tends to be used only rarely in games.) Notice that all affi ne 4 × 4 transformation matrices can be partitioned into four components: [ 0] [( 0 ) 0] [ 0].1 ⎡⎤=+ =⎢⎥⎣⎦ U0 vvU tvUt [] .yxzxyzw www ⎡⎤≡⎢⎥⎣⎦ 33 31 13 .1 ×× × ⎡⎤ ⎢⎥⎣⎦ U0 t 157 4.3. Matrices the upper 3 × 3 matrix U, which represents the rotation and/or scale, a 1 × 3 translation vector t, a 3 × 1 vector of zeros 0 = [ 0 0 0 ]T, and a scalar 1 in the bott om-right corner of the matrix. When a point is multiplied by a matrix that has been partitioned like this, the result is as follows: 4.3.7.1. Translation The following matrix translates a point by the vector t: or in partitioned shorthand: To invert a pure translation matrix, simply negate the vector t (i.e., negate tx , ty , and tz). 4.3.7.2. Rotation All 4 × 4 pure rotation matrices have the form: The t vector is zero and the upper 3 × 3 matrix R contains cosines and sines of the rotation angle, measured in radians. The following matrix represents rotation about the x-axis by an angle φ: 33 31 13 13 13 [ 1] [ 1] [( ) 1].1 ×× ×× × ⎡⎤′ = =+⎢⎥⎣⎦ U0 rr rU tt 1000 0 1 00 [ 1] 0010 1 [( )( )( )1], xyz xyz xxyyzz rrr t rt rt rt ⎡⎤ ⎢⎥ ⎢⎥+= ⎢⎥ ⎢⎥⎣⎦ =+ + + rt tt [ 1] [( ) 1].1 ⎡⎤=+⎢⎥⎣⎦ I0 rrtt [ 1] [ 1].1 ⎡⎤=⎢⎥⎣⎦ R0 rrR0 1 0 00 0 cos sin 0 rotate ( , ) [ 1] .0 sin cos 0 0 0 01 x xyzrrr ⎡⎤ ⎢⎥φφ⎢⎥φ= ⎢⎥−φ φ ⎢⎥⎣⎦ r 158 4. 3D Math for Games The matrix below represents rotation about the y-axis by an angle θ. Notice that this one is transposed relative to the other two—the positive and negative sine terms have been refl ected across the diagonal: This matrix represents rotation about the z-axis by an angle γ: Here are a few observations about these matrices: The 1 within the upper 3 × 3 always appears on the axis we’re rotating about, while the sine and cosine terms are off -axis. Positive rotations go from x to y (about z), from y to z (about x), and from z to x (about y). The z to x rotation “wraps around,” which is why the rotation matrix about the y-axis is transposed relative to the other two. (Use the right-hand or left -hand rule to remember this.) The inverse of a pure rotation is just its transpose. This works because inverting a rotation is equivalent to rotating by the negative angle. You may recall that cos(–θ) = cos(θ) while sin(–θ) = –sin(θ), so negating the angle causes the two sine terms to eff ectively switch places, while the cosine terms stay put. 4.3.7.3. Scale The following matrix scales the point r by a factor of sx along the x-axis, sy along the y-axis, and sz along the z-axis: cos 0 sin 0 01 0 0 rotate ( , ) [ 1] .sin 0 cos 0 00 0 1 y xyzrrr θ −θ⎡⎤ ⎢⎥ ⎢⎥θ= ⎢⎥θθ ⎢⎥⎣⎦ r cos sin 0 0 sin cos 0 0 rotate ( , ) [ 1] .0 0 10 0 0 01 z xyzrrr γγ⎡⎤ ⎢⎥−γ γ⎢⎥γ= ⎢⎥ ⎢⎥⎣⎦ r 0 00 0 00 [1] 00 0 0 0 01 [ 1]. x y xyz z xxyyzz s s rrr s sr sr sr ⎡⎤ ⎢⎥ ⎢⎥= ⎢⎥ ⎢⎥⎣⎦ = rS 159 4.3. Matrices or in partitioned shorthand: Here are some observations about this kind of matrix: To invert a scaling matrix, simply substitute sx , sy , and sz with their re- ciprocals (i.e., 1/sx , 1/sy , and 1/sz). When the scale factor along all three axes is the same (sx = sy = sz), we call this uniform scale. Spheres remain spheres under uniform scale, whereas under nonuniform scale they become ellipsoids. To keep the mathemat- ics of bounding sphere checks simple and fast, many game engines im- pose the restriction that only uniform scale may be applied to render- able geometry or collision primitives. When a uniform scale matrix Su and a rotation matrix R are concat- enated, the order of multiplication is unimportant (i.e., SuR = RSu). This only works for uniform scale! 4.3.8. 4 × 3 Matrices The rightmost column of an affi ne 4 × 4 matrix always contains the vector [ 0 0 0 1 ]T. As such, game programmers oft en omit the fourth column to save memory. You’ll encounter 4 × 3 affi ne matrices frequently in game math libraries. 4.3.9. Coordinate Spaces We’ve seen how to apply transformations to points and direction vectors us- ing 4 × 4 matrices. We can extend this idea to rigid objects by realizing that such an object can be thought of as an infi nite collection of points. Applying a transformation to a rigid object is like applying that same transformation to every point within the object. For example, in computer graphics an object is usually represented by a mesh of triangles, each of which has three vertices represented by points. In this case, the object can be transformed by applying a transformation matrix to all of its vertices in turn. We said above that a point is a vector whose tail is fi xed to the origin of some coordinate system. This is another way of saying that a point (position vector) is always expressed relative to a set of coordinate axes. The triplet of numbers representing a point changes numerically whenever we select a new set of coordinate axes. In Figure 4.16, we see a point P represented by two diff erent position vectors—the vector PA gives the position of P relative to the 33 33[ 1] [ 1].1 × × ⎡⎤=⎢⎥⎣⎦ S0 rrS0 160 4. 3D Math for Games “A” axes, while the vector PB gives the position of that same point relative to a diff erent set of axes “B.” In physics, a set of coordinate axes represents a frame of reference, so we sometimes refer to a set of axes as a coordinate frame (or just a frame). People in the game industry also use the term coordinate space (or simply space) to refer to a set of coordinate axes. In the following sections, we’ll look at a few of the most common coordinate spaces used in games and computer graphics. 4.3.9.1. Model Space When a triangle mesh is created in a tool such as Maya or 3DStudioMAX, the positions of the triangles’ vertices are specifi ed relative to a Cartesian coordi- nate system which we call model space (also known as object space or local space). The model space origin is usually placed at a central location within the object, such as at its center of mass, or on the ground between the feet of a humanoid or animal character. Most game objects have an inherent directionality. For example, an air- plane has a nose, a tail fi n, and wings that correspond to the front, up, and left /right directions. The model space axes are usually aligned to these natural directions on the model, and they’re given intuitive names to indicate their directionality as illustrated in Figure 4.17. Front. This name is given to the axis that points in the direction that the object naturally travels or faces. In this book, we’ll use the symbol F to refer to a unit basis vector along the front axis. Up. This name is given to the axis that points towards the top of the object. The unit basis vector along this axis will be denoted U. Left or right. The name “left ” or “right” is given to the axis that points toward the left or right side of the object. Which name is chosen de- pends on whether your game engine uses left -handed or right-handed xA yA xB yB PA = (2, 3) PB = (1, 5) Figure 4.16. Position vectors for the point P relative to different coordinate axes. 161 4.3. Matrices coordinates. The unit basis vector along this axis will be denoted L or R, as appropriate. The mapping between the (front, up, left ) labels and the (x, y, z) axes is com- pletely arbitrary. A common choice when working with right-handed axes is to assign the label front to the positive z-axis, the label left to the positive x-axis, and the label up to the positive y-axis (or in terms of unit basis vectors, F = k, L = i, and U = j). However, it’s equally common for +x to be front and +z to be right (F = i, R = k, U = j). I’ve also worked with engines in which the z-axis is oriented vertically. The only real requirement is that you stick to one conven- tion consistently throughout your engine. As an example of how intuitive axis names can reduce confusion, consid- er Euler angles (pitch, yaw, roll), which are oft en used to describe an aircraft ’s orientation. It’s not possible to defi ne pitch, yaw, and roll angles in terms of the (i, j, k) basis vectors because their orientation is arbitrary. However, we can defi ne pitch, yaw, and roll in terms of the (L, U, F) basis vectors, because their orientations are clearly defi ned. Specifi cally, pitch is rotation about L or R, yaw is rotation about U, and roll is rotation about F. 4.3.9.2. World Space World space is a fi xed coordinate space, in which the positions, orientations, and scales of all objects in the game world are expressed. This coordinate space ties all the individual objects together into a cohesive virtual world. The location of the world-space origin is arbitrary, but it is oft en placed near the center of the playable game space to minimize the reduction in fl oat- ing-point precision that can occur when (x, y, z) coordinates grow very large. Likewise, the orientation of the x-, y-, and z-axes is arbitrary, although most le front up Figure 4.17. One possible choice of the model-space front, left and up axis basis vectors for an airplane. 162 4. 3D Math for Games of the engines I’ve encountered use either a y-up or a z-up convention. The y-up convention was probably an extension of the two-dimensional conven- tion found in most mathematics textbooks, where the y-axis is shown going up and the x-axis going to the right. The z-up convention is also common, be- cause it allows a top-down orthographic view of the game world to look like a traditional two-dimensional xy-plot. As an example, let’s say that our aircraft ’s left wingtip is at (5, 0, 0) in mod- el space. (In our game, front vectors correspond to the positive z-axis in model space with y up, as shown in Figure 4.17.) Now imagine that the jet is facing down the positive x-axis in world space, with its model-space origin at some arbitrary location, such as (–25, 50, 8). Because the F vector of the airplane, which corresponds to +z in model space, is facing down the +x-axis in world space, we know that the jet has been rotated by 90 degrees about the world y-axis. So if the aircraft were sitt ing at the world space origin, its left wingtip would be at (0, 0, –5) in world space. But because the aircraft ’s origin has been translated to (–25, 50, 8), the fi nal position of the jet’s left wingtip in model space is (–25, 50, [8 – 5]) = (–25, 50, 3). This is illustrated in Figure 4.18. We could of course populate our friendly skies with more than one Lear jet. In that case, all of their left wingtips would have coordinates of (5, 0, 0) in model space. But in world space, the left wingtips would have all sorts of interesting coordinates, depending on the orientation and translation of each aircraft . 4.3.9.3. View Space View space (also known as camera space) is a coordinate frame fi xed to the cam- era. The view space origin is placed at the focal point of the camera. Again, any axis orientation scheme is possible. However, a y-up convention with z Airport z W xW xM z M (5,0,0)M (–25,50,3)W (–25,50,8)W Aircraft: Left Wingtip: Figure 4.18. A lear jet whose left wingtip is at (5, 0, 0) in model space. If the jet is rotated by 90 degrees about the world-space y-axis, and its model-space origin translated to (–25, 50, 8) in world space, then its left wingtip would end up at (–25, 50, 3) when expressed in world space coordinates. 163 4.3. Matrices increasing in the direction the camera is facing (left -handed) is typical because it allows z coordinates to represent depths into the screen . Other engines and APIs, such as OpenGL , defi ne view space to be right-handed, in which case the camera faces towards negative z, and z coordinates represent negative depths. 4.3.10. Change of Basis In games and computer graphics, it is oft en quite useful to convert an object’s position, orientation, and scale from one coordinate system into another. We call this operation a change of basis . 4.3.10.1. Coordinate Space Hierarchies Coordinate frames are relative. That is, if you want to quantify the position, orientation, and scale of a set of axes in three-dimensional space, you must specify these quantities relative to some other set of axes (otherwise the num- bers would have no meaning). This implies that coordinate spaces form a hi- erarchy—every coordinate space is a child of some other coordinate space, and the other space acts as its parent. World space has no parent; it is at the root of the coordinate-space tree, and all other coordinate systems are ultimately specifi ed relative to it, either as direct children or more-distant relatives. 4.3.10.2. Building a Change of Basis Matrix The matrix that transforms points and directions from any child coordinate system C to its parent coordinate system P can be writt en CP→M (pronounced “C to P”). The subscript indicates that this matrix transforms points and direc- tions from child space to parent space. Any child-space position vector PC can be transformed into a parent-space position vector PP as follows: Left-Handed x z y Right-Handed z x y Virtual Screen Virtual Screen Figure 4.19. Left- and right-handed examples of view space, also known as camera space. 164 4. 3D Math for Games In this equation, iC is the unit basis vector along the child space x-axis, expressed in par- ent space coordinates; jC is the unit basis vector along the child space y-axis, in parent space; kC is the unit basis vector along the child space z-axis, in parent space; tC is the translation of the child coordinate system relative to parent space. This result should not be too surprising. The tC vector is just the transla- tion of the child space axes relative to parent space, so if the rest of the ma- trix were identity, the point (0, 0, 0) in child space would become tC in parent space, just as we’d expect. The iC , jC , and kC unit vectors form the upper 3 × 3 of the matrix, which is a pure rotation matrix because these vectors are of unit length. We can see this more clearly by considering a simple example, such as a situation in which child space is rotated by an angle γ about the z-axis, with no translation. The matrix for such a rotation is given by (4.2) But in Figure 4.20, we can see that the coordinates of the iC and jC vectors, expressed in parent space, are iC = [ cos γ sin γ 0 ] and jC = [ –sin γ cos γ 0 ]. When we plug these vectors into our formula for CP→M , with kC = [ 0 0 1 ], it exactly matches the matrix rotatez(r, γ) from Equation (4.2). Scaling the Child Axes Scaling of the child coordinate system is accomplished by simply scaling the unit basis vectors appropriately. For example, if child space is scaled up by a cos sin 0 0 sin cos 0 0 rotate ( , ) [ 1] .0 0 10 0 0 01 z xyzrrr γγ⎡⎤ ⎢⎥−γ γ⎢⎥γ= ⎢⎥ ⎢⎥⎣⎦ r CP ; 0 0 0 0 0 0 .0 1 P C CP C C C C Cx Cy Cz Cx Cy Cz Cx Cy Cz Cx Cy Cz iii jjj kkk t → → = ⎡⎤ ⎢⎥ ⎢⎥=⎢⎥ ⎢⎥⎣⎦ ⎡⎤ ⎢⎥ ⎢⎥=⎢⎥ ⎢⎥⎣⎦ P PM i j M k t tt 165 4.3. Matrices factor of two, then the basis vectors iC , jC , and kC will be of length 2 instead of unit length. 4.3.10.3. Extracting Unit Basis Vectors from a Matrix The fact that we can build a change of basis matrix out of a translation and three Cartesian basis vectors gives us another powerful tool: Given any affi ne 4 × 4 transformation matrix, we can go in the other direction and extract the child-space basis vectors iC , jC , and kC from it by simply isolating the appropri- ate rows of the matrix (or columns if your math library uses column vectors). This can be incredibly useful. Let’s say we are given a vehicle’s model- to-world transform as an affi ne 4 × 4 matrix (a very common representation). This is really just a change of basis matrix, transforming points in model space into their equivalents in world space. Let’s further assume that in our game, the positive z-axis always points in the direction that an object is facing. So, to fi nd a unit vector representing the vehicle’s facing direction, we can simply ex- tract kC directly from the model-to-world matrix (by grabbing its third row). This vector will already be normalized and ready to go. 4.3.10.4. Transforming Coordinate Systems versus Vectors We’ve said that the matrix CP→M transforms points and directions from child space into parent space. Recall that the fourth row of CP→M contains tC , the translation of the child coordinate axes relative to the world space axes. There- fore, another way to visualize the matrix CP→M is to imagine it taking the parent coordinate axes and transforming them into the child axes. This is the reverse of what happens to points and direction vectors. In other words, if a matrix transforms vectors from child space to parent space, then it also trans- forms coordinate axes from parent space to child space. This makes sense when you think about it—moving a point 20 units to the right with the coordinate axes fi xed is the same as moving the coordinate axes 20 units to the left with the point fi xed. This concept is illustrated in Figure 4.21. x y cos(γ) sin(γ) –sin(γ) cos(γ) γ γ iC jC Figure 4.20. Change of basis when child axes are rotated by an angle γ relative to parent. 166 4. 3D Math for Games Of course, this is just another point of potential confusion. If you’re think- ing in terms of coordinate axes, then transformations go in one direction, but if you’re thinking in terms of points and vectors, they go in the other direction! As with many confusing things in life, your best bet is probably to choose a single “canonical” way of thinking about things and stick with it. For ex- ample, in this book we’ve chosen the following conventions: Transformations apply to vectors (not coordinate axes). Vectors are writt en as rows (not columns). Taken together, these two conventions allow us to read sequences of ma- trix multiplications from left to right and have them make sense (e.g., D A AB BC CD→→→=P PM M M ). Obviously if you start thinking about the coordi- nate axes moving around rather than the points and vectors, you either have to read the transforms from right to left , or fl ip one of these two conventions around. It doesn’t really matt er what conventions you choose as long as you fi nd them easy to remember and work with. That said, it’s important to note that certain problems are easier to think about in terms of vectors being transformed, while others are easier to work with when you imagine the coordinate axes moving around. Once you get good at thinking about 3D vector and matrix math, you’ll fi nd it prett y easy to fl ip back and forth between conventions as needed to suit the problem at hand. 4.3.11. Transforming Normal Vectors A normal vector is a special kind of vector, because in addition to (usually!) be- ing of unit length, it carries with it the additional requirement that it should always remain perpendicular to whatever surface or plane it is associated with. Special care must be taken when transforming a normal vector, to ensure that both its length and perpendicularity properties are maintained. x y x' y' y x P' P P Figure 4.21. Two ways to interpret a transformation matrix. On the left, the point moves against a fi xed set of axes. On the right, the axes move in the opposite direction while the point remains fi xed. 167 In general, if a point or (non-normal) vector can be rotated from space A to space B via the 3   ×   3 marix AB→M , then a normal vector n will be transformed from space A to space B via the inverse transpose of that matrix, 1T AB()− →M . We will not prove or derive this result here (see [28], Section 3.5 for an excellent derivation). However, we will observe that if the matrix AB→M contains only uniform scale and no shear, then the angles between all surfaces and vectors in space B will be the same as they were in space A. In this case, the matrix AB→M will actually work just fi ne for any vector, normal or non-normal. However, if AB→M contains nonuniform scale or shear (i.e., is non-orthogonal), then the angles between surfaces and vectors are not preserved when moving from space A to space B. A vector that was normal to a surface in space A will not necessarily be perpendicular to that surface in space B. The inverse transpose operation accounts for this distortion, bringing normal vectors back into per- pendicularity with their surfaces even when the transformation involves non- uniform scale or shear. 4.3.12. Storing Matrices in Memory In the C and C++ languages, a two-dimensional array is oft en used to store a matrix. Recall that in C/C++ two-dimensional array syntax, the fi rst subscript is the row and the second is the column, and that the column index varies fast- est as you move through memory sequentially. float m[4][4]; // [row][col], col varies fastest // "flatten" the array to demonstrate ordering float* pm = &m[0][0]; ASSERT( &pm[0] == &m[0][0] ); ASSERT( &pm[1] == &m[0][1] ); ASSERT( &pm[2] == &m[0][2] ); // etc. We have two choices when storing a matrix in a two-dimensional C/C++ array. We can either store the vectors (1. iC , jC , kC , tC) contiguously in memory (i.e., each row contains a single vector), or store the vectors 2. strided in memory (i.e., each column contains one vector). The benefi t of approach (1) is that we can address any one of the four vec- tors by simply indexing into the matrix and interpreting the four contiguous values we fi nd there as a 4-element vector. This layout also has the benefi t of matching up exactly with row vector matrix equations (which is another reason why I’ve selected row vector notation for this book). Approach (2) is some- times necessary when doing fast matrix-vector multiplies using a vector-en- 4.3. Matrices 168 4. 3D Math for Games abled (SIMD) microprocessor, as we’ll see later in this chapter. In most game engines I’ve personally encountered, matrices are stored using approach (1), with the vectors in the rows of the two-dimensional C/C++ array. This is shown below: float M[4][4]; M[0][0]=ix; M[0][1]=iy; M[0][2]=iz; M[0][3]=0.0f; M[1][0]=jx; M[1][1]=jy; M[1][2]=jz; M[1][3]=0.0f; M[2][0]=kx; M[2][1]=ky; M[2][2]=kz; M[2][3]=0.0f; M[3][0]=tx; M[3][1]=ty; M[3][2]=tz; M[3][3]=1.0f; The matrix M looks like this when viewed in a debugger: M[][] [0] [0] ix [1] iy [2] iz [3] 0.0000 [1] [0] jx [1] jy [2] jz [3] 0.0000 [2] [0] kx [1] ky [2] kz [3] 0.0000 [3] [0] tx [1] ty [2] tz [3] 1.0000 One easy way to determine which layout your engine uses is to fi nd a function that builds a 4 × 4 translation matrix. (Every good 3D math library provides such a function.) You can then inspect the source code to see where the elements of the t vector are being stored. If you don’t have access to the source code of your math library (which is prett y rare in the game industry), you can always call the function with an easy-to-recognize translation like (4, 3, 2), and then inspect the resulting matrix. If row 3 contains the values 4.0, 3.0, 2.0, 1.0, then the vectors are in the rows, otherwise the vectors are in the columns. 169 4.4. Quaternions 4.4. Quaternions We’ve seen that a 3   ×   3 matrix can be used to represent an arbitrary rotation in three dimensions. However, a matrix is not always an ideal representation of a rotation, for a number of reasons: We need nine fl oating-point values to represent a rotation, which seems 1. excessive considering that we only have three degrees of freedom— pitch, yaw, and roll. Rotating a vector requires a vector-matrix multiplication, which involves 2. three dot products, or a total of nine multiplications and six additions. We would like to fi nd a rotational representation that is less expensive to calculate, if possible. In games and computer graphics, it’s oft en important to be able to fi nd 3. rotations that are some percentage of the way between two known rota- tions. For example, if we are to smoothly animate a camera from some starting orientation A to some fi nal orientation B over the course of a few seconds, we need to be able to fi nd lots of intermediate rotations be- tween A and B over the course of the animation. It turns out to be diffi - cult to do this when the A and B orientations are expressed as matrices. Thankfully, there is a rotational representation that overcomes these three problems. It is a mathematical object known as a quaternion. A quaternion looks a lot like a four-dimensional vector, but it behaves quite diff erently. We usually write quaternions using non-italic, non-boldface type, like this: q = [ qx qy qz qw ]. Quaternions were developed by Sir William Rowan Hamilton in 1843 as an extension to the complex numbers. They were fi rst used to solve prob- lems in the area of mechanics. Technically speaking, a quaternion obeys a set of rules known as a four-dimensional normed division algebra over the real numbers. Thankfully, we won’t need to understand the details of these rather esoteric algebraic rules. For our purposes, it will suffi ce to know that the unit- length quaternions (i.e., all quaternions obeying the constraint qx2 + qy2 + qz2 + qw2 = 1) represent three-dimensional rotations. There are a lot of great papers, web pages, and presentations on quater- nions available on the web, for further reading. Here’s one of my favorites: htt p://graphics.ucsd.edu/courses/cse169_w05/CSE169_04.ppt. 4.4.1. Unit Quaternions as 3D Rotations A unit quaternion can be visualized as a three-dimensional vector plus a fourth scalar coordinate. The vector part qV is the unit axis of rotation, scaled 170 4. 3D Math for Games by the sine of the half-angle of the rotation. The scalar part qS is the cosine of the half-angle. So the unit quaternion q can be writt en as follows: where a is a unit vector along the axis of rotation, and θ is the angle of rota- tion. The direction of the rotation follows the right-hand rule , so if your thumb points in the direction of a, positive rotations will be in the direction of your curved fi ngers. Of course, we can also write q as a simple four-element vector: A unit quaternion is very much like an axis+angle representation of a ro- tation (i.e., a four-element vector of the form [ a θ ]). However, quaternions are more convenient mathematically than their axis+angle counterparts, as we shall see below. 4.4.2. Quaternion Operations Quaternions support some of the familiar operations from vector algebra, such as magnitude and vector addition. However, we must remember that the sum of two unit quaternions does not represent a 3D rotation, because such a quaternion would not be of unit length. As a result, you won’t see any quater- nion sums in a game engine, unless they are scaled in some way to preserve the unit length requirement. 4.4.2.1. Quaternion Multiplication One of the most important operations we will perform on quaternions is that of multiplication. Given two quaternions p and q representing two rotations P and Q, respectively, the product pq represents the composite rotation (i.e., ro- tation Q followed by rotation P). There are actually quite a few diff erent kinds of quaternion multiplication, but we’ll restrict this discussion to the variety used in conjunction with 3D rotations, namely the Grassman product. Using this defi nition, the product pq is defi ned as follows: pq ( )( ).S V S V V V SS V Vpq pq⎡⎤= + +× −⋅⎣⎦q p p q pq 22 q[ ] [ sin cos ], VSq θθ = = q a 2 2 2 2 q[ ], where sin , sin , sin , cos . xyzw x Vx x y Vy y z Vz z wS qqqq qq a qq a qq a qq θ θ θ θ = == == == == 171 4.4. Quaternions Notice how the Grassman product is defi ned in terms of a vector part, which ends up in the x, y, and z components of the resultant quaternion, and a scalar part, which ends up in the w component. 4.4.2.2. Conjugate and Inverse The inverse of a quaternion q is denoted q–1 and is defi ned as a quaternion which, when multiplied by the original, yields the scalar 1 (i.e., qq–1 = 0i + 0j + 0k + 1). The quaternion [ 0 0 0 1 ] represents a zero rotation (which makes sense since sin(0) = 0 for the fi rst three components, and cos(0) = 1 for the last component). In order to calculate the inverse of a quaternion, we must fi rst defi ne a quantity known as the conjugate. This is usually denoted q* and it is defi ned as follows: In other words, we negate the vector part but leave the scalar part unchaged. Given this defi nition of the quaternion conjugate, the inverse quaternion q–1 is defi ned as follows: Our quaternions are always of unit length (i.e., |q| = 1), because they represent 3D rotations. So, for our purposes, the inverse and the conjugate are identical: This fact is incredibly useful, because it means we can always avoid doing the (relatively expensive) division by the squared magnitude when inverting a quaternion, as long as we know a priori that the quaternion is normalized. This also means that inverting a quaternion is generally much faster than in- verting a 3   ×   3 matrix—a fact that you may be able to leverage in some situa- tions when optimizing your engine. Conjugate and Inverse of a Product The conjugate of a quaternion product (pq) is equal to the reverse product of the conjugates of the individual quaternions: Likewise the inverse of a quaternion product is equal to the reverse product of the inverses of the individual quaternions: (4.3) q* [ ].VSq=−q 1 2 q*q. q − = 1q q* [ ] when q 1.VSq− = =− =q (pq)* q* p* .= 1 11(pq) q p .− −−= 172 4. 3D Math for Games This is analogous to the reversal that occurs when transposing or inverting matrix products. 4.4.3. Rotating Vectors with Quaternions How can we apply a quaternion rotation to a vector ? The fi rst step is to rewrite the vector in quaternion form . A vector is a sum involving the unit basis vectors i, j, and k. A quaternion is a sum involving i, j, and k, but with a fourth scalar term as well. So it makes sense that a vector can be writt en as a quaternion with its scalar term qS equal to zero. Given the vector v, we can write a cor- responding quaternion v = [ v 0 ] = [ vx vy vz 0 ]. In order to rotate a vector v by a quaternion q, we pre-multiply the vec- tor (writt en in its quaternion form v) by q and then post-multiply it by the inverse quaternion, q–1. Therefore, the rotated vector v’ can be found as fol- lows: This is equivalent to using the quaternion conjugate, because our quaternions are always unit length: (4.4) The rotated vector v’ is obtained by simply extracting it from its quaternion form v’. Quaternion multiplication can be useful in all sorts of situations in real games. For example, let’s say that we want to fi nd a unit vector describing the direction in which an aircraft is fl ying. We’ll further assume that in our game, the positive z-axis always points toward the front of an object by convention. So the forward unit vector of any object in model space is always FM ≡ [ 0 0 1 ] by defi nition. To transform this vector into world space, we can simply take our aircraft ’s orientation quaternion q and use it with Equation (4.4) to rotate our model-space vector FM into its world space equivalent FW (aft er converting these vectors into quaternion form, of course): 4.4.3.1. Quaternion Concatenation Rotations can be concatenated in exactly the same way that matrix-based trans- formations can, by multiplying the quaternions together. For example, consid- er three distinct rotations, represented by the quaternions q1 , q2 , and q3 , with matrix equivalents R1 , R2 , and R3. We want to apply rotation 1 fi rst, followed by rotation 2 and fi nally rotation 3. The composite rotation matrix Rnet can be found and applied to a vector v as follows: v rotate(q, ) qvq* .′==v 11F qF q q [0 0 1 0] q .WM −−== 1v rotate(q, ) qvq .−′==v 173 4.4. Quaternions Likewise, the composite rotation quaternion qnet can be found and applied to vector v (in its quaternion form, v) as follows: Notice how the quaternion product must be performed in an order opposite to that in which the rotations are applied (q3q2q1). This is because quaternion rotations always multiply on both sides of the vector, with the uninverted quaternions on the left and the inverted quaternions on the right. As we saw in Equation (4.3), the inverse of a quaternion product is the reverse product of the individual inverses, so the uninverted quaternions read right-to-left while the inverted quaternions read left -to-right. 4.4.4. Quaternion-Matrix Equivalence We can convert any 3D rotation freely between a 3   ×   3 matrix representation R and a quaternion representation q. If we let q = [ qV qS ] = [ qVx qVy qVz qS ] = [ x y z w ], then we can fi nd R as follows: Likewise, given R we can fi nd q as follows (where q[0] = qVx , q[1] = qVy , q[2] = qVz , and q[3] = qS). This code assumes that we are using row vectors in C/C++ (i.e., that the rows of matrix R[row][col] correspond to the rows of the matrix R shown above). The code was adapted from a Gamasutra article by Nick Bobic, published on July 5, 1998, which is available here: htt p://www. gamasutra.com/view/feature/3278/rotating_objects_using_quaternions.php. For a discussion of some even faster methods for converting a matrix to a quaternion, leveraging various assumptions about the nature of the matrix, see htt p://www.euclideanspace.com/maths/geometry/rotations/conversions/ matrixToQuaternion/index.htm. void matrixToQuaternion( const float R[3][3], float q[/*4*/]) { float trace = R[0][0] + R[1][1] + R[2][2]; net 1 2 3 1 2 3 net ; . = ′== R RRR v vR R R vR net 3 2 1 111 1 3 2 1 1 2 3 net net q qqq; v q q q vq q q q vq .−−− − = ′== 22 22 22 12 2 2 2 2 2 2 2 12 2 2 2 . 2 2 2 2 12 2 y z xy zw xz yw xy zw x z yz xw xz yw yz xw x y ⎡⎤−− + −⎢⎥ = − −− +⎢⎥ ⎢⎥+ − −−⎣⎦ R 174 4. 3D Math for Games // check the diagonal if (trace > 0.0f) { float s = sqrt(trace + 1.0f); q[3] = s * 0.5f; float t = 0.5f / s; q[0] = (R[2][1] - R[1][2]) * t; q[1] = (R[0][2] - R[2][0]) * t; q[2] = (R[1][0] - R[0][1]) * t; } else { // diagonal is negative int i = 0; if (R[1][1] > R[0][0]) i = 1; if (R[2][2] > R[i][i]) i = 2; static const int NEXT[3] = {1, 2, 0}; j = NEXT[i]; k = NEXT[j]; float s = sqrt ((R[i][i] - (R[j][j] + R[k][k])) + 1.0f); q[i] = s * 0.5f; float t; if (s != 0.0) t = 0.5f / s; else t = s; q[3] = (R[k][j] - R[j][k]) * t; q[j] = (R[j][i] + R[i][j]) * t; q[k] = (R[k][i] + R[i][k]) * t; } } 4.4.5. Rotational Linear Interpolation Rotational interpolation has many applications in the animation, dynamics and camera systems of a game engine. With the help of quaternions, rotations can be easily interpolated just as vectors and points can. The easiest and least computationally intensive approach is to perform a four-dimensional vector LERP on the quaternions you wish to interpolate. Given two quaternions qA and qB representing rotations A and B, we can fi nd an intermediate rotation qLERP that is β percent of the way from A to B as follows: 175 Notice that the resultant interpolated quaternion had to be renormalized. This is necessary because the LERP operation does not preserve a vector’s length in general. Geometrically, qLERP = LERP(qA , qB , β) is the quaternion whose orientation lies β percent of the way from orientation A to orientation B, as shown (in two dimensions for clarity) in Figure 4.22. Mathematically, the LERP operation re- sults in a weighed average of the two quaternions, with weights (1 – β) and β (notice that (1 – β) + β = 1). 4.4.5.1. Spherical Linear Interpolation The problem with the LERP operation is that it does not take account of the fact that quaternions are really points on a four-dimensional hypersphere. A LERP eff ectively interpolates along a chord of the hypersphere, rather than along the surface of the hypersphere itself. This leads to rotation animations that do not have a constant angular speed when the parameter β is changing at a constant rate. The rotation will appear slower at the end points and faster in the middle of the animation. LERP (1 )q qq LERP(q ,q , ) (1 )q q (1 ) (1 ) normalize . (1 ) (1 ) AB AB AB T Ax Bx Ay By Az Bz Aw Bw qq qq qq qq −β +β=β= −β +β ⎛⎞−β +β⎡⎤⎜⎟⎢⎥⎜⎟⎢⎥−β +β⎜⎟⎢⎥= ⎜⎟⎢⎥−β +β⎜⎟⎢⎥⎜⎟⎢⎥⎜⎟−β +β⎣⎦⎝⎠ qA (β = 0) qLERP = LERP(qA, qB, 0.4) qB (β = 1) Figure 4.22. Linear interpolation (LERP) between quaternions qA and qB. 4.4. Quaternions 176 4. 3D Math for Games To solve this problem, we can use a variant of the LERP operation known as spherical linear interpolation, or SLERP for short. The SLERP operation uses sines and cosines to interpolate along a great circle of the 4D hypersphere, rather than along a chord, as shown in Figure 4.23. This results in a constant angular speed when β varies at a constant rate. The formula for SLERP is similar to the LERP formula, but the weights (1 – β) and β are replaced with weights wp and wq involving sines of the angle between the two quaternions. where The cosine of the angle between any two unit-length quaternions can be found by taking their four-dimensional dot product. Once we know cos(θ), we can calculate the angle θ and the various sines we need quite easily: qA (β = 0) qLERP = LERP(qA, qB, 0.4) qB (β = 1) qSLERP = SLERP(qA, qB, 0.4) 0 .4 a lo n g c h o rd 0 . 4 a l o n g a r c Figure 4.23. Spherical linear interpolation along a great circle arc of a 4D hypersphere. SLERP(p,q, ) p q,pqwwβ= + sin((1 ) ) ,sin( ) sin( ).sin( ) p q w w −β θ= θ βθ= θ 1 cos( ) p q ; cos (p q). xx yy zz wwpq pq pq p q − θ= =+++ θ= ⋅ ⋅ 177 4.4.5.2. To SLERP or Not to SLERP (That’s Still the Question) The jury is still out on whether or not to use SLERP in a game engine. Jonathan Blow wrote a great article positing that SLERP is too expensive, and LERP’s quality is not really that bad—therefore, he suggests, we should understand SLERP but avoid it in our game engines (see htt p://number-none.com/prod- uct/Understanding%20Slerp,%20Then%20Not%20Using%20It/index.html). On the other hand, some of my colleagues at Naughty Dog have found that a good SLERP implementation performs nearly as well as LERP. (For exam- ple, on the PS3’s SPUs, Naughty Dog’s Ice team’s implementation of SLERP takes 20 cycles per joint, while its LERP implementation takes 16.25 cycles per joint.) Therefore, I’d personally recommend that you profi le your SLERP and LERP implementations before making any decisions. If the performance hit for SLERP isn’t unacceptable, I say go for it, because it may result in slightly bett er-looking animations. But if your SLERP is slow (and you cannot speed it up, or you just don’t have the time to do so), then LERP is usually good enough for most purposes. 4.5. Comparison of Rotational Representations We’ve seen that rotations can be represented in quite a few diff erent ways. This section summarizes the most common rotational representations and outlines their pros and cons. No one representation is ideal in all situations. Using the information in this section, you should be able to select the best representation for a particular application. 4.5.1. Euler Angles We briefl y explored Euler angles in Section 4.3.9.1. A rotation represented via Euler angles consists of three scalar values: yaw, pitch, and roll. These quanti- ties are sometimes represented by a 3D vector [ θY θP θR ]. The benefi ts of this representation are its simplicity, its small size (three fl oating-point numbers), and its intuitive nature—yaw, pitch, and roll are easy to visualize. You can also easily interpolate simple rotations about a single axis. For example, it’s trivial to fi nd intermediate rotations between two distinct yaw angles by linearly interpolating the scalar θY. However, Euler angles cannot be interpolated easily when the rotation is about an arbitrarily-oriented axis. In addition, Euler angles are prone to a condition known as gimbal lock . This occurs when a 90-degree rotation causes one of the three principal axes to “collapse” onto another principal axis. For example, if you rotate by 90 degrees about the x-axis, the y-axis collapses onto the z-axis. This prevents 4.5. Comparison of Rotational Representations 178 4. 3D Math for Games any further rotations about the original y-axis, because rotations about y and z have eff ectively become equivalent. Another problem with Euler angles is that the order in which the rotations are performed around each axis matt ers. The order could be PYR, YPR, RYP, and so on, and each ordering may produce a diff erent composite rotation. No one standard rotation order exists for Euler angles across all disciplines (although certain disciplines do follow specifi c conventions). So the rotation angles [ θY θP θR ] do not uniquely defi ne a particular rotation—you need to know the rotation order to interpret these numbers properly. A fi nal problem with Euler angles is that they depend upon the mapping from the x-, y -, and z-axes onto the natural front, left /right, and up directions for the object being rotated. For example, yaw is always defi ned as rotation about the up axis, but without additional information we cannot tell whether this corresponds to a rotation about x, y, or z. 4.5.2. 3 × 3 M atrices A 3 × 3 matrix is a convenient and eff ective rotational representation for a number of reasons. It does not suff er from gimbal lock , and it can represent arbitrary rotations uniquely. Rotations can be applied to points and vectors in a straightforward manner via matrix multiplication (i.e., a series of dot prod- ucts). Most CPUs and all GPUs now have built-in support for hardware-accel- erated dot products and matrix multiplication. Rotations can also be reversed by fi nding an inverse matrix, which for a pure rotation matrix is the same thing as fi nding the transpose—a trivial operation. And 4 × 4 matrices off er a way to represent arbitrary affi ne transformations—rotations, translations, and scaling—in a totally consistent way. However, rotation matrices are not particularly intuitive. Looking at a big table of numbers doesn’t help one picture the corresponding transformation in three-dimensional space. Also, rotation matrices are not easily interpolated. Finally, a rotation matrix takes up a lot of storage (nine fl oating-point num- bers) relative to Euler angles. 4.5.3. Axis + Angle We can represent rotations as a unit vector defi ning the axis of rotation plus a scalar for the angle of rotation. This is known as an axis+angle representation, and it is sometimes denoted by the four-dimensional vector [ a θ ] , where a is the axis of rotation and θ the angle in radians. In a right-handed coordinate system, the direction of a positive rotation is defi ned by the right-hand rule, while in a left -handed system we use the left -hand rule instead. 179 The benefi ts of the axis+angle representation are that it is reasonably intu- itive and also compact (only requires four fl oating-point numbers, as opposed to the nine required for a 3 × 3 matrix). One important limitation of the axis+angle representation is that rota- tions cannot be easily interpolated. Also, rotations in this format cannot be applied to points and vectors in a straightforward way—one needs to convert the axis+angle representation into a matrix or quaternion fi rst. 4.5.4. Quaternions As we’ve seen, a unit-length quaternion can represent 3D rotations in a man- ner analogous to the axis+angle representation. The primary diff erence be- tween the two representations is that a quaternion’s axis of rotation is scaled by the sine of the half angle of rotation, and instead of storing the angle in the fourth component of the vector, we store the cosine of the half angle. The quaternion formulation provides two immense benefi ts over the axis+angle representation. First, it permits rotations to be concatenated and applied directly to points and vectors via quaternion multiplication. Second, it permits rotations to be easily interpolated via simple LERP or SLERP op- erations. Its small size (four fl oating-point numbers) is also a benefi t over the matrix formulation. 4.5.5. SQT Transformations By itself, a quaternion can only represent a rotation, whereas a 4   ×   4 matrix can represent an arbitrary affi ne transformation (rotation, translation, and scale). When a quaternion is combined with a translation vector and a scale factor (either a scalar for uniform scaling or a vector for nonuniform scaling), then we have a viable alternative to the 4   ×   4 matrix representation of affi ne transformations. We sometimes call this an SQT transform, because it contains a scale factor, a quaternion for rotation, and a translation vector. or SQT transforms are widely used in computer animation because of their smaller size (eight fl oats for uniform scale, or ten fl oats for nonuniform scale, as opposed to the 12 fl oating-point numbers needed for a 4   ×   3 matrix) and their ability to be easily interpolated. The translation vector and scale factor are interpolated via LERP, and the quaternion can be interpolated with either LERP or SLERP. SQT [ q ] (uniform scale ),ss= t SQT [ q ] (non-uniform scale vector ).= st s 4.5. Comparison of Rotational Representations 180 4. 3D Math for Games 4.5.6. Dual Quaternions Complete transformations involving rotation, translation, and scale can be represented using a mathematical object known as a dual quaternion. A dual quaternion is like an ordinary quaternion, except that its four components are dual numbers instead of regular real-valued numbers. A dual number can be writt en as the sum of a non-dual part and a dual part as follows: 0ˆ .aa aε= +ε Here ε is a magical number called the dual unit, defi ned as 2 0.ε= (This is analogous to the imaginary number 1i=− used when writing a complex number as the sum of a real and an imaginary part: .c a ib=+ ) Because each dual number can be represented by two real numbers (the non-dual and dual parts), a dual quaternion can be represented by an eight-element vector. It can also be represented as the sum of two ordinary quaternions, where the second one is multiplied by the dual unit, as follows: A full discussion of dual numbers and dual quaternions is beyond our scope here. However, a number of excellent articles on them exist online and in the literature. I recommend starting with htt ps://www.cs.tcd.ie/publica- tions/tech-reports/reports.06/TCD-CS-2006-46.pdf. 4.5.7. Rotations and Degrees of Freedom The term “degrees of freedom ” (or DOF for short) refers to the number of mu- tually-independent ways in which an object’s physical state (position and ori- entation) can change. You may have encountered the phrase “six degrees of freedom” in fi elds such as mechanics, robotics, and aeronautics. This refers to the fact that a three-dimensional object (whose motion is not artifi cially constrained) has three degrees of freedom in its translation (along the x-, y-, and z-axes) and three degrees of freedom in its rotation (about the x-, y-, and z-axes), for a total of six degrees of freedom. The DOF concept will help us to understand how diff erent rotational rep- resentations can employ diff erent numbers of fl oating-point parameters, yet all specify rotations with only three degrees of freedom. For example, Euler angles require three fl oats, but axis+angle and quaternion representations use four fl oats, and a 3   ×   3 matrix takes up nine fl oats. How can these representa- tions all describe 3-DOF rotations? The answer lies in constraints . All 3D rotational representations employ three or more fl oating-point parameters, but some representations also have one or more constraints on those parameters. The constraints indicate that the parameters are not independent—a change to one parameter induces changes to the other parameters in order to maintain the validity of the constraint(s). 0ˆq q q.ε= +ε 181 If we subtract the number of constraints from the number of fl oating-point parameters, we arrive at the number of degrees of freedom—and this number should always be three for a 3D rotation: NDOF = Nparameters – Nconstraints. (4.5) The following list shows Equation (4.5) in action for each of the rotational representations we’ve encountered in this book. Euler Angles. 3 parameters – 0 constraints = 3 DOF. Axis+Angle. 4 parameters – 1 constraint = 3 DOF. Constraint: Axis is constrained to be unit length. Quaternion. 4 parameters – 1 constraint = 3 DOF. Constraint: Quaternion is constrained to be unit length. 3   ×   3 Matrix. 9 parameters – 6 constraints = 3 DOF. Constraints: All three rows and all three columns must be of unit length (when treated as three-element vectors). 4.6. Other Useful Mathematical Objects As game engineers, we will encounter a host of other mathematical objects, in addition to points, vectors, matrices and quaternions. This section briefl y outlines the most common of these. 4.6.1. Lines, Rays, and Line Segments An infi nite line can be represented by a point P0 plus a unit vector u in the direction of the line. A parametric equation of a line traces out every possible point P along the line by starting at the initial point P0 and moving an arbi- trary distance t along the direction of the unit vector u. The infi nitely large set of points P becomes a vector function of the scalar parameter t: P(t) = P0 + t  u, where    –∞ < t < +∞. (4.73) This is depicted in Figure 4.24. t = 0 t = 1 t = 2 t = 3 t = –1 uP0 Figure 4.24. Parametric equation of a line. 4.6. Other Useful Mathematical Objects 182 4. 3D Math for Games x z y C r Figure 4.27. Point-radius representation of a sphere. A ray is a line that extends to infi nity in only one direction. This is easily expressed as P(t) with the constraint t ≥ 0, as shown in Figure 4.25. A line segment is bounded at both ends by P0 and P1. It too can be repre- sented by P(t), in either one of the following two ways (where L = P1 – P0 and L = |L| is the length of the line segment):  1. P(t) = P0 + tu, where 0 ≤ t ≤ L, or  2. P(t) = P0 + tL, where 0 ≤ t ≤ 1. The latt er format, depicted in Figure 4.26, is particularly convenient because the parameter t is normalized; in other words, t always goes from zero to one, no matt er which particular line segment we are dealing with. This means we do not have to store the constraint L in a separate fl oating-point parameter; it is already encoded in the vector L = Lu (which we have to store anyway). t = 0 t = 1 t = 2 t = 3uP0 Figure 4.25. Parametric equation of a ray. t = 0 t = 1 L = P1 – P0 P0 P1 t = 0.5 Figure 4.26. Parametric equation of a line segment, with normalized parameter t. 4.6.2. Spheres Spheres are ubiquitous in game engine programming. A sphere is typically defi ned as a center point C plus a radius r, as shown in Figure 4.27. This packs 183 nicely into a four-element vector, [ Cx Cy Cz r ]. As we’ll see below when we dis- cuss SIMD vector processing, there are distinct benefi ts to being able to pack data into a vector containing four 32-bit fl oats (i.e., a 128-bit package). 4.6.3. Planes A plane is a 2D surface in 3D space. As you may recall from high school alge- bra, the equation of a plane is oft en writt en as follows: Ax + By + Cz + D = 0. This equation is satisfi ed only for the locus of points P = [ x y z ] that lie on the plane. Planes can be represented by a point P0 and a unit vector n that is normal to the plane. This is sometimes called point-normal form , as depicted in Fig- ure 4.28. It’s interesting to note that when the parameters A, B, and C from the tra- ditional plane equation are interpreted as a 3D vector, that vector lies in the di- rection of the plane normal. If the vector [ A B C ] is normalized to unit length, then the normalized sub-vector [ a b c ] = n, and the normalized parameter 222dD A B C= ++ is just the distance from the plane to the origin . The sign of d is positive if the plane’s normal vector (n) is pointing toward the origin (i.e., the origin is on the “front” side of the plane) and negative if the normal is pointing away from the origin (i.e., the origin is “behind” the plane). In fact, the normalized equation ax + by + cz + d = 0 is just another way of writing (n P) = –   d, which means that when any point P on the plane is projected onto the plane normal n, the length of that projection will be –   d. A plane can actually be packed into a four-element vector, much like a sphere can. To do so, we observe that to describe a plane uniquely, we need only the normal vector n = [ a b c ] and the distance from the origin d. The four-element vector L = [ n d ] = [ a b c d ] is a compact and convenient way to represent and store a plane in memory. Note that when P is writt en in ho- mogeneous coordinates with w = 1, the equation (L P) = 0 is yet another way of writing (n P) = –   d. (These equations are satisfi ed for all points P that lie on the plane L.) Planes defi ned in four-element vector form can be easily transformed from one coordinate space to another. Given a matrix AB→M that transforms points and (non-normal) vectors from space A to space B, we already know that to transform a normal vector such as the plane’s n vector, we need to use the inverse transpose of that matrix, 1T AB()− →M . So it shouldn’t be a big surprise to learn that applying the inverse transpose of a matrix to a four-element plane vector L will, in fact, correctly transform that plane from space A to space B. P0 n Figure 4.28. A plane in point-normal form. 4.6. Other Useful Mathematical Objects 184 4. 3D Math for Games We won’t derive or prove this result any further here, but a thorough explana- tion of why this litt le “trick” works is provided in Section 4.2.3 of [28]. 4.6.4. Axis-Aligned Bounding Boxes (AABB) An axis-aligned bounding box (AABB) is a 3D cuboid whose six rectangular faces are aligned with a particular coordinate frame’s mutually orthogonal axes. As such, an AABB can be represented by a six-element vector containing the minimum and maximum coordinates along each of the 3 principal axes, [ xmin , xmax , ymin , ymax , zmin , zmax ], or two points Pmin and Pmax. This simple representation allows for a particularly convenient and in- expensive method of testing whether a point P is inside or outside any given AABB. We simply test if all of the following conditions are true: Px ≥ xmin and Px ≤ xmax and Py ≥ ymin and Py ≤ ymax and Pz ≥ zmin  and Pz ≤ zmax. Because intersection tests are so speedy, AABBs are oft en used as an “early out” collision check; if the AABBs of two objects do not intersect, then there is no need to do a more detailed (and more expensive) collision test. 4.6.5. Oriented Bounding Boxes (OBB) An oriented bounding box (OBB) is a cuboid that has been oriented so as to align in some logical way with the object it bounds. Usually an OBB aligns with the local-space axes of the object. Hence it acts like an AABB in local space, although it may not necessarily align with the world space axes. Various techniques exist for testing whether or not a point lies within an OBB, but one common approach is to transform the point into the OBB’s “aligned” coordinate system and then use an AABB intersection test as pre- sented above. 4.6.6. Frusta As shown in Figure 4.29, a frustum is a group of six planes that defi ne a trun- cated pyramid shape. Frusta are commonplace in 3D rendering because they conveniently defi ne the viewable region of the 3D world when rendered via a perspective projection from the point of view of a virtual camera. Four of the planes bound the edges of the screen space, while the other two planes repre- sent the the near and far clipping planes (i.e., they defi ne the minimum and maximum z coordinates possible for any visible point). Near Far Left Right Top Bottom Figure 4.29. A frustum. 185 4.7. Hardware-Accelerated SIMD Math One convenient representation of a frustum is as an array of six planes, each of which is represented in point-normal form (i.e., one point and one normal vector per plane). Testing whether a point lies inside a frustum is a bit involved, but the basic idea is to use dot products to determine whether the point lies on the front or back side of each plane. If it lies inside all six planes, it is inside the frustum. A helpful trick is to transform the world-space point being tested, by applying the camera’s perspective projection to it. This takes the point from world space into a space known as homogeneous clip space. In this space, the frustum is just an axis-aligned cuboid (AABB). This permits much simpler in/ out tests to be performed. 4.6.7. Convex Polyhedral Regions A convex polyhedral region is defi ned by an arbitrary set of planes, all with nor- mals pointing inward (or outward). The test for whether a point lies inside or outside the volume defi ned by the planes is relatively straightforward; it is similar to a frustum test, but with possibly more planes. Convex regions are very useful for implementing arbitrarily-shaped trigger regions in games. Many engines employ this technique; for example, the Quake engine’s ubiqui- tous brushes are just volumes bounded by planes in exactly this way. 4.7. Hardware-Accelerated SIMD Math SIMD stands for “single instruction multiple data .” This refers to the ability of most modern microprocessors to perform a single mathematical operation on multiple data items in parallel, using a single machine instruction. For exam- ple, the CPU might multiply four pairs of fl oating-point numbers in parallel with a single instruction. SIMD is widely used in game engine math libraries, because it permits common vector operations such as dot products and matrix multiplication to be performed extremely rapidly. Intel fi rst introduced MMX instructions with their Pentium line of CPUs in 1994. These instructions permitt ed SIMD calculations to be performed on 8-, 16-, and 32-bit integers packed into special 64-bit MMX registers. Intel fol- lowed this up with various revisions of an extended instruction set called Streaming SIMD Extensions, or SSE, the fi rst version of which appeared in the Pentium III processor. The SSE instruction set utilizes 128-bit registers that can contain integer or IEEE fl oating-point data. The SSE mode most commonly used by game engines is called packed 32- bit fl oating-point mode. In this mode, four 32-bit float values are packed into 186 4. 3D Math for Games a single 128-bit register; four operations such as additions or multiplications are performed in parallel on four pairs of fl oats using a single instruction. This is just what the doctor ordered when multiplying a four-element vector by a 4   ×   4 matrix! 4.7.1.1. SSE Registers In packed 32-bit fl oating-point mode, each 128-bit SSE register contains four 32-bit fl oats. The individual fl oats within an SSE register are conveniently re- ferred to as [ x y z w ], just as they would be when doing vector/matrix math in homogeneous coordinates on paper (see Figure 4.30). To see how the SSE registers work, here’s an example of a SIMD instruction: addps xmm0, xmm1 The addps instruction adds the four fl oats in the 128-bit XMM0 register with the four fl oats in the XMM1 register, and stores the four results back into XMM0. Put another way: xmm0.x = xmm0.x + xmm1.x; xmm0.y = xmm0.y + xmm1.y; xmm0.z = xmm0.z + xmm1.z;    xmm0.w = xmm0.w + xmm1.w. The four fl oating-point values stored in an SSE register can be extracted to or loaded from memory or registers individually, but such operations tend to be comparatively slow. Moving data between the x87 FPU registers and the SSE registers is particularly bad, because the CPU has to wait for either the x87 or the SSE unit to spit out its pending calculations. This stalls out the CPU’s entire instruction execution pipeline and results in a lot of wasted cycles. In a nutshell, code that mixes regular float mathematics with SSE mathematics should be avoided like the plague. To minimize the costs of going back and forth between memory, x87 FPU registers, and SSE registers, most SIMD math libraries do their best to leave data in the SSE registers for as long as possible. This means that even scalar values are left in SSE registers, rather than transferring them out to float variables. For example, a dot product between two vectors produces a scalar result, but if we leave that result in an SSE register it can be used later in other x y z w 32 bits 32 bits 32 bits 32 bits Figure 4.30. The four components of an SSE register in 32-bit fl oating-point mode. 187 4.7. Hardware-Accelerated SIMD Math vector calculations without incurring a transfer cost. Scalars are represented by duplicating the single fl oating-point value across all four “slots” in an SSE register. So to store the scalar s in an SSE register, we’d set x = y = z = w = s. 4.7.1.2. The __m128 Data Type Using one of these magic SSE 128-bit values in C or C++ is quite easy. The Microsoft Visual Studio compiler provides a predefi ned data type called __m128. This data type can be used to declare global variables, automatic vari- ables, and even class and structure members. In many cases, variables of this type will be stored in RAM. But when used in calculations, __m128 values are manipulated directly in the CPU’s SSE registers. In fact, declaring automatic variables and function arguments to be of type __m128 oft en results in the compiler storing those values directly in SSE registers, rather than keeping them in RAM on the program stack. Alignment of __m128 Variables When an __m128 variable is stored in RAM, it is the programmer’s responsi- bility to ensure that the variable is aligned to a 16-byte address boundary. This means that the hexadecimal address of an __m128 variable must always end in the nibble 0x0. The compiler will automatically pad structures and classes so that if the entire struct or class is aligned to a 16-byte boundary, all of the __m128 data members within it will be properly aligned as well. If you de- clare an automatic or global struct/class containing one or more __m128s, the compiler will align the object for you. However, it is still your responsibility to align dynamically allocated data structures (i.e., data allocated with new or malloc()); the compiler can’t help you there. 4.7.1.3. Coding with SSE Intrinsics SSE mathematics can be done in raw assembly language, or via inline assem- bly in C or C++. However, writing code like this is not only non-portable, it’s also a big pain in the butt . To make life easier, modern compilers provide intrinsics —special commands that look and behave like regular C functions, but are really boiled down to inline assembly code by the compiler. Many in- trinsics translate into a single assembly language instruction, although some are macros that translate into a sequence of instructions. In order to use the __m128 data type and SSE intrinsics, your .cpp fi le must #include . As an example, let’s take another look at the addps assembly language instruction. This instruction can be invoked in C/C++ using the intrinsic _mm _add_ps(). Here’s a side-by-side comparison of what the code would look like with and without the use of the intrinsic. 188 4. 3D Math for Games __m128 addWithAssembly( __m128 a, __m128 b) { __m128 r; __asm { movaps xmm0, xmmword ptr [a] movaps xmm1, xmmword ptr [b] addps xmm0, xmm1 movaps xmmword ptr [r], xmm0 } return r; } __m128 addWithIntrinsics( __m128 a, __m128 b) { __m128 r = _mm_add_ps(a, b); return r; } In the assembly language version, we have to use the __asm keyword to invoke inline assembly instructions, and we must create the linkage between the input parameters a and b and the SSE registers xmm0 and xmm1 manually, via movaps instructions. On the other hand, the version using intrinsics is much more intuitive and clear, and the code is smaller. There’s no inline as- sembly, and the SSE instruction looks just like a regular function call. If you’d like to experiment with these example functions, they can be in- voked via the following test bed main() function. Notice the use of another intrinsic, _mm_load_ps(), which loads values from an in-memory array of floats into an __m128 variable (i.e., into an SSE register). Also notice that we are forcing our four global float arrays to be 16-byte aligned via the __declspec(align(16)) directive—if we omit these directives, the pro- gram will crash. #include // ... function definitions from above ... __declspec(align(16)) float A[]={2.0f,-1.0f,3.0f,4.0f}; __declspec(align(16)) float B[]={-1.0f,3.0f,4.0f,2.0f}; __declspec(align(16)) float C[]={0.0f,0.0f,0.0f,0.0f}; __declspec(align(16)) float D[]={0.0f,0.0f,0.0f,0.0f}; int main(int argc, char* argv[]) { // load a and b from floating-point data arrays above __m128 a = _mm_load_ps(&A[0]); __m128 b = _mm_load_ps(&B[0]); 189 // test the two functions __m128 c = addWithAssembly(a, b); __m128 d = addWithIntrinsics(a, b); // store the original values back to check that they // weren’t overwritten _mm_store_ps(&A[0], a); _mm_store_ps(&B[0], b); // store results into float arrays so we can print // them _mm_store_ps(&C[0], c); _mm_store_ps(&D[0], d); // inspect the results printf(“%g %g %g %g\n”, A[0], A[1], A[2], A[3]); printf(“%g %g %g %g\n”, B[0], B[1], B[2], B[3]); printf(“%g %g %g %g\n”, C[0], C[1], C[2], C[3]); printf(“%g %g %g %g\n”, D[0], D[1], D[2], D[3]); return 0; } 4.7.1.4. Vector-Matrix Multiplication with SSE Let’s take a look at how vector-matrix multiplication might be implemented using SSE instructions. We want to multiply the 1 × 4 vector v with the 4 × 4 matrix M to generate a result vector r. The multiplication involves taking the dot product of the row vector v with the columns of matrix M. So to do this calculation using SSE instructions, we might fi rst try storing v in an SSE register (__m128), and storing each of the columns of M in SSE registers as well. Then we could calculate all of the products vkMij in parallel using only four mulps instructions, like this: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 ; [ ][ ] (((( )))) xyzw x y z w xxxx yyyy zzzz wwww MMMM MMMM rrrr v v vv MMMM MMMM vM vM vM vM vM vM vM vM vM vM vM vM vM vM vM vM = ⎡⎤ ⎢⎥ ⎢⎥= ⎢⎥ ⎢⎥⎣⎦ ⎡⎤ ⎢⎥++++⎢⎥=⎢⎥++++ ⎢⎥++++⎣⎦ r vM . 4.7. Hardware-Accelerated SIMD Math 190 4. 3D Math for Games __m128 mulVectorMatrixAttempt1(__m128 v, __m128 Mcol1, __m128 Mcol2, __m128 Mcol3, __m128 Mcol4) { __m128 vMcol1 = _mm_mul_ps(v, Mcol1); __m128 vMcol2 = _mm_mul_ps(v, Mcol2); __m128 vMcol3 = _mm_mul_ps(v, Mcol3); __m128 vMcol4 = _mm_mul_ps(v, Mcol4); // ... then what? } The above code would yield the following intermediate results: vMcol1 = [ vxM11 vyM21 vzM31 vwM41 ]; vMcol2 = [ vxM12 vyM22 vzM32 vwM42 ]; vMcol3 = [ vxM13 vyM23 vzM33 vwM43 ]; vMcol4 = [ vxM14 vyM24 vzM34 vwM44 ]. But the problem with doing it this way is that we now have to add “across the registers” in order to generate the results we need. For example, rx = (vxM11 + vyM21 + vzM31 + vwM41), so we’d need to add the four components of vMcol1 together. Adding across a register like this is diffi cult and ineffi cient, and moreover it leaves the four components of the result in four separate SSE registers, which would need to be combined into the single result vector r. We can do bett er. The “trick” here is to multiply with the rows of M, not its columns. That way, we’ll have results that we can add in parallel, and the fi nal sums will end up in the four components of a single SSE register representing the output vector r. However, we don’t want to multiply v as-is with the rows of M—we want to multiply vx with all of row 1, vy with all of row 2, vz with all of row 3, and vw with all of row 4. To do this, we need to replicate a single component of v, such as vx, across a register to yield a vector like [ vx vx vx vx ]. Then we can multiply the replicated component vectors by the appropriate rows of M. Thankfully there’s a powerful SSE instruction which can replicate values like this. It is called shufps, and it’s wrapped by the intrinsic _mm_shuffle_ ps(). This beast is a bit complicated to understand, because it’s a general- purpose instruction that can shuffl e the components of an SSE register around in arbitrary ways. However, for our purposes we need only know that the following macros replicate the x, y, z or w components of a vector across an entire register: #define SHUFFLE_PARAM(x, y, z, w) \ ((x) | ((y) << 2) | ((z) << 4) | ((w) << 6)) 191 #define _mm_replicate_x_ps(v) \ _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(0, 0, 0, 0)) #define _mm_replicate_y_ps(v) \ _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(1, 1, 1, 1)) #define _mm_replicate_z_ps(v) \ _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(2, 2, 2, 2)) #define _mm_replicate_w_ps(v) \ _mm_shuffle_ps((v), (v), SHUFFLE_PARAM(3, 3, 3, 3)) Given these convenient macros, we can write our vector-matrix multipli- cation function as follows: __m128 mulVectorMatrixAttempt2(__m128 v, __m128 Mrow1, __m128 Mrow2, __m128 Mrow3, __m128 Mrow4) { __m128 xMrow1 = _mm_mul_ps(_mm_replicate_x_ps(v), Mrow1); __m128 yMrow2 = _mm_mul_ps(_mm_replicate_y_ps(v), Mrow2); __m128 zMrow3 = _mm_mul_ps(_mm_replicate_z_ps(v), Mrow3); __m128 wMrow4 = _mm_mul_ps(_mm_replicate_w_ps(v), Mrow4); __m128 result = _mm_add_ps(xMrow1, yMrow2); result = _mm_add_ps(result, zMrow3); result = _mm_add_ps(result, wMrow4); return result; } This code produces the following intermediate vectors: xMrow1 = [ vxM11 vxM12 vxM13 vxM14 ]; yMrow2 = [ vyM21 vyM22 vyM23 vyM24 ];  zMrow3 = [ vzM31 vzM32 vzM33 vzM34 ];     wMrow4 = [ vwM41 vwM42 vwM43 vwM44 ]. Adding these four vectors in parallel produces our result r: 4.7. Hardware-Accelerated SIMD Math 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 (((( . )))) xxxx yyyy zzzz wwww vM vM vM vM vM vM vM vM vM vM vM vM vM vM vM vM ⎡⎤ ⎢⎥++++⎢⎥=⎢⎥++++ ⎢⎥++++⎣⎦ r 192 4. 3D Math for Games On some CPUs, the code shown above can be optimized even further by using a rather handy multiply-and-add instruction, usually denoted madd. This instruction multiplies its fi rst two arguments and then adds the result to its third argument. Unfortunately SSE doesn’t support a madd instruction, but we can fake it reasonably well with a macro like this: #define _mm_madd_ps(a, b, c) \ _mm_add_ps(_mm_mul_ps((a), (b)), (c)) __m128 mulVectorMatrixFinal(__m128 v, __m128 Mrow1, __m128 Mrow2, __m128 Mrow3, __m128 Mrow4) { __m128 result; result = _mm_mul_ps (_mm_replicate_x_ps(v), Mrow1); result = _mm_madd_ps(_mm_replicate_y_ps(v), Mrow2, result); result = _mm_madd_ps(_mm_replicate_z_ps(v), Mrow3, result); result = _mm_madd_ps(_mm_replicate_w_ps(v), Mrow4, result); return result; } We can of course perform matrix-matrix multiplication using a similar approach. Check out htt p://msdn.microsoft .com for a full listing of the SSE intrinsics for the Microsoft Visual Studio compiler. 4.8. Random Number Generation Random numbers are ubiquitous in game engines, so it behooves us to have a brief look at the two most common random number generators, the linear congruential generator and the Mersenne Twister. It’s important to realize that random number generators are just very complicated but totally deterministic pre-defi ned sequences of numbers. For this reason, we call the sequences they produce pseudo-random. What diff erentiates a good generator from a bad one is how long the sequence of numbers is before it repeats (its period), and how well the sequences hold up under various well-known randomness tests. 4.8.1. Linear Congruential Generators Linear congruential generators are a very fast and simple way to generate a sequence of pseudo-random numbers. Depending on the platform, this algo- rithm is sometimes used in the standard C library’s rand() function. How- 193 ever, your mileage may vary, so don’t count on rand() being based on any particular algorithm. If you want to be sure, you’ll be bett er off implementing your own random number generator. The linear congruential algorithm is explained in detail in the book Nu- merical Recipes in C, so I won’t go into the details of it here. What I will say is that this random number generator does not produce particularly high-quality pseudo-random sequences. Given the same initial seed value, the sequence is always exactly the same. The numbers produced do not meet many of the criteria widely accepted as desirable, such as a long period, low- and high-order bits that have similarly-long periods, and absence of sequential or spatial correlation between the generated values. 4.8.2. Mersenne Twister The Mersenne Twister pseudo-random number generator algorithm was de- signed specifi cally to improve upon the various problems of the linear con- gruential algorithm. Wikipedia provides the following description of the ben- efi ts of the algorithm: It was designed to have a colossal period of 21. 19937 − 1 (the creators of the algorithm proved this property). In practice, there is litt le reason to use larger ones, as most applications do not require 219937 unique combina- tions (219937 ≈ 4.3 × 106001). It has a very high order of dimensional equidistribution (see linear 2. congruential generator). Note that this means, by default, that there is negligible serial correlation between successive values in the output se- quence. It passes numerous tests for statistical randomness, including the strin-3. gent Diehard tests. It is fast.4. Various implementations of the Twister are available on the web, includ- ing a particularly cool one that uses SIMD vector instructions for an extra speed boost, called SFMT (SIMD-oriented fast Mersenne Twister). SFMT can be downloaded from htt p://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ SFMT/index.html. 4.8.3. Mother-of-All and Xorshift In 1994, George Marsaglia, a computer scientist and mathematician best known for developing the Diehard batt ery of tests of randomness (htt p://www.stat. fsu.edu/pub/diehard), published a pseudo-random number generation algo- 4.8. Random Number Generation 194 4. 3D Math for Games rithm that is much simpler to implement and runs faster than the Mersenne Twister algorithm. He claimed that it could produce a sequence of 32-bit pseu- do-random numbers with a period of non-repetition of 2250. It passed all of the Diehard tests and still stands today as one of the best pseudo-random number generators for high-speed applications. He called his algorithm the Mother of All Pseudo-Random Number Generators , because it seemed to him to be the only random number generator one would ever need. Later, Marsaglia published another generator called Xorshift , which is be- tween Mersenee and Mother-of-All in terms of randomness, but runs slightly faster than Mother. You can read about George Marsaglia at htt p://en.wikipedia.org/wiki/ George_Marsaglia, and about the Mother-of-All generator at ft p://ft p.forth. org/pub/C/mother.c and at htt p://www.agner.org/random. You can down- load a PDF of George’s paper on Xorshift at htt p://www.jstatsoft .org/v08/i14/ paper. Part II Low-Level Engine Systems 5 Engine Support Systems Every game engine requires some low-level support systems that manage mundane but crucial tasks, such as starting up and shutt ing down the en- gine, confi guring engine and game features, managing the engine’s memory usage, handling access to fi le system(s), providing access to the wide range of heterogeneous asset types used by the game (meshes, textures, animations, audio, etc.), and providing debugging tools for use by the game development team. This chapter will focus on the lowest-level support systems found in most game engines. In the chapters that follow, we will explore some of the larger core systems, including resource management, human interface devic- es, and in-game debugging tools. 5.1. Subsystem Start-Up and Shut-Down A game engine is a complex piece of soft ware consisting of many interacting subsystems. When the engine fi rst starts up, each subsystem must be confi g- ured and initialized in a specifi c order. Interdependencies between subsys- tems implicitly defi ne the order in which they must be started—i.e., if sub- system B depends on subsystem A, then A will need to be started up before B can be initialized. Shut-down typically occurs in the reverse order, so B would shut down fi rst, followed by A. 197 198 5. Engine Support Systems 5.1.1. C++ Static Initialization Order (or Lack Thereof) Since the programming language used in most modern game engines is C++, we should briefl y consider whether C++’s native start-up and shut-down se- mantics can be leveraged in order to start up and shut down our engine’s sub- systems. In C++, global and static objects are constructed before the program’s entry point (main(), or WinMain() under Windows) is called. However, these constructors are called in a totally unpredictable order. The destructors of global and static class instances are called aft er main() (or WinMain()) returns, and once again they are called in an unpredictable order. Clearly this behavior is not desirable for initializing and shutt ing down the subsystems of a game engine, or indeed any soft ware system that has interdependencies between its global objects. This is somewhat unfortunate, because a common design patt ern for im- plementing major subsystems such as the ones that make up a game engine is to defi ne a singleton class (oft en called a manager ) for each subsystem. If C++ gave us more control over the order in which global and static class instances were constructed and destroyed, we could defi ne our singleton instances as globals, without the need for dynamic memory allocation. For example, we could write: class RenderManager { public: RenderManager() { // start up the manager... } ~RenderManager() { // shut down the manager... } // ... }; // singleton instance static RenderManager gRenderManager; Alas, with no way to directly control construction and destruction order, this approach won’t work. 5.1.1.1. Construct On Demand There is one C++ “trick” we can leverage here. A static variable that is declared within a function will not be constructed before main() is called, but rather 199 5.1. Subsystem Start-Up and Shut-Down on the fi rst invocation of that function. So if our global singleton is function- static, we can control the order of construction for our global singletons. class RenderManager { public: // Get the one and only instance. static RenderManager& get() { // This function-static will be constructed on the // first call to this function. static RenderManager sSingleton; return sSingleton; } RenderManager() { // Start up other managers we depend on, by // calling their get() functions first... VideoManager::get(); TextureManager::get(); // Now start up the render manager. // ... } ~RenderManager() { // Shut down the manager. // ... } }; You’ll fi nd that many soft ware engineering textbooks suggest this de- sign, or a variant that involves dynamic allocation of the singleton as shown below. static RenderManager& get() { static RenderManager* gpSingleton = NULL; if (gpSingleton == NULL) { gpSingleton = new RenderManager; } ASSERT(gpSingleton); return *gpSingleton; } 200 5. Engine Support Systems Unfortunately, this still gives us no way to control destruction order. It is possible that C++ will destroy one of the managers upon which the RenderManager depends for its shut-down procedure, prior to the RenderManager’s destructor being called. In addition, it’s diffi cult to predict exactly when the RenderManager singleton will be constructed, because the construction will happen on the fi rst call to RenderManager::get()—and who knows when that might be? Moreover, the programmers using the class may not be expecting an innocuous-looking get() function to do something expensive, like allocating and initializing a heavy-weight singleton. This is an unpredictable and dangerous design. Therefore we are prompted to resort to a more direct approach that gives us greater control. 5.1.2. A Simple Approach That Works Let’s presume that we want to stick with the idea of singleton managers for our subsystems. In this case, the simplest “brute-force” approach is to defi ne explicit start-up and shut-down functions for each singleton manager class. These functions take the place of the constructor and destructor, and in fact we should arrange for the constructor and destructor to do absolutely nothing. That way, the start-up and shut-down functions can be explicitly called in the required order from within main() (or from some over-arching singleton object that manages the engine as a whole). For example: class RenderManager { public: RenderManager() { // do nothing } ~RenderManager() { // do nothing } void startUp() { // start up the manager... } void shutDown() { // shut down the manager... } 201 // ... }; class PhysicsManager { /* similar... */ }; class AnimationManager { /* similar... */ }; class MemoryManager { /* similar... */ }; class FileSystemManager { /* similar... */ }; // ... RenderManager gRenderManager; PhysicsManager gPhysicsManager; AnimationManager gAnimationManager; TextureManager gTextureManager; VideoManager gVideoManager; MemoryManager gMemoryManager; FileSystemManager gFileSystemManager; // ... int main(int argc, const char* argv) { // Start up engine systems in the correct order. gMemoryManager. startUp(); gFileSystemManager. startUp(); gVideoManager. startUp(); gTextureManager. startUp(); gRenderManager. startUp(); gAnimationManager. startUp(); gPhysicsManager. startUp(); // ... // Run the game. gSimulationManager. run(); // Shut everything down, in reverse order. // ... gPhysicsManager. shutDown(); gAnimationManager. shutDown(); gRenderManager. shutDown(); gFileSystemManager. shutDown(); gMemoryManager. shutDown(); return 0; } 5.1. Subsystem Start-Up and Shut-Down 202 5. Engine Support Systems There are “more elegant” ways to accomplish this. For example, you could have each manager register itself into a global priority queue and then walk this queue to start up all the managers in the proper order. You could defi ne the manger-to-manager dependency graph by having each manager explicitly list the other managers upon which it depends and then write some code to calculate the optimal start-up order given their interdependencies. You could use the construct-on-demand approach outlined above. In my experience, the brute-force approach always wins out, because: It’s simple and easy to implement.• It’s explicit. You can see and understand the start-up order immediately • by just looking at the code. It’s easy to debug and maintain. If something isn’t starting early enough, • or is starting too early, you can just move one line of code. One minor disadvantage to the brute-force manual start-up and shut- down method is that you might accidentally shut things down in an order that isn’t strictly the reverse of the start-up order. But I wouldn’t lose any sleep over it. As long as you can start up and shut down your engine’s subsystems successfully, you’re golden. 5.1.3. Some Examples from Real Engines Let’s take a brief look at some examples of engine start-up and shut-down taken from real game engines. 5.1.3.1. Ogre3D Ogre3D is by its authors’ admission a rendering engine, not a game engine per se. But by necessity it provides many of the low-level features found in full-fl edged game engines, including a simple and elegant start-up and shut- down mechanism. Everything in Ogre is controlled by the singleton object Ogre::Root. It contains pointers to every other subsystem in Ogre and man- ages their creation and destruction. This makes it very easy for a programmer to start up Ogre—just new an instance of Ogre::Root and you’re done. Here are a few excerpts from the Ogre source code so we can see what it’s doing: OgreRoot.h class _OgreExport Root : public Singleton { // // Singletons LogManager* mLogManager; 203 ControllerManager* mControllerManager; SceneManagerEnumerator* mSceneManagerEnum; SceneManager* mCurrentSceneManager; DynLibManager* mDynLibManager; ArchiveManager* mArchiveManager; MaterialManager* mMaterialManager; MeshManager* mMeshManager; ParticleSystemManager* mParticleManager; SkeletonManager* mSkeletonManager; OverlayElementFactory* mPanelFactory; OverlayElementFactory* mBorderPanelFactory; OverlayElementFactory* mTextAreaFactory; OverlayManager* mOverlayManager; FontManager* mFontManager; ArchiveFactory *mZipArchiveFactory; ArchiveFactory *mFileSystemArchiveFactory; ResourceGroupManager* mResourceGroupManager; ResourceBackgroundQueue* mResourceBackgroundQueue; ShadowTextureManager* mShadowTextureManager; // etc. }; OgreRoot.cpp Root::Root(const String& pluginFileName, const String& configFileName, const String& logFileName): mLogManager(0), mCurrentFrame(0), mFrameSmoothingTime(0.0f), mNextMovableObjectTypeFlag(1), mIsInitialised(false) { // superclass will do singleton checking String msg; // Init mActiveRenderer = 0; mVersion = StringConverter::toString(OGRE_VERSION_MAJOR) + "." + StringConverter::toString(OGRE_VERSION_MINOR) + "." + StringConverter::toString(OGRE_VERSION_PATCH) + OGRE_VERSION_SUFFIX + " " + "(" + OGRE_VERSION_NAME + ")"; mConfigFileName = configFileName; // Create log manager and default log file if there // is no log manager yet 5.1. Subsystem Start-Up and Shut-Down 204 5. Engine Support Systems if(LogManager::getSingletonPtr() == 0) { mLogManager = new LogManager(); mLogManager->createLog(logFileName, true, true); } // Dynamic library manager mDynLibManager = new DynLibManager(); mArchiveManager = new ArchiveManager(); // ResourceGroupManager mResourceGroupManager = new ResourceGroupManager(); // ResourceBackgroundQueue mResourceBackgroundQueue = new ResourceBackgroundQueue(); // and so on... Ogre provides a templated Ogre::Singleton base class from which all of its singleton (manager) classes derive. If you look at its implementation, you’ll see that Ogre::Singleton does not use deferred construction, but instead relies on Ogre::Root to explicitly new each singleton. As we discussed above, this is done to ensure that the singletons are created and destroyed in a well- defi ned order. 5.1.3.2. Naughty Dog’s Uncharted: Drake’s Fortune The Uncharted: Drake’s Fortune engine created by Naughty Dog Inc. uses a similar explicit technique for starting up its subsystems. You’ll notice by look- ing at the following code that engine start-up is not always a simple sequence of allocating singleton instances. A wide range of operating system services, third party libraries, and so on must all be started up during engine initial- ization. Also, dynamic memory allocation is avoided wherever possible, so many of the singletons are statically-allocated objects (e.g., g_fileSystem, g_languageMgr, etc.) It’s not always prett y, but it gets the job done. Err BigInit() { init_exception_handler(); U8* pPhysicsHeap = new(kAllocGlobal, kAlign16) U8[ALLOCATION_GLOBAL_PHYS_HEAP]; PhysicsAllocatorInit(pPhysicsHeap, ALLOCATION_GLOBAL_PHYS_HEAP); g_textDb.Init(); g_textSubDb.Init(); 205 5.2. Memory Management g_spuMgr.Init(); g_drawScript.InitPlatform(); PlatformUpdate(); thread_t init_thr; thread_create(&init_thr, threadInit, 0, 30, 64*1024, 0, "Init"); char masterConfigFileName[256]; snprintf(masterConfigFileName, sizeof(masterConfigFileName), MASTER_CFG_PATH); { Err err = ReadConfigFromFile( masterConfigFileName); if (err.Failed()) { MsgErr("Config file not found (%s).\n", masterConfigFileName); } } memset(&g_discInfo, 0, sizeof(BootDiscInfo)); int err1 = GetBootDiscInfo(&g_discInfo); Msg("GetBootDiscInfo() : 0x%x\n", err1); if(err1 == BOOTDISCINFO_RET_OK) { printf("titleId : [%s]\n", g_discInfo.titleId); printf("parentalLevel : [%d]\n", g_discInfo.parentalLevel); } g_fileSystem.Init(g_gameInfo.m_onDisc); g_languageMgr.Init(); if (g_shouldQuit) return Err::kOK; // and so on... 5.2. Memory Management As game developers, we are always trying to make our code run more quickly. The performance of any piece of soft ware is dictated not only by the algo- rithms it employs, or the effi ciency with which those algorithms are coded, 206 5. Engine Support Systems but also by how the program utilizes memory (RAM). Memory aff ects perfor- mance in two ways:  1. Dynamic memory allocation via malloc() or C++’s global operator new is a very slow operation. We can improve the performance of our code by either avoiding dynamic allocation altogether or by making use of custom memory allocators that greatly reduce allocation costs. On modern CPUs, the performance of a piece of soft ware is oft en 2. dominated by its memory access patt erns . As we’ll see, data that is located in small, contiguous blocks of memory can be operated on much more effi ciently by the CPU than if that same data were to be spread out across a wide range of memory addresses. Even the most effi cient algorithm, coded with the utmost care, can be brought to its knees if the data upon which it operates is not laid out effi ciently in memory. In this section, we’ll learn how to optimize our code’s memory utilization along these two axes. 5.2.1. Optimizing Dynamic Memory Allocation Dynamic memory allocation via malloc() and free() or C++’s global new and delete operators—also known as heap allocation—is typically very slow. The high cost can be att ributed to two main factors. First, a heap allocator is a general-purpose facility, so it must be writt en to handle any allocation size, from one byte to one gigabyte. This requires a lot of management overhead, making the malloc() and free() functions inherently slow. Second, on most operating systems a call to malloc() or free() must fi rst context-switch from user mode into kernel mode, process the request, and then context-switch back to the program. These context switches can be extraordinarily expensive. One rule of thumb oft en followed in game development is: Keep heap allocations to a minimum, and never allocate from the heap within a tight loop. Of course, no game engine can entirely avoid dynamic memory alloca- tion, so most game engines implement one or more custom allocators. A custom allocator can have bett er performance characteristics than the oper- ating system’s heap allocator for two reasons. First, a custom allocator can satisfy requests from a preallocated memory block (itself allocated using malloc() or new, or declared as a global variable). This allows it to run in user mode and entirely avoid the cost of context-switching into the operat- 207 ing system. Second, by making various assumptions about its usage pat- terns, a custom allocator can be much more effi cient than a general-purpose heap allocator. In the following sections, we’ll take a look at some common kinds of cus- tom allocators. For additional information on this topic, see Christian Gyr- ling’s excellent blog post, htt p://www.swedishcoding.com/2008/08/31/are-we- out-of-memory. 5.2.1.1. Stack-Based Allocators Many games allocate memory in a stack-like fashion. Whenever a new game level is loaded, memory is allocated for it. Once the level has been loaded, litt le or no dynamic memory allocation takes place. At the conclusion of the level, its data is unloaded and all of its memory can be freed. It makes a lot of sense to use a stack-like data structure for these kinds of memory allocations. A stack allocator is very easy to implement. We simply allocate a large con- tiguous block of memory using malloc() or global new, or by declaring a global array of bytes (in which case the memory is eff ectively allocated out of the executable’s BSS segment). A pointer to the top of the stack is maintained. All memory addresses below this pointer are considered to be in use, and all addresses above it are considered to be free. The top pointer is initialized to the lowest memory address in the stack. Each allocation request simply moves the pointer up by the requested number of bytes. The most-recently allocated block can be freed by simply moving the top pointer back down by the size of the block. It is important to realize that with a stack allocator, memory cannot be freed in an arbitrary order. All frees must be performed in an order oppo- site to that in which they were allocated. One simple way to enforce these restrictions is to disallow individual blocks from being freed at all. Instead, we can provide a function that rolls the stack top back to a previously-marked location, thereby freeing all blocks between the current top and the roll-back point. It’s important to always roll the top pointer back to a point that lies at the boundary between two allocated blocks, because otherwise new al- locations would overwrite the tail end of the top-most block. To ensure that this is done properly, a stack allocator oft en provides a function that returns a marker representing the current top of the stack. The roll-back function then takes one of these markers as its argument. This is depicted in Figure 5.1. The interface of a stack allocator oft en looks something like this. 5.2. Memory Management 208 5. Engine Support Systems class StackAllocator { public: // Stack marker: Represents the current top of the // stack. You can only roll back to a marker, not to // arbitrary locations within the stack. typedef U32 Marker; // Constructs a stack allocator with the given total // size. explicit StackAllocator(U32 stackSize_bytes); // Allocates a new block of the given size from stack // top. void* alloc(U32 size_bytes); // Returns a marker to the current stack top. Marker getMarker(); // Rolls the stack back to a previous marker. void freeToMarker(Marker marker); // Clears the entire stack (rolls the stack back to // zero). void clear(); Obtain marker after allocating blocks A and B. A B Allocate additional blocks C , D and E. A B C D E Free back to marker. A B Figure 5.1. Stack allocation, and freeing back to a marker. 209 private: // ... }; Double-Ended Stack Allocators A single memory block can actually contain two stack allocators—one which allocates up from the bott om of the block and one which allocates down from the top of the block. A double-ended stack allocator is useful because it uses memory more effi ciently by allowing a trade-off to occur between the memory usage of the bott om stack and the memory usage of the top stack. In some situ- ations, both stacks may use roughly the same amount of memory and meet in the middle of the block. In other situations, one of the two stacks may eat up a lot more memory than the other stack, but all allocation requests can still be satisfi ed as long as the total amount of memory requested is not larger than the block shared by the two stacks. This is depicted in Figure 5.2. In Midway’s Hydro Thunder arcade game, all memory allocations are made from a single large block of memory managed by a double-ended stack allocator. The bott om stack is used for loading and unloading levels (race tracks), while the top stack is used for temporary memory blocks that are al- located and freed every frame. This allocation scheme worked extremely well and ensured that Hydro Thunder never suff ered from memory fragmentation problems (see Section 5.2.1.4). Steve Ranck, Hydro Thunder’s lead engineer, de- scribes this allocation technique in depth in [6], Section 1.9. Lower Upper Figure 5.2. A double-ended stack allocator. 5.2. Memory Management 5.2.1.2. Pool Allocators It’s quite common in game engine programming (and soft ware engineering in general) to allocate lots of small blocks of memory, each of which are the same size. For example, we might want to allocate and free matrices, or iterators, or links in a linked list, or renderable mesh instances. For this type of memory allocation patt ern, a pool allocator is oft en the perfect choice. A pool allocator works by preallocating a large block of memory whose size is an exact multiple of the size of the elements that will be allocated. For example, a pool of 4   ×   4 matrices would be an exact multiple of 64 bytes (16 el- ements per matrix times four bytes per element). Each element within the pool is added to a linked list of free elements; when the pool is fi rst initialized, the free list contains all of the elements. Whenever an allocation request is made, 210 5. Engine Support Systems we simply grab the next free element off the free list and return it. When an element is freed, we simply tack it back onto the free list. Both allocations and frees are O(1) operations, since each involves only a couple of pointer ma- nipulations, no matt er how many elements are currently free. (The notation O(1) is an example of big “O” notation. In this case it means that the execution time of both allocations and frees are roughly constant and do not depend on things like the number of elements currently in the pool. See Section 5.3.3 for an explanation of big “O” notation.) The linked list of free elements can be a singly-linked list, meaning that we need a single pointer (four bytes on most machines) for each free ele- ment. Where should we obtain the memory for these pointers? Certainly they could be stored in a separate preallocated memory block, occupying (sizeof(void*) * numElementsInPool) bytes. However, this is unduly wasteful. We need only realize that the blocks on the free list are, by defi nition, free memory blocks. So why not use the free blocks themselves to store the free list’s “next” pointers? This litt le “trick” works as long as elementSize >= sizeof(void*). If each element is smaller than a pointer, then we can use pool element in- dices instead of pointers to implement our linked list. For example, if our pool contains 16-bit integers, then we can use 16-bit indices as the “next pointers” in our linked list. This works as long as the pool doesn’t contain more than 216 = 65,536 elements. 5.2.1.3. Aligned Allocations As we saw in Section 3.2.5.1, every variable and data object has an alignment requirement. An 8-bit integer variable can be aligned to any address, but a 32-bit integer or fl oating-point variable must be 4-byte aligned, meaning its address can only end in the nibbles 0x0, 0x4, 0x8 or 0xC. A 128-bit SIMD vector value generally has a 16-byte alignment requirement, meaning that its mem- ory address can end only in the nibble 0x0. On the PS3, memory blocks that are to be transferred to an SPU via the direct memory access (DMA) controller should be 128-bit aligned for maximum DMA throughput, meaning they can only end in the bytes 0x00 or 0x80. All memory allocators must be capable of returning aligned memory blocks. This is relatively straightforward to implement. We simply allocate a litt le bit more memory than was actually requested, adjust the address of the memory block upward slightly so that it is aligned properly, and then re- turn the adjusted address. Because we allocated a bit more memory than was requested, the returned block will still be large enough, even with the slight upward adjustment. 211 In most implementations, the number of additional bytes allocated is equal to the alignment. For example, if the request is for a 16-byte aligned memory block, we would allocate 16 additional bytes. This allows for the worst-case address adjustment of 15 bytes, plus one extra byte so that we can use the same calculations even if the original block is already aligned. This simplifi es and speeds up the code at the expense of one wasted byte per al- location. It’s also important because, as we’ll see below, we’ll need those extra bytes to store some additional information that will be used when the block is freed. We determine the amount by which the block’s address must be adjusted by masking off the least-signifi cant bits of the original block’s memory ad- dress, subtracting this from the desired alignment, and using the result as the adjustment off set. The alignment should always be a power of two (four- byte and 16-byte alignments are typical), so to generate the mask we simply subtract one from the alignment. For example, if the request is for a 16-byte aligned block, then the mask would be (16 – 1) = 15 = 0x0000000F. Taking the bitwise AND of this mask and any misaligned address will yield the amount by which the address is misaligned. For example, if the originally- allocated block’s address is 0x50341233, ANDing this address with the mask 0x0000000F yields 0x00000003, so the address is misaligned by three bytes. To align the address, we add (alignment – misalignment) = (16 – 3) = 13 = 0xD bytes to it. The fi nal aligned address is therefore 0x50341233 + 0xD = 0x50341240. Here’s one possible implementation of an aligned memory allocator: // Aligned allocation function. IMPORTANT: 'alignment' // must be a power of 2 (typically 4 or 16). void* allocateAligned(U32 size_bytes, U32 alignment) { // Determine total amount of memory to allocate. U32 expandedSize_bytes = size_bytes + alignment; // Allocate an unaligned block & convert address to a // U32. U32 rawAddress = (U32)allocateUnaligned(expandedSize_bytes); // Calculate the adjustment by masking off the lower // bits of the address, to determine how "misaligned" // it is. U32 mask = (alignment – 1); U32 misalignment = (rawAddress & mask); U32 adjustment = alignment – misalignment; 5.2. Memory Management 212 5. Engine Support Systems // Calculate the adjusted address, and return as a // pointer. U32 alignedAddress = rawAddress + adjustment; return (void*)alignedAddress; } When this block is later freed, the code will pass us the adjusted address, not the original address we allocated. How, then, do we actually free the mem- ory? We need some way to convert an adjusted address back into the original, possibly misaligned address. To accomplish this, we simply store some meta-information in those extra bytes we allocated in order to align the data in the fi rst place. The smallest adjustment we might make is one byte. That’s enough room to store the number of bytes by which the address was adjusted (since it will never be more than 256). We always store this information in the byte im- mediately preceding the adjusted address (no matt er how many bytes of adjustment we actually added), so that it is trivial to fi nd it again, given the adjusted address. Here’s how the modifi ed allocateAligned() function would look. // Aligned allocation function. IMPORTANT: ‘alignment’ // must be a power of 2 (typically 4 or 16). void* allocateAligned(U32 size_bytes, U32 alignment) { // Clients must call allocateUnaligned() and // freeUnaligned() if alignment == 1. ASSERT(alignment > 1); // Determine total amount of memory to allocate. U32 expandedSize_bytes = size_bytes + alignment; // Allocate an unaligned block & convert address to a // U32. U32 rawAddress = (U32)allocateUnaligned(expandedSize_bytes); // Calculate the adjustment by masking off the lower // bits of the address, to determine how “misaligned” // it is. U32 mask = (alignment – 1); U32 misalignment = (rawAddress & mask); U32 adjustment = alignment – misalignment; // Calculate the adjusted address, and return as a // pointer. U32 alignedAddress = rawAddress + adjustment; 213 // Store the adjustment in the four bytes immediately // preceding the adjusted address that we’re // returning. U32* pAdjustment = (U32*)(alignedAddress – 4); *pAdjustment = adjustment; return (void*)alignedAddress; } And here’s how the corresponding freeAligned() function would be imple- mented. void freeAligned(void* p) { U32 alignedAddress = (U32)p; U8* pAdjustment = (U8*)(alignedAddress – 4); U32 adjustment = (U32)*pAdjustment; U32 rawAddress = alignedAddress – adjustment; freeUnaligned((void*)rawAddress); } 5.2.1.4. Single-Frame and Double-Buffered Memory Allocators Virtually all game engines allocate at least some temporary data during the game loop. This data is either discarded at the end of each iteration of the loop or used on the next frame and then discarded. This allocation patt ern is so common that many engines support single- and double-buff ered allocators. Single-Frame Allocators A single-frame allocator is implemented by reserving a block of memory and managing it with a simple stack allocator as described above. At the beginning of each frame, the stack’s “top” pointer is cleared to the bott om of the memory block. Allocations made during the frame grow toward the top of the block. Rinse and repeat. StackAllocator g_singleFrameAllocator; // Main Game Loop while (true) { // Clear the single-frame allocator’s buffer every // frame. g_singleFrameAllocator. clear(); 5.2. Memory Management 214 5. Engine Support Systems // ... // Allocate from the single-frame buffer. We never // need to free this data! Just be sure to use it // only this frame. void* p = g_singleFrameAllocator.alloc(nBytes); // ... } One of the primary benefi ts of a single-frame allocator is that allocated memory needn’t ever be freed—we can rely on the fact that the allocator will be cleared at the start of every frame. Single-frame allocators are also blind- ingly fast. The one big negative is that using a single-frame allocator requires a reasonable level of discipline on the part of the programmer. You need to realize that a memory block allocated out of the single-frame buff er will only be valid during the current frame. Programmers must never cache a pointer to a single-frame memory block across the frame boundary! Double-Buffered Allocators A double-buff ered allocator allows a block of memory allocated on frame i to be used on frame (i + 1). To accomplish this, we create two single-frame stack allocators of equal size and then ping-pong between them every frame. class DoubleBufferedAllocator { U32 m_curStack; StackAllocator m_stack[2]; public: void swapBuffers() { m_curStack = (U32)!m_curStack; } void clearCurrentBuffer() { m_stack[m_curStack]. clear(); } void* alloc(U32 nBytes) { return m_stack[m_curStack].alloc(nBytes); } // ... }; 215 // ... DoubleBufferedAllocator g_doubleBufAllocator; // Main Game Loop while (true) { // Clear the single-frame allocator every frame as // before. g_singleFrameAllocator.clear(); // Swap the active and inactive buffers of the double // buffered allocator. g_doubleBufAllocator. swapBuffers(); // Now clear the newly active buffer, leaving last // frame’s buffer intact. g_doubleBufAllocator. clearCurrentBuffer(); // ... // Allocate out of the current buffer, without // disturbing last frame’s data. Only use this data // this frame or next frame. Again, this memory never // needs to be freed. void* p = g_doubleBufAllocator.alloc(nBytes); // ... } This kind of allocator is extremely useful for caching the results of asyn- chronous processing on a multicore game console like the Xbox 360 or the PLAYSTATION 3. On frame i, we can kick off an asynchronous job on one of the PS3’s SPUs, handing it the address of a destination buff er that has been allocated from our double-buff ered allocator. The job runs and produces its results some time before the end of frame i, storing them into the buff er we provided. On frame (i + 1), the buff ers are swapped. The results of the job are now in the inactive buff er, so they will not be overwritt en by any double- buff ered allocations that might be made during this frame. As long as we use the results of the job before frame (i + 2), our data won’t be overwritt en. 5.2.2. Memory Fragmentation Another problem with dynamic heap allocations is that memory can become fragmented over time. When a program fi rst runs, its heap memory is entirely free. When a block is allocated, a contiguous region of heap memory of the 5.2. Memory Management 216 5. Engine Support Systems appropriate size is marked as “in use,” and the remainder of the heap remains free. When a block is freed, it is marked as such, and adjacent free blocks are merged into a single, larger free block. Over time, as allocations and dealloca- tions of various sizes occur in random order, the heap memory begins to look like a patchwork of free and used blocks. We can think of the free regions as “holes” in the fabric of used memory. When the number of holes becomes large, and/or the holes are all relatively small, we say the memory has become fragmented. This is illustrated in Figure 5.3. The problem with memory fragmentation is that allocations may fail even when there are enough free bytes to satisfy the request. The crux of the problem is that allocated memory blocks must always be contiguous. For ex- ample, in order to satisfy a request of 128 kB, there must exist a free “hole” that is 128 kB or larger. If there are 2 holes, each of which is 64 kB in size, then enough bytes are available but the allocation fails because they are not contigu- ous bytes. free After one allocation... After eight allocations... After eight allocations and three frees... After n allocations and m frees... freeused Figure 5.3. Memory fragmentation. 217 Memory fragmentation is not as much of a problem on operating sys- tems that support virtual memory . A virtual memory system maps discontigu- ous blocks of physical memory known as pages into a virtual address space, in which the pages appear to the application to be contiguous. Stale pages can be swapped to the hard disk when physical memory is in short supply and reloaded from disk when they are needed. For a detailed discussion of how virtual memory works, see htt p://lyle.smu.edu/~kocan/7343/fall05/slides/ chapter08.ppt. Most embedded systems cannot aff ord to implement a virtual memory system. While some modern consoles do technically support it, most console game engines still do not make use of virtual memory due to the in- herent performance overhead. 5.2.2.1. Avoiding Fragmentation with Stack and Pool Allocators The detrimental eff ects of memory fragmentation can be avoided by using stack and/or pool allocators. A stack allocator is impervious to fragmentation because allocations are • always contiguous, and blocks must be freed in an order opposite to that in which they were allocated. This is illustrated in Figure 5.4. A pool allocator is also free from fragmentation problems. Pools • do be- come fragmented, but the fragmentation never causes premature out- of-memory conditions as it does in a general-purpose heap. Pool alloca- tion requests can never fail due to a lack of a large enough contiguous free block, because all of the blocks are exactly the same size. This is shown in Figure 5.5. 5.2. Memory Management Figure 5.4. A stack allocator is free from fragmentation problems. Single free block, always contiguousAllocated blocks , always contiguous deallocation allocation Allocated and free blocks all the same size Figure 5.5. A pool allocator is not degraded by fragmentation. 218 5. Engine Support Systems 5.2.2.2. Defragmentation and Relocation When diff erently-sized objects are being allocated and freed in a random or- der, neither a stack-based allocator nor a pool-based allocator can be used. In such cases, fragmentation can be avoided by periodically defragmenting the heap. Defragmentation involves coalescing all of the free “holes” in the heap by shift ing allocated blocks from higher memory addresses down to lower addresses (thereby shift ing the holes up to higher addresses). One simple al- gorithm is to search for the fi rst “hole” and then take the allocated block im- mediately above the hole and shift it down to the start of the hole. This has the eff ect of “bubbling up” the hole to a higher memory address. If this process is repeated, eventually all the allocated blocks will occupy a contiguous region of memory at the low end of the heap’s address space, and all the holes will have bubbled up into one big hole at the high end of the heap. This is illus- trated in Figure 5.6. The shift ing of memory blocks described above is not particularly tricky to implement. What is tricky is accounting for the fact that we’re moving al- located blocks of memory around. If anyone has a pointer into one of these al- located blocks, then moving the block will invalidate the pointer. The solution to this problem is to patch any and all pointers into a shift ed memory block so that they point to the correct new address aft er the shift . This procedure is known as pointer relocation . Unfortunately, there is no gen- eral-purpose way to fi nd all the pointers that point into a particular region A B C D E A B C D E A B C D E A B C D E A B C D E Figure 5.6. Defragmentation by shifting allocated blocks to lower addresses. 219 of memory. So if we are going to support memory defragmentation in our game engine, programmers must either carefully keep track of all the pointers manually so they can be relocated, or pointers must be abandoned in favor of something inherently more amenable to relocation, such as smart pointers or handles. A smart pointer is a small class that contains a pointer and acts like a pointer for most intents and purposes. But because a smart pointer is a class, it can be coded to handle memory relocation properly. One approach is to arrange for all smart pointers to add themselves to a global linked list. When- ever a block of memory is shift ed within the heap, the linked list of all smart pointers can be scanned, and each pointer that points into the shift ed block of memory can be adjusted appropriately. A handle is usually implemented as an index into a non-relocatable ta- ble which itself contains the pointers. When an allocated block is shift ed in memory, the handle table can be scanned and all relevant pointers found and updated automatically. Because the handles are just indices into the pointer table, their values never change no matt er how the memory blocks are shift ed, so the objects that use the handles are never aff ected by memory relocation. Another problem with relocation arises when certain memory blocks can- not be relocated. For example, if you are using a third-party library that does not use smart pointers or handles, it’s possible that any pointers into its data structures will not be relocatable. The best way around this problem is usu- ally to arrange for the library in question to allocate its memory from a special buff er outside of the relocatable memory area. The other option is to simply accept that some blocks will not be relocatable. If the number and size of the non-relocatable blocks are both small, a relocation system will still perform quite well. It is interesting to note that all of Naughty Dog’s engines have supported defragmentation. Handles are used wherever possible to avoid the need to re- locate pointers. However, in some cases raw pointers cannot be avoided. These pointers are carefully tracked and relocated manually whenever a memory block is shift ed due to defragmentation. A few of Naughty Dog’s game object classes are not relocatable for various reasons. However, as mentioned above, this doesn’t pose any practical problems, because the number of such objects is always very small, and their sizes are tiny when compared to the overall size of the relocatable memory area. Amortizing Defragmentation Costs Defragmentation can be a slow operation because it involves copying memory blocks. However, we needn’t fully defragment the heap all at once. Instead, the cost can be amortized over many frames. We can allow up to N allocated 5.2. Memory Management 220 5. Engine Support Systems blocks to be shift ed each frame, for some small value of N like 8 or 16. If our game is running at 30 frames per second, then each frame lasts 1/30 of a sec- ond (33 ms). So the heap can usually be completely defragmented in less than one second without having any noticeable eff ect on the game’s frame rate. As long as allocations and deallocations aren’t happening at a faster rate than the defragmentation shift s, the heap will remain mostly defragmented at all times. This approach is only valid when the size of each block is relatively small, so that the time required to move a single block does not exceed the time al- lott ed to relocation each frame. If very large blocks need to be relocated, we can oft en break them up into two or more subblocks, each of which can be relocated independently. This hasn’t proved to be a problem in Naughty Dog’s engine, because relocation is only used for dynamic game objects, and they are never larger than a few kilobytes—and usually much smaller. 5.2.3. Cache Coherency To understand why memory access patt erns aff ect performance, we need fi rst to understand how modern processors read and write memory. Access- ing main system RAM is always a slow operation, oft en taking thousands of processor cycles to complete. Contrast this with a register access on the CPU itself, which takes on the order of tens of cycles or sometimes even a single cycle. To reduce the average cost of reading and writing to main RAM, mod- ern processors utilize a high-speed memory cache. A cache is a special type of memory that can be read from and writt en to by the CPU much more quickly than main RAM. The basic idea of memory caching is to load a small chunk of memory into the high-speed cache when- ever a given region of main RAM is fi rst read. Such a memory chunk is called a cache line and is usually between 8 and 512 bytes, depending on the micro- processor architecture. On subsequent read operations, if the requested data already exists in the cache, it is loaded from the cache directly into the CPU’s registers—a much faster operation than reading from main RAM. Only if the required data is not already in the cache does main RAM have to be accessed. This is called a cache miss . Whenever a cache miss occurs, the program is forced to wait for the cache line to be refreshed from main RAM. Similar rules may apply when writing data to RAM. The simplest kind of cache is called a write-through cache ; in such a cache design, all writes to the cache are simply mirrored to main RAM immediately. However, in a write-back (or copy-back ) cache design, data is fi rst writt en into the cache and the cache line is only fl ushed out to main RAM under certain circumstances, such as when a dirty cache line needs to be evicted in order to read in a new 221 cache line from main RAM, or when the program explicitly requests a fl ush to occur. Obviously cache misses cannot be totally avoided, since data has to move to and from main RAM eventually. However, the trick to high-performance computing is to arrange your data in RAM and code your algorithms in such a way that the minimum number of cache misses occur. We’ll see exactly how to accomplish this below. 5.2.3.1. Level 1 and Level 2 Caches When caching techniques were fi rst developed, the cache memory was locat- ed on the motherboard, constructed from a faster and more expensive type of memory module than main RAM in order to give it the required boost in speed. However, cache memory was expensive, so the cache size was usually quite small—on the order of 16 kB. As caching techniques evolved, an even faster type of cache memory was developed that was located on the CPU die itself. This gave rise to two distinct types of cache memory: an on-die level 1 (L1) cache and an on-motherboard level 2 (L2) cache. More recently, the L2 cache has also migrated onto the CPU die (see Figure 5.7). The rules for moving data back and forth between main RAM are of course complicated by the presence of a level 2 cache. Now, instead of data hopping from RAM to cache to CPU and back again, it must make two hops—fi rst from main RAM to the L2 cache, and then from L2 cache to L1 cache. We won’t go into the specifi cs of these rules here. (They diff er slightly from CPU to CPU anyway.) But suffi ce it to say that RAM is slower than L2 cache memory, and L2 cache is slower than L1 cache. Hence L2 cache misses are usually more expensive than L1 cache misses, all other things being equal. 5.2. Memory Management CPU Die CPU L1 Cache L2 Cache Main RAMslower slowestfast Figure 5.7. Level 1 and level 2 caches. 222 5. Engine Support Systems A load-hit-store is a particularly bad kind of cache miss, prevalent on the PowerPC architectures found in the Xbox 360 and PLAYSTATION 3, in which the CPU writes data to a memory address and then reads the data back before it has had a chance to make its way through the CPU’s instruction pipeline and out into the L1 cache. See htt p://assemblyrequired.crashworks.org/2008/07/08/ load-hit-stores-and-the-_ _restrict-keyword for more details. 5.2.3.2. Instruction Cache and Data Cache When writing high-performance code for a game engine or for any other per- formance-critical system, it is important to realize that both data and code are cached. The instruction cache (I-cache) is used to preload executable machine code before it runs, while the data cache (D-cache) is used to speed up reading and writing of data to main RAM. Most processors separate the two caches physically. Hence it is possible for a program to slow down because of an I- cache miss or because of a D-cache miss. 5.2.3.3. Avoiding Cache Misses The best way to avoid D-cache misses is to organize your data in contiguous blocks that are as small as possible and then access them sequentially. This yields the minimum number of cache misses. When the data is contiguous (i.e., you don’t “jump around” in memory a lot), a single cache miss will load the maximum amount of relevant data in one go. When the data is small, it is more likely to fi t into a single cache line (or at least a minimum number of cache lines). And when you access your data sequentially (i.e., you don’t “jump around” within the contiguous memory block), you achieve the mini- mum number of cache misses, since the CPU never has to reload a cache line from the same region of RAM. Avoiding I-cache misses follows the same basic principle as avoiding D- cache misses. However, the implementation requires a diff erent approach. The compiler and linker dictate how your code is laid out in memory, so you might think you have litt le control over I-cache misses. However, most C/C++ linkers follow some simple rules that you can leverage, once you know what they are: The machine code for a single function is almost always contiguous in • memory. That is, the linker almost never splits a function up in order to intersperse another function in the middle. (Inline functions are the exception to this rule—more on this topic below.) Functions are laid out in memory in the order they appear in the • translation unit’s source code (.cpp fi le). 223 Therefore, functions in a single translation unit are always contiguous • in memory. That is, the linker never splits up a complied translation unit (.obj fi le) in order to intersperse code from some other translation unit. So, following the same principles that apply to avoiding D-cache misses, we should follow the rules of thumb listed below. Keep high-performance code • as small as possible, in terms of number of machine language instructions. (The compiler and linker take care of keeping our functions contiguous in memory.) Avoid calling functions• from within a performance-critical section of code. If you do have to call a function, place it as • close as possible to the calling function—preferably immediately before or aft er the calling function and never in a diff erent translation unit (because then you completely lose control over its proximity to the calling function). Use inline functions judiciously. Inlining a small function can be a big • performance boost. However, too much inlining bloats the size of the code, which can cause a performance-critical section of code to no longer fi t within the cache. Let’s say we write a tight loop that processes a large amount of data—if the entire body of that loop doesn’t fi t into the cache, then we are signing up for two I-cache misses during every iteration of the loop. In such a situation, it is probably best to rethink the algorithm and/or implementation so that less code is required within critical loops. 5.3. Containers Game programmers employ a wide variety of collection-oriented data struc- tures, also known as containers or collections. The job of a container is always the same—to house and manage zero or more data elements; however, the details of how they do this varies greatly, and each type of container has its pros and cons. Common container data types include, but are certainly not limited to, the following. Array• . An ordered, contiguous collection of elements accessed by index. The length of the array is usually statically defi ned at compile time. It may be multidimensional. C and C++ support these natively (e.g., int a[5]). Dynamic array• . An array whose length can change dynamically at runtime (e.g., STL’s std::vector) 5.3. Containers 224 5. Engine Support Systems Linked list• . An ordered collection of elements not stored contiguously in memory but rather linked to one another via pointers (e.g., STL’s std::list). Stack• . A container that supports the last-in-fi rst-out (LIFO) model for adding and removing elements, also known as push/pop (e.g., std::stack). Queue• . A container that supports the fi rst-in-fi rst-out (FIFO) model for adding and removing elements (e.g., std::queue). Deque• . A double-ended queue—supports effi cient insertion and removal at both ends of the array (e.g., std::deque). Priority queue• . A container that permits elements to be added in any or- der and then removed in an order defi ned by some property of the ele- ments themselves (i.e., their priority). It can be thought of as a list that stays sorted at all times. A priority queue is typically implemented as a binary search tree (e.g., std::priority_queue). Tree• . A container in which elements are grouped hierarchically. Each ele- ment (node) has zero or one parent and zero or more children. A tree is a special case of a DAG (see below). Binary search tree (BST)• . A tree in which each node has at most two chil- dren, with an order property to keep the nodes sorted by some well-de- fi ned criteria. There are various kinds of binary search trees, including red-black trees, splay trees, SVL trees, etc. Binary heap• . A binary tree that maintains itself in sorted order, much like a binary search tree, via two rules: the shape property, which specifi es that the tree must be fully fi lled and that the last row of the tree is fi lled from left to right; and the heap property, which states that every node is, by some user-defi ned criterion, “greater than” or “equal to” all of its children. Dictionary• . A table of key-value pairs. A value can be “looked up” ef- fi ciently given the corresponding key. A dictionary is also known as a map or hash table, although technically a hash table is just one possible implementation of a dictionary (e.g., std::map, std::hash_map). Set• . A container that guarantees that all elements are unique according to some criteria. A set acts like a dictionary with only keys, but no values. Graph• . A collection of nodes connected to one another by unidirectional or bidirectional pathways in an arbitrary patt ern. Directed acyclic graph (DAG)• . A collection of nodes with unidirectional (i.e., directed) interconnections, with no cycles (i.e., there is no non-empty path that starts and ends on the same node). 225 5.3.1. Container Operations Game engines that make use of container classes inevitably make use of vari- ous commonplace algorithms as well. Some examples include: Insert.• Add a new element to the container. The new element might be placed at the beginning of the list, or the end, or in some other location; or the container might not have a notion of ordering at all. Remove.• Remove an element from the container; may require a fi nd op- eration (see below). However if an iterator is available that refers to the desired element, it may be more effi cient to remove the element using the iterator. Sequential access (iteration).• Accessing each element of the container in some “natural” predefi ned order. Random access.• Accessing elements in the container in an arbitrary or- der. Find.• Search a container for an element that meets a given criterion. There are all sorts of variants on the fi nd operation, including fi nding in reverse, fi nding multiple elements, etc. In addition, diff erent types of data structures and diff erent situations call for diff erent algorithms (see htt p://en.wikipedia.org/wiki/Search_algorithm). Sort• . Sort the contents of a container according to some given criteria. There are many diff erent sorting algorithms, including bubble sort, se- lection sort, insertion sort, quicksort, and so on. (See htt p://en.wikipedia. org/wiki/Sorting_algorithm for details.) 5.3.2. Iterators An iterator is a litt le class that “knows” how to effi ciently visit the elements in a particular kind of container. It acts like an array index or pointer—it refers to one element in the container at a time, it can be advanced to the next element, and it provides some sort of mechanism for testing whether or not all elements in the container have been visited. As an example, the fi rst of the following two code snippets iterates over a C-style array using a pointer, while the second iterates over an STL linked list using almost identi- cal syntax. void processArray(int container[], int numElements) { int* pBegin = &container[0]; int* pEnd = &container[numElements]; 5.3. Containers 225 5.3.1. Container Operations Game engines that make use of container classes inevitably make use of vari- ous commonplace algorithms as well. Some examples include: Insert.• Add a new element to the container. The new element might be placed at the beginning of the list, or the end, or in some other location; or the container might not have a notion of ordering at all. Remove.• Remove an element from the container; may require a fi nd op- eration (see below). However if an iterator is available that refers to the desired element, it may be more effi cient to remove the element using the iterator. Sequential access (iteration).• Accessing each element of the container in some “natural” predefi ned order. Random access.• Accessing elements in the container in an arbitrary or- der. Find.• Search a container for an element that meets a given criterion. There are all sorts of variants on the fi nd operation, including fi nding in reverse, fi nding multiple elements, etc. In addition, diff erent types of data structures and diff erent situations call for diff erent algorithms (see htt p://en.wikipedia.org/wiki/Search_algorithm). Sort• . Sort the contents of a container according to some given criteria. There are many diff erent sorting algorithms, including bubble sort, se- lection sort, insertion sort, quicksort, and so on. (See htt p://en.wikipedia. org/wiki/Sorting_algorithm for details.) 5.3.2. Iterators An iterator is a litt le class that “knows” how to effi ciently visit the elements in a particular kind of container. It acts like an array index or pointer—it refers to one element in the container at a time, it can be advanced to the next element, and it provides some sort of mechanism for testing whether or not all elements in the container have been visited. As an example, the fi rst of the following two code snippets iterates over a C-style array using a pointer, while the second iterates over an STL linked list using almost identi- cal syntax. void processArray(int container[], int numElements) { int* pBegin = &container[0]; int* pEnd = &container[numElements]; 5.3. Containers 226 5. Engine Support Systems for (int* p = pBegin; p != pEnd; ++p) { int element = *p; // process element... } } void processList(std::list& container) { std::list:: iterator pBegin = container.begin(); std::list:: iterator pEnd = container.end(); std::list:: iterator p; for (p = pBegin; p != pEnd; ++p) { int element = *p; // process element... } } The key benefi ts to using an iterator over att empting to access the con- tainer’s elements directly are: Direct access would break the container class’ encapsulation. An iterator, • on the other hand, is typically a friend of the container class, and as such it can iterate effi ciently without exposing any implementation details to the outside world. (In fact, most good container classes hide their internal details and cannot be iterated over without an iterator.) An iterator can simplify the process of iterating. Most iterators act like • array indices or pointers, so a simple loop can be writt en in which the iterator is incremented and compared against a terminating condition— even when the underlying data structure is arbitrarily complex. For example, an iterator can make an in-order depth-fi rst tree traversal look no more complex than a simple array iteration. 5.3.2.1. Preincrement versus Postincrement Notice in the above example that we are using C++’s preincrement operator , ++p, rather than the postincrement operator , p++. This is a subtle but some- times important optimization. The preincrement operator returns the value of the operand aft er the increment has been performed, whereas postincrement returns the previous, unincremented value. Hence preincrement can simply increment the pointer or iterator in place and return a reference to it. Postin- crement must cache the old value, then increment the pointer or iterator, and fi nally return the cached value. This isn’t a big deal for pointers or integer 228 5. Engine Support Systems tion. If an algorithm executes a subalgorithm n times, and the subalgorithm is O(log n), then the resulting algorithm would be O(n log n). To select an appropriate container class, we should look at the opera- tions that we expect to be most common, then select the container whose per- formance characteristics for those operations are most favorable. The most common orders you’ll encounter are listed here from fastest to slowest: O(1), O(log n), O(n), O(n log n), O(n2), O(nk) for k > 2. We should also take the memory layout and usage characteristics of our containers into account. For example, an array (e.g., int a[5] or std::vector) stores its elements contiguously in memory and requires no overhead storage for anything other than the elements themselves. (Note that a dynamic array does require a small fi xed overhead.) On the other hand, a linked list (e.g., std::list) wraps each element in a “link” data structure that contains a pointer to the next element and possibly also a pointer to the previous element, for a total of up to eight bytes of overhead per element. Also, the elements in a linked list need not be contiguous in memory and oft en aren’t. A contiguous block of memory is usually much more cache-friendly than a set of disparate memory blocks. Hence, for high-speed algorithms, ar- rays are usually bett er than linked lists in terms of cache performance (unless the nodes of the linked list are themselves allocated from a small, contiguous memory block of memory, which is rare but not entirely unheard of). But a linked list is bett er for situations in which speed of inserting and removing elements is of prime importance. 5.3.4. Building Custom Container Classes Many game engines provide their own custom implementations of the com- mon container data structures. This practice is especially prevalent in console game engines and games targeted at mobile phone and PDA platforms. The reasons for building these classes yourself include: Total control.• You control the data structure’s memory requirements, the algorithms used, when and how memory is allocated, etc. Opportunities for optimization.• You can optimize your data structures and algorithms to take advantage of hardware features specifi c to the console(s) you are targeting; or fi ne-tune them for a particular applica- tion within your engine. Customizability.• You can provide custom algorithms not prevalent in third-party libraries like STL (for example, searching for the n most- relevant elements in a container, instead of just the single most-rele- vant). 229 Elimination of external dependencies.• Since you built the soft ware your- self, you are not beholden to any other company or team to maintain it. If problems arise, they can be debugged and fi xed immediately, rather than waiting until the next release of the library (which might not be until aft er you have shipped your game!) We cannot cover all possible data structures here, but let’s look at a few common ways in which game engine programmers tend to tackle contain- ers. 5.3.4.1. To Build or Not to Build We will not discuss the details of how to implement all of these data types and algorithms here—a plethora of books and online resources are available for that purpose. However, we will concern ourselves with the question of where to obtain implementations of the types and algorithms that you need. As game engine designers, we have a number of choices: Build the needed data structures manually.1. Rely on third-party implementations. Some common choices include2. the C++ standard template library (STL),a. a variant of STL, such as STLport,b. the powerful and robust Boost libraries (htt p://www.boost.org).c. Both STL and Boost are att ractive, because they provide a rich and power- ful set of container classes covering prett y much every type of data structure imaginable. In addition, both of these packages provide a powerful suite of template-based generic algorithms—implementations of common algorithms, such as fi nding an element in a container, which can be applied to virtually any type of data object. However, third-party packages like these may not be appropriate for some kinds of game engines. And even if we decide to use a third-party package, we must select between Boost and the various fl avors of STL, or another third-party library. So let’s take a moment to investigate some of the pros and cons of each approach. STL The benefi ts of the standard template library include: STL off ers a rich set of features.• Reasonably robust implementations are available on a wide variety of • platforms. STL comes “standard” with virtually all C++ compilers.• 5.3. Containers 230 5. Engine Support Systems However, the STL also has numerous drawbacks, including: STL has a steep learning curve. The documentation is now quite good, • but the header fi les are cryptic and diffi cult to understand on most plat- forms. STL is oft en slower than a data structure that has been craft ed specifi -• cally for a particular problem. STL also almost always eats up more memory than a custom-designed • data structure. STL does a lot of dynamic memory allocation, and it’s sometimes chal-• lenging to control its appetite for memory in a way that is suitable for high-performance, memory-limited console games. STL’s implementation and behavior varies slightly from compiler to • compiler, making its use in multiplatform engines more diffi cult. As long as the programmer is aware of the pitfalls of STL and uses it ju- diciously, it can have a place in game engine programming. It is best suited to a game engine that will run on a personal computer platform, because the advanced virtual memory systems on modern PCs make memory allocation cheaper, and the probability of running out of physical RAM is oft en negli- gible. On the other hand, STL is not generally well-suited for use on memory- limited consoles that lack advanced CPUs and virtual memory. And code that uses STL may not port easily to other platforms. Here are some rules of thumb that I use: First and foremost, be aware of the performance and memory character-• istics of the particular STL class you are using. Try to avoid heavier-weight STL classes in code that you believe will be • a performance bott leneck. Prefer STL in situations where memory is not at a premium. For ex-• ample, embedding a std::list inside a game object is OK, but em- bedding a std::list inside every vertex of a 3D mesh is probably not a good idea. Adding every vertex of your 3D mesh to a std::list is probably also not OK—the std::list class dynamically allocates a small “link” object for every element inserted into it, and that can result in a lot of tiny, fragmented memory allocations. If your engine is to be multiplatform, I highly recommend • STLport (htt p://www.stlport.org), an implementation of STL that was specifi cally designed to be portable across a wide range of compilers and target platforms, more effi cient, and more feature-rich than the original STL implementations. 231 The Medal of Honor: Pacifi c Assault engine for the PC made heavy use of STL, and while MOHPA did have its share of frame rate problems, the team was able to work around the performance problems caused by STL (primarily by carefully limiting and controlling its use). Ogre3D, the popular object-ori- ented rendering library that we use for some of the examples in this book, also makes heavy use of STL. Your mileage may vary. Using STL on a game engine project is certainly feasible, but it must be used with utmost care. Boost The Boost project was started by members of the C++ Standards Committ ee Library Working Group, but it is now an open-source project with many con- tributors from across the globe. The aim of the project is to produce libraries that extend and work together with STL, for both commercial and non-com- mercial use. Many of the Boost libraries have already been included in the C++ Standards Committ ee’s Library Technical Report (TR1), which is a step toward becoming part of a future C++ standard. Here is a brief summary of what Boost brings to the table: Boost provides a lot of useful facilities not available in STL.• In some cases, Boost provides alternatives to work around certain prob-• lems with STL’s design or implementation. Boost does a great job of handling some very complex problems, like • smart pointers. (Bear in mind that smart pointers are complex beasts, and they can be performance hogs. Handles are usually preferable; see Section 14.5 for details.) The Boost libraries’ documentation is usually excellent. Not only does • the documentation explain what each library does and how to use it, but in most cases it also provides an excellent in-depth discussion of the de- sign decisions, constraints, and requirements that went into construct- ing the library. As such, reading the Boost documentation is a great way to learn about the principles of soft ware design. If you are already using STL, then Boost can serve as an excellent exten- sion and/or alterative to many of STL’s features. However, be aware of the following caveats: Most of the core Boost classes are templates, so all that one needs in • order to use them is the appropriate set of header fi les. However, some of the Boost libraries build into rather large .lib fi les and may not be feasible for use in very small-scale game projects. While the world-wide Boost community is an excellent support net-• work, the Boost libraries come with no guarantees. If you encounter a 5.3. Containers 232 5. Engine Support Systems bug, it will ultimately be your team’s responsibility to work around it or fi x it. Backward compatibility may not be supported.• The Boost libraries are distributed under the Boost Soft ware License. • Read the license information (htt p://www.boost.org/more/license_info. html) carefully to be sure it is right for your engine. Loki There is a rather esoteric branch of C++ programming known as template meta- programming. The core idea is to use the compiler to do a lot of the work that would otherwise have to be done at runtime by exploiting the template fea- ture of C++ and in eff ect “tricking” the compiler into doing things it wasn’t originally designed to do. This can lead to some startlingly powerful and use- ful programming tools. By far the most well-known and probably most powerful template meta- programming library for C++ is Loki, a library designed and writt en by Andrei Alexandrescu (whose home page is at htt p://www.erdani.org). The library can be obtained from SourceForge at htt p://loki-lib.sourceforge.net. Loki is extremely powerful; it is a fascinating body of code to study and learn from. However, its two big weaknesses are of a practical nature: (a) its code can be daunting to read and use, much less truly understand, and (b) some of its components are dependent upon exploiting “side-eff ect” behav- iors of the compiler that require careful customization in order to be made to work on new compilers. So Loki can be somewhat tough to use, and it is not as portable as some of its “less-extreme” counterparts. Loki is not for the faint of heart. That said, some of Loki’s concepts such as policy-based pro- gramming can be applied to any C++ project, even if you don’t use the Loki library per se. I highly recommend that all soft ware engineers read Andrei’s ground-breaking book, Modern C++ Design [2], from which the Loki library was born. 5.3.4.2. Dynamic Arrays and Chunky Allocation Fixed-size C-style arrays are used quite a lot in game programming, because they require no memory allocation, are contiguous and hence cache-friendly, and support many common operations such as appending data and searching very effi ciently. When the size of an array cannot be determined a priori, programmers tend to turn either to linked lists or dynamic arrays. If we wish to maintain the performance and memory characteristics of fi xed-length arrays, then the dy- namic array is oft en the data structure of choice. 233 The easiest way to implement a dynamic array is to allocate an n-element buff er initially and then grow the list only if an att empt is made to add more than n elements to it. This gives us the favorable characteristics of a fi xed- size array but with no upper bound. Growing is implemented by allocating a new larger buff er, copying the data from the original buff er into the new buff er, and then freeing the original buff er. The size of the buff er is increased in some orderly manner, such as adding n to it on each grow, or doubling it on each grow. Most of the implementations I’ve encountered never shrink the array, only grow it (with the notable exception of clearing the array to zero size, which might or might not free the buff er). Hence the size of the array be- comes a sort of “high water mark .” The STL std::vector class works in this manner. Of course, if you can establish a high water mark for your data, then you’re probably bett er off just allocating a single buff er of that size when the engine starts up. Growing a dynamic array can be incredibly costly due to realloca- tion and data copying costs. The impact of these things depends on the sizes of the buff ers involved. Growing can also lead to fragmentation when dis- carded buff ers are freed. So, as with all data structures that allocate memory, caution must be exercised when working with dynamic arrays. Dynamic ar- rays are probably best used during development, when you are as yet unsure of the buff er sizes you’ll require. They can always be converted into fi xed size arrays once suitable memory budgets have been established.) 5.3.4.3. Linked Lists If contiguous memory is not a primary concern, but the ability to insert and remove elements at random is paramount, then a linked list is usually the data structure of choice. Linked lists are quite easy to implement, but they’re also quite easy to get wrong if you’re not careful. This section provides a few tips and tricks for creating robust linked lists. The Basics of Linked Lists A linked list is a very simple data structure. Each element in the list has a pointer to the next element, and, in a doubly-linked list , it also has a pointer to the previous element. These two pointers are referred to as links. The list as a whole is tracked using a special pair of pointers called the head and tail point- ers. The head pointer points to the fi rst element, while the tail pointer points to the last element. Inserting a new element into a doubly-linked list involves adjusting the next pointer of the previous element and the previous pointer of the next ele- ment to both point at the new element and then sett ing the new element’s next 5.3. Containers 234 5. Engine Support Systems and previous pointers appropriately as well. There are four cases to handle when adding a node to a linked list: Adding the fi rst element to a previously-empty list;• Prepending an element before the current head element;• Appending an element aft er the current tail element;• Inserting an interior element.• These cases are illustrated in Figure 5.8. Removing an element involves the same kinds of operations in and around the node being removed. Again there are four cases: removing the head element, removing the tail element, removing an interior element, and removing the last element (emptying the list). The Link Data Structure Linked list code isn’t particularly tough to write, but it can be error-prone. As such, it’s usually a good idea to write a general-purpose linked list facility that can be used to manage lists of any element type. To do this, we need to separate the data structure that contains the links (i.e., the next and previ- ous pointers) from the element data structure. The link data structure is typi- cally a simple struct or class, oft en called something like Link, Node, or LinkNode, and templated on the type of element to which it refers. It will usu- ally look something like this. Head Tail Head Tail Head Tail Head Tail Head Tail Head Tail Head Tail Head TailAdd First Prepend (Push Front) Insert Append (Push Back) Figure 5.8. The four cases that must be handled when adding an element to a linked list: add fi rst, prepend, append, and insert. 235 template< typename ELEMENT > struct Link { Link* m_pPrev; Link* m_pNext; ELEMENT* m_pElem; }; Extrusive Lists An extrusive list is a linked list in which the Link data structures are entirely separate from the element data structures. Each Link contains a pointer to the element, as shown in the example. Whenever an element is to be inserted into a linked list, a link is allocated for it, and the pointers to the element and the next and previous links are set up appropriately. When an element is removed from a linked list, its link can be freed. The benefi t of the extrusive design is that an element can reside in mul- tiple linked lists simultaneously—all we need is one link per list. The down side is that the Link objects must be dynamically allocated. Oft en a pool al- locator (see Section 5.2.1.2) is used to allocate links, because they are always exactly the same size (viz., 12 bytes on a machine with 32-bit pointers). A pool allocator is an excellent choice due to its speed and its freedom from fragmen- tation problems. Intrusive Lists An intrusive list is a linked list in which the Link data structure is embedded in the target element itself. The big benefi t of this approach is that we no lon- ger need to dynamically allocate the links—we get a link “for free” whenever we allocate an element. For example, we might have: class SomeElement { Link m_link; // other members... }; We can also derive our element class from class Link. Using inheri- tance like this is virtually identical to embedding a Link as the fi rst member of the class, but it has the additional benefi t of allowing a pointer to a link (Link*) to be down-cast into a pointer to the element itself (SomeElement*). This means we can eliminate the back-pointer to the ele- ment that would otherwise have to be embedded within the Link. Here’s how such a design might be implemented in C++. 5.3. Containers 236 5. Engine Support Systems template< typename ELEMENT > struct Link { Link* m_pPrev; Link* m_pNext; // No ELEMENT* pointer required, thanks to // inheritance. }; class SomeElement : public Link { // other members... }; The big pitfall of the intrusive linked list design is that it prevents an ele- ment from residing in more than one linked list at a time (because each ele- ment has one and only one link). We can allow an element to be a member of N concurrent lists by providing it with N embedded link instances (in which case we cannot use the inheritance method). However, the number N must be fi xed a priori, so this approach is still not quite as fl exible as the extrusive design. The choice between intrusive and extrusive linked lists depends on the application and the constraints under which you are operating. If dynamic memory allocation must be avoided at all costs, then an intrusive list is prob- ably best. If you can aff ord the overhead of pool allocation, then an extrusive design may be preferable. Sometimes only one of the two approaches will be feasible. For example, if we wish to store instances of a class defi ned by a third-party library in a linked list and are unable or unwilling to modify that library’s source code, then an extrusive list is the only option. Head and Tail Pointers: Circular Lists To fully implement a linked list, we need to provide a head and a tail pointer. The simplest approach is to embed these pointers in their own data structure, perhaps called LinkedList, as follows. template< typename ELEMENT > class LinkedList { Link* m_pTail; Link* m_pHead; // member functions for manipulating the list... }; You may have noticed that there isn’t much diff erence between a LinkedList and a Link—they both contain a pair of pointers to Link. As it 237 turns out, there are some distinct benefi ts to using an instance of class Link to manage the head and tail of the list, like this: template< typename ELEMENT > class LinkedList { Link m_root; // contains head and tail // member functions for manipulating the list... }; The embedded m_root member is a Link, no diff erent from any other Link in the list (except that its m_pElement member will always be NULL). This allows us to make the linked list circular as shown in Figure 5.9. In other words, the m_pNext pointer of the last “real” node in the list points to m_root, as does the m_pPrev pointer of the fi rst “real” node in the list. This design is preferable to the one involving two “loose” pointers for the head and tail, because it simplifi es the logic for inserting and removing ele- ments. To see why this is the case, consider the code that would be required to remove an element from a linked list when “loose” head and tail pointers are being used. void LinkedList::remove(Link& link) { if (link.m_pNext) link.m_pNext->m_pPrev = link.m_pPrev; else // Removing last element in the list. m_pTail = link.m_pPrev; if (link.m_pPrev) link.m_pPrev->m_pNext = link.m_pNext; else // Removing first element in the list. m_pHead = link.m_pNext; 5.3. Containers Head Tail m_root Figure 5.9. When the head and tail pointers are stored in a link, the linked list can be made circular, which simplifi es the implementation and has some additional benefi ts. 238 5. Engine Support Systems link.m_pPrev = link.m_pNext = NULL; } The code is a bit simpler when we use the m_root design: void LinkedList::remove(Link& link) { // The link must currently be a member of the list. ASSERT(link.m_pNext != NULL); ASSERT(link.m_pPrev != NULL); link.m_pNext->m_pPrev = link.m_pPrev; link.m_pPrev->m_pNext = link.m_pNext; // Do this to indicate the link is no longer in any // list. link.m_pPrev = link.m_pNext = NULL; } The example code shown above highlights an additional benefi t of the circularly linked list approach: A link’s m_pPrev and m_pNext pointers are never null, unless the link is not a member of any list (i.e., the link is unused/ inactive). This gives us a simple test for list membership. Contrast this with the “loose” head/tail pointer design. In that case, the m_pPrev pointer of the fi rst element in the list is always null, as is the m_pN- ext pointer of the last element. And if there is only one element in the list, that link’s next and previous pointers will both be null. This makes it impossible to know whether or not a given Link is a member of a list or not. Singly-Linked Lists A singly-linked list is one in which the elements have a next pointer, but no pre- vious pointer. (The list as a whole might have both a head and a tail pointer, or it might have only a head pointer.) Such a design is obviously a memory saver, but the cost of this approach becomes evident when inserting or removing an element from the list. We have no m_pPrev pointer, so we need to traverse the list from the head in order to fi nd the previous element, so that its m_pNext pointer can be updated appropriately. Therefore, removal is an O(1) operation for a doubly-linked list, but it’s an O(n) operation for a singly-linked list. This inherent insertion and removal cost is oft en prohibitive, so most linked lists are doubly linked. However, if you know for certain that you will only ever add and remove elements from the head of the list (as when imple- menting a stack), or if you always add to the head and remove from the tail (as with a queue—and your list has both a head and a tail pointer), then you can get away with a singly-linked list and save yourself some memory. 239 5.3.4.4. Dictionaries and Hash Tables A dictionary is a table of key-value pairs . A value in the dictionary can be looked up quickly, given its key. The keys and values can be of any data type. This kind of data structure is usually implemented either as a binary search tree or as a hash table. In a binary tree implementation, the key-value pairs are stored in the nodes of the binary tree, and the tree is maintained in key-sorted order. Look- ing up a value by key involves performing an O(log n) binary search. In a hash table implementation, the values are stored in a fi xed-size table, where each slot in the table represents one or more keys. To insert a key-value pair into a hash table, the key is fi rst converted into integer form via a pro- cess known as hashing (if it is not already an integer). Then an index into the hash table is calculated by taking the hashed key modulo the size of the table. Finally, the key-value pair is stored in the slot corresponding to that index. Recall that the modulo operator (% in C/C++) fi nds the remainder of dividing the integer key by the table size. So if the hash table has fi ve slots, then a key of 3 would be stored at index 3 (3 % 5 == 3), while a key of 6 would be stored at index 1 (6 % 5 == 1). Finding a key-value pair is an O(1) operation in the absence of collisions. Collisions: Open and Closed Hash Tables Sometimes two or more keys end up occupying the same slot in the hash table. This is known as a collision. There are two basic ways to resolve a collision, giv- ing rise to two diff erent kinds of hash tables: Open• . In an open hash table (see Figure 5.10), collisions are resolved by simply storing more than one key-value pair at each index, usually in the form of a linked list. This approach is easy to implement and imposes no upper bound on the number of key-value pairs that can be stored. However, it does require memory to be allocated dynamically whenever a new key-value pair is added to the table. Closed• . In a closed hash table (see Figure 5.11), collisions are resolved via a process of probing until a vacant slot is found. (“Probing” means apply- ing a well-defi ned algorithm to search for a free slot.) This approach is a bit more diffi cult to implement, and it imposes an upper limit on the number of key-value pairs that can reside in the table (because each slot can hold only one key-value pair). But the main benefi t of this kind of hash table is that it uses up a fi xed amount of memory and requires no dy- namic memory allocation. Therefore it is oft en a good choice in a console engine. 5.3. Containers 240 5. Engine Support Systems Hashing Hashing is the process of turning a key of some arbitrary data type into an integer, which can be used modulo the table size as an index into the table. Mathematically, given a key k, we want to generate an integer hash value h us- ing the hash function H, and then fi nd the index i into the table as follows: h = H(k), i = h mod N, where N is the number of slots in the table, and the symbol mod represents the modulo operation, i.e., fi nding the remainder of the quotient h/N. If the keys are unique integers, the hash function can be the identity func- tion, H(k) = k. If the keys are unique 32-bit fl oating-point numbers, a hash func- tion might simply re-interpret the bit patt ern of the 32-bit fl oat as if it were a 32-bit integer. U32 hashFloat(float f) { union { float asFloat; Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 (55, apple) (0, orange) (26, grape) (33, plum) Figure 5.10. An open hash table. (55, apple) (0, orange) collision! (33, plum) (55, apple) (26, grape) (33, plum) (0, orange) (26, grape) probe to find new slot 0 1 2 3 4 0 1 2 3 4 Figure 5.11. A closed hash table. 241 U32 asU32; } u; u.asFloat = f; return u.asU32; } If the key is a string, we can employ a string hashing function, which combines the ASCII or UTF codes of all the characters in the string into a single 32-bit integer value. The quality of the hashing function H(k) is crucial to the effi ciency of the hash table. A “good” hashing function is one that distributes the set of all valid keys evenly across the table, thereby minimizing the likelihood of collisions. A hash function must also be reasonably quick to calculate and deterministic in the sense that it must produce the exact same output every time it is called with an indentical input. Strings are probably the most prevalent type of key you’ll encounter, so it’s particularly helpful to know a “good” string hashing function. Here are a few reasonably good ones: LOOKUP3 by Bob Jenkins (htt p://burtleburtle.net/bob/c/lookup3.c).• Cyclic redundancy check functions, such as CRC-32 (htt p://en.wikipedia.• org/wiki/Cyclic_redundancy_check). Message-digest algorithm 5 (MD5), a cryptographic hash which yields • excellent results but is quite expensive to calculate (htt p://en.wikipedia. org/wiki/MD5). A number of other excellent alternatives can be found in an article by • Paul Hsieh available at htt p://www.azillionmonkeys.com/qed/hash. html. Implementing a Closed Hash Table In a closed hash table, the key-value pairs are stored directly in the table, rath- er than in a linked list at each table entry. This approach allows the program- mer to defi ne a priori the exact amount of memory that will be used by the hash table. A problem arises when we encounter a collision —two keys that end up wanting to be stored in the same slot in the table. To address this, we use a process known as probing. The simplest approach is linear probing . Imagining that our hashing func- tion has yielded a table index of i, but that slot is already occupied, we simply try slots (i + 1), (i + 2), and so on until an empty slot is found (wrapping around to the start of the table when i = N). Another variation on linear probing is to alternate searching forwards and backwards, (i + 1), (i – 1), (i + 2), (i – 2), and 5.3. Containers 242 5. Engine Support Systems so on, making sure to modulo the resulting indices into the valid range of the table. Linear probing tends to cause key-value pairs to “clump up.” To avoid these clusters, we can use an algorithm known as quadratic probing . We start at the occupied table index i and use the sequence of probes ij = (i ± j 2) for j = 1, 2, 3, …. In other words, we try (i + 12), (i – 12), (i + 22), (i – 22), and so on, remem- bering to always modulo the resulting index into the valid range of the table. When using closed hashing, it is a good idea to make your table size a prime number. Using a prime table size in conjunction with quadratic probing tends to yield the best coverage of the available table slots with minimal clus- tering. See htt p://www.cs.utk.edu/~eij khout/594-LaTeX/handouts/hashing- slides.pdf for a good discussion of why prime hash table sizes are preferable. 5.4. Strings Strings are ubiquitous in almost every soft ware project, and game engines are no exception. On the surface, the string may seem like a simple, fundamental data type. But when you start using strings in your projects, you will quickly discover a wide range of design issues and constraints, all of which must be carefully accounted for. 5.4.1. The Problem with Strings The most fundamental question is how strings should be stored and managed in your program. In C and C++, strings aren’t even an atomic type—they are implemented as arrays of characters. The variable length of strings means we either have to hard-code limitations on the sizes of our strings, or we need to dynamically allocate our string buff ers. C++ programmers oft en prefer to use a string class, rather than deal directly with character arrays. But then, which string class should we use? STL provides a reasonably good string class, but if you’ve decided not to use STL you might be stuck writing your own. Another big string-related problem is that of localization —the process of adapting your soft ware for release in other languages. This is also known as internationalization, or I18N for short. Any string that you display to the user in English must be translated into whatever languages you plan to support. (Strings that are used internally to the program but are never displayed to the user are exempt from localization, of course.) This not only involves making sure that you can represent all the character glyphs of all the languages you plan to support (via an appropriate set of fonts), but it also means ensuring that your game can handle diff erent text orientations. For example, Chinese 243 text is oriented vertically instead of horizontally, and some languages like He- brew read right-to-left . Your game also needs to gracefully deal with the pos- sibility that a translated string will be either much longer, or much shorter, than its English counterpart. Finally, it’s important to realize that strings are used internally within a game engine for things like resource fi le names and object ids. For example, when a game designer lays out a level, it’s highly convenient to permit him or her to identify the objects in the level using meaningful names, like “Player- Camera,” “enemy-tank-01,” or “explosionTrigger.” How our engine deals with these internal strings oft en has pervasive ram- ifi cations on the performance of the game. This is because strings are inherent- ly expensive to work with at runtime. Comparing or copying ints or floats can be accomplished via simple machine language instructions. On the other hand, comparing strings requires an O(n) scan of the character arrays using a function like strcmp() (where n is the length of the string). Copying a string requires an O(n) memory copy, not to mention the possibility of having to dynamically allocate the memory for the copy. During one project I worked on, we profi led our game’s performance only to discover that strcmp() and strcpy() were the top two most expensive functions! By eliminating unnec- essary string operations and using some of the techniques outlined in this section, we were able to all but eliminate these functions from our profi le, and increase the game’s frame rate signifi cantly. (I’ve heard similar stories from developers at a number of diff erent studios.) 5.4.2. String Classes String classes can make working with strings much more convenient for the programmer. However, a string class can have hidden costs that are diffi cult to see until the game is profi led. For example, passing a string to a function using a C-style character array is fast because the address of the fi rst character is typically passed in a hardware register. On the other hand, passing a string object might incur the overhead of one or more copy constructors, if the func- tion is not declared or used properly. Copying strings might involve dynamic memory allocation, causing what looks like an innocuous function call to end up costing literally thousands of machine cycles. For this reason, in game programming I generally like to avoid string classes. However, if you feel a strong urge to use a string class, make sure you pick or implement one that has acceptable runtime performance character- istics—and be sure all programmers that use it are aware of its costs. Know your string class: Does it treat all string buff ers as read-only? Does it utilize the copy on write optimization? (See htt p://en.wikipedia.org/wiki/Copy-on- 5.4. Strings 244 5. Engine Support Systems write.) As a rule of thumb, always pass string objects by reference, never by value (as the latt er oft en incurs string-copying costs). Profi le your code early and oft en to ensure that your string class isn’t becoming a major source of lost frame rate! One situation in which a specialized string class does seem justifi able to me is when storing and managing fi le system paths . Here, a hypothetical Path class could add signifi cant value over a raw C-style character array. For example, it might provide functions for extracting the fi lename, fi le exten- sion or directory from the path. It might hide operating system diff erences by automatically converting Windows-style backslashes to UNIX-style forward slashes or some other operating system’s path separator. Writing a Path class that provides this kind of functionality in a cross-platform way could be high- ly valuable within a game engine context. (See Section 6.1.1.4 for more details on this topic.) 5.4.3. Unique Identifi ers The objects in any virtual game world need to be uniquely identifi ed in some way. For example, in Pac Man we might encounter game objects named “pac_ man,” “blinky,” “pinky,” “inky,” and “clyde.” Unique object identifi ers allow game designers to keep track of the myriad objects that make up their game worlds and also permit those objects to be found and operated on at runtime by the engine. In addition, the assets from which our game objects are con- structed—meshes, materials, textures, audio clips, animations, and so on—all need unique identifi ers as well. Strings seem like a natural choice for such identifi ers. Assets are oft en stored in individual fi les on disk, so they can usually be identifi ed uniquely by their fi le paths, which of course are strings. And game objects are created by game designers, so it is natural for them to assign their objects understandable string names, rather than have to remember integer object indices, or 64- or 128-bit globally unique identifi ers (GUIDs). However, the speed with which comparisons between unique identifi ers can be made is of paramount impor- tance in a game, so strcmp() simply doesn’t cut it. We need a way to have our cake and eat it too—a way to get all the descriptiveness and fl exibility of a string, but with the speed of an integer. 5.4.3.1. Hashed String Ids One good solution is to hash our strings. As we’ve seen, a hash function maps a string onto a semi-unique integer. String hash codes can be compared just like any other integers, so comparisons are fast. If we store the actual strings in a hash table, then the original string can always be recovered from the hash 245 code. This is useful for debugging purposes and to permit hashed strings to be displayed on-screen or in log fi les. Game programmers sometimes use the term string id to refer to such a hashed string. The Unreal engine uses the term name instead (implemented by class FName). As with any hashing system, collisions are a possibility (i.e., two diff erent strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all rea- sonable input strings we might use in our game. Aft er all, a 32-bit hash code represents more than four billion possible values. So if our hash function does a good job of distributing strings evenly throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 al- gorithm to hash our strings, and we didn’t encounter a single collision in over two years of development on Uncharted: Drake’s Fortune. 5.4.3.2. Some Implementation Ideas Conceptually, it’s easy enough to run a hash function on your strings in order to generate string ids. Practically speaking, however, it’s important to con- sider when the hash will be calculated. Most game engines that use string ids do the hashing at runtime. At Naughty Dog, we permit runtime hash- ing of strings, but we also preprocess our source code using a simple utility that searches for macros of the form SID(any-string) and translates each one directly into the appropriate hashed integer value. This permits string ids to be used anywhere that an integer manifest constant can be used, including the constant case labels of a switch statement. (The result of a function call that generates a string id at runtime is not a constant, so it cannot be used as a case label.) The process of generating a string id from a string is sometimes called interning the string, because in addition to hashing it, the string is typi- cally also added to a global string table. This allows the original string to be recovered from the hash code later. You may also want your tools to be capable of hashing strings into string ids. That way, when the tool generates data for consumption by your engine, the strings will already have been hashed. The main problem with interning a string is that it is a slow operation. The hashing function must be run on the string, which can be an expensive proposition, especially when a large number of strings are being interned. In addition, memory must be allocated for the string, and it must be copied into the lookup table. As a result (if you are not generating string ids at compile-time), it is usually best to intern each string only once and save off the result for later use. For example, it would be preferable to write code like 5.4. Strings 246 5. Engine Support Systems this because the latt er implementation causes the strings to be unnecessarily re-interned every time the function f() is called. static StringId sid_foo = internString(“foo”); static StringId sid_bar = internString(“bar”); // ... void f(StringId id) { if (id == sid_foo) { // handle case of id == “foo” } else if (id == sid_bar) { // handle case of id == “bar” } } This approach is much less effi cient. void f(StringId id) { if (id == internString(“foo”)) { // handle case of id == “foo” } else if (id == internString(“bar”)) { // handle case of id == “bar” } } Here’s one possible implementation of internString(). stringid.h typedef U32 StringId; extern StringId internString(const char* str); stringid.cpp static HashTable gStringIdTable; 247 StringId internString(const char* str) { StringId sid = hashCrc32(str); HashTable::iterator it = gStringIdTable.find(sid); if (it == gStringTable.end()) { // This string has not yet been added to the // table. Add it, being sure to copy it in case // the original was dynamically allocated and // might later be freed. gStringTable[sid] = strdup(str); } return sid; } Another idea employed by the Unreal Engine is to wrap the string id and a pointer to the corresponding C-style character array in a tiny class. In the Unreal Engine, this class is called FName. Using Debug Memory for Strings When using string ids, the strings themselves are only kept around for human consumption. When you ship your game, you almost certainly won’t need the strings—the game itself should only ever use the ids. As such, it’s a good idea to store your string table in a region of memory that won’t exist in the retail game. For example, a PS3 development kit has 256 MB of retail memory, plus an additional 256 MB of “debug” memory that is not available on a retail unit. If we store our strings in debug memory, we needn’t worry about their impact on the memory footprint of the fi nal shipping game. (We just need to be careful never to write production code that depends on the strings being available!) 5.4.4. Localization Localization of a game (or any soft ware project) is a big undertaking. It is a task which is best handled by planning for it from day one and accounting for it at every step of development. However, this is not done as oft en as we all would like. Here are some tips that should help you plan your game engine project for localization. For an in-depth treatment of soft ware localization, see [29]. 5.4. Strings 248 5. Engine Support Systems 5.4.4.1. Unicode The problem for most English-speaking soft ware developers is that they are trained from birth (or thereabouts!) to think of strings as arrays of 8-bit ASCII character codes (i.e., characters following the ANSI standard). ANSI strings work great for a language with a simple alphabet, like English. But they just don’t cut it for languages with complex alphabets containing a great many more characters, sometimes totally diff erent glyphs than English’s 26 lett ers. To address the limitations of the ANSI standard, the Unicode character set system was devised. Please set down this book right now and read the article entitled, “The Absolute Minimum Every Soft ware Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky. You can fi nd it here: htt p://www.joelonsoft ware.com/articles/Unicode.html. (Once you’ve done that, please pick up the book again!) As Joel describes in his article, Unicode is not a single standard but actu- ally a family of related standards. You will need to select the specifi c standard that best suits your needs. The two most common choices I’ve seen used in game engines are UTF-8 and UTF-16. UTF-8 In UTF-8, the character codes are 8 bits each, but certain characters occupy more than one byte. Hence the number of bytes occupied by a UTF-8 character string is not necessarily the length of the string in characters. This is known as a multibyte character set (MBCS), because each character may take one or more bytes of storage. One of the big benefi ts of the UTF-8 encoding is that it is backwards-com- patible with the ANSI encoding. This works because the fi rst character of a multibyte character sequence always has its most signifi cant bit set (i.e., lies between 128 and 255, inclusive). Since the standard ANSI character codes are all less than 128, a plain old ANSI string is a valid and unambiguous UTF-8 string as well. UTF-16 The UTF-16 standard employs a simpler, albeit more expensive, approach. Each character takes up exactly 16 bits (whether it needs all of those bits or not). As a result, dividing the number of bytes occupied by the string by two yields the number of characters. This is known as a wide character set (WCS), because each character is 16 bits wide instead of the 8 bits used by “regular” ANSI chars. 249 Unicode under Windows Under Microsoft Windows, the data type wchar_t is used to represent a single “wide ” UTF-16 character (WCS), while the char type is used both for ANSI strings and for multibyte UTF-16 strings (MBCS). What’s more, Windows per- mits you to write code that is character set independent. To accomplish this, a data type known as TCHAR is provided. The data type TCHAR is a typedef to char when building your application in ANSI mode and is a typedef to wchar_t when building your application in UTF-16 (WCS) mode. (For consis- tency, the type WCHAR is also provided as a synonym for wchar_t.) Throughout the Windows API, a prefi x or suffi x of “w,” “wcs,” or “W” indicates wide (UTF-16) characters; a prefi x or suffi x of “t,” “tcs,” or “T” indicates the current character type (which might be ANSI or might be UTF- 16, depending on how your application was built); and no prefi x or suf- fi x indicates plain old ANSI. STL uses a similar convention—for example, std::string is STL’s ANSI string class, while std::wstring is its wide character equivalent. Prett y much every standard C library function that deals with strings has equivalent WCS and MBCS versions under Windows. Unfortunately, the API calls don’t use the terms UTF-8 and UTF-16, and the names of the functions aren’t always 100% consistent. This all leads to some confusion among pro- grammers who aren’t in the know. (But you aren’t one of those programmers!) Table 5.1 lists some examples. Windows also provides functions for translating between ANSI character strings, multibyte UTF-8 strings, and wide UTF-16 strings. For example, wcs- tombs() converts a wide UTF-16 string into a multibyte UTF-8 string. Complete documentation for these functions can be found on Microsoft ’s MSDN web site. Here’s a link to the documentation for strcmp() and its ilk, from which you can quite easily navigate to the other related string-manip- ulation functions using the tree view on the left -hand side of the page, or via the search bar: htt p://msdn2.microsoft .com/en-us/library/kk6xf663(VS.80). aspx. ANSI WCS MBCS strcmp() wcscmp() _mbscmp() strcpy() wcscpy() _mbscpy() strlen() wcslen() _mbstrlen() Table 5.1. Variants of some common standard C library string functions for use with ANSI, wide and multibyte character sets. 5.4. Strings 250 5. Engine Support Systems Unicode on Consoles The Xbox 360 soft ware development kit (XDK) uses WCS strings prett y much exclusively, for all strings—even for internal strings like fi le paths. This is cer- tainly one valid approach to the localization problem, and it makes for very consistent string handling throughout the XDK. However, the UTF-16 encod- ing is a bit wasteful on memory, so diff erent game engines may employ diff er- ent conventions. At Naughty Dog, we use 8-bit char strings throughout our engine, and we handle foreign languages via a UTF-8 encoding. The choice of encoding is not important, as long as you select one as early in the project as possible and stick with it consistently. 5.4.4.2. Other Localization Concerns Even once you have adapted your soft ware to use Unicode characters, there are still a host of other localization problems to contend with. For one thing, strings aren’t the only place where localization issues arise. Audio clips in- cluding recorded voices must be translated. Textures may have English words painted into them that require translation. Many symbols have diff erent mean- ings in diff erent cultures. Even something as innocuous as a no-smoking sign might be misinterpreted in another culture. In addition, some markets draw the boundaries between the various game-rating levels diff erently. For example, in Japan a Teen-rated game is not permitt ed to show blood of any kind, whereas in North America small red blood spatt ers are considered acceptable. For strings, there are other details to worry about as well. You will need to manage a database of all human-readable strings in your game, so that they can all be reliably translated. The soft ware must display the proper lan- guage given the user’s installation sett ings. The formatt ing of the strings may be totally diff erent in diff erent languages—for example, Chinese is writt en vertically, and Hebrew reads right-to-left . The lengths of the strings will vary greatly from language to language. You’ll also need to decide whether to ship a single DVD or Blu-ray disc that contains all languages or ship diff erent discs for particular territories. The most crucial components in your localization system will be the cen- tral database of human-readable strings and an in-game system for looking up those strings by id. For example, let’s say you want a heads-up display that lists the score of each player with “Player 1 Score:” and “Player 2 Score:” labels and that also displays the text “Player 1 Wins” or “Player 2 Wins” at the end of a round. These four strings would be stored in the localization database under unique ids that are understandable to you, the developer of the game. So our database might use the ids “p1score,” “p2score,” “p1wins,” and “p2wins,” respectively. Once our game’s strings have been translated into 251 French, our database would look something like the simple example shown in Table 5.2. Additional columns can be added for each new language your game supports. The exact format of this database is up to you. It might be as simple as a Microsoft Excel worksheet that can be saved as a comma-separated values (CSV ) fi le and parsed by the game engine or as complex as a full-fl edged Or- acle database. The specifi cs of the string database are largely unimportant to the game engine, as long as it can read in the string ids and the correspond- ing Unicode strings for whatever language(s) your game supports. (However, the specifi cs of the database may be very important from a practical point of view, depending upon the organizational structure of your game studio. A small studio with in-house translators can probably get away with an Excel spreadsheet located on a network drive. But a large studio with branch offi ces in Britain, Europe, South America, and Japan would probably fi nd some kind of distributed database a great deal more amenable.) At runtime, you’ll need to provide a simple function that returns the Uni- code string in the “current” language, given the unique id of that string. The function might be declared like this: wchar_t getLocalizedString(const char* id); and it might be used like this: void drawScoreHud(const Vector3& score1Pos, const Vector3& score2Pos) { renderer.displayTextOrtho( getLocalizedString("p1score"), score1Pos); renderer.displayTextOrtho( getLocalizedString("p2score"), score2Pos); // ... } Id English French p1score “Player 1 Score” “Grade Joueur 1” p2score “Player 2 Score” “Grade Joueur 2” p1wins “Player 1 wins!” “Joueur un gagne!” p2wins “Player 2 wins!” “Joueur deux gagne!” Table 5.2. Example of a string database used for localization. 5.4. Strings 252 5. Engine Support Systems Of course, you’ll need some way to set the “current” language globally. This might be done via a confi guration sett ing which is fi xed during the installa- tion of the game. Or you might allow users to change the current language on the fl y via an in-game menu. Either way, the sett ing is not diffi cult to imple- ment; it can be as simple as a global integer variable specifying the index of the column in the string table from which to read (e.g., column one might be English, column two French, column three Spanish, and so on). Once you have this infrastructure in place, your programmers must re- member to never display a raw string to the user. They must always use the id of a string in the database and call the look-up function in order to retrieve the string in question. 5.5. Engine Confi guration Game engines are complex beasts, and they invariably end up having a large number of confi gurable options. Some of these options are exposed to the player via one or more options menus in-game. For example, a game might expose options related to graphics quality, the volume of music and sound ef- fects, or controller confi guration. Other options are created for the benefi t of the game development team only and are either hidden or stripped out of the game completely before it ships. For example, the player character’s maximum walk speed might be exposed as an option so that it can be fi ne-tuned during development, but it might be changed to a hard-coded value prior to ship. 5.5.1. Loading and Saving Options A confi gurable option can be implemented trivially as a global variable or a member variable of a singleton class. However, confi gurable options are not particularly useful unless their values can be confi gured, stored on a hard disk, memory card, or other storage medium and later retrieved by the game. There are a number of simple ways to load and save confi guration options: Text confi guration fi les.• By far the most common method of saving and loading confi guration options is by placing them into one or more text fi les. The format of these fi les varies widely from engine to engine, but it is usually very simple. For example, Windows INI fi les (which are used by the Ogre3D renderer) consist of fl at lists of key-value pairs grouped into logical sections. [SomeSection] Key1=Value1 Key2=Value2 253 [AnotherSection] Key3=Value3 Key4=Value4 Key5=Value5 The XML format is another common choice for confi gurable game op- tions fi les. Compressed binary fi les.• Most modern consoles have hard disk drives in them, but older consoles could not aff ord this luxury. As a result, all game consoles since the Super Nintendo Entertainment System (SNES) have come equipped with proprietary removable memory cards that permit both reading and writing of data. Game options are sometimes stored on these cards, along with saved games. Compressed binary fi les are the format of choice on a memory card, because the storage space available on these cards is oft en very limited. The Windows registry• . The Microsoft Windows operating system pro- vides a global options database known as the registry. It is stored as a tree, where the interior nodes (known as registry keys) act like fi le fold- ers, and the leaf nodes store the individual options as key-value pairs. Any application, game or otherwise, can reserve an entire subtree (i.e., a registry key) for its exclusive use, and then store any set of options with- in it. The Windows registry acts like a carefully-organized collection of INI fi les, and in fact it was introduced into Windows as a replacement for the ever-growing network of INI fi les used by both the operating system and Windows applications. Command line options• . The command line can be scanned for option set- tings. The engine might provide a mechanism for controlling any option in the game via the command line, or it might expose only a small sub- set of the game’s options here. Environment variables• . On personal computers running Windows, Linux, or MacOS, environment variables are sometimes used to store confi gu- ration options as well. Online user profi les.• With the advent of online gaming communities like Xbox Live , each user can create a profi le and use it to save achievements, purchased and unlockable game features, game options, and other in- formation. The data is stored on a central server and can be accessed by the player wherever an Internet connection is available. 5.5.2. Per-User Options Most game engines diff erentiate between global options and per-user options . This is necessary because most games allow each player to confi gure the game 5.5. Engine Confi guration 254 5. Engine Support Systems to his or her liking. It is also a useful concept during development of the game, because it allows each programmer, artist, and designer to customize his or her work environment without aff ecting other team members. Obviously care must be taken to store per-user options in such a way that each player “sees” only his or her options and not the options of other play- ers on the same computer or console. In a console game, the user is typically allowed to save his or her progress, along with per-user options such as con- troller preferences, in “slots” on a memory card or hard disk. These slots are usually implemented as fi les on the media in question. On a Windows machine, each user has a folder under C:\Documents and Sett ings containing information such as the user’s desktop, his or her My Doc- uments folder, his or her Internet browsing history and temporary fi les, and so on. A hidden subfolder named Application Data is used to store per-user information on a per-application basis; each application creates a folder un- der Application Data and can use it to store whatever per-user information it requires. Windows games sometimes store per-user confi guration data in the reg- istry. The registry is arranged as a tree, and one of the top-level children of the root node, called HKEY_CURRENT_USER, stores sett ings for whichever user happens to be logged on. Every user has his or her own subtree in the registry (stored under the top-level subtree HKEY_USERS), and HKEY_CURRENT_USER is really just an alias to the current user’s subtree. So games and other applica- tions can manage per-user confi guration options by simply reading and writ- ing them to keys under the HKEY_CURRENT_USER subtree. 5.5.3. Confi guration Management in Some Real Engines In this section, we’ll take a brief look at how some real game engines manage their confi guration options. 5.5.3.1. Example: Quake’s CVARs The Quake family of engines uses a confi guration management system known as console variables, or CVARs for short. A CVAR is just a fl oating-point or string global variable whose value can be inspected and modifi ed from within Quake’s in-game console. The values of some CVARs can be saved to disk and later reloaded by the engine. At runtime, CVARs are stored in a global linked list. Each CVAR is a dy- namically-allocated instance of struct cvar_t, which contains the variable’s name, its value as a string or fl oat, a set of fl ag bits, and a pointer to the next CVAR in the linked list of all CVARs. CVARs are accessed by calling Cvar_ Get(), which creates the variable if it doesn’t already exist and modifi ed by 255 calling Cvar_Set(). One of the bit fl ags, CVAR_ARCHIVE, controls whether or not the CVAR will be saved into a confi guration fi le called confi g.cfg. If this fl ag is set, the value of the CVAR will persist across multiple runs of the game. 5.5.3.2. Example: Ogre3D The Ogre3D rendering engine uses a collection of text fi les in Windows INI format for its confi guration options. By default, the options are stored in three fi les, each of which is located in the same folder as the executable program: plugins.cfg• contains options specifying which optional engine plug-ins are enabled and where to fi nd them on disk. resources.cfg• contains a search path specifying where game assets (a.k.a. media, a.k.a. resources) can be found. ogre.cfg• contains a rich set of options specifying which renderer (DirectX or OpenGL) to use and the preferred video mode, screen size, etc. Out of the box, Ogre provides no mechanism for storing per-user confi gu- ration options. However, the Ogre source code is freely available, so it would be quite easy to change it to search for its confi guration fi les in the user’s C:\ Documents and Sett ings folder instead of in the folder containing the execut- able. The Ogre::ConfigFile class makes it easy to write code that reads and writes brand new confi guration fi les, as well. 5.5.3.3. Example: Uncharted: Drake’s Fortune Naughty Dog’s Uncharted engine makes use of a number of confi guration mechanisms. In-Game Menu Settings The Uncharted engine supports a powerful in-game menu system, allowing developers to control global confi guration options and invoke commands. The data types of the confi gurable options must be relatively simple (primar- ily Boolean, integer, and fl oating-point variables), but this limitation did not prevent the developers of Uncharted from creating literally hundreds of useful menu-driven options. Each confi guration option is implemented as a global variable. When the menu option that controls an option is created, the address of the global vari- able is provided, and the menu item directly controls its value. As an exam- ple, the following function creates a submenu item containing some options for Uncharted’s rail vehicles (the vehicles used in the “Out of the Frying Pan” jeep chase level). It defi nes menu items controlling three global variables: two Booleans and one fl oating-point value. The items are collected onto a menu, 5.5. Engine Confi guration 256 5. Engine Support Systems and a special item is returned that will bring up the menu when selected. Presumably the code calling this function would add this item to the parent menu that it is building. DMENU::ItemSubmenu * CreateRailVehicleMenu() { extern bool g_railVehicleDebugDraw2D; extern bool g_railVehicleDebugDrawCameraGoals; extern float g_railVehicleFlameProbability; DMENU::Menu * pMenu = new DMENU::Menu( "RailVehicle"); pMenu->PushBackItem( new DMENU::ItemBool("Draw 2D Spring Graphs", DMENU::ToggleBool, &g_railVehicleDebugDraw2D)); pMenu->PushBackItem( new DMENU::ItemBool("Draw Goals (Untracked)", DMENU::ToggleBool, &g_railVehicleDebugDrawCameraGoals)); DMENU::ItemFloat * pItemFloat; pItemFloat = new DMENU::ItemFloat( "FlameProbability", DMENU:: EditFloat, 5, "%5.2f", & g_railVehicleFlameProbability); pItemFloat->SetRangeAndStep(0.0f, 1.0f, 0.1f, 0.01f); pMenu->PushBackItem(pItemFloat); DMENU::ItemSubmenu * pSubmenuItem; pSubmenuItem = new DMENU::ItemSubmenu( "RailVehicle...", pMenu); return pSubmenuItem; } The value of any option can be saved by simply marking it with the circle butt on on the PS3 joypad when the corresponding menu item is selected. The menu sett ings are saved in an INI-style text fi le, allowing the saved global vari- ables to retain the values across multiple runs of the game. The ability to con- trol which options are saved on a per-menu-item basis is highly useful, because any option which is not saved will take on its programmer-specifi ed default value. If a programmer changes a default, all users will “see” the new value, unless of course a user has saved a custom value for that particular option. 257 Command Line Arguments The Uncharted engine scans the command line for a predefi ned set of special options. The name of the level to load can be specifi ed, along with a number of other commonly-used arguments. Scheme Data Defi nitions The vast majority of engine and game confi guration information in Uncharted is specifi ed using a Lisp -like language called Scheme . Using a proprietary data compiler, data structures defi ned in the Scheme language are transformed into binary fi les that can be loaded by the engine. The data compiler also spits out header fi les containing C struct declarations for every data type defi ned in Scheme. These header fi les allow the engine to properly interpret the data contained in the loaded binary fi les. The binary fi les can even be recompiled and reloaded on the fl y, allowing developers to alter the data in Scheme and see the eff ects of their changes immediately (as long as data members are not added or removed, as that would require a recompile of the engine). The following example illustrates the creation of a data structure specify- ing the properties of an animation. It then exports three unique animations to the game. You may have never read Scheme code before, but for this relatively simple example it should be prett y self-explanatory. One oddity you’ll notice is that hyphens are permitt ed within Scheme symbols, so simple-animation is a single symbol (unlike in C/C++ where simple-animation would be the subtraction of two variables, simple and animation). simple-animation.scm ;; Define a new data type called simple-animation. (deftype simple-animation () ( (name string) (speed float :default 1.0) (fade-in-seconds float :default 0.25) (fade-out-seconds float :default 0.25) ) ) ;; Now define three instances of this data structure... (define-export anim-walk (new simple-animation :name “walk” :speed 1.0 ) ) 5.5. Engine Confi guration 258 5. Engine Support Systems (define-export anim-walk-fast (new simple-animation :name "walk" :speed 2.0 ) ) (define-export anim-jump (new simple-animation :name "jump" :fade-in-seconds 0.1 :fade-out-seconds 0.1 ) ) This Scheme code would generate the following C/C++ header fi le: simple-animation.h // WARNING: This file was automatically generated from // Scheme. Do not hand-edit. struct SimpleAnimation { const char* m_name; float m_speed; float m_fadeInSeconds; float m_fadeOutSeconds; }; In-game, the data can be read by calling the LookupSymbol() function, which is templated on the data type returned, as follows: #include "simple-animation.h" void someFunction() { SimpleAnimation* pWalkAnim = LookupSymbol("anim-walk"); SimpleAnimation* pFastWalkAnim = LookupSymbol( " anim-walk-fast"); SimpleAnimation* pJumpAnim = LookupSymbol("anim-jump"); // use the data here... } 259 This system gives the programmers a great deal of fl exibility in defi n- ing all sorts of confi guration data—from simple Boolean, fl oating-point, and string options all the way to complex, nested, interconnected data structures. It is used to specify detailed animation trees, physics parameters, player me- chanics, and so on. 5.5. Engine Confi guration 261 6 Resources and the File System Games are by nature multimedia experiences. A game engine therefore needs to be capable of loading and managing a wide variety of diff erent kinds of media—texture bitmaps, 3D mesh data, animations, audio clips, col- lision and physics data, game world layouts, and the list goes on. Moreover, because memory is usually scarce, a game engine needs to ensure that only one copy of each media fi le is loaded into memory at any given time. For ex- ample, if fi ve meshes share the same texture, then we would like to have only one copy of that texture in memory, not fi ve. Most game engines employ some kind of resource manager (a.k.a. asset manager, a.k.a. media manager) to load and manage the myriad resources that make up a modern 3D game. Every resource manager makes heavy use of the fi le system. On a per- sonal computer, the fi le system is exposed to the programmer via a library of operating system calls. However, game engines oft en “wrap” the native fi le system API in an engine-specifi c API, for two primary reasons. First, the engine might be cross-platform, in which case the game engine’s fi le system API can shield the rest of the soft ware from diff erences between diff erent target hardware platforms. Second, the operating system’s fi le system API might not provide all the tools needed by a game engine. For example, many engines support fi le streaming (i.e., the ability to load data “on the fl y” while the game is running), yet most operating systems don’t provide a streaming fi le system API out of the box. Console game engines also need to provide ac- 262 6. Resources and the File System cess to a variety of removable and non-removable media, from memory sticks to optional hard drives to a DVD-ROM or Blu-ray fi xed disk to network fi le systems (e.g., Xbox Live or the PlayStation Network , PSN). The diff erences between various kinds of media can likewise be “hidden” behind a game engine’s fi le system API. In this chapter, we’ll fi rst explore the kinds of fi le system APIs found in modern 3D game engines. Then we’ll see how a typical resource manager works. 6.1. File System A game engine’s fi le system API typically addresses the following areas of functionality: manipulating fi le names and paths, opening, closing, reading and writing individual fi les, scanning the contents of a directory, handling asynchronous fi le I/O requests (for streaming). We’ll take a brief look at each of these in the following sections. 6.1.1. File Names and Paths A path is a string describing the location of a fi le or directory within a fi le sys- tem hierarchy. Each operating system uses a slightly diff erent path format, but paths have essentially the same structure on every operating system. A path generally takes the following form: volume/directory1/ directory2/…/directoryN/fi le-name or volume/directory1/directory2/…/directory(N – 1)/directoryN In other words, a path generally consists of an optional volume specifi er fol- lowed by a sequence of path components separated by a reserved path separa- tor character such as the forward or backward slash (/ or \). Each component names a directory along the route from the root directory to the fi le or direc- tory in question. If the path specifi es the location of a fi le, the last compo- nent in the path is the fi le name; otherwise it names the target directory. The root directory is usually indicated by a path consisting of the optional volume specifi er followed by a single path separator character (e.g., / on UNIX, or C:\ on Windows). 263 6.1. File System 6.1.1.1. Differences Across Operating Systems Each operating system introduces slight variations on this general path struc- ture. Here are some of the key diff erences between Microsoft DOS , Microsoft Windows , the UNIX family of operating systems, and Apple Macintosh OS: UNIX uses a forward slash (/) as its path component separator, while DOS and older versions of Windows used a backslash (\) as the path separator. Recent versions of Windows allow either forward or back- ward slashes to be used to separate path components, although some applications still fail to accept forward slashes. Mac OS 8 and 9 use the colon (:) as the path separator character. Mac OS X is based on UNIX, so it supports UNIX’s forward slash notation. UNIX and its variants don’t support volumes as separate directory hi- erarchies. The entire fi le system is contained within a single monolithic hierarchy, and local disk drives, network drives, and other resources are mounted so that they appear to be subtrees within the main hierarchy. As a result, UNIX paths never have a volume specifi er. On Microsoft Windows, volumes can be specifi ed in two ways. A local disk drive is specifi ed using a single lett er followed by a colon (e.g., the ubiquitous C:). A remote network share can either be mounted so that it looks like a local disk, or it can be referenced via a volume specifi er consisting of two backslashes followed by the remote computer name and the name of a shared directory or resource on that machine (e.g., \\some-computer\some-share). This double backslash notation is an example of the Universal Naming Convention (UNC). Under DOS and early versions of Windows, a fi le name could be up to eight characters in length, with a three-character extension which was separated from the main fi le name by a dot. The extension described the fi le’s type, for example .txt for a text fi le or .exe for an executable fi le. In recent Windows implementations, fi le names can contain any number of dots (as they can under UNIX), but the characters aft er the fi nal dot are still interpreted as the fi le’s extension by many applications including the Windows Explorer. Each operating system disallows certain characters in the names of fi les and directories. For example, a colon cannot appear anywhere in a Win- dows or DOS path except as part of a drive lett er volume specifi er. Some operating systems permit a subset of these reserved characters to ap- pear in a path as long as the path is quoted in its entirety or the off end- ing character is escaped by preceding it with a backslash or some other 264 6. Resources and the File System reserved escape character. For example, fi le and directory names may contain spaces under Windows, but such a path must be surrounded by double quotes in certain contexts. Both UNIX and Windows have the concept of a current working directory or CWD (also known as the present working directory or PWD). The CWD can be set from a command shell via the cd (change directory) command on both operating systems, and it can be queried by typing cd with no arguments under Windows or by executing the pwd command on UNIX. Under UNIX there is only one CWD. Under Windows, each vol- ume has its own private CWD. Operating systems that support multiple volumes, like Windows, also have the concept of a current working volume. From a Windows com- mand shell, the current volume can be set by entering its drive lett er and a colon followed by the Enter key (e.g., C:). Consoles oft en also employ a set of predefi ned path prefi xes to repre- sent multiple volumes. For example, PLAYSTATION 3 uses the prefi x /dev_bdvd/ to refer to the Bluray disk drive, while /dev_hddx/ refers to one or more hard disks (where x is the index of the device). On a PS3 development kit, /app_home/ maps to a user-defi ned path on whatever host machine is being used for development. During development, the game usually reads its assets from /app_home/ rather than from the Bluray or the hard disk. 6.1.1.2. Absolute and Relative Paths All paths are specifi ed relative to some location within the fi le system. When a path is specifi ed relative to the root directory, we call it an absolute path . When it is relative to some other directory in the fi le system hierarchy, we call it a relative pa th . Under both UNIX and Windows, absolute paths start with a path sepa- rator character (/ or \), while relative paths have no leading path separator. On Windows, both absolute and relative paths may have an optional volume specifi er—if the volume is omitt ed, then the path is assumed to refer to the current working volume. The following paths are all absolute: Windows C:\Windows\System32 D:\ (root directory on the D: volume) \ (root directory on the current working volume) 265 \game\assets\animation\walk.anim (current working volume) \\joe-dell\Shared_Files\Images\foo.jpg (network path) UNIX /usr/local/bin/grep /game/src/audio/effects.cpp / (root directory) The following paths are all relative: Windows System32 (relative to CWD \Windows on the current volume) X:animation\walk.anim (relative to CWD \game\assets on the X: volume) UNIX bin/grep (relative to CWD /usr/local) src/audio/effects.cpp (relative to CWD /game) 6.1.1.3. Search Paths The term path should not be confused with the term search path. A path is a string representing the location of a single fi le or directory within the fi le system hierarchy. A search path is a string containing a list of paths, each sepa- rated by a special character such as a colon or semicolon, which is searched when looking for a fi le. For example, when you run any program from a com- mand prompt, the operating system fi nds the executable fi le by searching each directory on the search path contained in the shell’s PATH environment variable. Some game engines also use search paths to locate resource fi les. For ex- ample, the Ogre3D rendering engine uses a resource search path contained in a text fi le named resources.cfg. The fi le provides a simple list of directories and Zip archives that should be searched in order when trying to fi nd an as- set. That said, searching for assets at runtime is a time-consuming proposition. Usually there’s no reason our assets’ paths cannot be known a priori. Presum- ing this is the case, we can avoid having to search for assets at all—which is clearly a superior approach. 6.1.1.4. Path APIs Clearly paths are much more complex than simple strings. There are many things a programmer may need to do when dealing with paths, such as isolat- ing the directory, fi lename and extension, canonicalizing a path, converting 6.1. File System 266 6. Resources and the File System back and forth between absolute and relative paths, and so on. It can be ex- tremely helpful to have a feature-rich API to help with these tasks. Microsoft Windows provides an API for this purpose. It is implement- ed by the dynamic link library shlwapi.dll, and exposed via the header fi le shlwapi.h. Complete documentation for this API is provided on the Microsoft Developer’s Network (MSDN) at the following URL: htt p://msdn2. microsoft .com/en-us/library/bb773559(VS.85).aspx. Of course, the shlwapi API is only available on Win32 platforms. Sony provides a similar API for use on the PLAYSTATION 3. But when writing a cross-platform game engine, we cannot use platform-specifi c APIs directly. A game engine may not need all of the functions provided by an API like sh- lwapi anyway. For these reasons, game engines oft en implement a stripped- down path-handling API that meets the engine’s particular needs and works on every operating system targeted by the engine. Such an API can be imple- mented as a thin wrapper around the native API on each platform or it can be writt en from scratch. 6.1.2. Basic File I/O The standard C library provides two APIs for opening, reading, and writing the contents of fi les—one buff ered and the other unbuff ered. Every fi le I/O API requires data blocks known as buff ers to serve as the source or destination of the bytes passing between the program and the fi le on disk. We say a fi le I/O API is buff ered when the API manages the necessary input and output data buff ers for you. With an unbuff ered API, it is the responsibility of the pro- grammer using the API to allocate and manage the data buff ers. The standard C library’s buff ered fi le I/O routines are sometimes referred to as the stream I/O API, because they provide an abstraction which makes disk fi les look like streams of bytes. The standard C library functions for buff ered and un-buff ered fi le I/O are listed in Table 6.1. The standard C library I/O functions are well-documented, so we will not repeat detailed documentation for them here. For more information, please refer to htt p://msdn2.microsoft .com/en-us/library/c565h7xx(VS.71).aspx for Microsoft ’s implementation of the buff ered (stream I/O) API, and to htt p:// msdn2.microsoft .com/en-us/library/40bbyw78(VS.71).aspx for Microsoft ’s implementation of the unbuff ered (low-level I/O) API. On UNIX and its variants, the standard C library’s unbuff ered I/O routes are native operating system calls. However, on Microsoft Windows these rou- tines are merely wrappers around an even lower-level API. The Win32 func- tion CreateFile() creates or opens a fi le for writing or reading, ReadFile() 267 and WriteFile() read and write data, respectively, and CloseFile() closes an open fi le handle. The advantage to using low-level system calls as opposed to standard C library functions is that they expose all of the details of the na- tive fi le system. For example, you can query and control the security att ributes of fi les when using the Windows native API—something you cannot do with the standard C library. Some game teams fi nd it useful to manage their own buff ers. For example, the Red Alert 3 team at Electronic Arts observed that writing data into log fi les was causing signifi cant performance degradation. They changed the logging system so that it accumulated its output into a memory buff er, writing the buff er out to disk only when it was fi lled. Then they moved the buff er dump routine out into a separate thread to avoid stalling the main game loop. 6.1.2.1. To Wrap or Not To Wrap A game engine can be writt en to use the standard C library’s fi le I/O functions or the operating system’s native API. However, many game engines wrap the fi le I/O API in a library of custom I/O functions. There are at least three advantages to wrapping the operating system’s I/O API. First, the engine programmers can guarantee identical behavior across all target platforms, even when native libraries are inconsistent or buggy on a particular platform. Second, the API can be simplifi ed down to only those functions actually required by the engine, which keeps maintenance eff orts to a minimum. Third, extended functionality can be provided. For example, the engine’s custom wrapper API might be ca- pable of dealing fi les on a hard disk, a DVD-ROM or Blu-ray disk on a console, 6.1. File System Operation Buff ered API Unbuff ered API Open a fi le fopen() open() Close a fi le fclose() close() Read from a fi le fread() read() Write to a fi le fwrite() write() Seek to an off set fseek() seek() Return current off set ftell() tell() Read a single line fgets() n/a Write a single line fputs() n/a Read formatt ed string fscanf() n/a Write formatt ed string fprintf() n/a Query fi le status fstat() stat() Table 6.1. Buffered and unbuffered fi le operations in the standard C library. 268 6. Resources and the File System fi les on a network (e.g., remote fi les managed by Xbox Live or PSN ), and also with fi les on memory sticks or other kinds of removable media. 6.1.2.2. Synchronous File I/O Both of the standard C library’s fi le I/O libraries are synchronous, meaning that the program making the I/O request must wait until the data has been com- pletely transferred to or from the media device before continuing. The fol- lowing code snippet demonstrates how the entire contents of a fi le might be read into an in-memory buff er using the synchronous I/O function fread(). Notice how the function syncReadFile() does not return until all the data has been read into the buff er provided. bool syncReadFile(const char* filePath, U8* buffer, size_t bufferSize, size_t& rBytesRead) { FILE* handle = fopen(filePath, "rb"); if (handle) { // BLOCK here until all data has been read. size_t bytesRead = fread(buffer, 1, bufferSize, handle); int err = ferror(handle); // get error if any fclose(handle); if (0 == err) { rBytesRead = bytesRead; return true; } } return false; } void main(int argc, const char* argv[]) { U8 testBuffer[512]; size_t bytesRead = 0; if (syncReadFile("C:\\testfile.bin", testBuffer, sizeof(testBuffer), bytesRead)) { printf("success: read %u bytes\n", bytesRead); // Contents of buffer can be used here... } } 269 6.1.3. Asynchronous File I/O Streaming refers to the act of loading data in the background while the main program continues to run. Many games provide the player with a seamless, load-screen-free playing experience by streaming data for upcoming levels from the DVD-ROM, Blu-ray disk, or hard drive while the game is being played. Audio and texture data are probably the most commonly streamed types of data, but any type of data can be streamed, including geometry, level layouts, and animation clips. In order to support streaming, we must utilize an asynchronous fi le I/O library, i.e., one which permits the program to continue to run while its I/O re- quests are being satisfi ed. Some operating systems provide an asynchronous fi le I/O library out of the box. For example, the Windows Common Language Runtime (CLR, the virtual machine upon which languages like Visual BASIC, C#, managed C++ and J# are implemented) provides functions like System. IO.BeginRead() and System.IO.BeginWrite(). An asynchronous API known as fios is available for the PLAYSTATION 3. If an asynchronous fi le I/O library is not available for your target platform, it is possible to write one yourself. And even if you don’t have to write it from scratch, it’s probably a good idea to wrap the system API for portability. The following code snippet demonstrates how the entire contents of a fi le might be read into an in-memory buff er using an asynchronous read opera- tion. Notice that the asyncReadFile() function returns immediately—the data is not present in the buff er until our callback function asyncReadCom- plete() has been called by the I/O library. AsyncRequestHandle g_hRequest; // handle to async I/O // request U8 g_asyncBuffer[512]; // input buffer static void asyncReadComplete(AsyncRequestHandle hRequest); void main(int argc, const char* argv[]) { // NOTE: This call to asyncOpen() might itself be an // asynchronous call, but we’ll ignore that detail // here and just assume it’s a blocking function. AsyncFileHandle hFile = asyncOpen( "C:\\testfile.bin"); if (hFile) { 6.1. File System 270 6. Resources and the File System // This function requests an I/O read, then // returns immediately (non-blocking). g_hRequest = asyncReadFile( hFile, // file handle g_asyncBuffer, // input buffer sizeof(g_asyncBuffer), // size of buffer asyncReadComplete); // callback function } // Now go on our merry way... // (This loop simulates doing real work while we wait // for the I/O read to complete.) for (;;) { OutputDebugString("zzz...\n"); Sleep(50); } } // This function will be called when the data has been read. static void asyncReadComplete( AsyncRequestHandle hRequest) { if (hRequest == g_hRequest && asyncWasSuccessful(hRequest)) { // The data is now present in g_asyncBuffer[] and // can be used. Query for the number of bytes // actually read: size_t bytes = asyncGetBytesReadOrWritten( hRequest); char msg[256]; sprintf(msg, "async success, read %u bytes\n", bytes); OutputDebugString(msg); } } Most asynchronous I/O libraries permit the main program to wait for an I/O operation to complete some time aft er the request was made. This can be useful in situations where only a limited amount of work can be done before the results of a pending I/O request are needed. This is illustrated in the fol- lowing code snippet. U8 g_asyncBuffer[512]; // input buffer void main(int argc, const char* argv[]) { 271 AsyncRequestHandle hRequest = ASYNC_INVALID_HANDLE; AsyncFileHandle hFile = asyncOpen( "C:\\testfile.bin"); if (hFile) { // This function requests an I/O read, then // returns immediately (non-blocking). hRequest = asyncReadFile( hFile, // file handle g_asyncBuffer, // input buffer sizeof(g_asyncBuffer), // size of buffer NULL); // no callback } // Now do some limited amount of work... for (int i = 0; i < 10; ++i) { OutputDebugString("zzz...\n"); Sleep(50); } // We can’t do anything further until we have that // data, so wait for it here. asyncWait(hRequest); if (asyncWasSuccessful(hRequest)) { // The data is now present in g_asyncBuffer[] and // can be used. Query for the number of bytes // actually read: size_t bytes = asyncGetBytesReadOrWritten( hRequest); char msg[256]; sprintf(msg, "async success, read %u bytes\n", bytes); OutputDebugString(msg); } } Some asynchronous I/O libraries allow the programmer to ask for an esti- mate of how long a particular asynchronous operation will take to complete. Some APIs also allow you to set deadlines on a request (eff ectively prioritizes the request relative to other pending requests), and to specify what happens when a request misses its deadline (e.g., cancel the request, notify the pro- gram and keep trying, etc.) 6.1. File System 272 6. Resources and the File System 6.1.3.1. Priorities It’s important to remember that fi le I/O is a real-time system, subject to dead- lines just like the rest of the game. Therefore, asynchronous I/O operations oft en have varying priorities. For example, if we are streaming audio from the hard disk or Bluray and playing it on the fl y, loading the next buff er-full of audio data is clearly higher priority than, say, loading a texture or a chunk of a game level. Asynchronous I/O systems must be capable of suspending lower-priority requests, so that higher-priority I/O requests have a chance to complete within their deadlines. 6.1.3.2. How Asynchronous File I/O Works Asynchronous fi le I/O works by handling I/O requests in a separate thread . The main thread calls functions that simply place requests on a queue and then return immediately. Meanwhile, the I/O thread picks up requests from the queue and handles them sequentially using blocking I/O routines like read() or fread(). When a request is completed, a callback provided by the main thread is called, thereby notifying it that the operation is done. If the main thread chooses to wait for an I/O request to complete, this is handled via a semaphore. (Each request has an associated semaphore, and the main thread can put itself to sleep waiting for that semaphore to be signaled by the I/O thread upon completion of the request.) Virtually any synchronous operation you can imagine can be transformed into an asynchronous operation by moving the code into a separate thread— or by running it on a physically separate processor, such as on one of the six synergistic processing units (SPUs) on the PLAYSTATION 3. See Section 7.6 for more details. 6.2. The Resource Manager Every game is constructed from a wide variety of resources (sometimes called assets or media). Examples include meshes, materials, textures, shader pro- grams, animations, audio clips, level layouts, collision primitives, physics pa- rameters, and the list goes on. A game’s resources must be managed, both in terms of the offl ine tools used to create them, and in terms of loading, unload- ing, and manipulating them at runtime. Therefore every game engine has a resource manager of some kind. Every resource manager is comprised of two distinct but integrated com- ponents. One component manages the chain of off -line tools used to create the assets and transform them into their engine-ready form. The other component 273 manages the resources at runtime, ensuring that they are loaded into memory in advance of being needed by the game and making sure they are unloaded from memory when no longer needed. In some engines, the resource manager is a cleanly-designed, unifi ed, centralized subsystem that manages all types of resources used by the game. In other engines, the resource manager doesn’t exist as a single subsystem per se, but is rather spread across a disparate collection of subsystems, per- haps writt en by diff erent individuals at various times over the engine’s long and sometimes colorful history. But no matt er how it is implemented, a re- source manager invariably takes on certain responsibilities and solves a well- understood set of problems. In this section, we’ll explore the functionality and some of the implementation details of a typical game engine resource manager. 6.2.1. Off-Line Resource Management and the Tool Chain 6.2.1.1. Revision Control for Assets On a small game project, the game’s assets can be managed by keeping loose fi les sitt ing around on a shared network drive with an ad hoc directory struc- ture. This approach is not feasible for a modern commercial 3D game, com- prised of a massive number and variety of assets. For such a project, the team requires a more formalized way to track and manage its assets. Some game teams use a source code revision control system to manage their resources. Art source fi les (Maya scenes, Photoshop .PSD fi les, Illustrator fi les, etc.) are checked in to Perforce or a similar package by the artists. This approach works reasonably well, although some game teams build custom asset management tools to help fl att en the learning curve for their artists. Such tools may be simple wrappers around a commercial revision control system, or they might be entirely custom. Dealing with Data Size One of the biggest problems in the revision control of art assets is the sheer amount of data. Whereas C++ and script source code fi les are small, relative to their impact on the project, art fi les tend to be much, much larger. Because many source control systems work by copying fi les from the central reposito- ry down to the user’s local machine, the sheer size of the asset fi les can render these packages almost entirely useless. I’ve seen a number of diff erent solutions to this problem employed at various studios. Some studios turn to commercial revision control systems like Alienbrain that have been specifi cally designed to handle very large data 6.2. The Resource Manager 274 6. Resources and the File System sizes. Some teams simply “take their lumps” and allow their revision control tool to copy assets locally. This can work, as long as your disks are big enough and your network bandwidth suffi cient, but it can also be ineffi cient and slow the team down. Some teams build elaborate systems on top of their revision control tool to ensure that a particular end-user only gets local copies of the fi les he or she actually needs. In this model, the user either has no access to the rest of the repository or can access it on a shared network drive when needed. At Naughty Dog we use a proprietary tool that makes use of UNIX symbol- ic links to virtually eliminate data copying, while permitt ing each user to have a complete local view of the asset repository. As long as a fi le is not checked out for editing, it is a symlink to a master fi le on a shared network drive. Sym- bolic links occupy very litt le space on the local disk, because it is nothing more than a directory entry. When the user checks out a fi le for editing, the symlink is removed, and a local copy of the fi le replaces it. When the user is done edit- ing and checks the fi le in, the local copy becomes the new master copy, its revi- sion history is updated in a master database, and the local fi le turns back into a symlink. This systems works very well, but it requires the team to build their own revision control system from scratch; I am unaware of any commercial tool that works like this. Also, symbolic links are a UNIX feature—such a tool could probably be built with Windows junctions (the Windows equivalent of a symbolic link), but I haven’t seen anyone try it as yet. 6.2.1.2. The Resource Database As we’ll explore in depth in the next section, most assets are not used in their original format by the game engine. They need to pass through some kind of asset conditioning pipeline, whose job it is to convert the assets into the binary format needed by the engine. For every resource that passes through the asset conditioning pipeline, there is some amount of metadata that describes how that resource should be processed. When compressing a texture bitmap, we need to know what type of compression best suits that particular image. When exporting an animation, we need to know what range of frames in Maya should be exported. When exporting character meshes out of a Maya scene containing multiple characters, we need to know which mesh corresponds to which character in the game. To manage all of this metadata, we need some kind of database. If we are making a very small game, this database might be housed in the brains of the developers themselves. I can hear them now: “Remember: the player’s anima- tions need to have the ‘fl ip X’ fl ag set, but the other characters must not have it set… or… rats… is it the other way around?” 275 Clearly for any game of respectable size, we simply cannot rely on the memories of our developers in this manner. For one thing, the sheer volume of assets becomes overwhelming quite quickly. Processing individual resource fi les by hand is also far too time-consuming to be practical on a full-fl edged commercial game production. Therefore, every professional game team has some kind of semi-automated resource pipeline, and the data that drives the pipeline is stored in some kind of resource database. The resource database takes on vastly diff erent forms in diff erent game engines. In one engine, the metadata describing how a resource should be built might be embedded into the source assets themselves (e.g., it might be stored as so-called blind data within a Maya fi le). In another engine, each source resource fi le might be accompanied by a small text fi le that describes how it should be processed. Still other engines encode their resource build- ing metadata in a set of XML fi les, perhaps wrapped in some kind of custom graphical user interface. Some engines employ a true relational database, such as Microsoft Access, MySQL, or conceivably even a heavy-weight database like Oracle. Whatever its form, a resource database must provide the following basic functionality: The ability to deal with multiple types of resources, ideally (but certainly not necessarily) in a somewhat consistent manner. The ability to create new resources. The ability to delete resources. The ability to inspect and modify existing resources. The ability to move a resource’s source fi le(s) from one location to an- other on-disk. (This is very helpful because artists and game designers oft en need to rearrange assets to refl ect changing project goals, re-think- ing of game designs, feature additions and cuts, etc.) The ability of a resource to cross-reference other resources (e.g., the ma- terial used by a mesh, or the collection of animations needed by level 17). These cross-references typically drive both the resource building process and the loading process at runtime. The ability to maintain referential integrity of all cross-references within the database and to do so in the face of all common operations such as deleting or moving resources around. The ability to maintain a revision history, complete with a log of who made each change and why. It is also very helpful if the resource database supports searching or querying in various ways. For example, a developer might want to 6.2. The Resource Manager 276 6. Resources and the File System know in which levels a particular animation is used or which textures are referenced by a set of materials. Or they might simply be trying to fi nd a resource whose name momentarily escapes them. It should be prett y obvious from looking at the above list that creating a reliable and robust resource database is no small task. When designed well and implemented properly, the resource database can quite literally make the diff erence between a team that ships a hit game and a team that spins its wheels for 18 months before being forced by management to abandon the project (or worse). I know this to be true, because I’ve personally experienced both. 6.2.1.3. Some Successful Resource Database Designs Every game team will have diff erent requirements and make diff erent deci- sions when designing their resource database. However, for what it’s worth, here are some designs that have worked well in my own experience: Unreal Engine 3 Unreal’s resource database is managed by their über-tool, UnrealEd . UnrealEd is responsible for literally everything, from resource metadata management to asset creation to level layout and more. UnrealEd has its drawbacks, but its single biggest benefi t is that UnrealEd is a part of the game engine itself. This permits assets to be created and then immediately viewed in their full glory, exactly as they will appear in-game. The game can even be run from within UnrealEd, in order to visualize the assets in their natural surroundings and see if and how they work in-game. Another big benefi t of UnrealEd is what I would call one-stop shopping. UnrealEd’s Generic Browser (depicted in Figure 6.1) allows a developer to access literally every resource that is consumed by the engine. Having a sin- gle, unifi ed, and reasonably-consistent interface for creating and managing all types of resources is a big win. This is especially true considering that the resource data in most other game engines is fragmented across countless in- consistent and oft en cryptic tools. Just being able to fi nd any resource easily in UnrealEd is a big plus. Unreal can be less error-prone than many other engines, because assets must be explicitly imported into Unreal’s resource database. This allows re- sources to be checked for validity very early in the production process. In most game engines, any old data can be thrown into the resource database, and you only know whether or not that data is valid when it is eventually built—or sometimes not until it is actually loaded into the game at runtime. But with Unreal, assets can be validated as soon as they are imported into 277 UnrealEd. This means that the person who created the asset gets immediate feedback as to whether his or her asset is confi gured properly. Of course, Unreal’s approach has some serious drawbacks. For one thing, all resource data is stored in a small number of large package fi les . These fi les are binary, so they are not easily merged by a revision control package like CVS, Subversion, or Perforce. This presents some major problems when more than one user wants to modify resources that reside in a single package. Even if the users are trying to modify diff erent resources, only one user can lock the package at a time, so the other has to wait. The severity of this problem can be reduced by dividing resources into relatively small, granular packages, but it cannot practically be eliminated. Referential integrity is quite good in UnrealEd, but there are still some problems. When a resource is renamed or moved around, all references to it are maintained automatically using a dummy object that remaps the old re- 6.2. The Resource Manager Figure 6.1. UnrealEd’s Generic Browser. 278 6. Resources and the File System source to its new name/location. The problem with these dummy remapping objects is that they hang around and accumulate and sometimes cause prob- lems, especially if a resource is deleted. Overall, Unreal’s referential integrity is quite good, but it is not perfect. Despite its problems, UnrealEd is by far the most user-friendly, well-in- tegrated, and streamlined asset creation toolkit, resource database, and asset- conditioning pipeline that I have ever worked with. Naughty Dog’s Uncharted: Drake’s Fortune Engine For Uncharted: Drake’s Fortune (UDF), Naughty Dog stored its resource metadata in a MySQL database. A custom graphical user interface was writt en to manage the contents of the database. This tool allowed artists, game design- ers, and programmers alike to create new resources, delete existing resources, and inspect and modify resources as well. This GUI was a crucial component of the system, because it allowed users to avoid having to learn the intricacies of interacting with a relational database via SQL. The original MySQL database used on UDF did not provide a useful his- tory of the changes made to the database, nor did it provide a good way to roll back “bad” changes. It also did not support multiple users editing the same resource, and it was diffi cult to administer. Naughty Dog has since moved away from MySQL in favor of an XML fi le-based asset database, managed under Perforce. Builder, Naughty Dog’s resource database GUI, is depicted in Figure 6.2. The window is broken into two main sections: a tree view showing all resourc- es in the game on the left and a properties window on the right, allowing the resource(s) that are selected in the tree view to be viewed and edited. The re- source tree contains folders for organizational purposes, so that the artists and game designers can organize their resources in any way they see fi t. Various types of resources can be created and managed within any folder, including actors and levels, and the various subresources that comprise them (primar- ily meshes, skeletons, and animations). Animations can also be grouped into pseudo-folders known as bundles. This allows large groups of animations to be created and then managed as a unit, and prevents a lot of wasted time drag- ging individual animations around in the tree view. The asset conditioning pipeline on UDF consists of a set of resource ex- porters, compilers, and linkers that are run from the command line. The engine is capable of dealing with a wide variety of diff erent kinds of data objects, but these are packaged into one of two types of resource fi les: actors and levels. An actor can contain skeletons, meshes, materials, textures, and/or animations. A level contains static background meshes, materials and textures, and also level-layout information. To build an actor, one simply types ba name-of-actor 279 on the command line; to build a level, one types bl name-of-level. These com- mand-line tools query the database to determine exactly how to build the actor or level in question. This includes information on how to export the assets from DCC tools like Maya, Photoshop etc., how to process the data, and how to package it into binary .pak fi les that can be loaded by the game engine. This is much simpler than in many engines, where resources have to be exported manually by the artists—a time-consuming, tedious, and error-prone task. The benefi ts of the resource pipeline design used by Naughty Dog in- clude: Granular resources. Resources can be manipulated in terms of logical en- tities in the game—meshes, materials, skeletons, and animations. These 6.2. The Resource Manager Figure 6.2. The front-end GUI for Naughty Dog’s off-line resource database, Builder. 280 6. Resources and the File System resource types are granular enough that the team almost never has confl icts in which two users want to edit the same resource simultane- ously. The necessary features (and no more). The Builder tool provides a powerful set of features that meet the needs of the team, but Naughty Dog didn’t waste any resources creating features they didn’t need. Obvious mapping to source fi les. A user can very quickly determine which source assets (native DCC fi les, like Maya .ma fi les or photoshop .psd fi les) make up a particular resource. Easy to change how DCC data is exported and processed. Just click on the resource in question and twiddle its processing properties within the resource database GUI. Easy to build assets. Just type ba or bl followed by the resource name on the command line. The dependency system takes care of the rest. Of course, the UDF tool chain has some drawbacks as well, including: Lack of visualization tools. The only way to preview an asset is to load it into the game or the model/animation viewer (which is really just a special mode of the game itself). The tools aren’t fully integrated. Naughty Dog uses one tool to lay out levels, another to manage the majority of resources in the resource data- base, and a third to set up materials and shaders (this is not part of the resource database front-end). Building the assets is done on the com- mand line. It might be a bit more convenient if all of these functions were to be integrated into a single tool. However, Naughty Dog has no plans to do this, because the benefi t would probably not outweigh the costs involved. Ogre’s Resource Manager System Ogre3D is a rendering engine, not a full-fl edged game engine. That said, Ogre does boast a reasonably complete and very well-designed runtime resource manager. A simple, consistent interface is used to load virtually any kind of resource. And the system has been designed with extensibility in mind. Any programmer can quite easily implement a resource manager for a brand new kind of asset and integrate it easily into Ogre’s resource framework. One of the drawbacks of Ogre’s resource manager is that it is a runtime- only solution. Ogre lacks any kind of off -line resource database. Ogre does provide some exporters which are capable of converting a Maya fi le into a mesh that can be used by Ogre (complete with materials, shaders, a skeleton and optional animations). However, the exporter must be run manually from 281 within Maya itself. Worse, all of the metadata describing how a particular Maya fi le should be exported and processed must be entered by the user do- ing the export. In summary, Ogre’s runtime resource manager is powerful and well-de- signed. But Ogre would benefi t a great deal from an equally powerful and modern resource database and asset conditioning pipeline on the tools side. Microsoft’s XNA XNA is a game development toolkit by Microsoft , targeted at the PC and Xbox 360 platforms. XNA’s resource management system is unique, in that it lever- ages the project management and build systems of the Visual Studio IDE to manage and build the assets in the game as well. XNA’s game development tool, Game Studio Express, is just a plug-in to Visual Studio Express. You can read more about Game Studio Express at htt p://msdn.microsoft .com/en-us/ library/bb203894.aspx. 6.2.1.4. The Asset Conditioning Pipeline In Section 1.7, we learned that resource data is typically created using ad- vanced digital content creation (DCC) tools like Maya, Z-Brush, Photoshop, or Houdini. However, the data formats used by these tools are usually not suit- able for direct consumption by a game engine. So the majority of resource data is passed through an asset conditioning pipeline (ACP) on its way to the game engine. The ACP is sometimes referred to as the resource conditioning pipeline (RCP), or simply the tool chain. Every resource pipeline starts with a collection of source assets in native DCC formats (e.g., Maya .ma or .mb fi les, Photoshop .psd fi les, etc.) These assets are typically passed through three processing stages on their way to the game engine:  1. Exporters . We need some way of gett ing the data out of the DCC’s na- tive format and into a format that we can manipulate. This is usually accomplished by writing a custom plug-in for the DCC in question. It is the plug-in’s job to export the data into some kind of intermediate fi le format that can be passed to later stages in the pipeline. Most DCC ap- plications provide a reasonably convenient mechanism for doing this. Maya actually provides three: a C++ SDK, a scripting language called MEL , and most recently a Python interface as well. In cases where a DCC application provides no customization hooks, we can always save the data in one of the DCC tool’s native formats. With any luck, one of these will be an open format, a reasonably-intuitive text format, or some other format that we can reverse engineer. Presuming 6.2. The Resource Manager 282 6. Resources and the File System this is the case, we can pass the fi le directly to the next stage of the pipe- line.  2. Resource compilers . We oft en have to “massage” the raw data exported from a DCC application in various ways in order to make it game-ready. For example, we might need to rearrange a mesh’s triangles into strips, or compress a texture bitmap, or calculate the arc lengths of the segments of a Catmull-Rom spline. Not all types of resources need to be compiled— some might be game-ready immediately upon being exported.  3. Resource linkers . Multiple resource fi les sometimes need to be combined into a single useful package prior to being loaded by the game engine. This mimics the process of linking together the object fi les of a compiled C++ program into an executable fi le, and so this process is sometimes called resource linking. For example, when building a complex compos- ite resource like a 3D model, we might need to combine the data from multiple exported mesh fi les, multiple material fi les, a skeleton fi le, and multiple animation fi les into a single resource. Not all types of resources need to be linked—some assets are game-ready aft er the export or com- pile steps. Resource Dependencies and Build Rules Much like compiling the source fi les in a C or C++ project and then linking them into an executable, the asset conditioning pipeline processes source as- sets (in the form of Maya geometry and animation fi les, Photoshop PSD fi les, raw audio clips, text fi les, etc.), converts them into game-ready form, and then links them together into a cohesive whole for use by the engine. And just like the source fi les in a computer program, game assets oft en have interdepen- dencies. (For example, a mesh refers to one or more materials, which in turn refer to various textures.) These interdependencies typically have an impact on the order in which assets must be processed by the pipeline. (For example, we might need to build a character’s skeleton before we can process any of that character’s animations.) In addition, the dependencies between assets tell us which assets need to be rebuilt when a particular source asset changes. Build dependencies revolve not only around changes to the assets them- selves, but also around changes to data formats. If the format of the fi les used to store triangle meshes changes, for instance, all meshes in the entire game may need to be reexported and/or rebuilt. Some game engines employ data formats that are robust to version changes. For example, an asset may contain a version number, and the game engine may include code that “knows” how to load and make use of legacy assets. The downside of such a policy is that asset fi les and engine code tend to become bulky. When data format changes 283 are relatively rare, it may be bett er to just bite the bullet and reprocess all the fi les when format changes do occur. Every asset conditioning pipeline requires a set of rules that describe the interdependencies between the assets, and some kind of build tool that can use this information to ensure that the proper assets are built, in the proper order, when a source asset is modifi ed. Some game teams roll their own build system. Others use an established tool, such as make. Whatever solution is selected, teams should treat their build dependency system with utmost care. If you don’t, changes to sources assets may not trigger the proper assets to be rebuilt. The result can be inconsistent game assets, which may lead to vi- sual anomalies or even engine crashes. In my personal experience, I’ve wit- nessed countness hours wasted in tracking down problems that could have been avoided had the asset interdependencies been properly specifi ed and the build system implemented to use them reliably. 6.2.2. Runtime Resource Management Let us turn our att ention now to how the assets in our resource database are loaded, managed, and unloaded within the engine at runtime. 6.2.2.1. Responsibilities of the Runtime Resource Manager A game engine’s runtime resource manager takes on a wide range of responsi- bilities, all related to its primary mandate of loading resources into memory: Ensures that only one copy of each unique resource exists in memory at any given time. Manages the lifetime of each resource loads needed resources and un- loads resources that are no longer needed. Handles loading of composite resources. A composite resource is a resource comprised of other resources. For example, a 3D model is a composite resource that consists of a mesh, one or more materials, one or more textures, and optionally a skeleton and multiple skeletal animations. Maintains referential integrity . This includes internal referential integrity (cross-references within a single resource) and external referential integ- rity (cross-references between resources). For example, a model refers to its mesh and skeleton; a mesh refers to its materials, which in turn refer to texture resources; animations refer to a skeleton, which ultimately ties them to one or more models. When loading a composite resource, the resource manager must ensure that all necessary subresources are loaded, and it must patch in all of the cross-references properly. Manages the memory usage of loaded resources and ensures that re- sources are stored in the appropriate place(s) in memory. 6.2. The Resource Manager 284 6. Resources and the File System Permits custom processing to be performed on a resource aft er it has been loaded, on a per-resource-type basis. This process is sometimes known as logging in or load-initializing the resource. Usually (but not always) provides a single unifi ed interface through which a wide variety of resource types can be managed. Ideally a re- source manager is also easily extensible, so that it can handle new types of resources as they are needed by the game development team. Handles streaming (i.e., asynchronous resource loading), if the engine supports this feature. 6.2.2.2. Resource File and Directory Organization In some game engines (typically PC engines), each individual resource is managed in a separate “loose” fi le on-disk. These fi les are typically con- tained within a tree of directories whose internal organization is designed primarily for the convenience of the people creating the assets; the engine typically doesn’t care where resource fi les are located within the resource tree. Here’s a typical resource directory tree for a hypothetical game called Space Evaders: SpaceEvaders Root directory for entire game. Resources Root of all resources. Characters Non-player character models and animations. Pirate Models and animations for pirates. Marine Models and animations for marines. ... Player Player character models and animations. Weapons Models and animations for weapons. Pistol Models and animations for the pistol. Rifle Models and animations for the rifl e. BFG Models and animations for the big... uh… gun. ... Levels Background geometry and level layouts. Level1 First level’s resources. Level2 Second level’s resources. ... Objects Miscellaneous 3D objects. Crate The ubiquitous breakable crate. Barrel The ubiquitous exploding barrel. Other engines package multiple resources together in a single fi le, such as a ZIP archive, or some other composite fi le (perhaps of a proprietary format). 285 The primary benefi t of this approach is improved load times. When loading data from fi les, the three biggest costs are seek times (i.e., moving the read head to the correct place on the physical media), the time required to open each individual fi le, and the time to read the data from the fi le into memory. Of these, the seek times and fi le-open times can be non-trivial on many operating systems. When a single large fi le is used, all of these costs are minimized. A single fi le can be organized sequentially on the disk, reducing seek times to a minimum. And with only one fi le to open, the cost of opening individual resource fi les is eliminated. The Ogre3D rendering engine’s resource manager permits resources to exist as loose fi les on disk, or as virtual fi les within a large ZIP archive. The primary benefi ts of the ZIP format are the following:  1. It is an open format. The zlib and zziplib libraries used to read and write ZIP archives are freely available. The zlib SDK is totally free (see htt p://www.zlib.net), while the zziplib SDK falls under the Lesser Gnu Public License (LGPL) (see htt p://zziplib.sourceforge.net).  2. The virtual fi les within a ZIP archive “remember” their relative paths. This means that a ZIP archive “looks like” a raw fi le system for most in- tents and purposes. The Ogre resource manager identifi es all resources uniquely via strings that appear to be fi le system paths. However, these paths sometimes identify virtual fi les within a ZIP archive instead of loose fi les on disk, and a game programmer needn’t be aware of the dif- ference in most situations.  3. ZIP archives may be compressed. This reduces the amount of disk space occupied by resources. But, more importantly, it again speeds up load times, as less data need be loaded into memory from the fi xed disk. This is especially helpful when reading data from a DVD-ROM or Blu-ray disk, as the data transfer rates of these devices are much slower than a hard disk drive. Hence the cost of decompressing the data aft er it has been loaded into memory is oft en more than off set by the time saved in loading less data from the device.  4. ZIP archives are modular. Resources can be grouped together into a ZIP fi le and managed as a unit. One particularly elegant application of this idea is in product localization. All of the assets that need to be local- ized (such as audio clips containing dialogue and textures that contain words or region-specifi c symbols) can be placed in a single ZIP fi le, and then diff erent versions of this ZIP fi le can be generated, one for each language or region. To run the game for a particular region, the engine simply loads the corresponding version of the ZIP archive. 6.2. The Resource Manager 286 6. Resources and the File System Unreal Engine 3 takes a similar approach, with a few important diff er- ences. In Unreal, all resources must be contained within large composite fi les known as packages (a.k.a. “pak fi les”) . No loose disk fi les are permitt ed. The format of a package fi le is proprietary. The Unreal Engine’s game editor, UnrealEd, allows developers to create and manage packages and the resourc- es they contain. 6.2.2.3. Resource File Formats Each type of resource fi le potentially has a diff erent format. For example, a mesh fi le is always stored in a diff erent format than that of a texture bitmap. Some kinds of assets are stored in standardized, open formats. For example, textures are typically stored as Targa fi les (TGA), Portable Network Graph- ics fi les (PNG), Tagged Image File Format fi les (TIFF), Joint Photographic Ex- perts Group fi les (JPEG), or Windows Bitmap fi les (BMP)—or in a standard- ized compressed format such as DirectX’s S3 Texture Compression family of formats (S3TC, also known as DXTn or DXTC). Likewise, 3D mesh data is oft en exported out of a modeling tool like Maya or Lightwave into a stan- dardized format such as OBJ or COLLADA for consumption by the game engine. Sometimes a single fi le format can be used to house many diff erent types of assets. For example, the Granny SDK by Rad Game Tools (htt p://www.rad- gametools.com) implements a fl exible open fi le format that can be used to store 3D mesh data, skeletal hierarchies, and skeletal animation data. (In fact the Granny fi le format can be easily repurposed to store virtually any kind of data imaginable.) Many game engine programmers roll their own fi le formats for various reasons. This might be necessary if no standardized format provides all of the information needed by the engine. Also, many game engines endeavor to do as much off -line processing as possible in order to minimize the amount of time needed to load and process resource data at runtime. If the data needs to conform to a particular layout in memory, for example, a raw binary format might be chosen so that the data can be laid out by an off -line tool (rather than att empting to format it at runtime aft er the resource has been loaded). 6.2.2.4. Resource GUIDs Every resource in a game must have some kind of globally unique identifi er (GUID). The most common choice of GUID is the resource’s fi le system path (stored either as a string or a 32-bit hash). This kind of GUID is intuitive, be- cause it clearly maps each resource to a physical fi le on-disk. And it’s guar- 287 anteed to be unique across the entire game, because the operating system al- ready guarantees that no two fi les will have the same path. However, a fi le system path is by no means the only choice for a resource GUID. Some engines use a less-intuitive type of GUID, such as a 128-bit hash code, perhaps assigned by a tool that guarantees uniqueness. In other engines, using a fi le system path as a resource identifi er is infeasible. For example, Unreal Engine 3 stores many resources in a single large fi le known as a pack- age, so the path to the package fi le does not uniquely identify any one re- source. To overcome this problem, an Unreal package fi le is organized into a folder hierarchy containing individual resources. Unreal gives each indi- vidual resource within a package a unique name which looks much like a fi le system path. So in Unreal, a resource GUID is formed by concatenating the (unique) name of the package fi le with the in-package path of the resource in question. For example, the Gears of War resource GUID Locust_Boomer. PhysicalMaterials. LocustBoomerLeather identifi es a material called LocustBoomerLeather within the PhysicalMaterials folder of the Locust_Boomer package fi le. 6.2.2.5. The Resource Registry In order to ensure that only one copy of each unique resource is loaded into memory at any given time, most resource managers maintain some kind of registry of loaded resources. The simplest implementation is a dictionary—i.e., a collection of key-value pairs . The keys contain the unique ids of the resources, while the values are typically pointers to the resources in memory. Whenever a resource is loaded into memory, an entry for it is added to the resource registry dictionary, using its GUID as the key. Whenever a resource is unloaded, its registry entry is removed. When a resource is requested by the game, the resource manager looks up the resource by its GUID within the re- source registry. If the resource can be found, a pointer to it is simply returned. If the resource cannot be found, it can either be loaded automatically or a failure code can be returned. At fi rst blush, it might seem most intuitive to automatically load a re- quested resource if it cannot be found in the resource registry. And in fact, some game engines do this. However, there are some serious problems with this approach. Loading a resource is a slow operation, because it involves lo- cating and opening a fi le on disk, reading a potentially large amount of data into memory (from a potentially slow device like a DVD-ROM drive), and also possibly performing post-load initialization of the resource data once it has been loaded. If the request comes during active gameplay, the time it takes to load the resource might cause a very noticeable hitch in the game’s frame 6.2. The Resource Manager 288 6. Resources and the File System rate, or even a multi-second freeze. For this reason, engines tend to take one of two alternative approaches: Resource loading might be disallowed completely during active game-1. play. In this model, all of the resources for a game level are loaded en masse just prior to gameplay, usually while the player watches a loading screen or progress bar of some kind. Resource loading might be done 2. asynchronously (i.e., the data might be streamed). In this model, while the player is engaged in level A, the re- sources for level B are being loaded in the background. This approach is preferable because it provides the player with a load-screen-free play experience. However, it is considerably more diffi cult to implement. 6.2.2.6. Resource Lifetime The lifetime of a resource is defi ned as the time period between when it is fi rst loaded into memory and when its memory is reclaimed for other purposes. One of the resource manager’s jobs is to manage resource lifetimes—either automatically, or by providing the necessary API functions to the game, so it can manage resource lifetimes manually. Each resource has its own lifetime requirements: Some resources must be loaded when the game fi rst starts up and must stay resident in memory for the entire duration of the game. That is, their lifetimes are eff ectively infi nite. These are sometimes called load- and-stay-resident (LSR) resources. Typical examples include the player character’s mesh, materials, textures and core animations, textures and fonts used on the heads-up display (HUD), and the resources for all of the standard-issue weapons used throughout the game. Any resource that is visible or audible to the player throughout the entire game (and cannot be loaded on the fl y when needed) should be treated as an LSR resource. Other resources have a lifetime that matches that of a particular game level. These resources must be in memory by the time the level is fi rst seen by the player and can be dumped once the player has permanently left the level. Some resources might have a lifetime that is shorter than the duration of the level in which they are found. For example, the animations and au- dio clips that make up an in-game cut-scene (a mini-movie that advances the story or provides the player with important information) might be loaded in advance of the player seeing the cut-scene and then dumped once the cut-scene has played. 289 Some resources like background music, ambient sound eff ects, or full- screen movies are streamed “live” as they play. The lifetime of this kind of resource is diffi cult to defi ne, because each byte only persists in mem- ory for a tiny fraction of a second, but the entire piece of music sounds like it lasts for a long period of time. Such assets are typically loaded in chunks of a size that matches the underlying hardware’s requirements. For example, a music track might be read in 4 kB chunks, because that might be the buff er size used by the low-level sound system. Only two chunks are ever present in memory at any given moment—the chunk that is currently playing and the chunk immediately following it that is being loaded into memory. The question of when to load a resource is usually answered quite easily, based on knowledge of when the asset is fi rst seen by the player. However, the question of when to unload a resource and reclaim its memory is not so eas- ily answered. The problem is that many resources are shared across multiple levels. We don’t want to unload a resource when level A is done, only to im- mediately reload it because level B needs the same resource. One solution to this problem is to reference-count the resources. When- ever a new game level needs to be loaded, the list of all resources used by that level is traversed, and the reference count for each resource is incremented by one (but they are not loaded yet). Next, we traverse the resources of any unneeded levels and decrement their reference counts by one; any resource whose reference count drops to zero is unloaded. Finally, we run through the list of all resources whose reference count just went from zero to one and load those assets into memory. For example, imagine that level 1 uses resources A, B, and C, and that level 2 uses resources B, C, D, and E. (B and C are shared between both levels.) Table 6.2 shows the reference counts of these fi ve resources as the player plays through levels 1 and 2. In this table, reference counts are shown in boldface type to indicate that the corresponding resource actually exists in memory, while a grey background indicates that the resource is not in memory. A refer- ence count in parentheses indicates that the corresponding resource data is being loaded or unloaded. 6.2.2.7. Memory Management for Resources Resource management is closely related to memory management , because we must inevitably decide where the resources should end up in memory once they have been loaded. The destination of every resource is not always the same. For one thing, certain types of resources must reside in video RAM. Typical examples include textures, vertex buff ers, index buff ers, and shader 6.2. The Resource Manager 290 6. Resources and the File System code. Most other resources can reside in main RAM, but diff erent kinds of resources might need to reside within diff erent address ranges. For example, a resource that is loaded and stays resident for the entire game (LSR resources) might be loaded into one region of memory, while resources that are loaded and unloaded frequently might go somewhere else. The design of a game engine’s memory allocation subsystem is usually closely tied to that of its resource manager. Sometimes we will design the re- source manager to take best advantage of the types of memory allocators we have available; or vice-versa, we may design our memory allocators to suit the needs of the resource manager. As we saw in Section 5.2.1.4, one of the primary problems facing any re- source management system is the need to avoid fragmenting memory as re- sources are loaded and unloaded. We’ll discuss a few of the more-common solutions to this problem below. Heap-Based Resource Allocation One approach is to simply ignore memory fragmentation issues and use a general-purpose heap allocator to allocate your resources (like the one imple- mented by malloc() in C, or the global new operator in C++). This works best if your game is only intended to run on personal computers, on operating systems that support advanced virtual memory allocation. On such a system, physical memory will become fragmented, but the operating system’s abil- ity to map non-contiguous pages of physical RAM into a contiguous virtual memory space helps to mitigate some of the eff ects of fragmentation. If your game is running on a console with limited physical RAM and only a rudimentary virtual memory manager (or none whatsoever), then fragmen- tation will become a problem. In this case, one alternative is to defragment your memory periodically. We saw how to do this in Section 5.2.2.2. Event ABCDE Initial state 00000 Level 1 counts incremented 11100 Level 1 loads (1) (1) (1) 00 Level 1 plays 11100 Level 2 counts incremented 12211 Level 1 counts decremented 01111 Level 1 unloads, level 2 loads (0) 1 1 (1) (1) Level 2 plays 0 1111 Table 6.2. Resource usage as two levels load and unload. 291 Stack-Based Resource Allocation A stack allocator does not suff er from fragmentation problems, because mem- ory is allocated contiguously and freed in an order opposite to that in which it was allocated. A stack allocator can be used to load resources if the following two conditions are met: The game is linear and level-centric (i.e., the player watches a loading screen, then plays a level, then watches another loading screen, then plays another level). Each level fi ts into memory in its entirety. Presuming that these requirements are satisfi ed, we can use a stack alloca- tor to load resources as follows: When the game fi rst starts up, the load-and- stay-resident (LSR) resources are allocated fi rst. The top of the stack is then marked, so that we can free back to this position later. To load a level, we sim- ply allocate its resources on the top of the stack. When the level is complete, we can simply set the stack top back to the marker we took earlier, thereby freeing all of the level’s resources in one fell swoop without disturbing the LSR resources. This process can be repeated for any number of levels, without ever fragmenting memory. Figure 6.3 illustrates how this is accomplished. 6.2. The Resource Manager Figure 6.3. Loading resources using a stack allocator. 292 6. Resources and the File System A double-ended stack allocator can be used to augment this approach. Two stacks are defi ned within a single large memory block. One grows up from the bott om of the memory area, while the other grows down from the top. As long as the two stacks never overlap, the stacks can trade memory re- sources back and forth naturally—something that wouldn’t be possible if each stack resided in its own fi xed-size block. On Hydro Thunder, Midway used a double-ended stack allocator. The low- er stack was used for persistent data loads, while the upper was used for tem- porary allocations that were freed every frame. Another way a double-ended stack allocator can be used is to ping-pong level loads. Such an approach was used at Bionic Games Inc. for one of their projects. The basic idea is to load a compressed version of level B into the upper stack, while the currently-active level A resides (in uncompressed form) in the lower stack. To switch from level A to level B, we simply free level A’s resources (by clearing the lower stack) and then decompress level B from the upper stack into the lower stack. Decompression is generally much faster than loading data from disk, so this approach eff ectively eliminates the load time that would otherwise be experi- enced by the player beween levels. Pool-Based Resource Allocation Another resource allocation technique that is common in game engines that support streaming is to load resource data in equally-sized chunks. Because the chunks are all the same size, they can be allocated using a pool allocator (see Section 5.2.1.2). When resources are later unloaded, the chunks can be freed without causing fragmentation. Of course, a chunk-based allocation approach requires that all resource data be laid out in a manner that permits division into equally-sized chunks. We cannot simply load an arbitrary resource fi le in chunks, because the fi le might contain a contiguous data structure like an array or a very large struct that is larger than a single chunk. For example, if the chunks that contain an array are not arranged sequentially in RAM, the continuity of the array will be lost, and array indexing will cease to function properly. This means that all resource data must be designed with “chunkiness” in mind. Large contigu- ous data structures must be avoided in favor of data structures that are either small enough to fi t within a single chunk or do not require contiguous RAM to function properly (e.g., linked lists). Each chunk in the pool is typically associated with a particular game lev- el. (One simple way to do this is to give each level a linked list of its chunks.) This allows the engine to manage the lifetimes of each chunk appropriately, even when multiple levels with diff erent life spans are in memory concur- 293 rently. For example, when level A is loaded, it might allocate and make use of N chunks. Later, level B might allocate an additional M chunks. When level A is eventually unloaded, its N chunks are returned to the free pool. If level B is still active, its M chunks need to remain in memory. By associating each chunk with a specifi c level, the lifetimes of the chunks can be managed easily and effi ciently. This is illustrated in Figure 6.4. One big trade-off inherent in a “chunky” resource allocation scheme is wasted space. Unless a resource fi le’s size is an exact multiple of the chunk size, the last chunk in a fi le will not be fully utilized (see Figure 6.5). Choos- ing a smaller chunk size can help to mitigate this problem, but the smaller the chunks, the more onerous the restrictions on the layout of the resource data. (As an extreme example, if a chunk size of one byte were selected, then no data structure could be larger than a single byte—clearly an untenable situ- ation.) A typical chunk size is on the order of a few kilobytes. For example at Naughty Dog, we use a chunky resource allocator as part of our resource streaming system, and our chunks are 512 kB in size. You may also want to consider selecting a chunk size that is a multiple of the operating system’s I/O buff er size to maximize effi ciency when loading individual chunks. 6.2. The Resource Manager File A Chunk 1 File A Chunk 2 File A Chunk 3 File B Chunk 1 File B Chunk 2 File C Chunk 1 File C Chunk 2 File C Chunk 3 File C Chunk 4 File D Chunk 1 File D Chunk 2 File D Chunk 3 File E Chunk 1 File E Chunk 2 File E Chunk 3 File E Chunk 4 File E Chunk 5 File E Chunk 6 Level X (files A, D) Level Y (files B, C, E) Figure 6.4. Chunky allocation of resources for levels A and B. File size: 1638 kB Unused: 410 kB Chunk 4Chunk 1 Chunk 2 Chunk 3 Chunk size: 512 kB each Figure 6.5. The last chunk of a resource fi le is often not fully utilized. 294 6. Resources and the File System Resource Chunk Allocators One way to limit the eff ects of wasted chunk memory is to set up a special memory allocator that can utilize the unused portions of chunks. As far as I’m aware, there is no standardized name for this kind of allocator, but we will call it a resource chunk allocator for lack of a bett er name. A resource chunk allocator is not particularly diffi cult to implement. We need only maintain a linked list of all chunks that contain unused memory, along with the locations and sizes of each free block. We can then allocate from these free blocks in any way we see fi t. For example, we might manage the linked list of free blocks using a general-purpose heap allocator. Or we might map a small stack allocator onto each free block; whenever a request for memory comes in, we could then scan the free blocks for one whose stack has enough free RAM, and then use that stack to satisfy the request. Unfortunately, there’s a rather grotesque-looking fl y in our ointment here. If we allocate memory in the unused regions of our resource chunks, what hap- pens when those chunks are freed? We cannot free part of a chunk—it’s an all or nothing proposition. So any memory we allocate within an unused portion of a resource chunk will magically disappear when that resource is unloaded. A simple solution to this problem is to only use our free-chunk alloca- tor for memory requests whose lifetimes match the lifetime of the level with which a particular chunk is associated. In other words, we should only al- locate memory out of level A’s chunks for data that is associated exclusively with level A and only allocate from B’s chunks memory that is used exclu- sively by level B. This requires our resource chunk allocator to manage each level’s chunks separately. And it requires the users of the chunk allocator to specify which level they are allocating for, so that the correct linked list of free blocks can be used to satisfy the request. Thankfully, most game engines need to allocate memory dynamically when loading resources, over and above the memory required for the resource fi les themselves. So a resource chunk allocator can be a fruitful way to reclaim chunk memory that would otherwise have been wasted. Sectioned Resource Files Another useful idea that is related to “chunky” resource fi les is the concept of fi le sections. A typical resource fi le might contain between one and four sec- tions, each of which is divided into one or more chunks for the purposes of pool allocation as described above. One section might contain data that is des- tined for main RAM, while another section might contain video RAM data. Another section could contain temporary data that is needed during the load- ing process but is discarded once the resource has been completely loaded. Yet 295 another section might contain debugging information. This debug data could be loaded when running the game in debug mode, but not loaded at all in the fi nal production build of the game. The Granny SDK’s fi le system (htt p:// www.radgametools.com) is an excellent example of how to implement fi le sectioning in a simple and fl exible manner. 6.2.2.8. Composite Resources and Referential Integrity Usually a game’s resource database consists of multiple resource fi les, each fi le containing one or more data objects. These data objects can refer to and depend upon one another in arbitrary ways. For example, a mesh data structure might contain a reference to its material, which in turn contains a list of references to textures. Usually cross-references imply dependency (i.e., if resource A refers to resource B, then both A and B must be in memory in order for the resources to be functional in the game.) In general, a game’s resource database can be represented by a directed graph of interdependent data objects. Cross-references between data objects can be internal (a reference between two objects within a single fi le) or external (a reference to an object in a dif- ferent fi le). This distinction is important because internal and external cross- references are oft en implemented diff erently. When visualizing a game’s re- source database, we can draw dott ed lines surrounding individual resource fi les to make the internal/external distinction clear—any edge of the graph that crosses a dott ed line fi le boundary is an external reference, while edges that do not cross fi le boundaries are internal. This is illustrated in Fiure 6.6. 6.2. The Resource Manager Figure 6.6. Example of a resource database dependency graph. 296 6. Resources and the File System We sometimes use the term composite resource to describe a self-suffi cient cluster of interdependent resources. For example, a model is a composite re- source consisting of one or more triangle meshes, an optional skeleton, and an optional collection of animations. Each mesh is mapped with a material, and each material refers to one or more textures. To fully load a composite resource like a 3D model into memory, all of its dependent resources must be loaded as well. 6.2.2.9. Handling Cross-References between Resources One of the more-challenging aspects of implementing a resource manager is managing the cross-references between resource objects and guaranteeing that referential integrity is maintained. To understand how a resource man- ager accomplishes this, let’s look at how cross-references are represented in memory, and how they are represented on-disk. In C++, a cross-reference between two data objects is usually implemented via a pointer or a reference. For example, a mesh might contain the data mem- ber Material* m_pMaterial (a pointer) or Material& m_material (a ref- erence) in order to refer to its material. However, pointers are just memory addresses—they lose their meaning when taken out of the context of the run- ning application. In fact, memory addresses can and do change even between subsequent runs of the same application. Clearly when storing data to a disk fi le, we cannot use pointers to describe inter-object dependencies. GUIDs As Cross-References One good approach is to store each cross-reference as a string or hash code containing the unique id of the referenced object. This implies that every re- source object that might be cross-referenced must have a globally unique identi- fi er or GUID. To make this kind of cross-reference work, the runtime resource manager maintains a global resource look-up table. Whenever a resource object is load- ed into memory, a pointer to that object is stored in the table with its GUID as the look-up key. Aft er all resource objects have been loaded into memory and their entries added to the table, we can make a pass over all of the objects and convert all of their cross-references into pointers, by looking up the address of each cross-referenced object in the global resource look-up table via that object’s GUID. Pointer Fix-Up Tables Another approach that is oft en used when storing data objects into a binary fi le is to convert the pointers into fi le off sets. Consider a group of C structs or C++ objects that cross-reference each other via pointers. To store this group 297 of objects into a binary fi le, we need to visit each object once (and only once) in an arbitrary order and write each object’s memory image into the fi le se- quentially. This has the eff ect of serializing the objects into a contiguous image within the fi le, even when their memory images are not contiguous in RAM. This is shown in Figure 6.7. Because the objects’ memory images are now contiguous within the fi le, we can determine the off set of each object’s image relative to the beginning of the fi le. During the process of writing the binary fi le image, we locate every pointer within every data object, convert each pointer into an off set, and store those off sets into the fi le in place of the pointers. We can simply overwrite the pointers with their off sets, because the off sets never require more bits to store than the original pointers. In eff ect, an off set is the binary fi le equivalent of a pointer in memory. (Do be aware of the diff erences between your development platform and your target platform. If you write out a memory image on a 64- bit Windows machine, its pointers will all be 64 bits wide and the resulting fi le won’t be compatible with a 32-bit console.) Of course, we’ll need to convert the off sets back into pointers when the fi le is loaded into memory some time later. Such conversions are known as pointer fi x-ups . When the fi le’s binary image is loaded, the objects contained in the image retain their contiguous layout. So it is trivial to convert an off set into a pointer. We merely add the off set to the address of the fi le image as a whole. This is demonstrated by the code snippet below, and illustrated in Figure 6.8. 6.2. The Resource Manager Addresses: Offsets: RAM Binary File Object 1 Object 2 Object 3 Object 4 0x0 0x240 0x4A 0 0x7F0 Object 1 Object 4 Object 2 Object 3 0x2A080 0x2D750 0x2F110 0x32EE 0 Figure 6.7. In-memory object images become contiguous when saved into a binary fi le. 298 6. Resources and the File System U8* ConvertOffsetToPointer(U32 objectOffset, U8* pAddressOfFileImage) { U8* pObject = pAddressOfFileImage + objectOffset; return pObject; } The problem we encounter when trying to convert pointers into off sets, and vice-versa, is how to fi nd all of the pointers that require conversion. This problem is usually solved at the time the binary fi le is writt en. The code that writes out the images of the data objects has knowledge of the data types and classes being writt en, so it has knowledge of the locations of all the pointers Addresses: Offsets: RAM Binary File Object 1 Object 2 Object 3 Object 4 0x0 0x240 0x4A 0 0x7F0 Object 1 Object 4 Object 2 Object 3 0x2A080 0x2D750 0x2F110 0x32EE 0 0x32EE 0 0x2F110 0x2A080 0x4A0 0x240 0x0 Pointers converted to offsets; locations of pointers stored in fix-up table. Fix-Up Table 0x200 0x340 0x810 Pointers to various objects are present. 3 pointers Figure 6.9. A pointer fi x-up table. Addresses:Offsets: RAMBinary File Object 1 Object 2 Object 3 Object 4 0x0 0x240 0x4A0 0x7F0 Object 1 Object 4 Object 2 Object 3 0x30100 0x30340 0x305A0 0x308 F0 Figure 6.8. Contiguous resource fi le image, after it has been loaded into RAM. 299 within each object. The locations of the pointers are stored into a simple table known as a pointer fi x-up table. This table is writt en into the binary fi le along with the binary images of all the objects. Later, when the fi le is loaded into RAM again, the table can be consulted in order to fi nd and fi x up every point- er. The table itself is just a list of off sets within the fi le—each off set represents a single pointer that requires fi xing up. This is illustrated in Figure 6.9. Storing C++ Objects as Binary Images: Constructors One important step that is easy to overlook when loading C++ objects from a binary fi le is to ensure that the objects’ constructors are called. For example, if we load a binary image containing three objects—an instance of class A, an instance of class B, and an instance of class C—then we must make sure that the correct constructor is called on each of these three objects. There are two common solutions to this problem. First, you can simply decide not to support C++ objects in your binary fi les at all. In other words, restrict yourself to plain old data structures (PODS)—i.e., C structs and C++ structs and classes that contain no virtual functions and trivial do-nothing con- structors (See htt p://en.wikipedia.org/wiki/Plain_Old_Data_Structures for a more complete discussion of PODS.) Second, you can save off a table containing the off sets of all non-PODS objects in your binary image along with some indication of which class each object is an instance of. Then, once the binary image has been loaded, you can iterate through this table, visit each object, and call the appropriate construc- tor using placement new syntax (i.e., calling the constructor on a preallocated block of memory). For example, given the off set to an object within the binary image, we might write: void* pObject = ConvertOffsetToPointer(objectOffset); ::new(pObject) ClassName; // placement-new syntax where ClassName is the class of which the object is an instance. Handling External References The two approaches described above work very well when applied to resourc- es in which all of the cross-references are internal—i.e., they only reference objects within a single resource fi le. In this simple case, you can load the bi- nary image into memory and then apply the pointer fi x-ups to resolve all the cross-references. But when cross-references reach out into other resource fi les, a slightly augmented approach is required. To successfully represent an external cross-reference, we must specify not only the off set or GUID of the data object in question, but also the path to the resource fi le in which the referenced object resides. 6.2. The Resource Manager 300 6. Resources and the File System The key to loading a multi-fi le composite resource is to load all of the interdependent fi les fi rst. This can be done by loading one resource fi le and then scanning through its table of cross-references and loading any externally- referenced fi les that have not already been loaded. As we load each data object into RAM, we can add the object’s address to the master look-up table . Once all of the interdependent fi les have been loaded and all of the objects are pres- ent in RAM, we can make a fi nal pass to fi x up all of the pointers using the master look-up table to convert GUIDs or fi le off sets into real addresses. 6.2.2.10. Post-Load Initialization Ideally, each and every resource would be completely prepared by our off -line tools, so that it is ready for use the moment it has been loaded into memory. Practically speaking, this is not always possible. Many types of resources re- quire at least some “massaging” aft er having been loaded, in order to prepare them for use by the engine. In this book, I will use the term post-load initializa- tion to refer to any processing of resource data aft er it has been loaded. Other engines may use diff erent terminology. (For example, at Naughty Dog we call this logging in a resource.) Most resource managers also support some kind of tear-down step prior to a resource’s memory being freed. (At Naughty Dog, we call this logging out a resource.) Post-load initialization generally comes in one of two varieties: In some cases, post-load initialization is an unavoidable step. For ex- ample, the vertices and indices that describe a 3D mesh are loaded into main RAM, but they almost always need to be transferred into video RAM. This can only be accomplished at runtime, by creating a Direct X vertex buff er or index buff er, locking it, copying or reading the data into the buff er, and then unlocking it. In other cases, the processing done during post-load initialization is avoidable (i.e., could be moved into the tools), but is done for conve- nience or expedience. For example, a programmer might want to add the calculation of accurate arc lengths to our engine’s spline library. Rather than spend the time to modify the tools to generate the arc length data, the programmer might simply calculate it at runtime during post-load initialization. Later, when the calculations are perfected, this code can be moved into the tools, thereby avoiding the cost of doing the calcula- tions at runtime. Clearly, each type of resource has its own unique requirements for post- load initialization and tear-down. So resource managers typically permit these two steps to be confi gurable on a per-resource-type basis. In a non-object-ori- 301 ented language like C, we can envision a look-up table that maps each type of resource to a pair of function pointers, one for post-load initialization and one for tear-down. In an object-oriented language like C++, life is even easier—we can make use of polymorphism to permit each class to handle post-load ini- tialization and tear-down in a unique way. In C++, post-load initialization could be implemented as a special con- structor, and tear-down could be done in the class’ destructor. However, there are some problems with using constructors and destructors for this purpose. (For example, constructors cannot be virtual in C++, so it would be diffi cult for a derived class to modify or augment the post-load initialization of its base class.) Many developers prefer to defer post-load initialization and tear-down to plain old virtual functions. For example, we might choose to use a pair of virtual functions named something sensible like Init() and Destroy(). Post-load initialization is closely related to a resource’s memory allocation strategy, because new data is oft en generated by the initialization routine. In some cases, the data generated by the post-load initialization step augments the data loaded from the fi le. (For example, if we are calculating the arc lengths of the segments of a Catmull-Rom spline curve aft er it has been loaded, we would probably want to allocate some additional memory in which to store the results.) In other cases, the data generated during post-load initialization replaces the loaded data. (For example, we might allow mesh data in an older out-of-date format to be loaded and then automatically converted into the lat- est format for backwards compatibility reasons.) In this case, the loaded data may need to be discarded, either partially or in its entirety, aft er the post-load step has generated the new data. The Hydro Thunder engine had a simple but powerful way of handling this. It would permit resources to be loaded in one of two ways: (a) directly into its fi nal resting place in memory, or (b) into a temporary area of memory. In the latt er case, the post-load initialization routine was responsible for copy- ing the fi nalized data into its ultimate destination; the temporary copy of the resource would be discarded aft er post-load initialization was complete. This was very useful for loading resource fi les that contained both relevant and irrelevant data. The relevant data would be copied into its fi nal destination in memory, while the irrelevant data would be discarded. For example, mesh data in an out-of-date format could be loaded into temporary memory and then converted into the latest format by the post-load initialization routine, without having to waste any memory keeping the old-format data kicking around. 6.2. The Resource Manager 303 7 The Game Loop and Real-Time Simulation Games are real-time, dynamic, interactive computer simulations . As such, time plays an incredibly important role in any electronic game. There are many diff erent kinds of time to deal with in a game engine—real time , game time , the local timeline of an animation, the actual CPU cycles spent within a particular function, and the list goes on. Every engine system might defi ne and manipulate time diff erently. We must have a solid understanding of all the ways time can be used in a game. In this chapter, we’ll take a look at how real-time, dynamic simulation soft ware works and explore the common ways in which time plays a role in such a simulation. 7.1. The Rendering Loop In a graphical user interface (GUI), of the sort found on a Windows PC or a Macintosh, the majority of the screen’s contents are static. Only a small part of any one window is actively changing appearance at any given moment. Because of this, graphical user interfaces have traditionally been drawn on- screen via a technique known as rectangle invalidation , in which only the small portions of the screen whose contents have actually changed are re-drawn. Older 2D video games used similar techniques to minimize the number of pixels that needed to be drawn. 304 7. The Game Loop and Real-Time Simulation Real-time 3D computer graphics are implemented in an entirely diff erent way. As the camera moves about in a 3D scene, the entire contents of the screen or window change continually, so the concept of invalid rectangles no longer applies. Instead, an illusion of motion and interactivity is produced in much the same way that a movie produces it—by presenting the viewer with a se- ries of still images in rapid succession. Obviously, producing a rapid succession of still images on-screen requires a loop. In a real-time rendering application, this is sometimes known as the render loop . At its simplest, a rendering loop is structured as follows: while (!quit) { // Update the camera transform based on interactive // inputs or by following a predefined path. updateCamera(); // Update positions, orientations and any other // relevant visual state of any dynamic elements // in the scene. updateSceneElements(); // Render a still frame into an off-screen frame // buffer known as the "back buffer". renderScene(); // Swap the back buffer with the front buffer, making // the most-recently-rendered image visible // on-screen. (Or, in windowed mode, copy (blit) the // back buffer’s contents to the front buffer. swapBuffers(); } 7.2. The Game Loop A game is composed of many interacting subsystems, including device I/O, rendering, animation, collision detection and resolution, optional rigid body dynamics simulation, multiplayer networking, audio, and the list goes on. Most game engine subsystems require periodic servicing while the game is running. However, the rate at which these subsystems need to be serviced var- ies from subsystem to subsystem. Animation typically needs to be updated at a rate of 30 or 60 Hz, in synchronization with the rendering subsystem. However, a dynamics simulation may actually require more frequent updates 305 7.2. The Game Loop (e.g., 120 Hz). Higher-level systems, like AI , might only need to be serviced once or twice per second, and they needn’t necessarily be synchronized with the rendering loop at all. There are a number of ways to implement the periodic updating of our game engine subsystems. We’ll explore some of the possible architectures in a moment. But for the time being, let’s stick with the simplest way to update our engine’s subsystems—using a single loop to update everything. Such a loop is oft en called the game loop, because it is the master loop that services every subsystem in the engine. 7.2.1. A Simple Example: Pong Pong is a well-known genre of table tennis video games that got its start in 1958, in the form of an analog computer game called Tennis for Two, created by William A. Higinbotham at the Brookhaven National Laboratory and dis- played on an oscilloscope. The genre is best known by its later incarnations on digital computers—the Magnavox Oddysey game Table Tennis and the Atari arcade game Pong. In pong, a ball bounces back and forth between two movable vertical pad- dles and two fi xed horizontal walls. The human players control the positions of the paddles via control wheels. (Modern re-implementations allow control via a joystick, the keyboard, or some other human interface device.) If the ball passes by a paddle without striking it, the other team wins the point and the ball is reset for a new round of play. The following pseudocode demonstrates what the game loop of a pong game might look like at its core : void main() // Pong { initGame(); while (true) // game loop { readHumanInterfaceDevices(); if (quitButtonPressed()) { break; // exit the game loop } movePaddles(); moveBall(); 306 7. The Game Loop and Real-Time Simulation collideAndBounceBall(); if (ballImpactedSide(LEFT_PLAYER)) { incremenentScore(RIGHT_PLAYER); resetBall(); } else if (ballImpactedSide(RIGHT_PLAYER)) { incrementScore(LEFT_PLAYER); resetBall(); } renderPlayfield(); } } Clearly this example is somewhat contrived. The original pong games were certainly not implemented by redrawing the entire screen at a rate of 30 frames per second. Back then, CPUs were so slow that they could barely mus- ter the power to draw two lines for the paddles and a box for the ball in real time. Specialized 2D sprite hardware was oft en used to draw moving objects on-screen. However, we’re only interested in the concepts here, not the imple- mentation details of the original Pong. As you can see, when the game fi rst runs, it calls initGame() to do whatever set-up might be required by the graphics system, human I/O de- vices, audio system, etc. Then the main game loop is entered. The statement while (true) tells us that the loop will continue forever, unless interrupted internally. The fi rst thing we do inside the loop is to read the human interface device(s). We check to see whether either human player pressed the “quit” butt on—if so, we exit the game via a break statement. Next, the positions of the paddles are adjusted slightly upward or downward in movePaddles(), based on the current defl ection of the control wheels, joysticks, or other I/O devices. The function moveBall() adds the ball’s current velocity vector to its position in order to fi nd its new position next frame. In collideAndBounce- Ball(), this position is then checked for collisions against both the fi xed hori- zontal walls and the paddles. If collisions are detected, the ball’s position is re- calculated to account for any bounce. We also note whether the ball impacted either the left or right edge of the screen. This means that it missed one of the paddles, in which case we increment the other player’s score and reset the ball for the next round. Finally, renderPlayfield() draws the entire contents of the screen. 307 7.3. Game Loop Architectural Styles 7.3. Game Loop Architectural Styles Game loops can be implemented in a number of diff erent ways—but at their core, they usually boil down to one or more simple loops, with various embel- lishments. We’ll explore a few of the more common architectures below. 7.3.1. Windows Message Pumps On a Windows platform, games need to service messages from the Windows operating system in addition to servicing the various subsystems in the game engine itself. Windows games therefore contain a chunk of code known as a message pump . The basic idea is to service Windows messages whenever they arrive and to service the game engine only when no Windows messages are pending. A message pump typically looks something like this: while (true) { // Service any and all pending Windows messages. MSG msg; while (PeekMessage(&msg, NULL, 0, 0) > 0) { TranslateMessage(&msg); DispatchMessage(&msg); } // No more Windows messages to process – run one // iteration of our "real" game loop. RunOneIterationOfGameLoop(); } One of the side-eff ects of implementing the game loop like this is that Win- dows messages take precedence over rendering and simulating the game. As a result, the game will temporarily freeze whenever you resize or drag the game’s window around on the desktop. 7.3.2. Callback-Driven Frameworks Most game engine subsystems and third-party game middleware packages are structured as libraries . A library is a suite of functions and/or classes that 308 7. The Game Loop and Real-Time Simulation can be called in any way the application programmer sees fi t. Libraries pro- vide maximum fl exibility to the programmer. But libraries are sometimes dif- fi cult to use, because the programmer must understand how to properly use the functions and classes they provide. In contrast, some game engines and game middleware packages are structured as frameworks . A framework is a partially-constructed applica- tion—the programmer completes the application by providing custom im- plementations of missing functionality within the framework (or overriding its default behavior). But he or she has litt le or no control over the overall fl ow of control within the application, because it is controlled by the frame- work. In a framework-based rendering engine or game engine, the main game loop has been writt en for us, but it is largely empty. The game programmer can write callback functions in order to “fi ll in” the missing details. The Ogre3D rendering engine is an example of a library that has been wrapped in a frame- work. At the lowest level, Ogre provides functions that can be called directly by a game engine programmer. However, Ogre also provides a framework that encapsulates knowledge of how to use the low-level Ogre library eff ectively. If the programmer chooses to use the Ogre framework, he or she derives a class from Ogre::FrameListener and overrides two virtual functions: frame- Started() and frameEnded(). As you might guess, these functions are called before and aft er the main 3D scene has been rendered by Ogre, respectively. The Ogre framework’s implementation of its internal game loop looks some- thing like the following pseudocode. (See Ogre::Root::renderOneFrame() in OgreRoot.cpp for the actual source code.) while (true) { for (each frameListener) { frameListener. frameStarted(); } renderCurrentScene(); for (each frameListener) { frameListener. frameEnded(); } finalizeSceneAndSwapBuffers(); } 309 A particular game’s frame listener implementation might look something like this. class GameFrameListener : public Ogre::FrameListener { public: virtual void frameStarted(const FrameEvent& event) { // Do things that must happen before the 3D scene // is rendered (i.e., service all game engine // subsystems). pollJoypad(event); updatePlayerControls(event); updateDynamicsSimulation(event); resolveCollisions(event); updateCamera(event); // etc. } virtual void frameEnded(const FrameEvent& event) { // Do things that must happen after the 3D scene // has been rendered. drawHud(event); // etc. } }; 7.3.3. Event-Based Updating In games, an event is any interesting change in the state of the game or its environment. Some examples include: the human player pressing a butt on on the joypad, an explosion going off , an enemy character spott ing the player, and the list goes on. Most game engines have an event system, which permits various engine subsystems to register interest in particular kinds of events and to respond to those events when they occur (see Section 14.7 for details). A game’s event system is usually very similar to the event/messaging system underlying virtually all graphical user interfaces (for example, Microsoft Win- dows’ window messages, the event handling system in Java’s AWT, or the services provided by C#’s delegate and event keywords). Some game engines leverage their event system in order to implement the periodic servicing of some or all of their subsystems. For this to work, the event system must permit events to be posted into the future—that is, to be queued for later delivery. A game engine can then implement periodic updat- 7.3. Game Loop Architectural Styles 310 7. The Game Loop and Real-Time Simulation ing by simply posting an event. In the event handler, the code can perform whatever periodic servicing is required. It can then post a new event 1/30 or 1/60 of a second into the future, thus continuing the periodic servicing for as long as it is required. 7.4. Abstract Timelines In game programming, it can be extremely useful to think in terms of abstract timelines . A timeline is a continuous, one-dimensional axis whose origin (t = 0) can lie at any arbitrary location relative to other timelines in the system. A timeline can be implemented via a simple clock variable that stores absolute time values in either integer or fl oating-point format. 7.4.1. Real Time We can think of times measured directly via the CPU’s high-resolution timer register (see Section 7.5.3) as lying on what we’ll call the real timeline. The ori- gin of this timeline is defi ned to coincide with the moment the CPU was last powered on or reset. It measures times in units of CPU cycles (or some mul- tiple thereof), although these time values can be easily converted into units of seconds by multiplying them by the frequency of the high-resolution timer on the current CPU. 7.4.2. Game Time We needn’t limit ourselves to working with the real timeline exclusively. We can defi ne as many other timeline(s) as we need, in order to solve the prob- lems at hand. For example, we can defi ne a game timeline that is technically independent of real time. Under normal circumstances, game time coincides with real time. If we wish to pause the game, we can simply stop updating the game timeline temporarily. If we want our game to go into slow-motion, we can update the game clock more slowly than the real-time clock. All sorts of eff ects can be achieved by scaling and warping one timeline relative to an- other. Pausing or slowing down the game clock is also a highly useful debug- ging tool. To track down a visual anomaly, a developer can pause game time in order to freeze the action. Meanwhile, the rendering engine and debug fl y- through camera can continue to run, as long as they are governed by a dif- ferent clock (either the real-time clock, or a separate camera clock). This allows the developer to fl y the camera around the game world to inspect it from any angle desired. We can even support single-stepping the game clock, by 311 7.4. Abstract Timelines advancing the game clock by one target frame interval (e.g., 1/30 of a second) each time a “single-step” butt on is pressed on the joypad or keyboard while the game is in a paused state. When using the approach described above, it’s important to realize that the game loop is still running when the game is paused—only the game clock has stopped. Single-stepping the game by adding 1/30 of a second to a paused game clock is not the same thing as sett ing a break point in your main loop, and then hitt ing the F5 key repeatedly to run one iteration of the loop at a time. Both kinds of single-stepping can be useful for tracking down diff erent kinds of problems. We just need to keep the diff erences between these ap- proaches in mind. 7.4.3. Local and Global Timelines We can envision all sorts of other timelines. For example, an animation clip or audio clip might have a local timeline, with its origin (t = 0) defi ned to coincide with the start of the clip. The local timeline measures how time progressed when the clip was originally authored or recorded. When the clip is played back in-game, we needn’t play it at the original rate. We might want to speed up an animation, or slow down an audio sample. We can even play an anima- tion backwards by running its local clock in reverse. Any one of these eff ects can be visualized as a mapping between the lo- cal timeline and a global timeline, such as real time or game time. To play an animation clip back at its originally-authored speed, we simply map the start of the animation’s local timeline (t = 0) onto the desired start time start()τ =τ along the global timeline. This is shown in Figure 7.1. To play an animation clip back at half speed, we can imagine scaling the local timeline to twice its original size prior to mapping it onto the global timeline. To accomplish this, we simply keep track of a time scale factor or playback rate R, in addition to the clip’s global start time start.τ This is illus- trated in Figure 7.2. A clip can even be played in reverse, by using a negative time scale (R < 0) as shown in Figure 7.3. Clip A t = 0 sec 5 sec τstart = 102 sec τ = 105 sec 110 sec Figure 7.1. Playing an animation clip can be envisioned as mapping its local timeline onto the global game timeline. 312 7. The Game Loop and Real-Time Simulation 7.5. Measuring and Dealing with Time In this section, we’ll investigate some of the subtle and not-so-subtle distinc- tions between diff erent kinds of timelines and clocks and see how they are implemented in real game engines. 7.5.1. Frame Rate and Time Deltas The frame rate of a real-time game describes how rapidly the sequence of still 3D frames is presented to the viewer. The unit of Hertz (Hz), defi ned as the number of cycles per second, can be used to describe the rate of any periodic process. In games and fi lm, frame rate is typically measured in frames per sec- ond (FPS), which is the same thing as Hertz for all intents and purposes. Films traditionally run at 24 FPS. Games in North America and Japan are typically rendered at 30 or 60 FPS, because this is the natural refresh rate of the NTSC color television standard used in these regions. In Europe and most of the rest Clip A τstart = 102 sec τ = 105 sec Clip A R = 2 (scale t by 1/R = 0.5) t = 0 sec t = 5 sec t = 0 sec 5 sec Figure 7.2. Animation play-back speed can be controlled by simply scaling the local time line prior to mapping it onto the global time line. t = 5 sec 0 sec τstart = 102 sec τ = 105 sec 110 sec A pilC Clip A t = 0 sec 5 sec R = –1 (flip t) Figure 7.3. Playing an animation in reverse is like mapping the clip to the global time line with a time scale of R = –1. 313 of the world, games update at 50 FPS, because this is the natural refresh rate of a PAL or SECAM color television signal. The amount of time that elapses between frames is known as the frame time, time delta , or delta time. This last term is commonplace because the dura- tion between frames is oft en represented mathematically by the symbol Δt. (Technically speaking, Δt should really be called the frame period, since it is the inverse of the frame frequency: T = 1/f. But game programmers hardly ever use the term “period” in this context.) If a game is being rendered at exactly 30 FPS, then its delta time is 1/30 of a second, or 33.3 ms (milliseconds). At 60 FPS, the delta time is half as big, 1/60 of a second or 16.6 ms. To really know how much time has elapsed during one iteration of the game loop, we need to measure it. We’ll see how this is done below. We should note here that milliseconds are a common unit of time mea- surement in games. For example, we might say that animation is taking 4 ms, which implies that it occupies about 12% of the entire frame (4 / 33.3 ≈ 0.12). Other common units include seconds and machine cycles. We’ll discuss time units and clock variables in more depth below. 7.5.2. From Frame Rate to Speed Let’s imagine that we want to make a spaceship fl y through our game world at a constant speed of 40 meters per second (or in a 2D game, we might specify this as 40 pixels per second!) One simple way to accomplish this is to multiply the ship’s speed v (measured in meters per second) by the duration of one frame Δt (measured in seconds), yielding a change in position Δx = v Δt (which is measured in meters per frame). This position delta can then be added to the ship’s current position x1 , in order to fi nd its position next frame: x2 = x1 + Δx = x1 + v Δt. This is actually a simple form of numerical integration known as the explicit Euler method (see Section 12.4.4). It works well as long as the speeds of our objects are roughly constant. To handle variable speeds, we need to resort to somewhat more-complex integration methods. But all numerical integration techniques make use of the elapsed frame time Δt in one way or another. So it is safe to say that the perceived speeds of the objects in a game are dependent upon the frame duration, Δt. Hence a central problem in game programming is to determine a suitable value for Δt. In the sections that follow, we’ll discuss various ways of doing this. 7.5.2.1. Old-School CPU-Dependent Games In many early video games, no att empt was made to measure how much real time had elapsed during the game loop. The programmers would essentially 7.5. Measuring and Dealing with Time 314 7. The Game Loop and Real-Time Simulation ignore Δt altogether and instead specify the speeds of objects directly in terms of meters (or pixels, or some other distance unit) per frame. In other words, they were, perhaps unwitt ingly, specifying object speeds in terms of Δx = v Δt, instead of in terms of v. The net eff ect of this simplistic approach was that the perceived speeds of the objects in these games were entirely dependent upon the frame rate that the game was actually achieving on a particular piece of hardware. If this kind of game were to be run on a computer with a faster CPU than the machine for which it was originally writt en, the game would appear to be running in fast forward. For this reason, I’ll call these games CPU-dependent games . Some older PCs provided a “Turbo” butt on to support these kinds of games. When the Turbo butt on was pressed, the PC would run at its fastest speed, but CPU-dependent games would run in fast forward. When the Turbo butt on was not pressed, the PC would mimic the processor speed of an older generation of PCs, allowing CPU-dependent games writt en for those PCs to run properly. 7.5.2.2. Updating Based on Elapsed Time To make our games CPU-independent, we must measure Δt in some way, rath- er than simply ignoring it. Doing this is quite straightforward. We simply read the value of the CPU’s high resolution timer twice—once at the beginning of the frame and once at the end. Then we subtract, producing an accurate mea- sure of Δt for the frame that has just passed. This delta is then made available to all engine subsystems that need it, either by passing it to every function that we call from within the game loop or by storing it in a global variable or en- capsulating it within a singleton class of some kind. (We’ll describe the CPU’s high resolution timer in more detail Section 7.5.3.) The approach outlined above is used by many game engines. In fact, I am tempted to go out on a limb and say that most game engines use it. However, there is one big problem with this technique: We are using the measured value of Δt taken during frame k as an estimate of the duration of the upcoming frame (k + 1). This isn’t necessarily very accurate. (As they say in investing, “past per- formance is not a guarantee of future results.”) Something might happen next frame that causes it to take much more time (or much less) than the current frame. We call such an event a frame-rate spike. Using last frame’s delta as an estimate of the upcoming frame can have some very real detrimental eff ects. For example, if we’re not careful it can put the game into a “viscious cycle” of poor frame times. Let’s assume that our physics simulation is most stable when updated once every 33.3 ms (i.e., at 30 Hz). If we get one bad frame, taking say 57 ms, then we might make the 315 mistake of stepping the physics system twice on the next frame, presumably to “cover” the 57 ms that has passed. Those two steps take roughly twice as long to complete as a regular step, causing the next frame to be at least as bad as this one was, and possibly worse. This only serves to exacerbate and prolong the problem. 7.5.2.3. Using a Running Average It is true that game loops tend to have at least some frame-to-frame coher- ency . If the camera is pointed down a hallway containing lots of expensive-to- draw objects on one frame, there’s a good chance it will still be pointed down that hallway on the next. Therefore, one reasonable approach is to average the frame-time measurements over a small number of frames and use that as the next frame’s estimate of Δt . This allows the game to adapt to varying frame rate, while soft ening the eff ects of momentary performance spikes. The longer the averaging interval, the less responsive the game will be to varying frame rate, but spikes will have less of an impact as well. 7.5.2.4. Governing the Frame Rate We can avoid the inaccuracy of using last frame’s Δt as an estimate of this frame’s duration altogether, by fl ipping the problem on its head. Rather than trying to guess at what next frame’s duration will be, we can instead att empt to guarantee that every frame’s duration will be exactly 33.3 ms (or 16.6 ms if we’re running at 60 FPS). To do this, we measure the duration of the current frame as before. If the measured duration is less than the ideal frame time, we simply put the main thread to sleep until the target frame time has elapsed. If the measured duration is more than the ideal frame time, we must “take our lumps” and wait for one more whole frame time to elapse. This is called frame-rate governing . Clearly this approach only works when your game’s frame rate is reason- ably close to your target frame rate on average. If your game is ping-ponging between 30 FPS and 15 FPS due to frequent “slow” frames, then the game’s quality can degrade signifi cantly. As such, it’s still a good idea to design all engine systems so that they are capable of dealing with arbitrary frame dura- tions. During development, you can leave the engine in “variable frame rate” mode, and everything will work as expected. Later on, when the game is get- ting closer to achieving its target frame rate consistently, we can switch on frame-rate governing and start to reap its benefi ts. Keeping the frame rate consistent can be important for a number of rea- sons. Some engine systems, such as the numerical integrators used in a phys- ics simulation, operate best when updated at a constant rate. A consistent 7.5. Measuring and Dealing with Time 316 7. The Game Loop and Real-Time Simulation frame rate also looks bett er, and as we’ll see in the next section, it can be used to avoid the tearing that can occur when the video buff er is updated at a rate that doesn’t match the refresh rate of the monitor. In addition, when elapsed frame times are consistent, features like record and play back become a lot more reliable. As its name implies, the record and play back feature allows a player’s gameplay experience to be recorded and later played back in exactly the same way. This can be a fun game feature, and it’s also a valuable testing and debugging tool. For example, diffi cult-to-fi nd bugs can be reproduced by simply playing back a recorded game that dem- onstrates the bug. To implement record and play back, we make note of every relevant event that occurs during gameplay, saving each one in a list along with an accurate time stamp. The list of events can then be replayed with exactly the same tim- ing, using the same initial conditions, and an identical initial random seed. In theory, doing this should produce a gameplay experience that is indis- tinguishable from the original playthrough. However, if the frame rate isn’t consistent, things may not happen in exactly the same order. This can cause “drift ,” and prett y soon your AI characters are fl anking when they should have fallen back. 7.5.2.5. The Vertical Blanking Interval A visual anomaly known as tearing occurs when the back buff er is swapped with the front buff er while the electron gun in the CRT monitor is only part way through its scan. When tearing occurs, the top portion of the screen shows the old image, while the bott om portion shows the new one. To avoid tearing, many rendering engines wait for the vertical blanking interval of the monitor (the time during which the electron gun is being reset to the top-left corner of the screen) before swapping buff ers. Waiting for the v-blank interval is another form of frame-rate governing . It eff ectively clamps the frame rate of the main game loop to a multiple of the screen’s refresh rate. For example, on an NTSC monitor that refreshes at a rate of 60 Hz, the game’s real update rate is eff ectively quantized to a multiple of 1/60 of a second. If more than 1/60 of a second elapses between frames, we must wait until the next v-blank interval, which means waiting 2/60 of a second (30 FPS). If we miss two v-blanks, then we must wait a total of 3/60 of a second (20 FPS), and so on. Also, be careful not to make assumptions about the frame rate of your game, even when it is synchronized to the v-blank in- terval; remember that the PAL and SECAM standards are based around an update rate of 50 Hz, not 60 Hz. 317 7.5.3. Measuring Real Time with a High-Resolution Timer We’ve talked a lot about measuring the amount of real “wall clock” time that elapses during each frame. In this section, we’ll investigate how such timing measurements are made in detail. Most operating systems provide a function for querying the system time, such as the standard C library function time(). However, such functions are not suitable for measuring elapsed times in a real-time game, because they do not provide suffi cient resolution. For example, time() returns an integer representing the number of seconds since midnight, January 1, 1970, so its reso- lution is one second—far too coarse, considering that a frame takes only tens of milliseconds to execute. All modern CPUs contain a high-resolution timer , which is usually imple- mented as a hardware register that counts the number of CPU cycles (or some multiple thereof) that have elapsed since the last time the processor was pow- ered on or reset. This is the timer that we should use when measuring elapsed time in a game, because its resolution is usually on the order of the duration of a few CPU cycles. For example, on a 3 GHz Pentium processor, the high- resolution timer increments once per CPU cycle, or 3 billion times per second. Hence the resolution of the high-res timer is 1 / 3 billion = 3.33 × 10–10 seconds = 0.333 ns (one-third of a nanosecond). This is more than enough resolution for all of our time-measurement needs in a game. Diff erent microprocessors and diff erent operating systems provide dif- ferent ways to query the high-resolution timer. On a Pentium, a special instruc- tion called rdtsc (read time-stamp counter) can be used, although the Win32 API wraps this facility in a pair of functions: QueryPerformanceCounter() reads the 64-bit counter register and QueryPerformanceFrequency() returns the number of counter increments per second for the current CPU. On a PowerPC architecture, such as the chips found in the Xbox 360 and PLAYSTATION 3, the instruction mftb (move from time base register) can be used to read the two 32-bit time base registers, while on other PowerPC architectures, the instruction mfspr (move from special-purpose register) is used instead. A CPU’s high-resolution timer register is 64 bits wide on most processors, to ensure that it won’t wrap too oft en. The largest possible value of a 64-bit un- signed integer is 0xFFFFFFFFFFFFFFFF ≈ 1.8 × 1019 clock ticks. So, on a 3 GHz Pentium processor that updates its high-res timer once per CPU cycle, the register’s value will wrap back to zero once every 195 years or so—defi nitely not a situation we need to lose too much sleep over. In contrast, a 32-bit integer clock will wrap aft er only about 1.4 seconds at 3 GHz. 7.5. Measuring and Dealing with Time 318 7. The Game Loop and Real-Time Simulation 7.5.3.1. High-Resolution Clock Drift Be aware that even timing measurements taken via a high-resolution timer can be inaccurate in certain circumstances. For example, on some multicore pro- cessors , the high-resolution timers are independent on each core, and they can (and do) drift apart. If you try to compare absolute timer readings taken on dif- ferent cores to one another, you might end up with some strange results—even negative time deltas. Be sure to keep an eye out for these kinds of problems. 7.5.4. Time Units and Clock Variables Whenever we measure or specify time durations in a game, we have two choices to make: What 1. time units should be used? Do we want to store our times in seconds, or milliseconds, or machine cycles… or in some other unit? What 2. data type should be used to store time measurements? Should we employ a 64-bit integer, or a 32-bit integer, or a 32-bit fl oating point variable ? The answers to these questions depend on the intended purpose of a given measurement. This gives rise to two more questions: How much precision do we need? And what range of magnitudes do we expect to be able to rep- resent? 7.5.4.1. 64-Bit Integer Clocks We’ve already seen that a 64-bit unsigned integer clock, measured in machine cycles, supports both an extremely high precision (a single cycle is 0.333 ns in duration on a 3 GHz CPU) and a broad range of magnitudes (a 64-bit clock wraps once roughly every 195 years at 3 GHz). So this is the most fl exible time representation, presuming you can aff ord 64 bits worth of storage. 7.5.4.2. 32-Bit Integer Clocks When measuring relatively short durations with high precision, we can turn to a 32-bit integer clock, measured in machine cycles. For eample, to profi le the performance of a block of code, we might do something like this: // Grab a time snapshot. U64 tBegin = readHiResTimer(); // This is the block of code whose performance we wish // to measure. doSomething(); doSomethingElse(); nowReallyDoSomething(); 319 // Measure the duration. U64 tEnd = readHiResTimer(); U32 dtCycles = static_cast(tEnd – tBegin); // Now use or cache the value of dtCycles... Notice that we still store the raw time measurements in 64-bit integer variables. Only the time delta dt is stored in a 32-bit variable. This circum- vents potential problems with wrapping at the 32-bit boundary. For example, if tBegin == 0x12345678FFFFFFB7 and tEnd == 0x1234567900000039, then we would measure a negative time delta if we were to truncate the indi- vidual time measurements to 32 bits each prior to subtracting them. 7.5.4.3. 32-Bit Floating-Point Clocks Another common approach is to store relatively small time deltas in fl oating- point format, measured in units of seconds. To do this, we simply multiply a duration measured in CPU cycles by the CPU’s clock frequency, which is in cycles per second. For example: // Start off assuming an ideal frame time (30 FPS). F32 dtSeconds = 1.0f / 30.0f; // Prime the pump by reading the current time. U64 tBegin = readHiResTimer(); while (true) // main game loop { runOneIterationOfGameLoop(dtSeconds); // Read the current time again, and calculate the // delta. U64 tEnd = readHiResTimer(); dtSeconds = (F32)(tEnd – tBegin) * (F32)getHiResTimerFrequency(); // Use tEnd as the new tBegin for next frame. tBegin = tEnd; } Notice once again that we must be careful to subtract the two 64-bit time measurements before converting them into fl oating point format. This ensures that we don’t store too large a magnitude into a 32-bit fl oating point variable. 7.5.4.4. Limitations of Floating Point Clocks Recall that in a 32-bit IEEE fl oat, the 23 bits of the mantissa are dynamically distributed between the whole and fractional parts of the value, by way 7.5. Measuring and Dealing with Time 320 7. The Game Loop and Real-Time Simulation of the exponent (see Section 3.2.1.4). Small magnitudes require only a few bits, leaving plenty of bits of precision for the fraction. But once the magni- tude of our clock grows too large, its whole part eats up more bits, leaving fewer bits for the fraction. Eventually, even the least-signifi cant bits of the whole part become implicit zeros. This means that we must be cautious when storing long durations in a fl oating-point clock variable. If we keep track of the amount of time that has elapsed since the game was started, a fl oating-point clock will eventually become inaccurate to the point of being unusable. Floating-point clocks are usually only used to store relatively short time deltas, measuring at most a few minutes, and more oft en just a single frame or less. If an absolute-valued fl oating-point clock is used in a game, you will need to reset the clock to zero periodically, to avoid accumulation of large magnitudes. 7.5.4.5. Other Time Units Some game engines allow timing values to be specifi ed in a game-defi ned unit that is fi ne-grained enough to permit a 32-bit integer format to be used, precise enough to be useful for a wide range of applications within the en- gine, and yet large enough that the 32-bit clock won’t wrap too oft en. One common choice is a 1/300 second time unit. This works well because (a) it is fi ne-grained enough for many purposes, (b) it only wraps once every 165.7 days, and (c) it is an even multiple of both NTSC and PAL refresh rates. A 60 FPS frame would be 5 such units in duration, while a 50 FPS frame would be 6 units in duration. Obviously a 1/300 second time unit is not precise enough to handle subtle eff ects, like time-scaling an animation. (If we tried to slow a 30 FPS anima- tion down to less than 1/10 of its regular speed, we’d be out of precision!) So for many purposes, it’s still best to use fl oating-point time units, or machine cycles. But a 1/300 second time unit can be used eff ectively for things like specifying how much time should elapse between the shots of an automatic weapon, or how long an AI -controlled character should wait before starting his patrol, or the amount of time the player can survive when standing in a pool of acid. 7.5.5. Dealing with Break Points When your game hits a break point, its loop stops running and the debug- ger takes over. However, the CPU continues to run, and the real-time clock continues to accrue cycles. A large amount of wall clock time can pass while you are inspecting your code at a break point. When you allow the program 321 to continue, this can lead to a measured frame time many seconds, or even minutes or hours in duration! Clearly if we allow such a huge delta-time to be passed to the subsystems in our engine, bad things will happen. If we are lucky, the game might con- tinue to function properly aft er lurching forward many seconds in a single frame. Worse, the game might just crash. A simple approach can be used to get around this problem. In the main game loop, if we ever measure a frame time in excess of some predefi ned up- per limit (e.g., 1/10 of a second), we can assume that we have just resumed ex- ecution aft er a break point, and we set the delta time artifi cially to 1/30 or 1/60 of a second (or whatever the target frame rate is). In eff ect, the game becomes frame-locked for one frame, in order to avoid a massive spike in the measured frame duration. // Start off assuming the ideal dt (30 FPS). F32 dt = 1.0f / 30.0f; // Prime the pump by reading the current time. U64 tBegin = readHiResTimer(); while (true) // main game loop { updateSubsystemA(dt); updateSubsystemB(dt); // ... renderScene(); swapBuffers(); // Read the current time again, and calculate an // estimate of next frame’s delta time. U64 tEnd = readHiResTimer(); dt = (F32)(tEnd – tBegin) / (F32) getHiResTimerFrequency(); // If dt is too large, we must have resumed from a // break point -- frame-lock to the target rate this // frame. if (dt > 1.0f/10.0f) { dt = 1.0f/30.0f; } // Use tEnd as the new tBegin for next frame. tBegin = tEnd; } 7.5. Measuring and Dealing with Time 322 7. The Game Loop and Real-Time Simulation 7.5.6. A Simple Clock Class Some game engines encapsulate their clock variables in a class. An engine might have a few instances of this class—one to represent real “wall clock” time, another to represent “game time” (which can be paused, slowed down or sped up relative to real time), another to track time for full-motion videos, and so on. A clock class is reasonably straightforward to implement. I’ll pres- ent a simple implementation below, making note of a few common tips, tricks, and pitfalls in the process. A clock class typically contains a variable that tracks the absolute time that has elapsed since the clock was created. As described above, it’s im- portant to select a suitable data type and time unit for this variable. In the following example, we’ll store absolute times in the same way the CPU does—with a 64-bit unsigned integer, measured in machine cycles. There are other possible implementations, of course, but this is probably the sim- plest. A clock class can support some nift y features, like time-scaling. This can be implemented by simply multiplying the measured time delta by an arbi- trary scale factor prior to adding it to the clock’s running total. We can also pause time by simply skipping its update while the clock is paused. Single- stepping a clock can be implemented by adding a fi xed time interval to a paused clock in response to a butt on press on the joypad or keyboard. All of this is demonstrated by the example Clock class shown below. class Clock { U64 m_timeCycles; F32 m_timeScale; bool m_isPaused; static F32 s_cyclesPerSecond; static inline U64 secondsToCycles(F32 timeSeconds) { return (U64)(timeSeconds * s_cyclesPerSecond); } // WARNING: Dangerous -- only use to convert small // durations into seconds. static inline F32 cyclesToSeconds(U64 timeCycles) { return (F32)timeCycles / s_cyclesPerSecond; } 323 public: // Call this when the game first starts up. static void init() { s_cyclesPerSecond = (F32)readHiResTimerFrequency(); } // Construct a clock. explicit Clock(F32 startTimeSeconds = 0.0f) : m_timeCycles( secondsToCycles(startTimeSeconds)), m_timeScale( 1.0f), // default to unscaled m_isPaused( false) // default to running { } // Return the current time in cycles. NOTE that we do // not return absolute time measurements in floating // point seconds, because a 32-bit float doesn’t have // enough precision. See calcDeltaSeconds(). U64 getTimeCycles() const { return m_timeCycles; } // Determine the difference between this clock’s // absolute time and that of another clock, in // seconds. We only return time deltas as floating // point seconds, due to the precision limitations of // a 32-bit float. F32 calcDeltaSeconds(const Clock& other) { U64 dt = m_timeCycles – other.m_timeCycles; return cyclesToSeconds(dt); } // This function should be called once per frame, // with the real measured frame time delta in seconds. void update(F32 dtRealSeconds) { if (!m_isPaused) { U64 dtScaledCycles = secondsToCycles( dtRealSeconds * m_timeScale); m_timeCycles += dtScaledCycles; } } 7.5. Measuring and Dealing with Time 324 7. The Game Loop and Real-Time Simulation void setPaused(bool isPaused) { m_isPaused = isPaused; } bool isPaused() const { return m_isPaused; } void setTimeScale(F32 scale) { m_timeScale = scale; } F32 getTimeScale() const { return m_timeScale; } void singleStep() { if (m_isPaused) { // Add one ideal frame interval; don’t forget // to scale it by our current time scale! U64 dtScaledCycles = secondsToCycles( ( 1.0f/30.0f) * m_timeScale); m_timeCycles += dtScaledCycles; } } }; 7.6. Multiprocessor Game Loops Now that we’ve investigated basic single-threaded game loops and learned some of the ways in which time is commonly measured and manipulated in a game engine, let’s turn our att ention to some more complex kinds of game loops. In this section, we’ll explore how game loops have evolved to take advantage of modern multiprocessor hardware. In the following sec- tion, we’ll see how networked multiplayer games typically structure their game loops. In 2004, microprocessor manufacturers industry-wide encountered a problem with heat dissipation that prevented them from producing faster 325 CPUs. Moore’s Law , which predicts an approximate doubling in transistor counts every 18 to 24 months, still holds true. But in 2004, its assumed cor- relation with doubling processor speeds was shown to be no longer val- id. As a result, microprocessor manufacturers shift ed their focus toward multicore CPUs. (For more information on this trend, see Microsoft ’s “The Manycore Shift Whitepaper,” available at htt p://www.microsoft post.com/ microsoft -download/the-manycore-shift -white-paper, and “Multicore Erod- ing Moore’s Law” by Dean Dauger, available at htt p://www.macresearch. org/multicore_eroding_moores_law.) The net eff ect on the soft ware industry was a major shift toward parallel processing techniques. As a result, mod- ern game engines running on multicore systems like the Xbox 360 and the PLAYSTATION 3 can no longer rely on a single main game loop to service their subsystems. The shift from single core to multicore has been painful. Multithreaded program design is a lot harder than single-threaded programming. Most game companies took on the transformation gradually, by selecting a hand- ful of engine subsystems for parallelization, and leaving the rest under the control of the old, single-threaded main loop. By 2008, most game studios had completed the transformation for the most part and have embraced parallel- ism to varying degrees within their engines. We don’t have room here for a full treatise on parallel programming architectures and techniques. (Refer to [20] for an in-depth discussion of this topic.) However, we will take a brief look at some of the most common ways in which game engines leverage multicore hardware. There are many diff erent soft ware architectures possible—but the goal of all of these archi- tectures is to maximize hardware utilization (i.e., to att empt to minimize the amount of time during which any particular hardware thread, core or CPU is idle). 7.6.1. Multiprocessor Game Console Architectures The Xbox 360 and the PLAYSTATION 3 are both multiprocessor consoles. In order to have a meaningful discussion of parallel soft ware architectures, let’s take a brief look at how these two consoles are structured internally. 7.6.1.1. Xbox 360 The Xbox 360 consists of three identical PowerPC processor cores. Each core has a dedicated L1 instruction cache and L1 data cache, and the three cores share a single L2 cache. The three cores and the GPU share a unifi ed 512 MB pool of RAM, which can be used for executable code, application data, tex- tures, video RAM—you name it. The Xbox 360 architecture is described in 7.6. Multiprocessor Game Loops 326 7. The Game Loop and Real-Time Simulation Main RAM (512 MB) PowerPC Core 0 PowerPC Core 1 PowerPC Core 2 L1 Data L1 Instr L1 Data L1 Instr L1 Data L1 Instr Shared L2 Cache GPU Figure 7.4. A simplifi ed view of the Xbox 360 hardware architecture. a great deal more depth in the PowerPoint presentation entited “Xbox 360 System Architecture,” by Jeff Andrews and Nick Baker of the Xbox Semicon- ductor Technology Group, available at htt p://www.hotchips.org/archives/ hc17/3_Tue/HC17.S8/HC17.S8T4.pdf. However, the preceding extremely brief overview should suffi ce for our purposes. Figure 7.4 shows the Xbox 360’s architecture in highly simplifi ed form. 7.6.1.2. PLAYSTATION 3 The PLAYSTATION 3 hardware makes use of the Cell Broadband Engine (CBE) architecture (see Figure 7.5), developed jointly by Sony, Toshiba, and IBM. The PS3 takes a radically diff erent approach to the one employed by the Xbox 360. Instead of three identical processors, it contains a number of diff er- ent types of processors, each designed for specifi c tasks. And instead of a uni- fi ed memory architecture, the PS3 divides its RAM into a number of blocks, each of which is designed for effi cient use by certain processing units in the system. The architecture is described in detail at htt p://www.blachford.info/ computer/Cell/Cell1_v2.html, but the following overview and the diagram shown in Figure 7.5 should suffi ce for our purposes. The PS3’s main CPU is called the Power Processing Unit (PPU). It is a PowerPC processor, much like the ones found in the Xbox 360. In addition to this central processor, the PS3 has six coprocessors known as Synergistic Processing Units (SPUs). These coprocessors are based around the PowerPC instruction set, but they have been streamlined for maximum performance. The GPU on the PS3 has a dedicated 256 MB of video RAM. The PPU has access to 256 MB of system RAM. In addition, each SPU has a dedicated high- speed 256 kB RAM area called its local store (LS). Local store memory performs about as effi ciently as an L1 cache, making the SPUs blindingly fast. 327 The SPUs never read directly from main RAM. Instead, a direct memory access (DMA) controller allows blocks of data to be copied back and forth between system RAM and the SPUs’ local stores. These data transfers happen in parallel, so both the PPU and SPUs can be doing useful calculations while they wait for data to arrive. 7.6.2. SIMD As we saw in Section 4.7, most modern CPUs (including the Xbox 360’s three PowerPC processors, and the PS3’s PPU and SPUs) provide a class of instruc- tions known as single instruction, multiple data (SIMD). Such instructions can perform a particular operation on more than one piece of data simultaneously, and as such they represent a fi ne-grained form of hardware parallelism. CPUs provide a number of diff erent SIMD instruction variants, but by far the most commonly-used in games are instructions that operate on four 32-bit fl oating- point values in parallel, because they allow 3D vector and matrix math to be performed four times more quickly than with their single instruction, single data (SISD) counterparts. Retrofi tt ing existing 3D math code to leverage SIMD instructions can be tricky, although the task is much easier if a well-encapsulated 3D math li- brary was used in the original code. For example, if a dot product is calcu- lated in long hand everywhere (e.g., float d = a.x * b.x + a.y * b.y + a.z * b.z;), then a very large amount of code will need to be re-writt en. However, if dot products are calculated by calling a function (e.g., float d = Dot(a, b);), and if vectors are treated largely as black boxes throughout the code base, then retrofi tt ing for SIMD can be accomplished by modifying the 7.6. Multiprocessor Game Loops Video RAM (256 MB) GPUSystem RAM (256 MB) ... PPU L1 Data L1 Instr L2 Cache SPU0 Local Store (256 kB) SPU1 Local Store (256 kB) SPU6 Local Store (256 kB) DMA ControllerDMA Bus Figure 7.5. Simplifi ed view of the PS3’s cell broadband architecture. 328 7. The Game Loop and Real-Time Simulation 3D math library, without having to modify much if any of the calling code (except perhaps to ensure alignment of vector data to 16-byte boundaries). 7.6.3. Fork and Join Another way to utilize multicore or multiprocessor hardware is to adapt di- vide-and-conquer algorithms for parallelism. This is oft en called the fork/join approach. The basic idea is to divide a unit of work into smaller subunits, dis- tribute these workloads onto multiple processing cores or hardware threads (fork), and then merge the results once all workloads have been completed (join). When applied to the game loop, the fork/join architecture results in a main loop that looks very similar to its single-threaded counterpart, but with some of the major phases of the update loop being parallelized. This architec- ture is illustrated in Figure 7.6. Let’s take a look at a concrete example. Blending animations using linear interpolation (LERP) is an operation that can be done on each joint indepen- dently of all other joints within a skeleton (see Section 11.5.2.2). We’ll assume that we want to blend pairs of skeletal poses for fi ve characters, each of which has 100 joints, meaning that we need to process 500 pairs of joint poses. To parallelize this task, we can divide the work into N batches, each con- taining roughly 500/N joint-pose pairs, where N is selected based on the avail- Main Thread HID Update Game Objects Ragdoll Physics Post Animation Game Object Update Fork Join Fork Join etc. Pose Blending Pose Blending Pose Blending Simulate/ Integrate Simulate/ Integrate Simulate/ Integrate Figure 7.6. Fork and join used to parallelize selected CPU-intensive phases of the game loop. 329 able processing resources. (On the Xbox 360, N should probably be 3 or 6, because the console has three cores with two hardware threads each. On a PS3, N might range anywhere from 1 to 6, depending on how many SPUs are available.) We then “fork” (i.e., create) N threads, requesting each one to work on a diff erent group of pose pairs. The main thread can either continue doing some useful but work that is independent of the animation blending task, or it can go to sleep , waiting on a semaphore that will tell it when all of the worker threads have completed their tasks. Finally, we “join” the individual resultant joint poses into a cohesive whole—in this case, by calculating the fi nal global pose of each of our fi ve skeletons. (The global pose calculation needs access to the local poses of all the joints in each skeleton, so it doesn’t parallelize well within a single skeleton. However, we could imagine forking again to calculate the global pose, this time with each thread working on one or more whole skeletons.) You can fi nd sample code illustrating how to fork and join worker threads using Win32 system calls at htt p://msdn.microsoft .com/en-us/library/ ms682516(VS.85).aspx. 7.6.4. One Thread per Subsystem Yet another approach to multitasking is to assign particular engine subsys- tems to run in separate threads . A master thread controls and synchronizes the operations of these secondary subsystem threads and also continues to handle the lion’s share of the game’s high-level logic (the main game loop ). On a hardware platform with multiple physical CPUs or hardware threads, this design allows these threaded engine subsystems to execute in parallel. This design is well suited to any engine subsystem that performs a relative- ly isolated function repeatedly, such as a rendering engine, physics simula- tion, animation pipeline, or audio engine. The architecture is depicted in Figure 7.7. Threaded architectures are usually supported by some kind of thread library on the target hardware system. On a personal computer running Windows, the Win32 thread API is usually used. On a UNIX-based system, a library like pthreads might be the best choice. On the PLAYSTATION 3, a library known as SPURS permits workloads to be run on the six synergistic processing units (SPUs). SPURS provides two primary ways to run code on the SPUs—the task model and the job model . The task model can be used to segregate engine subsystems into coarse-grained independent units of execu- tion that act very much like threads. We’ll discuss the SPURS job model in the next section. 7.6. Multiprocessor Game Loops 330 7. The Game Loop and Real-Time Simulation 7.6.5. Jobs One problem with the multithreaded approach is that each thread represents a relatively coarse-grained chunk of work (e.g., all animation tasks are in one thread, all collision and physics tasks in another). This can place restrictions on how the various processors in the system can be utilized . If one of the subsystem threads has not completed its work, the progress of other threads, including that of the main game loop, may be blocked. Another way to take advantage of parallel hardware architecture is to divide up the work that is done by the game engine into multiple small, rela- tively independent jobs . A job is best thought of as a pairing between a chunk of data and a bit of code that operates on that data. When a job is ready to be run, it is placed on a queue, to be picked up and worked on by the next avail- able processing unit. This approach is supported on the PLAYSTATION 3 via the SPURS job model. The main game loop runs on the PPU, and the six SPUs are used as job processors. Each job’s code and data are sent to an SPU’s local store via a DMA transfer. The SPU processes the job, and then it DMAs its results back to main RAM. Figure 7.7. One thread per major engine subsystem. 331 As shown in Figure 7.8, the fact that jobs are relatively fi ne-grained and independent of one another helps to maximize processor utilization. It can also reduce or eliminate some of the restrictions placed on the main thread in the one-thread-per-subsystem design. This architecture also scales up or down naturally to hardware with any number of processing units (something the one-thread-per-subsystem architecture does not do particularly well). 7.6.6. Asynchronous Program Design When writing or retrofi tt ing a game engine to take advantage of multitasking hardware, programmers must be careful to design their code in an asynchro- nous manner. This means that the results of an operation will usually not be available immediately aft er requesting it, as they would be in a synchronous design. For example, a game might request that a ray be cast into the world, in order to determine whether the player has line-of-sight to an enemy character. In a synchronous design, the ray cast would be done immediately in response to the request, and when the ray casting function returned, the results would be available, as shown below. Figure 7.8. In a job architecture, work is broken down into fi ne-grained chunks that can be picked up by any available processor. This can help maximize processor utilization, while providing the main game loop with improved fl exibility. 7.6. Multiprocessor Game Loops 332 7. The Game Loop and Real-Time Simulation while (true) // main game loop { // ... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r = castRay(playerPos, enemyPos); // Now process the results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ... } // ... } In an asynchronous design, a ray cast request would be made by calling a function that simply sets up and enqueues a ray cast job, and then returns immediately. The main thread can continue doing other unrelated work while the job is being processed by another CPU or core. Later, once the job has been completed, the main thread can pick up the results of the ray cast query and process them: while (true) // main game loop { // ... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r; requestRayCast(playerPos, enemyPos, &r); // Do other unrelated work while we wait for the // other CPU to perform the ray cast for us. // ... // OK, we can’t do any more useful work. Wait for the // results of our ray cast job. If the job is // complete, this function will return immediately. // Otherwise, the main thread will idle until they // are ready... waitForRayCastResults(&r); // Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ... } 333 // ... } In many instances, asynchronous code can kick off a request on one frame, and pick up the results on the next. In this case, you may see code that looks like this: RayCastResult r; bool rayJobPending = false; while (true) // main game loop { // ... // Wait for the results of last frame’s ray cast job. if (rayJobPending) { waitForRayCastResults(&r); // Process results... if (r.hitSomething() && isEnemy(r.getHitObject())) { // Player can see the enemy. // ... } } // Cast a new ray for next frame. rayJobPending = true; requestRayCast(playerPos, enemyPos, &r); // Do other work... // ... } 7.7. Networked Multiplayer Game Loops The game loop of a networked multiplayer game is particularly interesting, so we’ll have a brief look at how such loops are structured. We don’t have room here to go into the all of the details of how multiplayer games work. (Refer to [3] for an excellent in-depth discussion of the topic.) However, we’ll provide a brief overview of the two most-common multiplayer architectures here, and then look at how these architectures aff ect the structure of the game loop. 7.7. Networked Multiplayer Game Loops 334 7. The Game Loop and Real-Time Simulation 7.7.1. Client-Server In the client-server model, the vast majority of the game’s logic runs on a single server machine. Hence the server’s code closely resembles that of a non-net- worked single-player game. Multiple client machines can connect to the server in order to take part in the online game. The client is basically a “dumb” ren- dering engine that also reads human interface devices and controls the local player character, but otherwise simply renders whatever the server tells it to render. Great pains are taken in the client code to ensure that the inputs of the local human player are immediately translated into the actions of the player’s character on-screen. This avoids what would otherwise be an extremely an- noying sense of delayed reaction on the part of the player character. But other than this so-called player prediction code, the client is usually not much more than a rendering and audio engine, combined with some networking code. The server may be running on a dedicated machine, in which case we say it is running in dedicated server mode. However, the client and server needn’t be on separate machines, and in fact it is quite typical for one of the client ma- chines to also be running the server. In fact, in many client-server multiplayer games, the single-player game mode is really just a degenerate multiplayer game, in which there is only one client, and both the client and server are run- ning on the same machine. This is known as client-on-top-of-server mode. The game loop of a client-server multiplayer game can be implemented in a number of diff erent ways. Since the client and server are conceptually separate entities, they could be implemented as entirely separate processes (i.e., separate applications). They could also be implemented as two separate threads of execution, within a single process. However, both of these ap- proaches require quite a lot of overhead to permit the client and server to communicate locally, when being run in client-on-top-of-server mode. As a result, a lot of multiplayer games run both client and server in a single thread, serviced by a single game loop. It’s important to realize that the client and server code can be updated at diff erent rates. For example, in Quake, the server runs at 20 FPS (50 ms per frame), while the client typically runs at 60 FPS (16.6 ms per frame). This is implemented by running the main game loop at the faster of the two rates (60 FPS) and then servicing the server code once roughly every three frames. In reality, the amount of time that has elapsed since the last server update is tracked, and when it reaches or exceeds 50 ms, a server frame is run and the timer is reset. Such a game loop might look something like this: F32 dtReal = 1.0f/30.0f; // the real frame delta time F32 dtServer = 0.0f; // the server’s delta time 335 U64 tBegin = readHiResTimer(); while (true) // main game loop { // Run the server at 50 ms intervals. dtServer += dtReal; if (dtServer >= 0.05f) // 50 ms { runServerFrame(0.05f); dtServer -= 0.05f; // reset for next update } // Run the client at maximum frame rate. runClientFrame(dtReal); // Read the current time, and calculate an estimate // of next frame’s real delta time. U64 tEnd = readHiResTimer(); dtReal = (F32)(tEnd – tBegin) / (F32)getHiResTimerFrequency(); // Use tEnd as the new tBegin for next frame. tBegin = tEnd; } 7.7.2. Peer-to-Peer In the peer-to-peer multiplayer architecture, every machine in the online game acts somewhat like a server, and somewhat like a client. One and only one machine has authority over each dynamic object in the game. So each machine acts like a server for those objects over which it has authority. For all other ob- jects in the game world, the machine acts like a client, rendering the objects in whatever state is provided to it by that object’s remote authority. The structure of a peer-to-peer multiplayer game loop is much simpler than a client-server game loop, in that at the top-most level, it looks very much like a single-player game loop. However, the internal details of the code can be a bit more confusing. In a client-server model, it is usually quite clear which code is running on the server and which code is client-side. But in a peer-to- peer architecture, much of the code needs to be set up to handle two possible cases: one in which the local machine has authority over the state of an object in the game, and one in which the object is just a dumb proxy for a remote authoritative representation. These two modes of operation are oft en imple- mented by having two kinds of game objects—a full-fl edged “real” game ob- 7.7. Networked Multiplayer Game Loops 336 7. The Game Loop and Real-Time Simulation ject, over which the local machine has authority and a “proxy ” version that contains a minimal subset of the state of the remote object. Peer-to-peer architectures are made even more complex because author- ity over an object sometimes needs to migrate from machine to machine. For example, if one computer drops out of the game, all of the objects over which it had authority must be picked up by the other machines in the game. Like- wise, when a new machine joins the game, it should ideally take over author- ity of some game objects from other machines, in order to balance the load. The details are beyond the scope of this book. The key point here is that multi- player architectures can have profound eff ects on the structure of a game’s main loop. 7.7.3. Case Study: Quake II The following is an excerpt from the Quake II game loop . The source code for Quake, Quake II, and Quake 3 Arena is available on Id Soft ware’s website, htt p:// www.idsoft ware.com. As you can see, all of the elements we’ve discussed are present, including the Windows message pump (in the Win32 version of the game), calculation of the real frame delta time , fi xed-time and time-scaled modes of operation, and servicing of both server-side and client-side engine systems. int WINAPI WinMain (HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow) { MSG msg; int time, oldtime, newtime; char *cddir; ParseCommandLine (lpCmdLine); Qcommon_Init (argc, argv); oldtime = Sys_Milliseconds (); /* main window message loop */ while (1) { // Windows message pump. while (PeekMessage (&msg, NULL, 0, 0, PM_NOREMOVE)) { if (!GetMessage (&msg, NULL, 0, 0)) Com_Quit (); sys_msg_time = msg.time; 337 TranslateMessage (&msg); DispatchMessage (&msg); } // Measure real delta time in milliseconds. do { newtime = Sys_Milliseconds (); time = newtime - oldtime; } while (time < 1); // Run a frame of the game. Qcommon_Frame (time); oldtime = newtime; } // never gets here return TRUE; } void Qcommon_Frame (int msec) { char *s; int time_before, time_between, time_after; // [some details omitted...] // Handle fixed-time mode and time scaling. if (fixedtime->value) msec = fixedtime->value; else if (timescale->value) { msec *= timescale->value; if (msec < 1) msec = 1; } // Service the in-game console. do { s = Sys_ConsoleInput (); if (s) Cbuf_AddText (va("%s\n",s)); } while (s); Cbuf_Execute (); 7.7. Networked Multiplayer Game Loops 338 7. The Game Loop and Real-Time Simulation // Run a server frame. SV_Frame (msec); // Run a client frame. CL_Frame (msec); // [some details omitted...] } 339 8 Human Interface Devices (HID) Games are interactive computer simulations, so the human player(s) need some way of providing inputs to the game. All sorts of human interface devices (HID) exist for gaming, including joysticks, joypads, keyboards and mice, track balls, the Wii remote, and specialized input devices like steering wheels, fi shing rods, dance pads, and even electric guitars. In this chapter, we’ll investigate how game engines typically read, process, and utilize the inputs from human interface devices. We’ll also have a look at how outputs from these devices provide feedback to the human player. 8.1. Types of Human Interface Devices A wide range of human interface devices are available for gaming purposes. Consoles like the Xbox 360 and PS3 come equipped with joypad controllers, as shown in Figure 8.1. Nintendo’s Wii console is well known for its unique and innovative WiiMote controller, shown in Figure 8.2. PC games are generally either controlled via a keyboard and the mouse, or via a joypad. (Microsoft designed the Xbox 360 joypad so that it can be used both on the Xbox 360 and on Windows/DirectX PC platforms.) As shown in Figure 8.3, arcade machines have one or more built-in controllers, such as a joystick and various butt ons, or a track ball, a steering wheel, etc. An arcade machine’s input device is usually 340 8. Human Interface Devices (HID) Figure 8.1. Standard joypads for the Xbox 360 and PLAYSTATION 3 consoles. Figure 8.2. The innovative WiiMote for the Nintendo Wii. Figure 8.3. Various custom input devices for the arcade game Mortal Kombat II by Midway. Figure 8.4. Many specialized input devices are available for use with consoles. 341 8.2. Interfacing with a HID somewhat customized to the game in question, although input hardware is oft en re-used among arcade machines produced by the same manufacturer. On console platforms, specialized input devices and adapters are usually available, in addition to the “standard” input device such as the joypad. For example, guitar and drum devices are available for the Guitar Hero series of games, steering wheels can be purchased for driving games, and games like Dance Dance Revolution use a special dance pad device. Some of these devices are shown in Figure 8.4. The Nintendo WiiMote is one of the most fl exible input devices on the market today. As such, it is oft en adapted to new purposes, rather than re- placed with an entirely new device. For example, Mario Kart Wii comes with a pastic steering wheel adapter into which the WiiMote can be inserted (see Figure 8.5). 8.2. Interfacing with a HID All human interface devices provide input to the game soft ware, and some also allow the soft ware to provide feedback to the human player via various kinds of outputs as well. Game soft ware reads and writes HID inputs and outputs in various ways, depending on the specifi c design of the device in question. 8.2.1. Polling Some simple devices, like game pads and old-school joysticks, are read by polling the hardware periodically (usually once per iteration of the main game loop). This means explicitly querying the state of the device, either by read- ing hardware registers directly, reading a memory-mapped I/O port, or via a higher-level soft ware interface (which, in turn, reads the appropriate registers or memory-mapped I/O ports). Likewise, outputs might be sent to the HID by Figure 8.5. Steering wheel adapter for the Nintendo Wii. 342 8. Human Interface Devices (HID) writing to special registers or memory-mapped I/O addresses, or via a higher- level API that does our dirty work for us. Microsoft ’s XInput API, for use with Xbox 360 game pads on both the Xbox 360 and Windows PC platforms, is a good example of a simple polling mechanism. Every frame, the game calls the function XInputGetState(). This function communicates with the hardware and/or drivers, reads the data in the appropriate way, and packages it all up for convenient use by the soft - ware. It returns a pointer to an XINPUT_STATE struct, which in turn contains an embedded instance of a struct called XINPUT_GAMEPAD. This struct con- tains the current states of all of the controls (butt ons, thumb sticks, and trig- gers) on the device. 8.2.2. Interrupts Some HIDs only send data to the game engine when the state of the controller changes in some way. For example, a mouse spends a lot of its time just sitt ing still on the mouse pad. There’s no reason to send a continuous stream of data between the mouse and the computer when the mouse isn’t moving—we need only transmit information when it moves, or a butt on is pressed or released. This kind of device usually communicates with the host computer via hardware interrupts . An interrupt is an electronic signal generated by the hard- ware, which causes the CPU to temporarily suspend execution of the main program and run a small chunk of code called an interrupt service routine (ISR). Interrupts are used for all sorts of things, but in the case of a HID, the ISR code will probably read the state of the device, store it off for later processing, and then relinquish the CPU back to the main program. The game engine can pick up the data the next time it is convenient to do so. 8.2.3. Wireless Devices The inputs and outputs of a Bluetooth device, like the WiiMote, the DualShock 3 and the Xbox 360 wireless controller, cannot be read and writ- ten by simply accessing registers or memory-mapped I/O ports. Instead, the soft ware must “talk” to the device via the Bluetooth protocol. The soft ware can request the HID to send input data (such as the states of its butt ons) back to the host, or it can send output data (such as rumble sett ings or a stream of audio data) to the device. This communication is oft en handled by a thread separate from the game engine’s main loop, or at least encapsulated behind a relatively simple interface that can be called from the main loop. So from the point of view of the game programmer, the state of a Bluetooth device can be made to look prett y much indistinguishable from a traditional polled device. 343 8.3. Types of Inputs 8.3. Types of Inputs Although human interface devices for games vary widely in terms of form factor and layout, most of the inputs they provide fall into one of a small num- ber of categories. We’ll investigate each category in depth below. 8.3.1. Digital Buttons Almost every HID has at least a few digital butt ons . These are butt ons that can only be in one of two states: pressed and not pressed. Game programmers oft en refer to a pressed butt on as being down and a non-pressed butt on as being up. Electrical engineers speak of a circuit containing a switch as being closed (meaning electricity is fl owing through the circuit) or open (no electricity is fl owing—the circuit has infi nite resistance). Whether closed corresponds to pressed or not pressed depends on the hardware. If the switch is normally open, then when it is not pressed (up), the circuit is open, and when it is pressed (down), the circuit is closed. If the switch is normally closed, the reverse is true— the act of pressing the butt on opens the circuit. In soft ware, the state of a digital butt on (pressed or not pressed) is usually represented by a single bit. It’s common for 0 to represent not pressed (up) and 1 to represent pressed (down). But again, depending on the nature of the circuitry, and the decisions made by the programmers who wrote the device driver, the sense of these values might be reversed. It is quite common for the states of all of the butt ons on a device to be packed into a single unsigned integer value. For example, in Microsoft ’s XInput API, the state of the Xbox 360 joypad is returned in a struct called XINPUT_GAMEPAD, shown below. typedef struct _XINPUT_GAMEPAD { WORD wButtons; BYTE bLeftTrigger; BYTE bRightTrigger; SHORT sThumbLX; SHORT sThumbLY; SHORT sThumbRX; SHORT sThumbRY; } XINPUT_GAMEPAD; This struct contains a 16-bit unsigned integer (WORD) variable named wButtons that holds the state of all butt ons. The following masks defi ne 344 8. Human Interface Devices (HID) which physical butt on corresponds to each bit in the word. (Note that bits 10 and 11 are unused.) #define XINPUT_GAMEPAD_DPAD_UP 0x0001 // bit 0 #define XINPUT_GAMEPAD_DPAD_DOWN 0x0002 // bit 1 #define XINPUT_GAMEPAD_DPAD_LEFT 0x0004 // bit 2 #define XINPUT_GAMEPAD_DPAD_RIGHT 0x0008 // bit 3 #define XINPUT_GAMEPAD_START 0x0010 // bit 4 #define XINPUT_GAMEPAD_BACK 0x0020 // bit 5 #define XINPUT_GAMEPAD_LEFT_THUMB 0x0040 // bit 6 #define XINPUT_GAMEPAD_RIGHT_THUMB 0x0080 // bit 7 #define XINPUT_GAMEPAD_LEFT_SHOULDER 0x0100 // bit 8 #define XINPUT_GAMEPAD_RIGHT_SHOULDER 0x0200 // bit 9 #define XINPUT_GAMEPAD_A 0x1000 // bit 12 #define XINPUT_GAMEPAD_B 0x2000 // bit 13 #define XINPUT_GAMEPAD_X 0x4000 // bit 14 #define XINPUT_GAMEPAD_Y 0x8000 // bit 15 An individual butt on’s state can be read by masking the wButtons word with the appropriate bit mask via C/C++’s bitwise AND operator (&) and then checking if the result is non-zero. For example, to determine if the A butt on is pressed (down), we would write: bool IsButtonADown(const XINPUT_GAMEPAD& pad) { // Mask off all bits but bit 12 (the A button). return ((pad.wButtons & XINPUT_GAMEPAD_A) != 0); } 8.3.2. Analog Axes and Buttons An analog input is one that can take on a range of values (rather than just 0 or 1). These kinds of inputs are oft en used to represent the degree to which a trigger is pressed, or the two-dimensional position of a joystick (which is represented using two analog inputs, one for the x-axis and one for the y-axis, 345 as shown in Figure 8.6). Because of this common usage, analog inputs are sometimes called analog axes , or just axes. On some devices, certain butt ons are analog as well, meaning that the game can actually detect how hard the player is pressing on them. However, the signals produced by analog butt ons are usually too noisy to be particu- larly usable. I have yet to see a game that uses analog butt on inputs eff ectively (although some may very well exist!) Strictly speaking, analog inputs are not really analog by the time they make it to the game engine. An analog input signal is usually digitized, mean- ing it is quantized and represented using an integer in soft ware. For example, an analog input might range from –32,768 to 32,767 if represented by a 16-bit signed integer. Sometimes analog inputs are converted to fl oating-point— the values might range from –1 to 1, for instance. But as we know from Sec- tion 3.2.1.3, fl oating-point numbers are really just quantized digital values as well. Reviewing the defi nition of XINPUT_GAMEPAD (repeated below), we can see that Microsoft chose to represent the defl ections of the left and right thumb sticks on the Xbox 360 gamepad using 16-bit signed integers (sThumbLX and sThumbLY for the left stick and sThumbRX and sThumbRY for the right). Hence, these values range from –32,768 (left or down) to 32,767 (right or up). However, to represent the positions of the left and right shoulder triggers, Microsoft chose to use 8-bit unsigned integers (bLeftTrigger and bRight- Trigger respectively). These input values range from 0 (not pressed) to 255 (fully pressed). Diff erent game machines use diff erent digital representions for their analog axes. typedef struct _XINPUT_GAMEPAD { WORD wButtons; x y (1, 1) (–1, –1) (0.1, 0.3) Figure 8.6. Two analog inputs can be used to represent the x and y defl ection of a joystick. 8.3. Types of Inputs 346 8. Human Interface Devices (HID) // 8-bit unsigned BYTE bLeftTrigger; BYTE bRightTrigger; // 16-bit signed SHORT sThumbLX; SHORT sThumbLY; SHORT sThumbRX; SHORT sThumbRY; } XINPUT_GAMEPAD; 8.3.3. Relative Axes The position of an analog butt on, trigger, joystick, or thumb stick is absolute, meaning that there is a clear understanding of where zero lies. However, the inputs of some devices are relative . For these devices, there is no clear location at which the input value should be zero. Instead, a zero input indicates that the position of the device has not changed, while non-zero values represent a delta from the last time the input value was read. Examples include mice, mouse wheels, and track balls. 8.3.4. Accelerometers The PLAYSTATION 3’s Sixaxis and DualShock 3 joypads, and the Nintendo WiiMote , all contain acceleration sensors (accelerometers ). These devices can detect acceleration along the three principle axes (x, y, and z), as shown in Fig- ure 8.7. These are relative analog inputs, much like a mouse’s two-dimensional axes. When the controller is not accelerating these inputs are zero, but when the controller is accelerating, they measure the acceleration up to ±3 g along each axis, quantized into three signed 8-bit integers, one for each of x, y, and z. x y z Figure 8.7. Accelerometer axes for the WiiMote. 8.3.5. 3D Orientation with the WiiMote or Sixaxis Some Wii and PS3 games make use of the three accelerometers in the WiiMote or Sixaxis joypad to estimate the orientation of the controller in the player’s 347 hand. For example, in Super Mario Galaxy, Mario hops onto a large ball and rolls it around with his feet. To control Mario in this mode, the WiiMote is held with the IR sensor facing the ceiling. Tilting the WiiMote left , right, forward, or back causes the ball to accelerate in the corresponding direction. A trio of accelerometers can be used to detect the orientation of the WiiMote or Sixaxis joypad, because of the fact that we are playing these games on the surface of the Earth where there is a constant downward acceleration due to gravity of 1g (≈ 9.8 m/s2). If the controller is held perfectly level, with the IR sensor pointing toward your TV set, the vertical (z) acceleration should be approximately –1 g. If the controller is held upright, with the IR sensor pointing toward the ceiling, we would expect to see a 0 g acceleration on the z sensor, and +1 g on the y sensor (because it is now experiencing the full gravitational eff ect). Holding the WiiMote at a 45-degree angle should produce roughly sin(45°) = cos(45°) = 0.707 g on both the y and z inputs. Once we’ve calibrated the accel- erometer inputs to fi nd the zero points along each axis, we can calculate pitch, yaw, and roll easily, using inverse sine and cosine operations. Two caveats here: First, if the person holding the WiiMote is not hold- ing it still, the accelerometer inputs will include this acceleration in their val- ues, invalidating our math. Second, the z-axis of the accelerometer has been calibrated to account for gravity, but the other two axes have not. This means that the z-axis has less precision available for detecting orientation. Many Wii games request that the user hold the WiiMote in a non-standard orientation, such as with the butt ons facing the player’s chest, or with the IR sensor point- ing toward the ceiling. This maximizes the precision of the orientation read- ing, by placing the x- or y-accelerometer axis in line with gravity, instead of the gravity-calibrated z- axis. For more information on this topic, see htt p://druid. caughq.org/presentations/turbo/Wiimote-Hacking.pdf and htt p://www.wiili. org/index.php/Motion_analysis. 8.3.6. Cameras The WiiMote has a unique feature not found on any other standard console HID—an infrared (IR) sensor. This sensor is essentially a low-resolution cam- era that records a two-dimension infrared image of whatever the WiiMote is pointed at. The Wii comes with a “sensor bar” that sits on top of your televi- sion set and contains two infrared light emitt ing diodes (LEDs). In the image recorded by the IR camera, these LEDs appear as two bright dots on an oth- erwise dark background. Image processing soft ware in the WiiMote analyzes the image and isolates the location and size of the two dots. (Actually, it can detect and transmit the locations and sizes of up to four dots.) This position 8.3. Types of Inputs 348 8. Human Interface Devices (HID) and size information can be read by the console via a Bluetooth wireless con- nection. The position and orientation of the line segment formed by the two dots can be used to determine the pitch, yaw, and roll of the WiiMote (as long as it is being pointed toward the sensor bar). By looking at the separation between the dots, soft ware can also determine how close or far away the WiiMote is from the TV. Some soft ware also makes use of the sizes of the dots. This is il- lustrated in Figure 8.8. Another popular camera device is Sony’s EyeToy for the PlayStation line of consoles, shown in Figure 8.9. This device is basically a high quality color camera, which can be used for a wide range of applications. It can be used for simple video conferencing, like any web cam. It could also conceivably be used much like the WiiMote’s IR camera, for position, orientation, and depth sensing. The gamut of possibilities for these kinds of advanced input devices has only begun to be tapped by the gaming community. 8.4. Types of Outputs Human interface devices are primarily used to transmit inputs from the play- er to the game soft ware. However, some HIDs can also provide feedback to the human player via various kinds of outputs. 8.4.1. Rumble Game pads like the PlayStation’s DualShock line of controllers and the Xbox and Xbox 360 controllers have a rumble feature. This allows the controller to vibrate in the player’s hands, simulating the turbulence or impacts that the Figure 8.8. The Wii sensor bar houses two infrared LEDs which produce two bright spots on the image recorded by the WiiMote’s IR camera. Figure 8.9. Sony’s Eye- Toy for the PlaySta- tion3. 349 8.4. Types of Outputs character in the game world might be experiencing. Vibrations are usually produced by one or more motors, each of which rotates a slightly unbalanced weight at various speeds. The game can turn these motors on and off , and con- trol their speeds to produce diff erent tactile eff ects in the player’s hands. 8.4.2. Force-Feedback Force feedback is a technique in which an actuator on the HID is driven by a motor in order to slightly resist the motion the human operator is trying to impart to it. It is common in arcade driving games, where the steering wheel resists the player’s att empt to turn it, simulating diffi cult driving conditions or tight turns. As with rumble, the game soft ware can typically turn the motor(s) on and off , and can also control the strength and direction of the forces ap- plied to the actuator. 8.4.3. Audio Audio is usually a stand-alone engine system. However, some HIDs provide outputs that can be utilized by the audio system. For example, the WiiMote contains a small, low-quality speaker. The Xbox 360 controller has a headset jack and can be used just like any USB audio device for both output (speak- ers) and input (microphone). One common use of USB headsets is for multi- player games, in which human players can communicate with one another via a voice over IP (VOIP) connection. 8.4.4. Other Inputs and Outputs Human interface devices may of course support many other kinds of inputs and outputs. On some older consoles like the Sega Dreamcast, the memory card slots were located on the game pad. The Xbox 360 game pad, the Sixaxis and DualShock 3, and the WiiMote all have four LEDs which can be illuminated by game soft ware if desired. And of course specialized devices like musical instru- ments, dance pads, etc. have their own particular kinds of inputs and outputs. Innovation is actively taking place in the fi eld of human interfaces. Some of the most interesting areas today are gestural interfaces and thought-con- trolled devices. We can certainly expect more innovation from console and HID manufacturers in years to come. 8.5. Game Engine HID Systems Most game engines don’t use “raw” HID inputs directly. The data is usually massaged in various ways to ensure that the inputs coming from the HID 350 8. Human Interface Devices (HID) translate into smooth, pleasing, intuitive behaviors in-game. In addition, most engines introduce at least one additional level of indirection between the HID and the game in order to abstract HID inputs in various ways. For example, a butt on-mapping table might be used to translate raw butt on inputs into logi- cal game actions, so that human players can re-assign the butt ons’ functions as they see fi t. In this section, we’ll outline the typical requirements of a game engine HID system and then explore each one in some depth. 8.5.1. Typical Requirements A game engine’s HID system usually provides some or all of the following features: dead zones, analog signal fi ltering, event detection (e.g., butt on up, butt on down), detection of butt on sequences and multibutt on combinations (known as chords), gesture detection, management of multiple HIDs for multiple players, multiplatform HID support, controller input re-mapping, context-sensitive inputs, the ability to temporarily disable certain inputs. 8.5.2. Dead Zone A joystick, thumb stick, shoulder trigger, or any other analog axis produces input values that range between a predefi ned minimum and maximum value, which we’ll call Imin and Imax. When the control is not being touched, we would expect it to produce a steady and clear “undisturbed” value, which we’ll call I0. The undisturbed value is usually numerically equal to zero, and it either lies half-way between Imin and Imax for a centered, two-way control like a joy- stick axis, or it coincides with Imin for a one-way control like a trigger. Unfortunately, because HIDs are analog devices by nature, the voltage pro- duced by the device is noisy, and the actual inputs we observe may fl uctuate slightly around I0. The most common solution to this problem is to introduce a small dead zone around I0. The dead zone might be defi ned as [I0 – δ , I0 + δ] for a joy stick, or [I0  , I0 + δ] for a trigger. Any input values that are within the dead zone are simply clamped to I0. The dead zone must be wide enough to account 351 for the noisiest inputs generated by an undisturbed control, but small enough not to interfere with the player’s sense of the HID’s responsiveness. 8.5.3. Analog Signal Filtering Signal noise is a problem even when the controls are not within their dead zones. This noise can sometimes cause the in-game behaviors controlled by the HID to appear jerky or unnatural. For this reason, many games fi lter the raw inputs coming from the HID. A noise signal is usually of a high-frequency, relative to the signal produced by the human player. Therefore, one solution is to pass the raw input data through a simple low-pass fi lter , prior to it being used by the game. A discrete fi rst-order low-pass fi lter can be implemented by combining the current unfi ltered input value with last frame’s fi ltered input. If we denote the sequence of unfi ltered inputs by the time-varying function u(t) and the fi ltered inputs by f(t), where t denotes time, then we can write (8.1) where the parameter a is determined by the frame duration Δt and a fi ltering constant RC (which is just the product of the resistance and the capacitance in a traditional analog RC low-pass fi lter circuit): (8.2) This can be implemented trivially in C or C++ as follows, where it is assumed the calling code will keep track of last frame’s fi ltered input for use on the subsequent frame. For more information, see htt p://en.wikipedia.org/wiki/ Low-pass_fi lter. F32 lowPassFilter(F32 unfilteredInput, F32 lastFramesFilteredInput, F32 rc, F32 dt) { F32 a = dt / (rc + dt); return (1 – a) * lastFramesFilteredInput + a * unfilteredInput; } Another way to fi lter HID input data is to calculate a simple moving av- erage . For example, if we wish to average the input data over a 3/30 second (3 frame) interval, we simply store the raw input values in a 3-element circular 8.5. Game Engine HID Systems () (1 ) ( ) (),f = − −Δ +t f t t taau .ta RC t Δ= +Δ 352 8. Human Interface Devices (HID) buff er. The fi ltered input value is then the sum of the values in this array at any moment, divided by 3. There are a few minor details to account for when implementing such a fi lter. For example, we need to properly handle the fi rst two frames of input, during which the 3-element array has not yet been fi lled with valid data. However, the implementation is not particularly complicated. The code below shows one way to properly implement an N-element moving average. template< typename TYPE, int SIZE > class MovingAverage { TYPE m_samples[SIZE]; TYPE m_sum; U32 m_curSample; U32 m_sampleCount; public: MovingAverage() : m_sum(static_cast(0)), m_curSample(0), m_sampleCount(0) { } void addSample(TYPE data) { if (m_sampleCount == SIZE) { m_sum -= m_samples[m_curSample]; } else { ++m_sampleCount; } m_samples[m_curSample] = data; m_sum += data; ++m_curSample; if (m_curSample >= SIZE) { m_curSample = 0; } } F32 getCurrentAverage() const { 353 if (m_sampleCount != 0) { return static_cast(m_sum) / static_cast(m_sampleCount); } return 0.0f; } }; 8.5.4. Detecting Input Events The low-level HID interface typically provides the game with the current states of the device’s various inputs. However, games are oft en interested in detecting events, such as changes in state, rather than just inspecting the current state each frame. The most common HID events are probably butt on down (pressed) and butt on up (released), but of course we can detect other kinds of events as well. 8.5.4.1. Button Up and Button Down Let’s assume for the moment that our butt ons’ input bits are 0 when not pressed and 1 when pressed. The easiest way to detect a change in butt on state is to keep track of the butt ons’ state bits as observed last frame and compare them to the state bits observed this frame. If they diff er, we know an event occurred. The current state of each butt on tells us whether the event is a butt on-up or a butt on-down. We can use simple bit-wise operators to detect butt on-down and but- ton-up events. Given a 32-bit word buttonStates, containing the current state bits of up to 32 butt ons, we want to generate two new 32-bit words: one for butt on-down events which we’ll call buttonDowns and one for butt on-up events which we’ll call buttonUps. In both cases, the bit corre- sponding to each butt on will be 0 if the event has not occurred this frame and 1 if it has. To implement this, we also need last frame’s butt on states, prevButtonStates. The exclusive OR (XOR) operator produces a 0 if its two inputs are iden- tical and a 1 if they diff er. So if we apply the XOR operator to the previous and current butt on state words, we’ll get 1’s only for butt ons whose states have changed between last frame and this frame. To determine whether the event is a butt on-up or a butt on-down, we need to look at the current state of each butt on. Any butt on whose state has changed that is currently down generates a butt on-down event, and vice-versa for butt on-up events. The fol- lowing code applies these ideas in order to generate our two butt on event words: 8.5. Game Engine HID Systems 354 8. Human Interface Devices (HID) class ButtonState { U32 m_buttonStates; // current frame’s button // states U32 m_prevButtonStates; // previous frame’s states U32 m_buttonDowns; // 1 = button pressed this // frame U32 m_buttonUps; // 1 = button released this // frame void DetectButtonUpDownEvents() { // Assuming that m_buttonStates and // m_prevButtonStates are valid, generate // m_buttonDowns and m_buttonUps. // First determine which bits have changed via // XOR. U32 buttonChanges = m_buttonStates ^ m_prevButtonStates; // Now use AND to mask off only the bits that are // DOWN. m_buttonDowns = buttonChanges & m_buttonStates; // Use AND-NOT to mask off only the bits that are // UP. m_buttonUps = buttonChanges & (~m_buttonStates); } // ... }; 8.5.4.2. Chords A chord is a group of butt ons that, when pressed at the same time, produce a unique behavior in the game. Here are a few examples: Super Mario Galaxy’s start-up screen requires you to press the A and B butt ons on the WiiMote together in order to start a new game. Pressing the 1 and 2 butt ons on the WiiMote at the same time put it into Bluetooth discovery mode (no matt er what game you’re playing). The “grapple” move in many fi ghting games is triggered by a two-but- ton combination. 355 For development purposes, holding down both the left and right trig- gers on the DualShock 3 in Uncharted: Drake’s Fortune allows the player character to fl y anywhere in the game world, with collisions turned off . (Sorry, this doesn’t work in the shipping game!) Many games have a cheat like this to make development easier. (It may or may not be trig- gered by a chord, of course.) It is called no-clip mode in the Quake engine, because the character’s collision volume is not clipped to the valid play- able area of the world. Other engines use diff erent terminology. Detecting chords is quite simple in principle: We merely watch the states of two or more butt ons and only perform the requested operation when all of them are down. There are some subtleties to account for, however. For one thing, if the chord includes a butt on or butt ons that have other purposes in the game, we must take care not to perform both the actions of the individual butt ons and the action of chord when it is pressed. This is usually done by including a check that the other butt ons in the chord are not down when detecting the individual butt on-presses. Another fl y in the ointment is that humans aren’t perfect, and they oft en press one or more of the butt ons in the chord slightly earlier than the rest. So our chord-detection code must be robust to the possibility that we’ll observe one or more individual butt ons on frame i and the rest of the chord on frame i + 1 (or even multiple frames later). There are a number of ways to handle this: You can design your butt on inputs such that a chord always does the actions of the individual butt ons plus some additional action. For example, if pressing L1 fi res the primary weapon and L2 lobs a grenade, perhaps the L1   +   L2 chord could fi re the primary weapon, lob a grenade, and send out an energy wave that doubles the damage done by these weapons. That way, whether or not the individual butt ons are detected before the chord or not, the behavior will be identical from the point of view of the player. You can introduce a delay between when an individual butt on-down event is seen and when it “counts” as a valid game event. During the delay period (say 2 or 3 frames), if a chord is detected, then it takes precedence over the individual butt on-down events. This gives the human player some leeway in performing the chord. You can detect the chord when the butt ons are pressed, but wait to trigger the eff ect until the butt ons are released again. You can begin the single-butt on move immediately and allow it to be preempted by the chord move. 8.5. Game Engine HID Systems 356 8. Human Interface Devices (HID) 8.5.4.3. Sequences and Gesture Detection The idea of introducing a delay between when a butt on actually goes down and when it really “counts” as down is a special case of gesture detection. A gesture is a sequence of actions performed via a HID by the human player over a period of time. For example, in a fi ghting game or brawler, we might want to detect a sequence of butt on presses, such as A-B-A. We can extend this idea to non-butt on inputs as well. For example, A-B-A-Left -Right-Left , where the latt er three actions are side-to-side motions of one of the thumb sticks on the game pad. Usually a sequence or gesture is only considered to be valid if it is performed within some maximum time-frame. So a rapid A-B-A within a quarter of a second might “count,” but a slow A-B-A performed over a second or two might not. Gesture detection is generally implemented by keeping a brief history of the HID actions performed by the player. When the fi rst component of the gesture is detected, it is stored in the history buff er, along with a time stamp indicating when it occurred. As each subsequent component is detected, the time between it and the previous component is checked. If it is within the allowable time window, it too is added to the history buff er. If the entire se- quence is completed within the allott ed time (i.e., the history buff er is fi lled), an event is generated telling the rest of the game engine that the gesture has occurred. However, if any non-valid intervening inputs are detected, or if any component of the gesture occurs outside of its valid time window, the entire history buff er is reset, and the player must start the gesture over again. Let’s look at three concrete examples, so we can really understand how this works. Rapid Button Tapping Many games require the user to tap a butt on rapidly in order to perform an ac- tion. The frequency of the butt on presses may or may not translate into some quantity in the game, such as the speed with which the player character runs or performs some other action. The frequency is usually also used to defi ne the validity of the gesture—if the frequency drops below some minimum val- ue, the gesture is no longer considered valid. We can detect the frequency of a butt on press by simply keeping track of the last time we saw a butt on-down event for the butt on in question. We’ll call this Tlast  . The frequency f is then just the inverse of the time interval between presses (ΔT = Tcur – Tlast  and f = 1/ΔT). Every time we detect a new butt on-down event, we calculate a new frequency f. To implement a minimum valid fre- quency, we simply check f against the minimum frequency fmin (or we can just 357 check ΔT against the maximum period ΔTmax = 1/fmin directly). If this threshold is satisifi ed, we update the value of Tlast   , and the gesture is considered to be on-going. If the threshold is not satisfi ed, we simply don’t update Tlast  . The gesture will be considered invalid until a new pair of rapid-enough butt on- down events occurs. This is illustrated by the following pseudocode: class ButtonTapDetector { U32 m_buttonMask; // which button to observe (bit // mask) F32 m_dtMax; // max allowed time between // presses F32 m_tLast; // last button-down event, in // seconds public: // Construct an object that detects rapid tapping of // the given button (identified by an index). ButtonTapDetector(U32 buttonId, F32 dtMax) : m_buttonMask(1U << buttonId), m_dtMax(dtMax), m_tLast(CurrentTime() – dtMax) // start out // invalid { } // Call this at any time to query whether or not the // gesture is currently being performed. void IsGestureValid() const { F32 t = CurrentTime(); F32 dt = t – m_tLast; return (dt < m_dtMax); } // Call this once per frame. void Update() { if (ButtonsJustWentDown(m_buttonMask)) { m_tLast = CurrentTime(); } } }; In the above code excerpt, we assume that each butt on is identifi ed by a unique id. The id is really just an index, ranging from 0 to N – 1 (where N is the number of butt ons on the HID in question). We convert the butt on id to a 8.5. Game Engine HID Systems 358 8. Human Interface Devices (HID) bit mask by shift ing an unsigned 1 bit to the left by an amount equaling the butt on’s index (1U << buttonId ). The function ButtonsJustWentDown() returns a non-zero value if any one of the butt ons specifi ed by the given bit mask just went down this frame. Here, we’re only checking for a single but- ton-down event, but we can and will use this same function later to check for multiple simultaneous butt on-down events. Multibutton Sequence Let’s say we want to detect the sequence A-B-A, performed within at most one second. We can detect this butt on sequence as follows: We maintain a variable that tracks which butt on in the sequence we’re currently looking for. If we de- fi ne the sequence with an array of butt on ids (e.g., aButtons[3] = {A, B, A}), then our variable is just an index i into this array. It starts out initialized to the fi rst butt on in the sequence, i = 0. We also maintain a start time for the entire sequence, Tstart  , much as we did in the rapid butt on-pressing example. The logic goes like this: Whenever we see a butt on-down event that match- es the butt on we’re currently looking for, we check its time stamp against the start time of the entire sequence, Tstart  . If it occurred within the valid time window, we advance the current butt on to the next butt on in the sequence; for the fi rst butt on in the sequence only (i = 0), we also update Tstart  . If we see a butt on-down event that doesn’t match the next butt on in the sequence, or if the time delta has grown too large, we reset the butt on index i back to the beginning of the sequence and set Tstart to some invalid value (such as 0). This is illustrated by the code below. class ButtonSequenceDetector { U32* m_aButtonIds; // sequence of buttons to watch for U32 m_buttonCount; // number of buttons in sequence F32 m_dtMax; // max time for entire sequence U32 m_iButton; // next button to watch for in seq. F32 m_tStart; // start time of sequence, in // seconds public: // Construct an object that detects the given button // sequence. When the sequence is successfully // detected, the given event is broadcast, so the // rest of the game can respond in an appropriate way. ButtonSequenceDetector(U32* aButtonIds, U32 buttonCount, F32 dtMax, EventId eventIdToSend) : m_aButtonIds(aButtonIds), m_buttonCount(buttonCount), 359 m_dtMax(dtMax), m_eventId(eventIdToSend), // event to send when // complete m_iButton(0), // start of sequence m_tStart(0) // initial value // irrelevant { } // Call this once per frame. void Update() { ASSERT(m_iButton < m_buttonCount); // Determine which button we’re expecting next, as // a bit mask (shift a 1 up to the correct bit // index). U32 buttonMask = (1U << m_aButtonId[m_iButton]); // If any button OTHER than the expected button // just went down, invalidate the sequence. (Use // the bitwise NOT operator to check for all other // buttons.) if (ButtonsJustWentDown(~buttonMask)) { m_iButton = 0; // reset } // Otherwise, if the expected button just went // down, check dt and update our state appropriately. else if (ButtonsJustWentDown(buttonMask)) { if (m_iButton == 0) { // This is the first button in the // sequence. m_tStart = CurrentTime(); ++m_iButton; // advance to next button } else { F32 dt = CurrentTime() – m_tStart; if (dt < m_dtMax) { // Sequence is still valid. 8.5. Game Engine HID Systems 360 8. Human Interface Devices (HID) ++m_iButton; // advance to next button // Is the sequence complete? if (m_iButton == m_buttonCount) { BroadcastEvent(m_eventId); m_iButton = 0; // reset } } else { // Sorry, not fast enough. m_iButton = 0; // reset } } } } }; Thumb Stick Rotation As an example of a more-complex gesture, let’s see how we might detect when the player is rotating the left thumb stick in a clockwise circle. We can detect this quite easily by dividing the two-dimensional range of possible stick po- sitions into quadrants, as shown in Figure 8.10. In a clockwise rotation, the stick passes through the upper-left quadrant, then the upper-right, then the lower-right, and fi nally the lower-left . We can treat each of these cases like a butt on press and detect a full rotation with a slightly modifi ed version of the sequence detection code shown above. We’ll leave this one as an exercise for the reader. Try it! x y UL UR LL LR Figure 8.10. Detecting circular rotations of the stick by dividing the 2D range of stick inputs into quadrants. 361 8.5.5. Managing Multiple HIDs for Multiple Players Most game machines allow two or more HIDs to be att ached for multiplayer games. The engine must keep track of which devices are currently att ached and route each one’s inputs to the appropriate player in the game. This implies that we need some way of mapping controllers to players. This might be as simple as a one-to-one mapping between controller index and player index, or it might be something more sophisticated, such as assigning controllers to players at the time the user hits the Start butt on. Even in a single-player game with only one HID, the engine needs to be robust to various exceptional conditions, such as the controller being acciden- tally unplugged or running out of batt eries. When a controller’s connection is lost, most games pause gameplay, display a message, and wait for the con- troller to be reconnected. Some multiplayer games suspend or temporarily remove the avatar corresponding to a removed controller, but allow the other players to continue playing the game; the removed/suspended avatar might reactivate when the controller is reconnected. On systems with batt ery-operated HIDs, the game or the operating sys- tem is responsible for detecting low-batt ery conditions. In response, the play- er is usually warned in some way, for example via an unobtrusive on-screen message and/or a sound eff ect. 8.5.6. Cross-Platform HID Systems Many game engines are cross-platform. One way to handle HID inputs and outputs in such an engine would be to sprinkle conditional compilation di- rectives all over the code, wherever interactions with the HID take place, as shown below. This is clearly not an ideal solution, but it does work. #if TARGET_XBOX360 if (ButtonsJustWentDown(XB360_BUTTONMASK_A)) #elif TARGET_PS3 if (ButtonsJustWentDown(PS3_BUTTONMASK_TRIANGLE)) #elif TARGET_WII if (ButtonsJustWentDown(WII_BUTTONMASK_A)) #endif { // do something... } A bett er solution is to provide some kind of hardware abstraction layer, there- by insulating the game code from hardware-specifi c details. If we’re lucky, we can abstract most of the diff erences beween the HIDs on the diff erent platforms by a judicious choice of abstract butt on and axis 8.5. Game Engine HID Systems 362 8. Human Interface Devices (HID) ids. For example, if our game is to ship on Xbox 360 and PS3, the layout of the controls (butt ons, axes and triggers) on these two joypads are almost identical. The controls have diff erent ids on each platform, but we can come up with generic control ids that cover both types of joypad quite easily. For example: enum AbstractControlIndex { // Start and back buttons AINDEX_START, // Xbox 360 Start, PS3 Start AINDEX_BACK_PAUSE, // Xbox 360 Back, PS3 Pause // Left D-pad AINDEX_LPAD_DOWN, AINDEX_LPAD_UP, AINDEX_LPAD_LEFT, AINDEX_LPAD_RIGHT, // Right "pad" of four buttons AINDEX_RPAD_DOWN, // Xbox 360 A, PS3 X AINDEX_RPAD_UP, // Xbox 360 Y, PS3 Triangle AINDEX_RPAD_LEFT, // Xbox 360 X, PS3 Square AINDEX_RPAD_RIGHT, // Xbox 360 B, PS3 Circle // Left and right thumb stick buttons AINDEX_LSTICK_BUTTON, // Xbox 360 LThumb, PS3 L3, // Xbox white AINDEX_RSTICK_BUTTON, // Xbox 360 RThumb, PS3 R3, // Xbox black // Left and right shoulder buttons AINDEX_LSHOULDER, // Xbox 360 L shoulder, PS3 L1 AINDEX_RSHOULDER, // Xbox 360 R shoulder, PS3 R1 // Left thumb stick axes AINDEX_LSTICK_X, AINDEX_LSTICK_Y, // Right thumb stick axes AINDEX_RSTICK_X, AINDEX_RSTICK_Y, // Left and right trigger axes AINDEX_LTRIGGER, // Xbox 360 –Z, PS3 L2 AINDEX_RTRIGGER, // Xbox 360 +Z, PS3 R2 }; 363 Our abstraction layer can translate between the raw control ids on the cur- rent target hardware into our abstract control indices . For example, whenever we read the state of the butt ons into a 32-bit word, we can perform a bit-swiz- zling operation that rearranges the bits into the proper order to correspond to our abstract indices. Analog inputs can likewise be shuffl ed around into the proper order. In performing the mapping between physical and abstract controls, we’ll sometimes need to get a bit clever. For example, on the Xbox, the left and right triggers act as a single axis, producing negative values when the left trigger is pressed, zero when neither is trigger is pressed, and positive values when the right trigger is pressed. To match the behavior of the PlayStation’s DualShock controller, we might want to separate this axis into two distinct axes on the Xbox, scaling the values appropriately so the range of valid values is the same on all platforms. This is certainly not the only way to handle HID I/O in a multiplatform engine. We might want to take a more functional approach, for example, by naming our abstract controls according to their function in the game, rather than their physical locations on the joypad. We might introduce higher-level functions that detect abstract gestures, with custom detection code on each platform, or we might just bite the bullet and write platform-specifi c versions of all of the game code that requires HID I/O. The possibilities are numerous, but virtually all cross-platform game engines insulate the game from hard- ware details in some manner. 8.5.7. Input Re-Mapping Many games allow the player some degree of choice with regard to the func- tionality of the various controls on the physical HID. A common option is the sense of the vertical axis of the right thumb stick for camera control in a console game. Some folks like to push forward on the stick to angle the camera up, while others like an inverted control scheme, where pulling back on the stick angles the camera up (much like an airplane control stick). Other games allow the player to select between two or more predefi ned butt on mappings. Some PC games allow the user full control over the functions of individual keys on the keyboard, the mouse butt ons, and the mouse wheel, plus a choice between various control schemes for the two mouse axes. To implement this, we turn to a favorite saying of an old professor of mine, Professor Jay Black of the University of Waterloo, “Every problem in computer science can be solved with a level of indirection.” We assign each function in the game a unique id and then provide a simple table which maps each physical or abstract control index to a logical function in the game. When- 8.5. Game Engine HID Systems 364 8. Human Interface Devices (HID) ever the game wishes to determine whether a particular logical game function should be activated, it looks up the corresponding abstract or physical control id in the table and then reads the state of that control. To change the mapping, we can either swap out the entire table wholesale, or we can allow the user to edit individual entries in the table. We’re glossing over a few details here. For one thing, diff erent controls produce diff erent kinds of inputs. Analog axes may produce values ranging from –32,768 to 32,767, or from 0 to 255, or some other range. The states of all the digital butt ons on a HID are usually packed into a single machine word. Therefore, we must be careful to only permit control mappings that make sense. We cannot use a butt on as the control for a logical game func- tion that requires an axis, for example. One way around this problem is to normalize all of the inputs. For example, we could re-scale the inputs from all analog axes and butt ons into the range [0, 1]. This isn’t quite as helpful as you might at fi rst think, because some axes are inherently bidirectional (like a joy stick) while others are unidirectional (like a trigger). But if we group our controls into a few classes, we can normalize the inputs within those classes, and permit remapping only within compatible classes. A reason- able set of classes for a standard console joypad and their normalized input values might be: Digital butt ons. States are packed into a 32-bit word, one bit per butt on. Unidirectional absolute axes (e.g., triggers, analog butt ons). Produce fl oat- ing-point input values in the range [0, 1]. Bidirectional absolute axes (e.g., joy sticks). Produce fl oating-point input values in the range [–1, 1]. Relative axes (e.g., mouse axes, wheels, track balls). Produce fl oating-point input values in the range [–1, 1] , where ±1 represents the maximum relative off set possible within a single game frame (i.e., during a period of 1/30 or 1/60 of a second). 8.5.8. Context-Sensitive Controls In many games, a single physical control can have diff erent functions, depend- ing on context. A simple example is the ubiquitous “use” butt on. If pressed while standing in front of a door, the “use” butt on might cause the character to open the door. If it is pressed while standing near an object, it might cause the player character to pick up the object, and so on. Another common example is a modal control scheme. When the player is walking around, the controls are used to navigate and control the camera. When the player is riding a vehicle, the controls are used to steer the vehicle, and the camera controls might be diff erent as well. 365 Context-sensitive controls are reasonably straightforward to imple- ment via a state machine. Depending on what state we’re in, a particu- lar HID control may have a diff erent purpose. The tricky part is deciding what state to be in. For example, when the context-sensitive “use” butt on is pressed, the player might be standing at a point equidistant between a weapon and a health pack, facing the center point between them. Which object do we use in this case? Some games implement a priority system to break ties like this. Perhaps the weapon has a higher weight than the health pack, so it would “win” in this example. Implementing context-sensitive controls isn’t rocket science, but it invariably requires lots of trial-and-error to get it feeling and behaving just right. Plan on lots of iteration and focus testing! Another related concept is that of control ownership. Certain controls on the HID might be “owned” by diff erent parts of the game. For example, some inputs are for player control, some for camera control, and still others are for use by the game’s wrapper and menu system (pausing the game, etc.) Some game engines introduce the concept of a logical device, which is composed of only a subset of the inputs on the physical device. One logical device might be used for player control, while another is used by the camera system, and another by the menu system. 8.5.9. Disabling Inputs In most games, it is sometimes necessary to disallow the player from control- ling his or her character. For example, when the player character is involved in an in-game cinematic, we might want to disable all player controls temporar- ily; or when the player is walking through a narrow doorway, we might want to temporarily disable free camera rotation. One rather heavy-handed approach is to use a bit mask to disable indi- vidual controls on the input device itself. Whenever the control is read, the disable mask is checked, and if the corresponding bit is set, a neutral or zero value is returned instead of the actual value read from the device. We must be particularly cautious when disabling controls, however. If we forget to reset the disable mask, the game can get itself into a state where the player looses all control forever, and must restart the game. It’s important to check our logic carefully, and it’s also a good idea to put in some fail-safe mechanisms to en- sure that the disable mask is cleared at certain key times, such as whenever the player dies and re-spawns. Disabling a HID input masks it for all possible clients, which can be overly limiting. A better approach is probably to put the logic for disabling specific player actions or camera behaviors directly into the 8.5. Game Engine HID Systems 366 8. Human Interface Devices (HID) player or camera code itself. That way, if the camera decides to ignore the deflection of the right thumb stick, for example, other game engine systems still have the freedom to read the state of that stick for other purposes. 8.6. Human Interface Devices in Practice Correct and smooth handling of human interface devices is an important part of any good game. Conceptually speaking, HIDs may seem quite straightfor- ward. However, there can be quite a few “gotchas” to deal with, including variations between diff erent physical input devices, proper implementation of low-pass fi ltering, bug-free handling of control scheme mappings, achiev- ing just the right “feel” in your joypad rumble, limitations imposed by console manufacturers via their technical requirements checklists (TRCs), and the list goes on. A game team should expect to devote a non-trivial amount of time and engineering bandwidth to a careful and complete implementation of the human interface device system. This is extremely important because the HID system forms the underpinnings of your game’s most precious resource—its player mechanics. 367 9 Tools for Debugging and Development Developing game soft ware is a complex, intricate, math-intensive, and er- ror-prone business. So it should be no surprise that virtually every pro- fessional game team builds a suite of tools for themselves, in order to make the game development process easier and less error-prone. In this chapter, we’ll take a look at the development and debugging tools most oft en found in professional-grade game engines. 9.1. Logging and Tracing Remember when you wrote your fi rst program in BASIC or Pascal? (OK, may- be you don’t. If you’re signifi cantly younger than me—and there’s a prett y good chance of that—you probably wrote your fi rst program in Java, or maybe Python or Lua.) In any case, you probably remember how you debugged your programs back then. You know, back when you thought a debugger was one of those glowing blue insect zapper things? You probably used print statements to dump out the internal state of your program. C/C++ programmers call this printf debugging (aft er the standard C library function, printf()). It turns out that printf debugging is still a perfectly valid thing to do—even if you know that a debugger isn’t a device for frying hapless insects at night. Especially in real-time programming, it can be diffi cult to trace certain kinds 368 9. Tools for Debugging and Development of bugs using breakpoints and watch windows. Some bugs are timing-depen- dent; they only happen when the program is running at full speed. Other bugs are caused by a complex sequence of events too long and intricate to trace manually one-by-one. In these situations, the most powerful debugging tool is oft en a sequence of print statements. Every game platform has some kind of console or teletype (TTY) output device. Here are some examples: In a console application writt en in C/C++, running under Linux or Win32, you can produce output in the console by printing to stdout or stderr via printf(), fprintf(), or STL’s iostream interface. Unfortunately, printf() and iostream don’t work if your game is built as a windowed application under Win32, because there’s no console in which to display the output. However, if you’re running under the Visual Studio debugger, it provides a debug console to which you can print via the Win32 function OutputDebugString(). On the PLAYSTATION 3, an application known as the Target Manager runs on your PC and allows you to launch programs on the console. The Target Manager includes a set of TTY output windows to which messages can be printed by the game engine. So printing out information for debugging purposes is almost always as easy as adding calls to printf() throughout your code. However, most game en- gines go a bit farther than this. In the following sections, we’ll investigate the kinds of printing facilities most game engines provide. 9.1.1. Formatted Output with OutputDebugString() The Win32 function OutputDebugString() is great for printing debug- ging information to Visual Studio’s Debug Output window. However, unlike printf(), OutputDebugString() does not support formatt ed output—it can only print raw strings in the form of char arrays. For this reason, most Windows game engines wrap OutputDebugString() in a custom function, like this: #include // for va_list et al #ifndef WIN32_LEAN_AND_MEAN #define WIN32_LEAN_AND_MEAN 1 #endif #include // for OutputDebugString() int VDebugPrintF(const char* format, va_list argList) { const U32 MAX_CHARS = 1023; static char s_buffer[MAX_CHARS + 1]; 369 9.1. Logging and Tracing int charsWritten = vsnprintf(s_buffer, MAX_CHARS, format, argList); s_buffer[MAX_CHARS] = ‘\0’; // be sure to // NIL-terminate // Now that we have a formatted string, call the // Win32 API. OutputDebugString(s_buffer); return charsWritten; } int DebugPrintF(const char* format, ...) { va_list argList; va_start(argList, format); int charsWritten = VDebugPrintF(format, argList); va_end(argList); return charsWritten; } Notice that two functions are implemented: DebugPrintF()takes a variable-length argument list (specifi ed via the ellipsis, …), while VDebug PrintF()takes a va_list argument. This is done so that programmers can build additional printing functions in terms of VDebugPrintF(). (It’s impos- sible to pass ellipses from one function to another, but it is possible to pass va_lists around.) 9.1.2. Verbosity Once you’ve gone to the trouble of adding a bunch of print statements to your code in strategically chosen locations, it’s nice to be able to leave them there, in case they’re needed again later. To permit this, most engines provide some kind of mechanism for controlling the level of verbosity via the command-line, or dynamically at runtime. When the verbosity level is at its minimum value (usually zero), only critical error messages are printed. When the verbosity is higher, more of the print statements embedded in the code start to contribute to the output. The simplest way to implement this is to store the current verbosity level in a global integer variable, perhaps called g_verbosity. We then provide a VerboseDebugPrintF() function whose fi rst argument is the verbosity level at or above which the message will be printed. This function could be imple- mented as follows: 370 9. Tools for Debugging and Development int g_verbosity = 0; void VerboseDebugPrintF(int verbosity, const char* format, ...) { // Only print when the global verbosity level is // high enough. if (g_verbosity >= verbosity) { va_list argList; va_start(argList, format); VDebugPrintF(format, argList); va_end(argList); } } 9.1.3. Channels It’s also extremely useful to be able to categorize your debug output into chan- nels. One channel might contain messages from the animation system, while another might be used to print messages from the physics system, for exam- ple. On some platforms, like the PLAYSTATION 3, debug output can be di- rected to one of 14 distinct TTY windows. In addition, messages are mirrored to a special TTY window that contains the output from all of the other 14 windows. This makes it very easy for a developer to focus in on only the mes- sages he or she wants to see. When working on an animation problem, one can simply fl ip to the animation TTY and ignore all the other output. When working on a general problem of unknown origin, the “all” TTY can be con- sulted for clues. Other platforms like Windows provide only a single debug output con- sole. However, even on these systems it can be helpful to divide your output into channels. The output from each channel might be assigned a diff erent color. You might also implement fi lters, which can be turned on and off at runtime, and restrict output to only a specifi ed channel or set of channels. In this model, if a developer is debugging an animation-related problem, for example, he or she can simply fi lter out all of the channels except the anima- tion channel. A channel-based debug output system can be implemented quite easily by adding an additional channel argument to our debug printing function. Channels might be numbered, or bett er, assigned symbolic values via a C/C++ enum declaration. Or channels might be named using a string or hashed string 371 id. The printing function can simply consult the list of active channels and only print the message if the specifi ed channel is among them. If you don’t have more than 32 or 64 channels, it can be helpful to identify the channels via a 32- or 64-bit mask. This makes implementing a channel fi lter as easy as specifying a single integer. When a bit in the mask is 1, the cor- responding channel is active; when the bit is 0, the channel is muted. 9.1.4. Mirroring Output to a File It’s a good idea to mirror all debug output to one or more log fi les (e.g., one fi le per channel). This permits problems to be diagnosed aft er the fact. Ideally the log fi le(s) should contain all of the debug output, independent of the cur- rent verbosity level and active channels mask. This allows unexpected prob- lems to be caught and tracked down by simply inspecting the most-recent log fi les. You may want to consider fl ushing your log fi le(s) aft er every call to your debug output function to ensure that if the game crashes the log fi le(s) won’t be missing the last buff er-full of output. The last data printed is usually the most crucial to determine the cause of a crash, so we want to be sure that the log fi le always contains the most up-to-date output. Of course, fl ushing the output buff er can be expensive. So you should only fl ush buff ers aft er every debug output call if either (a) you are not doing a lot of logging, or (b) you discover that it is truly necessary on your particular platform. If fl ushing is deemed to be necessary, you can always provide an engine confi guration op- tion to turn it on and off . 9.1.5. Crash Reports Some game engines produce special text output and/or log fi les when the game crashes. In most operating systems, a top-level exception handler can be installed that will catch most crashes. In this function, you could print out all sorts of useful information. You could even consider emailing the crash report to the entire programming team. This can be incredibly enlightening for the programmers: When they see just how oft en the art and design teams are crashing, they may discover a renewed sense of urgency in their debug- ging tasks! Here are just a few examples of the kinds of information you can include in a crash report: Current level(s) being played at the time of the crash. World-space location of the player character when the crash occurred. Animation/action state of the player when the game crashed. 9.1. Logging and Tracing 372 9. Tools for Debugging and Development Gameplay script(s) that were running at the time of the crash. (This can be especially helpful if the script is the cause of the crash!) Stack trace. Most operating systems provide a mechanism for walking the call stack (although they are nonstandard and highly platform specifi c). With such a facility, you can print out the symbolic names of all non-inline functions on the stack at the time the crash occurred. State of all memory allocators in the engine (amount of memory free, degree of fragmentation, etc.). This kind of data can be helpful when bugs are caused by low-memory conditions, for example. Any other information you think might be relevant when tracking down the cause of a crash. 9.2. Debug Drawing Facilities Modern interactive games are driven almost entirely by math. We use math to position and orient objects in the game world, move them around, test for collisions, cast rays to determine lines of sight, and of course use matrix mul- tiplication to transform objects from object space to world space and even- tually into screen space for rendering. Almost all modern games are three- dimensional, but even in a two-dimensional game it can be very diffi cult to mentally visualize the results of all these mathematical calculations. For this reason, most good game engines provide an API for drawing colored lines, simple shapes, and 3D text. We call this a debug drawing facility, because the lines, shapes, and text that are drawn with it are intended for visualization during development and debugging and are removed prior to shipping the game. A debug drawing API can save you huge amounts of time. For example, if you are trying to fi gure out why your projectiles are not hitt ing the enemy characters, which is easier? Deciphering a bunch of numbers in the debugger? Or drawing a line showing the trajectory of the projectile in three dimensions within your game? With a debug drawing API, logical and mathematical er- rors become immediately obvious. One might say that a picture is worth 1,000 minutes of debugging. Here are some examples of debug drawing in action within Naughty Dog’s Uncharted: Drake’s Fortune engine. The following screen shots were all taken within our play-test level, one of many special levels we use for testing out new features and debugging problems in the game. Figure 9.1 shows how a single line can help developers understand whether a target is within the line of sight of an enemy character. You’ll 373 Figure 9.1. Visualizing the line of sight from an NPC to the player. 9.2. Debug Drawing Facilities also notice some debug text rendered just above the head of the enemy, in this case showing weapon ranges, a damage multiplier, the distance to the target, and the character’s percentage chance of striking the tar- get. Being able to print out arbitrary information in three-dimensional space is an incredibly useful feature. Figure 9.2 shows how a wireframe sphere can be used to visualize the dynamically expanding blast radius of an explosion. Figure 9.3 shows how spheres can be used to visualize the radii used by Drake when searching for ledges to hang from in the game. A red line shows the ledge he is currently hanging from. Notice that in this diagram, white text is displayed in the upper left -hand corner of the screen. In the Uncharted: Drake’s Fortune engine, we have the ability to display text in two-dimensional screen space, as well as in full 3D. This can be useful when you want the text to be displayed independently of the current camera angle. Figure 9.4 shows an AI character that has been placed in a special de- bugging mode. In this mode, the character’s brain is eff ectively turned 374 9. Tools for Debugging and Development Figure 9.2. Visualizing the expanding blast sphere of an explosion. Figure 9.3. Spheres and vectors used in Drake’s ledge hang and shimmy system. 375 off , and the developer is given full control over the character’s move- ments and actions via a simple heads-up menu. The developer can paint target points in the game world by simply aiming the camera and can then instruct the character to walk, run, or sprint to the specifi ed points. The user can also tell the character to enter or leave nearby cover, fi re its weapon, and so on. 9.2.1. Debug Drawing API A debug drawing API generally needs to satisfy the following requirements: The API should be simple and easy to use. It should support a useful set of primitives , including (but not limited to): lines, □ spheres, □ points (usually represented as small crosses or spheres, because a □ single pixel is very diffi cult to see), Figure 9.4. Manually controlling an NPC’s actions for debugging purposes. 9.2. Debug Drawing Facilities 376 9. Tools for Debugging and Development coordinate axes (typically the □ x-axis is drawn in red, y in green and z in blue), bounding boxes, and □ formatt ed text. □ It should provide a good deal of fl exibility in controlling how primitives are drawn, including: color, □ line width, □ sphere radii, □ the size of points, lengths of coordinate axes, and dimensions of oth- □ er “canned” primitives. It should be possible to draw primitives in world space (full 3D, using the game camera’s perspective projection matrix) or in screen space (ei- ther using an orthographic projection, or possibly a perspective projec- tion). World-space primitives are useful for annotating objects in the 3D scene. Screen-space primitives are helpful for displaying debugging information in the form of a heads-up display that is independent of camera position or orientation. It should be possible to draw primitives with or without depth testing enabled. When depth testing is enabled, the primitives will be occluded by □ real objects in your scene. This makes their depth easy to visualize, but it also means that the primitives may sometimes be diffi cult to see or totally hidden by the geometry of your scene. With depth testing disabled, the primitives will “hover” over the real □ objects in the scene. This makes it harder to gauge their real depth, but it also ensures that no primitive is ever hidden from view. It should be possible to make calls to the drawing API from anywhere in your code. Most rendering engines require that geometry be submit- ted for rendering during a specifi c phase of the game loop, usually at the end of each frame. So this requirement implies that the system must queue up all incoming debug drawing requests, so that they may be submitt ed at the proper time later on. Ideally, every debug primitive should have a lifetime associated with it. The lifetime controls how long the primitive will remain on-screen aft er having been requested. If the code that is drawing the primitive is called every frame, the lifetime can be one frame—the primitive will remain on-screen because it will be refreshed every frame. However, if the code 377 that draws the primitive is called rarely or intermitt ently (e.g., a func- tion that calculates the initial velocity of a projectile), then you do not want the primitive to fl icker on-screen for just one frame and then dis- appear. In such situations the programmer should be able to give his or her debug primitives a longer lifetime, on the order of a few seconds. It’s also important that the debug drawing system be capable of han- dling a large number of debug primitives effi ciently. When you’re draw- ing debug information for 1,000 game objects, the number of primitives can really add up, and you don’t want your game to be unusable when debug drawing is turned on. The debug drawing API in Naughty Dog’s Uncharted: Drake’s Fortune en- gine looks something like this: class DebugDrawManager { public: // Adds a line segment to the debug drawing queue. void AddLine( const Point& fromPosition, const Point& toPosition, Color color, float lineWidth = 1.0f, float duration = 0.0f, bool depthEnabled = true); // Adds an axis-aligned cross (3 lines converging at // a point) to the debug drawing queue. void AddCross( const Point& position, Color color, float size, float duration = 0.0f, bool depthEnabled = true); // Adds a wireframe sphere to the debug drawing queue. void AddSphere( const Point& centerPosition, float radius, Color color, float duration = 0.0f, bool depthEnabled = true); // Adds a circle to the debug drawing queue. void AddCircle( const Point& centerPosition, const Vector& planeNormal, float radius, Color color, float duration = 0.0f, bool depthEnabled = true); 9.2. Debug Drawing Facilities 378 9. Tools for Debugging and Development // Adds a set of coordinate axes depicting the // position and orientation of the given // transformation to the debug drawing queue. void AddAxes( const Transform& xfm, Color color, float size, float duration = 0.0f, bool depthEnabled = true); // Adds a wireframe triangle to the debug drawing // queue. void AddTriangle( const Point& vertex0, const Point& vertex1, const Point& vertex2, Color color, float lineWidth = 1.0f, float duration = 0.0f, bool depthEnabled = true); // Adds an axis-aligned bounding box to the debug // queue. void AddAABB( const Point& minCoords, const Point& maxCoords, Color color, float lineWidth = 1.0f, float duration = 0.0f, bool depthEnabled = true); // Adds an oriented bounding box to the debug queue. void AddOBB( const Mat44& centerTransform, const Vector& scaleXYZ, Color color, float lineWidth = 1.0f, float duration = 0.0f, bool depthEnabled = true); // Adds a text string to the debug drawing queue. void AddString( const Point& pos, const char* text, Color color, float duration = 0.0f, bool depthEnabled = true); }; // This global debug drawing manager is configured for // drawing in full 3D with a perspective projection. extern DebugDrawManager g_debugDrawMgr; 379 9.3. In-Game Menus // This global debug drawing manager draws its // primitives in 2D screen space. The (x,y) coordinates // of a point specify a 2D location on-screen, and the // z coordinate contains a special code that indicates // whether the (x,y) coordidates are measured in absolute // pixels or in normalized coordinates that range from // 0.0 to 1.0. (The latter mode allows drawing to be // independent of the actual resolution of the screen.) extern DebugDrawManager g_debugDrawMgr2D; Here’s an example of this API being used within game code: void Vehicle::Update() { // Do some calculations... // Debug-draw my velocity vector. Point start = GetWorldSpacePosition(); Point end = start + GetVelocity(); g_debugDrawMgr.AddLine(start, end, kColorRed); // Do some other calculations... // Debug-draw my name and number of passengers. { char b uffer[128]; sprintf(buffer, "Vehicle %s: %d passengers", GetName(), GetN umPassengers()); g_debugDrawMgr.AddString(GetWorldSpacePosition(), buffer, kColorWhite, 0.0f, false); } } You’ll notice that the names of the drawing functions use the verb “add” rather than “draw.” This is because the debug primitives are typically not drawn immediately when the drawing function is called. Instead, they are added to a list of visual elements that will be drawn at a later time. Most high- speed 3D rendering engines require that all visual elements be maintained in a scene data structure so that they can be drawn effi ciently, usually at the end of the game loop. We’ll learn a lot more about how rendering engines work in Chapter 10. 9.3. In-Game Menus Every game engine has a large number of confi guration options and features. In fact, each major subsystem, including rendering, animation, collision, 380 9. Tools for Debugging and Development physics, audio, networking, player mechanics, AI, and so on, exposes its own specialized confi guration options. It is highly useful to programmers, artists, and game designers alike to be able to confi gure these options while the game is running, without having to change the source code, recompile and relink the game executable, and then rerun the game. This can greatly reduce the amount of time the game development team spends on debugging problems and sett ing up new levels or game mechanics. Figure 9.5. Main development menu in Uncharted. Figure 9.6. Rendering submenu. 381 9.4. In-Game Console One simple and convenient way to permit this kind of thing is to provide a system of in-game menus. Items on an in-game menu can do any number of things, including (but certainly not limited to): toggling global Boolean sett ings, adjusting global integer and fl oating-point values, calling arbitrary functions, which can perform literally any task within the engine, Figure 9.7. Mesh options subsubmenu. Figure 9.8. Background meshes turned off. 382 9. Tools for Debugging and Development bringing up submenus, allowing the menu system to be organized hier- archically for easy navigation. An in-game menu should be easy and convenient to bring up, perhaps via a simple butt on-press on the joypad. (Of course, you’ll want to choose a but- ton combination that doesn’t occur during normal gameplay.) Bringing up the menus usually pauses the game. This allows the developer to play the game until the moment just before a problem occurs, then pause the game by bring- ing up the menus, adjust engine sett ings in order to visualize the problem more clearly, and then un-pause the game to inspect the problem in depth. Let’s take a brief look at how the menu system works in the Uncharted: Drake’s Fortune engine, by Naughty Dog. Figure 9.5 shows the top-level menu. It contains submenus for each major subsystem in the engine. In Figure 9.6, we’ve drilled down one level into the Rendering… submenu. Since the render- ing engine is a highly complex system, its menu contains many submenus con- trolling various aspects of rendering. To control the way in which 3D meshes are rendered, we drill down further into the Mesh Options… submenu, shown in Figure 9.7. On this menu, we can turn off rendering of all static background meshes, leaving only the dynamic foreground meshes visible. This is shown in Figure 9.8. 9.4. In-Game Console Some engines provide an in-game console, either in lieu of or in addition to an in-game menu system. An in-game console provides a command-line inter- face to the game engine’s features, much as a DOS command prompt provides users with access to various features of the Windows operating system, or a csh, tcsh, ksh or bash shell prompt provides users with access to the features of UNIX-like operating systems. Much like a menu system, the game engine console can provide commands allowing a developer to view and manipulate global engine sett ings, as well as running arbitrary commands. A console is somewhat less convenient than a menu system, especially for those who aren’t very fast typists. However, a console can be much more pow- erful than a menu. Some in-game consoles provide only a rudimentary set of hard-coded commands, making them about as fl exible as a menu system. But others provide a rich interface to virtually every feature of the engine. A screen shot of the in-game console in Quake 4 is shown in Figure 9.9. Some game engines provide a powerful scripting language that can be used by programmers and game designers to extend the functionality of the engine, or even build entirely new games. If the in-game console “speaks” 383 this same scripting language, then anything you can do in script can also be done interactively via the console. We’ll explore scripting languages in depth in Section 14.8. 9.5. Debug Cameras and Pausing the Game An in-game menu or console system is best accompanied by two other crucial features: (a) the ability to detach the camera from the player character and fl y it around the game world in order to scrutinize any aspect of the scene, and (b) the ability to pause , un-pause and single-step the game (see Section 7.5.6). When the game is paused, it is important to still be able to control the camera. To support this, we can simply keep the rendering engine and camera controls running, even when the game’s logical clock is paused. Slow motion mode is another incredibly useful feature for scrutinizing animations, particle eff ects, physics and collision behaviors, AI behaviors, and the list goes on. This feature is easy to implement. Presuming we’ve tak- en care to update all gameplay elements using a clock that is distinct from the real-time clock, we can put the game into slo-mo by simply updating the gameplay clock at a rate that is slower than usual. This approach can also be used to implement a fast-motion mode, which can be useful for moving 9.6. Cheats Figure 9.9. The in-game console in Quake 4, overlaid on top of the main game menu. 384 9. Tools for Debugging and Development rapidly through time-consuming portions of gameplay in order to get to an area of interest. 9.6. Cheats When developing or debugging a game, it’s important to allow the user to break the rules of the game in the name of expediency. Such features are aptly named cheats . For example, many engines allow you to “pick up” the player character and fl y him or her around in the game world, with collisions dis- abled so he or she can pass through all obstacles. This can be incredibly help- ful for testing out gameplay. Rather than taking the time to actually play the game in an att empt to get the player character into some desirable location, you can simply pick him up, fl y him over to where you want him to be, and then drop him back into his regular gameplay mode. Other useful cheats include, but are certainly not limited to: Invincible player. As a developer, you oft en don’t want to be bothered having to defend yourself from enemy characters, or worrying about falling from too high a height, as you test out a feature or track down a bug. Give player weapon. It’s oft en useful to be able to give the player any weapon in the game for testing purposes. Infi nite ammo. When you’re trying to kill bad guys to test out the weap- on system or AI hit reactions, you don’t want to be scrounging for clips! Select player mesh. If the player character has more than one “costume,” it can be useful to be able to select any of them for testing purposes. Obviously this list could go on for pages. The sky’s the limit—you can add whatever cheats you need in order to develop or debug the game. You might even want to expose some of your favorite cheats to the players of the fi nal shipping game. Players can usually activate cheats by entering unpublished cheat codes on the joypad or keyboard, and/or by accomplishing certain objec- tives in the game. 9.7. Screen Shots and Movie Capture Another extremely useful facility is the ability to capture screen shots and write them to disk in a suitable image format such as Windows Bitmap fi les 385 (.bmp) or Targa (.tga). The details of how to capture a screen shot vary from platform to platform, but they typically involve making a call to the graphics API that allows the contents of the frame buff er to be transferred from video RAM to main RAM, where it can be scanned and converted into the image fi le format of your choice. The image fi les are typically writt en to a predefi ned folder on disk and named using a date and time stamp to guarantee unique fi le names. You may want to provide your users with various options controlling how screen shots are to be captured. Some common examples include: Whether or not to include debug lines and text in the screen shot. Whether or not to include heads-up display (HUD) elements in the screen shot. The resolution at which to capture. Some engines allow high resolution screen shots to be captured, perhaps by modifying the projection matrix so that separate screen shots can be taken of the four quadrants of the screen at normal resolution and then combined into the fi nal high-res image. Simple camera animations. For example, you could allow the user to mark the starting and ending positions and orientations of the camera. A sequence of screen shots could then be taken while gradually interpo- lating the camera from the start location to the ending location. Some engines also provide a full-fl edged movie capture mode. Such a sys- tem captures a sequence of screen shots at the target frame rate of the game, which are typically processed offl ine to generate a movie fi le in a suitable format such as AVI or MP4. Capturing a screen shot is usually a relatively slow operation, due in part to the time required to transfer the frame buff er data from video RAM to main RAM (an operation for which the graphics hardware is usually not optimized), and in larger part to the time required to write image fi les to disk. If you want to capture movies in real time (or at least close to real time), you’ll almost certainly need to store the captured images to a buff er in main RAM, only writing them out to disk when the buff er has been fi lled (during which the game will typically be frozen). 9.8. In-Game Profi ling Games are real-time systems, so achieving and maintaining a high frame rate (usually 30 FPS or 60 FPS) is important. Therefore, part of any game program- 9.8. In-Game Profi ling 386 9. Tools for Debugging and Development mer’s job is ensuring that his or her code runs effi ciently and within budget. As we saw when we discussed the 80-20 and 90-10 rules in Chapter 2, a large percentage of your code probably doesn’t need to be optimized. The only way to know which bits require optimization is to measure your game’s performance. We discussed various third-party profi ling tools in Chapter 2. However, these tools have various limitations and may not be available at all on a console. For Figure 9.11. The Uncharted 2 engine also provides a profi le hierarchy display that allows the user to drill down into particular function calls in inspect their costs. Figure 9.10. The profi le category display in the Uncharted 2: Among Theives engine shows coarse timing fi gures for various top-level engine systems. 387 this reason, and/or for convenience, many game engines provide an in-game profi ling tool of some sort. Typically an in-game profi ler permits the programmer to annotate blocks of code which should be timed and give them human-readable names. The profi ler measures the execution time of each annotated block via the CPU’s hi-res timer, and stores the results in memory. A heads-up display is provided which shows up-to-date execution times for each code block (examples are shown in Figure 9.10, Figure 9.11, and Figure 9.12). The display oft en provides the data in various forms, including raw numbers of cycles, execution times in micro-seconds, and percentages relative to the execution time of the entire frame. 9.8.1. Hierarchical Profi ling Computer programs writt en in an imperative language are inherently hierar- chical—a function calls other functions, which in turn call still more functions. For example, let’s imagine that function a() calls functions b() and c(), and function b() in turn calls functions d(), e() and f(). The pseudocode for this is shown below. void a() { b(); c(); } 9.8. In-Game Profi ling Figure 9.12. The timeline mode in Uncharted 2 shows exactly when various operations are performed across a single frame on the PS3’s SPUs, GPU and PPU. 388 9. Tools for Debugging and Development void b() { d(); e(); f(); } void c() { ... } void d() { ... } void e() { ... } void f() { ... } Assuming function a() is called directly from main(), this function call hier- archy is shown in Figure 9.13. When debugging a program, the call stack shows only a snapshot of this tree. Specifi cally, it shows us the path from whichever function in the hierarchy is currently executing all the way to the root function in the tree. In C/C++, the root function is usually main() or WinMain(), although technically this func- tion is called by a start-up function that is part of the standard C runtime library (CRT), so that function is the true root of the hierarchy. If we set a breakpoint in function e(), for example, the call stack would look something like this: e() The currently-executing function. b() a() main() _crt_startup() Root of the call hierarchy. This call stack is depicted in Figure 9.14 as a pathway from function e() to the root of the function call tree. f() a() b() c() d() e() Figure 9.13. A hy- pothetical func- tion call hierar- chy. f() a() b() c() d() e() main() _crt_startup() Figure 9.14. Call stack resulting from setting a break point in function e(). 389 9.8.1.1. Measuring Execution Times Hierarchically If we measure the execution time of a single function, the time we measure includes the execution time of any the child functions called and all of their grandchildren, great grandchildren, and so on as well. To properly interpret any profi ling data we might collect, we must be sure to take the function call hierarchy into account. Many commercial profi lers can automatically instrument every single function in your program. This permits them to measure both the inclusive and exclusive execution times of every function that is called during a profi l- ing session. As the name implies, inclusive times measure the execution time of the function including all of its children, while exclusive times measure only the time spent in the function itself. (The exclusive time of a function can be calculated by subtracting the inclusive times of all its immediate chil- dren from the inclusive time of the function in question.) In addition, some profi lers record how many times each function is called. This is an impor- tant piece of information to have when optimizing a program, because it al- lows you to diff erentiate between functions that eat up a lot of time internally and functions that eat up time because they are called a very large number of times. In contrast, in-game profi ling tools are not so sophisticated and usually rely on manual instrumentation of the code. If our game engine’s main loop is structured simply enough, we may be able to obtain valid data at a coarse level without thinking much about the function call hierarchy. For example, a typical game loop might look roughly like this: while (!quitGame) { PollJoypad(); UpdateGameObjects(); UpdateAllAnimations(); PostProcessJoints(); DetectCollisions(); RunPhysics(); GenerateFinalAnimationPoses(); UpdateCameras(); RenderScene(); UpdateAudio(); } We could profi le this game at a very coarse level by measuring the execution times of each major phase of the game loop: while (!quitGame) { 9.8. In-Game Profi ling 390 9. Tools for Debugging and Development { PROFILE("Poll Joypad"); PollJoypad(); } { PROFILE("Game Object Update"); UpdateGameObjects(); } { PROFILE("Animation"); UpdateAllAnimations(); } { PROFILE("Joint Post-Processing"); PostProcessJoints(); } { PROFILE("Collision"); DetectCollisions(); } { PROFILE("Physics"); RunPhysics(); } { PROFILE("Animation Finaling"); GenerateFinalAnimationPoses(); } { PROFILE("Cameras"); UpdateCameras(); } { PROFILE("Rendering"); RenderScene(); } { PROFILE("Audio"); UpdateAudio(); } } The PROFILE() macro shown above would probably be implemented as a class whose constructor starts the timer and whose destructor stops the timer and records the execution time under the given name. Thus it only times the code within its containing block, by nature of the way C++ automatically con- structs and destroys objects as they go in and out of scope. 391 struct AutoProfile { AutoProfile(const char* name) { m_name = name; m_startTime = QueryPerformanceCounter(); } ~AutoProfile() { __int64 endTime = QueryPerformanceCounter(); __int64 elapsedTime = endTime – m_startTime; g_profileManager.storeSample(m_name, elapsedTime); } const char* m_name; __int64 m_startTime; }; #define PROFILE(name) AutoProfile p(name) The problem with this simplistic approach is that it breaks down when used within deeper levels of function call nesting. For example, if we embed additional PROFILE() annotations within the RenderScene() function, we need to understand the function call hierarchy in order to properly interpret those measurements. One solution to this problem is to allow the programmer who is an- notating the code to indicate the hierarchical interrelationships between profi ling samples. For example, any PROFILE(...) samples taken with- in the RenderScene() function could be declared to be children of the PROFILE("Rendering") sample. These relationships are usually set up sepa- rately from the annotations themselves, by predeclaring all of the sample bins. For example, we might set up the in-game profi ler during engine initialization as follows: // This code declares various profile sample "bins", // listing the name of the bin and the name of its // parent bin, if any. ProfilerDeclareSampleBin("Rendering", NULL); ProfilerDeclareSampleBin("Visibility", "Rendering"); ProfilerDeclareSampleBin("ShaderSetUp", "Rendering"); ProfilerDeclareSampleBin("Materials", "Shaders"); ProfilerDeclareSampleBin("SubmitGeo", "Rendering"); ProfilerDeclareSampleBin("Audio", NULL); ... 9.8. In-Game Profi ling 392 9. Tools for Debugging and Development This approach still has its problems. Specifi cally, it works well when every function in the call hierarchy has only one parent, but it breaks down when we try to profi le a function that is called by more than one parent function. The reason for this should be prett y obvious. We’re statically declaring our sample bins as if every function can only appear once in the function call hi- erarchy, but actually the same function can reappear many times in the tree, each time with a diff erent parent. The result can be misleading data, because a function’s time will be included in one of the parent bins, but really should be distributed across all of its parents’ bins. Most game engines don’t make an at- tempt to remedy this problem, since they are primarily interested in profi ling coarse-grained functions that are only called from one specifi c location in the function call hierarchy. But this limitation is something to be aware of when profi ling your code with a simple in-engine profi le of the sort found in most game engines. We would also like to account for how many times a given function is called. In the example above, we know that each of the functions we profi led are called exactly once per frame. But other functions, deeper in the func- tion call hierarchy, may be called more than once per frame. If we measure function x() to take 2 ms to execute, it’s important to know whether it takes 2 ms to execute on its own, or whether it executes in 2 μs but was called 1000 times during the frame. Keeping track of the number of times a function is called per frame is quite simple—the profi ling system can simply increment a counter each time a sample is received and reset the counters at the start of each frame. 9.8.2. Exporting to Excel Some game engines permit the data captured by the in-game profi ler to be dumped to a text fi le for subsequent analysis. I fi nd that a comma-separat- ed values (CSV ) format is best, because such fi les can be loaded easily into a Microsoft Excel spreadsheet, where the data can be manipulated and ana- lyzed in myriad ways. I wrote such an exporter for the Medal of Honor: Pacifi c Assault engine. The columns corresponded to the various annotated blocks, and each row represented the profi ling sample taken during one frame of the game’s execution. The fi rst column contained frame numbers and the sec- ond actual game time measured in seconds. This allowed the team to graph how the performance statistics varied over time and to determine how long each frame actually took to execute. By adding some simple formulae to the exported spreadsheet, we could calculate frame rates, execution time percent- ages, and so on. 393 9.9. In-Game Memory Stats and Leak Detection In addition to runtime performance (i.e., frame rate), most game engines are also constrained by the amount of memory available on the target hardware. PC games are least aff ected by such constraints, because modern PCs have sophisticated virtual memory managers. But even PC games are constrained by the memory limitations of their so-called “min spec” machine—the least- powerful machine on which the game is guaranteed to run, as promised by the publisher and stated on the game’s packaging. For this reason, most game engines implement custom memory-tracking tools. These tools allow the developers to see how much memory is being used by each engine subsystem and whether or not any memory is leaking (i.e., memory is allocated but never freed). It’s important to have this informa- tion, so that you can make informed decisions when trying to cut back the memory usage of your game so that it will fi t onto the console or type of PC you are targeting. Keeping track of how much memory a game actually uses can be a sur- prisingly tricky job. You’d think you could simply wrap malloc()/free() or new/delete in a pair of functions or macros that keep track of the amount of memory that is allocated and freed. However, it’s never that simple for a few reasons:  1. You oft en can’t control the allocation behavior of other people’s code. Unless you write the operating system, drivers, and the game engine entire- ly from scratch, there’s a good chance you’re going to end up linking your game with at least some third-party libraries. Most good libraries provide memory allocation hooks, so that you can replace their allocators with your own. But some do not. It’s oft en diffi cult to keep track of the memory allocated by each and every third-party library you use in your game engine—but it usually can be done if you’re thorough and selec- tive in your choice of third-party libraries.  2. Memory comes in diff erent fl avors. For example, a PC has two kinds of RAM: main RAM and video RAM (the memory residing on your graph- ics card, which is used primarily for geometry and texture data). Even if you manage to track all of the memory allocations and deallocations occurring within main RAM, it can be well neigh impossible to track video RAM usage. This is because graphics APIs like DirectX actually hide the details of how video RAM is being allocated and used from the developer. On a console, life is a bit easier, only because you oft en end up having to write a video RAM manager yourself. This is more diffi cult 9.9. In-Game Memory Stats and Leak Detection 394 9. Tools for Debugging and Development than using DirectX, but at least you have complete knowledge of what’s going on.  3. Allocators come in diff erent fl avors. Many games make use of specialized allocators for various purposes. For example, the Uncharted: Drake’s Fortune engine has a global heap for general-purpose allocations, a spe- cial heap for managing the memory created by game objects as they spawn into the game world and are destroyed, a level-loading heap for data that is streamed into memory during gameplay, a stack allocator for single-frame allocations (the stack is cleared automatically every frame), an allocator for video RAM, and a debug memory heap used only for allocations that will not be needed in the fi nal shipping game. Each of these allocators grabs a large hunk of memory when the game starts up and then manages that memory block itself. If we were to track all the calls to new and delete, we’d see one new for each of these six al- locators and that’s all. To get any useful information, we really need to track all of the allocations within each of these allocators’ memory blocks. Most professional game teams expend a signifi cant amount of eff ort on creating in-engine memory-tracking tools that provide accurate and detailed information. The resulting tools usually provide their output in a variety of forms. For example, the engine might produce a detailed dump of all memory allocations made by the game during a specifi c period of time. The data might include high water marks for each memory allocator or each game system, indicating the maximum amount of physical RAM required by each. Some engines also provide heads-up displays of memory usage while the game is Figure 9.15. Tabular memory statistics from the Uncharted 2: Among Thieves engine. 395 running. This data might be tabular, as shown in Figure 9.15, or graphical as shown in Figure 9.16. In addition, when low-memory or out-of-memory conditions arise, a good engine will provide this information in as helpful a way as possible. When PC games are developed, the game team usually works on high-powered PCs with more RAM than the min-spec machine being targeted. Likewise, console games are developed on special development kits which have more memory than a retail console. So in both cases, the game can continue to run even when it technically has run out of memory (i.e., would no longer fi t on a retail con- sole or min-spec PC). When this kind of out-of-memory condition arises, the game engine can display a message saying something like, “Out of memory— this level will not run on a retail system.” There are lots of other ways in which a game engine’s memory tracking system can aid developers in pinpointing problems as early and as conve- niently as possible. Here are just a few examples: If a model fails to load, a bright red text string could be displayed in 3D hovering in the game world where that object would have been. If a texture fails to load, the object could be drawn with an ugly pink texture that is very obviously not part of the fi nal game. If an animation fails to load, the character could assume a special (pos- sibly humorous) pose that indicates a missing animation, and the name of the missing asset could hover over the character’s head. The key to providing good memory analysis tools is (a) to provide accurate information, (b) to present the data in a way that is convenient and that makes problems obvious, and (c) to provide contextual information to aid the team in tracking down the root cause of problems when they occur. Figure 9.16. A graphical memory usage display, also from Uncharted 2. 9.9. In-Game Memory Stats and Leak Detection Part III Graphics and Motion 399 10 The Rendering Engine When most people think about computer and video games, the fi rst thing that comes to mind is the stunning three-dimensional graphics. Real- time 3D rendering is an exceptionally broad and profound topic, so there’s simply no way to cover all of the details in a single chapter. Thankfully there are a great many excellent books and other resources available on this topic. In fact, real-time 3D graphics is perhaps one of the best covered of all the tech- nologies that make up a game engine. The goal of this chapter, then, is to pro- vide you with a broad understanding of real-time rendering technology and to serve as a jumping-off point for further learning. Aft er you’ve read through these pages, you should fi nd that reading other books on 3D graphics seems like a journey through familiar territory. You might even be able to impress your friends at parties (… or alienate them…) We’ll begin by laying a solid foundation in the concepts, theory, and math- ematics that underlie any real-time 3D rendering engine. Next, we’ll have a look at the soft ware and hardware pipelines used to turn this theoretical framework into reality. We’ll discuss some of the most common optimization techniques and see how they drive the structure of the tools pipeline and the runtime rendering API in most engines. We’ll end with a survey of some of the advanced rendering techniques and lighting models in use by game engines today. Throughout this chapter, I’ll point you to some of my favorite books 400 10. The Rendering Engine and other resources that should help you to gain an even deeper understand- ing of the topics we’ll cover here. 10.1. Foundations of Depth-Buffered Triangle Rasterization When you boil it down to its essence, rendering a three-dimensional scene involves the following basic steps: A virtual scene is described, usually in terms of 3D surfaces represented in some mathematical form. A virtual camera is positioned and oriented to produce the desired view of the scene. Typically the camera is modeled as an idealized focal point, with an imaging surface hovering some small distance in front of it, composed of virtual light sensors corresponding to the picture elements (pixels ) of the target display device . Various light sources are defi ned. These sources provide all the light rays that will interact with and refl ect off the objects in the environment and eventually fi nd their way onto the image-sensing surface of the virtual camera. The visual properties of the surfaces in the scene are described. This de- fi nes how light should interact with each surface. For each pixel within the imaging rectangle, the rendering engine calcu- lates the color and intensity of the light ray(s) converging on the virtual camera’s focal point through that pixel. This is known as solving the ren- dering equation (also called the shading equation). This high-level rendering process is depicted in Figure 10.1. Many diff erent technologies can be used to perform the basic render- ing steps described above. The primary goal is usually photorealism , although some games aim for a more stylized look (e.g., cartoon, charcoal sketch, wa- tercolor, and so on). As such, rendering engineers and artists usually att empt to describe the properties of their scenes as realistically as possible and to use light transport models that match physical reality as closely as possible. Within this context, the gamut of rendering technologies ranges from tech- niques designed for real-time performance at the expense of visual fi delity, to those designed for photorealism but which are not intended to operate in real time. Real-time rendering engines perform the steps listed above repeatedly, displaying rendered images at a rate of 30, 50, or 60 frames per second to 401 10.1. Foundations of Depth-Buffered Triangle Rasterization provide the illusion of motion. This means a real-time rendering engine has at most 33.3 ms to generate each image (to achieve a frame rate of 30 FPS). Usu- ally much less time is available, because bandwidth is also consumed by other engine systems like animation, AI, collision detection, physics simulation, au- dio, player mechanics, and other gameplay logic. Considering that fi lm ren- dering engines oft en take anywhere from many minutes to many hours to render a single frame, the quality of real-time computer graphics these days is truly astounding. 10.1.1. Describing a Scene A real-world scene is composed of objects. Some objects are solid, like a brick, and some are amorphous, like a cloud of smoke, but every object occupies a volume of 3D space. An object might be opaque (in which case light cannot pass through its volume), transparent (in which case light passes through it without being scatt ered, so that we can see a reasonably clear image of what- ever is behind the object), or translucent (meaning that light can pass through the object but is scatt ered in all directions in the process, yielding only a blur of colors that hint at the objects behind it). Opaque objects can be rendered by considering only their surfaces . We don’t need to know what’s inside an opaque object in order to render it, be- cause light cannot penetrate its surface. When rendering a transparent or translucent object, we really should model how light is refl ected, refracted, scatt ered, and absorbed as it passes through the object’s volume. This requires knowledge of the interior structure and properties of the object. However, most game engines don’t go to all that trouble. They just render the surfaces Virtual Screen (Near Plane) xC zC yC Rendered ImageCamera Frustum Camera Figure 10.1. The high-level rendering approach used by virtually all 3D computer graphics technologies. 402 10. The Rendering Engine of transparent and translucent objects in almost the same way opaque objects are rendered. A simple numeric opacity measure known as alpha is used to describe how opaque or transparent a surface is. This approach can lead to various visual anomalies (for example, surface features on the far side of the object may be rendered incorrectly), but the approximation can be made to look reasonably realistic in many cases. Even amorphous objects like clouds of smoke are oft en represented using particle eff ects, which are typically com- posed of large numbers of semi-transparent rectangular cards. Therefore, it’s safe to say that most game rendering engines are primarily concerned with rendering surfaces. 10.1.1.1. Representations Used by High-End Rendering Packages Theoretically, a surface is a two-dimensional sheet comprised of an infi nite number of points in three-dimensional space. However, such a description is clearly not practical. In order for a computer to process and render arbitrary surfaces, we need a compact way to represent them numerically. Some surfaces can be described exactly in analytical form, using a para- metric surface equation . For example, a sphere centered at the origin can be rep- resented by the equation x2 + y2 + z2 = r2. However, parametric equations aren’t particularly useful for modeling arbitrary shapes. In the fi lm industry, surfaces are oft en represented by a collection of rect- angular patches each formed from a two-dimensional spline defi ned by a small number of control points. Various kinds of splines are used, including Bézi- er surfaces (e.g., bicubic patches , which are third-order Béziers —see htt p:// en.wikipedia.org/wiki/Bezier_surface for more information), nonuniform rational B-splines (NURBS—see htt p://en.wikipedia.org/wiki/Nurbs), Bézi- er triangles, and N-patches (also known as normal patches—see htt p://www. gamasutra.com/features/20020715/mollerhaines_01.htm for more details). Modeling with patches is a bit like covering a statue with litt le rectangles of cloth or paper maché. High-end fi lm rendering engines like Pixar’s RenderMan use subdivision surfaces to defi ne geometric shapes. Each surface is represented by a mesh of control polygons (much like a spline), but the polygons can be subdivided into smaller and smaller polygons using the Catmull-Clark algorithm. This subdivision typically proceeds until the individual polygons are smaller than a single pixel in size. The biggest benefi t of this approach is that no matt er how close the camera gets to the surface, it can always be subdivided further so that its silhouett e edges won’t look faceted. To learn more about subdivi- sion surfaces, check out the following great article on Gamasutra: htt p://www. gamasutra.com/features/20000411/sharp_pfv.htm. 403 10.1.1.2. Triangle Meshes Game developers have traditionally modeled their surfaces using triangle meshes. Triangles serve as a piece-wise linear approximation to a surface, much as a chain of connected line segments acts as a piece-wise approxima- tion to a function or curve (see Figure 10.2). Triangles are the polygon of choice for real-time rendering because they have the following desirable properties: The triangle is the simplest type of polygon. Any fewer than three vertices, and we wouldn’t have a surface at all. A triangle is always planar. Any polygon with four or more vertices need not have this property, because while the fi rst three vertices defi ne a plane, the fourth vertex might lie above or below that plane. Triangles remain triangles under most kinds of transformations, including affi ne transforms and perspective projections. At worst, a triangle viewed edge-on will degenerate into a line segment. At every other orientation, it remains triangular. Virtually all commercial graphics-acceleration hardware is designed around triangle rasterization. Starting with the earliest 3D graphics accelerators for the PC, rendering hardware has been designed almost exclusively around triangle rasterization. This decision can be traced all the way back to the fi rst soft ware rasterizers used in the earliest 3D games like Castle Wolfenstein 3D and Doom . Like it or not, triangle-based technolo- gies are entrenched in our industry and probably will be for years to come. Tessellation The term tessellation describes a process of dividing a surface up into a collec- tion of discrete polygons (which are usually either quadrilaterals, also known as quads, or triangles). Triangulation is tessellation of a surface into triangles. One problem with the kind of triangle mesh used in games is that its level of tessellation is fi xed by the artist when he or she creates it. Fixed tessellation Figure 10.2. A mesh of triangles is a linear ap- proximation to a sur- face, just as a series of connected line seg- ments can serve as a linear approximation to a function or curve. x f(x) Figure 10.3. Fixed tessellation can cause an object’s silhouette edges to look blocky, especially when the object is close to the camera. 10.1. Foundations of Depth-Buffered Triangle Rasterization 404 10. The Rendering Engine can cause an object’s silhouett e edges to look blocky, as shown in Figure 10.3; this is especially noticeable when the object is close to the camera. Ideally, we’d like a solution that can arbitrarily increase tessellation as an object gets closer to the virtual camera. In other words, we’d like to have a uni- form triangle-to-pixel density, no matt er how close or far away the object is. Subdivision surfaces can achieve this ideal—surfaces can be tessellated based on distance from the camera, so that every triangle is less than one pixel in size. Game developers oft en att empt to approximate this ideal of uniform tri- angle-to-pixel density by creating a chain of alternate versions of each triangle mesh, each known as a level of detail (LOD). The fi rst LOD, oft en called LOD 0, represents the highest level of tessellation; it is used when the object is very close to the camera. Subsequent LODs are tessellated at lower and lower reso- lutions (see Figure 10.4). As the object moves farther away from the camera, the engine switches from LOD 0 to LOD 1 to LOD 2, and so on. This allows the rendering engine to spend the majority of its time transforming and lighting the vertices of the objects that are closest to the camera (and therefore occupy the largest number of pixels on-screen). Some game engines apply dynamic tessellation techniques to expansive meshes like water or terrain. In this technique, the mesh is usually represented by a height fi eld defi ned on some kind of regular grid patt ern. The region of the mesh that is closest to the camera is tessellated to the full resolution of the grid. Regions that are farther away from the camera are tessellated using fewer and fewer grid points. Progressive meshes are another technique for dynamic tessellation and LODing. With this technique, a single high-resolution mesh is created for dis- play when the object is very close to the camera. (This is essentially the LOD 0 Figure 10.4. A chain of LOD meshes, each with a fi xed level of tessellation, can be used to approximate uniform triangle-to-pixel density. The leftmost torus is constructed from 5000 triangles, the center torus from 450 triangles, and the rightmost torus from 200 triangles. 405 mesh.) This mesh is automatically detessellated as the object gets farther away by collapsing certain edges. In eff ect, this process automatically generates a semi-continuous chain of LODs. See htt p://research.microsoft .com/en-us/um/ people/hoppe/pm.pdf for a detailed discussion of progressive mesh technol- ogy. 10.1.1.3. Constructing a Triangle Mesh Now that we understand what triangle meshes are and why they’re used, let’s take a brief look at how they’re constructed. Winding Order A triangle is defi ned by the position vectors of its three vertices, which we can denote p1 , p2 , and p3. The edges of a triangle can be found by simply subtract- ing the position vectors of adjacent vertices. For example, e12 = p2 – p1 , e13 = p3 – p1 , e23 = p3 – p2. The normalized cross product of any two edges defi nes a unit face normal N: 12 13 12 13 . ×= × eeN ee These derivations are illustrated in Figure 10.5. To know the direction of the face normal (i.e., the sense of the edge cross product), we need to defi ne which side of the triangle should be considered the front (i.e., the outside surface of an object) and which should be the back (i.e., its inside surface). This can be defi ned easily by specifying a winding order —clockwise (CW) or counterclock- wise (CCW). Most low-level graphics APIs give us a way to cull back-facing triangles based on winding order. For example, if we set the cull mode parameter in Di- Figure 10.5. Deriving the edges and plane of a triangle from its vertices. p1 p2 p3 e12 N e13 e23 10.1. Foundations of Depth-Buffered Triangle Rasterization 406 10. The Rendering Engine rect3D (D3DRS_CULL) to D3DCULLMODE_CW, then any triangle whose vertices wind in a clockwise fashion in screen space will be treated as a back-facing triangle and will not be drawn. Back face culling is important because we generally don’t want to waste time drawing triangles that aren’t going to be visible anyway. Also, rendering the back faces of transparent objects can actually cause visual anomalies. The choice of winding order is an arbitrary one, but of course it must be consistent across all assets in the entire game. Inconsistent winding order is a common error among junior 3D modelers. Triangle Lists The easiest way to defi ne a mesh is simply to list the vertices in groups of three, each triple corresponding to a single triangle. This data structure is known as a triangle list ; it is illustrated in Figure 10.6. Figure 10.6. A triangle list. V0 V1 V2 V3 V4 V5 V6 V7 ... V5 V7 V6V0 V5 V1V1 V2 V3V0 V1 V3 Indexed Triangle Lists You probably noticed that many of the vertices in the triangle list shown in Figure 10.6 were duplicated, oft en multiple times. As we’ll see in Section 10.1.2.1, we oft en store quite a lot of metadata with each vertex, so repeating this data in a triangle list wastes memory. It also wastes GPU bandwidth, be- cause a duplicated vertex will be transformed and lit multiple times. For these reasons, most rendering engines make use of a more effi cient data structure known as an indexed triangle list . The basic idea is to list the vertices once with no duplication and then to use light-weight vertex indi- ces (usually occupying only 16 bits each) to defi ne the triples of vertices that constitute the triangles. The vertices are stored in an array known as a vertex 407 buff er (DirectX) or vertex array (OpenGL). The indices are stored in a separate buff er known as an index buff er or index array. This technique is shown in Fig- ure 10.7. Strips and Fans Specialized mesh data structures known as triangle strips and triangle fans are sometimes used for game rendering. Both of these data structures eliminate the need for an index buff er, while still reducing vertex duplication to some degree. They accomplish this by predefi ning the order in which vertices must appear and how they are combined to form triangles. V0 V1 V2 V3 V4 V5 V6 V7 Indices 0 1 3 1 2 3 0 5 1 ... 5 7 6 Vertices V0 V1 V2 V3 V4 V5 V6 V7 Figure 10.7. An indexed triangle list. Interpreted as triangles: 01 2 1 3 2 234 3 5 4 V0 V1 V2 V3 V4 V5Vertices V0 V1 V2 V3 V4 V5 Figure 10.8. A triangle strip. 10.1. Foundations of Depth-Buffered Triangle Rasterization 408 10. The Rendering Engine In a strip, the fi rst three vertices defi ne the fi rst triangle. Each subsequent vertex forms an entirely new triangle, along with its previous two neigh- bors. To keep the winding order of a triangle strip consistent, the previous two neighbor vertices swap places aft er each new triangle. A triangle strip is shown in Figure 10.8. In a fan, the fi rst three vertices defi ne the fi rst triangle and each subse- quent vertex defi nes a new triangle with the previous vertex and the fi rst ver- tex in the fan. This is illustrated in Figure 10.9. Vertex Cache Optimization When a GPU processes an indexed triangle list, each triangle can refer to any vertex within the vertex buff er. The vertices must be processed in the order they appear within the triangles, because the integrity of each triangle must be maintained for the rasterization stage. As vertices are processed by the vertex shader , they are cached for reuse. If a subsequent primitive refers to a vertex that already resides in the cache, its processed att ributes are used instead of reprocessing the vertex. Strips and fans are used in part because they can potentially save memory (no index buff er required) and in part because they tend to improve the cache coherency of the memory accesses made by the GPU to video RAM. Even bett er, we can use an indexed strip or indexed fan to virtually eliminate vertex duplication (which can oft en save more memory than eliminating the index buff er), while still reaping the cache coherency benefi ts of the strip or fan ver- tex ordering. Indexed triangle lists can also be cache-optimized without restricting ourselves to strip or fan vertex ordering. A vertex cache optimizer is an offl ine geometry processing tool that att empts to list the triangles in an order that Figure 10.9. A triangle fan. Interpreted as triangles: 012 02 3 034 V0 V1 V2 V3 V4Vertices V0 V4 V3 V2 V1 409 optimizes vertex reuse within the cache. It generally takes into account factors such as the size of the vertex cache(s) present on a particular type of GPU and the algorithms used by the GPU to decide when to cache vertices and when to discard them. For example, the vertex cache optimizer included in Sony’s Edge geometry processing library can achieve rendering throughput that is up to 4% bett er than what is possible with triangle stripping. 10.1.1.4. Model Space The position vectors of a triangle mesh’s vertices are usually specifi ed relative to a convenient local coordinate system called model space , local space, or object space. The origin of model space is usually either in the center of the object or at some other convenient location, like on the fl oor between the feet of a char- acter or on the ground at the horizontal centroid of the wheels of a vehicle. As we learned in Section 4.3.9.1, the sense of the model space axes is ar- bitrary, but the axes typically align with the natural “front,” “left ” or “right,” and “up” directions on the model. For a litt le mathematical rigor, we can de- fi ne three unit vectors F, L (or R), and U and map them as desired onto the unit basis vectors i, j, and k (and hence to the x-, y-, and z-axes, respectively) in model space. For example, a common mapping is L = i, U = j, and F = k. The mapping is completely arbitrary, but it’s important to be consistent for all models across the entire engine. Figure 10.10 shows one possible mapping of the model space axes for an aircraft model. L = i F = k U = j Figure 10.10. One possible mapping of the model space axes. 10.1.1.5. World Space and Mesh Instancing Many individual meshes are composed into a complete scene by position- ing and orienting them within a common coordinate system known as world space . Any one mesh might appear many times in a scene—examples include a street lined with identical lamp posts, a faceless mob of soldiers, or a swarm of spiders att acking the player. We call each such object a mesh instance . 10.1. Foundations of Depth-Buffered Triangle Rasterization 410 10. The Rendering Engine A mesh instance contains a reference to its shared mesh data and also includes a transformation matrix that converts the mesh’s vertices from model space to world space, within the context of that particular instance. This ma- trix is called the model-to-world matrix, or sometimes just the world matrix. Us- ing the notation from Section 4.3.10.2, this matrix can be writt en as follows: () ,1 MW MW M → → ⎡⎤=⎢⎥⎣⎦ RS 0 M t where the upper 3 × 3 matrix ()MW→RS rotates and scales model-space ver- tices into world space, and Mt is the translation of the model space axes ex- pressed in world space. If we have the unit model space basis vectors ,Mi ,Mj and ,Mk expressed in world space coordinates, this matrix can also be writt en as follows: 0 0 .0 1 M M MW M M → ⎡⎤ ⎢⎥ ⎢⎥=⎢⎥ ⎢⎥ ⎢⎥⎣⎦ i j M k t Given a vertex expressed in model-space coordinates, the rendering en- gine calculates its world-space equivalent as follows: .W M MW→=v vM We can think of the matrix MM→W as a description of the position and orienta- tion of the model space axes themselves, expressed in world space coordi- nates. Or we can think of it as a matrix that transforms vertices from model space to world space. When rendering a mesh, the model-to-world matrix is also applied to the surface normals of the mesh (see Section 10.1.2.1). Recall from Section 4.3.11, that in order to transform normal vectors properly, we must multiply them by the inverse transpose of the model-to-world matrix. If our matrix does not contain any scale or shear, we can transform our normal vectors correctly by simply sett ing their w components to zero prior to multiplication by the mod- el-to-world matrix, as described in Section 4.3.6.1. Some meshes like buildings, terrain, and other background elements are entirely static and unique. The vertices of these meshes are oft en expressed in world space, so their model-to-world matrices are identity and can be ignored. 10.1.2. Describing the Visual Properties of a Surface In order to properly render and light a surface, we need a description of its visual properties. Surface properties include geometric information, such as the 411 direction of the surface normal at various points on the surface. They also encompass a description of how light should interact with the surface. This includes diff use color, shininess/refl ectivity, roughness or texture, degree of opacity or transparency, index of refraction, and other optical properties. Sur- face properties might also include a specifi cation of how the surface should change over time (e.g., how an animated character’s skin should track the joints of its skeleton or how the surface of a body of water should move). The key to rendering photorealistic images is properly accounting for light’s behavior as it interacts with the objects in the scene. Hence rendering engineers need to have a good understanding of how light works, how it is transported through an environment, and how the virtual camera “senses” it and translates it into the colors stored in the pixels on-screen. 10.1.2.1. Introduction to Light and Color Light is electromagnetic radiation; it acts like both a wave and a particle in diff erent situations. The color of light is determined by its intensity I and its wavelength λ (or its frequency f, where f = 1/λ). The visible gamut ranges from a wavelength of 740 nm (or a frequency of 430 THz) to a wavelength of 380 nm (750 THz). A beam of light may contain a single pure wavelength (i.e., the colors of the rainbow, also known as the spectral colors ), or it may contain a mixture of various wavelengths. We can draw a graph showing how much of each frequency a given beam of light contains, called a spectral plot . White light contains a litt le bit of all wavelengths, so its spectral plot would look roughly like a box extending across the entire visible band. Pure green light contains only one wavelength, so its spectral plot would look like a single infi nitesi- mally narrow spike at about 570 THz. Light-Object Interactions Light can have many complex interactions with matt er . Its behavior is gov- erned in part by the medium through which it is traveling and in part by the shape and properties of the interfaces between diff erent types of media (air- solid, air-water, water-glass, etc.). Technically speaking, a surface is really just an interface between two diff erent types of media. Despite all of its complexity, light can really only do four things: It can be absorbed ; It can be refl ected ; It can be transmitt ed through an object, usually being refracted (bent) in the process; It can be diff racted when passing through very narrow openings. 10.1. Foundations of Depth-Buffered Triangle Rasterization 412 10. The Rendering Engine Most photorealistic rendering engines account for the fi rst three of these be- haviors; diff raction is not usually taken into account because its eff ects are rarely noticeable in most scenes. Only certain wavelengths may be absorbed by a surface, while others are refl ected. This is what gives rise to our perception of the color of an object. For example, when white light falls on a red object, all wavelengths except red are absorbed, hence the object appears red. The same perceptual eff ect is achieved when red light is cast onto a white object—our eyes don’t know the diff erence. Refl ections can be diff use , meaning that an incoming ray is scatt ered equal- ly in all directions. Refl ections can also be specular , meaning that an incident light ray will refl ect directly or be spread only into a narrow cone. Refl ections can also be anisotropic , meaning that the way in which light refl ects from a sur- face changes depending on the angle at which the surface is viewed. When light is transmitt ed through a volume, it can be scatt ered (as is the case for translucent objects), partially absorbed (as with colored glass), or re- fracted (as happens when light travels through a prism). The refraction an- gles can be diff erent for diff erent wavelengths, leading to spectral spreading. This is why we see rainbows when light passes through raindrops and glass prisms. Light can also enter a semi-solid surface, bounce around, and then exit the surface at a diff erent point from the one at which it entered the surface. We call this subsurface scatt ering , and it is one of the eff ects that gives skin, wax, and marble their characteristic warm appearance. Color Spaces and Color Models A color model is a three-dimensional coordinate system that measures colors. A color space is a specifi c standard for how numerical colors in a particular color model should be mapped onto the colors perceived by human beings in the real world. Color models are typically three-dimensional because of the three types of color sensors (cones) in our eyes, which are sensitive to diff erent wavelengths of light. The most commonly used color model in computer graphics is the RGB model. In this model, color space is represented by a unit cube, with the rela- tive intensities of red, green, and blue light measured along its axes. The red, green, and blue components are called color channels . In the canonical RGB color model, each channel ranges from zero to one. So the color (0, 0, 0) repre- sents black, while (1, 1, 1) represents white. When colors are stored in a bitmapped image , various color formats can be employed. A color format is defi ned in part by the number of bits per pixel it occupies and, more specifi cally, the number of bits used to represent each color channel. The RGB888 format uses eight bits per channel, for a total of 413 24 bits per pixel. In this format, each channel ranges from 0 to 255 rather than from zero to one. RGB565 uses fi ve bits for red and blue and six for green, for a total of 16 bits per pixel. A palett ed format might use eight bits per pixel to store indices into a 256-element color palett e, each entry of which might be stored in RGB888 or some other suitable format. A number of other color models are also used in 3D rendering. We’ll see how the log-LUV color model is used for high dynamic range (HDR) lighting in Section 10.3.1.5. Opacity and the Alpha Channel A fourth channel called alpha is oft en tacked on to RGB color vectors. As men- tioned in Section 10.1.1, alpha measures the opacity of an object. When stored in an image pixel, alpha represents the opacity of the pixel. RGB color formats can be extended to include an alpha channel, in which case they are referred to as RGBA or ARGB color formats. For example, RGBA8888 is a 32 bit-per-pixel format with eight bits each for red, green, blue, and alpha. RGBA5551 is a 16 bit-per-pixel format with one-bit alpha; in this format, colors can either be fully opaque or fully transparent. 10.1.2.2. Vertex Attributes The simplest way to describe the visual properties of a surface is to specify them at discrete points on the surface. The vertices of a mesh are a conve- nient place to store surface properties, in which case they are called vertex att ributes . A typical triangle mesh includes some or all of the following att ributes at each vertex. As rendering engineers, we are of course free to defi ne any ad- ditional att ributes that may be required in order to achieve a desired visual eff ect on-screen. Position vector (pi = [ pix piy piz ]). This is the 3D position of the ith vertex in the mesh. It is usually specifi ed in a coordinate space local to the object, known as model space. Vertex normal (ni = [ nix niy niz ]). This vector defi nes the unit surface nor- mal at the position of vertex i. It is used in per-vertex dynamic lighting calculations. Vertex tangent (ti = [ tix tiy tiz ]) and bitangent (bi = [ bix biy biz ]). These two unit vectors lie perpendicular to one another and to the vertex normal ni. Together, the three vectors ni , ti , and bi defi ne a set of coordinate axes known as tangent space . This space is used for various per-pixel lighting calculations, such as normal mapping and environment mapping. (The 10.1. Foundations of Depth-Buffered Triangle Rasterization 414 10. The Rendering Engine bitangent bi is sometimes confusingly called the binormal , even though it is not normal to the surface.) Diff use color (di = [ dRi dGi dBi dAi ]). This four-element vector describes the diff use color of the surface, expressed in the RGB color space. It typically also includes a specifi cation of the opacity or alpha (A) of the surface at the position of the vertex. This color may be calculated off -line (static lighting) or at runtime (dynamic lighting). Specular color (si = [ sRi sGi sBi sAi ]). This quantity describes the color of the specular highlight that should appear when light refl ects directly from a shiny surface onto the virtual camera’s imaging plane. Texture coordinates (uij = [ uij vij ]). Texture coordinates allow a two- (or sometimes three-) dimensional bitmap to be “shrink wrapped” onto the surface of a mesh—a process known as texture mapping. A texture co- ordinate (u, v) describes the location of a particular vertex within the two-dimensional normalized coordinate space of the texture. A triangle can be mapped with more than one texture; hence it can have more than one set of texture coordinates. We’ve denoted the distinct sets of texture coordinates via the subscript j above. Skinning weights (kij , wij ). In skeletal animation, the vertices of a mesh are att ached to individual joints in an articulated skeleton. In this case, each vertex must specify to which joint it is att ached via an index, k. A vertex can be infl uenced by multiple joints, in which case the fi nal vertex posi- tion becomes a weighted average of these infl uences. Thus, the weight of each joint’s infl uence is denoted by a weighting factor w. In general, a vertex i can have multiple joint infl uences j, each denoted by the pair of numbers [ kij wij ]. 10.1.2.3. Vertex Formats Vertex att ributes are typically stored within a data structure such as a C struct or a C++ class. The layout of such a data structure is known as a ver- tex format. Diff erent meshes require diff erent combinations of att ributes and hence need diff erent vertex formats. The following are some examples of com- mon vertex formats: // Simplest possible vertex – position only (useful for // shadow volume extrusion, silhouette edge detection // for cartoon rendering, z prepass, etc.) struct Vertex1P { Vector3 m_p; // position }; 415 // A typical vertex format with position, vertex normal // and one set of texture coordinates. struct Vertex1P1N1UV { Vector3 m_p; // position Vector3 m_n; // vertex normal F32 m_uv[2]; // (u, v) texture coordinate }; // A skinned vertex with position, diffuse and specular // colors and four weighted joint influences. struct Vertex1P1D1S2UV4J { Vector3 m_p; // position Color4 m_d; // diffuse color and translucency Color4 m_S; // specular color F32 m_uv0[2]; // first set of tex coords F32 m_uv1[2]; // second set of tex coords U8 m_k[4]; // four joint indices, and... F32 m_w[3]; // three joint weights, for // skinning // (fourth calc’d from other // three) }; Clearly the number of possible permutations of vertex att ributes—and hence the number of distinct vertex formats—can grow to be extremely large. (In fact the number of formats is theoretically unbounded, if one were to per- mit any number of texture coordinates and/or joint weights.) Management of all these vertex formats is a common source of headaches for any graphics programmer. Some steps can be taken to reduce the number of vertex formats that an engine has to support. In practical graphics applications, many of the theo- retically possible vertex formats are simply not useful, or they cannot be handled by the graphics hardware or the game’s shaders. Some game teams also limit themselves to a subset of the useful/feasible vertex formats in or- der to keep things more manageable. For example, they might only allow zero, two, or four joint weights per vertex, or they might decide to support no more than two sets of texture coordinates per vertex. Modern GPUs are capable of extracting a subset of att ributes from a vertex data structure, so game teams can also choose to use a single “überformat” for all meshes and let the hardware select the relevant att ributes based on the requirements of the shader. 10.1. Foundations of Depth-Buffered Triangle Rasterization 416 10. The Rendering Engine 10.1.2.4. Attribute Interpolation The att ributes at a triangle’s vertices are just a coarse, discretized approxima- tion to the visual properties of the surface as a whole. When rendering a tri- angle, what really matt ers are the visual properties at the interior points of the triangle as “seen” through each pixel on-screen. In other words, we need to know the values of the att ributes on a per-pixel basis, not a per-vertex basis. One simple way to determine the per-pixel values of a mesh’s surface at- tributes is to linearly interpolate the per-vertex att ribute data. When applied to vertex colors, att ribute interpolation is known as Gouraud shading . An example of Gouraud shading applied to a triangle is shown in Figure 10.11, and its ef- fects on a simple triangle mesh are illustrated in Figure 10.12. Interpolation is routinely applied to other kinds of vertex att ribute information as well, such as vertex normals, texture coordinates, and depth. Figure 10.11. A Gouraud-shaded triangle with different shades of gray at the vertices. Figure 10.12. Gouraud shading can make faceted objects appear to be smooth. Vertex Normals and Smoothing As we’ll see in Section 10.1.3, lighting is the process of calculating the color of an object at various points on its surface, based on the visual properties of the surface and the properties of the light impinging upon it. The simplest way to light a mesh is to calculate the color of the surface on a per-vertex basis. In other words, we use the properties of the surface and the incoming light to calculate the diff use color of each vertex (di). These vertex colors are then interpolated across the triangles of the mesh via Gouraud shading. 417 In order to determine how a ray of light will refl ect from a point on a sur- face, most lighting models make use of a vector that is normal to the surface at the point of the light ray’s impact. Since we’re performing lighting calculations on a per-vertex basis, we can use the vertex normal ni for this purpose. There- fore, the directions of a mesh’s vertex normals can have a signifi cant impact on the fi nal appearance of a mesh. As an example, consider a tall, thin, four-sided box. If we want the box to appear to be sharp-edged, we can specify the vertex normals to be perpen- dicular to the faces of the box. As we light each triangle, we will encounter the same normal vector at all three vertices, so the resulting lighting will appear fl at, and it will abruptly change at the corners of the box just as the vertex normals do. We can also make the same box mesh look a bit like a smooth cylinder by specifying vertex normals that point radially outward from the box’s center line. In this case, the vertices of each triangle will have diff erent vertex nor- mals, causing us to calculate diff erent colors at each vertex. Gouraud shading will smoothly interpolate these vertex colors, resulting in lighting that appears to vary smoothly across the surface. This eff ect is illustrated in Figure 10.13. Figure 10.13. The directions of a mesh’s vertex normals can have a profound effect on the colors calculated during per-vertex lighting calculations. 10.1.2.5. Textures When triangles are relatively large, specifying surface properties on a per-ver- tex basis can be too coarse-grained. Linear att ribute interpolation isn’t always what we want, and it can lead to undesirable visual anomalies. As an example, consider the problem of rendering the bright specular highlight that can occur when light shines on a glossy object. If the mesh is 10.1. Foundations of Depth-Buffered Triangle Rasterization 418 10. The Rendering Engine highly tessellated, per-vertex lighting combined with Gouraud shading can yield reasonably good results. However, when the triangles are too large, the errors that arise from linearly interpolating the specular highlight can become jarringly obvious, as shown in Figure 10.14. To overcome the limitations of per-vertex surface att ributes, rendering en- gineers use bitmapped images known as texture maps. A texture oft en contains color information and is usually projected onto the triangles of a mesh. In this case, it acts a bit like those silly fake tatt oos we used to apply to our arms when we were kids. But a texture can contain other kinds of visual surface proper- ties as well as colors. And a texture needn’t be projected onto a mesh—for example, a texture might be used as a stand-alone data table. The individual picture elements of a texture are called texels to diff erentiate them from the pixels on the screen. The dimensions of a texture bitmap are constrained to be powers of two on some graphics hardware. Typical texture dimensions include 256 × 256, 512 × 512, 1024 × 1024, and 2048 × 2048, although textures can be any size on most hardware, provided the texture fi ts into video memory. Some graph- ics hardware imposes additional restrictions, such as requiring textures to be square, or lift s some restrictions, such as not constraining texture dimensions to be powers of two. Types of Textures The most common type of texture is known as a diff use map , or albedo map . It describes the diff use surface color at each texel on a surface and acts like a de- cal or paint job on the surface. Other types of textures are used in computer graphics as well, including normal maps (which store unit normal vectors at each texel, encoded as RGB values), gloss maps (which encode how shiny a surface should be at each texel), Figure 10.14. Linear interpolation of vertex attributes does not always yield an adequate description of the visual properties of a surface, especially when tessellation is low. 419 environment maps (which contain a picture of the surrounding environment for rendering refl ections), and many others. See Section 10.3.1 for a discussion of how various types of textures can be used for image-based lighting and other eff ects. We can actually use texture maps to store any information that we happen to need in our lighting calculations. For example, a one-dimensional texture could be used to store sampled values of a complex math function, a color-to- color mapping table, or any other kind of look-up table (LUT) . Texture Coordinates Let’s consider how to project a two-dimensional texture onto a mesh. To do this, we defi ne a two-dimensional coordinate system known as texture space . A texture coordinate is usually represented by a normalized pair of numbers denoted (u, v). These coordinates always range from (0, 0) at the bott om left corner of the texture to (1, 1) at the top right. Using normalized coordinates like this allows the same coordinate system to be used regardless of the di- mensions of the texture. To map a triangle onto a 2D texture, we simply specify a pair of texture coordinates (ui, vi) at each vertex i. This eff ectively maps the triangle onto the image plane in texture space. An example of texture mapping is depicted in Figure 10.15. Figure 10.15. An example of texture mapping. The triangles are shown both in three-dimen- sional space and in texture space. Texture Addressing Modes Texture coordinates are permitt ed to extend beyond the [0, 1] range. The graphics hardware can handle out-of-range texture coordinates in any one of 10.1. Foundations of Depth-Buffered Triangle Rasterization 420 10. The Rendering Engine the following ways. These are known as texture addressing modes; which mode is used is under the control of the user. Wrap. In this mode, the texture is repeated over and over in every direc- tion. All texture coordinates of the form (ju, kv) are equivalent to the coordinate (u, v), where j and k are arbitrary integers. Mirror. This mode acts like wrap mode, except that the texture is mir- rored about the v-axis for odd integer multiples of u, and about the u- axis for odd integer multiples of v. Clamp. In this mode, the colors of the texels around the outer edge of the texture are simply extended when texture coordinates fall outside the normal range. Border color. In this mode, an arbitrary user-specifi ed color is used for the region outside the [0, 1] texture coordinate range. These texture addressing modes are depicted in Figure 10.16. Figure 10.16. Texture addressing modes. Texture Formats Texture bitmaps can be stored on disk in virtually any image format provided your game engine includes the code necessary to read it into memory. Com- mon formats include Targa (.tga), Portable Network Graphics (.png), Win- dows Bitmap (.bmp), and Tagged Image File Format (.tif). In memory, textures are usually represented as two-dimensional (strided) arrays of pixels using 421 various color formats, including RGB888, RGBA8888, RGB565, RGBA5551, and so on. Most modern graphics cards and graphics APIs support compressed tex- tures . DirectX supports a family of compressed formats known as DXT or S3 Texture Compression (S3TC). We won’t cover the details here, but the basic idea is to break the texture into 2 × 2 blocks of pixels and use a small color pal- ett e to store the colors for each block. You can read more about S3 compressed texture formats at htt p://en.wikipedia.org/wiki/S3_Texture_Compression. Compressed textures have the obvious benefi t of using less memory than their uncompressed counterparts. An additional unexpected plus is that they are faster to render with as well. S3 compressed textures achieve this speed-up because of more cache-friendly memory access patt erns—4 × 4 blocks of ad- jacent pixels are stored in a single 64- or 128-bit machine word—and because more of the texture can fi t into the cache at once. Compressed textures do suff er from compression artifacts. While the anomalies are usually not notice- able, there are situations in which uncompressed textures must be used. Texel Density and Mipmapping Imagine rendering a full-screen quad (a rectangle composed of two triangles) that has been mapped with a texture whose resolution exactly matches that of the screen. In this case, each texel maps exactly to a single pixel on-screen, and we say that the texel density (ratio of texels to pixels) is one. When this same quad is viewed at a distance, its on-screen area becomes smaller. The resolu- tion of the texture hasn’t changed, so the quad’s texel density is now greater than one (meaning that more than one texel is contributing to each pixel). Clearly texel density is not a fi xed quantity—it changes as a texture- mapped object moves relative to the camera. Texel density aff ects the memory consumption and the visual quality of a three-dimensional scene. When the texel density is much less than one, the texels become signifi cantly larger than a pixel on-screen, and you can start to see the edges of the texels. This destroys the illusion. When texel density is much greater than one, many texels contrib- ute to a single pixel on-screen. This can cause a moiré banding patt ern , as shown in Figure 10.17. Worse, a pixel’s color can appear to swim and fl icker as diff er- ent texels within the boundaries of the pixel dominate its color depending on subtle changes in camera angle or position. Rendering a distant object with a very high texel density can also be a waste of memory if the player can never get close to it. Aft er all, why keep such a high-res texture in memory if no one will ever see all that detail? Ideally we’d like to maintain a texel density that is close to one at all times, for both nearby and distant objects. This is impossible to achieve exactly, but it can be approximated via a technique called mipmapping . For each texture, 10.1. Foundations of Depth-Buffered Triangle Rasterization 422 10. The Rendering Engine we create a sequence of lower-resolution bitmaps, each of which is one-half the width and one-half the height of its predecessor. We call each of these images a mipmap, or mip level. For example, a 64 × 64 texture would have the following mip levels: 64 × 64, 32 × 32, 16 × 16, 8 × 8, 4 × 4, 2 × 2, and 1 × 1, as shown in Figure 10.18. Once we have mipmapped our textures, the graphics hardware selects the appropriate mip level based on a triangle’s distance away from the camera, in an att empt to maintain a texel density that is close to one. For example, if a texture takes up an area of 40 × 40 on-screen, the 64 × 64 mip level might be selected; if that same texture takes up only a 10 × 10 area, the 16 × 16 mip level might be used. As we’ll see below, trilinear fi ltering allows the hardware to sample two adjacent mip levels and blend the results. In this case, a 10 × 10 area might be mapped by blending the 16 × 16 and 8 × 8 mip levels together. Figure 10.17. A texel density greater than one can lead to a moiré pattern. Figure 10.18. Mip levels for a 64×64 texture. World Space Texel Density The term “texel density ” can also be used to describe the ratio of texels to world space area on a textured surface. For example, a two meter cube mapped with a 256 × 256 texture would have a texel density of 2562/22 = 16,384. I will call this world space texel density to diff erentiate it from the screen space texel density we’ve been discussing thus far. 423 World-space texel density need not be close to one, and in fact the specifi c value will usually be much greater than one and depends entirely upon your choice of world units. Nonetheless, it is important for objects to be texture mapped with a reasonably consistent world space texel density. For example, we would expect all six sides of a cube to occupy the same texture area. If this were not the case, the texture on one side of the cube would have a lower-res- olution appearance than another side, which can be noticeable to the player. Many game studios provide their art teams with guidelines and in-engine texel density visualization tools in an eff ort to ensure that all objects in the game have a reasonably consistent world space texel density. Texture Filtering When rendering a pixel of a textured triangle, the graphics hardware samples the texture map by considering where the pixel center falls in texture space. There is usually not a clean one-to-one mapping between texels and pixels, and pixel centers can fall at any place in texture space, including directly on the boundary between two or more texels. Therefore, the graphics hardware usually has to sample more than one texel and blend the resulting colors to arrive at the actual sampled texel color. We call this texture fi ltering. Most graphics cards support the following kinds of texture fi ltering: Nearest neighbor . In this crude approach, the texel whose center is closest to the pixel center is selected. When mipmapping is enabled, the mip level is selected whose resolution is nearest to but greater than the ideal theoreti- cal resolution needed to achieve a screen-space texel density of one. Bilinear . In this approach, the four texels surrounding the pixel center are sampled, and the resulting color is a weighted average of their col- ors (where the weights are based on the distances of the texel centers from the pixel center). When mipmapping is enabled, the nearest mip level is selected. Trilinear . In this approach, bilinear fi ltering is used on each of the two nearest mip levels (one higher-res than the ideal and the other lower- res), and these results are then linearly interpolated. This eliminates abrupt visual boundaries between mip levels on-screen. Anisotropic . Both bilinear and trilinear fi ltering sample 2 × 2 square blocks of texels. This is the right thing to do when the textured sur- face is being viewed head-on, but it’s incorrect when the surface is at an oblique angle relative to the virtual screen plane. Anisotropic fi ltering samples texels within a trapezoidal region corresponding to the view angle, thereby increasing the quality of textured surfaces when viewed at an angle. 10.1. Foundations of Depth-Buffered Triangle Rasterization 424 10. The Rendering Engine 10.1.2.6. Materials A material is a complete description of the visual properties of a mesh. This includes a specifi cation of the textures that are mapped to its surface and also various higher-level properties, such as which shader programs to use when rendering the mesh, the input parameters to those shaders, and other parame- ters that control the functionality of the graphics acceleration hardware itself. While technically part of the surface properties description, vertex att ri- butes are not considered to be part of the material. However, they come along for the ride with the mesh, so a mesh-material pair contains all the informa- tion we need to render the object. Mesh-material pairs are sometimes called render packets, and the term “geometric primitive” is sometimes extended to encompass mesh-material pairs as well. A 3D model typically uses more than one material. For example, a mod- el of a human would have separate materials for the hair, skin, eyes, teeth, and various kinds of clothing. For this reason, a mesh is usually divided into submeshes , each mapped to a single material. The Ogre3D rendering engine implements this design via its Ogre::SubMesh class. 10.1.3. Lighting Basics Lighting is at the heart of all CG rendering. Without good lighting, an other- wise beautifully modeled scene will look fl at and artifi cial. Likewise, even the Figure 10.19. A variation on the classic “Cornell box” scene illustrating how realistic lighting can make even the simplest scene appear photorealistic. 425 simplest of scenes can be made to look extremely realistic when it is lit accu- rately. The classic “Cornell box” scene, shown in Figure 10.19, is an excellent example of this. The following sequence of screen shots from Naughty Dog’s Uncharted: Drake’s Fortune is another good illustration of the importance of lighting. In Figure 10.20, the scene is rendered without textures. Figure 10.21 shows the same scene with diff use textures applied. The fully lit scene is shown in Fig- ure 10.22. Notice the marked jump in realism when lighting is applied to the scene. Figure 10.20. A scene from Uncharted: Drake’s Fortune rendered without textures. Figure 10.21. The same UDF scene with only diffuse textures applied. 10.1. Foundations of Depth-Buffered Triangle Rasterization 426 10. The Rendering Engine The term shading is oft en used as a loose generalization of lighting plus other visual eff ects. As such, “shading” encompasses procedural deformation of vertices to simulate the motion of a water surface, generation of hair curves or fur shells, tessellation of high-order surfaces, and prett y much any other calculation that’s required to render a scene. In the following sections, we’ll lay the foundations of lighting that we’ll need in order to understand graphics hardware and the rendering pipeline. We’ll return to the topic of lighting in Section 10.3, where we’ll survey some advanced lighting and shading techniques. 10.1.3.1. Local and Global Illumination Models Rendering engines use various mathematical models of light-surface and light- volume interactions called light transport models . The simplest models only ac- count for direct lighting in which light is emitt ed, bounces off a single object in the scene, and then proceeds directly to the imaging plane of the virtual cam- era. Such simple models are called local illumination models , because only the local eff ects of light on a single object are considered; objects do not aff ect one another’s appearance in a local lighting model. Not surprisingly, local models were the fi rst to be used in games, and they are still in use today—local light- ing can produce surprisingly realistic results in some circumstances. True photorealism can only be achieved by accounting for indirect light- ing , where light bounces multiple times off many surfaces before reaching the virtual camera. Lighting models that account for indirect lighting are called global illumination models . Some global illumination models are targeted at simulating one specifi c visual phenomenon, such as producing realistic shad- Figure 10.22. The UDF scene with full lighting. 427 ows, modeling refl ective surfaces, accounting for interrefl ection between ob- jects (where the color of one object aff ects the colors of surrounding objects), and modeling caustic eff ects (the intense refl ections from water or a shiny metal surface). Other global illumination models att empt to provide a holis- tic account of a wide range of optical phenomena. Ray tracing and radiosity methods are examples of such technologies. Global illumination is described completely by a mathematical formula- tion known as the rendering equation or shading equation. It was introduced in 1986 by J. T. Kajiya as part of a seminal SIGGRAPH paper. In a sense, every rendering technique can be thought of as a full or partial solution to the ren- dering equation, although they diff er in their fundamental approach to solv- ing it and in the assumptions, simplifi cations, and approximations they make. See htt p://en.wikipedia.org/wiki/Rendering_equation, [8], [1], and virtually any other text on advanced rendering and lighting for more details on the rendering equation. 10.1.3.2. The Phong Lighting Model The most common local lighting model employed by game rendering engines is the Phong refl ection model . It models the light refl ected from a surface as a sum of three distinct terms: The ambient term models the overall lighting level of the scene. It is a gross approximation of the amount of indirect bounced light present in the scene. Indirect bounces are what cause regions in shadow not to appear totally black. The diff use term accounts for light that is refl ected uniformly in all direc- tions from each direct light source. This is a good approximation to the way in which real light bounces off a matt e surface, such as a block of wood or a piece of cloth. The specular term models the bright highlights we sometimes see when viewing a glossy surface. Specular highlights occur when the view- ing angle is closely aligned with a path of direct refl ection from a light source. Figure 10.23 shows how the ambient, diff use, and specular terms add together to produce the fi nal intensity and color of a surface. To calculate Phong refl ection at a specifi c point on a surface, we require a number of input parameters. The Phong model is normally applied to all three color channels (R, G and B) independently, so all of the color parameters in the following discussion are three-element vectors. The inputs to the Phong model are: 10.1. Foundations of Depth-Buffered Triangle Rasterization 428 10. The Rendering Engine the viewing direction vector V = [ Vx Vy Vz ], which extends from the refl ection point to the virtual camera’s focal point (i.e., the negation of the camera’s world-space “front” vector); the ambient light intensity for the three color channels, A = [ AR AG AB ]; the surface normal N = [ Nx Ny Nz ] at the point the light ray impinges on the surface; the surface refl ectance properties, which are the ambient refl ectivity □ kA, the diff use refl ectivity □ kD, the specular refl ectivity □ kS, and a specular “glossiness” exponent □ α; and, for each light source i, the light’s color and intensity □ Ci = [ CRi CGi CBi ], the direction vector □ Li from the refl ection point to the light source. In the Phong model, the intensity I of light refl ected from a point can be ex- pressed with the following vector equation: ( )( ) ,A D i Si i i kk kα⎡⎤= + ⋅+ ⋅⎣⎦∑I A NL R V C where the sum is taken over all lights i aff ecting the point in question. This can be broken into three scalar equations, one for each color channel: ( )( ) , ( )( ) , ( )( ) . RAR D i S i Ri i GAG D i S i Gi BAB D i S i Bi I kA k k C I kA k k C I kA k k C α α α ⎡⎤= + ⋅+ ⋅⎣⎦ ⎡⎤= + ⋅+ ⋅⎣⎦ ⎡⎤= + ⋅+ ⋅⎣⎦ ∑ ∑ ∑ NL R V NL R V NL R V i i Figure 10.23. Ambient, diffuse and specular terms are summed to calculate Phong refl ection. 429 In these equations, the vector Ri = [ Rxi Ryi Rzi ] is the refl ection of the light ray’s direction vector Li about the surface normal N. The vector Ri can be easily calculated via a bit of vector math. Any vec- tor can be expressed as a sum of its tangential and normal components. For example, we can break up the light direction vector L as follows: .TN=+LL L We know that the dot product (N · L) represents the projection of L normal to the surface (a scalar quantity). So the normal component LN is just the unit normal vector N scaled by this dot product: ( ).N =⋅L N LN The refl ected vector R has the same normal component as L but the opposite tangential component (–LT). So we can fi nd R as follows: This equation can be used to fi nd all of the Ri values corresponding to the light directions Li. Blinn-Phong The Blinn-Phong lighting model is a variation on Phong shading that calcu- lates specular refl ection in a slightly diff erent way. We defi ne the vector H to be the vector that lies halfway between the view vector V and the light direc- tion vector L. The Blinn-Phong specular component is then (N · H)a, as op- posed to Phong’s (R · V)α. The exponent a is slightly diff erent than the Phong exponent α, but its value is chosen in order to closely match the equivalent Phong specular term. The Blinn-Phong model off ers increased runtime effi ciency at the cost of some accuracy, although it actually matches empirical results more closely than Phong for some kinds of surfaces. The Blinn-Phong model was used almost exclusively in early computer games and was hard-wired into the fi xed-function pipelines of early GPUs. See htt p://en.wikipedia.org/wiki/ Blinn%E2%80%93Phong_shading_model for more details. BRDF Plots The three terms in the Phong lighting model are special cases of a general local refl ection model known as a bidirectional refl ection distribution function (BRDF ). 10.1. Foundations of Depth-Buffered Triangle Rasterization () 2; 2( ) . NT NN N =− = −− =− =⋅− RL L L LL LL R N LN L 430 10. The Rendering Engine A BRDF calculates the ratio of the outgoing (refl ected) radiance along a given viewing direction V to the incoming irradiance along the incident ray L. A BRDF can be visualized as a hemispherical plot, where the radial dis- tance from the origin represents the intensity of the light that would be seen if the refl ection point were viewed from that direction. The diff use Phong refl ec- tion term is kD(N · L). This term only accounts for the incoming illumination ray L, not the viewing angle V. Hence the value of this term is the same for all viewing angles. If we were to plot this term as a function of the viewing angle in three dimensions, it would look like a hemisphere centered on the point at which we are calculating the Phong refl ection. This is shown in two dimen- sions in Figure 10.24. The specular term of the Phong model is kS(R · V)α. This term is dependent on both the illumination direction L and the viewing direction V. It produces a specular “hot spot” when the viewing angle aligns closely with the refl ection R of the illumination direction L about the surface normal. However, its con- tribution falls off very quickly as the viewing angle diverges from the refl ected illumination direction. This is shown in two dimensions in Figure 10.25. 10.1.3.3. Modeling Light Sources In addition to modeling the light’s interactions with surfaces, we need to de- scribe the sources of light in the scene. As with all things in real-time rendering, we approximate real-world light sources using various simplifi ed models. Figure 10.24. The diffuse term of the Phong refl ection model is dependent upon N • L, but is independent of the viewing angle V. Figure 10.25. The specular term of the Phong refl ection model is at its maximum when the viewing angle V coincides with the refl ected light direction R and drops off quickly as V di- verges from R. 431 Static Lighting The fastest lighting calculation is the one you don’t do at all. Lighting is there- fore performed off -line whenever possible. We can precalculate Phong refl ec- tion at the vertices of a mesh and store the results as diff use vertex color at- tributes. We can also precalculate lighting on a per pixel basis and store the results in a kind of texture map known as a light map . At runtime, the light map texture is projected onto the objects in the scene in order to determine the light’s eff ects on them. You might wonder why we don’t just bake lighting information directly into the diff use textures in the scene. There are a few reasons for this. For one thing, diff use texture maps are oft en tiled and/or repeated throughout a scene, so baking lighting into them wouldn’t be practical. Instead, a single light map is usually generated per light source and applied to any objects that fall within that light’s area of infl uence. This approach permits dynamic objects to move past a light source and be properly illuminated by it. It also means that our light maps can be of a diff erent (oft en lower) resolution than our diff use tex- ture maps. Finally, a “pure” light map usually compresses bett er than one that includes diff use color information. Ambient Lights An ambient light corresponds to the ambient term in the Phong lighting model. This term is independent of the viewing angle and has no specifi c direction. An ambient light is therefore represented by a single color, corresponding to the A color term in the Phong equation (which is scaled by the surface’s ambi- ent refl ectivity kA at runtime). The intensity and color of ambient light may vary from region to region within the game world. Directional Lights A directional light models a light source that is eff ectively an infi nite distance away from the surface being illuminated—like the sun. The rays emanating from a directional light are parallel, and the light itself does not have any particular location in the game world. A directional light is therefore modeled as a light color C and a direction vector L. A directional light is depicted in Figure 10.26. Point (Omni-Directional) Lights A point light (omni-directional light) has a distinct position in the game world and radiates uniformly in all directions. The intensity of the light is usually considered to fall off with the square of the distance from the light source, and beyond a predefi ned maximum radius its eff ects are simply clamped to zero. A point light is modeled as a light position P, a source color/intensity C, Figure 10.26. Model of a directional light source. Figure 10.27. Mod- el of a point light source. 10.1. Foundations of Depth-Buffered Triangle Rasterization 432 10. The Rendering Engine and a maximum radius rmax. The rendering engine only applies the eff ects of a point light to those surfaces that fall within is sphere of infl uence (a signifi cant optimization). Figure 10.27 illustrates a point light. Spot Lights A spot light acts like a point light whose rays are restricted to a cone-shaped region, like a fl ashlight. Usually two cones are specifi ed with an inner and an outer angle. Within the inner cone, the light is considered to be at full inten- sity. The light intensity falls off as the angle increases from the inner to the outer angle, and beyond the outer cone it is considered to be zero. Within both cones, the light intensity also falls off with radial distance. A spot light is modeled as a position P, a source color C, a central direction vector L, a maxi- mum radius rmax , and inner and outer cone angles θmin and θmax. Figure 10.28 illustrates a spot light source. Area Lights All of the light sources we’ve discussed thus far radiate from an idealized point, either at infi nity or locally. A real light source almost always has a non- zero area—this is what gives rise to the umbra and penumbra in the shadows it casts. Rather than trying to model area lights explicitly, CG engineers oft en use various “tricks” to account for their behavior. For example to simulate a pen- umbra, we might cast multiple shadows and blend the results, or we might blur the edges of a sharp shadow in some manner. Emissive Objects Some surfaces in a scene are themselves light sources. Examples include fl ash- lights, glowing crystal balls, fl ames from a rocket engine, and so on. Glowing surfaces can be modeled using an emissive texture map —a texture whose colors are always at full intensity, independent of the surrounding lighting environ- ment. Such a texture could be used to defi ne a neon sign, a car’s headlights, and so on. Some kinds of emissive objects are rendered by combining multiple tech- niques. For example, a fl ashlight might be rendered using an emissive texture for when you’re looking head-on into the beam, a colocated spot light that casts light into the scene, a yellow translucent mesh to simulate the light cone, some camera-facing transparent cards to simulate lens fl are (or a bloom eff ect if high dynamic range lighting is supported by the engine), and a projected texture to produce the caustic eff ect that a fl ashlight has on the surfaces it il- luminates. The fl ashlight in Luigi’s Mansion is a great example of this kind of eff ect combination, as shown in Figure 10.29. Figure 10.28. Model of a spot light source. 433 10.1.4. The Virtual Camera In computer graphics, the virtual camera is much simpler than a real camera or the human eye. We treat the camera as an ideal focal point with a rectangu- lar virtual sensing surface called the imaging rectangle fl oating some small dis- tance in front of it. The imaging rectangle consists of a grid of square or rect- angular virtual light sensors, each corresponding to a single pixel on-screen. Rendering can be thought of as the process of determining what color and intensity of light would be recorded by each of these virtual sensors. 10.1.4.1. View Space The focal point of the virtual camera is the origin of a 3D coordinate system known as view space or camera space. The camera usually “looks” down the positive or negative z-axis in view space, with y up and x to the left or right. Typical left - and right-handed view space axes are illustrated in Figure 10.30. Figure 10.29. The fl ashlight in Luigi’s Mansion is composed of numerous visual effects, in- cluding a cone of translucent geometry for the beam, a dynamic spot light to cast light into the scene, an emissive texture on the lens, and camera-facing cards for the lens fl are. Left-HandedRight-Handed Virtual Screen Virtual Screen Frustum Frustum xC zC yC xC zC yC Figure 10.30. Left- and right-handed camera space axes. 10.1. Foundations of Depth-Buffered Triangle Rasterization 434 10. The Rendering Engine The camera’s position and orientation can be specifi ed using a view-to- world matrix , just as a mesh instance is located in the scene with its model-to- world matrix. If we know the position vector and three unit basis vectors of camera space, expressed in world-space coordinates, the view-to-world ma- trix can be writt en as follows, in a manner analogous to that used to construct a model-to-view matrix: 0 0 .0 1 V V VW V V → ⎡⎤ ⎢⎥ ⎢⎥=⎢⎥ ⎢⎥ ⎢⎥⎣⎦ i j M k t When rendering a triangle mesh, its vertices are transformed fi rst from model space to world space, and then from world space to view space. To perform this latt er transformation, we need the world-to-view matrix , which is the inverse of the view-to-world matrix. This matrix is sometimes called the view matrix: 1 view() .WV VW − →→==MM M Be careful here. The fact that the camera’s matrix is inverted relative to the matrices of the objects in the scene is a common point of confusion and bugs among new game developers. The world-to-view matrix is oft en concatenated to the model-to-world matrix prior to rendering a particular mesh instance. This combined matrix is called the model-view matrix in OpenGL. We precalculate this matrix so that the rendering engine only needs to do a single matrix multiply when transform- ing vertices from model space into view space: model view.MV MW WV→ →→ -==M MM M 10.1.4.2. Projections In order to render a 3D scene onto a 2D image plane, we use a special kind of transformation known as a projection . The perspective projection is the most common projection in computer graphics, because it mimics the kinds of im- ages produced by a typical camera. With this projection, objects appear small- er the farther away they are from the camera—an eff ect known as perspective foreshortening . The length-preserving orthographic projection is also used by some games, primarily for rendering plan views (e.g., front, side, and top) of 3D models or game levels for editing purposes, and for overlaying 2D graphics onto the screen for heads-up displays (HUDs) and the like. Figure 10.31 illustrates how a cube would look when rendered with these two types of projections. 435 10.1.4.3. The View Volume and the Frustum The region of space that the camera can “see” is known as the view volume . A view volume is defi ned by six planes. The near plane corresponds to the virtual image-sensing surface. The four side planes correspond to the edges of the virtual screen. The far plane is used as a rendering optimization to ensure that extremely distant objects are not drawn. It also provides an upper limit for the depths that will be stored in the depth buff er (see Section 10.1.4.8). When rendering the scene with a perspective projection, the shape of the view volume is a truncated pyramid known as a frustum . When using an or- thographic projection, the view volume is a rectangular prism. Perspective and orthographic view volumes are illustrated in Figure 10.32 and Figure 10.33, respectively. The six planes of the view volume can be represented compactly using six four-element vectors (nxi , nyi , nzi , di), where n = (nx , ny , nz) is the plane normal and d is its perpendicular distance from the origin. If we prefer the point- normal plane representation, we can also describe the planes with six pairs of vectors (Qi, ni), where Q is the arbitrary point on the plane and n is the plane normal. (In both cases, i is the index of the plane.) Figure 10.31. A cube rendered using a perspective projection (on the left) and an ortho- graphic projection (on the right). Far PlaneyV Near Plane xV zV (r, b, n) (r, b, f) (r, t, f) (l, t, f) (l, b, n) (l, t, n) (l, b, f) (r, t, n) Figure 10.32. A perspective view volume (frustum). 10.1. Foundations of Depth-Buffered Triangle Rasterization 436 10. The Rendering Engine 10.1.4.4. Projection and Homogeneous Clip Space Both perspective and orthographic projections transform points in view space into a coordinate space called homogeneous clip space . This three-dimensional space is really just a warped version of view space. The purpose of clip space is to convert the camera-space view volume into a canonical view volume that is independent both of the kind of projection used to convert the 3D scene into 2D screen space, and of the resolution and aspect ratio of the screen onto which the scene is going to be rendered. In clip space, the canonical view volume is a rectangular prism extending from –1 to +1 along the x- and y-axes. Along the z-axis, the view volume ex- tends either from –1 to +1 (OpenGL) or from 0 to 1 (DirectX). We call this coor- Figure 10.33. An orthographic view volume. Far PlaneyV Near Plane xV zV (r, b, n) (r, b, f) (r, t, f) (l, t, f) (l, b, n) (l, t, n) (l, b, f) (r, t, n) Far Plane yH Near Plane xH zH (1, –1, –1) (1, –1, 1) (1, 1, 1) (–1, 1, 1) (–1, –1, –1) (–1, 1, –1) Figure 10.34. The canonical view volume in homogeneous clip space. 437 dinate system “clip space” because the view volume planes are axis-aligned, making it convenient to clip triangles to the view volume in this space (even when a perspective projection is being used). The canonical clip-space view volume for OpenGL is depicted in Figure 10.34. Notice that the z-axis of clip space goes into the screen, with y up and x to the right. In other words, homo- geneous clip space is usually left -handed. Perspective Projection An excellent explanation of perspective projection is given in Section 4.5.1 of [28], so we won’t repeat it here. Instead, we’ll simply present the perspective projection matrix VH→M below. (The subscript V→H indicates that this ma- trix transforms vertices from view space into homogeneous clip space.) If we take view space to be right-handed, then the near plane intersects the z-axis at z = –n, and the far plane intersects it at z = –f. The virtual screen’s left , right, bott om, and top edges lie at x = l, x = r, y = b, and y = t on the near plane, respec- tively. (Typically the virtual screen is centered on the camera-space z-axis, in which case l = –r and b = –t, but this isn’t always the case.) Using these defi ni- tions, the perspective projection matrix for OpenGL is as follows: 2 0 0 0 20 0 0 . 1 200 0 VH n rl n tb fnrl tb rl tb f n nf fn → ⎡⎤⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥ ⎢⎥⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥=⎢⎥⎛⎞⎛ ⎞⎛ ⎞ +++⎜⎟⎢⎥−−⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠−− −⎝⎠⎢⎥ ⎢⎥⎛⎞⎢⎥⎜⎟−⎢⎥−⎝⎠⎣⎦ M DirectX defi nes the z-axis extents of the clip-space view volume to lie in the range [0, 1] rather thanin the range [–1, 1] as OpenGL does. We can easily adjust the perspective projection matrix to account for DirectX’s conventions as follows: () DirectX 2 0 0 0 20 0 0 . 1 00 0 VH n rl n tb frl tb rl tb f n nf fn → ⎡⎤⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥ ⎢⎥⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥=⎢⎥⎛⎞⎛ ⎞⎛ ⎞++⎜⎟⎢⎥−−⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠−− −⎝⎠⎢⎥ ⎢⎥⎛⎞⎢⎥⎜⎟−⎢⎥−⎝⎠⎣⎦ M 10.1. Foundations of Depth-Buffered Triangle Rasterization 438 10. The Rendering Engine Division by Z Perspective projection results in each vertex’s x- and y-coordinates being di- vided by its z-coordinate. This is what produces perspective foreshortening . To understand why this happens, consider multiplying a view-space point Vp expressed in four-element homogeneous coordinates by the OpenGL per- spective projection matrix: 2 0 0 0 20 0 0 [1] . 1 200 0 H V VH Vx Vy Vz n rl n tb ppp fnrl tb rl tb f n nf fn →= ⎡⎤⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥ ⎢⎥⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥= ⎢⎥⎛⎞⎛ ⎞⎛ ⎞ +++⎜⎟⎢⎥−−⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠−− −⎝⎠⎢⎥ ⎢⎥⎛⎞⎢⎥⎜⎟−⎢⎥−⎝⎠⎣⎦ p pM The result of this multiplication takes the form .HVzabc p⎡⎤=−⎣⎦p (10.1) When we convert any homogeneous vector into three dimensional coor- dinates, the x-, y-, and z-components are divided by the w-component: .yxzxyzw www ⎡⎤⎡⎤≡⎣⎦⎢⎥⎣⎦ So, aft er dividing Equation (10.1) by the homogeneous w-component, which is really just the negative view-space z-coordinate Vzp− , we have: []. H Vz Vz Vz Hx Hy Hz abc ppp ppp ⎡⎤ =⎢⎥−−−⎣⎦ = p Thus the homogeneous clip space coordinates have been divided by the view- space z-coordinate, which is what causes perspective foreshortening. Perspective-Correct Vertex Attribute Interpolation In Section 10.1.2.4, we learned that vertex att ributes are interpolated in order to determine appropriate values for them within the interior of a triangle. Att ri- bute interpolation is performed in screen space. We iterate over each pixel of the screen and att empt to determine the value of each att ribute at the correspond- ing location on the surface of the triangle. When rendering a scene with a perspec- 439 tive projection, we must do this very carefully so as to account for perspective foreshortening. This is known as perspective-correct att ribute interpolation . A derivation of perspective-correct interpolation is beyond our scope, but suffi ce it to say that we must divide our interpolated att ribute values by the corresponding z-coordinates (depths) at each vertex. For any pair of vertex at- tributes A1 and A2, we can write the interpolated att ribute at a percentage t of the distance between them as follows: 1 2 12 1 2 12 (1 ) LERP , , . z z z zz A A A AAtp p p pp ⎛⎞ ⎜⎟=− + = ⎝⎠tt Refer to [28] for an excellent derivation of the math behind perspective-correct att ribute interpolation. Orthographic Projection An orthographic projection is performed by the following matrix : () ortho 2 0 00 2000 .200 0 1 VH rl tb fn fnrl tb rl tb f n → ⎡⎤⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥ ⎢⎥⎛⎞⎜⎟⎢⎥⎝⎠−⎢⎥=⎢⎥⎛⎞ ⎜⎟⎢⎥− −⎝⎠⎢⎥ ⎢⎥⎛⎞⎛ ⎞⎛ ⎞ +++⎢⎥⎜⎟−−−⎜ ⎟⎜ ⎟⎢⎥⎝ ⎠⎝ ⎠−− −⎝⎠⎣⎦ M This is just an everyday scale-and-translate matrix. (The upper-left 3 × 3 contains a diagonal nonuniform scaling matrix, and the lower row contains the translation.) Since the view volume is a rectangular prism in both view space and clip space, we need only scale and translate our vertices to convert from one space to the other. 10.1.4.5. Screen Space and Aspect Ratios Screen space is a two-dimensional coordinate system whose axes are mea- sured in terms of screen pixels. The x-axis typically points to the right, with the origin at the top-left corner of the screen and y pointing down. (The reason for the inverted y-axis is that CRT monitors scan the screen from top to bot- tom.) The ratio of screen width to screen height is known as the aspect ratio. The most common aspect ratios are 4:3 (the aspect ratio of a traditional tele- vision screen) and 16:9 (the aspect ratio of a movie screen or HDTV). These aspect ratios are illustrated in Figure 10.35. 10.1. Foundations of Depth-Buffered Triangle Rasterization 440 10. The Rendering Engine We can render triangles expressed in homogeneous clip space by simply drawing their (x, y) coordinates and ignoring z. But before we do, we scale and shift the clip-space coordinates so that they lie in screen space rather than within the normalized unit square. This scale-and-shift operation is known as screen mapping . 10.1.4.6. The Frame Buffer The fi nal rendered image is stored in a bitmapped color buff er known as the frame buff er . Pixel colors are usually stored in RGBA8888 format, although other frame buff er formats are supported by most graphics cards as well. Some com- mon formats include RGB565, RGB5551, and one or more palett ed modes. The display hardware (CRT, fl at-screen monitor, HDTV, etc.) reads the contents of the frame buff er at a periodic rate of 60 Hz for NTSC televisions used in North America and Japan, or 50 Hz for PAL /SECAM televisions used in Europe and many other places in the world. Rendering engines typically maintain at least two frame buff ers. While one is being scanned by the dis- play hardware, the other one can be updated by the rendering engine. This is known as double buff ering . By swapping or “fl ipping” the two buff ers during the vertical blanking interval (the period during which the CRT’s electron gun is being reset to the top-left corner of the screen), double buff ering ensures that the display hardware always scans the complete frame buff er. This avoids a jarring eff ect known as tearing , in which the upper portion of the screen dis- plays the newly rendered image while the bott om shows the remnants of the previous frame’s image. Some engines make use of three frame buff ers—a technique aptly known as triple buff ering . This is done so that the rendering engine can start work on the next frame, even when the previous frame is still being scanned by the display hardware. For example, the hardware might still be scanning buff er A when the engine fi nishes drawing buff er B. With triple buff ering, it can pro- xS 4:3yS xS 16:9yS Figure 10.35. The two most prevalent screen space aspect ratios are 4:3 and 16:9. 441 ceed to render a new frame into buff er C, rather than idling while it waits for the display hardware to fi nish scanning buff er A. Render Targets Any buff er into which the rendering engine draws graphics is known as a ren- der target . As we’ll see later in this chapter, rendering engines make use of all sorts of other off -screen render targets, in addition to the frame buff ers. These include the depth buff er , the stencil buff er , and various other buff ers used for storing intermediate rendering results. 10.1.4.7. Triangle Rasterization and Fragments To produce an image of a triangle on-screen, we need to fi ll in the pixels it overlaps. This process is known as rasterization . During rasterization, the tri- angle’s surface is broken into pieces called fragments , each one representing a small region of the triangle’s surface that corresponds to a single pixel on the screen. (In the case of multisample antialiasing, a fragment corresponds to a portion of a pixel—see below.) A fragment is like a pixel in training. Before it is writt en into the frame buff er, it must pass a number of tests (described in more depth below). If it fails any of these tests, it will be discarded. Fragments that pass the tests are shaded (i.e., their colors are determined), and the fragment color is either writ- ten into the frame buff er or blended with the pixel color that’s already there. Figure 10.36 illustrates how a fragment becomes a pixel. Fragment Pixel Figure 10.36. A fragment is a small region of a triangle corresponding to a pixel on the screen. It passes through the rendering pipeline and is either discarded or its color is written into the frame buffer. 10.1. Foundations of Depth-Buffered Triangle Rasterization Antialiasing When a triangle is rasterized, its edges can look jagged—the familiar “stair step” eff ect we have all come to know and love (or hate). Technically speak- 442 10. The Rendering Engine ing, aliasing arises because we are using a discrete set of pixels to sample an image that is really a smooth, continuous two-dimensional signal. (In the fre- quency domain, sampling causes a signal to be shift ed and copied multiple times along the frequency axis. Aliasing literally means that these copies of the signal overlap and get confused with one another.) Antialiasing is a technique that reduces the visual artifacts caused by alias- ing. In eff ect, antialiasing causes the edges of the triangle to be blended with the surrounding colors in the frame buff er. There are a number of ways to antialias a 3D rendered image. In full-screen antialiasing (FSAA), the image is rendered into a frame buff er that is twice as wide and twice as tall as the actual screen. The resulting image is down- sampled to the desired resolution aft erwards. FSAA can be expensive because rendering a double-sized frame means fi lling four times the number of pixels. FSAA frame buff ers also consume four times the memory of a regular frame buff er. Modern graphics hardware can antialias a rendered image without ac- tually rendering a double-size image, via a technique called multisample an- tialiasing (MSAA). The basic idea is to break a triangle down into more than one fragment per pixel. These supersampled fragments are combined into a single pixel at the end of the pipeline. MSAA does not require a double-width frame buff er, and it can handle higher levels of supersampling as well. (4× and 8× supersampling are commonly supported by modern GPUs.) 10.1.4.8. Occlusion and the Depth Buffer When rendering two triangles that overlap each other in screen space, we need some way of ensuring that the triangle that is closer to the camera will appear on top. We could accomplish this by always rendering our triangles in Figure 10.37. The painter’s algorithm renders triangles in a back-to-front order to produce proper triangle occlusion. However, the algorithm breaks down when triangles intersect one another. 443 back-to-front order (the so-called painter’s algorithm ). However, as shown in Figure 10.37, this doesn’t work if the triangles are intersecting one another. To implement triangle occlusion properly, independent of the order in which the triangles are rendered, rendering engines use a technique known as depth buff ering or z-buff ering. The depth buff er is a full-screen buff er that typically contains 16- or 24-bit fl oating-point depth information for each pix- el in the frame buff er. Every fragment has a z-coordinate that measures its depth “into” the screen. (The depth of a fragment is found by interpolating the depths of the triangle’s vertices.) When a fragment’s color is writt en into the frame buff er, it depth is stored into the corresponding pixel of the depth buff er. When another fragment (from another triangle) is drawn into the same pixel, the engine compares the new fragment’s depth to the depth already present in the depth buff er. If the fragment is closer to the camera (i.e., if it has a smaller depth), it overwrites the pixel in the frame buff er. Otherwise the fragment is discarded. Z-Fighting and the W-Buffer When rendering parallel surfaces that are very close to one another, it’s im- portant that the rendering engine can distinguish between the depths of the two planes. If our depth buff er had infi nite precision, this would never be a problem. Unfortunately, a real depth buff er only has limited precision, so the depth values of two planes can collapse into a single discrete value when the planes are close enough together. When this happens, the more-distant plane’s pixels start to “poke through” the nearer plane, resulting in a noisy eff ect known as z-fi ghting . To reduce z-fi ghting to a minimum across the entire scene, we would like to have equal precision whether we’re rendering surfaces that are close to the camera or far away. However, with z-buff ering this is not the case. The preci- sion of clip-space z-depths ( Hzp ) are not evenly distributed across the entire range from the near plane to the far plane, because of the division by the view- space z-coordinate. Because of the shape of the 1/z curve, most of the depth buff er’s precision is concentrated near the camera. The plot of the function 1/Hz Vzp p= shown in Figure 10.38 demonstrates this eff ect. Near the camera, the distance between two planes in view space VzpΔ gets transformed into a reasonably large delta in clip space, .HzpΔ But far from the camera, this same separation gets transformed into a tiny delta in clip space. The result is z fi ghting, and it becomes rapidly more prevalent as objects get farther away from the camera. To circumvent this problem, we would like to store view-space z-coor- dinates ( Vzp ) in the depth buff er instead of clip-space z-coordinates ( Hzp ). View-space z-coordinates vary linearly with the distance from the camera, so 10.1. Foundations of Depth-Buffered Triangle Rasterization 444 10. The Rendering Engine using them as our depth measure achieves uniform precision across the en- tire depth range. This technique is called w-buff ering , because the view-space z-coordinate conveniently appears in the w-component of our homogeneous clip-space coordinates. (Recall from Equation (10.1) that Hw Vzp p=− .) The terminology can be a very confusing here. The z- and w-buff ers store coordinates that are expressed in clip space. But in terms of view-space coordi- nates, the z-buff er stores 1/z (i.e., 1/ Vzp ) while the w-buff er stores z (i.e., Vzp )! We should note here that the w-buff ering approach is a bit more expen- sive than its z-based counterpart. This is because with w-buff ering, we cannot linearly interpolate depths directly. Depths must be inverted prior to interpo- lation and then re-inverted prior to being stored in the w-buff er. 10.2. The Rendering Pipeline Now that we’ve completed our whirlwind tour of the major theoretical and practical underpinnings of triangle rasterization, let’s turn our att ention to how it is typically implemented. In real-time game rendering engines, the high-level rendering steps described in Section 10.1 are implemented using a soft ware/hardware architecture known as a pipeline . A pipeline is just an or- dered chain of computational stages, each with a specifi c purpose, operating on a stream of input data items and producing a stream of output data. Each stage of a pipeline can typically operate independently of the other stages. Hence, one of the biggest advantages of a pipelined architecture is that it lends itself extremely well to parallelization . While the fi rst stage is chewing on one data element, the second stage can be processing the results previously produced by the fi rst stage, and so on down the chain. Parallelization can also be achieved within an individual stage of the pipeline. For example, if the computing hardware for a particular stage is du- Figure 10.38. A plot of the function 1/pVz, showing how most of the precision lies close to the camera. ΔpHz ΔpVzΔpVz ΔpHz pHz = 1/pVz pHz = 1/pVz 445 plicated N times on the die, N data elements can be processed in parallel by that stage. A parallelized pipeline is shown in Figure 10.39. Ideally the stages operate in parallel (most of the time), and certain stages are capable of operat- ing on multiple data items simultaneously as well. The throughput of a pipeline measures how many data items are processed per second overall. The pipeline’s latency measures the amount of time it takes for a single data element to make it through the entire pipeline. The latency of an individual stage measures how long that stage takes to process a single item. The slowest stage of a pipeline dictates the throughput of the entire pipe- line. It also has an impact on the average latency of the pipeline as a whole. Therefore, when designing a rendering pipeline, we att empt to minimize and balance latency across the entire pipeline and eliminate bott lenecks. In a well- designed pipeline, all the stages operate simultaneously, and no stage is ever idle for very long waiting for another stage to become free. 10.2.1. Overview of the Rendering Pipeline Some graphics texts divide the rendering pipeline into three coarse-grained stages. In this book, we’ll extend this pipeline back even further, to encompass the offl ine tools used to create the scenes that are ultimately rendered by the game engine. The high level stages in our pipeline are: Tools stage (offl ine). Geometry and surface properties (materials) are de- fi ned. Asset conditioning stage (offl ine). The geometry and material data are pro- cessed by the asset conditioning pipeline (ACP) into an engine-ready format. Stage 3 Stage 1 Stage 2 Time Figure 10.39. A parallelized pipeline. The stages all operate in parallel and some stages are capable of operating on multiple data items simultaneously as well. 10.2. The Rendering Pipeline 446 10. The Rendering Engine Application stage (CPU). Potentially visible mesh instances are identifi ed and submitt ed to the graphics hardware along with their materials for rendering. Geometry processing stage (GPU). Vertices are transformed and lit and projected into homogeneous clip space. Triangles are processed by the optional geometry shader and then clipped to the frustum. Rasterization stage (GPU). Triangles are converted into fragments that are shaded, passed through various tests (z test, alpha test, stencil test, etc.) and fi nally blended into the frame buff er. 10.2.1.1. How the Rendering Pipeline Transforms Data It’s interesting to note how the format of geometry data changes as it passes through the rendering pipeline. The tools and asset conditioning stages deal with meshes and materials. The application stage deals in terms of mesh in- stances and submeshes, each of which is associated with a single material. During the geometry stage, each submesh is broken down into individual ver- tices, which are processed largely in parallel. At the conclusion of this stage, the triangles are reconstructed from the fully transformed and shaded verti- ces. In the rasterization stage, each triangle is broken into fragments, and these fragments are either discarded, or they are eventually writt en into the frame buff er as colors. This process is illustrated in Figure 10.40. Tools ACP Application Geometry Processing VerticeVerticesMesh Instance Submeshes TexturesMaterials Textures Mesh Materials Materials Textures Rasterization VerticeFragments VerticePixels VerticeTriangles Figure 10.40. The format of geometric data changes radically as it passes through the vari- ous stages of the rendering pipeline. 447 10.2.1.2. Implementation of the Pipeline The fi rst two stages of the rendering pipeline are implemented offl ine, usually executed by a PC or Linux machine. The application stage is run either by the main CPU of the game console or PC, or by parallel processing units like the PS3’s SPUs. The geometry and rasterization stages are usually implemented on the graphics processing unit (GPU). In the following sections, we’ll explore some of the details of how each of these stages is implemented. 10.2.2. The Tools Stage In the tools stage, meshes are authored by 3D modelers in a digital content creation (DCC) application like Maya , 3ds Max , Lightwave , Soft image/XSI , SketchUp , etc. The models may be defi ned using any convenient surface de- scription—NURBS, quads, triangles, etc. However, they are invariably tessel- lated into triangles prior to rendering by the runtime portion of the pipeline. The vertices of a mesh may also be skinned . This involves associating each vertex with one or more joints in an articulated skeletal structure, along with weights describing each joint’s relative infl uence over the vertex. Skin- ning information and the skeleton are used by the animation system to drive the movements of a model—see Chapter 11 for more details. Materials are also defi ned by the artists during the tools stage. This in- volves selecting a shader for each material, selecting textures as required by the shader, and specifying the confi guration parameters and options of each shader. Textures are mapped onto the surfaces, and other vertex att ributes are also defi ned, oft en by “painting” them with some kind of intuitive tool within the DCC application. Materials are usually authored using a commercial or custom in-house material editor . The material editor is sometimes integrated directly into the DCC application as a plug-in, or it may be a stand-alone program. Some mate- rial editors are live-linked to the game, so that material authors can see what the materials will look like in the real game. Other editors provide an offl ine 3D visualization view. Some editors even allow shader programs to be writt en and debugged by the artist or a shader engineer. NVIDIA’s Fx Composer is an example of such a tool; it is depicted in Figure 10.41. Both FxComposer and Unreal Engine 3 provide powerful graphical shad- ing languages . Such tools allow rapid prototyping of visual eff ects by con- necting various kinds of nodes together with a mouse. These tools generally provide a WYSIWYG display of the resulting material. The shaders created by a graphical language usually need to be hand-optimized by a rendering engineer, because a graphical language invariably trades some runtime per- 10.2. The Rendering Pipeline 448 10. The Rendering Engine Figure 10.42. The Unreal Engine 3 graphical shader language. Figure 10.41. Nvidia’s Fx Composer allows shader programs to be written, previsualized, and debugged easily. formance for its incredible fl exibility, generality, and ease of use. The Unreal graphical shader editor is shown in Figure 10.42. Materials may be stored and managed with the individual meshes. How- ever, this can lead to duplication of data—and eff ort. In many games, a rela- tively small number of materials can be used to defi ne a wide range of objects in the game. For example, we might defi ne some standard, reusable materials 449 like wood, rock, metal, plastic, cloth, skin, and so on. There’s no reason to du- plicate these materials inside every mesh. Instead, many game teams build up a library of materials from which to choose, and the individual meshes refer to the materials in a loosely-coupled manner. 10.2.3. The Asset Conditioning Stage The asset conditioning stage is itself a pipeline, sometimes called the asset conditioning pipeline or ACP. As we saw in Section 6.2.1.4, its job is to export, process, and link together multiple types of assets into a cohesive whole. For example, a 3D model is comprised of geometry (vertex and index buff ers), materials, textures, and an optional skeleton. The ACP ensures that all of the individual assets referenced by a 3D model are available and ready to be load- ed by the engine. Geometric and material data is extracted from the DCC application and is usually stored in a platform-independent intermediate format. The data is then further processed into one or more platform-specifi c formats, depend- ing on how many target platforms the engine supports. Ideally the platform- specifi c assets produced by this stage are ready to load into memory and use with litt le or no postprocessing at runtime. For example, mesh data targeted for the Xbox 360 might be output as index and vertex buff ers that are ready to be uploaded to video RAM; on the PS3, geometry might be produced in compressed data streams that are ready to be DMA’d to the SPUs for decom- pression. The ACP oft en takes the needs of the material/shader into account when building assets. For example, a particular shader might require tangent and bitangent vectors as well as a vertex normal; the ACP could generate these vectors automatically. High-level scene graph data structures may also be computed during the asset conditioning stage. For example, static level geometry may be processed in order to build a BSP tree. (As we’ll investigate in Section 10.2.7.4, scene graph data structures help the rendering engine to very quickly determine which objects should be rendered, given a particular camera position and ori- entation.) Expensive lighting calculations are oft en done offl ine as part of the as- set conditioning stage. This is called static lighting ; it may include calcula- tion of light colors at the vertices of a mesh (this is called “baked” vertex lighting), construction of texture maps that encode per-pixel lighting in- formation known as light maps , calculation of precomputed radiance transfer (PRT) coeffi cients (usually represented by spherical harmonic functions), and so on. 10.2. The Rendering Pipeline 450 10. The Rendering Engine 10.2.4. A Brief History of the GPU In the early days of game development, all rendering was done on the CPU. Games like Castle Wolfenstein 3D and Doom pushed the limits of what early PCs could do, rendering interactive 3D scenes without any help from special- ized graphics hardware (other than a standard VGA card). As the popularity of these and other PC games took off , graphics hard- ware was developed to offl oad work from the CPU. The earliest graphics ac- celerators, like 3Dfx’s Voodoo line of cards, handled only the most expensive stage in the pipeline—the rasterization stage. Subsequent graphics accelera- tors provided support for the geometry processing stage as well. At fi rst, graphics hardware provided only a hard-wired but confi gurable implementation known as the fi xed-function pipeline . This technology was known as hardware transformation and lighting , or hardware T&L for short. Later, certain substages of the pipeline were made programmable. Engineers could now write programs called shaders to control exactly how the pipeline pro- cessed vertices (vertex shaders) and fragments (fragment shaders, more common- ly known as pixel shaders). With the introduction of DirectX 10, a third type of shader known as a geometry shader was added. It permits rendering engineers to modify, cull, or create entire primitives (triangles, lines, and points). Graphics hardware has evolved around a specialized type of micropro- cessor known as the graphics processing unit or GPU. A GPU is designed to maximize throughput of the pipeline, which it achieves through massive par- allelization . For example, a modern GPU like the GeForce 8800 can process 128 vertices or fragments simultaneously. Even in its fully programmable form, a GPU is not a general-purpose microprocessor—nor should it be. A GPU achieves its high processing speeds (on the order of terafl ops on today’s GPUs) by carefully controlling the fl ow of data through the pipeline . Certain pipeline stages are either entirely fi xed in their function, or they are confi gurable but not programmable. Memory can only be accessed in controlled ways, and specialized data caches are used to minimize unnecessary duplication of computations. In the following sections, we’ll briefl y explore the architecture of a mod- ern GPU and see how the runtime portion of the rendering pipeline is typi- cally implemented. We’ll speak primarily about current GPU architectures, which are used on personal computers with the latest graphics cards and on console platforms like the Xbox 360 and the PS3. However, not all platforms support all of the features we’ll be discussing here. For example, the Wii does not support programmable shaders, and most PC games need to support fall- back rendering solutions to support older graphics cards with only limited programmable shader support. 451 10.2.5. The GPU Pipeline Virtually all GPUs break the pipeline into the substages described below and depicted in Figure 10.43. Each stage is shaded to indicate whether its function- ality is programmable, fi xed but confi gurable, or fi xed and non-confi gurable. 10.2.5.1. Vertex Shader This stage is fully programmable. It is responsible for transformation and shading/lighting of individual vertices. The input to this stage is a single ver- tex (although in practice many vertices are processed in parallel). Its position and normal are typically expressed in model space or world space. The vertex shader handles transformation from model space to view space via the model- view transform. Perspective projection is also applied, as well as per-vertex lighting and texturing calculations, and skinning for animated characters. The vertex shader can also perform procedural animation by modifying the posi- tion of the vertex. Examples of this include foliage that sways in the breeze or an undulating water surface. The output of this stage is a fully transformed and lit vertex, whose position and normal are expressed in homogeneous clip space (see Section 10.1.4.4). On modern GPUs, the vertex shader has full access to texture data—a ca- pability that used to be available only to the pixel shader. This is particularly useful when textures are used as stand-alone data structures like height maps or look-up tables . 10.2.5.2. Geometry Shader This optional stage is also fully programmable. The geometry shader oper- ates on entire primitives (triangles, lines, and points) in homogeneous clip space. It is capable of culling or modifying input primitives, and it can also generate new primitives. Typical uses include shadow volume extrusion (see Configurable Fixed-Function Programmable Primitive Assembly Geometry Shader Clipping Screen Mapping Triangle Setup Triangle Traversal Early Z Test Pixel Shader Merge / ROP Stream Output Vertex Shader Frame Buffer Figure 10.43. The geometry processing and rasterization stages of the rendering pipeline, as implemented by a typical GPU. The white stages are programmable, the light grey stages are confi gurable, and the dark grey boxes are fi xed-function. 10.2. The Rendering Pipeline 452 10. The Rendering Engine Section 10.3.3.1), rendering the six faces of a cube map (see Section 10.3.1.4), fur fi n extrusion around silhouett e edges of meshes, creation of particle quads from point data (see Section 10.4.1), dynamic tessellation, fractal sub- division of line segments for lightning eff ects, cloth simulations, and the list goes on. 10.2.5.3. Stream Output Modern GPUs permit the data that has been processed up to this point in the pipeline to be writt en back to memory. From there, it can then be looped back to the top of the pipeline for further processing. This feature is called stream output . Stream output permits a number of intriguing visual eff ects to be achieved without the aid of the CPU. An excellent example is hair rendering. Hair is oft en represented as a collection of cubic spline curves. It used to be that hair physics simulation would be done on the CPU. The CPU would also tessellate the splines into line segments. Finally the GPU would render the segments. With stream output, the GPU can do the physics simulation on the control points of the hair splines within the vertex shader. The geometry shader tes- sellates the splines, and the stream output feature is used to write the tessel- lated vertex data to memory. The line segments are then piped back into the top of the pipeline so they can be rendered. 10.2.5.4. Clipping The clipping stage chops off those portions of the triangles that straddle the frustum . Clipping is done by identifying vertices that lie outside the frustum and then fi nding the intersection of the triangle’s edges with the planes of the frustum. These intersection points become new vertices that defi ne one or more clipped triangles. This stage is fi xed in function, but it is somewhat confi gurable. For ex- ample, user-defi ned clipping planes can be added in addition to the frustum planes. This stage can also be confi gured to cull triangles that lie entirely out- side the frustum. 10.2.5.5. Screen Mapping Screen mapping simply scales and shift s the vertices from homogeneous clip space into screen space. This stage is entirely fi xed and non-confi gurable. 10.2.5.6. Triangle Setup During triangle setup , the rasterization hardware is initialized for effi cient conversion of the triangle into fragments. This stage is not confi gurable. 453 10.2.5.7. Triangle Traversal Each triangle is broken into fragments (i.e., rasterized) by the triangle travers- al stage. Usually one fragment is generated for each pixel, although with mul- tisample antialiasing (MSAA), multiple fragments are created per pixel (see Section 10.1.4.7). The triangle traversal stage also interpolates vertex att ributes in order to generate per-fragment att ributes for processing by the pixel shader. Perspective-correct interpolation is used where appropriate. This stage’s func- tionality is fi xed and not confi gurable. 10.2.5.8. Early Z Test Many graphics cards are capable of checking the depth of the fragment at this point in the pipeline, discarding it if it is being occluded by the pixel already in the frame buff er. This allows the (potentially very expensive) pixel shader stage to be skipped entirely for occluded fragments. Surprisingly, not all graphics hardware supports depth testing at this stage of the pipeline. In older GPU designs, the z test was done along with al- pha testing, aft er the pixel shader had run. For this reason, this stage is called the early z test or early depth test stage. 10.2.5.9. Pixel Shader This stage is fully programmable. Its job is to shade (i.e., light and otherwise process) each fragment. The pixel shader can also discard fragments, for ex- ample because they are deemed to be entirely transparent. The pixel shader can address one or more texture maps, run per-pixel lighting calculations, and do whatever else is necessary to determine the fragment’s color. The input to this stage is a collection of per-fragment att ributes (which have been interpolated from the vertex att ributes by the triangle traversal stage). The output is a single color vector describing the desired color of the fragment. 10.2.5.10. Merging / Raster Operations Stage The fi nal stage of the pipeline is known as the merging stage or blending stage, also known as the raster operations stage or ROP in NVIDIA parlance. This stage is not programmable, but it is highly confi gurable. It is responsible for running various fragment tests including the depth test (see Section 10.1.4.8), alpha test (in which the values of the fragment’s and pixel’s alpha channels can be used to reject certain fragments), and stencil test (see Section 10.3.3.1). If the fragment passes all of the tests, its color is blended (merged) with the color that is already present in the frame buff er. The way in which blend- ing occurs is controlled by the alpha blending function —a function whose basic 10.2. The Rendering Pipeline 454 10. The Rendering Engine structure is hard-wired, but whose operators and parameters can be confi g- ured in order to produce a wide variety of blending operations. Alpha blending is most commonly used to render semi-transparent ge- ometry. In this case, the following blending function is used: (1 ) .D SS S DAA′ = +−CC C The subscripts S and D stand for “source” (the incoming fragment) and “des- tination” (the pixel in the frame buff er), respectively. Therefore, the color that is writt en into the frame buff er ( D′C ) is a weighted average of the existing frame buff er contents ( DC ) and the color of the fragment being drawn ( SC ). The blend weight ( SA ) is just the source alpha of the incoming fragment. For alpha blending to look right, the semi-transparent and translucent surfaces in the scene must be sorted and rendered in back-to-front order, af- ter the opaque geometry has been rendered to the frame buff er. This is be- cause aft er alpha blending has been performed, the depth of the new fragment overwrites the depth of the pixel with which it was blended. In other words, the depth buff er ignores transparency (unless depth writes have been turned off , of course). If we are rendering a stack of translucent objects on top of an opaque backdrop, the resulting pixel color should ideally be a blend between the opaque surface’s color and the colors of all of the translucent surfaces in the stack. If we try to render the stack in any order other than back-to-front, depth test failures will cause some of the translucent fragments to be discard- ed, resulting in an incomplete blend (and a rather odd-looking image). Other alpha blending functions can be defi ned as well, for purposes other than transparency blending. The general blending equation takes the form ( ) ( ),D SS DD′ = ⊗+ ⊗C wC w C where the weighting factors wS and wD can be selected by the programmer from a predefi ned set of values including zero, one, source or destination color, source or destination alpha, and one minus the source or destination color or alpha. The operator ⊗ is either a regular scalar-vector multiplication or a component-wise vector-vector multiplication (a Hadamard product —see Section 4.2.4.1) depending on the data types of wS and wD. 10.2.6. Programmable Shaders Now that we have an end-to-end picture of the GPU pipeline in mind, let’s take a deeper look at the most interesting part of the pipeline—the program- mable shaders. Shader architectures have evolved signifi cantly since their introduction with DirectX 8. Early shader models supported only low-level as- sembly language programming, and the instruction set and register set of the pixel shader diff ered signifi cantly from those of the vertex shader. DirectX 455 9 brought with it support for high-level C-like shader languages such as Cg (C for graphics), HLSL (High-Level Shading Language —Microsoft ’s imple- mentation of the Cg language), and GLSL (OpenGL shading language). With DirectX 10, the geometry shader was introduced, and with it came a unifi ed shader architecture called shader model 4.0 in DirectX parlance. In the unifi ed shader model, all three types of shaders support roughly the same instruction set and have roughly the same set of capabilities, including the ability to read texture memory. A shader takes a single element of input data and transforms it into zero or more elements of output data. In the case of the vertex shader, the input is a vertex whose position and normal are expressed in model space or world space. The output of the vertex shader is a fully transformed and lit vertex, expressed in homo- geneous clip space. The input to the geometry shader is a single n-vertex primitive—a point (n = 1), line segment (n = 2), or triangle (n = 3)—with up to n additional vertices that act as control points. The output is zero or more primitives, possibly of a diff erent type than the input. For example, the geometry shader could convert points into two-triangle quads, or it could trans- form triangles into triangles but optionally discard some triangles, and so on. The pixel shader’s input is a fragment whose att ributes have been in- terpolated from the three vertices of the triangle from which it came. The output of the pixel shader is the color that will be writt en into the frame buff er (presuming the fragment passes the depth test and other optional tests). The pixel shader is also capable of discarding fragments explicitly, in which case it produces no output. 10.2.6.1. Accessing Memory Because the GPU implements a data processing pipeline, access to RAM is very carefully controlled. A shader program cannot read from or write to memory directly. Instead, its memory accesses are limited to two methods: registers and texture maps. Shader Registers A shader can access RAM indirectly via registers. All GPU registers are in 128- bit SIMD format. Each register is capable of holding four 32-bit fl oating-point or integer values (represented by the float4 data type in the Cg language). Such a register can contain a four-element vector in homogeneous coordinates or a color in RGBA format, with each component in 32-bit fl oating-point for- 10.2. The Rendering Pipeline 456 10. The Rendering Engine mat. Matrices can be represented by groups of three or four registers (rep- resented by built-in matrix types like float4x4 in Cg). A GPU register can also be used to hold a single 32-bit scalar, in which case the value is usually replicated across all four 32-bit fi elds. Some GPUs can operate on 16-bit fi elds, known as halfs. (Cg provides various built-in types like half4 and half4x4 for this purpose.) Registers come in four fl avors, as follows: Input registers. These registers are the shader’s primary source of input data. In a vertex shader, the input registers contain att ribute data ob- tained directly from the vertices. In a pixel shader, the input registers contain interpolated vertex att ribute data corresponding to a single fragment. The values of all input registers are set automatically by the GPU prior to invoking the shader. Constant registers. The values of constant registers are set by the applica- tion and can change from primitive to primitive. Their values are con- stant only from the point of view of the shader program. They provide a secondary form of input to the shader. Typical contents include the model-view matrix, the projection matrix, light parameters, and any other parameters required by the shader that are not available as vertex att ributes. Temporary registers. These registers are for use by the shader program inter- nally and are typically used to store intermediate results of calculations. Output registers. The contents of these registers are fi lled in by the shader and serve as its only form of output. In a vertex shader, the output regis- ters contain vertex att ributes such as the transformed position and nor- mal vectors in homogeneous clip space, optional vertex colors, texture coordinates, and so on. In a pixel shader, the output register contains the fi nal color of the fragment being shaded. The application provides the values of the constant registers when it sub- mits primitives for rendering. The GPU automatically copies vertex or frag- ment att ribute data from video RAM into the appropriate input registers prior to calling the shader program, and it also writes the contents of the output registers back into RAM at the conclusion of the program’s execution so that the data can be passed to the next stage of the pipeline. GPUs typically cache output data so that it can be reused without be- ing recalculated. For example, the post-transform vertex cache stores the most- recently processed vertices emitt ed by the vertex shader. If a triangle is en- countered that refers to a previously-processed vertex, it will be read from the post-transform vertex cache if possible—the vertex shader need only be called 457 again if the vertex in question has since been ejected from the cache to make room for newly processed vertices. Textures A shader also has direct read-only access to texture maps. Texture data is ad- dressed via texture coordinates, rather than via absolute memory addresses. The GPU’s texture samplers automatically fi lter the texture data, blending val- ues between adjacent texels or adjacent mipmap levels as appropriate. Texture fi ltering can be disabled in order to gain direct access to the values of particu- lar texels. This can be useful when a texture map is used as a data table, for example. Shaders can only write to texture maps in an indirect manner—by render- ing the scene to an off -screen frame buff er that is interpreted as a texture map by subsequent rendering passes. This feature is known as render to texture . 10.2.6.2. Introduction to High-Level Shader Language Syntax High-level shader languages like Cg and GLSL are modeled aft er the C pro- gramming language. The programmer can declare functions, defi ne a simple struct, and perform arithmetic. However, as we said above, a shader pro- gram only has access to registers and textures. As such, the struct and vari- able we declare in Cg or GLSL is mapped directly onto registers by the shader compiler. We defi ne these mappings in the following ways: Semantics . Variables and struct members can be suffi xed with a co- lon followed by a keyword known as a semantic. The semantic tells the shader compiler to bind the variable or data member to a particular vertex or fragment att ribute. For example, in a vertex shader we might declare an input struct whose members map to the position and color att ributes of a vertex as follows: struct VtxOut { float4 pos : POSITION; // map to the position // attribute float4 color : COLOR; // map to the color attribute }; Input versus output. The compiler determines whether a particular vari- able or struct should map to input or output registers from the context in which it is used. If a variable is passed as an argument to the shader program’s main function, it is assumed to be an input; if it is the return value of the main function, it is taken to be an output. 10.2. The Rendering Pipeline 458 10. The Rendering Engine VtxOut vshaderMain(VtxIn in) // in maps to input // registers { VtxOut out; // ... return out; // out maps to output registers } Uniform declaration . To gain access to the data supplied by the applica- tion via the constant registers, we can declare a variable with the key- word uniform. For example, the model-view matrix could be passed to a vertex shader as follows: VtxOut vshaderMain(VtxIn in, uniform float4x4 modelViewMatrix) { VtxOut out; // ... return out; } Arithmetic operations can be performed by invoking C-style operators, or by calling intrinsic functions as appropriate. For example, to multiply the input vertex position by the model-view matrix, we could write: VtxOut vshaderMain(VtxIn in, uniform float4x4 modelViewMatrix) { VtxOut out; out.pos = mul(modelViewMatrix, in.pos); out.color = float4(0, 1, 0, 1); // RGBA green return out; } Data is obtained from textures by calling special intrinsic functions that read the value of the texels at a specifi ed texture coordinate. A number of vari- ants are available for reading one-, two- and three-dimensional textures in various formats, with and without fi ltering. Special texture addressing modes are also available for accessing cube maps and shadow maps. References to the texture maps themselves are declared using a special data type known as a texture sampler declaration. For example, the data type sampler2D repre- sents a reference to a typical two-dimensional texture. The following simple Cg pixel shader applies a diff use texture to a triangle: struct FragmentOut { float4 color : COLOR; }; 459 FragmentOut pshaderMain(float2 uv : TEXCOORD0, uniform sampler2D texture) { FragmentOut out; out.color = tex2D(texture, uv); // look up texel at // (u,v) return out; } 10.2.6.3. Effect Files By itself, a shader program isn’t particularly useful. Additional information is required by the GPU pipeline in order to call the shader program with mean- ingful inputs. For example, we need to specify how the application-specifi ed parameters, like the model-view matrix, light parameters, and so on, map to the uniform variables declared in the shader program. In addition, some vi- sual eff ects require two or more rendering passes, but a shader program only describes the operations to be applied during a single rendering pass. If we are writing a game for the PC platform, we will need to defi ne “fallback” ver- sions of some of our more-advanced rendering eff ects, so that they will work even on older graphics cards. To tie our shader program(s) together into a complete visual eff ect, we turn to a fi le format known as an eff ect fi le. Diff erent rendering engines implement eff ects in slightly diff erent ways. In Cg, the eff ect fi le format is known as CgFX . Ogre3D uses a fi le format very similar to CgFX known as a material fi le. GLSL eff ects can be described using the COLLADA format, which is based on XML. Despite the diff erences, eff ects generally take on the following hierarchical format: At global scope, structs, shader programs (implemented as various “main” functions), and global variables (which map to application- specifi ed constant parameters) are defi ned. One or more techniques are defi ned. A technique represents one way to render a particular visual eff ect. An eff ect typically provides a primary technique for its highest-quality implementation and possibly a number of fall back techniques for use on lower-powered graphics hardware. Within each technique, one or more passes are defi ned. A pass describes how a single full-frame image should be rendered. It typically includes a reference to a vertex, geometry and/or pixel shader program’s “main” function, various parameter bindings, and optional render state sett ings. 10.2.6.4. Further Reading In this section, we’ve only had a small taste of what high-level shader pro- gramming is like—a complete tutorial is beyond our scope here. For a much 10.2. The Rendering Pipeline 460 10. The Rendering Engine more-detailed introduction to Cg shader programming, refer to the Cg tu- torial available on NVIDIA’s website at htt p://developer.nvidia.com/object/ cg_tutorial_home.html. 10.2.7. The Application Stage Now that we understand how the GPU works, we can discuss the pipeline stage that is responsible for driving it—the application stage . This stage has three roles:  1. Visibility determination. Only objects that are visible (or at least poten- tially visible) should be submitt ed to the GPU, lest we waste valuable resources processing triangles that will never be seen.  2. Submitt ing geometry to the GPU for rendering. Submesh-material pairs are sent to the GPU via a rendering call like DrawIndexedPrimitive() (DirectX) or glDrawArrays() (OpenGL), or via direct construction of the GPU command list. The geometry may be sorted for optimal render- ing performance. Geometry might be submitt ed more than once if the scene needs to be rendered in multiple passes.  3. Controlling shader parameters and render state. The uniform parameters passed to the shader via constant registers are confi gured by the ap- plication stage on a per-primitive basis. In addition, the application stage must set all of the confi gurable parameters of the non-program- mable pipeline stages to ensure that each primitive is rendered ap- propriately. In the following sections, we’ll briefl y explore how the application stage per- forms these tasks. 10.2.7.1. Visibility Determination The cheapest triangles are the ones you never draw. So it’s incredibly impor- tant to cull objects from the scene that do not contribute to the fi nal rendered image prior to submitt ing them to the GPU. The process of constructing the list of visible mesh instances is known as visibility determination . Frustum Culling In frustum culling , all objects that lie entirely outside the frustum are exclud- ed from our render list. Given a candidate mesh instance, we can determine whether or not it lies inside the frustum by performing some simple tests be- tween the object’s bounding volume and the six frustum planes. The bounding volume is usually a sphere, because spheres are particularly easy to cull. For 461 each frustum plane, we move the plane inward a distance equal to the radius of the sphere, then we determine on which side of each modifi ed plane the center point of the sphere lies. If the sphere is found to be on the front side of all six modifi ed planes, the sphere is inside the frustum. A scene graph data structure, described in Section 10.2.7.4, can help opti- mize frustum culling by allowing us to ignore objects whose bounding spheres are nowhere close to being inside the frustum. Occlusion and Potentially Visible Sets Even when objects lie entirely within the frustum, they may occlude one an- other. Removing objects from the visible list that are entirely occluded by other objects is called occlusion culling . In crowded environments viewed from ground level, there can be a great deal of inter-object occlusion, making oc- clusion culling extremely important. In less crowded scenes, or when scenes are viewed from above, much less occlusion may be present and the cost of occlusion culling may outweigh its benefi ts. Gross occlusion culling of a large-scale environment can be done by pre- calculating a potentially visible set (PVS). For any given camera vantage point, a PVS lists those scene objects that might be visible. A PVS errs on the side of including objects that aren’t actually visible, rather than excluding objects that actually would have contributed to the rendered scene. One way to implement a PVS system is to chop the level up into regions of some kind. Each region can be provided with a list of the other regions that can be seen when the camera is inside it. These PVSs might be manu- ally specifi ed by the artists or game designers. More commonly, an automated offl ine tool generates the PVS based on user-specifi ed regions. Such a tool usually operates by rendering the scene from various randomly distributed vantage points within a region. Every region’s geometry is color coded, so the list of visible regions can be found by scanning the resulting frame buff er and tabulating the region colors that are found. Because automated PVS tools are imperfect, they typically provide the user with a mechanism for tweaking the results, either by manually placing vantage points for testing, or by manually specifying a list of regions that should be explicitly included or excluded from a particular region’s PVS. Portals Another way to determine what portions of a scene are visible is to use portals . In portal rendering, the game world is divided up into semiclosed regions that are connected to one another via holes, such as windows and doorways. These holes are called portals. They are usually represented by polygons that describe their boundaries. 10.2. The Rendering Pipeline 462 10. The Rendering Engine To render a scene with portals, we start by rendering the region that con- tains the camera. Then, for each portal in the region, we extend a frustum-like volume consisting of planes extending from the camera’s focal point through each edge of the portal’s bounding polygon. The contents of the neighboring region can be culled to this portal volume in exactly the same way geometry is culled against the camera frustum. This ensures that only the visible geometry in the adjacent regions will be rendered. Figure 10.44 provides an illustration of this technique. Occlusion Volumes (Antiportals) If we fl ip the portal concept on its head, pyramidal volumes can also be used to describe regions of the scene that cannot be seen because they are being occluded by an object. These volumes are known as occlusion volumes or anti- portals . To construct an occlusion volume, we fi nd the silhouett e edges of each Figure 10.44. Portals are used to defi ne frustum-like volumes which are used to cull the con- tents of neighboring regions. In this example, objects A, B, and D will be culled because they lie outside one of the portals; the other objects will be visible. A H E D F GB C Figure 10.45. As a result of the antiportals corresponding to objects A, B, and C, objects D, E, F, and G are culled. Therefore only A, B, C, and H are visible. 463 occluding object and extend planes outward from the camera’s focal point through each of these edges. We test more-distant objects against these oc- clusion volumes and cull them if they lie entirely within the occlusion region. This is illustrated in Figure 10.45. Portals are best used when rendering enclosed indoor environments with a relatively small number of windows and doorways between “rooms.” In this kind of scene, the portals occupy a relatively small percentage of the total volume of the camera frustum, resulting in a large number of objects outside the portals which can be culled. Antiportals are best applied to large outdoor environments, in which nearby objects oft en occlude large swaths of the cam- era frustum. In this case, the antiportals occupy a relatively large percent- age of the total camera frustum volume, resulting in large numbers of culled objects. 10.2.7.2. Primitive Submission Once a list of visible geometric primitives has been generated, the individual primitives must be submitt ed to the GPU pipeline for rendering. This can be accomplished by making calls to DrawIndexedPrimitive() in DirectX or glDrawArrays() in OpenGL. Render State As we learned in Section 10.2.5, the functionality of many of the GPU pipeline’s stages is fi xed but confi gurable. And even programmable stages are driven in part by confi gurable parameters. Some examples of these confi gurable param- eters are listed below (although this is by no means a complete list!) world-view matrix; light direction vectors; texture bindings (i.e., which textures to use for a given material/ shader); texture addressing and fi ltering modes; time base for scrolling textures and other animated eff ects; z test (enabled or disabled); alpha blending options. The set of all confi gurable parameters within the GPU pipeline is known as the hardware state or render state. It is the application stage’s responsibility to ensure that the hardware state is confi gured properly and completely for each submitt ed primitive. Ideally these state sett ings are described completely by the material associated with each submesh. So the application stage’s job boils 10.2. The Rendering Pipeline 464 10. The Rendering Engine down to iterating through the list of visible mesh instances, iterating over each submesh-material pair, sett ing the render state based on the material’s specifi - cations, and then calling the low level primitive submission functions (Draw- IndexedPrimitive(), glDrawArrays() or similar). State Leaks If we forget to set some aspect of the render state between submitt ed primi- tives, the sett ings used on the previous primitive will “leak” over onto the new primitive. A render state leak might manifest itself as an object with the wrong texture or an incorrect lighting eff ect, for example. Clearly it’s important that the application stage never allow state leaks to occur. The GPU Command List The application stage actually communicates with the GPU via a command list . These commands interleave render state sett ings with references to the geometry that should be drawn. For example, to render objects A and B with material 1, followed by objects C, D, and E using material 2, the command list might look like this: Set render state for material 1 (multiple commands, one per render state sett ing). Submit primitive A. Submit primitive B. Set render state for material 2 (multiple commands). Submit primitive C. Submit primitive D. Submit primitive E. Under the hood, API functions like DrawIndexedPrimitive() actu- ally just construct and submit GPU command lists. The cost of these API calls can themselves be too high for some applications. To maximize performance, some game engines build GPU command lists manually or by calling a low- level rendering API like the PS3’s libgcm library. 10.2.7.3. Geometry Sorting Render state sett ings are global—they apply to the entire GPU as a whole. So in order to change render state sett ings, the entire GPU pipeline must be fl ushed before the new sett ings can be applied. This can cause massive perfor- mance degradation if not managed carefully. Clearly we’d like to change render sett ings as infrequently as possible. The best way to accomplish this is to sort our geometry by material. That way, 465 we can install material A’s sett ings, render all geometry associated with mate- rial A, and then move on to material B. Unfortunately, sorting geometry by material can have a detrimental eff ect on rendering performance because it increases overdraw —a situation in which the same pixel is fi lled multiple times by multiple overlapping triangles. Cer- tainly some overdraw is necessary and desirable, as it is the only way to prop- erly alpha-blend transparent and translucent surfaces into a scene. However, overdraw of opaque pixels is always a waste of GPU bandwidth. The early z test is designed to discard occluded fragments before the ex- pensive pixel shader has a chance to execute. But to take maximum advantage of early z, we need to draw the triangles in front-to-back order. That way, the closest triangles will fi ll the z-buff er right off the bat, and all of the fragments coming from more-distant triangles behind them can be quickly discarded, with litt le or no overdraw. Z Prepass to the Rescue How can we reconcile the need to sort geometry by material with the confl ict- ing need to render opaque geometry in a front-to-back order? The answer lies in a GPU feature known as z prepass . The idea behind z prepass is to render the scene twice: the fi rst time to generate the contents of the z-buff er as effi ciently as possible and the second time to populate the frame buff er with full color information (but this time with no overdraw, thanks to the contents of the z-buff er). The GPU provides a special double-speed rendering mode in which the pixel shaders are disabled, and only the z-buff er is updated. Opaque geometry can be rendered in front- to-back order during this phase, to minimize the time required to generate the z-buff er contents. Then the geometry can be resorted into material order and rendered in full color with minimal stage changes for maximum pipeline throughput. Once the opaque geometry has been rendered, transparent surfaces can be drawn in back-to-front order. Unfortunately, there is no general solution to the material sorting problem for transparent geometry. We must render it in back-to-front order to achieve the proper alpha-blended result. Therefore we must accept the cost of frequent state changes when drawing transparent geometry (unless our particular game’s usage of transparent geometry is such that a specifi c optimization can be implemented). 10.2.7.4. Scene Graphs Modern game worlds can be very large. The majority of the geometry in most scenes does not lie within the camera frustum, so frustum culling all of these 10.2. The Rendering Pipeline 466 10. The Rendering Engine objects explicitly is usually incredibly wasteful. Instead, we would like to de- vise a data structure that manages all of the geometry in the scene and allows us to quickly discard large swaths of the world that are nowhere near the cam- era frustum prior to performing detailed frustum culling. Ideally, this data structure should also help us to sort the geometry in the scene, either in front- to-back order for the z prepass or in material order for full-color rendering. Such a data structure is oft en called a scene graph , in reference to the graph- like data structures oft en used by fi lm rendering engines and DCC tools like Maya. However, a game’s scene graph needn’t be a graph, and in fact the data structure of choice is usually some kind of tree. The basic idea behind most of these data structures is to partition three-dimensional space in a way that makes it easy to discard regions that do not intersect the frustum, without having to frustum cull all of the individual objects within them. Examples include quadtrees and octress, BSP trees, kd-trees, and spatial hashing tech- niques. Quadtrees and Octrees A quadtree divides space into quadrants recursively. Each level of recursion is represented by a node in the quadtree with four children, one for each quadrant. The quadrants are typically separated by vertically oriented, ax- is-aligned planes, so that the quadrants are square or rectangular. However, some quadtrees subdivide space using arbitrarily-shaped regions. Quadtrees can be used to store and organize virtually any kind of spa- tially-distributed data. In the context of rendering engines, quadtrees are of- ten used to store renderable primitives such as mesh instances, subregions of terrain geometry, or individual triangles of a large static mesh, for the pur- poses of effi cient frustum culling. The renderable primitives are stored at the Figure 10.46. A top-down view of a space divided recursively into quadrants for storage in a quadtree, based on the criterion of one point per region. 467 leaves of the tree, and we usually aim to achieve a roughly uniform number of primitives within each leaf region. This can be achieved by deciding whether to continue or terminate the subdivision based on the number of primitives within a region. To determine which primitives are visible within the camera frustum, we walk the tree from the root to the leaves, checking each region for inter- section with the frustum. If a given quadrant does not intersect the frustum, then we know that none of its child regions will do so either, and we can stop traversing that branch of the tree. This allows us to search for potentially visible primitives much more quickly than would be possible with a linear search (usually in O(log n) time). An example of a quadtree subdivision of space is shown in Figure 10.46. An octree is the three-dimensional equivalent of a quadtree, dividing space into eight subregions at each level of the recursive subdivision. The regions of an octree are oft en cubes or rectangular prisms but can be arbitrarily-shaped three-dimensional regions in general. Bounding Sphere Trees In the same way that a quadtree or octree subdivides space into (usually) rectangular regions, a bounding sphere tree divides space into spherical regions hierarchically. The leaves of the tree contain the bounding spheres of the ren- derable primitives in the scene. We collect these primitives into small logical groups and calculate the net bounding sphere of each group. The groups are themselves collected into larger groups, and this process continues until we have a single group with a bounding sphere that encompasses the entire vir- tual world. To generate a list of potentially visible primitives, we walk the tree from the root to the leaves, testing each bounding sphere against the frustum, and only recursing down branches that intersect it. BSP Trees A binary space partitioning (BSP) tree divides space in half recursively until the objects within each half-space meet some predefi ned criteria (much as a quadtree divides space into quadrants). BSP trees have numerous uses, in- cluding collision detection and constructive solid geometry, as well as its most well-known application as a method for increasing the performance of frus- tum culling and geometry sorting for 3D graphics. A kd-tree is a generaliza- tion of the BSP tree concept to k dimensions. In the context of rendering, a BSP tree divides space with a single plane at each level of the recursion. The dividing planes can be axis-aligned, but more commonly each subdivision corresponds to the plane of a single triangle in the scene. All of the other triangles are then categorized as being either on 10.2. The Rendering Pipeline 468 10. The Rendering Engine the front side or the back side of the plane. Any triangles that intersect the dividing plane are themselves divided into three new triangles, so that every triangle lies either entirely in front of or entirely behind the plane, or is copla- nar with it. The result is a binary tree with a dividing plane and one or more triangles at each interior node and triangles at the leaves. A BSP tree can be used for frustum culling in much the same way a quadtree, octree, or bounding sphere tree can. However, when generated with individual triangles as described above, a BSP tree can also be used to sort tri- angles into a strictly back-to-front or front-to-back order. This was particularly important for early 3D games like Doom, which did not have the benefi t of a z-buff er and so were forced to use the painter’s algorithm (i.e., to render the scene from back to front) to ensure proper inter-triangle occlusion. Given a camera view point in 3D space, a back-to-front sorting algorithm walks the tree from the root. At each node, we check whether the view point is in front of or behind that node’s dividing plane. If the camera is in front of a node’s plane, we visit the node’s back children fi rst, then draw any triangles that are coplanar with its dividing plane, and fi nally we visit its front chil- dren. Likewise, when the camera’s view point is found to be behind a node’s dividing plane, we visit the node’s front children fi rst, then draw the triangles coplanar with the node’s plane, and fi nally we visit its back children. This traversal scheme ensures that the triangles farthest from the camera will be visited before those that are closer to it, and hence it yields a back-to-front Figure 10.47. An example of back-to-front traversal of the triangles in a BSP tree. The tri- angles are shown edge-on in two dimensions for simplicity, but in a real BSP tree the triangles and dividing planes would be arbitrarily oriented in space. A B C D2 D1 Camera Visit A Cam is in front V isit B Leaf node Draw B Draw A V isit C Cam is in front V isit D 1 Leaf node Draw D 1 Draw C V isit D 2 Leaf node Draw D 2 A D2 D1 C B 469 ordering. Because this algorithm traverses all of the triangles in the scene, the order of the traversal is independent of the direction the camera is looking. A secondary frustum culling step would be required in order to traverse only visible triangles. A simple BSP tree is shown in Figure 10.47, along with the tree traversal that would be done for the camera position shown. Full coverage of BSP tree generation and usage algorithms is beyond our scope here. See htt p://www.ccs.neu.edu/home/donghui/teaching/slides/ge- ometry/BSP2D.ppt and htt p://www.gamedev.net/reference/articles/article657. asp for more details on BSP trees. 10.2.7.5. Choosing a Scene Graph Clearly there are many diff erent kinds of scene graphs. Which data structure to select for your game will depend upon the nature of the scenes you expect to be rendering. To make the choice wisely, you must have a clear understand- ing of what is required—and more importantly what is not required—when rendering scenes for your particular game. For example, if you’re implementing a fi ghting game, in which two char- acters batt le it out in a ring surrounded by a mostly static environment, you may not need much of a scene graph at all. If your game takes place primarily in enclosed indoor environments, a BSP tree or portal system may serve you well. If the action takes place outdoors on relatively fl at terrain, and the scene is viewed primarily from above (as might be the case in a real-time strategy game or god game), a simple quad tree might be all that’s required to achieve high rendering speeds. On the other hand, if an outdoor scene is viewed pri- marily from the point of view of someone on the ground, we may need addi- tional culling mechanisms. Densely populated scenes can benefi t from an oc- clusion volume (antiportal) system, because there will be plenty of occluders. On the other hand, if your outdoor scene is very sparse, adding an antiportal system probably won’t pay dividends (and might even hurt your frame rate). Ultimately, your choice of scene graph should be based on hard data ob- tained by actually measuring the performance of your rendering engine. You may be surprised to learn where all your cycles are actually going! But once you know, you can select scene graph data structures and/or other optimiza- tions to target the specifi c problems at hand. 10.3. Advanced Lighting and Global Illumination In order to render photorealistic scenes, we need physically accurate global illumination algorithms. A complete coverage of these techniques is beyond our scope. In the following sections, we will briefl y outline the most prevalent 10.3. Advanced Lighting and Global Illumination 470 10. The Rendering Engine techniques in use within the game industry today. Our goal here is to provide you with an awareness of these techniques and a jumping off point for further investigation. For an excellent in-depth coverage of this topic, see [8]. 10.3.1. Image-Based Lighting A number of advanced lighting and shading techniques make heavy use of image data, usually in the form of two-dimensional texture maps. These are called image-based lighting algorithms. 10.3.1.1. Normal Mapping A normal map specifi es a surface normal direction vector at each texel. This al- lows a 3D modeler to provide the rendering engine with a highly detailed de- scription of a surface’s shape, without having to tessellate the model to a high degree (as would be required if this same information were to be provided via vertex normals). Using a normal map, a single fl at triangle can be made to look as though it were constructed from millions of tiny triangles. An example of normal mapping is shown in Figure 10.48. The normal vectors are typically encoded in the RGB color channels of the texture, with a suitable bias to overcome the fact that RGB channels are strictly positive while normal vector components can be negative. Sometimes only two coordinates are stored in the texture; the third can be easily calculated at runtime, given the assumption that the surface normals are unit vectors. Figure 10.48. An example of a normal-mapped surface. 10.3.1.2. Height Maps: Parallax and Relief Mapping As its name implies, a height map encodes the height of the ideal surface above or below the surface of the triangle. Height maps are typically encoded as grayscale images, since we only need a single height value per texel. 471 Height maps are oft en used for parallax mapping and relief mapping —two techniques that can make a planar surface appear to have rather extreme height variation that properly self-occludes and self-shadows. Figure 10.49 shows an example of parallax occlusion mapping implemented in DirectX 9. A height map can also be used as a cheap way to generate surface normals. This technique was used in the early days of bump mapping . Nowadays, most game engines store surface normal information explicitly in a normal map, rather than calculating the normals from a height map. 10.3.1.3. Specular/Gloss Maps When light refl ects directly off a shiny surface, we call this specular refl ection. The intensity of a specular refl ection depends on the relative angles of the viewer, the light source, and the surface normal. As we saw in Section 10.1.3.2, the specular intensity takes the form ( ),Sk α⋅RV where R is the refl ection of the light’s direction vector about the surface normal, V is the direction to the viewer, kS is the overall specular refl ectivity of the surface, and α is called the specular power. Many surfaces aren’t uniformly glossy. For example, when a person’s face is sweaty and dirty, wet regions appear shiny, while dry or dirty areas appear dull. We can encode high-detail specularity information in a special texture map known as a specular map . If we store the value of kS in the texels of a specular map, we can control how much specular refl ection should be applied at each texel. This kind of specular map is sometimes called a gloss map . It is also called a specular mask, because zero-valued texels can be used to “mask off ” regions of the surface where we do not want specular refl ection applied. If we store the value of α in our specular map, we can control the amount of “focus” our specular high- Figure 10.49. DirectX 9 parallax occlusion mapping. The surface is actually a fl at disc; a height map texture is used to defi ne the surface details. 10.3. Advanced Lighting and Global Illumination 472 10. The Rendering Engine lights will have at each texel. This kind of texture is called a specular power map . An example of a gloss map is shown in Figure 10.50. 10.3.1.4. Environment Mapping An environment map looks like a panoramic photograph of the environment taken from the point of view of an object in the scene, covering a full 360 degrees horizontally and either 180 degrees or 360 degrees vertically. An envi- ronment map acts like a description of the general lighting environment sur- rounding an object. It is generally used to inexpensively render refl ections. The two most common formats are spherical environment maps and cubic environment maps . A spherical map looks like a photograph taken through a fi sheye lens, and it is treated as though it were mapped onto the inside of a sphere whose radius is infi nite, centered about the object being rendered. The problem with sphere maps is that they are addressed using spherical coordi- nates. Around the equator, there is plenty of resolution both horizontally and vertically. However, as the vertical (azimuthal) angle approaches vertical, the resolution of the texture along the horizontal (zenith) axis decreases to a single texel. Cube maps were devised to avoid this problem. A cube map looks like a composite photograph pieced together from pho- tos taken in the six primary directions (up, down, left , right, front, and back). During rendering, a cube map is treated as though it were mapped onto the six inner surfaces of a box at infi nity, centered on the object being rendered. To read the environment map texel corresponding to a point P on the surface of an object, we take the ray from the camera to the point P and refl ect Figure 10.50. This screen shot from EA’s Fight Night Round 3 shows how a gloss map can be used to control the degree of specular refl ection that should be applied to each texel of a surface. 473 it about the surface normal at P. The refl ected ray is followed until it intersects the sphere or cube of the environment map. The value of the texel at this inter- section point is used when shading the point P. 10.3.1.5. Three-Dimensional Textures Modern graphics harware also includes support for three-dimensional tex- tures. A 3D texture can be thought of as a stack of 2D textures. The GPU knows how to address and fi lter a 3D texture, given a three-dimensional texture co- ordinate (u, v, w). Three-dimensional textures can be useful for describing the appearance or volumetric properties of an object. For example, we could render a marble sphere and allow it to be cut by an arbitrary plane. The texture would look continuous and correct across the cut no matt er where it was made, because the texture is well-defi ned and continuous throughout the entire volume of the sphere. 10.3.2. High Dynamic Range Lighting A display device like a television set or CRT monitor can only produce a lim- ited range of intensities. This is why the color channels in the frame buff er are limited to a zero to one range. But in the real world, light intensities can grow arbitrarily large. High dynamic range (HDR) lighting att empts to capture this wide range of light intensities. HDR lighting performs lighting calculations without clamping the result- ing intensities arbitrarily. The resulting image is stored in a format that per- mits intensities to grow beyond one. The net eff ect is an image in which ex- treme dark and light regions can be represented without loss of detail within either type of region. Prior to display on-screen, a process called tone mapping is used to shift and scale the image’s intensity range into the range supported by the display device. Doing this permits the rendering engine to reproduce many real-world visual eff ects, like the temporary blindness that occurs when you walk from a dark room into a brightly lit area, or the way light seems to bleed out from behind a brightly back-lit object (an eff ect known as bloom ). One way to represent an HDR image is to store the R, G, and B chan- nels using 32-bit fl oating point numbers, instead of 8-bit integers. Another alternative is to employ an entirely diff erent color model altogether. The log- LUV color model is a popular choice for HDR lighting. In this model, color is represented as an intensity channel (L) and two chromaticity channels (U and V). Because the human eye is more sensitive to changes in intensity 10.3. Advanced Lighting and Global Illumination 474 10. The Rendering Engine than it is to changes in chromaticity, the L channel is stored in 16 bits while U and V are given only eight bits each. In addition, L is represented using a logarithmic scale (base two) in order to capture a very wide range of light intensities. 10.3.3. Global Illumination As we noted in Section 10.1.3.1, global illumination refers to a class of light- ing algorithms that account for light’s interactions with multiple objects in the scene, on its way from the light source to the virtual camera . Global illumina- tion accounts for eff ects like the shadows that arise when one surface occludes another, refl ections, caustics, and the way the color of one object can “bleed” onto the objects around it. In the following sections, we’ll take a brief look at some of the most common global illumination techniques. Some of these methods aim to reproduce a single isolated eff ect, like shadows or refl ections. Others like radiosity and ray tracing methods aim to provide a holistic model of global light transport. 10.3.3.1. Shadow Rendering Shadows are created when a surface blocks light’s path. The shadows caused by an ideal point light source would be sharp, but in the real world shadows have blurry edges; this is called the penumbra . A penumbra arises because real-world light sources cover some area and so produce light rays that graze the edges of an object at diff erent angles. The two most prevalent shadow rendering techniques are shadow vol- umes and shadow maps. We’ll briefl y describe each in the sections below. In both techniques, objects in the scene are generally divided into three catego- ries: objects that cast shadows, objects that are to receive shadows, and ob- jects that are entirely excluded from consideration when rendering shadows. Likewise, the lights are tagged to indicate whether or not they should gener- ate shadows. This important optimization limits the number of light-object combinations that need to be processed in order to produce the shadows in a scene. Shadow Volumes In the shadow volume technique, each shadow caster is viewed from the vantage point of a shadow-generating light source, and the shadow caster’s silhouett e edges are identifi ed. These edges are extruded in the direction of the light rays emanating from the light source. The result is a new piece of geometry that describes the volume of space in which the light is occluded by the shadow caster in question. This is shown in Figure 10.51. 475 A shadow volume is used to generate a shadow by making use of a special full-screen buff er known as the stencil buff er . This buff er stores a single inte- ger value corresponding to each pixel of the screen. Rendering can be masked by the values in the stencil buff er—for example, we could confi gure the GPU to only render fragments whose corresponding stencil values are non-zero. In addition, the GPU can be confi gured so that rendered geometry updates the values in the stencil buff er in various useful ways. To render shadows, the scene is fi rst drawn to generate an unshadowed image in the frame buff er, along with an accurate z-buff er . The stencil buff er is cleared so that it contains zeros at every pixel. Each shadow volume is then rendered from the point of view of the camera in such a way that front-facing triangles increase the values in the stencil buff er by one, while back-facing triangles decrease them by one. In areas of the screen where the shadow vol- ume does not appear at all, of course the stencil buff er’s pixels will be left containing zero. The stencil buff er will also contain zeros where both the front and back faces of the shadow volume are visible, because the front face will increase the stencil value but the back face will decrease it again. In areas where the back face of the shadow volume has been occluded by “real” scene geometry, the stencil value will be one. This tells us which pixels of the screen are in shadow. So we can render shadows in a third pass, by simply darkening those regions of the screen that contain a non-zero stencil buff er value. Shadow Maps The shadow mapping technique is eff ectively a per-fragment depth test per- formed from the point of view of the light instead of from the point of view of the camera. The scene is rendered in two steps: First, a shadow map texture Figure 10.51. A shadow volume generated by extruding the silhouette edges of a shadow casting object as seen from the point of view of the light source. 10.3. Advanced Lighting and Global Illumination 476 10. The Rendering Engine is generated by rendering the scene from the point of view of the light source and saving off the contents of the depth buff er. Second, the scene is rendered as usual, and the shadow map is used to determine whether or not each frag- ment is in shadow. At each fragment in the scene, the shadow map tells us whether or not the light is being occluded by some geometry that is closer to the light source, in just the same way that the z-buff er tells us whether a frag- ment is being occluded by a triangle that is closer to the camera. A shadow map contains only depth information—each texel records how far away it is from the light source. Shadow maps are therefore typically ren- dered using the hardware’s double-speed z-only mode (since all we care about is the depth information). For a point light source, a perspective projection is used when rendering the shadow map; for a directional light source, an ortho- graphic projection is used instead. To render a scene using a shadow map, we draw the scene as usual from the point of view of the camera. For each vertex of every triangle, we calculate its position in light space —i.e., in the same “view space” that was used when generating the shadow map in the fi rst place. These light space coordinates can be interpolated across the triangle, just like any other vertex att ribute. This gives us the position of each fragment in light space. To determine whether a given fragment is in shadow or not, we convert the fragment’s light-space (x, y)-coordinates into texture coordinates (u, v) within the shadow map. We then compare the fragment’s light-space z-coordinate with the depth stored at the corresponding texel in the shadow depth map. If the fragment’s light-space z is farther away from the light than the texel in the shadow map, then it must be occluded by some other piece of geometry that is closer to the light source— hence it is in shadow. Likewise, if the fragment’s light-space z is closer to the light source than the texel in the shadow map, then it is not occluded and is not in shadow. Based on this information, the fragment’s color can be adjusted accordingly. The shadow mapping process is illustrated in Figure 10.52. Figure 10.52. The far left image is a shadow map—the contents of the z-buffer as rendered from the point of view of a particular light source. The pixels of the center image are black where the light-space depth test failed (fragment in shadow) and white where it succeeded (fragment not in shadow). The far right image shows the fi nal scene rendered with shadows. 477 10.3.3.2. Ambient Occlusion Ambient occlusion is a technique for modeling contact shadows —the soft shad- ows that arise when a scene is illuminated by only ambient light. In eff ect, am- bient occlusion describes how “accessible” each point on a surface is to light in general. For example, the interior of a section of pipe is less accessible to ambient light than its exterior. If the pipe were placed outside on an overcast day, its interior would generally appear darker than its exterior. Figure 10.53 shows the level of ambient occlusion across an object’s sur- face. Ambient occlusion is measured at a point on a surface by constructing a hemisphere with a very large radius centered on that point and determing what percentage of that hemisphere’s area is visible from the point in ques- tion. It can be precomputed offl ine for static objects, because ambient occlu- sion is independent of view direction and the direction of incident light. It is typically stored in a texture map that records the level of ambient occlusion at each texel across the surface. 10.3.3.3. Refl ections Refl ections occur when light bounces off a highly specular (shiny) surface pro- ducing an image of another portion of the scene in the surface. Refl ections can be implemented in a number of ways. Environment maps are used to Figure 10.53. A dragon rendered with ambient occlusion. Figure 10.54. Mirror refl ections in Luigi’s Mansion implemented by rendering the scene to a texture that is subsequently applied to the mirror’s surface. 10.3. Advanced Lighting and Global Illumination 478 10. The Rendering Engine produce general refl ections of the surrounding environment on the surfaces of shiny objects. Direct refl ections in fl at surfaces like mirrors can be produced by refl ecting the camera’s position about the plane of the refl ective surface and then rendering the scene from that refl ected point of view into a texture . The texture is then applied to the refl ective surface in a second pass. 10.3.3.4. Caustics Caustics are the bright specular highlights arising from intense refl ections or refractions from very shiny surfaces like water or polished metal. When the refl ective surface moves, as is the case for water, the caustic eff ects glimmer and “swim” across the surfaces on which they fall. Caustic eff ects can be pro- duced by projecting a (possibly animated) texture containing semi-random bright highlights onto the aff ected surfaces. An example of this technique is shown in Figure 10.55. Figure 10.55. Water caustics produced by projecting an animated texture onto the affected surfaces. 10.3.3.5. Subsurface Scattering When light enters a surface at one point, is scatt ered beneath the surface, and then reemerges at a diff erent point on the surface, we call this subsurface scatt ering . This phenomenon is responsible for the “warm glow” of human skin, wax, and marble statues. Subsurface scatt ering is described by a more- advanced variant of the BRDF (see Section 10.1.3.2) known as the BSSRDF (bidirectional surface scatt ering refl ectance distribution function). Subsurface scatt ering can be simulated in a number of ways. Depth-map– based subsurface scatt ering renders a shadow map (see Section 10.3.3.1), but instead of using it to determine which pixels are in shadow, it is used to mea- sure how far a beam of light would have to travel in order to pass all the way 479 through the occluding object. The shadowed side of the object is then given an artifi cial diff use lighting term whose intensity is inversely proportional to the distance the light had to travel in order to emerge on the opposite side of the object. This causes objects to appear to be glowing slightly on the side op- posite to the light source but only where the object is relatively thin. For more information on subsurface scatt ering techniques, see htt p://htt p.developer. nvidia.com/GPUGems/gpugems_ch16.html. 10.3.3.6. Precomputed Radiance Transfer (PRT) Precomputed radiance transfer (PRT) is a relatively new technique that att empts to simulate the eff ects of radiosity-based rendering methods in real time. It does so by precomputing and storing a complete description of how an inci- dent light ray would interact with a surface (refl ect, refract, scatt er, etc.) when approaching from every possible direction. At runtime, the response to a par- ticular incident light ray can be looked up and quickly converted into very accurate lighting results. In general the light’s response at a point on the surface is a complex func- tion defi ned on a hemisphere centered about the point. A compact repre- sentation of this function is required to make the PRT technique practical. A common approach is to approximate the function as a linear combination of spherical harmonic basis functions. This is essentially the three-dimensional equivalent of encoding a simple scalar function f(x) as a linear combination of shift ed and scaled sine waves. The details of PRT are far beyond our scope. For more information, see htt p://web4.cs.ucl.ac.uk/staff /j.kautz/publications/prtSIG02.pdf. PRT lighting Figure 10.56. On the left, a dragon rendered without subsurface scattering (i.e., using a BRDF lighting model). On the right, the same dragon rendered with subsurface scattering (i.e., using a BSSRDF model). Images rendered by Rui Wang at the University of Virginia. 10.3. Advanced Lighting and Global Illumination 480 10. The Rendering Engine techniques are demonstrated in a DirectX sample program available in the DirectX SDK—see htt p://msdn.microsoft .com/en-us/library/bb147287.aspx for more details. 10.3.4. Deferred Rendering In traditional triangle-rasterization–based rendering, all lighting and shad- ing calculations are performed on the triangle fragments in view space. The problem with this technique is that it is inherently ineffi cient. For one thing, we potentially do work that we don’t need to do. We shade the vertices of tri- angles, only to discover during the rasterization stage that the entire triangle is being depth-culled by the z test. Early z tests help eliminate unnecessary pixel shader evaluations, but even this isn’t perfect. What’s more, in order to handle a complex scene with lots of lights, we end up with a proliferation of diff erent versions of our vertex and pixel shaders—versions that handle dif- Figure 10.57. Screenshots from Killzone 2, showing some of the typical components of the G-buffer used in deferred rendering. The upper image shows the fi nal rendered image. Below it, clockwise from the upper left, are the albedo (diffuse) color, depth, view-space normal, screen space 2D motion vector (for motion blurring), specular power, and specular intensity. 481 ferent numbers of lights, diff erent types of lights, diff erent numbers of skin- ning weights, etc. Deferred rendering is an alternative way to shade a scene that addresses many of these problems. In deferred rendering, the majority of the lighting calculations are done in screen space, not view space. We effi ciently render the scene without worrying about lighting. During this phase, we store all the information we’re going to need to light the pixels in a “deep” frame buf- fer known as the G-buff er . Once the scene has been fully rendered, we use the information in the G-buff er to perform our lighting and shading calcula- tions. This is usually much more effi cient than view-space lighting, avoids the proliferation of shader variants, and permits some very pleasing eff ects to be rendered relatively easily. The G-buff er may be physically implemented as a collection of buff ers, but conceptually it is a single frame buff er containing a rich set of informa- tion about the lighting and surface properties of the objects in the scene at every pixel on the screen. A typical G-buff er might contain the following per- pixel att ributes: depth, surface normal in clip space, diff use color, specular power, even precomputed radiance transfer (PRT) coeffi cients. The following sequence of screen shots from Guerrilla Games’ Killzone 2 shows some of the typical components of the G-buff er. An in-depth discussion of deferred rendering is beyond our scope, but the folks at Guerrilla Games have prepared an excellent presentation on the topic, which is available at htt p://www.guerrilla-games.com/publications/ dr_kz2_rsx_dev07.pdf. 10.4. Visual Effects and Overlays The rendering pipeline we’ve discussed to this point is responsible primarily for rendering three-dimensional solid objects. A number of specialized render- ing systems are typically layered on top of this pipeline, responsible for ren- dering visual elements like particle eff ects, decals (small geometry overlays that represent bullet holes, cracks, scratches, and other surface details), hair and fur, rain or falling snow, water , and other specialized visual eff ects. Full- screen post eff ects may be applied, including vignett e (slight blur around the edges of the screen), motion blur, depth of fi eld blurring, artifi cial/enhanced colorization, and the list goes on. Finally, the game’s menu system and heads- up display (HUD) are typically realized by rendering text and other two- or three-dimensional graphics in screen space overlaid on top of the three- dimensional scene. 10.4. Visual Effects and Overlays 482 10. The Rendering Engine An in-depth coverage of these engine systems is beyond our scope. In the following sections, we’ll provide a brief overview of these rendering systems, and point you in the direction of additional information. 10.4.1. Particle Effects A particle rendering system is concerned with rendering amorphous objects like clouds of smoke, sparks, fl ame, and so on. These are called particle eff ects. The key features that diff erentiate a particle eff ect from other kinds of render- able geometry are as follows: It is composed of a very large number of relatively simple pieces of geom- etry—most oft en simple cards called quads, composed of two triangles each. The geometry is oft en camera-facing (i.e., billboarded ), meaning that the engine must take steps to ensure that the face normals of each quad always point directly at the camera’s focal point. Its materials are almost always semi-transparent or translucent. As such, particle eff ects have some stringent rendering order constraints that do not apply to the majority of opaque objects in a scene. Particles animate in a rich variety of ways. Their positions, orientations, sizes (scales), texture coordinates, and many of their shader parameters vary from frame to frame. These changes are defi ned either by hand- authored animation curves or via procedural methods. Particles are typically spawned and killed continually. A particle emitt er is a logical entity in the world that creates particles at some user-speci- fi ed rate; particles are killed when they hit a predefi ned death plane, or Figure 10.58. Some particle effects. 483 when they have lived for a user-defi ned length of time, or as decided by some other user-specifi ed criteria. Particle eff ects could be rendered using regular triangle mesh geometry with appropriate shaders. However, because of the unique characteristics listed above, a specialized particle eff ect animation and rendering system is always used to implement them in a real production game engine. A few ex- ample particle eff ects are shown in Figure 10.58. Particle system design and implementation is a rich topic that could oc- cupy many chapters all on its own. For more information on particle systems, see [1] Section 10.7, [14] Section 20.5, [9] Section 13.7 and [10] Section 4.1.2. 10.4.2. Decals A decal is a relatively small piece of geometry that is overlaid on top of the reg- ular geometry in the scene, allowing the visual appearance of the surface to be modifi ed dynamically. Examples include bullet holes, foot prints, scratches, cracks, etc. The approach most oft en used by modern engines is to model a decal as a rectangular area that is to be projected along a ray into the scene. This gives rise to a rectangular prism in 3D space. Whatever surface the prism intersects fi rst becomes the surface of the decal. The triangles of the intersected geom- etry are extracted and clipped against the four bounding planes of the decal’s projected prism. The resulting triangles are texture-mapped with a desired decal texture by generating appropriate texture coordinates for each vertex. These texture-mapped triangles are then rendered over the top of the regular scene, oft en using parallax mapping to give them the illusion of depth and with a slight z-bias (usually implemented by shift ing the near plane slightly) so they don’t experience z-fi ghting with the geometry on which they are over- Figure 10.59. Parallax-mapped decals from Uncharted: Drake’s Fortune. 10.4. Visual Effects and Overlays 484 10. The Rendering Engine laid. The result is the appearance of a bullet hole, scratch or other kind of sur- face modifi cation. Some bullet-hole decals are depicted in Figure 10.59. For more information on creating and rendering decals, see [7] Section 4.8, and [28] Section 9.2. 10.4.3. Environmental Effects Any game that takes place in a somewhat natural or realistic environment requires some kind of environmental rendering eff ects. These eff ects are usu- ally implemented via specialized rendering systems. We’ll take a brief look at a few of the more common of these systems in the following sections. 10.4.3.1. Skies The sky in a game world needs to contain vivid detail, yet technically speak- ing it lies an extremely long distance away from the camera. Therefore we cannot model it as it really is and must turn instead to various specialized rendering techniques. One simple approach is to fi ll the frame buff er with the sky texture prior to rendering any 3D geometry. The sky texture should be rendered at an ap- proximate 1:1 texel-to-pixel ratio, so that the texture is roughly or exactly the resolution of the screen. The sky texture can be rotated and scrolled to corre- spond to the motions of the camera in-game. During rendering of the sky, we make sure to set the depth of all pixels in the frame buff er to the maximum possible depth value. This ensures that the 3D scene elements will always sort on top of the sky. The arcade hit Hydro Thunder rendered its skies in exactly this manner. For games in which the player can look in any direction, we can use a sky dome or sky box . The dome or box is rendered with its center always at the cam- era’s current location, so that it appears to lie at infi nity, no matt er where the camera moves in the game world. As with the sky texture approach, the sky box or dome is rendered before any other 3D geometry, and all of the pixels in the frame buff er are set to the maximum z-value when the sky is rendered. This means that the dome or box can actually be tiny, relative to other objects in the scene. Its size is irrelevant, as long as it fi lls the entire frame buff er when it is drawn. For more information on sky rendering, see [1] Section 10.3 and [38] page 253. Clouds are oft en implemented with a specialized rendering and anima- tion system as well. In early games like Doom and Quake, the clouds were just planes with scrolling semi-transparent cloud textures on them. More-recent cloud techniques include camera-facing cards (billboards), particle-eff ect based clouds, and volumetric cloud eff ects. 485 10.4.3.2. Terrain The goal of a terrain system is to model the surface of the earth and provide a canvas of sorts upon which other static and dynamic elements can be laid out. Terrain is sometimes modeled explicitly in a package like Maya. But if the player can see far into the distance, we usually want some kind of dynamic tessellation or other level of detail (LOD) system. We may also need to limit the amount of data required to represent very large outdoor areas. Height fi eld terrain is one popular choice for modeling large terrain areas. The data size can be kept relatively small because a height fi eld is typically stored in a grayscale texture map. In most height-fi eld– based terrain systems, the horizontal (y = 0) plane is tessellated in a regular grid patt ern, and the heights of the terrain vertices are determined by sampling the height fi eld texture. The number of triangles per unit area can be varied based on distance from the camera, thereby allowing large-scale features to be seen in the dis- tance, while still permitt ing a good deal of detail to be represented for nearby terrain. An example of a terrain defi ned via a height fi eld bitmap is shown in Figure 10.60. Terrain systems usually provide specialized tools for “painting” the height fi eld itself, carving out terrain features like roads, rivers, and so on. Texture mapping in a terrain system is oft en a blend between four or more textures. This allows artists to “paint” in grass, dirt, gravel, and other terrain features by simply exposing one of the texture layers. The layers can be cross-blended from one to another to provide smooth textural transitions. Some terrain tools also permit sections of the terrain to be cut out to permit buildings, trenches, and other specialized terrain features to be inserted in the form of regular mesh geometry. Terrain authoring tools are sometimes integrated directly into the game world editor , while in other engines they may be stand-alone tools. Figure 10.60. A grayscale height fi eld bitmap (left) can be used to control the vertical posi- tions of the vertices in a terrain grid mesh (right). In this example, a water plane intersects the terrain mesh to create islands. 10.4. Visual Effects and Overlays 486 10. The Rendering Engine Of course, height fi eld terrain is just one of many options for modeling the surface of the Earth in a game. For more information on terrain rendering, see [6] Sections 4.16 through 4.19 and [7] Section 4.2. 10.4.3.3. Water Water renderers are commonplace in games nowadays. There are lots of dif- ferent possible kinds of water, including oceans, pools, rivers, waterfalls, foun- tains, jets, puddles, and damp solid surfaces. Each type of water generally requires some specialized rendering technology. Some also require dynamic motion simulations . Large bodies of water may require dynamic tessellation or other LOD methodologies similar to those employed in a terrain system. Water systems sometimes interact with a game’s rigid body dynamics system (fl otation, force from water jets, etc.) and with gameplay (slippery sur- faces, swimming mechanics, diving mechanics, riding vertical jets of water, and so on). Water eff ects are oft en created by combining disparate render- ing technologies and subsystems. For example, a waterfall might make use of specialized water shaders, scrolling textures, particle eff ects for mist at the base, a decal-like overlay for foam, and the list goes on. Today’s games off er some prett y amazing water eff ects, and active research into technologies like real-time fl uid dynamics promises to make water simulations even richer and more realistic in the years ahead. For more information on water rendering and simulation techniques, see [1] Sections 9.3, 9.5, and 9.6, [13], and [6] Sec- tions 2.6 and 5.11. 10.4.4. Overlays Most games have heads-up displays, in-game graphical user interfaces, and menu systems. These overlays are typically comprised of two- and three-di- mensional graphics rendered directly in view space or screen space . Overlays are generally rendered aft er the primary scene, with z testing disabled to ensure that they appear on top of the three-dimensional scene. Two-dimensional overlays are typically implemented by rendering quads (tri- angle pairs) in screen space using an orthographic projection. Three-dimen- sional overlays may be rendered using an orthographic projection or via the regular perspective projection with the geometry positioned in view space so that it follows the camera around. 10.4.4.1. Normalized Screen Coordinates The coordinates of two-dimensional overlays can be measured in terms of screen pixels. However, if your game is going to be expected to support mul- tiple screen resolutions (which is very common in PC games), it’s a far bett er 487 idea to use normalized screen coordinates . Normalized coordinates range from zero to one along one of the two axes (but not both—see below), and they can easily be scaled into pixel-based measurements corresponding to an arbitrary screen resolution. This allows us to lay out our overlay elements without wor- rying about screen resolution at all (and only having to worry a litt le bit about aspect ratio). It’s easiest to defi ne normalized coordinates so that they range from 0.0 to 1.0 along the y-axis. At a 4:3 aspect ratio, this means that the x-axis would range from 0.0 to 1.333 (= 4 / 3), while at 16:9 the x-axis’ range would be from 0.0 to 1.777 (= 16 / 9). It’s important not to defi ne our coordinates so that they range from zero to one along both axes. Doing this would cause square visual elements to have unequal x and y dimensions—or put another way, a visual element with seemingly square dimensions would not look like a square on- screen! Moreover, our “square” elements would stretch diff erently at diff erent aspect ratios—defi nitely not an acceptable state of aff airs. 10.4.4.2. Relative Screen Coordinates To really make normalized coordinates work well, it should be possible to specify coordinates in absolute or relative terms. For example, positive co- ordinates might be interpreted as being relative to the top-left corner of the screen, while negative coordinates are relative to the bott om-right corner. That way, if I want a HUD element to be a certain distance from the right or bott om edges of the screen, I won’t have to change its normalized coordinates when the aspect ratio changes. We might want to allow an even richer set of possible alignment choices, such as aligning to the center of the screen or aligning to another visual element. That said, you’ll probably have some overlay elements that simply cannot be laid out using normalized coordinates in such a way that they look right at both the 4:3 and 16:9 aspect ratios. You may want to consider having two distinct layouts, one for each aspect ratio, so you can fi ne-tune them independently. 10.4.4.3. Text and Fonts A game engine’s text /font system is typically implemented as a special kind of two-dimensional (or sometimes three-dimensional) overlay. At its core, a text rendering system needs to be capable of displaying a sequence of character glyphs corresponding to a text string, arranged in various orientations on the screen. A font is oft en implemented via a texture map containing the vari- ous required glyphs. A font description fi le provides information such as the bounding boxes of each glyph within the texture, and font layout information such as kerning, baseline off sets, and so on. 10.4. Visual Effects and Overlays 488 10. The Rendering Engine A good text/font system must account for the diff erences in character sets and reading directions inherent in various languages. Some text systems also provide various fun features like the ability to animate characters across the screen in various ways, the ability to animate individual characters, and so on. Some game engines even go so far as to implement a subset of the Adobe Flash standard in order to support a rich set of two-dimensional eff ects in their overlays and text. However, it’s important to remember when imple- menting a game font system that only those features that are actually required by the game should be implemented. There’s no point in furnishing your en- gine with an advanced text animation if your game never needs to display animated text! 10.4.5. Gamma Correction CRT monitors tend to have a nonlinear response to luminance values. That is, if a linearly-increasing ramp of R, G, or B values were to be sent to a CRT, the image that would result on-screen would be perceptually nonlinear to the hu- man eye. Visually, the dark regions of the image would look darker than they should. This is illustrated in Figure 10.61. The gamma response curve of a typical CRT display can be modeled quite simply by the formula out in ,VVγ= where γCRT > 1. To correct for this eff ect, the colors sent to the CRT display are usually passed through an inverse transformation (i.e., using a gamma value γcorr < 1). The value of γCRT for a typical CRT monitor is 2.2, so the correc- Figure 10.61. The effect of a CRT’s gamma response on image quality and how the effect can be corrected for. Image courtesy of www.wikipedia.org. 489 tion value is usually γcorr = 1/2.2 = 0.455. These gamma encoding and decoding curves are shown in Figure 10.62. Gamma encoding can be performed by the 3D rendering engine to ensure that the values in the fi nal image are properly gamma-corrected. One problem that is encountered, however, is that the bitmap images used to represent tex- ture maps are oft en gamma-corrected themselves. A high-quality rendering engine takes this fact into account, by gamma-decoding the textures prior to rendering and then re-encoding the gamma of the fi nal rendered scene so that its colors can be reproduced properly on-screen. 10.4.6. Full-Screen Post Effects Full-screen post eff ects are eff ects applied to a rendered three-dimensional scene that provide additional realism or a stylized look. These eff ects are of- ten implemented by passing the entire contents of the screen through a pixel shader that applies the desired eff ect(s). This can be accomplished by render- ing a full-screen quad that has been mapped with a texture containing the unfi ltered scene. A few examples of full-screen post eff ects are given below: Motion blur . This is typically implemented by rendering a buff er of screen-space velocity vectors and using this vector fi eld to selectively blur the rendered image. Blurring is accomplished by passing a con- volution kernel over the image (see “Image Smoothing and Sharpening by Discrete Convolution” by Dale A. Schumacher, published in [4], for details). Figure 10.62. Gamma encoding and decoding curves. Image courtesy of www.wikipedia.org. 10.4. Visual Effects and Overlays 490 10. The Rendering Engine Depth of fi eld blur . This blur eff ect can be produced by using the contents of the depth buff er to adjust the degree of blur applied at each pixel. Vignett e . In this fi lmic eff ect, the brightness or saturation of the image is reduced at the corners of the screen for dramatic eff ect. It is some- times implemented by literally rendering a texture overlay on top of the screen. A variation on this eff ect is used to produce the classic circular eff ect used to indicate that the player is looking through a pair of bin- oculars or a weapon scope. Colorization . The colors of screen pixels can be altered in arbitrary ways as a post-processing eff ect. For example, all colors except red could be desaturated to grey to produce a striking eff ect similar to the famous scene of the litt le girl in the red coat from Schindler’s List. 10.5. Further Reading We’ve covered a lot of material in a very short space in this chapter, but we’ve only just scratched the surface. No doubt you’ll want to explore many of these topics in much greater detail. For an excellent overview of the entire process of creating three-dimensional computer graphics and animation for games and fi lm, I highly recommend [23]. The technology that underlies modern real- time rendering is covered in excellent depth in [1], while [14] is well known as the defi nitive reference guide to all things related to computer graphics. Other great books on 3D rendering include [42], [9], and [10]. The mathematics of 3D rendering is covered very well in [28]. No graphics programmer’s library would be complete without one or more books from the Graphics Gems series ([18], [4], [24], [19], and [36]) and/or the GPU Gems series ([13], [38], and [35]). Of course