GAME PROGRAMMING GEMS 8 Edited by Adam Lake Course Technology PTR A part of Cengage Learning Australia, Brazil, Japan, Korea, Mexico, Singapore, Spain, United Kingdom, United States © 2011 Course Technology, a part of Cengage Learning. ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copy- right Act, without the prior written permission of the publisher. For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1-800-354-9706 For permission to use material from this text or product, submit all requests online at Further permissions questions can be emailed to All trademarks are the property of their respective owners. Cover image used courtesy of Valve Corporation. All other images © Cengage Learning unless otherwise noted. Library of Congress Control Number: 2010920327 ISBN-13: 978-1-58450-702-4 ISBN-10: 1-58450-702-0 Course Technology, a part of Cengage Learning 20 Channel Center Street Boston, MA 02210 USA Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the U nited Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at: Cengage Learning products are represented in Canada by Nelson Education, Ltd. For your lifelong learning solutions, visit Visit our corporate website at Game Programming Gems 8 Edited by Adam Lake Publisher and General Manager, Course Technology PTR: Stacy L. Hiquet Associate Director of Marketing: Sarah Panella Manager of Editorial Services: Heather Talbot Marketing Manager: Jordan Castellani Senior Acquisitions Editor: Emi Smith Project and Copy Editor: Cathleen D. Small Interior Layout: Shawn Morningstar Cover Designer: Mike Tanamachi CD-ROM Producer: Brandon Penticuff Indexer: Katherine Stimson Proofreader: Heather Urschel Printed in the United States of America 1 2 3 4 5 6 7 12 1 1 10 eISBN-10: 1-43545-771-4 iii Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Section 1 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Jason Mitchell, Valve 1.1 Fast Font Rendering with Instancing. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Aurelio Reis, id Software 1.2 Principles and Practice of Screen Space Ambient Occlusion. . . . . . . . 12 Dominic Filion, Blizzard Entertainment 1.3 Multi-Resolution Deferred Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Hyunwoo Ki, INNOACE Co., Ltd 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 . . . . . . . 39 Rahul P. Sathe, Intel Advanced Visual Computing (AVC) 1.5 Ambient Occlusion Using DirectX Compute Shader. . . . . . . . . . . . . . . 50 Jason Zink 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping. . . . . . . . . 74 Nico Galoppo, Intel Advanced Visual Computing (AVC) 1.7 Overlapped Execution on Programmable Graphics Hardware . . . . . . . 90 Allen Hux, Intel Advanced Visual Computing (AVC) 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs . . 101 Steven Tovey, Bizarre Creations Ltd. Section 2 Physics and Animation . . . . . . . . . . . . . . . . . . . . . . . 119 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Jeff Lander, Darwin 3D, LLC 2.1 A Versatile and Interactive Anatomical Human Face Model . . . . . . . . 121 Marco Fratarcangeli 2.2 Curved Paths for Seamless Character Animation . . . . . . . . . . . . . . . 132 Michael Lewin 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) . 141 Philip Taylor 2.4 Particle Swarm Optimization for Game Programming . . . . . . . . . . . . 152 Dario L. Sancho-Pradel 2.5 Improved Numerical Integration with Analytical Techniques . . . . . . . 168 Eric Brown 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 B. Charles Rasco, Ph.D., President, Smarter Than You Software 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes . . . 194 Krzysztof Mieloszyk, Gdansk University of Technology 2.8 Approximate Convex Decomposition for Real-Time Collision Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Khaled Mamou Section 3 AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Borut Pfeifer 3.1 AI Level of Detail for Really Large Worlds . . . . . . . . . . . . . . . . . . . . 213 Cyril Brom, Charles University in Prague Tomáš Poch, Ondřej Šerý 3.2 A Pattern-Based Approach to Modular AI for Games . . . . . . . . . . . . . 232 Kevin Dill, Boston University iv Table of Contents 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . 244 D. Hunter Hale 3.4 A Practical Spatial Architecture for Animal and Agent Navigation . . . 256 Michael Ramsey—Blue Fang Games, LLC 3.5 Applying Control Theory to Game AI and Physics . . . . . . . . . . . . . . . 264 Brian Pickrell 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games . . . . 279 Thomas Hartley, Institute of Gaming and Animation (IGA), University of Wolverhampton Quasim Mehdi, Institute of Gaming and Animation (IGA), University of Wolverhampton 3.7 Embracing Chaos Theory: Generating Apparent Unpredictability through Deterministic Systems . . . . . . . . . . . . . . . . 288 Dave Mark, Intrinsic Algorithm LLC 3.8 Needs-Based AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Robert Zubek 3.9 A Framework for Emotional Digital Actors . . . . . . . . . . . . . . . . . . . . 312 Phil Carlisle 3.10 Scalable Dialog Authoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Baylor Wetzel, Shikigami Games 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs . . . 335 Nikhil S. Ketkar and G. Michael Youngblood Section 4 General Programming . . . . . . . . . . . . . . . . . . . . . . . . 353 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Doug Binks, Intel Semiconductors AG 4.1 Fast-IsA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Joshua Grass, PhD 4.2 Registered Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Peter Dalton, Smart Bomb Interactive 4.3 Efficient and Scalable Multi-Core Programming . . . . . . . . . . . . . . . . 373 Jean-François Dubé, Ubisoft Montreal Table of Contents v 4.4 Game Optimization through the Lens of Memory and Data Access . . 385 Steve Rabin, Nintendo of America Inc. 4.5 Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Michael Dailly 4.6 Design and Implementation of an In-Game Memory Profiler . . . . . . . 402 Ricky Lung 4.7 A More Informative Error Log Generator . . . . . . . . . . . . . . . . . . . . . . 409 J.L. Raza and Peter Iliev Jr. 4.8 Code Coverage for QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 Matthew Jack 4.9 Domain-Specific Languages in Game Engines . . . . . . . . . . . . . . . . . 428 Gabriel Ware 4.10 A Flexible User Interface Layout System for Divergent Environments . . 442 Gero Gerber, Electronic Arts (EA Phenomic) 4.11 Road Creation for Projectable Terrain Meshes . . . . . . . . . . . . . . . . . 453 Igor Borovikov, Aleksey Kadukin 4.12 Developing for Digital Drawing Tablets . . . . . . . . . . . . . . . . . . . . . . . 462 Neil Gower 4.13 Creating a Multi-Threaded Actor-Based Architecture Using Intel® Threading Building Blocks . . . . . . . . . . . . 473 Robert Jay Gould, Square-Enix Section 5 Networking and Multiplayer. . . . . . . . . . . . . . . . . . . . 485 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Craig Tiller and Adam Lake 5.1 Secure Channel Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Chris Lomont 5.2 Social Networks in Games: Playing with Your Facebook Friends . . . . 498 Claus Höfele, Team Bondi vi Table of Contents 5.3 Asynchronous I/O for Scalable Game Servers . . . . . . . . . . . . . . . . . . 506 Neil Gower 5.4 Introduction to 3D Streaming Technology in Massively Multiplayer Online Games. . . . . . . . . . . . . . . . . . . . . . . . . 514 Kevin Kaichuan He Section 6 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Brian Schmidt, Founder and Executive Director, GameSoundCon; President, Brian Schmidt Studios 6.1 A Practical DSP Radio Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 Ian Ni-Lewis 6.2 Empowering Your Audio Team with a Great Engine . . . . . . . . . . . . . . 553 Mat Noguchi, Bungie 6.3 Real-Time Sound Synthesis for Rigid Bodies. . . . . . . . . . . . . . . . . . . 563 Zhimin Ren and Ming Lin Section 7 General Purpose Computing on GPUs . . . . . . . . . . . . 573 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 Adam Lake, Sr. Graphics Software Architect, Advanced Visual Computing, Intel 7.1 Using Heterogeneous Parallel Architectures with OpenCL . . . . . . . . 575 Udeepta Bordoloi, Benedict R. Gaster, and Marc Romankewicz, Advanced Micro Devices 7.2 PhysX GPU Rigid Bodies in Batman: Arkham Asylum. . . . . . . . . . . . . 590 Richard Tonge, NVIDIA Corporation Ben Wyatt and Ben Nicholson, Rocksteady Studios 7.3 Fast GPU Fluid Simulation in PhysX . . . . . . . . . . . . . . . . . . . . . . . . . 602 Simon Schirm and Mark Harris, NVIDIA Corporation Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 Table of Contents vii This page intentionally left blank ix Preface Welcome to the eighth edition of the Game Programming Gems series, started by Mark DeLoura in 2000. The first edition was inspired by Andrew Glassner’s popular Graphics Gems series. Since then, other Gems series have started, including AI Gems and a new series focused on the capabilities of programmable graphics, the ShaderX series. These tomes serve as an opportunity to share our experience and best practices with the rest of the industry. Many readers think of the Game Programming Gems series as a collection of arti- cles with sections that target specialists. For me, I’ve read through them as a way to get exposure to the diverse subsystems used to create games and stay abreast of the latest techniques. For example, I may not be a specialist in networking, but reading this section will often enlighten and stimulate connections that I may not have made between areas in which I have expertise and ones in which I do not. One statement I’ve heard recently regarding our industry is the idea that we now have all the horsepower we need to create games, so innovations by hardware compa- nies are not needed. I believe this argument is flawed in many ways. First, there are continued advancements in graphical realism in academia, in R&D labs, and in the film industry that have yet to be incorporated into our real-time pipelines. As devel- opers adopt these new features, computational requirements of software will continue to increase. Second, and the more important issue, is that this concept of play isn’t entirely correct—the very notion of what gaming serves from an anthropological perspective. Play is fundamental, not just to the human condition, but to the sentient condition. We invent interactive experiences on any platform, be it a deck of cards, a set of cardboard cutouts, or a next-gen PC platform with multi-terabyte data and multi- threaded, multi-gigahertz, multi-processor environments. It’s as natural as the pursuit of food. This play inspires real-world applications and pushes the next generation of platform requirements. It enables affordability of ever-increased computational horse- power in our computing platforms. The extension of gaming into other arenas, mobile and netbook platforms, serves only to prove the point. While the same ideas and themes may be used in these envi- ronments, the experience available to the player is different if the designer is to lever- age the full capabilities and differentiating features of the platform. There is an often-chanted “ever increasing cost of game development” quote for console and PC platforms. In the same breath, it’s alluded that this spiral of cost can- not continue. I believe these issues are of short-term concern. If there is a community willing to play, our economies will figure out a way to satisfy those needs. This will open up new opportunities for venture capital and middleware to reduce those plat- form complexities and cross-industry development costs, fueling the next generation of interactive experiences. I do believe the process has changed and will continue to evolve, but game development will continue to thrive. Will there be 15 first-person military simulations on a single platform? Perhaps not, but will there continue to be compelling multiplayer and single-player experiences? I believe so. The ingenuity of the game developer, when brought to the task of leveraging new incarnations of silicon, will continue to create enriching interactive experiences for ever-increasing audiences. Finally, I’d like to take a moment to address another issue often mentioned in the press. In November 2009, the Wall Street Journal ran an article by Jonathan V. Last from the Weekly Standard discussing the social implications of gaming. The majority of his article, “Videogames—Not Only for the Lonely,” was making this observation in the context of a holiday gathering of family members of many generations sharing experiences with their Nintendo Wii. Near the end of the article, he refers to the fact that “the shift to videogames might be lamenting if it meant that people who would otherwise be playing mini-golf or Monopoly were sealing themselves off and playing Halo 3 death matches across the Internet.” Much to the contrary, I have personally spent many quality multiplayer hours interacting socially with longtime friends when playing multiplayer games. A few days ago, I was having a conversation with an acquaintance who was thrilled that she could maintain her relationship with her brother on the East Coast by playing World of Warcraft with him. Ultimately, whether we are discussing our individual game experiences with others or interacting directly while playing, games do what they have always done across generations and platforms—they bring us together with shared experiences, whether it be cardboard cutouts, a deck of cards, or multiplayer capture the flag. Despite the overall informed message of the article, the writer encouraged a myth I see repeated in the mainstream press by those out of touch with the multiplayer, socially interactive game experiences that are common today, including Halo 3. Overview of Content The graphics section in this edition covers several topics of recent interest, leveraging new features of graphics APIs such as Compute Shader, tessellation using DirectX 11, and two gems on the implementation details of Screen Space Ambient Occlusion (SSAO). In the physics and animation section, we have selected a number of gems that advance beyond the basics of the topics such as IK solvers or fluid simulation in general. Instead, these gems go deeper with improvements to existing published tech- niques based on real-world experience with the current state of the art—for example, a simple, fast, and accurate IK solver, leveraging swarm systems for animation, and modeling air and fluid resistance. x Preface Artificial intelligence, AI, is one of the hottest areas in game development these days. Game players want worlds that don’t just look real, but that also feel and act real. The acting part is the responsibility of the AI programmer. Gems in the AI section are diverse, covering areas such as decision making, detailed character simulation, and player modeling to solve the problem of gold farm detection. The innovations dis- cussed are sure to influence future gems. In the general programming section, we have a number of tools to help with the development, performance, and testing of our game engines. We include gems that deal with multi-threading using Intel’s Thread Building Blocks, an open-source multi- threading library, memory allocation and profiling, as well as a useful code coverage system used by the developers at Crytek. The gems in the networking and multiplayer section cover architecture, security, scalability, and the leveraging of social networking applications to create multiplayer experiences. The audio section had fewer submissions than in past years. Why is this? Is the area of audio lacking in innovation? Has it matured to the point where developers are buying off-the-shelf components? Regardless, we’ve assembled a collection of gems for audio that we think will be of interest. In one of the articles in the audio section, we discuss a relatively new idea—the notion of real-time calculation of the audio sig- nal based on the actual physics instead of using the traditional technique of playing a pre-recorded processed sound. As games become more interactive and physics driven, there will be a corresponding demand for more realistic sound environments gener- ated by such techniques enabled with the increasing computational horsepower Moore’s Law continues to deliver to game developers. I’m excited to introduce a new section in this edition of Game Programming Gems 8 that I’m calling “General Purpose Computing on GPUs.” This is a new area for the Gems series, and we wanted to have a real-world case study of a game developer using the GPU for non-graphics tasks. We’ve collected three gems for this section. The first is about OpenCL, a new open standard for programming heterogeneous platforms of today, and we also have two gems that leverage PhysX for collision detection and fluid simulation. The PhysX components were used in Batman: Arkham Asylum by Rock- steady Studios Ltd. As the computing capabilities of the platform evolve, I expect game developers will face the decision of what to compute, where to compute, and how to manage the data being operated upon. These articles serve as case studies in what others have done in their games. I expect this to be an exciting area of future development. While we all have our areas of specialty, I think it’s fair to say game developers are a hungry bunch, with a common desire to learn, develop, and challenge ourselves and our abilities. These gems are meant to insprire, enlighten, and evolve the industry. As always, we look forward to the contributions and feedback developers have when put- ting these gems into practice. Adam Lake Preface xi About the Cover Image © Valve Corporation The cover of Game Programming Gems 8 features the Engineer from Valve’s Tea m Fortress 2. With their follow-up to the original class-based multiplayer shooter Tea m Fortress, Valve chose to depart from the typical photorealistic military themes of the genre. Instead, they employed an “illustrative” non-photorealistic rendering style, reminiscent of American commercial illustrators of the 1920s. This was motivated by the need for players to be able to quickly visually identify each other’s team, class, and weapon choices in the game. The novel art style and rendering techniques of Tea m Fortress 2 allowed Valve’s designers to visually separate the character classes from each other and from the game’s environments through the use of strong silhouettes and strategic distribution of color value. CD-ROM Downloads If you purchased an ebook version of this book, and the book had a companion CD-ROM, we will mail you a copy of the disc. Please send the title of the book, the ISBN, your name, address, and phone number. Thank you. xiii Acknowledgments I’d like to take a moment to acknowledge the section editors that I worked with to create this tome. They are the best and brightest in the industry. The quality of sub- missions and content in this book is a testament to this fact. They worked incredibly hard to bring this book together, and I thank them for their time and expertise. Also, I appreciate the time and patience that Emi Smith and Cathleen Small at Cengage Learning have put into this first-time book editor. They were essential in taking care of all the details necessary for publication. Finally, I’d like to acknowledge the artists at Valve who provided the cover image for this edition of Game Programming Gems. I have been blessed to have had exposure to numerous inspirational individuals— friends who refused to accept norms, parents who satiated my educational desires, teachers willing to spend a few extra minutes on a random tangent, instructors to teach not just what we know about the world, but also to make me aware of the things we do not. Most importantly, I want to acknowledge my wife, Stacey Lake, who remained supportive while I toiled away in the evenings and weekends for the better part of a year on this book. I dedicate these efforts to my mother, Amanda Lake. I thank her for teaching me that education is an enjoyable lifelong endeavor. Contributors Full bios for those contributors who submitted one can be found at downloads. Contributors to this book include: Dr. Doug Binks, D.Phil. Udeepta Bordoloi Igor Borovikov Cyril Brom Eric Brown Phil Carlisle Michael Dailly Peter Dalton Kevin Dill Jean-Francois Dube Dominic Filion Marco Fratarcangeli Nico Galoppo Benedict R. Gaster Gero Gerber Robert Jay Gould Neil Gower Joshua Grass, Ph.D. Hunter Hale Mark Harris Thomas Hartley Kevin He Claus Höfele Allen Hux Peter Iliev Matthew Jack Aleksey Kadukin Nikhil S. Ketkar Hyunwoo Ki Adam Lake Michael Lewin Chris Lomont, Ph.D. Ricky Lung Khaled Mamou Dave Mark Quasim Mehdi Krzysztof Mieloszyk Jason Mitchell Ben Nicholson Ian Ni-Lewis Mat Noguchi Borut Pfeifer Brian Pickrell Tomas Poch Steve Rabin Mike Ramsey B. Charles Rasco, Ph.D. João Lucas G. Raza Aurelio Reis Zhimin Ren Marc Romankewicz Dario Sancho Rahul Sathe Simon Schirm Brian Schmidt Ondřej Šerý Philip Taylor Richard Tonge Steven Tovey Gabriel Ware Ben Wyatt G. Michael Youngblood Jason Zink Robert Zubek 1 SECTION 1 GRAPHICS Introduction Jason Mitchell, Valve In this edition of the Game Programming Gems series, we explore a wide range of important real-time graphics topics, from lynchpin systems such as font rendering to cutting-edge hardware architectures, such as Larrabee, PlayStation 3, and the DirectX 11 compute shader. Developers in the trenches at top industry studios such as Blizzard, id, Bizarre Creations, Nexon, and Intel’s Advanced Visual Computing group share their insights on optimally exploiting graphics hardware to create high-quality visuals for games. To kick off this section, Aurelio Reis of id Software compares several methods for accelerating font rendering by exploiting GPU instancing, settling on a constant- buffer-based method that achieves the best performance. We then move on to two chapters discussing the popular image-space techniques of Screen Space Ambient Occlusion (SSAO) and deferred shading. Dominic Filion of Blizzard Entertainment discusses the SSAO algorithms used in StarCraft II, including novel controls that allowed Blizzard’s artists to tune the look of the effect to suit their vision. Hyunwoo Ki of Nexon then describes a multi-resolution acceleration method for deferred shading that computes low-frequency lighting information at a lower spatial frequency and uses a novel method for handling high-frequency edge cases. For the remainder of the section, we concentrate on techniques that take advan- tage of the very latest graphics hardware, from DirectX 11’s tessellator and compute shader to Larrabee and the PlayStation 3. Rahul Sathe of Intel presents a method for culling of Bezier patches in the context of the new DirectX 11 pipeline. Jason Zink then describes the new DirectX 11 compute shader architecture, using Screen Space Ambient Occlusion as a case study to illustrate the novel aspects of this new hardware architecture. In a pair of articles from Intel, Nico Galoppo and Allen Hux describe a method for integrating anti-aliasing into the irregular shadow mapping algorithm as well as a software task system that allows highly programmable systems such as Larrabee to achieve maximum throughput on this type of technique. We conclude the section with Steven Tovey’s look at the SPU units on the PlayStation 3 and techniques for achieving maximum performance in the vehicle damage and light pre-pass rendering systems in the racing game Blur from Bizarre Creations. 2 Section 1 Graphics 3 1.1 Fast Font Rendering with Instancing Aurelio Reis, id Software Font rendering is an essential component of almost all interactive applications, and while techniques exist to allow for fully scalable vector-based font rendering using modern GPUs, the so-called “bitmap font” is still the most versatile, efficient, and easy-to-implement solution. When implemented on typical graphics APIs, however, this technique uses run-time updated vertex buffers to store per-glyph geometry, resulting in inefficient rendering performance by potentially stalling the graphics pipeline. By leveraging efficient particle system rendering techniques that were developed previously, it is possible to render thousands of glyphs in a single batch without ever touching the vertex buffer. In this article, I propose a simple and efficient method to render fonts utilizing modern graphics hardware when compared to other similar methods. This technique is also useful in that it can be generalized for use in rendering other 2D elements, such as sprites and graphical user interface (GUI) elements. Text-Rendering Basics The most common font format is the vector-based TrueType format. This format rep- resents font glyphs (in other words, alphabetic characters and other symbols) as vector data, specifically, quadratic Bezier curves and line segments. As a result, TrueType fonts are compact, easy to author, and scale well with different display resolutions. The downside of a vector font, however, is that it is not straightforward to directly render this type of data on graphics hardware. There are, however, a few different ways to map the vector representation to a form that graphics hardware can render. One way is to generate geometry directly from the vector curves, as shown in Figure 1.1.1. However, while modern GPUs are quite efficient at rendering large num- bers of triangles, the number of polygons generated from converting a large number of complex vector curves to a triangle mesh could number in the tens of thousands. This increase in triangle throughput can greatly decrease application performance. 4 Section 1 Graphics Some optimizations to this way of rendering fonts have been introduced, such as the technique described by Loop and Blinn in which the polygonal mesh consists merely of the curve control points while the curve pixels are generated using a simple and effi- cient pixel shader [Loop05]. While this is a great improvement over the naive triangu- lation approach, the number of polygons generated in this approach is still prohibitively high on older graphics hardware (and that of the current console generation—the target of this article). Because of these limitations, the most common approach relies on rasterizing vec- tor graphics into a bitmap and displaying each glyph as a rectangle composed of two triangles (from here on referred to as a quad), as shown in Figure 1.1.2. A font texture page is generated with an additional UV offset table that maps glyphs to a location in that texture very similar to how a texture atlas is used [NVIDIA04]. The most obvious drawback is the resolution dependence caused by the font page being rasterized at a predefined resolution, which leads to distortion when rendering a font at a non-native resolution. Additional techniques exist to supplement this approach with higher qual- ity results while mitigating the resolution dependence that leads to blurry and aliased textures, such as the approach described by [Green07]. Overall, the benefits of the raster approach outweigh the drawbacks, because rendering bitmap fonts is incredibly easy and efficient. Figure 1.1.1 Vector curves converted into polygonal geometry. Figure 1.1.2 A font page and a glyph rendered on a quad. To draw glyphs for a bitmap font, the program must bind the texture page matching the intended glyph set and draw a quad for each glyph, taking into account spacing for kerning or other character-related offsets. While this technique yields very good performance, it can still be inefficient, as the buffers containing the geometry for each batch of glyphs must be continually updated. Constantly touching these buffers is a sure way to cause GPU stalls, resulting in decreased performance. For text- or GUI- heavy games, this can lead to an unacceptable overall performance hit. Improving Performance One way to draw the glyphs for the GUI is to create a GUI model that maintains buffers on the graphics card for drawing a predefined maximum number of indexed triangles as quads. Whenever a new glyph is to be drawn, its quad is inserted into a list, and the vertex buffer for the model is eventually updated with the needed geom- etry at a convenient point in the graphics pipeline. When the time comes to render the GUI model, assuming the same texture page is used, only a single draw call is required. As previously mentioned, this buffer must be updated each frame and for each draw batch that must be drawn. Ideally, as few draw batches as possible are needed, as the font texture page should contain all the individual glyphs that would need to be rendered, but on occasion (such as for high-resolution fonts or Asian fonts with many glyphs), it’s not possible to fit them all on one page. In the situation where a font glyph must be rendered from a different page, the batch is broken and must be presented immediately so that a new one can be started with the new texture. This holds true for any unique rendering states that a glyph may hold, such as blending modes or custom shaders. Lock-Discard The slowest part of the process is when the per-glyph geometry must be uploaded to the graphics card. Placing the buffer memory as close to AGP memory as possible (using API hints) helps, but locking and unlocking vertex buffers can still be quite expensive. To alleviate the expense, it is possible to use a buffer that is marked to “discard” its existing buffer if the GPU is currently busy with it. By telling the API to discard the existing buffer, a new one is created, which can be written to immediately. Eventually, the old buffer is purged by the API under the covers. This use of lock-discard prevents the CPU from waiting on the GPU to finish consuming the buffer (for example, in the case where it was being rendered at the same time). You can specify this with the D3DLOCK_DISCARD flag in Direct3D or by passing a NULL pointer to glBufferDataARB and then calling glMapBufferARB(). Be aware that although this is quite an improvement, it is still not an ideal solution, as the entire buffer must be discarded. Essentially, this makes initiating a small update to the buffer impossible. 1.1 Fast Font Rendering with Instancing 5 Vertex Compression Another step in improving performance is reducing the amount of memory that needs to be sent to the video card. The vertex structure for sending a quad looks some- thing like this and takes 28 bytes per vertex (and 112 bytes for each quad): struct GPU_QUAD_VERTEX_POS_TC_COLOR { D3DXVECTOR4 Position; D3DXVECTOR2 Texcoord; D3DCOLOR Color; }; Since the bandwidth across the AGP bus to the video card is not infinite, it is important to be aware of how much memory is being pushed through it. One way to reduce the memory costs is to use an additional vertex stream to update only that information that has changed on a per-frame basis. Unfortunately, the three essential quad attributes (position, texture dimensions, and color) could be in a state of constant flux, so there is little frame-to-frame coherency we can exploit. There is one very easy way to reduce at least some of the data that must be sent to the video card, however. Traditionally, each vertex represents a corner of a quad. This is not ideal, because this data is relatively static. That is, the size and position of a quad changes, but not the fact that it is a quad. Hicks describes a shader technique that allows for aligning a billboarded quad toward the screen by storing a rightFactor and upFactor for each corner of the billboard and projecting those vertices along the cam- era axes [Hicks03]. This technique is attractive, as it puts the computation of offset- ting the vertices on the GPU and potentially limits the need for vertex buffer locks to update the quad positions. By using a separate vertex stream that contains unique data, it is possible to repre- sent the width and height of the quad corners as a 4D unsigned byte vector. (Techni- cally, you could go as small as a Bool if that was supported on modern hardware.) In the vertex declaration, it is possible to map the position information to specific vertex semantics, which can then be accessed directly in the vertex shader. The vertex struc- ture would look something like this: struct GPU_QUAD_VERTEX { BYTE OffsetXY[ 4 ]; }; Although this may seem like an improvement, it really isn’t, since the same amount of memory must be used to represent the quad attributes (more so since we’re supplying a 4-byte offset now). There is an easy way to supply this additional infor- mation without requiring the redundancy of all those additional vertices. 6 Section 1 Graphics Instancing Quad Geometry If you’re lucky enough to support a Shader Model 3 profile, you have hardware sup- port for some form of geometry instancing. OpenGL 2.0 has support for instancing using pseudo-instancing [GLSL04] and the EXT_draw_instanced [EXT06] extension, which uses the glDrawArraysInstancedEXT and glDrawElementsInstancedEXT routines to render up to 1,024 instanced primitives that are referenced via an instance identifier in shader code. As of DirectX 9, Direct3D also supports instancing, which can be utilized by cre- ating a vertex buffer containing the instance geometry and an additional vertex buffer with the per-instance data. By using instancing, we’re able to completely eliminate our redundant quad vertices (and index buffer) at the cost of an additional but smaller buffer that holds only the per-instance data. This buffer is directly hooked up to the vertex shader via input semantics and can be easily accessed with almost no additional work to the previous method. While this solution sounds ideal, we have found that instancing actually comes with quite a bit of per-batch overhead and also requires quite a bit of instanced data to become a win. As a result, it should be noted that per- formance does not scale quite so well and in some situations can be as poor as that of the original buffer approach (or worse on certain hardware)! This is likely attributed to the fact that the graphics hardware must still point to this data in some way or another, and while space is saved, additional logic is required to compute the proper vertex strides. Constant Array Instancing Another way to achieve similar results with better performance is to perform shader instancing using constant arrays. By creating a constant array for each of the separate quad attributes (in other words, position/size, texture coordinate position/size, color), it is possible to represent all the necessary information without the need for a heavy- weight vertex structure. See Figure 1.1.3. 1.1 Fast Font Rendering with Instancing 7 Figure 1.1.3 A number of glyphs referencing their data from a constant array. Similar to indexed vertex blending (a.k.a. matrix palette skinning), an index is assigned for each group of four vertices required to render a quad, as shown in Figure 1.1.4. To get the value for the current vertex, all that is needed is to index into the constant array using this value. Because the number of constants available is usually below 256 on pre–Shader Model 4 hardware, this index can be packed directly as an additional element in the vertex offset vector (thus requiring no additional storage space). It’s also possible to use geometry instancing to just pass in the quad ID/index in order to bypass the need for a large buffer of four vertices per quad. However, as mentioned previously, we have found that instancing can be unreliable in practice. 8 Section 1 Graphics Figure 1.1.4 A quad referencing an element within the attribute constant array. This technique yields fantastic performance but has the downside of only allowing a certain number of constants, depending on your shader profile. The vertex structure is incredibly compact, weighing in at a mere 4 bytes (16 bytes per quad) with an addi- tional channel still available for use: struct GPU_QUAD_VERTEX { BYTE OffsetXY_IndexZ[ 4 ]; }; Given the three quad attributes presented above and with a limit of 256 constants, up to 85 quads can be rendered per batch. Despite this limitation, performance can still be quite a bit better than the other approaches, especially as the number of state changes increases (driving up the number of batches and driving down the number of quads per batch). Additional Considerations I will now describe some small but important facets of font rendering, notably an effi- cient use of clip-space position and a cheap but effective sorting method. Also, in the sample code for this chapter on the book’s CD, I have provided source code for a tex- ture atlasing solution that readers may find useful in their font rendering systems. Sorting Fonts are typically drawn in a back-to-front fashion, relying on the painter’s algo- rithm to achieve correct occlusion. Although this is suitable for most applications, certain situations may require that quads be layered in a different sort order than that in which they were drawn. This is easily implemented by using the remaining avail- able value in the vertex structure offset/index vector as a z value for the quad, allowing for up to 256 layers. Clip-Space Positions To save a few instructions and the constant space for the world-view-projection matrix (the clip matrix), it’s possible to specify the position directly in clip-space to forego having to transform the vertices from perspective to orthographic space, as illustrated in Figure 1.1.5. Clip-space positions range from –1 to 1 in the X and Y directions. To remap an absolute screen-space coordinate to clip space, we can just use the equation [cx = –1 + x * (2 / screen_width)], [cy = 1 – y * (2 / screen_height)], where x and y are the screen-space coordinates up to a max of screen_width and screen_height, respectively. 1.1 Fast Font Rendering with Instancing 9 Texture Atlasing On the book’s CD, I have provided code for a simple virtual texture system that uses atlases to reduce batches. This system attempts to load an atlased version of a texture if possible and otherwise loads a texture directly from disk. There are some switches (documented in the code) that demonstrate how to turn this system on and off to demonstrate how important it can be toward reducing the number of batches and maintaining a high level of performance. Future Work The techniques demonstrated in this chapter were tailored to work on current console technology, which is limited to Shader Model 3. In the future, I would like to extend these techniques to take advantage of new hardware features, such as Geometry Shaders and StreamOut, to further increase performance, image fidelity, and ease of use. 10 Section 1 Graphics Figure 1.1.5 A quad/billboard being expanded. Demo On the accompanying disc, you’ll find a Direct3D sample application that demon- strates each of the discussed techniques in a text- and GUI-rich presentation. Two scenes are presented: One displays a cityscape for a typical 2D tile-based game, and the other displays a Strange Attractor simulation. In addition, there is an option to go overboard with the text rendering. Feel free to play around with the code until you get a feel for the strengths and weaknesses of the different approaches. The main shader file (Font.fx) contains the shaders of interest as well as some additional functionality (such as font anti-aliasing/filtering). Please note that certain aspects (such as quad expansion) were made for optimum efficiency and not necessar- ily readability. In general, most of the code was meant to be very accessible, and it will be helpful to periodically cross-reference the files GuiModel.cpp and Font.fx. Conclusion In this gem, I demonstrated a way to render font and GUI elements easily and effi- ciently by taking advantage of readily available hardware features, such as instancing, multiple stream support, and constant array indexing. As a takeaway item, you should be able to easily incorporate such a system into your technology base or improve an existing system with only minor changes. References [EXT06] “EXT_draw_instanced.” 2006. Open GL. n.d. . [GLSL04] “GLSL Pseudo-Instancing.” 17 Nov. 2004. NVIDIA. n.d. . [Green07] Green, Chris. “Improved Alpha-Tested Magnification for Vector Textures and Special Effects.” Course on Advanced Real-Time Rendering in 3D Graphics and Games. SIGGRAPH 2007. San Diego Convention Center, San Diego, CA. 8 August 2007. [Hicks03] Hicks, O’Dell. “Screen-aligned Particles with Minimal VertexBuffer Locking.” ShaderX2: Shader Programming Tips and Tricks with DirectX 9.0. Ed. Wolfgang F. Engel. Plano, TX: Wordware Publishing, Inc., 2004. 107–112. [Loop05] Loop, Charles and Jim Blinn. “Resolution Independent Curve Rendering Using Programmable Graphics Hardware.” 2005. Microsoft. n.d. . [NVIDIA04] “Improve Batching Using Texture Atlases.” 2004. NVIDIA. n.d. . 1.1 Fast Font Rendering with Instancing 11 12 1.2 Principles and Practice of Screen Space Ambient Occlusion Dominic Filion, Blizzard Entertainment Simulation of direct lighting in modern video games is a well-understood concept, as virtually all of real-time graphics has standardized on the Lambertian and Blinn models for simulating direct lighting. However, indirect lighting (also referred to as global illumination) is still an active area of research with a variety of approaches being explored. Moreover, although some simulation of indirect lighting is possible in real time, full simulation of all its effects in real time is very challenging, even on the latest hardware. Global illumination is based on simulating the effects of light bouncing around a scene multiple times as light is reflected on light surfaces. Computational methods such as radiosity attempt to directly model this physical process by modeling the interactions of lights and surfaces in an environment, including the bouncing of light off of surfaces. Although highly realistic, sophisticated global illumination methods are typically too computationally intensive to perform in real time, especially for games, and thus to achieve the complex shadowing and bounced lighting effects in games, one has to look for simplifications to achieve a comparable result. One possible simplification is to focus on the visual effects of global illumination instead of the physical process and furthermore to aim at a particular subset of effects that global illumination achieves. Ambient occlusion is one such subset. Ambient occlusion simplifies the problem space by assuming all indirect light is equally distrib- uted throughout the scene. With this assumption, the amount of indirect light hitting a point on a surface will be directly proportional to how much that point is exposed to the scene around it. A point on a plane surface can receive light from a full 180-degree hemisphere around that point and above the plane. In another example, a point in a room’s corner, as shown in Figure 1.2.1, could receive a smaller amount of light than a point in the middle of the floor, since a greater amount of its “upper hemisphere” is 1.2 Principles and Practice of Screen Space Ambient Occlusion 13 occluded by the nearby walls. The resulting effect is a crude approximation of global illumination that enhances depth in the scene by shrouding corners, nooks, and crannies in a scene. Artistically, the effect can be controlled by varying the size of the hemisphere within which other objects are considered to occlude neighboring points; large hemi- sphere ranges will extend the shadow shroud outward from corners and recesses. Although the global illumination problem has been vastly simplified through this approach, it can still be prohibitively expensive to compute in real time. Every point on every scene surface needs to cast many rays around it to test whether an occluding object might be blocking the light, and an ambient occlusion term is computed based on how many rays were occluded from the total amount of rays emitted from that point. Performing arbitrary ray intersections with the full scene is also difficult to implement on graphics hardware. We need further simplification. Figure 1.2.1 Ambient occlusion relies on finding how much of the hemisphere around the sampling point is blocked by the environment. Screen Space Ambient Occlusion What is needed is a way to structure the scene so that we can quickly and easily deter- mine whether a given surface point is occluded by nearby geometry. It turns out that the standard depth buffer, which graphics engines already use to perform hidden surface removal, can be used to approximate local occlusion [Shanmugam07, Mittring07]. By definition, the depth buffer contains the depth of every visible point in the scene. From these depths, we can reconstruct the 3D positions of the visible surface points. Points that can potentially occlude other points are located close to each other in both screen space and world space, making the search for potential occluders straightforward. We need to align a hemisphere around each point’s upper hemisphere as defined by its normal. We will thus need a normal buffer that will encode the normal of every corresponding point in the depth buffer in screen space. Rather than doing a full ray intersection, we can simply inspect the depths of neighboring points to establish the likelihood that each is occluding the current point. Any neighbor whose 2D position does not fall within the 2D coverage of the hemi- sphere could not possibly be an occluder. If it does lie within the hemisphere, then the closer the neighbor point’s depth is to the target point, the higher the odds it is an occluder. If the neighbor’s depth is behind the point being tested for occlusion, then no occlusion is assumed to occur. All of these calculations can be performed using the screen space buffer of normals and depths, hence the name Screen Space Ambient Occlusion (SSAO). At first glance, this may seem like a gross oversimplification. After all, the depth buffer doesn’t contain the whole scene, just the visible parts of it, and as such is only a partial reconstruction of the scene. For example, a point in the background could be occluded by an object that is hidden behind another object in the foreground, which a depth buffer would completely miss. Thus, there would be pixels in the image that 14 Section 1 Graphics Figure 1.2.2 SSAO samples neighbor points to discover the likelihood of occlusion. Lighter arrows are behind the center point and are considered occluded samples. 1.2 Principles and Practice of Screen Space Ambient Occlusion 15 should have some amount of occlusion but don’t due to the incomplete representation we have of the scene’s geometry. It turns out that these kinds of artifacts are not especially objectionable in practice. The eye focuses first on cues from objects within the scene, and missing cues from objects hidden behind one another are not as disturbing. Furthermore, ambient occlu- sion is a low-frequency phenomenon; what matters more is the general effect rather than specific detailed cues, and taking shortcuts to achieve a similar yet incorrect effect is a fine tradeoff in this case. Discovering where the artifacts lie should be more a process of rationalizing the errors than of simply catching them with the untrained eye. From this brief overview, we can outline the steps we will take to implement Screen Space Ambient Occlusion. • We will first need to have a depth buffer and a normal buffer at our disposal from which we can extract information. • From these screen space maps, we can derive our algorithm. Each pixel in screen space will generate a corresponding ambient occlusion value for that pixel and store that information in a separate render target. For each pixel in our depth buffer, we extract that point’s position and sample n neighboring pixels within the hemisphere aligned around the point’s normal. • The ratio of occluding versus non-occluding points will be our ambient occlusion term result. • The ambient occlusion render target can then be blended with the color output from the scene generated afterward. I will now describe our Screen Space Ambient Occlusion algorithm in greater detail. Generating the Source Data The first step in setting up the SSAO algorithm is to prepare the necessary incoming data. Depending on how the final compositing is to be done, this can be accomplished in one of two ways. The first method requires that the scene be rendered twice. The first pass will render the depth and normal data only. The SSAO algorithm can then generate the ambient occlusion output in an intermediate step, and the scene can be rendered again in full color. With this approach, the ambient occlusion map (in screen space) can be sampled by direct lights from the scene to have their contribution modulated by the ambient occlusion term as well, which can help make the contributions from direct and indirect lighting more coherent with each other. This approach is the most flexible but is somewhat less efficient because the geometry has to be passed to the hardware twice, doubling the API batch count and, of course, the geometry processing load. A different approach is to render the scene only once, using multiple render targets bound as output to generate the depth and normal information as the scene is first rendered without an ambient lighting term. SSAO data is then generated as a post-step, and the ambient lighting term can simply be added. This is a faster approach, but in practice artists lose the flexibility to decide which individual lights in the scene may or may not be affected by the ambient occlusion term, should they want to do so. Using a fully deferred renderer and pushing the entire scene lighting stage to a post-processing step can get around this limitation to allow the entire lighting setup to be configurable to use ambient occlusion per light. Whether to use the single-pass or dual-pass method will depend on the constraints that are most important to a given graphics engine. In all cases, a suitable format must be chosen to store the depth and normal information. When supported, a 16-bit floating-point format will be the easiest to work with, storing the normal components in the red, green, and blue components and storing depth as the alpha component. Screen Space Ambient Occlusion is very bandwidth intensive, and minimizing sampling bandwidth is necessary to achieve optimal performance. Moreover, if using the single-pass multi-render target approach, all bound render targets typically need to be of the same bit depth on the graphics hardware. If the main color output is 32-bit RGBA, then outputting to a 16-bit floating-point buffer at the same time won’t be possible. To minimize bandwidth and storage, the depth and normal can be encoded in as little as a single 32-bit RGBA color, storing the x and y components of the normal in the 8-bit red and green channels while storing a 16-bit depth value in the blue and alpha channels. The HLSL shader code for encoding and decoding the normal and depth values is shown in Listing 1.2.1. LISTING 1.2.1 HLSL code to decode the normal on subsequent passes as well as HLSL code used to encode and decode the 16-bit depth value // Normal encoding simply outputs x and y components in R and G in // the range 0..1 float3 DecodeNormal( float2 cInput ) { float3 vNormal.xy = 2.0f * cInput.rg - 1.0f; vNormal.z = sqrt(max(0, 1 - dot(vNormal.xy, vNormal.xy))); return vNormal; } // Encode depth to B and A float2 DepthEncode( float fDepth ) { float2 vResult; // Input depth must be mapped to 0..1 range fDepth = fDepth / p_fScalingFactor; // R = Basis = 8 bits = 256 possible values // G = fractional part with each 1/256th slice = frac( float2( fDepth, fDepth * 256.0f ) ); return vResult; } float3 DecodeDepth( float4 cInput ) { return dot (, float2( 1.0f, 1.0f / 256.0f ) * p_fScalingFactor; } 16 Section 1 Graphics 1.2 Principles and Practice of Screen Space Ambient Occlusion 17 Sampling Process With the input data in hand, we can begin the ambient occlusion generation process itself. At any visible point on a surface on the screen, we need to explore neighboring points to determine whether they could occlude our current point. Multiple samples are thus taken from neighboring points in the scene using a filtering process described by the HLSL shader code in Listing 1.2.2. LISTING 1.2.2 Screen Space Ambient Occlusion filter described in HLSL code // i_VPOS is screen pixel coordinate as given by HLSL VPOS interpolant. // p_vSSAOSamplePoints is a distribution of sample offsets for each sample. float4 PostProcessSSAO( float3 i_VPOS ) { float2 vScreenUV; // This will become useful later. float3 vViewPos = 2DToViewPos( i_VPOS, vScreenUV ); half fAccumBlock = 0.0f; for ( int i = 0; i < iSampleCount; i++ ) { float3 vSamplePointDelta = p_vSSAOSamplePoints[i]; float fBlock = TestOcclusion( vViewPos, vSamplePointDelta, p_fOcclusionRadius, p_fFullOcclusionThreshold, p_fNoOcclusionThreshold, p_fOcclusionPower ) ) fAccumBlock += fBlock; } fAccumBlock /= iSampleCount; return 1.0f - fAccumBlock; } We start with the current point, p, whose occlusion we are computing. We have the point’s 2D coordinate in screen space. Sampling the depth buffer at the corresponding UV coordinates, we can retrieve that point’s depth. From these three pieces of infor- mation, the 3D position of the point within can be reconstructed using the shader code shown in Listing 1.2.3. LISTING 1.2.3 HLSL shader code used to map a pixel from screen space to view space // vRecipDepthBufferSize = 1.0 / depth buffer width and height in pixels. // p_vCameraFrustrumSize = Full width and height of camera frustum at the // camera’s near plane in world space. float2 p_vRecipDepthBufferSize; float2 p_vCameraFrustrumSize; float3 2DPosToViewPos( float3 i_VPOS, out float2 vScreenUV ) { float2 vViewSpaceUV = i_VPOS * p_vRecipDepthBufferSize; vScreenUV = vViewSpaceUV; // From 0..1 to to 0..2 vViewSpaceUV = vViewSpaceUV * float2( 2.0f, -2.0f ); // From 0..2 to -1..1 vViewSpaceUV = vViewSpaceUV + float2( -1.0f, 1.0f ); vViewSpaceUV = vViewSpaceUV * p_vCameraFrustrumSize * 0.5f; return float3( vViewSpaceUV.x, vViewSpaceUV.y, 1.0f ) * tex2D( p_sDepthBuffer, vScreenUV ).r; } We will need to sample the surrounding area of the point p along multiple offsets from its position, giving us n neighbor positions qi. Sampling the normal buffer will give us the normal around which we can align our set of offset vectors, ensuring that all sample offsets fall within point p’s upper hemisphere. Transforming each offset vector by a matrix can be expensive, and one alternative is to perform a dot product between the offset vector and the normal vector at that point and to flip the offset vector if the dot product is negative, as shown in Figure 1.2.3. This is a cheaper way to solve for the offset vectors without doing a full matrix transform, but it has the drawback of using fewer samples when samples are rejected due to falling behind the plane of the surface of the point p. 18 Section 1 Graphics Figure 1.2.3 Samples behind the hemisphere are flipped over to stay within the hemisphere. 1.2 Principles and Practice of Screen Space Ambient Occlusion 19 Each neighbor’s 3D position can then be transformed back to screen space in 2D, and the depth of the neighbor point can be sampled from the depth buffer. From this neighboring depth value, we can establish whether an object likely occupies that space at the neighbor point. Listing 1.2.4 shows shader code to test for this occlusion. LISTING 1.2.4 HLSL code used to test occlusion by a neighboring pixel float TestOcclusion( float3 vViewPos, float3 vSamplePointDelta, float fOcclusionRadius, float fFullOcclusionThreshold, float fNoOcclusionThreshold, float fOcclusionPower ) { float3 vSamplePoint = vViewPos + fOcclusionRadius * vSamplePointDelta; float2 vSamplePointUV; vSamplePointUV = vSamplePoint.xy / vSamplePoint.z; vSamplePointUV = vSamplePointUV / p_vCameraSize / 0.5f; vSamplePointUV = vSamplePointUV + float2( 1.0f, -1.0f ); vSamplePointUV = vSamplePointUV * float2( 0.5f, -0.5f ); float fSampleDepth = tex2D( p_sDepthBuffer, vSamplePointUV ).r; float fDistance = vSamplePoint.z - fSampleDepth; return OcclusionFunction( fDistance, fFullOcclusionThreshold, fNoOcclusionThreshold, fOcclusionPower ); } We now have the 3D positions of both our point p and the neighboring points qi. We also have the depth di of the frontmost object along the ray that connects the eye to each neighboring point. How do we determine ambient occlusion? The depth di gives us some hints as to whether a solid object occupies the space at each of the sampled neighboring points. Clearly, if the depth di is behind the sampled point’s depth, it cannot occupy the space at the sampled point. The depth buffer does not give us the thickness of the object along the ray from the viewer; thus, if the depth of the object is anywhere in front of p, it may occupy the space, though without thick- ness information, we can’t know for sure. We can devise some reasonable heuristics with the information we do have and use a probabilistic method. The further in front of the sample point the depth is, the less likely it is to occupy that space. Also, the greater the distance between the point p and the neighbor point, the lesser the occlusion, as the object covers a smaller part of the hemisphere. Thus, we can derive some occlusion heuristics based on: • The difference between the sampled depth di and the depth of the point qi • The distance between p and qi For the first relationship, we can formulate an occlusion function to map the depth deltas to occlusion values. If the aim is to be physically correct, then the occlusion function should be quad- ratic. In our case we are more concerned about being able to let our artists adjust the occlusion function, and thus the occlusion function can be arbitrary. Really, the occlu- sion function can be any function that adheres to the following criteria: • Negative depth deltas should give zero occlusion. (The occluding surface is behind the sample point.) • Smaller depth deltas should give higher occlusion values. • The occlusion value needs to fall to zero again beyond a certain depth delta value, as the object is too far away to occlude. For our implementation, we simply chose a linearly stepped function that is entirely controlled by the artist. A graph of our occlusion function is shown in Figure 1.2.4. There is a full-occlusion threshold where every positive depth delta smaller than this value gets complete occlusion of one, and a no-occlusion threshold beyond which no occlusion occurs. Depth deltas between these two extremes fall off linearly from one to zero, and the value is exponentially raised to a specified occlusion power value. If a more complex occlusion function is required, it can be pre-computed in a small 1D texture to be looked up on demand. 20 Section 1 Graphics Figure 1.2.4 SSAO blocker function. 1.2 Principles and Practice of Screen Space Ambient Occlusion 21 LISTING 1.2.5 HLSL code used to implement occlusion function float OcclusionFunction( float fDistance, float fNoOcclusionThreshold, float fFullOcclusionThreshold, float fOcclusionPower ) { const c_occlusionEpsilon = 0.01f; if ( fDistance > c_ occlusionEpsilon ) { // Past this distance there is no occlusion. float fNoOcclusionRange = fNoOcclusionThreshold - fFullOcclusionThreshold; if ( fDistance < fFullOcclusionThreshold ) return 1.0f; else return max( 1.0f – pow( ( ( fDistance – fFullOcclusionThreshold ) / fNoOcclusionRange, fOcclusionPower ) ), 0.0f ); } else return 0.0f; } Once we have gathered an occlusion value for each sample point, we can take the average of these, weighted by the distance of each sample point to p, and the average will be our ambient occlusion value for that pixel. Sampling Randomization Sampling neighboring pixels at regular vector offsets will produce glaring artifacts to the eye, as shown in Figure 1.2.5. To smooth out the results of the SSAO lookups, the offset vectors can be random- ized. A good approach is to generate a 2D texture of random normal vectors and per- form a lookup on this texture in screen space, thus fetching a unique random vector per pixel on the screen, as illustrated in Figure 1.2.6 [Mittring07]. We have n neigh- bors we must sample, and thus we will need to generate a set of n unique vectors per pixel on the screen. These will be generated by passing a set of offset vectors in the pixel shader constant registers and reflecting these vectors through the sampled random vector, resulting in a semi-random set of vectors at each pixel, as illustrated by Listing 1.2.6. The set of vectors passed in as registers is not normalized—having varying lengths helps to smooth out the noise pattern and produces a more even distribution of the samples inside the occlusion hemisphere. The offset vectors must not be too short to avoid clustering samples too close to the source point p. In general, varying the offset vectors from half to full length of the occlusion hemisphere radius produces good results. The size of the occlusion hemisphere becomes a parameter controllable by the artist that determines the size of the sampling area. 22 Section 1 Graphics Figure 1.2.5 SSAO without random sampling. Figure 1.2.6 Randomized sampling process. 1.2 Principles and Practice of Screen Space Ambient Occlusion 23 LISTING 1.2.6 HLSL code used to generate a set of semi-random 3D vectors at each pixel float3 reflect( float3 vSample, float3 vNormal ) { return normalize( vSample – 2.0f * dot( vSample, vNormal ) * vNormal ); } float3x3 MakeRotation( float fAngle, float3 vAxis ) { float fS; float fC; sincos( fAngle, fS, fC ); float fXX = vAxis.x * vAxis.x; float fYY = vAxis.y * vAxis.y; float fZZ = vAxis.z * vAxis.z; float fXY = vAxis.x * vAxis.y; float fYZ = vAxis.y * vAxis.z; float fZX = vAxis.z * vAxis.x; float fXS = vAxis.x * fS; float fYS = vAxis.y * fS; float fZS = vAxis.z * fS; float fOneC = 1.0f - fC; float3x3 result = float3x3( fOneC * fXX + fC, fOneC * fXY + fZS, fOneC * fZX - fYS, fOneC * fXY - fZS, fOneC * fYY + fC, fOneC * fYZ + fXS, fOneC * fZX + fYS, fOneC * fYZ - fXS, fOneC * fZZ + fC ); return result; } float4 PostProcessSSAO( float3 i_VPOS ) { ... const float c_scalingConstant = 256.0f; float3 vRandomNormal = ( normalize( tex2D( p_sSSAONoise, vScreenUV * p_vSrcImageSize / c_scalingConstant ).xyz * 2.0f – 1.0f ) ); float3x3 rotMatrix = MakeRotation( 1.0f, vNormal ); half fAccumBlock = 0.0f; for ( int i = 0; i < iSampleCount; i++ ) { float3 vSamplePointDelta = reflect( p_vSSAOSamplePoints[i], vRandomNormal ); float fBlock = TestOcclusion( vViewPos, vSamplePointDelta, p_fOcclusionRadius, p_fFullOcclusionThreshold, p_fNoOcclusionThreshold, p_fOcclusionPower ) ) { fAccumBlock += fBlock; } ... } Ambient Occlusion Post-Processing As shown in Figure 1.2.7, the previous step helps to break up the noise pattern, pro- ducing a finer-grained pattern that is less objectionable. With wider sampling areas, however, a further blurring of the ambient occlusion result becomes necessary. The ambient occlusion results are low frequency, and losing some of the high-frequency detail due to blurring is generally preferable to the noisy result obtained by the previous steps. To smooth out the noise, a separable Gaussian blur can be applied to the ambient occlusion buffer. However, the ambient occlusion must not bleed through edges to objects that are physically separate within the scene. A form of bilateral filtering is used. This filter samples the nearby pixels as a regular Gaussian blur shader would, yet the 24 Section 1 Graphics Figure 1.2.7 SSAO term after random sampling applied. Applying blur passes will further reduce the noise to achieve the final look. 1.2 Principles and Practice of Screen Space Ambient Occlusion 25 normal and depth for each of the Gaussian samples are sampled as well. (Encoding the normal and depth in the same render targets presents significant advantages here.) If the depth from the Gaussian sample differs from the center tap by more than a cer- tain threshold, or the dot product of the Gaussian sample and the center tap normal is less than a certain threshold value, then the Gaussian weight is reduced to zero. The sum of the Gaussian samples is then renormalized to account for the missing samples. LISTING 1.2.7 HLSL code used to blur the ambient occlusion image // i_UV : UV of center tap // p_fBlurWeights Array of gaussian weights // i_GaussianBlurSample: Array of interpolants, with each interpolants // packing 2 gaussian sample positions. float4 PostProcessGaussianBlur( VertexTransport vertOut ) { float2 vCenterTap = i_UV.xy; float4 cValue = tex2D( p_sSrcMap, vCenterTap.xy ); float4 cResult = cValue * p_fBlurWeights[0]; float fTotalWeight = p_fBlurWeights[0]; // Sample normal & depth for center tap float4 vNormalDepth = tex2D( p_sNormalDepthMap, vCenterTap.xy ).a; for ( int i = 0; i < b_iSampleInterpolantCount; i++ ) { half4 cValue = tex2D( p_sSrcMap, i_GaussianBlurSample[i].xy ); half fWeight = p_fBlurWeights[i * 2 + 1]; float4 vSampleNormalDepth = tex2D( p_sNormalDepthMap, i_GaussianBlurSample[i].xy ); if ( dot( vSampleNormalDepth.rgb, vNormalDepth.rgb) < 0.9f || abs( vSampleNormalDepth.a – vNormalDepth.a ) > 0.01f ) fWeight = 0.0f; cResult += cValue * fWeight; fTotalWeight += fWeight; cValue = tex2D( p_sSeparateBlurMap, INTERPOLANT_GaussianBlurSample[i].zw ); fWeight = p_fBlurWeights[i * 2 + 2]; vSampleNormalDepth = tex2D( p_sSrcMap, INTERPOLANT_GaussianBlurSample[i].zw ); if ( dot( vSampleNormalDepth.rgb, vNormalDepth .rgb < 0.9f ) || abs( vSampleNormalDepth.a – vNormalDepth.a ) > 0.01f ) fWeight = 0.0f; cResult += cValue * fWeight; fTotalWeight += fWeight; } // Rescale result according to number of discarded samples. cResult *= 1.0f / fTotalWeight; return cResult; } Several blur passes can thus be applied to the ambient occlusion output to completely eliminate the noisy pattern, trading off some higher-frequency detail in exchange. Handling Edge Cases The offset vectors are in view space, not screen space, and thus the length of the offset vectors will vary depending on how far away they are from the viewer. This can result in using an insufficient number of samples at close-up pixels, resulting in a noisier result for these pixels. Of course, samples can also go outside the 2D bounds of the screen. Naturally, depth information outside of the screen is not available. In our implementation, we ensure that samples outside the screen return a large depth value, ensuring they would never occlude any neighboring pixels. This can be achieved through the “border color” texture wrapping state, setting the border color to a suit- ably high depth value. To prevent unacceptable breakdown of the SSAO quality in extreme close-ups, the number of samples can be increased dynamically in the shader based on the distance of the point p to the viewer. This can improve the quality of the visual results but can 26 Section 1 Graphics Figure 1.2.8 Result of Gaussian blur. 1.2 Principles and Practice of Screen Space Ambient Occlusion 27 result in erratic performance. Alternatively, the 2D offset vector lengths can be artifi- cially capped to some threshold value regardless of distance from viewer. In effect, if the camera is very close to an object and the SSAO samples end up being too wide, the SSAO area consistency constraint is violated so that the noise pattern doesn’t become too noticeable. Optimizing Performance Screen Space Ambient Occlusion can have a significant payoff in terms of mood and visual quality of the image, but it can be quite an expensive effect. The main bottleneck of the algorithm is the sampling itself. The semi-random nature of the sampling, which is necessary to minimize banding, wreaks havoc with the GPU’s texture cache system and can become a problem if not managed. The performance of the texture cache will also be very dependent on the sampling area size, with wider areas straining the cache more and yielding poorer performance. Our artists quickly got in the habit of using SSAO to achieve a faked global illumination look that suited their purposes. This required more samples and wider sampling areas, so extensive optimization became necessary for us. One method to bring SSAO to an acceptable performance level relies on the fact that ambient occlusion is a low-frequency phenomenon. Thus, there is generally no need for the depth buffer sampled by the SSAO algorithm to be at full-screen resolution. The initial depth buffer can be generated at screen resolution, since the depth infor- mation is generally reused for other effects, and it potentially has to fit the size of other render targets, but it can thereafter be downsampled to a smaller depth buffer that is a quarter size of the original on each side. The downsampling itself does have some cost, but the payback in improved throughput is very significant. Downsampling the depth buffer also makes it possible to convert it from a wide 16-bit floating-point format to a more bandwidth-friendly 32-bit packed format. Fake Global Illumination and Artistic Styling If the ambient occlusion hemisphere is large enough, the SSAO algorithm eventually starts to mimic behavior seen from general global illumination; a character relatively far away from a wall could cause the wall to catch some of the subtle shadowing cues a global illumination algorithm would detect. If the sampling area of the SSAO is wide enough, the look of the scene changes from darkness in nooks and crannies to a softer, ambient feel. This can pull the art direction in two somewhat conflicting directions: on the one hand, the need for tighter, high-contrast occluded zones in deeper recesses, and on the other hand, the desire for the larger, softer, ambient look of the wide-area sampling. One approach is to split the SSAO samples between two different sets of SSAO parameters: Some samples are concentrated in a small area with a rapidly increasing occlusion function (generally a quarter of all samples), while the remaining samples use a wide sampling area with a gentler function slope. The two sets are then averaged independently, and the final result uses the value from the set that produces the most (darkest) occlusion. This is the approach that was used in StarCraft II. The edge-enhancing component of the ambient occlusion does not require as many samples as the global illumination one, thus a quarter of the samples can be assigned to crease enhancement while the remainder are assigned for the larger area threshold. Though SSAO provides for important lighting cues to enhance the depth of the scene, there was still a demand from our artist for more accurate control that was only feasible through the use of some painted-in ambient occlusion. The creases from SSAO in particular cannot reach the accuracy that using a simple texture can without using an enormous amount of samples. Thus the usage of SSAO does not preclude the need for some static ambient occlusion maps to be blended in with the final ambient occlusion result, which we have done here. 28 Section 1 Graphics Figure 1.2.9 SSAO with different sampling-area radii. 1.2 Principles and Practice of Screen Space Ambient Occlusion 29 For our project, complaints about image noise, balanced with concerns about performance, were the main issues to deal with for the technique to gain acceptance among our artists. Increasing SSAO samples helps improve the noise, yet it takes an ever-increasing number of samples to get ever smaller gains in image quality. Past 16 samples, we’ve found it’s more effective to use additional blur passes to smooth away the noise pattern, at the expense of some loss of definition around depth discontinu- ities in the image. Transparency It should be noted the depth buffer can only contain one depth value per pixel, and thus transparencies cannot be fully supported. This is generally a problem with all algorithms that rely on screen space depth information. There is no easy solution to this, and the SSAO process itself is intensive enough that dealing with edge cases can push the algorithm outside of the real-time realm. In practice, for the vast majority of scenes, correct ambient occlusion for transparencies is a luxury that can be skimped on. Very transparent objects will typically be barely visible either way. For transparent objects that are nearly opaque, the choice can be given to the artist to allow some transparencies to write to the depth buffer input to the SSAO algorithm (not the z-buffer used for hidden surface removal), overriding opaque objects behind them. Figure 1.2.10 Combined small- and large-area SSAO result. Final Results Color Plate 1 shows some results portraying what the algorithm contributes in its final form. The top-left pane shows lighting without the ambient occlusion, while the top-right pane shows lighting with the SSAO component mixed in. The final colored result is shown in the bottom pane. Here the SSAO samples are very wide, bathing the background area with an effect that would otherwise only be obtained with a full global illumination algorithm. The SSAO term adds depth to the scene and helps anchor the characters within the environment. Color Plate 2 shows the contrast between the large-area, low-contrast SSAO sam- pling component on the bar surface and background and the tighter, higher-contrast SSAO samples apparent within the helmet, nooks, and crannies found on the character’s spacesuit. Conclusion This gem has described the Screen Space Ambient Occlusion technique used at Blizzard and presented various problems and solutions that arise. Screen Space Ambient Occlu- sion offers a different perspective in achieving results that closely resemble what the eye expects from ambient occlusion. The technique is reasonably simple to implement and amenable to artistic tweaks in real time to make it ideal to fit an artistic vision. References [Bavoil] Bavoil, Louis and Miguel Sainz. “Image-Space Horizon-Based Ambient Occlusion.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. Section 6.2. [Bavoil09] Bavoil, Louis and Miguel Sainz. “Multi-Layer Dual-Resolution Screen-Space Ambient Occlusion.” 2009. NVIDIA. n.d. . [Bavoil08] Bavoil, Louis and Miguel Sainz. “Screen Space Ambient Occlusion.” Sept. 2008. NVIDIA. n.d. . [Fox08] Fox, Megan. “Ambient Occlusive Crease Shading.” Game Developer. March 2008. [Kajalin] Kajalin, Vladimir. “Screen Space Ambient Occlusion.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. Section 6.1. [Lajzer] Lajzer, Brett and Dan Nottingham. “Combining Screen-Space Ambient Occlusion and Cartoon Rendering on Graphics Hardware.” n.d. Brett Lajzer. n.d.. [Luft06] Luft, Thomas, Carsten Colditz, and Oliver Deussen. “Image Enhancement by Unsharp Masking the Depth Buffer.” Course on Non-Photorealistic Rendering. SIGGRAPH 2006. Boston Convention and Exhibition Center, Boston, MA. 3 August 2006. 30 Section 1 Graphics 1.2 Principles and Practice of Screen Space Ambient Occlusion 31 [Mittring07] Mittring, Martin. “Finding Next Gen—CryEngine 2.0.” Course on Advanced Real-Time Rendering in 3D Graphics and Games. SIGGRAPH 2007. San Diego Convention Center, San Diego, CA. 8 August 2007. [Pesce] Pesce, Angelo. “Variance Methods for Screen-Space Ambient Occlusion.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. Section 6.7. [Ritschel09] Ritschel, Tobias, Thorsten Grosch, and Hans-Peter Seidel. “Approximating Dynamic Global Illumination in Image Space.” 2009. Max Planck Institut Informatik. n.d. . [Sains08] Sains, Miguel. “Real-Time Depth Buffer Based Ambient Occlusion.” Game Developers Conference. Moscone Center, San Francisco, CA. 18–22 February 2008. [Shamugan07] Shanmugam, Perumaal and Okan Arikan. “Hardware Accelerated Ambient Occlusion Techniques on GPUs.” 2007. Google Sites. n.d.. [Sloan07] Sloan, Peter-Pike, Naga K. Govindaraju, Derek Nowrouzezahrai, and John Snyder. “Image-Based Proxy Accumulation for Real-Time Soft Global Illumination.” Pacific Graphics Conference. The Royal Lahaina Resort, Maui, Hawaii. 29 October 2007. [Tomasi98] Tomasi, Carlo and Roberto Manduchi. “Bilateral Filtering for Gray and Color Images.” IEEE International Conference on Computer Vision. Homi Bhabha Auditorium, Bombay, India. 7 January 1998. 32 1.3 Multi-Resolution Deferred Shading Hyunwoo Ki, INNOACE Co., Ltd Recently, deferred shading has become a popular rendering technique for real-time games. Deferred shading enables game engines to handle many local lights with- out repeated geometry processing because it replaces geometry processing with pixel processing [Saito90, Shishkovtsov05, Valient07, Koonce07, Engel09, Kircher09]. In other words, shading costs are independent of geometric complexity, which is important as the CPU cost of scene-graph traversal and the GPU cost of geometry processing grows with scene complexity. Despite this decoupling of shading cost from geometric complexity, we still seek to optimize the pixel processing necessary to handle many local lights, soft shadows, and other per-pixel effects. In this gem, we present a technique that we call multi-resolution deferred shading, which provides adaptive sub-sampling using a hierarchical approach to shading by exploiting spatial coherence of the scene. Multi-resolution deferred shading efficiently reduces pixel shading costs as compared to traditional deferred shading without noticeable aliasing. As shown in Figure 1.3.1, our technique allows us to achieve a significant improvement in performance with negligible visual degradation relative to a more expensive full-resolution deferred shading approach. 1.3 Multi-Resolution Deferred Shading 33 Deferred Shading Unlike traditional forward rendering approaches, deferred shading costs are independ- ent of scene complexity. This is because deferred shading techniques store geometry information in textures, often called G-buffers, replacing geometry processing with pixel processing [Saito90, Shishkovtsov05, Valient07, Koonce07]. Deferred shading techniques start by rendering the scene into a G-buffer, which is typically implemented using multiple render targets to store geometry information, such as positions, normals, and other quantities instead of final shading results. Next, deferred shading systems render a screen-aligned quad to invoke a pixel shader at all pixels in the output image. The pixel shader retrieves the geometry information from the G-buffer and performs shading operations as a post process. Naturally, one must carefully choose the data formats and precise quantities to store in a G-buffer in order to make the best possible use of both memory and memory bandwidth. For example, the game Killzone 2 utilizes four buffers containing lighting accumulation and inten- sity, normal XY in 16-bit floating-point format, motion vector XY, specular and dif- fuse albedo, and sun occlusion [Valient07]. The Z component of the normal is computed from normal XY, and position is computed from depth and pixel coordi- nates. These types of encodings are a tradeoff between decode/encode cost and the memory and memory bandwidth consumed by the G-buffer. As shown in Color Plate 3, we simply use two four-channel buffers of 16-bit floating-point precision per chan- nel without any advanced encoding schemes for ease of description and implementa- tion. The first of our buffers contains view-space position in the RGB channels and a material ID in the alpha channel. The other buffer contains view-space normal in the RGB channels and depth in the alpha channel. Figure 1.3.1 Deferred shading (left: 20 fps), multi-resolution deferred shading (center: 38 fps), and their difference image (right). There are 40 spot lights, including fuzzy shadows (1024×1024 pixels with 24 shadow samples per pixel). We could also use material buffers that store diffuse reflectance, specular reflectance, shininess, and so on. However, material buffers are not necessary if we separate lighting and material phases from the shading phase using light pre-pass ren- dering [Engel09]. Unlike traditional deferred shading, light pre-pass rendering first computes lighting results instead of full shading. This method can then incorporate material properties in an additional material phase with forward rendering. Although this technique requires a second geometry rendering pass, such separation of lighting and material phases gives added flexibility during material shading and is compatible with hardware multi-sample anti-aliasing. A related technique, inferred lighting, stores lighting results in a single low-resolution buffer instead of the full-resolution buffer [Kircher09]. To avoid discontinuity problems, this technique filters edges using depth and object ID comparison in the material phase. As we will describe in the next section, our technique is similar to inferred lighting, but our method finds discontin- uous areas based on spatial proximity and then solves the discontinuity problems using a multi-resolution approach during the lighting (or shading) phase. Multi-Resolution Deferred Shading Although deferred shading improves lighting efficiency, computing illumination for every pixel is still expensive, despite the fact that it is often fairly low frequency. We have developed a multi-resolution deferred shading approach to exploit the low- frequency nature of illumination. We perform lighting in a lower-resolution buffer for spatially coherent areas and then interpolate results into a higher-resolution buffer. This key concept is based upon our prior work [Ki07a]. Here, we generalize this work and improve upon it to reduce aliasing. The algorithm has three steps, as shown in Color Plate 4: geometry pass, multi- resolution rendering pass, and composite pass. The geometry pass populates the G-buffers. Our technique is compatible with any sort of G-buffer organization, but for ease of explanation, we will stick with the 8-channel G-buffer layout described previously. The next step is multi-resolution rendering, which consists of resolution selection (non-edge detection), shading (lighting), and interpolation (up-sampling). We allocate buffers to store rendering results at various resolutions. We call these buffers R-buffers, where the “R” stands for “Result” or “Resolution.” In this chapter, we will use three R-buffers: full resolution, quarter resolution, and 1/16th resolution (for example, 1280×1024, 640×512, and 320×256). If the full-resolution image is especially high, we could choose to decrease the resolutions of the R-buffers even more drastically than just one-quarter resolution in each step. Multi-resolution rendering uses rendering iterations from lower-resolution to higher-resolution R-buffers. We prevent repeated pixel processing by exploiting early- Z culling to skip pixels processed in earlier iterations using lower-resolution R-buffers [Mitchell04]. To start shading our R-buffers, we set the lowest-resolution R-buffer as the current render target and clear its depth buffer with one depth (farthest). Next, we 34 Section 1 Graphics 1.3 Multi-Resolution Deferred Shading 35 determine pixels being rendered in this resolution by rendering a screen-aligned quad with Zi = 1.0 – i * 0.1, where i is the current iteration, writing only depth. During this pass, the pixel shader reads geometry information from mip-mapped versions of our G-buffers and estimates spatial proximity for non-edge detection. To estimate spatial proximity, we first compare the current pixel’s material ID with the material IDs of neighboring pixels. Then, we compare the difference of normal and depth values using tunable thresholds. If spatial proximity is low for the current pixel, we should use a higher-resolution R-buffer for better quality, and thus we discard the current pixel in the shader to skip writing Z. After this pass, pixels whose spatial prox- imity is high (in other words, non-edge) in the current resolution contain meaningful Z values because they were not discarded. The pixels whose spatial proximity is low (in other words, edges) still have farthest Z values left over from the initial clear. We then perform shading (or lighting) by rendering a screen-aligned quad with Zi = 1.0 – i * 0.1 again, but the Z function is changed to Equal. This means that only spatially coherent pixels in this resolution will pass the Z-test, as illustrated in Color Plate 4. In the pixel shader, we read geometric data from G-buffers and compute illu- mination as in light pre-pass rendering. On a textured surface, such as wall and floor, although spatial proximity between neighboring pixels is high, these pixel colors are often different. Such cases can cause serious aliasing in the resulting images. To solve this problem, we store only lighting results instead of full shading results into R-buffers, and we handle material properties with stored illumination in R-buffers in the com- posite pass. After shading, we copy the current shading/lighting results and depth to the next higher-resolution R-buffer, allowing the hardware’s bilinear units to do a simple inter- polation as we up-sample. We have found that bilinear filtering is adequate, though we could use bi-cubic filtering or other higher-order filtering for better quality. We repeat the process described above at the next higher resolution, estimating spatial proximity and writing Z and computing illumination until we reach the full- resolution R-buffer. A full-screen quad is drawn three times per iteration. If a given pixel was shaded on a prior iteration in a lower-resolution R-buffer, that pixel is not shaded again at the higher resolution due to early-Z culling. In this way, we are able to perform our screen-space shading operations at the appropriate resolution for different regions of the screen. In Figure 1.3.2, we visualize the distribution of pixels shaded at each level of our hierarchy. Because this approach exploits image scaling from low resolution to high resolu- tion with interpolation, discontinuity artifacts can appear at boundaries of lighting or shadows. We address this issue during the multi-resolution rendering phase. We write 1.0 to the alpha channel of R-buffer pixels that are lit; otherwise, we write zero. If pixels are lit by the same lights (or the same number of lights), their neighbors’ alpha values will be equal. Therefore, we interpolate these pixels to a higher-resolution buffer. Otherwise, we consider these pixels within the boundary, and thus we discard them in the interpolation pass (see Figure 1.3.3). We can handle shadow boundaries similarly. If shadow color is neither zero nor one (in other words, penumbra), we also set a pixel alpha to zero and thus discard it in the interpolation work. 36 Section 1 Graphics Figure 1.3.2 Visualization of hierarchical pixel processing. Non-black pixels were shaded in the first pass at 1/16th resolution as in the image on the left. The middle image shows the pixels shaded in the second iteration at one-quarter resolution, and only the pixels in the image on the right were shaded at full image resolution. Figure 1.3.3 A boundary-check algorithm. If a pixel is lit by a light, we add one alpha for this pixel in the lighting phase. In the interpolation pass, we consider pixels that are in boundaries whose neighbor pixels’ alpha values are different to others, and thus we use a higher-resolution buffer without interpolation. 1.3 Multi-Resolution Deferred Shading 37 In the composite pass, we render a screen-aligned quad, reading shading results from the full-resolution R-buffer and material properties such as albedo to compute the final shading result. We could draw scene geometry instead of drawing a screen quad for MSAA, similar to light pre-pass rendering. In contrast to traditional deferred shading and light pre-pass rendering, multi- resolution deferred shading reduces rendering costs for low-frequency pixels. Our multi-resolution deferred shading is also more efficient than inferred lighting due to the hierarchical approach. Multi-resolution deferred shading can also be used for other rendering techniques, such as the GPU-based light clustering technique for diffuse interreflection and subsurface light diffusion called Light Pyramids [Ki08]. The Light Pyramids technique stores first-bounced lights in shadow maps and groups them by considering their angular and spatial similarity. Although such light clustering dra- matically reduces the number of lights, it still requires hundreds of lights for each pixel. Figure 1.3.4 shows an example of a combination of Light Pyramids and multi- resolution deferred shading. Thanks to our pixel clustering, we achieved a performance improvement of approximately 1.5 to 2.0 times without noticeable quality loss. As pixel processing increases in complexity—for example, using higher resolution or using more lights—the relative performance improvement also increases. Figure 1.3.4 Indirect illumination using Light Pyramids [Ki08] based on traditional deferred shading (left) and multi-resolution deferred shading (right: 1.7 times faster). Conclusion and Future Work We have presented a multi-resolution deferred shading technique that performs lighting and shading computations at appropriate screen-space frequency in order to improve the efficiency of deferred shading without aliasing. In the future, we would also like to develop even more efficient resolution-selection algorithms, and we also seek to handle a wider variety of surface reflection models. We also hope to integrate transparent rendering of inferred lighting into our method. We believe that our method could be applied for not only lighting but also other rendering operations with high per-pixel overhead, such as per-pixel displacement mapping [Ki07b]. References [Engel09] Engel, Wolfgang. “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. 655-666. [Ki08] Ki, Hyunwoo. “A GPU-Based Light Hierarchy for Real-Time Approximate Illumination.” The Visual Computer 24.7–9 (July 2008): 649–658. [Ki07a] Ki, Hyunwoo. “Hierarchical Rendering Techniques for Real-Time Approximate Illumination on Programmable Graphics Hardware.” Master’s Thesis. Soongsil University, 2007. [Ki07b] Ki, Hyunwoo and Kyoungsu Oh. “Accurate Per-Pixel Displacement Mapping using a Pyramid Structure.” 2007. Hyunwoo Ki. n.d. . [Kircher09] Kircher, Scott and Alan Lawrance. “Inferred Lighting: Fast Dynamic Lighting and Shadows for Opaque and Translucent Objects.” Course on 3D and the Cinematic in Games. SIGGRAPH 2009. Ernest N. Morial Convention Center, New Orleans, LA. 6 August 2009. [Koonce07] Koonce, Rusty. “Deferred Shading in Tabula Rasa.” GPU Gems 3. Ed. Hurbert Nguyen. Kendallville, KY: Addison-Wesley, 2007. 429–458. [Mitchell04] Mitchell, Jason and Pedro Sander. “Applications of Explicit Early-Z Culling.” Course on Real-Time Shading. SIGGRAPH 2004. Los Angeles Convention Center, Los Angeles, CA. 8 August 2004. [Saito90] Saito, Takafumi and Tokiichiro Takahashi. “Comprehensible Rendering of 3-D Shapes.” ACM SIGGRAPH Computer Graphics 24.4 (August 1990): 197–206. [Shishkovtsov05] Shishkovtsov, Oles. “Deferred Shading in S.T.A.L.K.E.R.” GPU Gems 2: Programming Techniques for High-Performance Graphics and General Purpose Computation. Ed. Matt Pharr. Kendallville, Ky: Addison-Wesley, 2005. 143–166. [Valient07] Valient, Michal. “Deferred Rendering in Killzone 2.” Develop Conference 2007. Brighton Hilton Metropole, Brighton, England, UK. 25 July 2007. 38 Section 1 Graphics 39 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 Rahul P. Sathe, Advanced Visual Computing, Intel Corp DirectX 11 has introduced hardware tessellation in order to enable high geometric detail without increasing memory usage or memory bandwidth demands. Higher- order surface patches with displacements are of prime interest to game developers, and we would like to render them as efficiently as possible. For example, we would like to cull subdivision surface patches (instead of the resulting triangles) that will not affect the final image. Culling a given patch avoids higher-order surface evaluation of domain points in that patch as well as processing of the triangles generated for the patch. The nature of higher-order surface patches coupled with displacements and animation make the process of culling them non-trivial, since the exact geometric bounds are not known until well after the opportunity to cull a given patch. In this chapter, we will present an algorithm that evaluates conservative bounding boxes for displaced approximate Catmull-Clark subdivision surface patches at run time, allow- ing us to perform view frustum culling on the patches. With this method, we achieve performance improvement with minimal overhead. Background Before describing our culling strategy, we must review the fundamentals of Catmull- Clark subdivision surfaces, displacement mapping, and the methods that are currently in use to approximate Catmull-Clark subdivision surfaces on DirectX 11. 40 Section 1 Graphics Displaced Subdivision Surfaces and Catmull-Clark Surfaces Catmull-Clark subdivision surfaces have become an increasingly popular modeling primitive and have been extensively used in offline rendering [DeRose98]. In general, subdivision surfaces can be described as recursive refinement of a polygonal mesh. Starting with a coarse polygonal mesh M0, one can introduce new vertices along the edges and faces and update the connectivity to get a mesh M1, and repeat this process to get meshes M2, M3, and so on. In the limit, this process approaches a smooth sur- face S. This smooth surface S is called the subdivision limit sur face, and the original mesh M0 is often referred to as the control mesh. The control mesh consists of vertices connected to each other to form edges and faces. The number of other vertices that a given vertex is connected to directly by shared edges is called the valence of a vertex. In the realm of Catmull-Clark subdivi- sion surfaces, a vertex is called a regular or ordinary vertex if it has a valence of four. If the valences of all of the vertices of a given quad are four, then that quad is called an ordinary quad or an ordinary patch. The faces that have at least one vertex that is not valence four are called extraordinary faces (or patches). Approximate Catmull-Clark Subdivision Surfaces Recently, Loop and Schaefer introduced a hardware-friendly method of rendering Approximate Catmull Clark (ACC) subdivision surfaces, which maps very naturally to the DirectX 11 pipeline [Loop08]. At its core, the ACC scheme maps each quadri- lateral from the original control mesh to a bi-cubic Bezier patch. Loop and Schaefer show that, for ordinary patches, the bi-cubic Bezier corresponds exactly to the Catmull- Clark limit surface. Extraordinary patches do not correspond exactly to the limit sur- face, but Loop and Schaefer decouple the patch description for position attributes and normal attributes in order to reduce the visual impact of the resulting discontinuities. To do this, for extraordinary patches, ACC generates separate normal and bi-tangent patches in order to impose GN continuity at patch boundaries. The word “approximate” in ACC has its roots in the fact that these extraordinary patches are GN continuous, and this GN continuity only guarantees the same direction of partial derivatives but not the magnitudes across the patch boundaries. The ACC scheme describes the nor- mals and bi-tangents using additional Bezier patches, which results in a continuous normal field even across edges of extraordinary patches. Displacement Although it is very empowering to be able to generate smooth surfaces from polygonal meshes procedurally, such smooth surfaces are rarely encountered in real life and lack realism without additional high-frequency geometric detail. This is where displace- ment maps come into the picture. Displacement maps are simply textures that can be used to store geometric perturbations from a smooth surface. Although normal maps and displacement maps have the similar effect of adding high-frequency detail, the difference is notable around the silhouettes of objects. A normal mapped object’s silhou- ette lacks geometric detail because only per-pixel normals are perturbed and not the underlying geometry, as illustrated in Figure 1.4.1. To add this high-frequency detail, displacement maps can be applied to subdivision surfaces. DirectX 11 Pipeline DirectX 11 has introduced three new stages to the graphics pipeline to enable dynamic on chip tessellation, as shown in Figure 1.4.4. The two new programmable pipeline stages are the hull shader and the domain shader. Between these two programmable stages lies a new fixed function stage, the tessellator. Fortunately for us, ACC and Direct3D 11 were designed with each other in mind, and there is a natural mapping of the ACC algorithm onto the Direct3D 11 pipeline. Hull Shader As illustrated in Figure 1.4.1, the new hull shader stage follows the traditional vertex shader. In a typical implementation of ACC on Direct3D 11, the vertex shader is responsible for performing animation of the control mesh vertices. In the hull shader, each quadrilateral’s four vertices and its one-ring neighborhood are gathered from the output of the vertex shader. These vertices are used to define the control points of a bi-cubic Bezier patch. This basis conversion process that generates the Bezier patch control points is SIMD friendly, and every output control point can be calculated independently of others. In order to exploit this opportunity for parallelism, this con- trol point phase of the hull shader is invoked once per control point. In the case of ACC, the basis conversion process depends on the topology of the incoming patch, but the output control points are always a 4×4 Bezier control mesh. Please refer to the sample code on the CD. 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 41 Figure 1.4.1 Normal mapping versus displacement mapping. In addition to the computation of the Bezier control points, the hull shader can optionally calculate edge tessellation factors in order to manage level of detail. One can assign arbitrary tessellation factors to the edges of a patch (within some constraints, defined by the DirectX 11 tessellator specifications). Because the hull shader is pro- grammable, one can choose any metric to calculate edge tessellation factors. Typical metrics may include screen space projection, proximity to silhouette, luminosity reaching the patch, and so on. The calculation of each edge tessellation factor is typi- cally independent of the others, and hence the edge tessellation factors can also be computed in parallel in a separate phase of the hull shader called the fork phase. The final stage of hull shader is called the join phase (or patch constant phase) and is a phase in which the shader can efficiently compute data that is constant for the entire patch. This stage is of most interest to us in this chapter. Tessellator The tessellator accepts edge LODs of a patch and other tessellator-specific states that control how it generates domain locations and connectivity. Some of these states include patch topology (quad, tri, or isoline), inside reduction function (how to calculate inner tessellation factor(s) using outer tessellation factors), one-axis versus two-axis reduction (whether to reduce only one inner tessellation factor or two—once per each domain axis), and scale (how much to scale inner LOD). The tessellator feeds domain values to the domain shader and connectivity information to the rest of the pipeline via the geometry shader. Domain Shader In the case of quadrilateral patch rendering, the domain shader is invoked at domain values (u,v) determined by the tessellator. (In the case of triangular patches, the barycentric coordinates (u,v,w); w = 1 – u – v are used.) Naturally, the domain shader 42 Section 1 Graphics Figure 1.4.2 Basis conversion for an irregular patch. has access to output control points from the hull shader. Typically, the domain shader evaluates a higher-order surface at these domain locations using the control points provided by the hull shader as the basis. After evaluating the surface, the domain shader can perform arbitrary operations on the surface position, such as displacing the geometry using a displacement map. In ACC, we evaluate position using bi-cubic polynomials for a given (u,v). Our domain shader interpolates texture coordinates (s,t) from the four vertices using bilin- ear interpolation to generate the texture coordinates for the given (u,v). We also optionally sample a displacement map at these interpolated texture coordinates. As mentioned earlier, normal calculation is different for ordinary and extraordinary patches. For ordinary patches, we just calculate d/du and d/dv of the position and take the cross-product. For extraordinary patches, we evaluate tangent and bi-tangent patches separately and take their cross-product. Culling The mapping of ACC to the DirectX 11 pipeline that we have described allows us to render smooth surfaces with adaptive tessellation and displacement mapping, resulting in a compelling visual quality improvement while maintaining a modest memory footprint. At the end of the day, however, we are still rendering triangles, and the remaining stages of the graphics pipeline are largely unchanged, including the hard- ware stages that perform triangle setup and culling. This means that we perform ver- tex shading, domain shading, tessellation, and hull shading of all patches submitted to the graphics pipeline, including those patches that are completely outside of the view frustum. Clearly, this provides an opportunity for optimization. The main contribu- tion of this chapter is a method for frustum culling patches early in the pipeline in order to avoid unnecessary computations. Of course, we must account for mesh ani- mation and displacement, both of which deform a given patch in a way that compli- cates culling. An elegant generalized solution to surface patch culling has been proposed by Hasselgren et a l. that generates culling shaders, looking at domain shaders using Taylor Arithmetic [Hasselgren09]. This article proposes a simplified version of ideas discussed in their work to cull the approximate Catmull-Clark patches against view frustum. Pre-Processing Step We perform a pre-processing step on a given control mesh and displacement map in order to find the maximum displacement for each patch. Please note, although the positions are evaluated as bi-cubic polynomials using the new basis, the texture coor- dinates for those points are the result of bilinear interpolation of texture coordinates of the corners. This is due to the fact that the local (per-patch) uv-parameterization used to describe the Catmull-Clark surface and the global uv-parameterization done while creating the displacement map are linearly dependent on each other. Figure 1.4.3 shows one such patch. This linear dependence means that straight lines u=0, v=0, 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 43 44 Section 1 Graphics u=1, and v=1 in the patch parameterization are also straight lines in the global parame- terization. Due to this linear relationship, we know the exact area in the displacement map from which the displacements will be sampled in the domain shader for that patch. The maximum displacement in the given patch can be found by calculating the maximum displacement in the region confined by patch boundaries in the displacement map. Even if the displacement map stores vector-valued displacements, the mapping is still linear, so we can still find the magnitude of the maximum displacement for a given patch. Based on this, we can create a buffer for the entire mesh that stores this maximum displacement per patch. Run-Time Step At run time, the patch vertices of the control mesh go through the vertex shader, which animates the control mesh. The hull shader then operates on each quad patch, performing the basis transformation to Bezier control points. One convenient property of Bezier patches is that they always stay within the convex hull of the control mesh defining the patch. Using the maximum displacement computed previously, we can Figure 1.4.3 Mapping between global (s-t) and local (u-v) parameterization is linear. The figure on the left shows (u,v) parameterization that is used for patch evaluation. The figure on the right shows the global parameterization (s,t) that was used while unwrapping original mesh. Bold lines correspond to u=0, v=0, u=1, and v=1 lines in the figure on the left. move the convex hull planes of a given patch outward by the maximum displacement, resulting in conservative bounds suitable for culling a given patch. Although moving the convex hull planes out by the max displacement may give tighter bounds compared to an axis-aligned bounding box (AABB) for the control mesh, calculating the corner points can be tricky because it requires calculation of plane intersections. It is simpler and more efficient to compute an AABB of the control mesh and offset the AABB planes by the maximum displacement. In Figure 1.4.5, we show a 2D representation of this process for illustration. Dotted black lines represent the basis-converted Bezier control mesh. The actual Bezier curve is shown in bold black, displacements along the curve normal (scalar valued displace- ments) are shown in solid gray, and the maximum displacement for this curve segment is denoted as d. An AABB for the Bezier curve is shown in dashed lines (the inner bounding box), and the conservative AABB that takes displacements into account is shown in dashed and dotted lines (the outer bounding box). 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 45 Figure 1.4.4 The DirectX11 pipeline. Normally, triangles get culled after primitive assembly, just before rasterization. The proposed scheme culls the patches in the hull shader, and all the associated triangles from that patch get culled as a result, freeing up compute resources. As you can see, the corners of inner and outer enclosures are more than d distance apart, so we are being more conservative than we need to be for the ease and speed of computation. At this point, we have a conservative patch AABB that takes displacements into account. If the AABB for a patch is outside the view frustum, we know that the entire patch is outside the view frustum and can be safely culled. If we make the view frus- tum’s plane equations available as shader constants, then our shader can test the AABB using in-out tests for view frustum. Alternatively, one can transform the AABB into normalized device coordinates (NDC), and the in-out tests can be done in NDC space. In-out tests in NDC space are easier than world space tests because they involve comparing only with +1 or –1. If the AABB is outside the view frustum, we set the edge LODs for that patch to be negative, which indicates to the graphics hardware that the patch should be culled. We perform the culling test during the join phase (a.k.a. patch constant phase) of the hull shader because this operation only needs to be performed once per patch. Performance For each culled patch, we eliminate unnecessary tessellator and domain shader work for that patch. All patches, whether or not they’re culled, take on the additional com- putational burden of computing the conservative AABB and testing against the view frustum. When most of the character is visible on the screen (for example, Figure 1.4.9 (a)), culling overhead is at its worst. Figure 1.4.6 shows that, even in this case, culling overhead is minimal and is seen only at very low levels of tessellation. At LOD=3, the gains due to culling a very small number of patches (around the character’s feet) start offsetting the cycles spent on culling tests. 46 Section 1 Graphics Figure 1.4.5 Conservative AABB for a displaced Bezier curve. The Bezier curve is shown in bold black, the control mesh in dotted lines, and displacements in solid gray lines. AABB for the Bezier curve without displacements is shown in dashed lines (inner bounding box), and conservative AABB for the displaced Bezier curve is shown in dashed and dotted lines (outer bounding box). When about half of the patches in our test model are outside of the view frustum (see Figure 1.4.9 (b)), the overhead of the AABB computations is offset by the gains from culling the offscreen patches. The gains from culling patches are more noticeable at higher levels of tessellation. This is shown graphically in Figures 1.4.7 and 1.4.8. Figure 1.4.7 shows how fps changes with the edge tessellation factor (edge LOD) when about half of the patches are culled. As you can see, at moderate levels of tessel- lation, we strike the balance between benefits of the proposed algorithm at increased level of detail. Figure 1.4.8 shows the same data as percentage speed-up. We performed all our tests on the ATI Radeon 5870 card, with 1 GB GDDR. The benefits of this algorithm increase with domain shader complexity and tessella- tion level, whereas the per-patch overhead of the culling tests remains constant. It is easy to imagine an application strategy that first tests an object’s bounding box against the frustum to determine whether patch culling should be performed at all for a given object, thus avoiding the culling overhead for objects that are known to be mostly onscreen. 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 47 Figure 1.4.6 Culling overhead is the worst when nothing gets culled. Culling overhead is minimal except at very low levels of tessellation. “NO CULL” indicates the fps measured when no culling code was running. “CULL Overhead” shows the fps measured when culling code was running in the patch constant phase of shaders. 48 Section 1 Graphics Figure 1.4.7 Culling benefits go up with the level of tessellation, except at the super-high levels of tessellation where culling patches doesn’t help. At moderate levels of tessellation, we get benefits of the proposed algorithm and still see high geometric details. Figure 1.4.8 Culling benefits shown as percentage increase in fps against edge LODs (edge tessellation factor). Conclusion We have presented a method for culling Catmull-Clark patches against the view frustum using the DirectX 11 pipeline. Applications will benefit the most from this algorithm at moderate to high levels of tessellation. In the future, we would like to extend this technique to account for occluded and back-facing patches with displacements. References [DeRose98] DeRose, Tony, Michael Kass, and Tien Truong. “Subdivision Surfaces in Character Animation.” Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques. 1998. ACM SIGGRAPH. n.d. . [Hasselgren09] Hasselgren, Jon, Jacob Munkberg, and Tomas Akenine-Möller. “Automatic Pre-Tessellation Culling.” ACM Transactions on Graphics 28.2 (April 2009): n.p. ACM Portal. [Loop08] Loop, Charles and Scott Schaefer. “Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches.” ACM Transactions on Graphics 27.1 (March 2008): n.p. ACM Portal. [Microsoft09] Microsoft Corporation. DirectX SDK. August 2009. [Reif95] Reif, Ulrich. “A Unified Approach to Subdivision Algorithms Near Extraordinary Vertices.” Computer Aided Geometric Design 12.2 (March 1995): 153–174. ACM Portal. [Stam98] Stam, Jos. “Exact Evaluation of Catmull-Clark Subdivision Surfaces at Arbitrary Parameter Values.” Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques (1998): 395–404. ACM Portal. [Zorin2000] Zorin, Dennis and Peter Schroder. “Subdivision for Modeling and Animation.” SIGGRAPH. 2000. 85–94. 1.4 View Frustum Culling of Catmull-Clark Patches in DirectX 11 49 Figure 1.4.9 Screenshots showing our algorithm in action. We saw about 8.9 fps for the view on the left and 15.1 fps for the view on the right on the ATI Radeon 5870. Increase in the frame rate was due to view frustum culling patches. 50 1.5 Ambient Occlusion Using DirectX Compute Shader Jason Zink Microsoft has recently released DirectX 11, which brings with it significant changes in several of its APIs. Among these new and updated APIs is the latest version of Direct3D. Direct3D 11 provides the ability to perform multi-threaded rendering calls, a shader interface system for providing an abstraction layer to shader code, and the addition of several new programmable shader stages. One of these new shader stages is the compute shader, which provides a significantly more flexible processing paradigm than was available in previous iterations of the Direct3D API. Specifically, the compute shader allows for a controllable threading model, sharing memory between processing threads, synchronization of primitive functions, and several new resource types to allow read/write access to resources. This gem will provide an introduction to the compute shader and its new features. In addition, we will take an in-depth look at a Screen Space Ambient Occlusion (SSAO) algorithm implemented on the compute shader to show how to take advantage of this new processing paradigm. We will examine the SSAO algorithm in detail and provide a sample implementation to demonstrate how the compute shader can work together with the traditional rendering pipeline. Finally, we will wrap up with a discussion of our results and future work. 1.5 Ambient Occlusion Using DirectX Compute Shader 51 The Compute Shader Before we begin to apply the compute shader’s capabilities to a particular problem domain, let’s take a closer look at the compute shader itself and the general concepts needed to program it. Overview The compute shader is a new programmable shader stage that is actually not simply inserted into the traditional rendering pipeline like some of the other new DirectX 11 pipeline stages discussed in Sathe’s Gem 1.4. Rather, the compute shader is conceptually a standalone processing element that has access to the majority of the functionality available in the common shader core, but with some important additional functional- ity. The two most important new mechanics are fine-grained control over how each thread is used in a given shader invocation and new synchronization primitives that allow threads to synchronize. The threads also have read/write access to a common memory pool, which provides the opportunity for threads to share intermediate cal- culations with one another. These new capabilities are the basic building blocks for advanced algorithms that have yet to be developed, while at the same time allowing for traditional algorithms to be implemented in different ways in order to achieve performance improvements. Compute Shader Threading Model To use the compute shader, we need to understand its threading model. The main concept is that of a Thread Group. A Thread Group defines the number of threads that will be executing in parallel that will have the ability to communicate with one another. The threads within the Thread Group are conceptually organized in a 3D grid layout, as shown in Figure 1.5.1, with the sizes along each axis of the grid determined by the developer. The choice of the layout provides a simple addressing scheme used in the compute shader code to have each thread perform an operation on a particular portion of the input resources. When a particular thread is running, it executes the compute shader code and has access to several system value input attributes that uniquely identify the given thread. To actually execute the compute shader, we tell the API to execute a given num- ber of Thread Groups via the Dispatch method, as illustrated in Figure 1.5.2. With these two layout definitions in mind, we can look at how they affect the addressing scheme of the compute shader. The following list of system values is avail- able to the compute shader: • SV_GroupID. This system value identifies the Thread Group that a thread belongs to with a 3-tuple of zero-based indices. • SV_GroupThreadID. This system value identifies the thread index within the current Thread Group with a 3-tuple of zero-based indices. • SV_DispatchThreadID. This system value identifies the current thread identifier over a complete Dispatch call with a 3-tuple of zero-based indices. • SV_GroupIndex. This system value is a single integer value representing a flat index of the current thread within the group. 52 Section 1 Graphics Figure 1.5.1 Thread Groups visualized as a 3D volume. Figure 1.5.2 Visualization of the Dispatch method. 1.5 Ambient Occlusion Using DirectX Compute Shader 53 The individual threads running the compute shader have access to these system values and can use the values to determine, for example, which portions of input to use or which output resources to compute. For example, if we wanted a compute shader to perform an operation on each pixel of an input texture, we would define the thread group to be of size (x, y, 1) and call the Dispatch method with a size of (m, n, 1) where x*m is the width of the image and y*n is the height of the image. In this case, the shader code would use the SV_DispatchThreadID system value to determine the location in the input image from which to load data and where the result should be stored in the output image. Figure 1.5.3 illustrates one way in which a 2D workload might be partitioned using this method. In this example, we have an image with a size of 32×32 pixels. If we wanted to process the image with a total of 4×4 (m = 4, n = 4) Thread Groups as shown, then we would need to define the Thread Groups to each have 8×8 (x = 8 and y = 8) threads. This gives us the total number of threads needed to process all 32×32 (x*m and y*n) pixels of the input image. Compute Shader Thread Interactions In addition to providing an easy-to-use thread addressing scheme, the compute shader also allows each Thread Group to declare a block of Group Shared Memory (GSM). This memory is basically defined as an array of variables that are accessible to all of the threads in the Thread Group. The array itself can be composed of any native data types as well as structures, allowing for flexible grouping of data. In practice, the Group Figure 1.5.3 Visualization of Thread Group distribution for a 2D workload, where the number of Thread Groups (m = 4, n = 4) and the number of threads (x = 8, y = 8) are shown. Shared Memory is expected to be on-chip register-based memory that should be sig- nificantly faster to access than general texture memory, which can have unpredictable performance depending on access patterns. Similar to CPU-based multi-threaded programming, when you have multiple threads reading and writing to the same area of memory there is the potential that the same memory can be accessed simultaneously by more than one thread. To provide some form of control over the sequences of access, the compute shader introduces sev- eral atomic functions for thread synchronization. For example, there is an atomic function for adding called InterlockedAdd. This can be used to have all threads per- form a test sequence and then use the InterlockedAdd function to increment a variable in the Group Shared Memory to tabulate an overall number of test sequences that pro- duce a particular result. Another atomic function is the InterlockedCompareExchange function, which compares a shared variable with one argument and sets the variable to a second argu- ment if the variable has the same value as the first argument. This provides the basic building blocks of creating a mutex system in the compute shader, where a shared variable serves as the mutex. Each thread can call this function on the mutex variable and only take action if it is able to update the variable to its own identifier. Since the compute shader is intended to provide massively parallel execution, a mutex is not really a preferred choice, but in some situations it may be a desirable avenue to follow, such as when a single resource must be shared across many threads. The Direct3D 11 documentation can be referenced for a complete list of these atomic functions and how they can be used. Also similar to CPU-based multi-threaded programming is the fact that it is more efficient to design your algorithms to operate in parallel while minimizing the number of times that they must synchronize data with one another. The fastest synchronization operation is the one that you don’t have to perform! Compute Shader Resources New resource types introduced in Direct3D 11 include Structured Buffers, Byte Address Buffers, and Append/Consume Buffers. Structured Buffers provide what they sound like—1D buffers of structures available in your shader code. The Byte Address Buffers are similar, except that they are a general block of 32-bit memory elements. The Append/Consume Buffers allow for stack/queue-like access to a resource, allowing the shader to consume the elements of a buffer one at a time and append results to an output buffer one at a time. This should also provide some simplified processing paradigms in which the absolute position of an element is less important than the relative order in which it was added to the buffer. To further facilitate the compute shader’s parallel-processing capabilities, Direct3D 11 provides a new resource view called an Unordered Access View (UAV). This type of view allows the compute shader (as well as the pixel shader) to have read and write access to a resource, where any thread can access any portion of the resource. 54 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 55 This is a big departure from the traditional shader resource access paradigm; typically, a shader can only read from or write to a given resource during a shader invocation, but not both. The UAV can be used to provide random access to both the new and existing resource types, which provides significant freedom in designing the input and output structure of compute shader–based algorithms. With a general understanding of the new capabilities of the compute shader, we can now take a look at a concrete example in order to better understand the details. We will discuss the general concepts of the SSAO algorithm and then describe how we can use the compute shader’s features to build an efficient implementation of the technique. Screen Space Ambient Occlusion Screen Space Ambient Occlusion is a relatively recently developed technique for approximating global illumination in the ambient lighting term based solely on the information present in a given frame’s depth buffer [Mittring07]. As described in detail in Gem 1.2 by Filion, an approximate amount of ambient light that reaches a given pixel can be computed by sampling the area around the pixel in screen space. This technique provides a convincing approximation to global illumination and per- forms at a usable speed for high-end applications. The quality of the algorithm depends on the number of samples and subsequent calculations that are performed for each pixel. In the past few years, a variety of tech- niques have been proposed to modify the general SSAO algorithm with varying levels of quality versus performance tradeoffs, such as HBAO [Bavoil09a] and SSDO [Ritschel09]. While these new variants of the original algorithm provide improvements in image quality or performance, the basic underlying concepts are shared across all implementations, and hence the compute shader should be applicable in general. We will now review some of these recent SSAO techniques and discuss several areas of the underlying algorithm that can benefit from the compute shader’s new capabilities. Then we will look at an implementation that takes advantage of some of these possible improvements. SSAO Algorithm Ambient occlusion techniques have been around for some time and have found uses primarily in offline rendering applications [Landis02]. The concept behind these techniques is to utilize the geometric shape of a model to calculate which portions of the model would be more likely to be occluded than others. If a given point on a model is located on a flat surface, it will be less occluded than another point that is located at a fold in the surface. This relationship is based on the following integral for the reflected radiance: In this integral, Lin is the incident radiation from direction ω, and the surface normal vector is n. This integral indicates that the amount of light reflected at a given surface point is a function of the incident radiance and the angle at which it reaches that point. If there is nearby geometry blocking some portion of the surface surround- ing the surface point, then we can generally conclude that less radiant energy will reach the surface. With this in mind, the ambient lighting term can be modulated by an occlusion factor to approximately represent this geometric relationship. One way to perform this geometric calculation is to project a series of rays from each surface point being tested. The amount of occlusion is then calculated depend- ing on the number of rays that intersect another part of the model within a given radius from the surface point. This effectively determines how much “background” light can reach that point by performing the inverse operation of the radiance integral described previously. Instead of integrating the incident radiance coming into that point over the surface of a hemisphere, we shoot rays out from the surface point over the hemisphere to test for occlusion within the immediate area. The overall occlusion factor is then calculated by accumulating the ray test results and finding the ratio of occluded rays versus non-occluded rays. Once it is calculated, this occlusion factor is then stored either per vertex or per pixel in a texture map and is used to modulate the ambient lighting term of that object when rendered. This produces a rough approxi- mation of global illumination. Figure 1.5.4 demonstrates this ray casting technique. 56 Section 1 Graphics Figure 1.5.4 Side profile of a ray casting technique for approximating occlusion. 1.5 Ambient Occlusion Using DirectX Compute Shader 57 This technique works quite well for static scenes or individual static geometric models, but the pre-computation requirements are not practical for dynamic geometry, such as skinned meshes. Several alternative techniques have been suggested to allow for dynamic ambient occlusion calculations, such as [Bunnell05], which generalizes the geometric object into disks to reduce the computational complexity of the occlu- sion calculations. This allows real-time operation of the algorithm, but it still requires some pre-processing of the models being rendered to determine where to place the disks in the approximated models. In addition, the cost of performing the occlusion calculation scales with increased scene complexity. The Screen Space Ambient Occlusion algorithm provides an interesting alternative technique for determining an approximate occlusion value. Instead of computing an occlusion value based on the geometric representation of a scene by performing ray casting, the occlusion calculation is delayed until after the scene has been rasterized. Once the scene has been rasterized, an approximated amount of occlusion is deter- mined by inspecting the contents of the scene’s depth buffer only—the geometric queries are carried out on the depth buffer instead of on the geometric models. This effectively moves the operation from an object space operation to a screen space oper- ation—which is one of the major benefits of this algorithm. Since it operates at the screen space level, the algorithm’s performance is less sensitive to the amount of geom- etry being rendered and is more sensitive to the resolution of the buffers being used. The scene’s depth buffer can be obtained by utilizing the actual Z-buffer used during rendering, by performing a separate rendering pass that writes the linear depth to a render target, or by using the depth information from a deferred rendering G-buffer. Regardless of how the buffer is generated, the algorithm performs a processing pass that uses the depth buffer as an input and generates an output texture that holds the occlusion values for the entire visible scene. Each pixel of the output is calculated using the depth information within a given radius of its local area, which can be considered an approximation to ambient occlusion. I will refer to this output in the remainder of this document as the occlusion buffer. When the final scene rendering is performed, the occlusion buffer is sampled based on screen space location and used to modulate the ambient term of each object in the final scene. SSAO Algorithm Details Screen Space Ambient Occlusion has provided a significant improvement over previous ambient occlusion algorithms. Due to the fact that the algorithm runs after a scene is rendered, it focuses the processing time on only the portion of the scene that is visible for the current frame, saving a significant amount of computation and allowing the algorithm to be run in real-time applications without pre-computation. However, the use of the depth buffer also introduces a few obstacles to overcome. There is the potential that some occluders will not be visible in the depth buffer if there is another object in front of it. Since the depth buffer only records one depth sample per pixel, there is no additional information about the occluders behind the foreground object. This is typically handled by defaulting to zero occlusion if the depth sample read from the depth buffer is too far away from the current pixel being processed. If a more accurate solution is needed, depth peeling can be used to perform multiple occlusion queries, as described in [Bavoil09b]. Additionally, if an object is offscreen but is still occluding an object that is visible onscreen, then the occlusion is not taken into account. This leads to some incorrect occlusion calculations around the outer edge of the image, but solutions have also been proposed to minimize or eliminate these issues. One possibility is to render the depth buffer with a larger field of view than the final rendering to allow objects to be visible to the algorithm around the perimeter of the view port [Bavoil09a]. Another issue with the algorithm is that a relatively large number of samples needs to be taken in order to generate a complete representation of the geometry around each pixel. If performance were not a concern, we could sample the entire area around the pixel P in a regular sampling pattern, but in real-time applications this quickly becomes impractical. Instead of a regular sampling pattern, a common solution is to use a sparse sampling kernel to choose sampling points around the current pixel. This roughly approximates the surrounding area, but the decreased sampling rate may miss some detail. To compensate for the decreased sampling, it is common to use a stochastic sam- pling technique instead. By varying the sampling kernel shape and/or orientation for each pixel and then sharing the results between neighboring pixels, an approximation to the more expensive regular sampling pattern can be achieved. Since a typical 3D scene is composed of groups of connected triangles, the majority of the contents of the depth buffer will contain roughly similar depth values in neighborhoods of pixels except at geometric silhouette edges. The variation of the sampling kernel between pixels in combination with this spatial coherence of the depth buffer allows us to share a combined larger number of sample results per pixel while reducing the overall num- ber of calculations that need to be performed. This helps to effectively widen the sampling kernel, but it also introduces some additional high-frequency noise into the occlusion buffer. To compensate for this effect, it is common to perform a filtering pass over the entire occlusion buffer that blurs the occlusion values without bleeding across object boundaries. This type of a filter is referred to as a bilateral filter, which takes into account both the spatial dis- tance between pixels and the intensity values stored in neighboring pixels when calcu- lating the weights to apply to a sample [Tomasi98]. This allows the filter to remove high-frequency noise and at the same time preserve the edges that are present in the occlusion buffer. In addition, the randomization process can be repeated over a small range to facilitate easier filtering later on. Figures 1.5.5 and 1.5.6 show ambient occlu- sion results before and after bilateral filtering. 58 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 59 As mentioned before, the algorithm is performed after rasterization, meaning that its performance is directly related to the screen resolution being used. In fact, this dependency on screen resolution has been exploited to speed up the algorithm as described in Gem 1.3. The depth buffer and/or the occlusion buffer can be generated at a decreased resolution. If the screen resolution is decreased by a factor of 2 in the x and y directions, there is an overall factor of 4 reduction in the number of occlusion pixels that need to be calculated. Then the occlusion buffer can either be upsampled with a bilateral filter or just directly used at the lower resolution. Figure 1.5.5 A sample scene rendered without bilateral filtering. Figure 1.5.6 A sample scene after bilateral filtering. This strategy can still lead to fairly pleasing results, since the contents of the occlusion buffer are relatively low frequency. SSAO Meets the Compute Shader When looking at the block diagram of the SSAO algorithm in Figure 1.5.7, we can begin to compare these high-level operations with the new capabilities of the compute shader to see how we can build a more efficient implementation. We will now go over the steps of the algorithm and discuss potential strategies for mapping to the compute shader. Calculation Setup The first step shown in the block diagram is to initialize the computations for the current pixel. This entails sampling the depth buffer to obtain the pixel’s depth. One of the benefits of having a Group Shared Memory that can be shared by all threads in a Thread Group is the possibility to share texture samples among the entire Thread Group. 60 Section 1 Graphics Figure 1.5.7 Block diagram of the SSAO algorithm. 1.5 Ambient Occlusion Using DirectX Compute Shader 61 Because the shared memory is supposed to be significantly faster than a direct texture sample, if each thread requests a depth sample to initialize its own calculations, then it can also write that depth value to the shared memory for use later on by other threads. The net effect of every thread in a Thread Group doing this is to have a copy of the complete local depth data in the Group Shared Memory. Later, as each thread begins calculating the relative occlusion against the local area, it can read the needed depth values from the Group Shared Memory instead of directly loading from texture memory. Figure 1.5.8 shows this process. There are a few additional notes to consider on this topic, however. There is some overhead associated with reading the depth values and then storing them to the Group Shared Memory. In addition, the texture cache can often provide very fast results from memory sample requests if the result was in the cache. Thus, depending on the hard- ware being run and the patterns and frequency of memory access, it may or may not provide a speed increase to use the Group Shared Memory in practice. Randomize Sampling Kernel The next step in the SSAO block diagram is to somehow randomize the sampling kernel that will be used to later look up the surrounding area. This is typically done by Figure 1.5.8 Comparison of directly sampling versus using the Group Shared Memory for cached sampling. acquiring a random vector and then performing a “reflect” operation on each of the sampling kernel vectors around the random vector. Probably the most common way to acquire this vector is to build a small texture with randomized normal vectors inside. The shader can load a single normalized reflection vector based on the screen space position of the pixel being processed [Kajalin09]. This makes removing the “salt-and-pepper” noise easier in the filtering stage of the algorithm. In the past, SSAO was performed in the pixel shader, which means that the pixel shader required a screen space position as a fragment attribute to be passed by the vertex or geometry shader. The compute shader can help to simplify this operation somewhat. By utilizing the Dispatch ID system value, we can automatically receive the integer ID of each pixel being processed in our compute shader code. To create our repeating pattern of reflection vectors in screen space, we can simply perform a bitwise AND operation on the least significant bits of the dispatch ID—in other words, if we wanted to repeat every 4×4 block of pixels, we would mask off all but the two least significant bits of the ID. In fact, we can even store the randomized vectors as an array of constants in our shader. This eliminates the need for a texture sample and the repeating texture of nor- malized reflection vectors altogether. Of course this is predicated on the fact that we don’t use too many vectors, but we could always use the standard approach if that is needed. Acquire Depth Data Once the sampling kernel has been randomized, we can acquire each individual depth sample. In a traditional SSAO algorithm, this is done with a sampler that uses the x and y coordinates of the current sampling kernel vector to offset from the current pixel location. Since the sampling kernel has been pseudo-randomized, there is a potential for reduced texture cache efficiency if the sampling kernel width is large enough. If we utilize the Group Shared Memory as described previously, then the depth values that we need to acquire could already be available in the GSM. However, there are several points to consider before embarking on this strategy as well. Since the Thread Group will only be operating on one block of the depth data at a time—for example, a 16×16 block—then we need to consider what happens at the edges of that block. The pixels along the outer edges of the block will need access to the depth samples within our sampling radius, and they would not already be pre-loaded. This provides a choice—we could either pre-load a larger portion of the depth buffer to include the surrounding area or we could dynamically check to see whether the data has been loaded to the GSM yet, and, if not, then directly get it from the depth buffer. Both options could have performance penalties. Pre-loading large bands of depth data around each block may end up increasing the number of depth samples to the point that it would be just as efficient to perform the sampling in the traditional manner. If we dynamically decide whether or not to fetch data from the depth buffer, then we could perform a large number of dynamic branches in the shader, which could also be detrimental to performance. These factors need to be weighed against the increased 62 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 63 access speed provided by using the GSM instead of direct sampling. With the texture cache providing similar fast access for at least a portion of the texture samples, it is altogether possible that the standard approach would be faster. Of course, any discussion of texture cache performance depends on the hardware that the algorithm is running on, so this should be tested against your target platform to see which would be a better choice. The other point to consider with using the GSM is that there is no native support for bilinear filtering of the GSM data. If you wanted to filter the depth values for each depth sample based on the floating-point values of the kernel offset vector, then you would need to implement this functionality in the shader code itself. However, since the depth buffer contains relatively low-frequency data, this is not likely to affect image quality in this case. Perform Partial Occlusion Calculation (per Sample) Once we have obtained a depth sample to compare to our current pixel depth, we can move to the partial occlusion calculations. In this step, we determine whether our sample depth causes any occlusion at the current pixel. There are many different vari- eties of calculations available to perform here, from a binary test of the sample point being above or below the kernel offset vector [Kajalin09] all the way up to a piecewise defined function read from a texture [Filion08]. Regardless of how the calculation is performed, there is an interesting possibility that the compute shader introduces if the calculation is only a function of the depth delta—sharing occlusion calculations between pixels. If we call our current pixel point P and our current sample point S, then the occlusion caused at point P by point S is inherently related to the inverse occlusion at point S by point P. Since the compute shader can perform scatter operations, a single thread can calculate the occlusion for one pair of locations and then write the result to point P and the inverse of the calcu- lation to point S. This would save the number of required calculations by nearly a factor of 2, but it would also introduce the need for some type of communication mechanism to get the values to both occlusion buffer values. Since there is the possibility that multiple pix- els would be trying to write a result to the same pixel, we could attempt to use the atomic operations for updating the values, but this could lead to a large number of synchronization events between threads. At the same time, these occlusion values can be accumulated in the GSM for fast access by each thread. Again, the cost of the syn- chronization events will likely vary across hardware, so further testing would be needed to see how much of a benefit could come from this implementation. Perform Complete Occlusion Calculation The final step in this process is to calculate the final occlusion value that will end up in the occlusion buffer for use in the final rendering. This is normally done by performing a simple average of all of the partial occlusion calculations. In this way, we can scale the number of samples used to calculate the occlusion according to the performance level of the target hardware. As described earlier, there is typically some form of a bilateral filter applied to the occlusion buffer after all pixels have a final occlusion value calculated. In general, fil- tering is one area that could potentially see huge benefits from compute shader imple- mentations. Since filtering generally has an exact predetermined access pattern for the input image, the Group Shared Memory can directly be used to pre-load the exact tex- ture data needed. This is especially beneficial when implementing 2D separable filters due to the ability to perform the filtering pass in one direction, store the result into the GSM, then perform the second filtering pass in the other direction over the values in the GSM without ever writing the results back to the output buffer in between steps. Even though the bilateral filter is non-separable, it has been shown that a decent approximation of it can be achieved with a separable implementation [Pham05]. Compute Shader Implementation Details After reviewing some of the new features available in the compute shader and how they can be used with the SSAO algorithm, we can now look at a sample implementation. Since the compute shader techniques are relatively new, the focus of this implementa- tion will be to demonstrate some of its new features and draw some conclusions about appropriate-use cases for them. These features are described briefly here, with addi- tional detail provided in the following sections. This implementation will utilize two different-size thread groups, 16×16 and 32×32, to generate the occlusion buffer. Using two different sizes will allow us to see whether the Thread Group size has any effect on the performance of the algorithm. We will also demonstrate the use of the GSM as a cache for the depth values and com- pare how well this tactic performs relative to directly loading samples from the depth buffer. In addition to using the GSM, we also utilize the Gather sampling function for filling the GSM with depth values to see whether there is any impact on overall performance. The randomization system will utilize one of the new thread addressing system values to select a reflection vector, eliminating the need for a randomization texture. After the occlusion buffer has been generated, we will utilize a separable version of the bilateral filter to demonstrate the ability of the compute shader to efficiently perform filtering operations. Implementation Overview The process is started by rendering a linear depth buffer at full-screen resolution with the traditional rendering pipeline. Stored along with the depth value is the view space nor- mal vector, which will be used during the occlusion calculations. This depth/normal buffer serves as the primary input to the compute shader to calculate a raw, unfiltered occlusion buffer. Finally, we use the depth/normal buffer and the raw occlusion buffer to perform separable bilateral filtering to produce a final occlusion buffer suitable for rendering the scene with the standard rendering pipeline. 64 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 65 Depth/Normal Buffer Generation The depth/normal buffer will consist of a four-component floating-point texture, and each of the occlusion buffers will consist of a single floating-point component. The depth/normal vectors are generated by rendering the linear view space depth and view space normal vectors into the depth/normal buffer. The depth value is calculated by simply scaling the view space depth by the distance to the far clipping plane. This ensures an output in the range of [0,1]. The normal vector is calculated by transforming the normal vector into view space and then scaling and biasing the vector components. Listing 1.5.1 shows the code for doing so. LISTING 1.5.1 Generation of the view space depth and normal vector buffer output.position = mul( float4( v.position, 1.0f ), WorldViewProjMatrix ); float3 ViewSpaceNormals = mul( float4( v.normal, 0.0f ), WorldViewMatrix ).xyz; = ViewSpaceNormals * 0.5f + 0.5f; output.depth.w = output.position.w / 50.0f; Depending on the depth precision required for your scene, you can choose an appropriate image format—either 16 or 32 bits. This sample implementation utilizes 16-bit formats. Raw Occlusion Buffer Generation Next, we generate the raw occlusion buffer in the compute shader. This represents the heart of the SSAO algorithm. As mentioned earlier, we will utilize two different Thread Group sizes. The occlusion calculations will be performed in Thread Groups of size 16×16×1 and 32×32×1. Since we can adjust the number of Thread Groups executed in the application’s Dispatch call, either Thread Group size can be used to generate the raw occlusion buffer. However, if there is any performance difference between the two Thread Group sizes, this will provide some insight into the proper usage of the compute shader. Regardless of the size of the Thread Groups, each one will generate one portion of the raw occlusion buffer equivalent to its size. Each thread will calculate a single pixel of the raw occlusion buffer that corresponds to the thread’s Dispatch thread ID sys- tem value. This Dispatch thread ID is also used to determine the appropriate location in the depth/normal buffer to load. The depth value and normal vector are loaded from the texture and converted back into their original formats for use later. Depth Value Cache with the GSM We will also set up the compute shader to cache local depth values in the GSM. Once the depth values of the surrounding area are loaded into the GSM, all subsequent depth sampling can be performed on the GSM instead of loading directly from texture memory. Before we discuss how to set up and use the GSM, we need to consider the desired lay- out for the data. Since we are utilizing two different Thread Group sizes, we will specify a different layout for each. Each of the Thread Groups requires the corresponding depth region that it represents to be present in the GSM. In addition, the area surrounding the Thread Group’s boundary is also needed to allow the occlusion calculations for the border pixels to be carried out correctly. This requires each thread to sample not only its own depth/normal vector, but also some additional depth values to properly load the GSM for use later. If we stipulate that each thread will load four depth values into the GSM, then our 16×16 thread group will provide a 32×32 overall region in the GSM (the original 16×16 block with an 8-pixel boundary). The 32×32 Thread Group size will provide a 64×64 region (the original 32×32 block with a 16-pixel boundary). Fortunately, the Gather instruction can be utilized to increase the number of depth values that are sampled for each thread. The Gather instruction returns the four point-sampled single component texture samples that would normally have been used for bilinear interpolation—which is perfect for pre-loading the GSM since we are using only single component depth values. This effectively increases the number of depth samples per texture instruction by a factor of 4. If we use each thread to perform a single Gather instruction, then we can easily fill the required areas of 32×32 and 64×64. The required samples are obtained by having each thread perform the Gather instruction and store the results in the GSM for all other threads within the group to utilize. This is demonstrated in Listing 1.5.2. LISTING 1.5.2 Declaring and populating the Group Shared Memory with depth data #define USE_GSM #ifdef USE_GSM // Declare enough shared memory for the padded thread group size groupshared float LoadedDepths[padded_x][padded_y]; #endif int3 OffsetLocation = int3( GroupID.x*size_x - kernel_x, GroupID.y*size_y - kernel_y, 0 ); int3 ThreadLocation = GroupThreadID * 2; float2 fGatherSample; fGatherSample.x = ((float)GroupID.x * (float)size_x – (float)kernel_x + (float)GroupThreadID.x * 2.0f ) / xres; fGatherSample.y = ((float)GroupID.y * (float)size_y – (float)kernel_y + (float)GroupThreadID.y * 2.0f ) / yres; float4 fDepths = DepthMap.GatherAlpha( DepthSampler, fGatherSample + float2( 0.5f / (float)xres, 0.5f / (float)yres ) ) * zf; LoadedDepths[ThreadLocation.x][ThreadLocation.y] = fDepths.w; LoadedDepths[ThreadLocation.x+1][ThreadLocation.y] = fDepths.z; LoadedDepths[ThreadLocation.x+1][ThreadLocation.y+1] = fDepths.y; LoadedDepths[ThreadLocation.x][ThreadLocation.y+1] = fDepths.x; GroupMemoryBarrierWithGroupSync(); 66 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 67 The number of depth values loaded into the GSM can be increased as needed by having each thread perform additional Gather instructions. The Group Shared Memory is defined as a 2D array corresponding to the size of the area that will be loaded and cached. After all of the depth values have been loaded, we introduce a synchronization among threads in the Thread Group with the GroupMemoryBarrierWithGroupSync() intrinsic function. This function ensures that all threads have finished writing to the GSM up to this point in the compute shader before continuing execution. A compile-time switch is provided in the sample code to allow switching between filling the GSM to use the cached depth values or to directly access the depth texture. Since the GSM has the potential to improve the sampling performance depending on the access pattern, this will allow an easy switch between techniques for a clear efficiency comparison. Next, we initialize the randomization of the sampling kernel with the lowest four bits of the Dispatch thread ID x and y coordinates, as shown in Listing 1.5.3. The lowest four bits in each direction are used to select a reflection vector from a 2D array of rotation vectors, which are predefined and stored in a constant array. This elimi- nates the need for a separate texture and range expansion calculations, but it requires a relatively large array to be loaded when the compute shader is loaded. After it is selected, the reflection vector is then used to modify the orientation of the sampling kernel by reflecting each of the kernel vectors about the reflection vector. This pro- vides a different sampling kernel for each consecutive pixel in the occlusion buffer. LISTING 1.5.3 Definition of the sampling kernel and selection of the randomization vector const float3 kernel[8] = { normalize( float3( 1, 1, 1 ) ), normalize( float3( -1,-1,-1 ) ), normalize( float3( -1,-1, 1 ) ), normalize( float3( -1, 1,-1 ) ), normalize( float3( -1, 1 ,1 ) ), normalize( float3( 1,-1,-1 ) ), normalize( float3( 1,-1, 1 ) ), normalize( float3( 1, 1,-1 ) ) }; const float3 rotation[16][16] = { { {...},{...},{...},{...}, ... } }; int rotx = DispatchThreadID.x & 0xF; int roty = DispatchThreadID.y & 0xF; float3 reflection = rotation[rotx][roty]; With a random reflection vector selected, we can begin the iteration process by sampling a depth value at the location determined by the randomized sampling kernel offsets. The sample location is found by determining the current pixel’s view space 3D position and then adding the reoriented sampling kernel vectors as offsets from the pixel’s location. This new view space position is then converted back to screen space, producing an (x, y) coordinate pair that can then be used to select the depth sample from either the GSM or the depth/normal texture. This is shown in Listing 1.5.4. LISTING 1.5.4 Sampling location flipping and re-projection from view space to screen space float3 vRotatedOffset = reflect( kernel[y], rotation[rotx][roty] ); float fSign = dot( fPixelNormal, vRotatedOffset ); if ( fSign < 0.0f ) vFlippedOffset = -vFlippedOffset; float3 Sample3D = PixelPosVS + vFlippedOffset * scale; int3 newoffset = ViewPosToScreenPos( Sample3D ); #ifndef USE_GSM float fSample = DepthMap.Load( iNewOffset ).w * zf; #else float fSample = LoadDepth( iNewOffset - OffsetLocation ); #endif The pixel’s view space normal vector is used to determine whether the kernel off- set vector points away from the current pixel. If so, then the direction of the offset vector is negated to provide an additional sample that is more relevant for determin- ing occlusion. This provides additional samples in the visible hemisphere of the pixel, which increases the usable sample density for the pixel. The final screen space sample location is then used to look up the depth sample either directly from the texture or from the GSM by calling the LoadDepth() function. After the depth has been loaded, the occlusion at the current pixel from this sample is calculated. The calculation that is used is similar to the one presented in [Filion08] and [Lake10], using a linear occlusion falloff function raised to a power. This produces a smooth gradual falloff from full occlusion to zero occlusion and provides easy-to-use parameters for adjusting the occlusion values. The partial occlusion calculation is repeated for a given number of samples, implemented as a multiple of the number of elements in the sampling kernel. In this implementation, the number of samples can be chosen in multiples of eight. All of these individual occlusion values are averaged and then stored in the raw occlusion buffer for further processing. 68 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 69 Separable Bilateral Filter The final step in our occlusion value generation is to perform the bilateral blur. As described earlier, we are able to use a separable version of the filter, even though it is not perfectly accurate to do so. The bilateral filter passes are implemented in the compute shader, with each of the separable passes being performed in an individual dispatch call. Since we are only processing one direction at a time, we will first use one Thread Group for each row of the image and then process the resulting image with one Thread Group for each column of the image. In this arrangement, we can load the entire contents of a Thread Group’s row or column into the GSM, and then each thread can directly read its neighbor values from it. This should minimize the cost of sampling a texture for filtering and allow larger filter sizes to be used. This implemen- tation uses 7×7 bilateral filters, but this can easily be increased or decreased as needed. Listing 1.5.5 shows how the separable filter pass loads its data into the GSM. LISTING 1.5.5 Loading and storing the depth and occlusion values into the GSM for the horizontal portion of a separable bilateral filter // Declare enough shared memory for the padded group size groupshared float2 horizontalpoints[totalsize_x]; ... int textureindex = DispatchThreadID.x + DispatchThreadID.y * totalsize_x; // Each thread will load its own depth/occlusion values float fCenterDepth = DepthMap.Load(DispatchThreadID).w; float fCenterOcclusion = AmbientOcclusionTarget[textureindex].x; // Then store them in the GSM for everyone to use horizontalpoints[GroupIndex].x = fCenterDepth; horizontalpoints[GroupIndex].y = fCenterOcclusion; // Synchronize all threads GroupMemoryBarrierWithGroupSync(); One thread is declared for each pixel of the row/column, and each thread loads a single value out of the raw occlusion buffer and stores that value in the GSM. Once the value has been stored, a synchronization point is used to ensure that all of the memory accesses have completed and that the values that have been stored can be safely read by other threads. The bilateral filter weights consist of two components: a spatially based weighting and a range-based weighting. The spatial weights utilize a fixed Gaussian kernel with a size of 7 taps in each direction. A separate Gaussian weighting value is calculated based on the difference between the center pixel and each of the samples to determine the weighting to apply to that sample. Modifying the sigma values used in the range- based Gaussian allows for easy adjustment of the range-filtering properties of the bilateral filter. Listing 1.5.6 shows how this calculation is performed. LISTING 1.5.6 Horizontal portion of a separable bilateral filter in the compute shader const float avKernel7 = { 0.004431f, 0.05402f, 0.2420f, 0.3990f, 0.2420f, 0.05402f, 0.004431f }; const float rsigma = 0.0051f; float fBilateral = 0.0f; float fWeight = 0.0f; for ( int x = -3; x <= 3; x++ ) { int location = GroupIndex + x; float fSampleDepth = horizontalpoints[location].x; float fSampleOcclusion = horizontalpoints[location].y; float fDelta = fCenterDepth - fSampleDepth; float fRange = exp( ( -1.0f * fDelta * fDelta ) / ( 2.0f * rsigma * rsigma ) ); fBilateral += fSampleOcclusion * fRange * avKernel[x+3]; fWeight += fRange * avKernel[x+3]; } AmbientOcclusionTarget[textureindex] = fBilateral / fWeight; Finally, once both passes of the bilateral filter are performed, the values written to the final output buffer can be used to supply the ambient lighting term of an output image. The value stored in the occlusion buffer represents visibility, and thus can be directly used as the ambient lighting term without modification. The sample imple- mentation provides its output using only the occlusion value—no other lighting is applied to the scene. Results Figure 1.5.9 shows the end result of our compute shader–based implementation of the algorithm. All images and performance numbers were generated on an AMD 57xx series GPU. The following image was generated using 32 depth samples for each occlu- sion pixel and with the 7×7 separable bilateral filter applied twice. To gain some insight into how well the new implementation techniques perform, we can review the overall frame time for each of our optional configurations. Table 1.5.1 provides the performance metrics for the two Thread Group sizes for varying numbers of samples used in the occlusion calculation. These frames-per-second figures were generated with the direct sampling technique on a 640×480 render target with no bilateral filtering applied. The two Thread Group sizes produce nearly identical performance numbers. This indicates that the Thread Group size does not have a significant impact on this type of iterative algorithm. When considering the actual frame time for each test (the inverse of the fps), we see a linear increase in frame time for each of the additional sets of samples. 70 Section 1 Graphics 1.5 Ambient Occlusion Using DirectX Compute Shader 71 In comparison, Table 1.5.2 provides the same metrics generated with the GSM caching technique. TABLE 1.5.2 W/ GSM 8 16 24 32 40 48 56 64 16×16 811 699 531 426 357 306 269 240 32×32 764 625 485 391 330 286 252 225 Table 1.5.2 shows a different performance characteristic. When compared to the direct loading technique, we see the GSM technique performs slower at the 8-sample level. However, for all sample levels above this, the GSM technique significantly out- performs the direct sampling method. For all of these higher sampling levels, we see a similar linear increase in frame time but a smaller slope than the direct sampling method. Figure 1.5.10 shows the frame times for the four different cases. The slower performance with the GSM at lower sampling rates can be attributed to the overhead of loading and storing all of the additional depth data. However, there is a clear performance gain for each additional sample used in the occlusion calculation. Figure 1.5.9 Final results of the compute shader SSAO implementation. TABLE 1.5.1 NO GSM 8 16 24 32 40 48 56 64 16×16 947 608 437 349 288 241 212 188 32×32 961 610 445 350 288 245 214 187 With this performance advantage also comes some limitations. In both Thread Group sizes, we defined a fixed border size. In some cases, when a pixel is close to the viewer, the offset vector can produce a screen space offset much larger than this border size. This can be overcome either by scaling the size of the sampling kernel according to the distance from the camera or by dynamically determining whether the sample location is available in the GSM and directly loading the sample if needed. Conclusion and Future Work In this chapter, we have applied the compute shader to the Screen Space Ambient Occlusion algorithm and discussed the implications of various implementation choices. This implementation provides a basic framework upon which further pro- posed extensions can be implemented relatively easily. Additional research can be directed at sharing partial occlusion values between neighboring pixels for each occlu- sion calculation, which is now possible due to the scatter capabilities of the compute shader. In addition, further exploration on the use of the GSM as a caching mecha- nism for regional depth averages could be investigated. Finally, there have been several recent findings using multi-resolution rendering solutions for SSAO, which should also benefit from the compute shader implementations. References [Bavoil09a] Bavoil, Louis and Miguel Sainz. “Image-Space Horizon-Based Ambient Occlusion.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. Section 6.2. [Bavoil09b] Bavoil, Louis and Miguel Sainz. “Multi-Layer Dual-Resolution Screen-Space Ambient Occlusion.” 2009. SlideShare. n.d. . 72 Section 1 Graphics Figure 1.5.10 Comparison of frame times with and without the GSM as a cache. 1.5 Ambient Occlusion Using DirectX Compute Shader 73 [Bunnell05] Bunnell, Michael. “Dynamic Ambient Occlusion and Indirect Lighting.” 2005. n.d. . [Filion08] Filion, Dominic and Rob McNaughton. “Effects and Techniques.” Course on Advances in Real-Time Rendering in 3D Graphics and Games Course. SIGGRAPH 2008. Los Angeles Convention Center, Los Angeles, CA. 11 August 2008. [Kajalin09] Kajalin, Vladimir. “Screen Space Ambient Occlusion.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang F. Engel. Boston: Charles River Media, 2009. Section 6.1. [Lake10] Lake, Adam, ed. Game Programming Gems 8. Boston: Charles River Media, 2010. [Landis02] Landis, Hayden. “Production-Ready Global Illumination.” Course notes on RenderMan in Production. SIGGRAPH 2002. Henry B. Gonzalez Convention Center, San Antonio, TX. 21 July 2002. Chapter 5. [Mittring07] Mittring, Martin. “Finding Next Gen – CryEngine 2.0.” Course notes on Advanced Real-Time Rendering in 3D Graphics and Games. SIGGRAPH 2007. San Diego Convention Center, San Diego, CA. 8 August 2007. 97–121. [Pham05] Pham, T.Q. and L.J. van Vliet. “Separable Bilateral Filtering for Fast Video Processing.” IEEE International Conference on Multimedia and Expo. Amsterdam, The Netherlands. July 2005. [Ritschel09] Ritschel, Tobias, Thorsten Grosch, and Hans-Peter Seidel. “Approximating Dynamic Global Illumination in Image Space.” Proceedings of the 2009 Symposium on Interactive 3D Graphics and Games (2009): 75–82. [Tomasi98] Tomasi, Carlo and Roberto Manduchi. “Bilateral Filtering for Gray and Color Images.” Proceedings of the Sixth International Conference on Computer Vision. (1998): 839–846. 74 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping Nico Galoppo, Intel Advanced Visual Computing (AVC) The irregular shadows algorithm (also known as Irregular Z-Buffer shadows) com- bines the image quality and sampling characteristics of ray-traced shadows with the performance advantages of depth buffer–based hardware pipelines [Johnson04]. Irregular shadows are free from aliasing from the perspective of the light source because the occlusion of each eye-view sample is evaluated at sub-pixel precision in the light view. However, irregular shadow mapping suffers from pixel aliasing in the final shadowed image due to the fact that shadow edges and high-frequency shadows are not correctly captured by the resolution of the eye-view image. Brute-force super- sampling of eye-view pixels decreases shadow aliasing overall but incurs impractical memory and computational requirements. In this gem, we present an efficient algorithm to compute anti-aliased occlusion values. Rather than brute-force super-sampling all pixels, we propose adaptively adding shadow evaluation samples for a small fraction of potentially aliased pixels. We construct a conservative estimate of eye-view pixels that are not fully lit and not fully occluded. Multiple shadow samples are then inserted into the irregular Z-buffer based on the footprint of the light-view projection of potentially aliased pixels. Finally, the individual shadow sample occlusion values are combined into fractional and properly anti-aliased occlusion values. Our algorithm requires minimal additional storage and shadow evaluation cost but results in significantly better image quality of shadow edges and improved temporal behavior of high-frequency shadow content. Previously, architectural constraints of traditional GPUs have inhibited per-frame construction and traversal of irregular data structures in terms of both performance and programmer flexibility. Our implementation of anti-aliased irregular shadow 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 75 mapping exploits many strengths of the Larrabee architecture, one of which is the ability to write to run-time computed addresses in global memory space. Additionally, we were able to do so using the conventional C programming model and incorporate the adaptive nature of our technique with little effort. In comparison, traditional GPU architectures do not offer programming semantics for such global scatter operations, or they do so at extremely low performance due to their highly specialized but constrained (localized) memory hierarchy (for example, the CUDA programming model), in the worst case falling back to main memory writes [Sintorn08, Baumann05]. Background and Problem: Shadow Edge Aliasing In this section, we’ll describe the characteristics of various popular shadow generation algorithms and how they cope with different forms of aliasing, and we’ll describe the problem of screen-space shadow edge aliasing, which affects many current algorithms. Pixel-Perfect Shadows with the Irregular Z-Buffer Conventional shadow mapping renders the scene from the eye and the light, and in the final compositing pass, the two views are compared to identify points that are in shadow [Williams78]. Light-view aliasing results from misalignment of these two views, as shown in Figure 1.6.1(a). There are several variants of shadow mapping that reduce but do not eliminate sampling and self-shadowing artifacts [Fernando01, Stamminger02, Sen03, Lloyd08, Lefohn07], because none of them resolves the fundamental mismatch in sampling patterns between the eye and light views, which is the root cause of most shadow mapping artifacts. Irregular shadow mapping addresses the root cause of visual artifacts in conven- tional shadow mapping by basing the light-view sampling pattern on the positions of pixels in the eye-view raster and their corresponding depth values, therefore perfectly aligning the compared occluder surface point with the projection of the shadow sam- ple, as illustrated in Figure 1.6.1(b) [Johnson04, Johnson05]. The density of shadow samples varies significantly across the image plane (as seen in Figure 1.6.2), which illustrates the need for an irregular data structure during the light pass. Irregular shadow mapping utilizes the irregular Z-buffer in this context. This data structure explicitly stores all of the sample locations in a two-dimensional spatial data structure rather than implicitly representing them with a regular pattern. The data structure can be any spatial data structure that supports efficient range queries, such as a k-d tree or a grid. Just as in conventional shadow mapping, irregular shadow mapping projects triangles onto the light-view image plane one at a time and then determines which samples lie inside a triangle. Unlike conventional shadow mapping, this deter- mination is made by querying the irregular Z-buffer. Finally, for each sample inside a triangle, irregular shadow mapping performs the standard depth comparison and updates the sample’s occlusion value. 76 Section 1 Graphics Figure 1.6.1 Conventional versus irregular shadow mapping. In conventional shadow mapping (left), both the eye-view and light-view images are rendered with the classic Z-buffer, leading to a mismatch between the desired and actual sample locations in the shadow map. Irregular shadow mapping (right) avoids this mismatch by rendering the light-view image with the irregular Z-buffer. (A) (B) Figure 1.6.2 The classic Z-buffer (a) samples a scene at regularly spaced points on the light image plane. The irregular Z-buffer (b) samples a scene at arbitrary points on the light image plane. Irregular shadow mapping (d) eliminates aliasing artifacts typically associated with conventional shadow mapping (c). 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 77 Note that when a conventional rasterizer is used during light-view projection of occluder triangles, it is necessary to scan-convert expanded triangles to ensure frag- ments will be generated for any cell touched by the unexpanded triangle (also known as conservative rasterization [Akenine-Möller05]), since irregular Z-buffer samples may lie anywhere within the cell bounds, as illustrated in Figure 1.6.3. For reference, [Hasselgren05] describes a shader implementation with example code. On the other hand, the advantage of a software rasterizer (for example, on Larrabee) is that a special rasterization path can be implemented to apply custom rasterization rules that enable conservative rasterization directly without triangle expansion. Eye-View Aliasing While irregular shadow mapping is free of light-view aliasing, it still suffers from eye-view aliasing of pixels, as illustrated in Figure 1.6.4. Such aliasing is a common problem in computer graphics. For example, it is also encountered in ray casting with a single eye ray per pixel. The problem is that thin geometry (high-frequency screen content) cannot be captured by a single ray, because the rays of two neighboring pixels may miss some geometry even though the geometry projects to part of those pixels. Similarly, in the case of shadows in a rasterizer, it is possible that a surface point projected to the center of an eye-view pixel is lit, but the entire area of the pixel is not lit. This phenomenon, known as eye-view shadow aliasing, is caused by the fact that a single bit occlusion Figure 1.6.3 Conservative rasterization versus conventional rasterization. Scan-converted triangles have to be expanded during light-view projection to ensure fragments will be generated for any cell touched by the unexpanded triangle (shaded cells), since irregular Z-buffer samples (circles) may lie anywhere within the pixel bounds. value is not sufficient to represent the occlusion value of aliased pixels. Anti-aliased occlusion values are fractional values that represent the fraction of the total pixel area that is lit. Recently, a few novel shadow mapping techniques [Brabec01, Lauritzen06, Salvi08] have addressed this problem and provide good solutions for eye-view aliasing but still expose light-view aliasing. The most obvious approach to produce anti-aliased shadows with irregular shadow mapping is super-sampling of the entire screen by generating and evaluating multiple shadow samples for each eye-view pixel. The anti-aliased occlusion value for a pixel is then simply the average of the individual sample occlusion values. While this brute- force approach certainly works, as illustrated in Figure 1.6.5, the computational and storage costs quickly become impractical. Data structure construction, traversal times, and storage requirements of the irregular Z-buffer are proportional to the number of shadow samples, making real-time performance impossible on current hardware for even as little as four shadow samples per pixel. The recent method by [Robison09] provides a solution to compute anti-aliased shadows from the (aliased) output of irregular shadow mapping, but in essence it is also a brute-force approach in screen space that does not exploit the irregular Z-buffer acceleration structure and is therefore at a computational disadvantage compared to our approach. 78 Section 1 Graphics Figure 1.6.4 The thin geometry in the tower causes eye-view aliasing of the projected shadow. Note that some of the tower’s connected features are disconnected in the shadow. 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 79 Solution: Adaptive Multi-Sampling of Irregular Shadows We observed in Figure 1.6.5 that accumulating shadow evaluation results of multiple samples per eye-view pixels provides a nice anti-aliased shadow and that potentially shadow-aliased pixels are those pixels that lie on a projected shadow edge. Therefore, we propose an efficient algorithm for anti-aliased irregular shadow mapping by adaptive multi-sampling of only those pixels that potentially lie on a shadow edge. Since only a marginal fraction of all screen pixels are shadow-edge pixels, this approach results in sub- stantial gains in computational and storage costs compared to the brute-force approach. Essentially, our method is an extension of the original irregular shadow mapping algorithm, where the irregular Z-buffer acceleration structure remains a light space– oriented acceleration structure for the projected eye-view shadow samples. However, during irregular Z-buffer construction, potential shadow edge pixels are detected using a conservative shadow edge stencil buffer. Such pixels generate multiple shadow samples distributed over the pixel’s extent and are inserted in the irregular Z-buffer (shadow sample splatting). Non-shadow-edge pixels are treated just as in the original irregular shadow mapping algorithm—a single shadow sample is sufficient to detect the occlusion value of the entire pixel. In the final shadow evaluation step, shadow occlusion values are averaged over each eye-view pixel’s sample, resulting in a properly anti-aliased fractional occlusion value. This value approximates the fraction of the pixel’s area that is occluded, and it goes toward the true value in the limit as the number of samples per pixel increases. Figure 1.6.5 A four-times super-sampled irregular shadow mapping result image of the tower scene. Algorithm: Anti-Aliased Irregular Shadow Mapping We will now give an overview of the complete algorithm to provide structure to the remainder of the algorithm description in this section. Then we describe how to determine which pixels are potentially aliased by constructing a conservative shadow edge stencil buffer and how to splat multiple samples into the irregular Z-buffer effi- ciently. Finally, we put it all together and present the complete algorithm in practice. We can formulate our approach in the following top-level description of our algorithm: 1. Render the scene conservatively from the light’s point of view to a variance shadow map. 2. Render the scene from the eye point to a conventional Z-buffer—depth values only (gives points P0). 3. Construct a conservative shadow edge stencil buffer using a variance shadow map and light-space projection of P0. 4. Using the stencil in Step 3, generate N extra eye-view samples Pi for potential shadow edge pixels only. 5. Transform eye-view samples Pi to light space P’i (shadow sample splatting). 6. Insert all samples P’i in the irregular Z-buffer. 7. Render the scene from the light’s point of view while testing against samples in the irregular Z-buffer, tagging occluded samples. 8. Render the scene from the eye point, using the result from Step 7 and the conservative shadow edge stencil buffer. Multi-sampled eye-view pixels accu- mulate shadow sample values into a fractional (anti-aliased) shadow value. Conservative Shadow Edge Stencil Buffer To adaptively add shadow samples at shadow-edge pixels, we construct a special stencil buffer that answers the following question: Is there any chance that this eye-space pixel is partially occluded by geometry in this light-space texel? We call this stencil buffer the conservative shadow edge stencil buffer . Giving an exact answer to the aforemen- tioned question is impossible because it is essentially solving the shadowing problem. However, we can use a probabilistic technique to answer the question conservatively with sufficient confidence. A conservative answer is sufficient for our purpose, since multi-sampling of non-shadow-edge pixels does not alter the correctness of the result —it only adds some extra cost. Obviously, we strive to make the stencil buffer only as conservative as necessary. We employ a technique called Variance Shadow Mapping [Lauritzen06]. Variance shadow maps encode a distribution of depths at each light-space texel by determining the mean and variance of depth (the first two moments of the depth distribution). These moments are constructed through mip-mapping of the variance shadow map. 80 Section 1 Graphics 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 81 When querying the variance shadow map, we use these moments to compute an upper bound on the fraction of the distribution that is more distant than the surface being shaded, and therefore this bound can be used to cull eye-view pixels that have very little probability to be in shadow. In particular, the cumulative distribution function F(t) 5 P(x $ t) can be used as a measure of the fraction of the eye-view fragment that is lit, where t is the distance of the eye-view sample to the light, x is the occluder depth distribution, and P stands for the probability function. While we cannot compute this function F(t) exactly, Chebyshev’s inequality gives an upper bound: The upper bound Pmax (t) and the true probability Plit (t) are depicted in Figure 1.6.6. Thus, we can determine that it is almost certain that a projected eye-view sample with light depth t is in shadow (for example, with 99-percent certainty) by comparing Pmax to 1% (Pmax , implies Plit (t) , 0.01). Conversely, we can use the same distribution to construct a bound to cull eye-view pixels that have very high probability of being lit: Figure 1.6.6 Pin shadow (t) and Plit (t), in addition to their conservative upper bounds Pmax (t) and p9max (t). In summary, the conservative shadow edge stencil buffer can be constructed in the following steps: 1. Render the scene from the light’s point of view, writing out depth x and depth squared x2 to a variance shadow map texture (VSM). 2. Mip-map the resulting texture, effectively computing E(x) and E(x 2 ), the mean and variance of the depth distribution. 3. Render the scene from the eye point, computing for each sample: a. The light depth t by projection of the sample to light space. b. E(x) and E(x 2 ) by texture-sampling the mip-mapped VSM with the appropriate filter width, determined by the extent of the light projection of the pixel area. c. m 5 E(x), s2 5 E(x 2 ) 2 E(x) and Pmax (t), p9max (t) 4. Compare Pmax (t) and p9max (t) to a chosen threshold (for example, 1 percent). Set the stencil buffer bit if either one is smaller than the threshold. These steps can be implemented in HLSL shader pseudocode, as shown in Listing 1.6.1. LISTING 1.6.1 Conservative shadow edge stencil buffer construction HLSL shader float2 ComputeMoments(float Depth) { // Compute first few moments of depth float2 Moments; Moments.x = Depth; Moments.y = Depth * Depth; return Moments; } float ChebyshevUpperBound( float2 moments, float mean, float minVariance) { // Compute variance float variance = max( minVariance, moments.y - (moments.x * moments.x)); float d = mean - moments.x; float pMax = variance / (variance + (d * d)); // One-tailed Chebyshev’s Inequality return (mean <= moments.x ? 1.0f : pMax); } bool IsPotentialShadowEdge(float2 texCoord, float2 texCoordDX, float2 texCoordDY, float depth) { float4 occluderData; 82 Section 1 Graphics 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 83 // Variance Shadow Map mip-mapped LOD tex lookup occluderData = texShadowMap.SampleGrad( sampShadowMap, texCoord, texCoordDX, texCoordDY); float2 posMoments = occluderData.xy; // Minimum variance to take account for variance // across entire pixel float gMinVariance = 0.000001f; float pMaxLit = ChebyshevUpperBound( posMoments, depth, gMinVariance); float pMaxShadow = ChebyshevLowerBound( posMoments, depth, gMinVariance); if (PMaxLit < .01 || PMaxShadow < .01) { return true; } return false; } Note that while conventional rasterization of the scene from the light’s point of view in Step 1 earlier is sufficient for generating a conventional variance shadow map, it is not sufficient for generating our conservative stencil buffer. Conventional rasteri- zation does not guarantee that a primitive’s depth contributes to the minimum depth of each light-view texel that it touches. Hence, the conservativeness of the stencil buffer would not be preserved as illustrated in Figure 1.6.7, which depicts potential shadow-edge pixels in overlay but misses quite a few due to the low resolution of the variance shadow map. Figure 1.6.7 Conservative shadow edge stencil map with regular rasterization. Many potential shadow-edge pixels (overlay) are missed due to low resolution of the variance shadow map. To preserve conservativeness, it is required to perform conservative rasterization in the light-view render of Step 1, just as we do during the light-view render of irregular shadow mapping illustrated in Figure 1.6.3. Figure 1.6.8 depicts correctly detected potential shadow-edge pixels in overlay, regardless of the variance shadow map resolution. Shadow Sample Splatting In the irregular Z-buffer construction phase, when the time comes to generate addi- tional samples for potentially aliased pixels, as defined by the conservative shadow edge stencil buffer, we insert the eye-view samples in each light-view grid cell that is touched by the pixel samples. We call this process shadow sample splatting, because we conceptually splat the projection of the pixel footprint into the light space grid data structure. This process is as follows: 1. In addition to light view coordinates of the eye-view pixel center, also generate multiple samples per eye-view pixel. We have achieved very good results with rotated grid 4× multi-sampling, but higher sample rates and even jittered sampling strategies can be used to increase the quality of the anti-aliasing. 2. Project all samples into light space as in the original irregular shadow algo- rithm. 3. Insert all samples into the irregular Z-buffer as in the conventional irregular shadowing algorithm. Potentially multiple light grid cells are touched by the set of samples of a pixel. 84 Section 1 Graphics Figure 1.6.8 Conservative shadow edge stencil map with conservative rasterization. All potential shadow-edge pixels are detected (overlay), regardless of the variance shadow map resolution. 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 85 Results and Discussion We will now show the results of our algorithm for two different scenes. The first scene consists of a tower construction with fine geometry casting high-frequency shadows onto the rest of the scene. The second scene is the view of a fan at the end of a tunnel, viewed from the inside. The fan geometry casts high-frequency shadows on the inside walls of the tunnel. The tunnel walls are almost parallel to the eye and light directions, a setup that is particularly hard for many shadow mapping algorithms. Irregular shadow mapping shows its strength in the tunnel scene because no shadow map resolution management is required to avoid light-view aliasing. However, severe eye-view aliasing artifacts are present for the single-sample irregular shadow algorithm (see Figures 1.6.10(a) and 1.6.11(a)). Figure 1.6.9 illustrates the result of computing the conserva- tive shadow edge stencil buffer on both scenes: Potential shadow edge pixels are ren- dered with an overlay. Figure 1.6.10 compares single-sample irregular shadows with 4× rotated grid multi-sampling on potential shadow edge pixels only. Note the signif- icant improvement in the tower shadow, where many disconnected features in the shadow are now correctly connected in the improved algorithm. Figure 1.6.11 illustrates the same comparison for the tunnel scene. There is a great improvement in shadow quality toward the far end of the tunnel, where high-frequency shadows cause signifi- cant aliasing when using only a single eye-view shadow sample. Figure 1.6.9 Result of the conservative shadow edge stencil buffer on the tower (a) and tunnel (b) scenes. Potential shadow edge pixels are rendered with an overlay. (A) (B) 86 Section 1 Graphics Figure 1.6.10 Tower scene: (a) Single-sample irregular shadows and (b) 4× rotated grid multi-sampling on potential shadow edge pixels only. Note the significant improvement in the tower shadow, where many disconnected features in the shadow are now correctly connected in the improved algorithm. (A) (B) Figure 1.6.11 Tunnel scene: (a) Single-sample irregular shadows and (b) 4× rotated grid multi-sampling on potential shadow edge pixels only. There is a great improvement in shadow quality toward the end of the tunnel, where high-frequency shadows caused significant anti-aliasing when using only a single eye-view shadow sample. (A) (B) 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 87 Implementation Details On Larrabee, we have implemented Steps 2 and 3 of our algorithm in an efficient post-process over all eye-view pixels in parallel. However, Step 3 is identical to the conventional irregular shadowing algorithm; therefore, it could be implemented as in [Arvo07] as well. Conceptually, we use a grid-of-lists representation for the irregular Z-buffer. This representation is well-suited to parallel and streaming computer architectures and produces high-quality shadows in real time in game scenes [Johnson05]. The following chapter of this book [Hux10], in particular Figure 1.7.1, explains our grid-of-lists representation and its construction in more detail. Finally, our solution was implemented in a deferred renderer, but it could also be implemented in a forward renderer with a few modifications. Compute Requirements Since only a marginal fraction of all screen pixels are shadow-edge pixels, this approach results in substantial gains in computational and storage costs compared to the brute- force approach. Compared to the single-sample irregular shadow maps, the additional computational cost is relatively small. For example, let’s assume the number of poten- tial shadow-edge pixels is ~10 percent of all eye-view pixels, and that we generate N additional samples per potential shadow-edge pixel. Since data structure construction and traversal times are proportional to the number of shadow map samples, this means an additional cost of 10N percent for anti-aliasing. Additionally, there is an extra cost associated with creating the conservative shadow edge stencil buffer. In our implementation inside a deferred rendering, much of the required information was already computed—therefore, that extra cost is small. However, our algorithm does require an extra light-space pass, per light, to capture the depth distribution into the variance shadow map. Storage Requirements The storage cost is the same as the standard irregular Z-buffer, proportional to the number of samples. Again, for 10-percent extra samples, 10N-percent extra storage is required, depending on the implementation. Storage of the stencil buffer requires only 1 bit per eye-view pixel and can easily be packed into one of the existing eye-view buffers of the irregular shadowing algorithm. Future Work Going forward, we would like to investigate the benefits of merging our algorithm that adaptively samples potential shadow edge pixels multiple times with conventional multi-sampling techniques that adaptively sample the geometry silhouette pixels mul- tiple times—for best performance, preferably through the use of common data structures and shared rendering passes. Additionally, it should be fairly straightforward to extend our approach to soft irregular shadow mapping, where the concept of anti-aliasing is implicit, as both the soft and hard irregular shadow mapping algorithms share the same algorithmic framework [Johnson09]. For soft shadows, one may envision extending the conservative stencil to shadow penumbra detection. Conclusion The main advantage of irregular shadow maps with respect to conventional shadow maps is that they bear no light-view aliasing. However, irregular shadow maps are affected by eye-view aliasing of the shadow result. Recent pre-filterable shadow mapping algorithms and brute-force eye-view techniques have provided solutions for anti-aliased shadows, but none of them exploits the irregular Z-buffer acceleration structure directly. The method in this chapter is an extension of irregular shadow mapping, exploits the same irregular data structure, and is therefore the first algorithm to produce anti- aliased shadows by means of adaptive multi-sampling of irregular shadow maps, while keeping all its other positive characteristics, such as pixel-perfect ray-traced quality shadows and the complete lack of light-view aliasing. Acknowledgements We’d like to thank the people at the 3D Graphics and Advanced Rendering teams at the Intel Visual Computing group for their continued input while developing the methods described here. Many thanks go out in particular to Jeffery A. Williams for providing the art assets in the tower and tunnel scenes and to David Bookout for per- sistent support and feedback. References [Akenine-Möller05] Akenine-Möller, Tomas and Timo Aila. “Conservative and Tiled Rasterization Using a Modified Triangle Setup.” Journal of Graphics, GPU, and Game Tools 10.3 (2005): 1–8. [Arvo07] Arvo, Jukka. “Alias-Free Shadow Maps using Graphics Hardware.” Journal of Graphics, GPU, and Game Tools 12.1 (2007): 47–59. [Baumann05] Baumann, Dave. “ATI Xenos: Xbox 360 Graphics Demystified.” 13 June 2005. Beyond 3D. n.d.>. [Brabec01] Brabec, Stefan and Hans-Peter Seidel. “Hardware-Accelerated Rendering of Antialiased Shadows with Shadow Maps.” Proceedings of the International Conference on Computer Graphics (July 2001): 209. ACM Portal. [Fernando01] Fernando, Randima, Sebastian Fernandez, Kavita Bala, and Donald P. Greenberg. “Adaptive Shadow Maps.” Proceedings of the 28th Annual Conference on Computer Graphics and interactive Techniques (2001): 387–390. ACM Portal. [Hasselgren05] Hasselgren, Jon, Tomas Akenine-Möller, and Lennart Ohlsson. “Conservative Rasterization.” GPU Gems 2. 2005. NVIDIA. n.d. . 88 Section 1 Graphics 1.6 Eye-View Pixel Anti-Aliasing for Irregular Shadow Mapping 89 [Hux10] Hux, Allen. “Overlapped Execution on Programmable Graphics Hardware.” Game Programming Gems 8. Ed. Adam Lake. Boston: Charles River Media, 2010. [Johnson04] Johnson, Gregory S., William R. Mark, and Christopher A. Burns. “The Irregular Z-Buffer and its Application to Shadow Mapping.” April 2004. The University of Texas at Austin. n.d. . [Johnson05] Johnson, Gregory S., Juhyun Lee, Christopher A. Burns, and William R. Mark. “The Irregular Z-Buffer: Hardware Acceleration for Irregular Data Structures.” ACM Transactions on Graphics 24.4 (October 2005): 1462–1482. ACM Portal. [Johnson09] Johnson, Gregory S., Allen Hux, Christopher A. Burns, Warren A. Hunt, William R. Mark, and Stephen Junkins. “Soft Irregular Shadow Mapping: Fast, High- Quality, and Robust Soft Shadows.” Proceedings of the 2009 Symposium on Interactive 3D Graphics and Games. (2009): 57–66. ACM Portal. [Lauritzen06] Lauritzen, Andrew and William Donnelly. “Variance Shadow Maps.” Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games. (2006): 161–165. ACM Portal. [Lefohn07] Lefohn, Aaron E., Shubhabrata Sengupta, and John D. Owens. “Resolution- Matched Shadow Maps.” ACM Transactions on Graphics 26.4 (Oct. 2007): 20. ACM Portal. [Lloyd08] Lloyd, D. Brandon, Naga K. Govindaraju, Cory Quammen, Steven E. Molnar, and Dinesh Manocha. “Logarithmic Perspective Shadow Maps.” ACM Transactions on Graphics 27.4 (Oct. 2008): 1–32. ACM Portal. [Robison09] Robison, Austin, and Peter Shirley. “Image Space Gathering.” Proceedings of the Conference on High Performance Graphics 2009 (2009): 91–98. ACM Portal. [Salvi08] Salvi, Marco. “Rendering Filtered Shadows with Exponential Shadow Maps.” ShaderX6: Advanced Rendering Techniques. Ed. Wolfgang Engel. Boston: Charles River Media, 2008. 257–274. [Sen03] Sen, Pradeep, Mike Cammarano, and Pat Hanrahan. “Shadow Silhouette Maps.” ACM Transactions on Graphics 22.3 (July 2003): 521–526. ACM Portal. [Sintorn08] Sintorn, Erik, Elmar Eisemann, and Ulf Assarsson. “Sample Based Visibility for Soft Shadows using Alias-free Shadow Maps.” Computer Graphics Forum: Proceedings of the Eurographics Symposium on Rendering 2008 27.4 (June 2008): 1285–1292. [Stamminger02] Stamminger, Marc and George Drettakis. “Perspective Shadow Maps.” Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques (2002): 557–562. ACM Portal. [Williams78] Williams, Lance. “Casting Curved Shadows on Curved Surfaces.” ACM SIGGRAPH Computer Graphics 12.3 (Aug. 1978): 270–274. ACM Portal. 90 1.7 Overlapped Execution on Programmable Graphics Hardware Allen Hux, Intel Advanced Visual Computing (AVC) Some graphics algorithms require data structure construction and traversal steps that do not map well to constrained graphics pipelines. Additionally, because of the dependencies between rendering and non-rendering passes, much (or all) of the compute power of the device may go idle between steps of a given algorithm. In this gem, we examine techniques for executing non-rendering algorithms concurrently with traditional rendering on programmable graphics hardware, such as Larrabee. Such programmable graphics devices enable fine-grained signaling and event graphs, allowing algorithmic stages to “overlap.” As a working model, we present an imple- mentation of Irregular Z-Buffer (IZB) shadows—an algorithm requiring both standard rendering passes (for example, depth-only pre-pass) and parallelized data structure construction. Identification of rendering and non-rendering work that is not dependent reveals opportunities to remove the stalls that currently occur when switching between the two types of workloads. The APIs discussed in this article are examples and not necessarily representative of APIs provided with a particular product. Introduction to Irregular Z-Buffer Shadows The simplest shadow mapping algorithm requires two passes. First, render the scene from the light view to get a light-view depth buffer (at one resolution). Second, render the scene from the eye view (at the same or different resolution). For each point, test the visibility of that point by projecting it into the light view and comparing its depth to the one found in the first pass. If the eye-view depth is greater, then something must be between the light and the point being tested, and therefore it must be in shadow. This algorithm fits entirely within conventional rendering pipelines [Williams78]. 1.7 Overlapped Execution on Programmable Graphics Hardware 91 The problem with this approach is aliasing: The resolution of the light-view plane will never precisely match the sampling frequency when projecting from the eye view, resulting in visual artifacts. As described in the previous gem, an irregular Z-buffer stores precise eye-view positions in the light-view grid, enabling accurate shadow deter- mination from the eye view [Johnson04, Johnson09, Galoppo10]. Shadow mapping with an irregular Z-buffer is a multi-step process involving three rendering passes interleaved with two non-rendering steps. 1. Render the scene from the eye view, depth only. 2. Transform the eye-view points to light view. a. For each point, atomic increment the corresponding pixel in the light-view plane. (A bigger plane improves parallelism at the cost of memory.) b. Parallel prefix sum the indices, resulting in a mapping from each eye-view receiver to a light-view pixel. c. Scatter the light-view values into a light-view-friendly 1D structure. Indices into the 1D structure are stored in the light-view plane. 3. Render the scene from the light view. Instead of executing a traditional pixel shader, test the triangle bounds against the points in the data structure (referring to the indices from Step 2c). Points that are inside and behind the triangle are in shadow. Set a bit in the data structure marking this point as in shadow (occluded). 4. Create a standard shadow map by traversing the data structure and scattering out the occlusion value to a traditional 2D image. We call this the “deswizzle” step. 5. Render the scene from the eye view again, using the shadow map. For our purposes, this serves as an example of an algorithm that has some rasteri- zation steps intermingled with some algorithmic work one might normally implement in C++. The second step in particular can result in quite a bit of idle hardware, because it requires a dependency chain of parallel and serial workloads. Assuming we have a graphics system capable of performing the compute steps of IZB (shown in Figure 1.7.1) and executing a pixel shader capable of traversing the IZB data structure, we get the simple dependency graph of tasks shown in Figure 1.7.2. Each stage in the graph cannot start until the last thread completes the last task of the prior stage. Because there is almost always a “long pole” that determines the duration of a stage in the algorithm, nearly every thread in the system experiences some amount of idle time. This is suggested by the activity bars in Figure 1.7.2. (Imagine these corre- spond to the execution time on a system with four threads.) In the remainder of the gem, we will describe how we can reclaim some of those lost cycles on a programma- ble architecture, such as Larrabee. 92 Section 1 Graphics Figure 1.7.1 Building the IZB data structure, effectively a grid of lists. Regular points from the eye view (top left) are transformed into the light view (top middle). A count of the number of pixels is kept in the light grid (bottom left). A prefix sum of the counters in the light grid results in offsets (bottom right) into the 1D data structure (top right). Point data is scattered to the 1D data structure, including position and occlusion state (initially 0). The number of points in a light-view pixel can be determined by subtracting the current pixel value from the value to the right. The offset table combined with the 1D data structure forms the grid-of-lists IZB data structure. Figure 1.7.2 IZB dependency graph. (Tasks are completed in the order they are submitted.) 1.7 Overlapped Execution on Programmable Graphics Hardware 93 Overview of Programmable Graphics Hardware As graphics devices become more general, they can be viewed as many-core compute devices with threads that can communicate amongst themselves (for example, via global variables and atomic instructions). Consider a representative programmable graphics architecture, Larrabee. Larrabee consists of many in-order cores on a single chip, each executing four threads in round-robin fashion, with an extended instruc- tion set supporting 16-element vectors [Seiler08]. The cores are connected by a very high-bandwidth ring that maintains cache coherency. Several hardware texture sam- plers are distributed around the ring, as well as connections to GDDR. Traditional graphics APIs (DirectX, OpenGL) could be implemented on Larrabee as a typical process running within a relatively conventional operating system [Abrash09a]. A Larrabee architecture device could be an add-in adapter discrete from the tradi- tional host CPU, it could be on-chip or on-die with the host CPU, or it could be the only processor in the system. For the purposes of this gem, we ignore the transport mechanism that enables our programs to run on the device, but it is important to real- ize that the techniques and code that follow are designed to execute directly on an architecture such as Larrabee. Implementing an efficient rasterizer within a many-threaded, 16-wide SIMD platform is beyond the scope of this chapter, but we can summarize it here. Maximiz- ing parallelism is the main design goal. (Keeping the cores busy will be a recurring theme.) The approach described by Abrash [Abrash09b] is to use a binning architec- ture where the bin dimensions are chosen such that the data accessed by each core (the depth buffer format and the pixel format) does not exceed the cache size of the core (256 KB L2, 32 KB L1 in the current Larrabee architecture). Done properly, opera- tions such as depth tests would require no off-core bandwidth (neither ring nor GDDR accesses)—unless, of course, something wants to access that depth buffer later, which merely requires a one-time write at the end. Remember that textures are accessed via hardware texture samplers, which have their own caches. Despite the attention to bandwidth, the primary motivation for binning is to produce a lot of independent work that can be executed in parallel. A programmable graphics device, such as a software rasterizer, would build a dependency graph of rendering tasks to complete (which we will call a render graph). We can expect rendering tasks to be roughly divided between front end (for example, vertex processing/binning) and back end (rasterization/pixel shading). A dependency graph enables independent tasks to run concurrently; for example, if a core completes pixel shading within a bin, it can work on a different bin or begin vertex shading for the next frame. Dependencies can be defined in terms of resources—for example, a render task may signal that it has completed writing to a render target resource (write dependency) or wait to start until a shadow map resource (texture) is ready for use (read dependency). The task affected (front end or back end) is a function of whether the resource is bound to the vertex shader or pixel shader. Commands can be inserted within the graph to be executed when a resource is ready—for example, CopySubResource of a render target. We can also create nodes in the graph that are signaled by outside events, such as direct memory access (DMA) transfer completions. We will discuss dependency graphs in more detail in the subse- quent sections. Overview of Task-Based Parallelism Non-rendering algorithms hoping to effectively use many-core architectures require an efficient tasking system, such as Cilk [Blumofe95]. Such a task system leverages a thread pool (where the number of software threads is less than or equal to the number of hardware threads) to avoid operating system overhead from switching between threads. A good tasking system also provides the following features: • The ability to create individual tasks or sets of tasks (task sets). A task set calls the same task function a user-defined number of times (in parallel). • Tasks and task sets may depend upon each other. Specifically, a task or task set will not start until the tasks or task sets it depends on have completed. • Tasks and task sets may also depend upon user-defined events. Events can be signaled by simply calling a NotifyEvent API. • An efficient work-stealing scheduler to automatically spawn and load-balance among independent tasks. Following is an example of what such a task API might look like. Tasks call a user function with user data. Task sets call a user function with user data and a number indicating which instance this is (0..numTasks-1). // create an individual work item typedef void (* const TaskFunction) (void* in_pTaskFunctionArg); SyncObject CreateTask( TaskFunction in_pTaskFunc, void* in_pData, SyncObject* in_pDependencies, int in_numDependencies); // create work items that do a subset of work typedef void (* const TaskSetFunction) (void* in_pTaskSetFunctionArg, int in_taskIndex, int in_taskSetSize); SyncObject CreateTaskSet( int in_numTasksInSet, TaskSetFunction in_pTaskFunc, void* in_pData, SyncObject* in_pDependencies, int in_numDependencies); SyncObject CreateEvent(bool initialState); 94 Section 1 Graphics 1.7 Overlapped Execution on Programmable Graphics Hardware 95 Since the task system itself is software, common-sense performance heuristics apply: The amount of work done in the task should be sufficient to compensate for the overhead of the tasking system, which includes the function call into the task as well as some amount of communication to synchronize with internal task queues. For example, to perform an operation across a 1024×1024 image, don’t create one mil- lion tasks. Instead, create a multiple of the number of hardware threads in the system. For a system with 100 hardware threads, 400 or 500 equally sized tasks would give the task system some opportunity to greedily load balance, while each task of approxi- mately 2,000 pixels would give good opportunities for prefetching and loop unrolling. (A Larrabee optimized routine would operate on 16 pixels at a time, hence a loop of only about 100 iterations.) Our experiments on desktop x86 machines show that tasks of a few thousand clocks each achieve 90 percent or better overall efficiency. Efficient graphics processing requires thread affinity knowledge. That is, the rasterizer assigns threads to cores with the expectation that those threads will share render target data in their cache. Non-rendering tasks typically function more oppor- tunistically, executing whenever a thread becomes idle. Hence, we design our tasks to be independent of the threads they may execute on. Since the hardware is program- mable, finer control is possible but adds complexity. Even with equal-sized tasks, in a machine with that many threads, contention for resources (caches and GDDR bandwidth) will cause tasks to have variable durations. For very irregular workloads, such as irregular Z-buffer, optimal performance may require thousands of tasks—it all depends on the algorithm. Combining Render Graphs and Task Graphs The ability to create non-rendering task sets that depend on, or are dependencies of, rendering tasks is the key to achieving maximum hardware utilization for these mixed-usage algorithms. To do this, we need a way to interact with the rendering dependency graph from user code. Following is a method to inject a non-rendering task into the render graph, referring to the current render context (analogous to a DirectX or OpenGL context), a function to call when the dependencies are met, user data to pass to the function, and a list of all the resource dependencies. void CreateRenderTask(in_pRenderContext, in_pUserFunc, in_pUserData, in_pReadDependencies, in_numReadDependencies, in_pWriteDependencies, in_numWriteDependencies, out_pRenderTask); We then need a way to notify the render system that the render task and its read and write dependencies, as declared above, are complete and available. void NotifyRenderTaskComplete(in_pRenderTask); Now we have a way for tasks created using our task system to define dependencies with rendering tasks and for rendering tasks to very finely interact with our task system. For example, if a render pass has pixel shaders bound to a resource declared as a write dependency of a user task, the front end of the render pass can start (transform/ bin), but the back end cannot (rasterize/pixel shade). We need a little helper glue to efficiently communicate between “render” work and the tasks created for our “client” work. Following, we show how an event that waits on a task set can call NotifyRenderTaskComplete() to enable dependent render work. We also show how render work can cause a callback declared in CreateRenderTask() to signal an event, thereby starting dependent client work. On the left side of Figure 1.7.3, we show how a render pass can be made to wait on client work. First, create a task set that does some client work, such as builds a data structure. Next, create a render task with the data structure (resource) written to by the task set as a write dependency, no read dependencies, and no callback. Then, create a render pass with the data structure resource as a read dependency. Finally, create a task that depends on the task set that will call NotifyRenderTaskComplete(). The render pass cannot start until the task set is complete. 96 Section 1 Graphics Figure 1.7.3 Detail of connecting client task sets to render passes via render tasks. 1.7 Overlapped Execution on Programmable Graphics Hardware 97 On the right side of Figure 1.7.3, we show how client work can be made depend- ent on a render pass. First, create an event that we will signal and a task set that depends on the event. (It would do work on the render target.) Next, create the render pass, which in this case writes to a render target (write dependency). Finally, create a render task with the render target as a read dependency and a callback that will set the event. The task set cannot start until the render pass is complete. Combined Dependency Graph To reduce the idle time, we need to build a complete dependency graph including both rendering and non-rendering tasks. Below, we work with the following constraints: • Tasks or task sets must have their dependencies described at creation time. • Tasks or tasks sets can depend on tasks, task sets, or events. These constraints force us to work from the end of the algorithm backwards. Since a task will start immediately if it has no dependencies, we create events in a not-signaled state to act as gates for the task sets. 1. Create a build data structure event (not signaled). 2. Create a build data structure task set that depends on event (1). 3. Create a deswizzle event (not signaled). 4. Create a deswizzle task set that depends on event (3). 5. Create a light-view render pass where: a. Rasterization depends on resource from task set (2). b. Render target resource complete signals event (3). 6. Create a final eye-view render pass where rasterization depends on the shadow map resource from task set (4). 7. Create a depth-only eye-view render pass that signals event (1) when its render target resource is complete. As soon as we complete the seventh step, creating the depth-only eye-view render with no dependencies, the whole algorithm will fall into place. As shown in Figure 1.7.4, when the depth-only pass (7) completes, it signals the build event (1), which enables the build task set (2) to start. Completion of the build task set (2) enables the light-view rasterization (5) to start. (Transform and binning should already be complete.) When the light-view render (5) completes, it signals the deswizzle event (3), which enables the deswizzle task set (4) to start. When the deswizzle task set (4) completes, it enables the final eye-view rasterization (6) to start. (Transform and binning should already be complete.) Idle-Free Irregular Z-Buffer Shadows Figure 1.7.5 shows the naïve linear dependency graph of the IZB algorithm discussed earlier, showing the render stages expanded into front end (transform + binning) and back end (rasterization + pixel shading) for a total of eight stages, or task sets. Figure 1.7.6 shows how threads that would otherwise have become idle can instead work on non-dependent front-end rendering tasks if we start all three render passes immediately. This overlapped execution is especially helpful for improving the performance of our irregular shadow-mapping tasks, allowing us (in this example) to 98 Section 1 Graphics Figure 1.7.4 Order of creation of events, task sets, and render graph nodes for overlapped execution of IZB. Figure 1.7.5 IZB dependency graph with render stages expanded into front and back end. 1.7 Overlapped Execution on Programmable Graphics Hardware 99 fully hide the cost of the front-end rendering tasks. Another way to interpret this is that the compute part of irregular shadow mapping is essentially free when overlapped with rendering. This demonstrates another advantage of programmable hardware: Maximum performance is achieved by enabling flexible hardware to execute whatever tasks are available, rather than by partitioning the hardware into dedicated islands of computa- tion. Compare this to the early days of graphics devices with dedicated pixel shader and vertex shader hardware: When there was more vertex work than pixel work, the idle pixel shading hardware could not be reconfigured to help out. In modern archi- tectures, graphics processors dynamically load balance across all execution units. On a programmable architecture such as Larrabee, this load balancing can be controlled by the programmer, enabling the overlap of rendering and non-rendering tasks. Conclusion Maximizing performance on modern, many-core platforms requires identifying inde- pendent work to be executed in parallel. Non-graphics workloads leverage a system of dependent tasks and task sets to manage parallel computation. On programmable graphics devices such as Larrabee, a similar system of task dependencies can be used to identify independent and dependent graphics work—for example, binning versus pixel shading. By connecting these graphs, we can further exploit available execution units by exposing more opportunities for independent tasks to run concurrently. Figure 1.7.6 IZB with dependency graph integrated with flexible rendering pipeline. Xform/bin tasks can complete as threads become available from non-rendering tasks, filling in gaps in execution. References [Abrash09a] Abrash, Michael. “A First Look at the Larrabee New Instructions (LRBni).” Dr. Dobb’s. 1 April 2009. . [Abrash09b] Abrash, Michael. “Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action.” Intel. March 2009. . [Blumofe95] Blumofe, Robert D., Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. “Cilk: An Efficient Multithreaded Runtime System.” Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (July 1995): 207–216. [Galoppo10] Galoppo, Nico. “Eye-view Pixel Anti-Aliasing for Irregular Shadow Mapping.” Game Programming Gems 8. Ed. Adam Lake. Boston: Charles River Media, 2010. [Johnson04] Johnson, Gregory S., William R. Mark, and Christopher A. Burns. “The Irregular Z-Buffer and its Application to Shadow Mapping.” April 2004. The University of Texas at Austin. n.d. . [Johnson09] Johnson, Gregory S., Allen Hux, Christopher A. Burns, Warren A. Hunt, William R. Mark, and Stephen Junkins. “Soft Irregular Shadow Mapping: Fast, High-Quality, and Robust Soft Shadows.” Proceedings of the 2009 Symposium on Interactive 3D Graphics and Games. (2009): 57–66. ACM Portal. [Seiler08] Seiler, Larry, et al. “Larrabee: A Many Core x86 Architecture for Visual Computing.” ACM Transactions on Graphics 27.3 (Aug. 2008): n.p. ACM Portal. [Williams78] Williams, Lance. “Casting Curved Shadows on Curved Surfaces.” ACM SIGGRAPH Computer Graphics 12.3 (Aug. 1978): 270–274. ACM Portal. 100 Section 1 Graphics 101 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs Steven Tovey, Bizarre Creations Ltd. When the Cell Broadband Engine was designed, Sony and the other corporations in the STI coalition always had one eye on the Cell’s ability to support a GPU in its processing activities [Shippy09]. The Cell has been with us for three years now, and like any new piece of hardware, it has taken time for developers to understand the best ways of pushing the hardware to its limits. The likes of Mike Acton and the Insomniac Games Technology Team have been instrumental in pushing general devel- opment and coding strategies for the Cell forward, but there has been little discussion about ways that the SPUs can support a GPU in its processing activities specifically. This chapter aims to introduce fundamental techniques that can be employed when develop- ing code for the CBE that will allow it to aid the GPU in performing rendering tasks. The CBE as Part of a Real-World System Understanding Cell’s place in a real-world system is useful to our discussion, and, as such, we will use Sony’s PlayStation 3 as our case study. PlayStation 3 contains the Cell Broadband Engine, which was developed jointly by Sony Computer Entertainment, Toshiba Inc., and IBM Corp. [Shippy09, Möller08, IBM08]. The Cell forms part of the overall architecture of the console along with the Reality Synthesizer, RSX, and two types of memory. Figure 1.8.1 shows a high-level view of the architecture. 102 Section 1 Graphics The Cell contains two distinctly different types of processor: the PowerPC Pro- cessing Element (PPE) and the Synergistic Processing Element (SPE). The PPE is essentially the brains of the chip [Shippy09] and is capable of running an operating system in addition to coordinating the processing activities of its counterpart process- ing elements, the SPEs. Inside PlayStation 3, there are eight SPEs. However, to increase chip yield, one is locked out, and Sony reserves another for their operating system, leaving a total of six SPEs available for application programmers. All process- ing elements in the Cell are connected by a token-ring bus, as shown in Figure 1.8.2. Because the SPEs are the main focus of this chapter, they are discussed in much greater detail in the forthcoming sections. The SPEs Each Synergistic Processing Element is composed of two major components: the Syn- ergistic Processing Unit (SPU) and the Memory Flow Controller (MFC). The SPU Detailed knowledge of the SPU instruction set and internal execution model are crit- ical to achieving peak performance on the PlayStation 3. In the following sections, we will highlight some important facets of this unique processor. Figure 1.8.1 The PlayStation 3 architecture (illustration modeled after [Möller08, Perthuis06]). Figure 1.8.2 The Cell Broadband Engine (modeled after [IBM08]). The Synergistic Execution Unit and SPU ISA The Synergistic Execution Unit (SXU), part of the SPU, is responsible for the execu- tion of instructions. Inside the SXU are two pipelines: the odd pipeline and the even pipeline. Instructions are issued to exactly one of these pipelines, depending on the group the issued instruction falls into (see Table 1.8.1). The SXU supports the dual issue of instructions (one from each pipeline) if and only if a very strict set of require- ments is met. We will discuss these requirements in detail later. TABLE 1.8.1 A List of Instruction Groups Together with Their Associated Execution Pipes and Latencies Instruction Group Pipeline Latency (Cycles) Issue (Cycles) Single precision floating-point operations EVEN 6 1 Double precision floating-point operations EVEN 7 6 Integer multiplies, integer/float conversions, EVEN 7 1 and interpolation Immediate loads, logical operations, integer EVEN 2 1 addition/subtraction, carry/borrow generate Element-wise rotates and shifts, special byte operations EVEN 4 1 Loads and stores, branch hints, channel operations ODD 6 1 Shuffle bytes, qword rotates and shifts, estimates, ODD 4 1 gather, selection mask formation and branches The SPU has a particularly large register file to facilitate the execution of pipelined, unrolled code without the need for excessive register spilling. Unlike its counterpart, the PPE, the register file of the SPU is unified. That is, floating-point, integer, and vector operations act on the same registers without having to move through memory. As the SPU is a vector processing unit at its heart, its Instruction Set Architecture (ISA) is designed specifically for vector processing [IBM08a]. All 128 of the SPU’s registers are 16 bytes in size, allowing for up to four 32-bit floating-point values or eight 16-bit integers to be processed with each instruction. While a full analysis of the SPU’s ISA is beyond the scope of this gem, there are a number of instructions worth discussing in greater detail that are particularly impor- tant for efficient programming of the SPU. The first of these instructions is selb, or “select bits.” The selb instruction performs branchless selection on a bitwise basis and takes the form selb rt, ra, rb, rm. For each bit of a quadword, this instruction uses the mask register (rm) to determine which bits of the source registers (ra and rb) should be placed in the corresponding bits of the target register (rt). Comparison instructions all return a quadword selection mask that can be used with selb1. 1 The fsmbi instruction is also very useful for efficiently constructing a selection mask for use with selb. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 103 The shuffle bytes instruction, shufb, is the key instruction in data manipulation on the SPU. The shufb instruction takes four operands, all of which are registers. The first operand, rt, is the target register. The next two operands, ra and rb, are the two quadwords that will be manipulated by the quadword pattern from the fourth operand, rp. The manipulations controlled by this fourth operand, known as the shuffle pattern, are particularly interesting. A shuffle pattern is a quadword value that works on a byte level. Each of the 16 bytes in the quadword controls the contents of the corresponding byte in the target register. For example, the 0th byte of the pattern quadword controls the value that will ultimately be placed into the 0th byte of the target register, the 1th byte controls the value of the 1th byte placed into the target register, and so on, for all 16 bytes of the quad word. Listing 1.8.1 provides an example shuffle pattern. LISTING 1.8.1 An example shuffle pattern const vector unsigned char _example1 = { 0x00, 0x11, 0x02, 0x13, 0x04, 0x15, 0x06, 0x17, 0x08, 0x19, 0x0a, 0x1b, 0x0c, 0x1d, 0x0e, 0x1f }; The above pattern performs a perfect shuffle, but on a byte level. (The term “perfect shuffle” typically refers to the interleaving of bits from two words.) The lower 4 bits of each byte can essentially be thought of as an index into the bytes of the first or second operand quadword. Similarly, the upper 4 bits can be thought of as an index into the registers referred to in the instruction’s operands. Since there are only two, we need only concern ourselves with the LSB of this 4-bit group—in other words, 0x0x (where x denotes some other value of the lower 4 bits of the byte) would index into the con- tents of the ra register, and 0x1x would access the second. It is worth noting that there are special case values that can be used to load constants with shufb; an interested reader can refer to [IBM08a] for details. A further example in Listing 1.8.2 will aid us in our understanding. LISTING 1.8.2 An example of using shufb const vector unsigned char _example2 = { 0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x08, 0x09, 0x0a, 0x0b, 0x1c, 0x1d, 0x1e, 0x1f }; qword pattern = (const qword)_example2; qword ra = si_ilhu(0x3f80); // ra contains: 1.0f, 1.0f, 1.0f, 1.0f qword rb = si_ilhu(0x4000); // rb contains: 2.0f, 2.0f, 2.0f, 2.0f // result contains: 1.0f, 2.0f, 1.0f, 2.0f qword result = si_shufb(ra, rb, pattern); 104 Section 1 Graphics In many programs, simply inlining of shuffle patterns for data manipulation requirements will suffice, but since the terminal operand to shufb is simply a register, there is nothing to stop you from computing the patterns dynamically in your pro- gram or from forming them with the constant formation instructions (as should be preferred when lower latency can be achieved than the 6-cycle load from the local store). As it turns out, dynamic shuffle pattern computation is actually critical to per- forming unaligned loads from the local store in a vaguely efficient manner, as we shall see later. In-depth details of the SPU ISA can be found in [IBM08a]. Local Store and Memory Flow Controller As previously mentioned, each of the SPUs in the Cell is individually endowed with its own memory, known as its local store. The local store is (at least on current imple- mentations of the CBE) 256 KB in size and can essentially be thought of as an L1 cache for the Synergistic Execution Unit. Data can be copied into and out of the local store by way of the DMA engine in the MFC, which resides on each SPE and acts asynchronously of the SXU. Loads and stores to and from the local store are always 16-byte aligned and sized. Hence, processing data smaller than 16 bytes requires use of a less-than-efficient load-modify-store pattern. Accesses to the local store are arbi- trated by the SPU Store and Load unit (SLS) based on a priority; the DMA engine always has priority over the SXU for local store accesses. Each DMA is part of a programmer-specified tag group. This provides a mecha- nism for a programmer to poll the state of the MFC to find out if a specific DMA has completed. A tag group is able to contain multiple DMAs. The tag group is denoted by a 5-bit value internally, and, as such, the MFC supports 32 distinct tag groups [Bader07]. The DMA queue (DMAQ) is 16 entries deep in current implementations of the CBE. Data Management In many ways, the choice of data structure is more important than the efficiency of the operations that must be performed on it. In the following sections, we will describe a variety of data management strategies and their tradeoffs in the context of the SPU. Multi-Buffering All graphics programmers will be familiar with the concept of a double buffer. The multi-buffer is simply a term that generalizes the concept to an arbitrary number of buffers. In many cases two buffers will be sufficient, but sometimes a third buffer will be required to effectively hide the latency of transfers to and from the effective address space. Figure 1.8.3 shows the concept of multi-buffering. Bader suggests that each buffer should use a separate tag group in order to prevent unnecessary stalling of the SPU waiting for data that will be processed sometime in the future. Barriers and fences should be used to order DMAs within a tag group and the DMA queue, respectively [Bader07]. Multi-buffering can yield significant 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 105 performance increases, but it does have a downside. Because the buffers are resident in the local store, it does mean that SPE programs must be careful not to exceed the 256-KB limit. Using a reasonable size for each of the buffers in your multi-buffer (about 16 KB) in order to allow the SPU to process several vertices or pixels before requiring more data from the main address space is a fine strategy. However, the pointer wrangling can become a little complicated if one’s goal is to support a list of arbitrarily sized (and hence aligned) vertex formats. Conversely, alignments do tend to be a little more favorable and can be easily controlled by carefully selecting a reasonably sized unit of work when processing pixels. Structure-of-Arrays versus Array-of-Structures The design of data is paramount when hoping to write performant software for the SPU. Since the SPU is a SIMD vector processor, concepts familiar to those who have programmed with other vector ISAs, such as SSE on Intel chips, Altivec on PowerPC chips, or even the VU on the PlayStation 2, are immediately transferable to the SPU. One such concept is parallel array data layout, better known as Structure-of-Arrays (SOA). By laying data out in a format that is the transpose of its natural layout (Array- of-Structures), as can be seen in Figure 1.8.4, a programmer is often able to produce much more efficient code (most notably in those cases where vectorized data is inter- acting with scalar data). The benefits of using an SOA layout are substantial in a lot of common cases. Listing 1.8.3 illustrates this by way of computing the squared length of a vector. 106 Section 1 Graphics Figure 1.8.3 Multi-buffering data to hide latency (modeled after [Bader07]). LISTING 1.8.3 Two versions of a function to calculate the squared length of a vector. The first assumes Array-of-Structures data layout, and the second Structure-of-Arrays layout. // Version 1: AOS mode – 1 vector, ~18 cycles. qword dot_xx = si_fm(v, v); qword dot_xx_r4 = si_rotqbyi(dot_xx, 4); dot_xx = si_fa(dot_xx, dot_xx_r4); qword dot_xx_r8 = si_rotqbyi(dot_xx, 8); dot_xx = si_fa(dot_xx, dot_xx_r8); return si_to_float(dot_xx); // Version 2: SOA mode – 4 vectors, ~8 cycles. qword dot_x = si_fm(x, x); qword dot_y = si_fma(y, y, dot_x); qword dot_z = si_fma(z, z, dot_y); return dot_z; Branch-Free DMAs The cost of badly predicted branches on the SPU is quite significant. Given that the SPU does not contain any dedicated branch prediction hardware 2, the burden of responsibility falls squarely on the shoulders of the programmer (or in the majority of cases, the compiler). There are built-in language extensions available in most SPU compilers that allow the programmer to supply branch hints, but such things assume that you have sufficient time in order to make the prediction (that is, more than 11 cycles) and that the branch is intrinsically predictable, which may not be the case. It is therefore recommended that programmers avoid branches entirely [Acton08]. Others have discussed this topic at length [Acton08, Kapoulkine09], so I will refrain from doing so here; however, I do wish to touch upon one common case where branch avoidance is not entirely obvious but is entirely trivial. IBM’s SDK provides several MFC functions to initiate DMA without resorting to the manual writing of registers 3. An unfortunate side effect of such functions is that they seem to actively encourage code such as that presented in Listing 1.8.4. 2 The SXU adopts the default prediction strategy that all branches are not taken. 3 SPU-initiated DMAs are performed by the writing of special-purpose registers in the MFC using the wrch instruction. There are six such registers that must be written in order to initiate a DMA. These may be written arbitrarily as long as the command register is written terminally [IBM08]. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 107 Figure 1.8.4 An Array-of-Structures layout on the left is transposed into a Structure-of-Arrays layout (illustration modeled after [Tovey10]). LISTING 1.8.4 All-too-often encountered code to avoid issuing unwanted DMAs if(si_to_uint(counter) > 0)) mfc_put(si_to_uint(lsa), si_to_uint(ea), si_to_uint(size), si_to_uint(tag)); However, a little knowledge of the MFC can help avoid the branch in this case. The MFC contains the DMA queue (DMAQ). This queue contains SPU-initiated commands to the MFC’s DMA engine. Similar to a CPU or GPU, the MFC supports the concept of a NOP. A NOP is an operation that can be inserted into the DMAQ but doesn’t result in any data being transferred. A NOP for the MFC is denoted by any DMA command being written that has zero size. The resulting code looks some- thing like Listing 1.8.5. LISTING 1.8.5 Branch-free issue of DMA qword cmp_mask = si_cgti(counter, 0x0); qword cmp = si_andi(cmp_mask, 0x1); // bottom bit only. qword dma_size = si_mpy(size, cmp); // size < 2^16 mfc_put(si_to_uint(lsa), si_to_uint(ea), si_to_uint(dma_size), si_to_uint(tag)); Unfortunately, the hardware is not smart enough to discard zero-sized DMA commands immediately upon the command register being written, and these com- mands are inserted into the 16-entry DMAQ for processing. The entry into the queue is immediately discarded when the DMA engine attempts to process this element of the queue. However, this causes a subtle downside to the employment of this tech- nique for branch avoidance. SPE programs that issue a lot of DMAs can quickly back up the DMAQ, and issuing a zero-sized DMA can stall the SPU while it flushes the entire DMAQ. Luckily, this state of affairs can be almost entirely mitigated by a well- designed SPE program, which issues fewer, but larger DMAs. Vertex/Geometry Shading The SPUs can also lend a hand in various vertex processing tasks and, because of their general nature, can help overcome some of the shortcomings of the GPU program- ming model. In Blur, we were able to use the SPU to deal with awkward vertex sizes and to optimize the vehicle damage system. Handling Strange Alignments When Multi-Buffering Vertex data comes in all shapes and sizes, and, as a result, multi-buffering this type of data presents some challenges. When vertex buffers are created, contiguous vertices are packed tightly together in the buffer to both save memory and improve the performance 108 Section 1 Graphics of the pre-transform cache on the GPU. This presents an SPU programmer with a challenge when attempting to process buffers whose per-vertex alignment may not be a multiple of 16 bytes. This is a problem for two reasons. First, the DMA engine in the MFC transfers 1, 2, 4, 8, or multiples of 16 bytes, meaning that we must be care- ful not to overwrite parts of the buffer that we do not mean to modify. Second, loads and stores performed by the SXU itself are always 16-byte aligned [IBM08]. There are a lot of cases where a single vertex will straddle the boundary of two multi-buffers, due to vertex structures that have alignments that are sub-optimal from an SPU processing point of view. The best way of coding around this problem is to simply copy the end of a multi-buffer to its nearest 16-byte boundary into the start of the second multi-buffer and offset the pointer to the element you are currently pro- cessing. This means that when the second multi-buffer is transferred back to the main address space, it will not corrupt the vertices you had previously processed and trans- ferred out of the first multi-buffer, as shown in Figure 1.8.5. Listing 1.8.6 contains code demonstrating how to handle unaligned loads from the local store. Case Study: Car Damage in Blur The car damage system in Blur works by manipulating a lattice of nodes that roughly represent the volume of the car. The GPU implementation makes use of a volume texture containing vectors representing the offset of these nodes’ positions from their original positions. This is then sampled based on the position of a vertex being processed relative to a volume that loosely represents the car in order to calculate posi- tion and normal offsets (see Figure 1.8.6). The texture is updated each time impacts are applied to the lattice, or when the car is repaired. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 109 Figure 1.8.5 Avoid buffer corruption by copying a small chunk from the end of one multi-buffer into the start of another. 110 Section 1 Graphics The GPU performs the deformation every frame because the damage is stateless and a function of the volume texture and the undamaged vertex data. Given the amount of work involved and the additional performance hit from sampling textures in the vertex unit, the performance of rendering cars in Blur was heavily vertex lim- ited. This was something we wanted to tackle, and the SPUs were useful in doing so. Porting the entire vertex shader to the SPU was not practical given the timeframe and memory budgets, so instead we focused on moving just the damage calculations to the SPUs. This meant that the car damage vertex processing would only occur when dam- age needed to be inflicted on the car (instead of every frame with the equivalent GPU implementation), and it would greatly reduce the complexity of the vertex shader run- ning on the GPU. The damage offsets are a function of the vertex’s position and the state of the node lattice. Given the need for original position, we must transfer the vertex data for the cars to the local store via DMA and read the position data corresponding to each vertex. This is done using a multi-buffering strategy. Because different components of the car utilize different materials (and hence have different vertex formats), we were also forced to a variety of vertex alignments as described earlier. With the vertex data of the car in the SPU local store, we are able to calculate a position and normal offset for each vertex and write these out to a separate vertex buffer. Each of these values is stored as a float4, which means the additional vertex stream has a stride of 32 bytes per vertex. An astute GPU programmer will notice the potential to pack this data into fewer bits to improve cache utilization. This is undesirable, however. The data in its 32-bytes-per-vertex form is ideal for the DMA engine because the MFC natively works in 16-byte chunks, meaning from the point of view of other processing ele- ments (in our case, the GPU), a given vertex is either deformed or it is not. This is one of the tradeoffs made to mitigate the use of a double buffer. Color Plate 5 has a screen- shot of this technique. To GPU Types and Back Again For the most part, GPUs do their best to support common type formats found in CPUs. The IEEE754 floating-point format is (for better or worse) the de facto floating- point standard on pretty much all modern hardware that supports floating point 4 . 4 Ironically, the SPUs do not offer full IEEE754 support, but it’s very close. Figure 1.8.6 Position and normal offsets are applied to each vertex based on deltas stored in a volume texture. However, in addition to the IEEE754 standard 32-bit floats and 64-bit doubles, most shading languages offer a 16-bit counterpart known as half. The format of the half is not defined by any standard, and, as such, chip designers are free to implement their own floating-point formats for this data type on their GPUs. Fortunately, almost all GPU vendors have adopted the half format formalized by Industrial Light & Magic for their OpenEXR HDR file format [ILM09]. This format uses a single bit to denote the sign of the number, 5 bits for the exponent, and the remaining 10 bits for its man- tissa or significand. Since the half type is regrettably absent from the C98 and C++ standards, it falls to the programmer to write routines to convert to other data types. Acton has made available an entirely branch-free version of these conversion functions at [Acton06]. For the general case, you would be hard-pressed to better Acton’s code (assuming you don’t have the memory for a lookup table as in [ILM09]). However, in many con- strained cases, we have knowledge about our data that allows us to omit support for floating-point special cases that require heavyweight conversion logic (NaNs and de-normalized numbers). Listing 1.8.6 contains code to convert between an unaligned half4 and float4 but omits support for NaNs. This is an optimization that was employed in Blur’s damage system. The inverse of this function is left as an exercise for the reader. LISTING 1.8.6 Code to convert an unaligned half4 to a qword static inline const qword ld_float16_4(void* __restrict__ addr) { const vector unsigned char _loader = { 0x80, 0x80, 0x00, 0x01, 0x80, 0x80, 0x02, 0x03, 0x80, 0x80, 0x04, 0x05, 0x80, 0x80, 0x06, 0x07 }; const vector unsigned char _shft = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f }; qword target = si_from_ptr(addr); qword val_lo = si_lqd(target, 0x00); qword val_hi = si_lqd(target, 0x10); qword sign_bit_mask = si_ilhu(0x0); sign_bit_mask = si_iohl(sign_bit_mask, 0x8000); qword mant_bit_mask = si_ilhu(0x0); mant_bit_mask = si_iohl(mant_bit_mask, 0x7fff); qword expo_bias = si_ilhu(0x3800); qword loader = (const qword)_loader; qword shft = (const qword)_shft; qword offset = si_andi(target, 0x0f); qword lo_byte_pat = si_ilh(0x0303); qword offset_pat = si_shufb(offset, offset, lo_byte_pat); 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 111 qword mod_shuf = si_a(shft, offset_pat); qword val = si_shufb(val_lo, val_hi, mod_shuf); qword result = si_shufb(val, val, loader); // aligned qword sign_bit = si_and(result, sign_bit_mask); sign_bit = si_shli(sign_bit, 0x10); qword significand = si_and(result, mant_bit_mask); significand = si_shli(significand, 0xd); qword is_zero_mask = si_cgti(significand, 0x0); expo_bias = si_and(is_zero_mask, expo_bias); qword exponent_bias = si_a(significand, expo_bias); qword final_result = si_or(exponent_bias, sign_bit); return final_result; } Benefits versus Drawbacks Processing vertex data on the SPUs has a number of advantages; one of the most sig- nificant is that the rigidity of the GPU’s processing model is largely circumvented as you are performing processing on a general-purpose CPU. Access to mesh topology is supported, but one must be careful that these accesses do not introduce unwanted stalls as the data is fetched from the main address space. In addition, since we are using a CPU capable of general-purpose program execution, we are able to employ higher-level optimization tactics, such as early outs or faster code paths, which would be tricky or impossible under the rigid processing model adopted by GPUs. The abil- ity to split workloads between the SPUs and the GPU is also useful in striking the ideal balance for a given application. As with most things in graphics programming, there are some tradeoffs to be made. Vertex processing on the SPU can in many cases require that vertex buffers are double buffered, meaning a significantly increased memory footprint. The situation is only aggravated if there is a requirement to support multiple instances of the same model. In this case, each instance of the base model may also require a double buffer. This can be mitigated to some extent by carefully designing the vertex format to sup- port atomic writes of individual elements by the DMA engine, but the practicality of this is highly application-specific and certainly doesn’t work in the case of instances. Clever use of a ring buffer can also solve this problem to some extent, but it intro- duces additional problems with SPU/GPU inter-processor communication. Fragment Shading Fragment shading in the traditional sense is heavily tied to the output of the GPU’s rasterizer. Arbitrarily “hooking into” the graphics pipeline to have the SPUs perform general-purpose fragment shading with current generations of graphics hardware is effectively impossible. However, performing the heavy lifting for certain types of frag- ment shading that do not necessarily require the use of the rasterizer, or even helping out the GPU with some pre-processing as in [Swoboda09], is certainly feasible and in 112 Section 1 Graphics our experience has yielded significant performance benefits in real-world applications [Tovey10]. This section discusses some of the techniques that will help you get the most out of the SPUs when shading fragments. Batch! Batch! Batch! It might be tempting with initial implementations of pixel processing code on the SPU to adopt the approach of video hardware, such as the RSX. RSX processes pixels in groups of four, known as quads [Möller08]. For sufficiently interleavable program code—in other words, program code that contains little dependency between opera- tions that follow one another—this may be a good approach. However, in our experi- ence, larger batches can produce better results with respect to pixel throughput because there is a greater volume of interleavable operations. Too few pixels result in large stalls between dependant operations, time that could be better spent performing pixel shading, while larger batches cause high register pressure and ultimately spilling. Moreover, in many applications that have a fixed setup cost for each processing batch, you are doing more work for little to no extra setup overhead. So, what is the upper bound on the number of pixels to process in a single batch of work? Can we simply process the entire buffer at once? The answer to this is not obvious and depends on a number of factors, including the complexity of your frag- ment program and the number of intermediate values that you have occupying regis- ters at any one time. Typically, the two are inextricably linked. As mentioned earlier, the SXU contains 128 registers, each 16 bytes in size. It is the task of the compiler to multiplex all live variables in your program onto a limited register file5 . When there are more live variables than there are registers—in other words, when register pressure is high—the contents of some or all of the registers (depending on the size of the register file) have to be written back to main memory and restored later. This is known as spilling registers. The more pixels one attempts to process in a batch, the higher the register pressure for that function will be, and the likelihood that the compiler will have to spill registers back to the stack becomes greater. Spilling registers can become very expensive if done to excess. The optimum batch size is hence the largest number of pixels that one can reasonably process with- out spilling any registers back to the local store and without adding expense to the setup code for the batch of pixels. Pipeline Balance Is Key! An efficient, well-written program will be limited by the number of instructions issued to the processor. Those processors with dual-issue capabilities, such as the SPU, have the potential to dramatically decrease the number of cycles that a program consumes. 5 The process of mapping multiple live variables onto a limited register file is known as register coloring. Register coloring is a topic in its own right, and we will not cover it in detail here. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 113 Pipeline balance between the odd and even execution pipelines is critical to achieving good performance with SPU programs. We will now discuss the requirements for instruction dual-issue and touch briefly on techniques to maximize instruction issue (through dual-issue) for those programmers writing in assembly. The SPU can dual-issue instructions under a very specific set of circumstances. Instructions are fetched in pairs from two very small instruction buffers [Bader07], and the following must all be true if dual-issue is to occur: • The instructions in the fetch group must be capable of dispatch to separate execu- tion pipelines. • The alignment of the instructions must be such that the even pipeline instruction occupies an even-aligned address in the fetch group, and the odd pipeline in the odd-aligned address. • Finally, there must be no dependencies either between the two instructions in the fetch group or between any one of the instructions in the fetch group and another instruction currently being executed in either of the pipelines. Programmers writing code with intrinsics rarely need to worry about instruction alignment. The addition of nops and lnops in intrinsic form does not typically help the compiler to better align your code for dual-issue, and, in many cases, the compiler will do a reasonable job of instruction balancing. However, if you’re programming in assembly language, the use of nop (and its odd-pipeline equivalent, lnop) will be use- ful in ensuring that code is correctly aligned for dual-issue. Of course, care must be taken not to overdo it and actually make the resulting code slower. A good rule of thumb is never to insert more than two nops/lnops. Case Study: Light Pre-Pass Rendering in Blur Light pre-pass rendering is a variant of deferred shading first introduced by Wolfgang Engel on his blog [Engel08] and later in [Engel09, Engel09a] at the same time it was derived independently by Balestra et al. for use in Uncharted: Drake’s Fortune [Balestra08]. The techniques behind light pre-pass rendering are well understood and are discussed elsewhere [Engel08, Balestra08, Engel09, Engel09a, Tovey10], so a brief summary will suffice here. As with all deferred rendering, the shading of pixels is decoupled from scene com- plexity by rendering out “fat” frame buffers for use in an image space pass [Deer- ing88, Saito90]. Light pre-pass rendering differs slightly from traditional deferred shading in that only the data required for lighting calculations is written to the frame buffer during an initial rendering pass of the scene. This has several advantages, including a warm Z-buffer and a reduced impact on bandwidth requirements, at the expense of rendering the scene geometry twice. Because one of the main requirements for the new engine written for Blur was that it should be equipped to handle a large number of dynamic lights, the light pre-pass 114 Section 1 Graphics renderer was a very attractive option. After implementing a light pre-pass renderer for Blur (which ran on the RSX), it became apparent that we could get significant per- formance gains from offloading the screen-space lighting pass to the SPUs 6. The lighting calculations in Blur are performed on the SPU in parallel with other non-dependent parts of the frame. This means that as long as we have enough render- ing work for the RSX, the lighting has no impact on the latency of a frame. Processing of the lighting buffer is done in tiles, the selection of which is managed through the use of the SPE’s atomic unit. When the tiles are processed, the RSX is free to access the lighting buffer during the rendering of the main pass. The results of our technique are shown in Color Plate 6 and discussed in greater detail in [Swoboda09, Tovey10]. Benefits versus Drawbacks The SPUs are powerful enough to perform fragment processing. This has been demon- strated by developers with deferred shading, post-processing, and so on [Swoboda09, van der Leeuw09, Tovey10]. While general-purpose fragment shading is not possible, it is possible to perform a plethora of image-space techniques on the SPUs, including motion blur, depth of field, shadowing, and lighting. Parallelization with other non- related rendering work on the GPU can provide an extra gain if one’s goal is to mini- mize frame latency. Such gains can even be made without the expense of an increased memory footprint. Rasterization on the SPUs has been achieved by a number of studios with good results, but the use cases for this technique are somewhat restricted, usually being reserved for occlusion culling and the like rather than general-purpose rendering. Ras- terization aside, the most serious drawback to performing fragment shading on the SPUs is the lack of dedicated texture-mapping hardware. Small textures may be feasi- ble, as they will fit in the limited local store, but for larger textures or multiple tex- tures, software caching is currently considered to be the best approach [Swoboda09]. Further Work Due to the highly flexible nature of the SPUs in augmenting the processing power of the GPU, it is hard to suggest avenues of further work with any certainty. However, there are a few significant challenges that warrant additional research efforts in order to further improve the feasibility of some graphics techniques on the SPUs. Texture mapping is one such avenue of research. Currently, the best that has been done is the use of a good software cache [Swoboda09] to try and minimize the latency of texture accesses from the SPUs. Taking inspiration from other convergent architec- tures, namely Intel Larrabee [Seiler08], we believe that the employment of user-level threads on the SPUs as a mechanism for hiding latency could certainly go some way 6 Coincidentally, it was around this time that Matt Swoboda presented his work in a similar area, in which he moved a fully deferred renderer to the SPUs in [Swoboda09]; Matt’s work and willingness to communicate with us was useful in laying the ground work for our implementation in Blur. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 115 to helping the prohibitively slow texture access speeds currently endured by graphics programmers seeking to help the GPU along with the SPUs. Running two to four copies of the same SPU program (albeit with offline modifications to the program’s byte code) could allow a programmer to trade space in the local store for processing speed. The idea is simple: Each time a DMA is initiated, the programmer performs a lightweight context-switch to another version of the program residing in the local store, which can be done cheaply if the second copy does not make use of the same registers. The hope is that by the time we return the original copy, the data we requested has arrived in the local store, allowing us to process it without delay. Such a scheme would impose some limitations but could be feasible for small-stream kernels, such as shaders. Conclusion The SPUs are fast enough to perform high-end vertex and fragment processing. While they are almost certainly not going to beat the GPU in a like-for-like race (in other words, the implementation of a full graphics pipeline), they can be used in synergy with the GPU to supplement processing activities traditionally associated with ren- dering. The option to split work between the two processing elements makes them great tools for optimizing the rendering of specific objects in a scene. The deferred lighting and car damage systems in Blur demonstrate the potential of the SPUs to work harmoniously with the GPU to produce impressive results. Looking to the future, the ever-growing popularity and prevalence of deferred rendering techniques in current generations of hardware further empower the SPUs to deliver impressive improvements to the latency of a frame and allow game develop- ers to get closer to synthesizing reality than ever before. Acknowledgements I would like to thank the supremely talented individuals of the Bizarre Creations Core Technologies Team for being such a great bunch to work with, with special thanks reserved for Steve McAuley for being my partner in crime with our SPU lighting implementation. Thanks also go to Andrew Newton and Neil Purvey at Juice Games for our numerous discussions about SPU coding, to Matt Swoboda of SCEE R&D for our useful discussions about SPU-based image processing, and to Wade Brainerd of Activision Central Technology for his helpful comments, corrections, and sugges- tions. Last but not least, thanks also to Jason Mitchell of Valve for being an under- standing and knowledgeable section editor! References [Acton06] Acton, Mike. “Branch-Free Implementation of Half Precision Floating Point.” CellPerformance. 17 July 2006. Mike Acton. 2 July 2009. . 116 Section 1 Graphics [Acton08] Acton, Mike and Eric Christensen. “Insomniac SPU Best Practices.” Insomniac Games. 2008. Insomniac Games. 2 July 2009. . [Bader07] Bader, David A. “Cell Programming Tips & Techniques.” One-Day IBM Cell Programming Workshop at Georgia Tech. 6 Feb. 2007. Georgia Tech College of Computing. 2 July 2009. . [Balestra08] Balestra, Christophe and Pal-Kristian Engstad. “The Technology of Uncharted: Drake’s Fortune.” Game Developers Conference. 2008. Naughty Dog Inc. n.d. . [Deering88] Deering, Michael, et al. “The Triangle Processor and Normal Vector Shader: A VLSI System for High Performance Graphics.” Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques (1988): 21–30. ACM Portal. [Engel08] Engel, Wolfgang. “Light Pre-Pass Renderer.” Diary of a Graphics Programmer. 16 March 2008. 4 July 2009. . [Engel09] Engel, Wolfgang. “Designing a Renderer for Multiple Lights: The Light Pre-Pass Renderer.” ShaderX7: Advanced Rendering Techniques. Ed. Wolfgang Engel. Boston: Charles River Media, 2009. 655–666. [Engel09a] Engel, Wolfgang. “The Light Pre-Pass Renderer Mach III.” To appear in proceedings of ACM SIGGRAPH09, 2009. [IBM08] “Cell Broadband Engine Programming Handbook.” IBM. 19 April 2006. IBM. n.d. 7A77CCDF14FE70D5852575CA0074E8ED>. [IBM08a] “Synergistic Processing Unit Instruction Set Architecture.” IBM. 27 Jan. 2007. IBM. n.d. [IBM09] “The Cell Project at IBM Research.” IBM. n.d. IBM. 4 July 2009. . [ILM09] “OpenEXR.” OpenEXR. n.d. Lucas Digital Limited. 4 July 2009. . [Kapoulkine09] Kapoulkine, Arseny. “View frustum culling optimization—never let me branch.” What Your Mother Never Told You About Graphics Development. 1 March 2009. 21 July 2009. . [Möller08] Akenine-Möller, Thomas, Eric Haines, and Naty Hoffman. Real-Time Rendering, 3rd Edition. Wellesley, MA: A.K. Peters, Ltd, 2008. [Perthuis06] Perthuis, Cedric. “Introduction to the Graphics Pipeline of the PS3.” Eurographics 2006. Austrian Academy of Sciences, Vienna, Austria. 6 Sept. 2006. [Saito90] Saito, Takafumi and Tokiichiro Takahashi. “Comprehensible Rendering of 3-D Shapes.” ACM SIGGRAPH Computer Graphics 24.4 (1990): 197-206. ACM Portal. [Seiler08] Seiler, Larry, et al. “Larrabee: A Many Core x86 Architecture for Visual Computing.” ACM Transactions on Graphics 27.3 (Aug. 2008): n.p. ACM Portal. 1.8 Techniques for Effective Vertex and Fragment Shading on the SPUs 117 [Shippy09] Shippy, David and Mickie Phipps. The Race for a New Games Machine: Creating the Chips Inside The New Xbox360 & The Playstation 3. New York: Citadel Press, 2009. [Swoboda09] Swoboda, Matt. “Deferred Lighting and Post Processing on PLAYSTATION®3.” Game Developers Conference. 2009. Sony Computer Entertainment Eurpoe, Ltd. n.d. . [Tovey10] Tovey, Steven and Steven McAuley. “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine™.” GPU Pro: Advanced Rendering Techniques. Natick, MA: A K Peters Ltd., 2010. [van der Leeuw09] van der Leeuw, Michiel. “The PLAYSTATION3’s SPUs in the Real World—KILLZONE2 Case Study.” Game Developers Conference 2009. Moscone Center, San Francisco, CA. 25 March 2009. 118 Section 1 Graphics 119 SECTION 2 PHYSICS AND ANIMATION Introduction Jeff Lander, Darwin 3D, LLC Game creation as a business and an art has become much more mature, in years of experience, in complexity, and in controversy of the material covered. For the most part, long gone are the days when text adventures, simple frame flipping, and sprite- based animation ruled the top-ten lists of gamers’ hearts. Our games need to be much more real and complex to compete with the ever-increasing expectations of our audience. Nowhere is this more evident than in the character performances and physical simulations of real (or imaginary) worlds. Players exposed to the amazing worlds that television and filmmakers can create with visual effects rightly believe that their games should reflect these advances and expectations as well. We need to bring our charac- ters to life. We need to create worlds where the rules and systems that govern the real- ity have a basis in physical realism and have a consistency that is at once familiar and exciting for our players. We are past the point where players can be amazed by a sim- ple animation clip of a character running or watching a ball bounce on the ground. They have seen that all before. They now expect a game’s characters to react to the worlds around them in an intelligent way, as well as interact with the world in a way that models the physical interactions in our real-life experiences. The gems in this section represent the intersection of the physical interactions and animated performances that we need in order to bring more of the illusion of life to our characters and worlds. In “A Versatile and Interactive Anatomical Human Face Model,” Marco Fratarcangeli discusses how to bring more realistic movement to facial animation to directly attack the problems with facial performance. This gem models the underlying facial animation systems with physical simulation. In this same way, physical simulation is used in the gems “Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes” and “What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance” to improve the realism in our interactive worlds. “Particle Swarm Optimization for Game Programming” discusses applying easy-to-use particle simulation techniques to a variety of optimization problems. Much of this is pretty advanced stuff. We will not be discussing how to play back an animation on a hierar- chical character or how to simulate and detect a collision between two objects. You are expected to be masters of that kind of low-level system by now. We are attacking larger problems now. For example, it is no longer sufficient for our characters to follow a simple piecewise linear path when moving across a space. That simply looks too mechanical and robotic. In “Curved Paths for Seamless Character Animation,” Michael Lewin discusses how it is necessary to smooth the results from our AI pathfinding systems to create a movement path that follows a much more real- istic curve, while still avoiding all the obstacles that may be in the way. Adding to our character performance improvements, in Philip Taylor’s “Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver,” the existing iterative IK techniques are improved with an easy-to-understand, closed-form solution. It is also not enough to take for granted little code snippets for numerical integration we have seen posted on the net. It is important now for us to have a deeper understanding of what is going on when we use something such as Euler integration for a physical simulation and why it is important that we understand the error inherent in these algorithms. “Improved Numerical Integration with Analytical Techniques” looks directly at these issues and proposes methods to increase the accuracy in our simulations. As we continue to create amazing new projects and push the envelope for what is possible to do in a game, I believe some of the most important steps forward will come at this intersection of animation and physics. As our animated characters become more physically aware and grounded in our game environments, and our sim- ulated worlds become inhabited by these responsive and emotional characters, our games will make huge leaps forward in connecting to our audience. I hope these gems provoke some new ideas and encourage you all to inject just a little more life into your virtual creations. 120 Section 2 Physics and Animation 121 2.1 A Versatile and Interactive Anatomical Human Face Model Marco Fratarcangeli In a compelling virtual world, virtual humans play an important role in improving the illusion of life and interacting with the user in a natural way. In particular, face motion is crucial to represent a talking person and convey emotive states. In a modern video game, approximately 5 to 10 percent of the frame cycle is devoted to the anima- tion and rendering of virtual characters, including face, body, hair, cloth, and interaction with the surrounding virtual environment and with the final user. Facial blend shapes are commonly adopted in the industry to efficiently animate virtual faces. Each blend shape represents a key pose of the face while performing an action (for example, a smiling face or raised eyebrows). By interpolating the blend shapes with each other, as depicted in Figure 2.1.1, the artist can achieve a great number of facial expressions in real time at a negligible computational cost. Figure 2.1.1 Blend shapes interpolation. 122 Section 2 Physics and Animation However, creation of the facial blend shapes is a difficult and time-consuming task, and, more likely than not, it is still heavily dependant on the manual work of talented artists. In this gem, I share principles and ideas for a tool that assists the artist in authoring humanoid facial blend shapes. It is based on a physical simulation of the human head, which is able to mimic the dynamic behavior of the skull, passive tissues, muscles, and skin. The idea is to let the artist design blend shapes by simply adjusting the contraction of the virtual muscles and rotating the bony jaw. The anatomical simulation is fast enough to feed back the results in real time, allowing the artist to tune the anatomical parameters interactively. Overview Our objective is to build a virtual model of the human head that simulates its anatomy and can be adapted to simulate the dynamics of different facial meshes. The artist models a static face mesh, and then the anatomical simulation is used to generate its blend shapes. The goal of the simulation is to create shapes that generate realistic motion. The exact accuracy of the individual muscles in a real anatomical model is not the goal. The anatomical elements are simple yet expressive and able to capture the dynamics of a real head. The anatomical simulation must also be fast enough to be interactive (in other words, run at least at 30 fps) to allow the artist to quickly sketch, prototype, tune, and, where needed, discard facial poses. The basic anatomical element is the skull. The anatomical model is not bound to the polygonal resolution or to the shape of the skull mesh; we require only that the skull mesh has a movable jaw. On top of the skull, the artist may design several layers of muscles and passive tissue (such as the fat under the cheeks), the so-called muscle map. The skull and the muscle map form the musculoskeletal structure that can be saved and reused for different faces. The musculoskeletal structure is morphed to fit the shape of the target face. Then, the face is bound to the muscles and to the skull, and thus it is animated through a simple and efficient numerical integration scheme. Numerical Simulation The anatomical model is composed of different parts, most of which are deformable bodies, such as muscles, fat, and the skin. The dynamics algorithm must be stable enough to allow interaction among these parts, it must be computationally cheap to carry out the computation at an interactive rate, and it must be controllable to allow precise tuning of the muscles’ contractions. To meet these requirements, we will use Position Based Dynamics (PBD), a method introduced in [Müller06] and recently embedded in the PhysX and Bullet engines. A less formal (although limited) descrip- tion was introduced in [Jakobsen03] and employed in the popular game Hitman: Codename 47. For an easy-to-understand and complete formulation of PBD, as well as other use- ful knowledge about physics-based animation, you can review the publicly available SIGGRAPH course [Müller08]. In this gem, I describe PBD from an implementa- tion point of view and focus on the aspects needed for the anatomical simulation. In most of the numerical methods used in games, the position of particles is computed starting from the forces that are applied to the physical system. For each inte- gration step at a given time, we obtain velocities by integrating forces, and eventually we obtain the particle’s position by integrating velocities. This means that, in general, we can only influence a particle’s position through forces. PBD works in a different way. It is based on a simple yet effective concept: The current position of particles can be directly set according to a customizable set of geometrical constraints , where is the set of particles involved in the sim- ulation. Because the constraints could cause the resulting position to change slightly, the velocity must be recalculated using the new position and the position at the previ- ous time step. As the position is computed, the velocity is adjusted according to the current position and the position at the previous time step. The integration steps are: (1) for each particle i do (2) for each particle i do (3) loop nbIterations times (4) solve (5) for each particle i (6) For example, let us consider the simple case of two particles traveling in the space, which must stick to a fixed distance d from each other. In this case, there is only one constraint C, which is: (1) Assuming a particle mass , Step (3) in the Algorithm 1 is solved by: In this example, if a force is applied to , it will gain acceleration in the direction of the force, and it will move. Then, both and will be displaced in Step (3) in order to maintain the given distance d. 2.1 A Versatile and Interactive Anatomical Human Face Model 123 Given , finding the change in position for each particle is not difficult; however, it requires some notions of vectorial calculus, which are outside the scope of this gem. The process is explained in [Müller06, Müller08] and partly extended in [Müller08b]. Steps (3) and (4) solve the set of constraints in an iterative way. That is, each con- straint is solved separately, one after the other. When the last constraint is solved, the iteration starts again from the first one, and the loop is repeated nbIterations times, eventually converging to a solution. Then, the velocity is accommodated in order to compensate the change in position . In general, using a high value for nbIterations improves the precision of the solution and the stiffness of the system, but it slows the computation. Furthermore, an exact solution is not always guaranteed because the solution of a constraint may violate another constraint. This issue is partly solved by simply multiplying the change in position by a scalar constant , the so-called constraint stiffness. For example, choosing for a distance constraint leads to a dynamics behavior similar to the one of a soft spring. Using soft constraints leads to soft dynamics and improves drastically the probability of finding an acceptable (and approximated) solution for the constraint set. For example, think about two rigid spheres that must be accommodated inside a cube with a diagonal length smaller than the sum of the diameters of the two spheres: The spheres simply will not fit. However, if the spheres were soft enough, they would change shape and eventually find a steady configuration to fit in the cube. This is exactly how soft constraints work: If they are soft enough, they will adapt and con- verge toward a steady state. In my experiments, I used a time step of 16.6 ms and found that one iteration is enough in most of the cases to solve the set of constraints. Rarely did I use more iter- ations, and never beyond four. Similar to the distance constraint, other constraints can be formulated consider- ing further geometric entities, such as areas, angles, or volumes. The set of constraints defined over the particles defines the dynamics of a deformable body represented by the triangulated mesh. I provide the source code for the distance, bending, triangular area, and volume constraints on the companion CD-ROM. You are encouraged to exper- iment with PBD and build deformable bodies following the examples in the source code. PBD has several advantages: • The overshooting problem typical of force-driven techniques, such as mass-spring networks, is avoided. • You can control exactly the position of a subset of particles by applying proper constraints; thus, the remaining particles will displace accordingly. • PBD scales up very well with spatial dimensions because the constraint stiffness parameter is an adimensional number, without unit of measure. 124 Section 2 Physics and Animation Building the Anatomical Model We begin the process of building our virtual head by building up the low-level pieces that make up the foundation of the physical motion. To do this, we take an anatomical approach. The Skull The skull is the basis upon which the entire computational model of the face is built. It is represented by a triangulated mesh chosen by the artist. Figure 2.1.2 illustrates an example of the mesh used in our prototype. The skull mesh is divided in two parts: the fixed upper skull and the movable mandible. The latter will move by applying the rigid transformation depicted in Figure 2.1.2. Interactive Sketching of the Muscles In our model, muscles are represented by rectangular parallelepipeds, which are deformed to match the shape of the facial muscles. To define the shape of the muscles, we draw a closed contour directly on the skull surface and the already-made muscles. Figure 2.1.3 shows an example of the definition of the initial shape of a muscle in rest state. The closed contour is defined upon underlying anatomical structures—in this case, the skull. Then, a hexahedral mesh M is morphed to fit the contour. M is passively deformed as the skull moves. The closed contour is drawn through a simple ray-casting algorithm. The posi- tion of the pointing device (I used a mouse) is projected into the 3D scene, and a ray is cast from the near plane of the frustum to the far plane, as shown in Figure 2.1.4. 2.1 A Versatile and Interactive Anatomical Human Face Model 125 Figure 2.1.2 (1) Example skull mesh, (2) jaw lateral slide, (3) jaw opening, (4) jaw protruding. The intersection points form the basis of the muscle geometry, the so-called action lines. An action line is a piecewise linear curve lying on at least one mesh. The purpose of the action lines is twofold: (1) They define the bottom contour of the muscle geometry during the simulation, and (2) they provide a mechanism to control the active contraction of the muscle itself. A surface point is a point sp in S, where S is the surface represented by the mesh. A surface point sp is uniquely described by the homogeneous barycentric coordinates (t1, t2, t3) with respect to the vertices A1A2A3 of the triangular facet to which it belongs, as shown in Figure 2.1.5. The relevant attributes of a surface point are position and normal; both of them are obtained through the linear combination of the barycentric coordinates with the corresponding attributes of the triangle vertices. When the latter displace due to a deformation of the triangle, the new position and normal of sp are updated and use the new attributes of the vertices (Figure 2.1.5 (b)). Each linear segment of the action line is defined by two surface points; thus, an action line is completely described by the ordered list of its surface points. Note that each single surface point may belong to a different surface S. So, for example, an action line may start on a surface, continue on another surface, and finish on a third surface. When the underlying surfaces deform, the surface points displace, and the action line deforms accordingly. 126 Section 2 Physics and Animation Figure 2.1.3 Defining the shape of a muscle on the skull. Figure 2.1.4 Casting a ray to determine the intersection point. near plane far plane Soft Model for a Facial Muscle Starting from a triangulated, rectangular parallelepiped (or hexahedron), each vertex is considered as a particle with mass ; particles are connected with each other to form a network of distance constraints, as shown in Figure 2.1.6. These constraints replicate some aspects of the dynamic behavior of a real face muscle, in particular resistance to in-plane compression, shearing, and tension stresses. Note that distance constraints are placed over the surface of the hexahedron, not inter- nally. For completing the muscle model, we add bending constraints among the trian- gular faces of the mesh to conserve superficial tension. We also add a further volume constraint over all the particles, which makes the muscle thicker due to compression and thinner due to elongation. 2.1 A Versatile and Interactive Anatomical Human Face Model 127 Figure 2.1.5 (a) A surface point sp in A1A2A3 is defined by the homogeneous barycentric coordinates (t1, t2, t3) with respect to the triangle vertices. (b) When the triangle deforms, the triple (t1, t2, t3) does not change, and sp is updated to sp’. Figure 2.1.6 Distance constraints used in the muscle model. upper side bottom side particle shear connectionlongitudinal latitudinal The Muscle Map Using the interactive editor to sketch the shape of the soft tissues over the skull, we define the structure made up of intertwined muscles, cartilage, and facial tissue, mostly fat. The muscles are organized in layers. Each layer influences the deformation of the layers on top of it, but not those underlying it. See Figure 2.1.7. The muscle map comprises 25 linear muscles and one circular muscle. This map does not represent the real muscular structure of the human head; this is due to the simulated muscle model, which has simplified dynamics compared to the real muscu- loskeletal system. However, even though there may be not a one-to-one mapping with the muscle map in a real head, this virtual muscle map has been devised to mimic all the main expressive functionalities of the real one. For instance, on the forehead area of a real head, there is a single large, flat sheet muscle, the frontalis belly, which causes almost all the motion of the eyebrows. In the virtual model, this has been represented by two separate groups of muscles, each one on a separate side of the forehead. Each group is formed by a flat linear muscle (the frontalis) and on top of it two additional muscles (named, for convenience, frontalis inner and frontalis outer). On top of them, there is the corrugator, which ends on the nasal region of the skull. Combining these muscles, the dynamics of the real frontalis belly are reproduced with a satisfying degree of visual realism, even though the single linear muscle models have simple dynamics compared to the corresponding real ones. Each simulated muscle is linked to the underlying structures through position constraints following the position of surface points. Thus, when an anatomical structure deforms, the entire set of surface points lying on it moves as well, which in turn influ- ences the motion of the above linked structures. For instance, when the jaw, which is part of the deepest layer, rotates, all the deformable tissues that totally or partially lie on it will be deformed as well, and so on, in a sort of chain reaction that eventually arrives at the skin. 128 Section 2 Physics and Animation Figure 2.1.7 Different layers forming the muscle map used in the experiments. Active contraction of a muscle is achieved by simply moving the surface points along the action lines. Given that the bottom surface of the muscles is anchored to the surface points through position constraints, when the latter move, the muscle contracts or elongates, depending on the direction of motion of the surface points. Morphing the Muscle Map into the Target Face Mesh Once the skull and the muscle map are ready, they can be morphed to fit inside the target facial mesh, which represents the external skin. The morphing is done through an interpolation function, which relies on the so-called Radial Basis Functions [Fang96]. We define two sets of 3D points, P and Q. P is a set of points defined over the surface of the skull and the muscles, and Q is a set of points defined over the skin mesh. Each point of P corresponds to one, and only one, point of Q. The position of the points in P and Q are illustrated in Figure 2.1.9 (a) and must be manually picked by the artist. These positions have been proven to be effective for describing the shape of the skull and of the human face in the context of image-based coding [Pandzic and Forchheimer02]. Given P and Q, we find the interpolation function G(p), which transforms a point pi in P and a point qi in Q. Once G(p) is defined, we apply it to all the vertices of the skull and muscle meshes, fitting them in the target mesh. Finding the interpolation function G(p) requires solving a system of linear inde- pendent equations. For each couple , where pi is in P and qi is the correspon- ding point in Q, we set an equation like: Where n is the number of points in each set, di is the distance of pi from pj, ri is a positive number that controls the “stiffness” of the morphing, and hi is the unknown. 2.1 A Versatile and Interactive Anatomical Human Face Model 129 Figure 2.1.8 The example muscle map is deformed by rotating the jaw and contracting the frontalis belly. Note how all the above muscles are properly deformed. Solving the system leads to the values of hi, i = 1, .., n, and thus G(p): This particular form of Radial Basis Functions is demonstrated to always have a solution, it is computationally cheap, and it is particularly effective when the number of points in P and Q is scarce. Figure 2.1.9 (b) shows an example of fitting the skull into a target skin mesh. Skin Skin is modeled as a deformable body; its properties are defined by geometrical con- straints in a similar way to the muscles. The skin is built starting from the target face mesh provided in input. Each vertex in the mesh is handled as a particle with a mass, which is set to 1.0. After the skull and the muscle map are fitted onto the skin mesh, further constraints are defined to bind the skin to the underlying musculoskeletal structure. For each particle p in the skin mesh, a ray is cast along the normal that is toward the outer direction. In fact, after the fitting, portions of some muscles may stay outside the skin. By projecting in the outer direction, the skin vertices are first bound to these muscles. If no intersection is found, then another ray is cast in the 130 Section 2 Physics and Animation Figure 2.1.9 (a) The set of points P and Q picked on the skull and on the face mesh, respectively. (b) The outcome of the morphing technique. opposite direction of the normal, toward the inner part of the head. The ray is tested against the muscles from the most superficial to the deepest one. If the ray does not intersect any muscle, then the skull is tested. The normal of the skin particle is created by averaging the normals of the star of faces to which the particle belongs. If an intersection is found, then it is defined as a surface point sp on the inter- sected triangular face in the position where the ray intersects the face. A particle q is added to the system, and it is bound through a position constraint to sp. A stretching constraint is placed among the particles p and q. When the skull and the muscles move, the position of the surface points will change accordingly. The set of added par- ticle q is updated as well because it is bound to the surface points through the corre- sponding position constraint and will displace the skin particles, whose final motion will depend also on the other involved constraints. Conclusion Although not very accurate from the point of view of biomechanics, the presented anatomical model is able to simulate convincing facial poses, including macro-wrinkles, which can be used as blend shapes, as shown in Color Plate 7. The model is stable, robust, and controllable, which is critical in interactive tools for producing video game content. The anatomical model is adaptive enough to animate directly the face mesh provided by the artist, thus it does not lead to the artifacts associated with motion retar- geting. It does not require expensive hardware, and it may run on a consumer-class PC while still providing interactive feedback to the artist. References [Fang96] Fang, Shiaofen, Raghu Raghavan, and Joan T. Richtsmeier. “Volume Morphing Methods for Landmark Based 3D Image Deformation.” International Symposium on Medical Imaging. 2710 (1996): 404–415. [Fratarcangeli08]. Fratarcangeli, Marco. “A Computational Musco-Skeletal Model for Animating Virtual Faces.” Ph.D. thesis, Universitá degli Studi di Roma “La Sapienza.” 2008. [Pandzic and Forchheimer02] Pandzic, Igor S. and Robert Forchheimer. MPEG-4 Facial Animation—The Standard, Implementation and Applications, 1st ed. John Wiley & Sons, 2002. [Jakobsen03] Jakobsen, T. “Advanced Character Physics.” 2003. Gamasutra. n.d. . [Müller06] Müller, M., B. Heidelberger, M. Hennix, and J. Ratcliff. “Position Based Dynamics.” J. Vis. Commun. 18.2 (2007): 109–118. [Müller08] Müller M., D. James, J. Stam, and N. Thuerey. “Real Time Physics.” 2008. Matthias Muller. n.d. . [Müller08b] Müller, M. “Hierarchical Position Based Dynamics.” Proceedings of Virtual Reality Interactions and Physical Simulations. Grenoble, 2008. 2.1 A Versatile and Interactive Anatomical Human Face Model 131 132 2.2 Curved Paths for Seamless Character Animation Michael Lewin The latest generation of consoles brings with it the promise of almost lifelike graphics and animation. Yet for all the beautiful artwork, we are still seeing arti- facts, such as character sliding, that break the illusion of the virtual world. Sliding occurs when we apply any extra rotation or translation to an animation that does not correspond to the way the character’s limbs are moving. This causes movement in which the feet are not planted firmly on the ground, such as running on the spot, shifting left or right while walking, rotating unnaturally while walking or standing, or sliding upwards or downwards while walking on stairs. A tension exists between our desire for realistic human animation and our need to maintain precise control of the character’s position, orientation, and velocity. In many projects, there is not sufficient synergy between the AI and animation layers to satisfy both requirements simultaneously. This gem presents a general way to adapt pathfinding techniques to better interact with animation selection. We developed a technique at Sony Computer Entertainment that uses cubic Bezier curves to allow the AI system to be more closely coupled with dynamics and animation, considering the constraints of the character’s movement. Related Work [Johnson06] describes a method for fitting Bezier curves inside the cells of a navigation mesh to create a smooth path. This gem describes in a more general way that Bezier curves can be fitted to any piecewise linear path. 2.2 Curved Paths for Seamless Character Animation 133 Finding a Path All characters need to move around their environment, which usually involves fast-paced dynamic changes and contains unpredicted obstacles. This problem is just as relevant to robot motion planning in the real world as it is to character movement in simulations such as games, and there is a wealth of literature on the subject applied to both domains. The vast majority of pathfinding solutions create a piecewise linear path first and then fit a curved path to it. There are a few examples where this intermediate step is not needed, such as using potential fields [Ratering95], but they do not give the same level of control as other methods. I will not consider them further in this article. There are many different choices of representation when constructing a piecewise linear path. The A* algorithm is usually applied because it is simple, fast, and guaranteed to be optimal. Where techniques for generating pathfinding solutions differ lies in how to represent the physical space. It can be represented as a set of connected cells or points. I briefly explain some of the main techniques; a thorough review can be found in [Tozour03] or [Latombe91], and an edifying demo can be downloaded at [Tozour05]. Figure 2.2.1 shows an example that compares the results of using the various methods. The simplest method is to divide up the space into a grid of regular rectangular cells with a resolution that is small compared to the size of obstacles. A slightly more sophisticated version of this is to use a quad tree instead of identical rectangles, so that areas of different resolution are defined as needed. Navigation meshes, as described in [Snook00, Tozour02], are an increasingly popular alternative to the grid solutions described previously. They partition the space using a variety of irregular polygons so that each cell is either entirely filled by an obstacle or entirely empty. The partitioning may be performed by hand when the environment is authored, or it can be generated automatically and then hand-edited later. [Axelrod08, Marden08] describe algorithms for dynamically recalculating the mesh to account for moving obstacles. Voronoi diagrams are on first appearance similar to navigation meshes, but they yield quite different paths. Obstacles are represented as a finite set of points, and the space is partitioned such that each point is inside a region, where that region is defined as the set of points closer to that obstacle point than any other. The edges of the resulting diagram can define paths around the space provided that all edges that pass through an obstacle are removed first. Visibility graphs are different from the previously described methods because they generate a set of waypoints rather than partitioning the space into cells. The graph is defined by points in the environment, in which pairs of points are connected by an edge if they can be joined by a straight line without intersecting any obstacles. The points are selected to be close to the corners of obstacles (this is known as a corner graph), so that the character can pass close to obstacles without colliding with them. A good demonstration of the technique can be seen at [Paluszewski]. These methods are shown in Figure 2.2.1. 134 Section 2 Physics and Animation Figure 2.2.1 The various methods of creating a piecewise linear path: (a) regular grid, (b) quad tree, (c) navigation mesh, (d) Voronoi diagram, and (e) visibility graph. (a) (b) (c) (d) (e) 2.2 Curved Paths for Seamless Character Animation 135 Smoothing the Path The AI programmer’s challenge is to create a piecewise linear path through the envi- ronment so that the character appears as lifelike and natural as possible, which means it cannot include jagged edges. Piecewise polynomial parametric curves, more commonly known as splines, are well suited to this task, and in particular cubic Bezier curves. This is a parametric curve of the form: B(t) = (1-t) 3 P0+ 3(1-t)2tP1+ 3(1-t)t2P2+t3P3for ( 1) As Figure 2.2.2 illustrates, the control points P0 and P3 define the start and end points, respectively, while the control points P1 and P2 define the curvature at the end points. The vectors P1-P0 and P3-P2 are referred to as the start and end velocity, respectively, because they define the tangent and curvature at the two end points. In practical terms, this means we can adjust their direction to control the character’s ini- tial and final direction of motion, and we can adjust their magnitude to control the curvature of the curve. Another useful property is that the curve will always lie within the boundary of the quadrilateral convex hull defined by the four control points. Having constructed a piecewise linear path, we can fit cubic Bezier curves to it for a smooth path. By choosing the control points so that for each consecutive pair of curves the output direction of the first matches the input direction of the second, we achieve continuity and smoothness at the endpoints. Our objective is to adhere to the linear path only as much as necessary. If the character’s momentum is high and a large sweeping curve looks most natural, and provided there are no obstacles in the way of that path, we want to select such a path rather than sticking rigidly to the linear path. However, if the path is very close to obstacles and there is little room to maneuver, we want a path that is as curved as possible but still avoids those obstacles. Figure 2.2.2 Bezier curves are well suited to path fitting. The four control points define the curve’s position, shape, and curvature. The curve will always be contained inside the convex hull defined by the control points. The solution to this problem comes from an unlikely source: The open source graph plotting software Graphviz [Graphviz] makes use of the same technique for drawing edges between its nodes [Dobkin97]. The inspiration comes from a 1990 article in the first Graphics Gems book that was presented as a way to render fonts using vector graphics [Schneider90, Schneider90_2]. The basic principle is described by the following pseudocode: Fitcurve(startPos, endPos, startVel, endVel) create a Bezier curve from startPos to endPos, using startVel and endVel to define the control points while the curve intersects an obstacle: reduce startVel and endVel recalculate the Bezier curve if startVel and endVel reach a user-defined minimum: break if the curve still intersects an obstacle: divide the path in two by choosing a point newPos along path set newVel using the vector between two neighbors of newPos Fitcurve(startPos, newPos, startVel, newVel) Fitcurve(newPos, endPos, newVel, endVel) The whole process of calculating the piecewise linear path and fitting the curved path to it is fast enough to run in a single frame. The path can therefore be recalcu- lated whenever a new obstacle renders the current path unfeasible. Note that the above recursive algorithm is guaranteed to terminate with a valid path because, in the limiting case, we are left with the same piecewise linear path that we began with. Figure 2.2.3 shows an example. A piecewise linear path is created that avoids an obstacle. Then a single cubic Bezier curve is fitted to it, which takes into account the character’s starting momentum. But this curve intersects with another obstacle, so a new path is created, using two cubic Bezier curves, that passes through another point on the piecewise linear path. 136 Section 2 Physics and Animation Figure 2.2.3 Fitting a curve that accounts for character momentum. A single cubic Bezier curve A (light dashed line) is fitted to the piecewise linear curve (light filled line). But this passes through another obstacle, so a new path B (heavy dashed line) is constructed from two cubic Bezier curves that passes through an intermediate point on the piecewise linear path. 2.2 Curved Paths for Seamless Character Animation 137 In order for a character to walk along the Bezier path, we need to choose the max- imum and minimum velocities for the curve. Consider the distance d from the start point to the end point of a single Bezier curve. If the sum of the magnitudes of the start and end velocities exceeds d, the resulting curve can contain a loop. Usually we do not want this, so we can impose d/2 as a maximum. A value of zero works perfectly well for a minimum, but we can also use this as an opportunity to impose constraints based on the character’s initial momentum, so that sharp turns are not allowed if the character is moving fast. Unfortunately, the character’s physical velocity is not identi- cal to the concept of velocity in the Bezier curve. This is because the latter is related to the curve’s internal parametrization and therefore varies with curvature and curve length. Nonetheless, good results can be achieved by imposing a minimum kv/d, where v is the character’s speed and k is a constant calculated empirically. Animation Selection Thus we have created a smooth path to the end point that avoids any obstacles in the environment. All that remains is to select the appropriate animation for the character that will take it along this path. There is a wealth of literature on this complicated topic, using techniques such as motion graphs and inverse kinematics, which are beyond the scope of this article. Instead, I will present a simple solution using animation blending that gives a satisfactory degree of accuracy. To test the system, I created a character with a family of hand-designed animations for each different gait (for example, walking, running, sprinting). Within each family, all the animations had the same duration and path length but different amounts of rotation (for example, 0, 22.5, 45, 90 degrees). In this way, we can generate motion with any amount of rotation by blending two of the animations together. Linear blend- ing of two or more animations simply means averaging the position and rotation of the skeleton joints. This is done separately for each frame of the animation. The resulting output is therefore a mixture of the inputs. This will only look natural when the anima- tions are quite similar to begin with, as is the case here. Figure 2.2.4 shows an example. We can chain the animations together because each one begins and ends at the same point in the character’s stride. We can transition seamlessly between different gaits using transition animations, authored in the same way as a family of animations with different degrees of rotation. As one animation ends, a new one must be selected. It is possible at this moment that the character’s position and orientation do not perfectly match the curved path. We could choose an animation that will return the character to the path, but if his orientation is wrong, this will result in a zigzagging motion as the character veers too far one way with one step and then too far the other way with the next. It is better to ensure the character’s final orientation matches the tangent to the path, even if this means there is some error in his position. The method described gave sufficiently accurate results with a wide range of walking and running gaits; Figure 2.2.5 shows an example. This technique is limited, however, as Figure 2.2.6 shows. The animations we can create by blending in this way always describe a circular arc. All we can do is set the curvature of this arc. Some por- tions of a Bezier curve, however, cannot be described by a circular arc, and this is what causes the character to deviate from the path. For this reason, more sophisticated ani- mation selection techniques would work better. 138 Section 2 Physics and Animation Figure 2.2.4 A family of animations that can be blended together. Each has the same path length but a different amount of rotation (0, 22.5, 45, and 90 degrees). Figure 2.2.5 There is not much error between the dotted line, made of circular arcs, and the full line, made of four cubic Bezier curves. Each dot represents the start of a new arc. 2.2 Curved Paths for Seamless Character Animation 139 Figure 2.2.6 (a) A possible blended output: a circular arc with a rotation of 67.5 degrees. (b) An impossible output: No single circular arc can fit it. Conclusion In conclusion, this gem has presented a simple and effective way to generate a curved path from any piecewise linear path that facilitates animation selection and promotes interaction between the AI and animation layers of a character. Future work should look at applying this to common physical constraints, such as enforcing a run-up before jumping or picking up an object. An important future extension to this work is to address how best to manage features of a three-dimensional terrain, such as stairs, gaps, and ledges. References [Axelrod08] Axelrod, Ramon. “Navigation Graph Generation in Highly Dynamic Worlds.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. [Dobkin97] Dobkin, David P., et al. “Implementing a General-Purpose Edge Router.” Proceedings of Graph Drawing 1997: 262–271. [Graphviz] Graphviz “Spline-o-matic.” n.d. Graphviz. n.d. . [Johnson06] Johnson, Geraint. “Smoothing a Navigation Mesh Path.” AI Game Programming Wisdom 3. Boston: Charles River Media, 2006. [Latombe91] Latombe, Jean-Claude. Robot Motion Planning. Kluwer Academic Publishers, 1991. [Marden08] Marden, Paul. “Dynamically Updating a Navigation Mesh via Efficient Polygon Subdivision.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. [Paluszewski] Paluszewski, Martin. “Robot Motion Planning (applet).” n.d. University of Copenhagen. n.d. . [Ratering95] Ratering, Steven and Maria Gini. “Robot Navigation in a Known Environment with Unknown Moving Obstacles.” Autonomous Robots. 1.1 (June 1995): n.p. [Schneider90] Schneider, Philip J. “An Algorithm for Automatically Fitting Digitized Curves.” Graphics Gems. Academic Press Professional, Inc., 1990. [Schneider90_2] Schneider, Philip J. “A Bezier Curve-Based Root-Finder.” Graphics Gems. Academic Press Professional, Inc., 1990. [Snook00] Snook, Greg. “Simplified 3D Movement and Pathfinding Using Navigation Meshes.” Game Programming Gems. Boston: Charles River Media, 2000. [Tozour02] Tozour, Paul. “Building a Near-Optimal Navigation Mesh.” AI Game Programming Wisdom. Boston: Charles River Media, 2002. [Tozour03] Tozour, Paul. “Search Space Representations.” AI Game Programming Wisdom 2. Boston: Charles River Media, 2003. [Tozour05] Tozour, Paul. “Pathfinding Algorithms & Search Space Representations Demo.” 16 July 2005. . 140 Section 2 Physics and Animation 141 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) Philip Taylor Inverse kinematics (IK) has many uses in games. A primary use is to control the limbs of characters—to fit the pose of the character to the terrain it is standing on or to pin a foot while walking to reduce foot sliding, as described in [Forsyth04]. For many characters, a simple two-bone solver is all that is required because, as in the case of human characters, there are only two major bones in a chain that need to be modified. The problem of solving a two-bone chain is often reduced to a two-dimensional problem by constraining the solution to lie on a plane, usually defined by the root of the chain, the IK goal, and an up-vector position value. These two-bone solutions are con- sidered “closed form” because the solution can be found using trigonometry [Lander98]. When it comes to chains with more than two bones, there are several well-known algorithms used to solve this problem. Coordinate Cyclic Descent (CCD), Jacobian Transpose, or Pseudo-Inverse are commonly used. These algorithms suffer from performance issues because multiple iterations are required to converge on a solution. Without sufficient iterations, the chain may not reach the goal within acceptable limits, and the chain may exhibit irregular move- ments between frames, causing visual artifacts. Additionally, they lack precise control over the resulting shape of the chain. In this gem, I present a new method for solving inverse kinematics on chains comprising any number of bones by solving each bone separately in two dimensions. This solution is non-iterative in that only a single evaluation per bone is required, while guaranteeing a correct solution if the goal is within reach of the chain. The algo- rithm does not require extra parameters, such as up-vector position values or preferred angles. Instead, the algorithm attempts to preserve the initial shape of the chain while reaching for the goal, maintaining the integrity of any pose or animation data that was present on the chain prior to the IK solver’s evaluation. 142 Section 2 Physics and Animation Context Consider the context of the character within a game. We are usually not building a set of rotations to define the chain pose; rather, we are modifying a set of orientations to more closely fit a world space constraint. During the evaluation of a game scene, the animation system is sampled to provide a local space transform per bone, and these local space pose transforms get concatenated together to build a global space pose. The pose at this point in the engine’s evaluation is artist defined and part of what describes the style and personality of the character. Maintaining the integrity of the character’s pose, or motion, while introducing new constraints, such as foot planting or lever pulling, is desirable as significant investment has been made in defining these motions. Any modifications made to the pose should be minimized to avoid breaking the original intent of the motion. In this gem, the pose defined by the animation system is referred to as the forward kinematic pose, or FK pose. Forward kinematic refers to the way that the pose was built, by accumulating local space transforms down through the hierarchy to generate global space transforms. Bone Definition For the purpose of this gem, a bone is defined as containing a position and an orien- tation. The position is represented as a vector triple, and the orientation is represented as a quaternion. As illustrated in Figure 2.3.1, we refer to the vector that runs along its length as the bone length vector, and the length of a bone is defined as the length of this vector. By convention, bones are aligned with their local X-axis. A chain is defined as a linear hierarchy of bones, usually lined end to end. IK Goal Position Without considering hands, feet, or any other child bones of the chain, a solution is defined as solving the chain such that the tip of the last bone touches, or comes as close to touching as possible, some predetermined IK goal position. The exact location of this IK goal position is defined by the engine and is not within the scope of this gem. Figure 2.3.1 Bone definition. The animation engine might be locking the foot to a plant position over the course of a character’s step, or the hand of a character could be constrained to the steering wheel of a vehicle. In both cases, we have a bone chain with animation applied and a desired goal position, which is defined by the character’s environment. One-Bone IK The simplest chain would only comprise one bone, and the closest valid solution would be to align the bone with the goal. While this example may appear trivial, the rest of this gem builds upon this concept. The vector between the bone and the goal is called the bone-to-goal vector, and we refer to the desired angle between the bone length vector and the bone-to-goal vector as the IK angle. When solving a chain comprising one bone, the best that can be achieved is to adjust the bone’s pose such that the bone length vector is aimed at the IK goal, as shown in Figure 2.3.2. This is done by incrementing the rotation using the angle between the bone length vector and the bone-to-goal vector. The axis of rotation is perpendicular to both the bone length vector and the bone-to-goal vector. This rotation is the short- est arc of rotation that will transform the bone vector onto the bone-to-goal vector. Computing this quaternion directly is described in the article by [Melax06]. Two-Bone IK When an additional bone is added into the system, the mathematics becomes more complex. Aligning the Chain to the IK Goal Maintaining the shape of the limb while solving IK requires that each bone retain the respective transforms with bones prior to it and after it in the chain. For example, if the bones in the chain all lie on one plane prior to solving, then after solving the bones should all still lie on a single plane. The first step to solving a chain is to offset the orientation of each bone by the shortest arc between the root of the chain and the chain tip, and the root of the chain 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) 143 Figure 2.3.2 Aiming a bone at an IK goal position. and the IK goal position, as shown in Figure 2.3.3. This step ensures that all further deformation will occur on the same plane with respect to the overall chain shape, as the original chain pose describes. This step also has the effect of biasing most of the defor- mation to the first joint in the chain. For most characters’ limbs this is desirable, but joint limits can be imposed to restrict this deformation. See the “Future Work” section. We first apply the overall alignment to each bone and then solve using the appro- priate method for that bone. Calculating Bone 0 in a Two-Bone Chain The vector from a bone to the tip of the chain prior to solving is called the bone-to- chain-tip vector, and the angle between the bone length vector and the bone-to-chain tip is the FK angle, as shown in Figure 2.3.4. To calculate the IK angle for Bone 0, we use the law of cosines equation providing the parameters a, b, and c. a = Bone 0 length b = Bone 1 length c = Bone to IK goal distance (1) Once we have calculated the IK angle, we subtract this from the current angle between the bone length vector and the bone-to-goal vector to get a delta angle, which we will use to modify the bone’s pose. The axis around which we rotate the bone is the cross product of the bone-to-goal vector and the bone-to-chain-tip vector. We then modify the bone’s orientation by incrementing the rotation using the axis and the delta angle. 144 Section 2 Physics and Animation Figure 2.3.3 Aligning the chain to the IK goal. Solving the Last Bone Before any child bone can be solved, its new global position must be calculated by transforming the bone’s length vector using the new orientation of its parent and adding it to the parent’s position. Bone 1, or the last bone in any chain, is solved using the one-bone solution described previously in the section “One-Bone IK.” Three-Bone IK Solver I have shown that a two-bone problem can be decomposed into two types of bones that are each solved using an appropriate method. The first thing to consider about a three-bone chain is that it encapsulates the two-bone chain described previously with one extra bone at the start of the chain, as shown in Figure 2.3.5. Therefore, once a solution can be found for Bone 0, then bones one and two can be derived using the previously described methods. Maximum Bone Angles Consider that the initial pose of the chain defines an ideal pose for the limb, and any modification of the limb pose must be minimized. By analyzing the FK pose of the chain, for example, in Figure 2.3.6, we can determine a value that describes how bent Bone 0 is with respect to the rest of the chain. We can use the law of cosines to calculate a value that defines the angle of Bone 0 if the remaining chain were to be laid out in a straight line. The remaining chain length, the distance to the FK chain tip, and the bone length are used to calculate the max bone angle. This angle is referred to as the max FK bone angle. Comparing this max FK bone angle to the actual FK bone angle gives us a value that defines our FK bone angle relative to the distance to the FK chain tip. bone angle fraction = FK bone angle / maximum FK bone angle (2) 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) 145 Figure 2.3.4 Solving the first bone in a two-bone chain. 146 Section 2 Physics and Animation Figure 2.3.5 Three-bone chain. Figure 2.3.6 Calculating the bone angle and the maximum FK and IK bone angles. The bone angle fraction is defined relative to the initial shape of the chain and is a correlation between the bone’s orientation and the rest of the chain pose. Con- versely, the remaining bone length can also be used to calculate the maximum possible angle that the bone can assume in IK. This maximum IK bone angle is multiplied by the bone angle fraction value to determine the new IK bone angle. Four-Bone IK Solver Figure 2.3.7 illustrates that with chains comprising three or more bones, the bones before the last two bones are all solved using the method described previously. Con- sidering only the bone length, the remaining bone lengths, the distance to the FK chain tip, and the distance to the IK goal, a bone angle can be calculated. Once the IK bone angle has been calculated for Bone 0, then we simply continue down the chain, and Bone 0 can be solved in exactly the same way. Bone 2 can then be solved using trigonometry, and Bone 3 is simply aligned with the target. N-Bone IK Solver An N-bone chain consists of three categories of problems. Bones 0 to N–3 are all solved using the method described previously of calculating maximum IK and FK bone angles to derive an IK bone angle. Bone N–2 can be solved using trigonometry, and Bone N–1 is simply aligned with the goal. 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) 147 Figure 2.3.7 Applying the method to a four-bone chain. Function SolveIKChain( chain ) begin calculate chain target alignment for each bone in chain begin apply chain target alignment to bone if bone is last bone aim bone at target else if bone is second last use trigonometry to calculate bone angle else begin determine FK bone angle determine maximum FK bone angle determine maximum IK bone angle IK bone angle = ( FK bone angle / maximum FK bone angle ) * maximum IK bone angle end end end Handling Extreme Deformation In some cases, the remaining chain length is greater than the bone length plus the dis- tance to the chain tip or IK goal. In these cases, the technique described previously of using trigonometry to calculate the maximum FK and IK angles cannot generate a maximum angle greater than π. Furthermore, during animation there may be visual artifacts if the max IK bone angles hit this limit. As shown in Figure 2.3.8, if the distance between the bone and the chain tip is greater than the remaining chain length, then trigonometry is used as described previ- ously. Once the distance to the IK goal is less than the remaining chain length, the maximum IK and FK bone angle values can be calculated. The following pseudocode describes how to calculate the maximum FK bone angle for a long chain. if( distToFkChainTip > remainingChainLength ) { use trigonometry to calculate maximum bone angle } else { maxFkBoneAngle = acos( ( boneLength/2 ) / remainingChainLength ) maxFkBoneAngle += ( remainingChainLength - distToFkChainTip ) / boneLength; } 148 Section 2 Physics and Animation Effectively, the remaining chain length is applied in an arc around the one position, allowing us to define maximum angles greater than π. This technique gives a much greater range of angles for the maximum FK/IK bone angle values, while maintaining the important limits when the chain is extended. Additional Details We have only discussed the basic case in this gem. Here are some additional details that may be helpful for specific cases. IK/FK Blending In many cases, you will want to turn off IK because it may not be applicable. Simply disabling IK evaluation will cause a visual pop in the pose of your character. A better approach is to interpolate your IK pose back to your original FK pose before disabling IK. To do this, you need to keep an IK pose buffer separate from your FK pose buffer. Animated Joint Offsets So far in this gem, we have only considered chains with static local offsets along the local X-axis. This solver can readily be extended to support animated joint offsets. The local position offset of a joint can be used as the bone length vector of the parent bone, as shown in Figure 2.3.9. The FK bone angle and max FK/IK bone angles are all calculated using this vector and its length, as described in this gem. This 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) 149 Figure 2.3.8 Handling chains with extreme curvature. ensures that the bones pivot in the same position relative to their parent bone and that local position offsets are still applied in parent space. Because we are required to know the total length of the chain before solving, the entire chain’s length needs to be meas- ured prior to solving for IK. Extension Limit One concern with applying inverse kinematics to a chain in a game is that often the IK solvers will generate artifacts in the motion of the limb when the limb reaches the limit of its extension. The term hyperextension refers to the visual pop that happens when a chain reaches the limits of its reach. Fixing this issue is a minor addition to the IK solver algorithm. In the code samples provided with the book, extension dampen- ing has been implemented to show how this can be achieved. Future Work Our algorithm will not address all cases that you will come across. In the future, we plan to extend the system in several ways. Sliding Joints This IK solver will not currently calculate new local joint position offsets but only cal- culates orientation changes for each bone in the hierarchy. For example, it will not extend a hydraulic joint to reach the goal. Sliding joints could be implemented using a similar technique as that used to calculate angles. By defining the FK bone’s local position relative to a slide vector, a new local offset position could be defined by com- paring the distance to the FK chain tip and the actual distance to the IK goal. Joint Limits Aside from limiting the overall extension of the chain, it may be required that certain bones do not rotate beyond a certain limit or that they are not free to rotate on any plane. Because this chain is evaluated from top to bottom, it is possible to start impos- ing limits on angles. A simple way to do this is to limit the angle generated for each bone. In this way, we can limit the angle of a bone with respect to its parent. 150 Section 2 Physics and Animation Figure 2.3.9 Animated joint offsets. DOF Constraints Creature limbs are usually made up of a ball joint followed by a collection of hinge joints. In this gem, however, no twisting is ever applied to the chain, meaning that the solver only modifies the pose of the root bone on two axes, and every other joint rotates on only one axis. The axis used to modify the bone’s orientation is assumed to be perpendicular to the bone vector and the bone-to-goal vector. Chains with limited degrees of freedom are not supported in the algorithm presented in this gem. If an axis of motion were defined for a particular bone, then solving would still be feasible by calculating the maximum bone angles projected onto the plane defined by this axis. General-Purpose Programming on Graphics Processors The algorithm presented in this gem does not employ any recursion, or major branch- ing, and has a fixed cost per evaluation. These features make it an ideal candidate for implementations using some of the newer SIMT architectures, such as CUDA or OpenCL. Perhaps, as in the case of shaders, specific versions of the solver would be generated for categories of chains and the loops unrolled. A two-, three-, and four- bone version could be generated and the loops unrolled at compile time. Depending on the bone count, a chain would be solved in a batch with all other limbs of the same structure. This would avoid costly synchronization points, which would slow all chain evaluations to the cost of the longest chain. Conclusion In this gem, I have presented a new method for solving inverse kinematics on chains of bones that is simple, fast, and accurate. This is achieved by calculating bone angles directly using values derived from the initial pose of the chain and the position of the IK goal. This method minimizes the modification of the original chain’s pose, while ensuring the goal is always reached if within range. There are many possibilities for further development to expand this concept beyond a simple chain solver. References [Forsyth04] Forsyth, Tom. “How to Walk.” 2004. Game Tech. n.d. . [Lander98] Lander, Jeff. “Oh My God, I Inverted Kine!” Game Developer. (September 1998): 9–14. [Melax06] Melax, Stan. “The Shortest Arc Quaternion.” Game Programming Gems. Boston: Charles River Media, 2006. 214–218. 2.3 Non-Iterative, Closed-Form, Inverse Kinematic Chain Solver (NCF IK) 151 152 2.4 Particle Swarm Optimization for Game Programming Dario L. Sancho-Pradel This gem presents the foundations of Particle Swarm Optimization (PSO), a simple yet powerful meta-heuristic optimization technique that can be applied to complex non-linear problems, even in the absence of a precise analytical formulation of the sys- tem. As a result, PSO is used in a variety of engineering applications, such as artificial neural network training, mechanical design, and telecommunications. I have also applied this technique to robotic systems. PSO is related to other population-based search strategies, such as Genetic Algorithms (GA), and can solve similar problems. However, PSO works natively with real numbers and tends to be more efficient than GA, often reaching a near-optimal solution in fewer function evaluations [Hassan05]. Next- generation consoles are pushing the boundaries of realism and complexity in game development. Often, games contain large sets of parameters that drive or influence the behavior of various systems, such as physics, animation, or AI. More often than not, the final values of many of these parameters are the result of a manual selection process based on experience and trial and error. Under certain basic conditions, PSO can be a valuable tool for better tuning these parameters in an automated fashion. A Few Words on Optimization Optimization problems are typically modeled as finding the maximum or minimum value of a set of m-dimensional real functions, whose variables are normally subject to constraints. Maximizing a function f is equivalent to minimizing the function –f and vice versa. Using vector notation, the optimization problem is reduced to finding the roots (solutions) of the vector function: (1) 2.4 Particle Swarm Optimization for Game Programming 153 where, for the general case of p non-linear equations in m variables: In some cases, Equation (1) may be solved analytically, which means that the optimum point (or set of points) can be calculated exactly. In most real applications, however, this is not possible, and numerical methods are applied to find an approxi- mate solution. Numerical Methods Numerical methods are algorithms that return a numerical value that in most cases represents an approximation to the exact solution of a problem. Classical optimiza- tion techniques have been successful at solving many optimization problems common in industry and science. However, there are optimization scenarios that classical approaches cannot solve in a reasonable amount of time. Combinatorial optimization problems—in other words, problems with a discrete set of solutions from which we want to find the optimal one—and NP-hard problems in general are good examples. For those cases, approximate solutions can be obtained relatively quickly using meta- heuristic approaches (that is, high-level strategies used for guiding different heuristics in search problems). No Free Lunch (NFL) Theorem for Optimization Optimizing can be regarded as a search problem where the selected optimization method represents the particular style or mechanism of executing the search. The NFL theorem [Wolpert97] states that for finite spaces and algorithms that do not resample points, the performance of all search (optimization) algorithms averaged over all possible objective functions is the same. In practice, however, we do observe that some algorithms perform, on average, significantly better than others over certain objective functions. This is because the objective functions considered in most real optimization scenarios have a structure that is far from random. As a result, algo- rithms that exploit such a structure will perform, on average, better than “unin- formed” search strategies (for example, a linear sequential search). The PSO search strategy is based on an exchange of information between members of a swarm of can- didate solutions and operates under the assumption that “good solutions” are close together in the search space. If this is not the case, PSO will not perform better than a random search. The PSO Paradigm and Its Canonical Formulation PSO is a stochastic, population-based computer algorithm modeled on swarm intelli- gence. The PSO paradigm was originally developed by Dr. Eberhart and Dr. Kennedy [Kennedy95], an electrical engineer and a social psychologist, respectively, who had the idea of applying social behavior to continuous non-linear optimization problems. They reasoned that social behavior is so ubiquitous in the animal kingdom because it optimizes results. Social behavior is the result of an exchange of information within the members of the society. The original PSO algorithm evolved from a bird flocking simulator into an optimization tool motivated by the hypothesis that social sharing of information among members of the society offers an evolutionary advantage. In order to implement and exploit that flow of information, each member was provided with a small amount of memory and a mechanism to exchange some knowledge with its social network (in other words, its neighbors). Although different improvements and variations have been proposed since the initial PSO algorithm was presented, most of them resemble closely the original formulation. A significant exception is the Quantum PSO (QPSO) algorithm [Sun04], which for reasons of space will not be covered here. Canonical Equations of Motion The way PSO works is by “flying” a swarm of collision-free particles over the search space (also called the problem space). Each particle represents a candidate solution of the optimization problem, and each element (dimension) of the particle represents a parameter to be optimized. The movement of each particle in the swarm is dictated by the equations of motion that the particular flavor of the PSO algorithm defines. In its canonical form, these equations can be expressed as: (2) where x(t) and V(t) represent respectively the particle’s position and velocity at time t. By choosing Dt 5 1, t becomes the iteration step. Note that x(t) and V(t) are m-dimensional vectors. At t 5 0, the position of each particle is selected from a uniform random distribution, i.e. x j (0) 5 U(x j min, x j max), j 5 1,2,...m, where x j min, x j max are the search space boundaries for the j-th dimension. V(0) can either be randomly initialized or set equal to the zero vector. Velocity Update As Equation (2) shows, the search process is driven by the velocity update, based on: • Cognitive information. Experience gained during the particle’s search, expressed as the best location ever visited by the particle xCognitive_Best. • Socially exchanged information. The best location found so far by any member of the particle’s social network (xSocial_Best). The calculation of xSocial_Best depends on the chosen network’s topology. Figure 2.4.1 shows some common topologies. In a fully connected network (for example, Star), xSocial_Best represents the best location found by any particle in the swarm, hence called global best (gBest). If the swarm is not fully connected xSocial_Best becomes the best position found by any particle in its local neighborhood, hence called local best (IBest). Using IBest tends 154 Section 2 Physics and Animation 2.4 Particle Swarm Optimization for Game Programming 155 to provide more accurate results, whereas algorithms running gBest execute faster [Engelbrecht02]. The neighborhood is typically defined based on the particles’ indices, although spatial information could also be used. Allowing neighborhood overlapping helps the information exchange across the swarm. • Inertia. Provides certain continuity to the motion of the particle. Adding inertia to the velocity update helps to decrease the change in momentum between two consecutive iterations and provides a means of controlling the balance between the particle’s exploration and exploitation. An exploratory strategy tends to direct the particle toward unexplored areas of the problem space. Exploitation refers to the movement of a particle around previously explored areas, resulting in a finer- grain search on these areas. Clearly, low inertia will favor exploitation, whereas higher values will result in a more exploratory behavior. In general, the PSO strategy is to use the cognitive and social best positions as attractors in the search space of the particle. It achieves that by defining the particle’s velocity vectors as: where vP(0,1). r1 and r2 are random m-dimensional vectors taken from a uniform random distribution, in other words, ri 5 {ri jPU(0,1)} , j 5 1,2,...,m , i 5 1,2. The values of the two acceleration coefficients are normally chosen to be the same (typically cCognitive 5 cSocialP(0,2.05]). It is important to notice that here the operator denotes a per-component multiplication of two vectors, and therefore its result is another vector. Figure 2.4.1 Three classical social network structures: (a) Star topology (fully connected), (b) Ring topology, (c) Cluster topology. (a) (b) (c) Finally, the equations of motion for the basic canonical PSO system can be expressed as: (3) Algorithm 1 details the basic canonical PSO formulation implementing the gBest strategy, and Figure 2.4.2 illustrates graphically the position and velocity update of one particle based on the equations of motion presented in Equation (3). Algorithm 1: Canonical PSO (<> version) 1: // Create a swarm of N randomly distributed particles 2: FOR EACH particle(P) i=1,…,N do 3: FOR EACH dimension j=1,..,M do 4: P[i].X[j] = Xmin[j]+rand(0,1)*(Xmax[j]- Xmin[j]); 5: P[i].V[j] = rand(0,1)*Vmax[j]*sign(rand(0,1)-0.5);// or =0; 6: END 7: END 8: g_Best = P[0]; 9: numIterations = 1; 10: // Iterative optimisation process 11: REPEAT 12: FOR EACH particle(P) i=1,…,N do 13: IF Eval(P[i].X) BETTER_THAN Eval(g_Best.X) 14: g_Best = P[i]; 15: END 16: IF Eval(P[i].X) BETTER_THAN Eval(P[i].x_best) 17: P[i].x_best = P[i].X; 18: END 19: END 20: // Apply_Equations_of_Motion 21: FOR EACH particle i=1,…,N do 22: FOR EACH dimension j=1,..,M do 23: V_Inertia = w*P[i].V[j]; 24: V_social = c1*rand(0,1)*(g_Best.X[j] - P[i].X[j]); 25: V_cognitive = c2*rand(0,1)*(P[i].x_best[j] - P[i].X[j]); 26: 27: P[i].V[j] = V_Inertia + V_social + V_cognitive; 28: Clamp(P[i].V[j]); // optional 29: P[i].X[j] = P[i].X[j] + P[i].V[j]; 30: Clamp(P[i].X[j]); // or any other strategy 31: END 32: END 33: numIterations++; 34: UNTIL (m_numIterations > MAX_NUM_ITER || GOOD_ENOUGH(Eval(g_Best))) 156 Section 2 Physics and Animation 2.4 Particle Swarm Optimization for Game Programming 157 The position update may take the particle outside of the predefined boundary. In fact, it has been proven that in high-dimensional swarms, most of the particles will leave the search space after the first iteration [Helwig08]. Some strategies to constrain the particles to the search space include: • Moving the particle to its closest boundary and setting its velocity to zero. • Assuming a cyclic problem space. For instance, for , if , then it is recalculated as . • Allowing the particle to leave the boundary, omitting its evaluation if the objective function is not defined at the new coordinates. • Reinitializing all components whose values are outside of the search space. Evaluating the Particles: The Objective Function The objective function, also referred to as the fitness function or cost function, does not need to be the actual mathematical representation of the system to be optimized, but a measure of the quality of any solution found. Therefore, each call to Eval(P[i].X) in Algorithm 1 requires the simulation of the system using the parameters encoded in P[i]. During the simulation, various indicators related to the performance of the solution are recorded and combined in a function to provide a numerical measure of performance. This function is the objective function. Figure 2.4.2 Illustration of the velocity and position updates of a particle during the optimiza- tion of a simple 2D parabolic function f(x,y). (Left) Geometrical illustration showing contour lines of f(x,y). (Right) Sketch of f(x,y) illustrating the particle’s relevant information and labels. Imagine that we want to optimize the parameters that define the way an NPC performs a certain task, such as handling a car. Most likely, we do not have a dynamic model of the car and its controllers, so we cannot directly optimize the non-linear system of equations that defines the problem at hand. Nevertheless, by defining an objective function that penalizes, for instance, the time the NPC spends outside the training path and the time required to drive a predetermined distance, we are effec- tively giving the particles valid feedback to evaluate their quality, and therefore a means of improving in later iterations. Add-Ons to the Classical Formulation Beyond the basic formulation, we can add a few improvements to deal with specific cases that may come up in a particular use case. Velocity Clamping The formulation described so far may generate particles with increasingly large oscil- lations around a potential optimum due to an “explosion” in the velocity values. This issue was addressed in [Kennedy01] by the introduction of a positive velocity clamping parameter Vmax, which modifies the velocity update as follows: where is the j-th component of the velocity vector calculated using Equation (3). The selection of Vmax is problem dependent. Typically, each of its com- ponents is initialized within the size of the search space boundaries, in other words, . Initially, we want to be close to 1 in order to favor exploration. During the optimization process, Vmax can be periodically adjusted by: • Setting Vmax to the dynamic range of the particle. • Using linearly or exponentially decaying Vmax in order to progressively move toward exploitation (fine-grained searches) of the area around the best particle(s). • Increasing or decreasing Vmax after a iterations without improvement in the best found solution. Variable Inertia Weight v(t) The inertia coefficient has a clear influence in the particle’s search strategy. Large values of v favor exploring new areas, while small values favor the exploitation of known areas. Initially, we would like the algorithm to favor exploration, and as we approach the vicinity of the optimum, we could decrease the value of v in order to perform 158 Section 2 Physics and Animation 2.4 Particle Swarm Optimization for Game Programming 159 finer-grain searches. There are many ways to define a decreasing v(t). A common one is a linear interpolation between the desired values of vInitial and vFinal: where t is the current iteration and T represents the number of iterations required to reach vFinal, keeping v(t) 5vFinal for t .T. Another strategy is to select T 5 MAX_NUM_ITER / N, resetting v(t) and applying again the interpolation every N iterations. Constriction Coefficient Based on the dynamic analysis of a simplified and deterministic PSO model, Kennedy and Clerc [Clerc02] reformulated the velocity update in terms of the constriction coefficient x. They also analyzed the convergence properties of the algorithm, obtain- ing a set of theoretical optimal values for the PSO parameters to control both conver- gence and the velocity explosion. A simple constriction model for velocity update can be expressed as: where x is a diagonal matrix, whose elements are calculated as: and: Low values of encourage exploitation, whereas high values result in a more exploratory system. Typical values are xjj 5 0.7298 and f1 5f2 5 2.05. Notice that this model can be easily converted into the original inertia-based formulation. Extended Full Model for Velocity Update The velocity update could include both the global best and the local best particle: where typically . Adding Constraints In an optimization scenario, the constraints define feasible subspaces within the search space of the unconstrained problem. In the original PSO algorithm, each dimension is bounded individually. (See Line 4 of Algorithm 1.) This translates into an axes-aligned hyper-rectangular search space. Therefore, constraints such as cannot be generally defined. As a result, neither a simple circular search space ( ), nor a combination of constraints (for example, a rectangular search space with a rectangular hole inside), is supported. Different approaches have been proposed to extend the constraint-handling capabilities of evolutionary algorithms. (For a review, see [Michalewicz96] and [Mezura09].) A simple approach presented in [Hu02] uses the canonical PSO algorithm and introduces two modifications: • All particles are initialized inside the feasible region. • After evaluating the population, only particles within the feasible region can update the values of xCognitive_Best and xSocial_Best. Integer Optimization Although PSO deals natively with optimization problems of continuous variables, it can also be modified to cope with discrete and mixed-parameter problems. The sim- plest way to do that is to use the same equations of motion, but round the results of the integer components (at least for the position update) to the nearest integer. In some cases this can result in invalid particles (for example, duplicated elements in a particle are generally not allowed in permutation problems), and particles need to be “fixed” after every position update. The fixing mechanism is problem dependent, but it may consist of making all the elements of a particle different, while keeping their values within a certain range. Maintaining Diversity A common problem in numerical optimization is premature convergence—in other words, getting trapped around a local optimum. In stochastic methods, the diversity of the population has a major effect on the convergence of the problem; hence, diver- sity loss is commonly regarded as a threat to the optimization process. Some ways of favoring diversity are: • Alternating periodically PSO with other search methods. • Every a iterations, reinitializing (randomizing) the location of a random number of particles in the swarm. • Every b stalls, reinitializing the system. • Including mutation and crossover operators and applying them to the position of a small subset of particles in the swarm. 160 Section 2 Physics and Animation 2.4 Particle Swarm Optimization for Game Programming 161 • Running multiple independent swarms, allowing an occasional exchange of information (for example, gBest) and/or particles between swarms. Furthermore, each swarm could use different search strategies (for example, some could use IBest, some others could use gBest, some could use parameters that favor explo- ration, some others could use more exploitative ones, and so on). It is important to notice that the selected networking strategy also has an impact on the diversity of the population. Distributed topologies, such as Ring or Cluster, generate a number of local attractors and tend to maintain higher levels of diversity, leading to good solutions more often than other centralized approaches. A Note on Randomness PSO is a stochastic optimization technique that initializes its particles by placing them randomly in the search space. In order to do so, pseudo-random values are drawn from a uniform distribution. The reason is that in the absence of any prior knowledge about the location of the optimum, it is best to distribute the particles ran- domly, covering as much of the problem space as possible. However, as Figure 2.4.3 (a) shows, the result may be less uniform than desired. The spatial uniformity of the randomized particles can be improved by using Halton point sets, which are uniformly distributed and stochastic-looking sampling patterns, generated by a deterministic formula at low computational cost [Wong97]. Figure 2.4.3 One instance of two sets of 25 particles initialized in a two-dimensional space using (a) a uniform random distribution and (b) a Halton sequence. (a) Using a uniform random distribution (b) Using a Halton sequence Case Study 1: Typical Benchmarking Functions There are sets of functions that are commonly used for benchmarking optimization algorithms. These functions tend to have areas with a very low gradient (for example, Rosenbrock’s saddle) and/or numerous local optima (for example, Rastrigin’s function). Figure 2.4.4 illustrates the 2D versions of three of them. The accompanying CD includes the multidimensional version of a larger set of benchmarking functions together with a PSO algorithm implementation. You are encouraged to experiment with the different PSO parameters, observing the variation in convergence speed and accuracy. More information about different benchmarking functions can be found in [Onwubolu04]. Case Study 2: Optimization of Physical Parameters for In-Game Vehicle Simulation To obtain realistic vehicle responses in a game, simplified physical models of their key components are generated. These models contain a number of physical parameters that define the way the vehicle reacts in many situations. Those parameters are often obtained based on experience and a trial-and-error process of selection. Such a manual process tends to be inevitably slow, and as a result, little of the search space is often covered. It is clear that there is room for improvement, and PSO is a good candidate for automating the search process. The only prerequisite is to be able to estimate numerically the quality of each candidate solution. The following example illustrates the general principle. We want to optimize the performance of a car, whose simplified model is shown in Figure 2.4.5. The model includes friction between the tires and the ground, the suspension, and the car’s mass distribution. Based on the model’s param- eters, the particle’s components can be defined as: P.X[0] = Forward Friction Coeff.; P.X[1] = Side Friction Coeff.; P.X[2] = Spring Constant (Rear); P.X[3] = Damper Constant (Rear); P.X[4] = Spring Constant (Front); P.X[5] = Damper Constant (Front); P.X[6] = Mass 1; P.X[7] = Mass 1 location (along Z); P.X[8] = Mass 2; 162 Section 2 Physics and Animation Figure 2.4.4 2D versions of some common benchmarking functions. Inverse cosine wave Rastrigin’s function Rosenbrock’s saddle 2.4 Particle Swarm Optimization for Game Programming 163 As an example of dimensionality reduction, the value and location of the mass M3 and the location of the mass M2 have been fixed. They can be included at any moment in the optimization by extending the dimensionality of the particle. Most importantly, we need to define an objective function capable of quantifying the response of the vehicle while following a path along a network of roads. The objective function to be minimized could include the following variables: distance to goal on a crash, total distance driven without being on the marked spline, accumulated distance from the car’s center of mass to the marked spline (shown as d in Figure 2.4.6 (b)), accumulated heading deviation with respect to the selected path, oscillations around the Z-axis, and tilt around the Z-axis on curves (c.f. Figure 2.4.6 (a)). The function can also include big penalties for crashing and in cases where only two wheels are in contact with the road. The final objective function could be a quadratic combination of each of these variables ri. The normalization constant hi ensures that all the variables have the same weight (typically between 0 and 1), regardless of their particular ranges. The relative importance of ri, with respect to the rest of the variables, is represented by the coefficient k i, which can also be used for scaling the objective function. Squaring ri is useful in cases where only the magnitude of ri is relevant, ensuring that the objective function does not decrease based on the sign of the variables. Figure 2.4.5 A simplified physics model of a car. As Algorithm 1 showed, each iteration requires the evaluation of the whole pop- ulation (in other words, N particles). Therefore, it would be convenient to evaluate N vehicles simultaneously following identical paths. Case Study 3: Physics-Based Animation of Mechanical Systems Physics-based animation can improve the gaming experience by adding physical realism and believability to the game. Cloth and hair procedural animation, ragdolls, and general collisions between rigid bodies are some examples featured in many titles. Extending this concept to complex mechanical systems, such as robotic NPCs and vehicles, can result in more realistic behaviors that are in accordance with our experi- ence of how mechanical systems move. Figure 2.4.7 shows a hexapod NPC robot exhibiting a tripod gait, which is a gait pattern often used by insects and other animals due to its performance and stability. Gait patterns can be defined by a series of parameters (for example, maximum joint angles, feet trajectory, stance phase offset, stance duty factor, step duration) whose values will depend on the physical characteristics of the robot, the environment, the pattern configuration, the task (for example, turn, go forward, climb, and so on), and the task qualifiers (for example, fast, safe, and so on). PSO can be used for finding good values for these parameters. Each gait could be optimized individually by using specialized objective functions and running single-gait simulations. Next, gait transi- tions could be worked out. Once the optimization is finished, the result would be a hexapod robotic NPC with newly acquired locomotion skills. The design of the robot can be taken one step further. The configuration of the legs could be optimized by adding the length of their segments and the number of joints into the optimization’s parameters list. Ultimately, PSO could be used as a tool for evolving different types of robots by including the number of legs and their posi- tion in body coordinates as optimization parameters. 164 Section 2 Physics and Animation Figure 2.4.6 Illustration of some of the parameters included in the objective function. (a) Tilt around the Z-axis during a curve. (b) Distance to the path’s splines and heading error. (a) (b) 2.4 Particle Swarm Optimization for Game Programming 165 Once the robot design is finished and the gait parameters optimized, the gait con- troller of the robot would send the appropriate sequence of actions to the different joints, and the physics engine would do the rest. Incidentally, the controller of each joint (typically some flavor of proportional-integral-derivative controller, or PID) could also be optimized using PSO. Let’s see one final example. We want to find the parameters that generate the fastest walking gait for the particular robotic NPC illustrated in Figure 2.4.8. Figure 2.4.8 (b) shows the feet trajectory selected for this gait. This half-ellipse trajectory is defined by two parameters, namely the step height (h) and the step size (s). Another important parameter is the duration of the step (Ts). To improve stability, the torso of the robot is initially tilted sideways to shift its center of mass toward the supporting leg (refer to Figure 2.4.8 (a)). This can be parameterized by a maximum tilt angle (a) and the percentage of step time that the body is tilted toward the leg in stance (Tt) before shifting the weight to the other side. Assuming a 50-percent duty cycle for each leg, the structure of the particles could be P.X[]= {h,s,Ts,a,Tt}. The objective function could be designed as in Case Study 2, taking into account the walking time, the distance to the goal before falling, the accumulative heading error, the accumulative position error with respect to the given path, and the stability of the gait measure in terms of oscillations around various axes. It is worth noticing that the gait parameters obtained are subject to external factors, such as the friction between the robot’s feet and the ground. In principle, it could be possible to find a continuous mapping between the walking parameters and different types of terrains by feeding the friction data and the PSO solutions to a neural network or a fuzzy inference system. In that way, the robotic NPC could try to adapt its gait to optimize the walking performance when the type of terrain changes. Gaits are not the only movements that could be optimized. Learning other actions, such as crouching (refer to Figure 2.4.8 (c)) or climbing stairs would follow the same principles. Figure 2.4.7 Snapshot of a hexapod NPC robot using a tripod gait. The legs shown in white are lifted, while the darkened ones are in stance. Conclusion The aim of meta-heuristic optimization algorithms is to provide approximated solutions to complex non-linear optimization problems that more traditional approaches have difficulties addressing. These algorithms are not guaranteed to converge at a good solution, but they are designed to find good approximations to the global optimum with high probability. Population-based algorithms are sensitive to the diversity of their individuals and to their configuration parameters. PSO relies on a small set of intu- itive parameters, such as the number of particles, the maximum number of iterations, and the topology of the social network. Often, the mapping between an optimization problem and the function that quantifies the quality of a solution is not unique. In these cases, the quality of the selected objective function has a significant impact on the convergence of the algorithm. This gem has shown a number of variations pro- posed around the canonical PSO algorithm. The simplicity of the PSO paradigm makes its extension a relatively simple task. Games are becoming extremely complex and highly parameterized software prod- ucts. In many cases, the response of different systems is driven, or at least influenced, by sets of parameters stored in configuration files. PSO could be a useful tool for opti- mizing some of these parameters in an automated fashion. AI, physics, and animation are examples of systems that could benefit from this optimization technique. 166 Section 2 Physics and Animation Figure 2.4.8 Illustration of some basic parameters in a humanoid-like robot. (a) Weight-shifting angle for improving stability, (b) half-ellipse gate parameters, (c) the different joint angles in a crouching action. (a) (b) (c) 2.4 Particle Swarm Optimization for Game Programming 167 References [Clerc02] Clerc, M. and J. Kennedy. “The Particle Swarm-Explosion, Stability and Convergence in a Multidimensional Complex Space.” IEEE Transactions on Evolutionary Computation 6 (2002): 58–73. [Engelbrecht02] Engelbrecht, A.P. Computational Intelligence: An Introduction. Wiley, 2002. [Hassan05] Hassan, R., B. Cohanim, and O. de Weck. “A Comparison of Particle Swarm Optimization and the Genetic Algorithm.” Proceedings of 46th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference. 2005. [Helwig08] Helwig, S. and R. Wanka. “Theoretical Analysis of Initial Particle Swarm Behavior.” Proceedings of the 10th International Conference on Parallel Problem Solving from Nature (Sept 2008): 889–898. [Hu02] Hu, X. and R. Eberhart. “Solving Constrained Nonlinear Optimization Problems with Particle Swarm Optimization.” 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002): 203–206. [Kennedy95] Kennedy, J. and R. Eberhart. “Particle Swarm Optimization.” Proceeding of IEEE International Conference on Neural Networks 4 (Dec. 1995): 1942–1948. [Kennedy01] Kennedy, J., R.C. Eberhart, and Y. Shi. Swarm Intelligence. Morgan Kaufmann Publishers, 2001. [Mezura09] Mezura-Montes, E. Constraint-Handling in Evolutionary Optimization. Springer, 2009. [Michalewicz96] Michalewicz, Z. and M. Schoenauer. “Evolutionary Algorithms for Constrained Parameter Optimization Problems.” Evolutionary Computation 4 (1996): 1–32. [Onwubolu04] Onwubolu, G. and B. Babu. New Optimization Techniques in Engineering. Springer, 2004. [Sun04] Sun, J., B. Feng, and Wenbo Xu. “Particle Swarm Optimization with Particles Having Quantum Behavior.” Proc. Cong. Evolutionary Computation 1 (June 2004): 325–331. [Wolpert97] Wolpert, D.H. and W.G. Macready. “No Free Lunch Theorems for Search.” IEEE Transactions on Evolutionary Computation1.1 (April 1997): 67–82. [Wong97] Wong, Tien-Tsin, Wai-Shing Luk, and Pheng-Ann Heng. “Sampling with Hammersley and Halton Points.” Journal of Graphics Tools 2.2 (1997): 9–24. 168 2.5 Imp roved Numerical Integration with Analytical Techniques Eric Brown There is a fairly standard recipe for integrating the equation of motion in the con- text of a game physics engine. Usually the integration technique is based on the Symplectic Euler stepping equations. These equations are fed an acceleration, which is accumulated over the current time step. Such integration methods are useful when the exact nature of the forces acting on an object is unknown. In a video game, the forces that are acting on an object at any given moment are not known beforehand. There- fore, such a numerical technique is very appropriate. However, though we may not know beforehand the exact nature of the forces that act on an object, we usually know the exact nature of forces that are currently acting on an object. If this were not so, we would not be able to provide the stepping equa- tions with the current acceleration of the body. If it were possible to leverage our knowledge of these current forces, then we might expect to decrease the error of the integration dramatically. This gem proposes such a method. This method allows for the separation of numer- ical integration from analytic integration. The numerical integration steps the state of the body forward in time, based on the previous state. The analytic integration takes into account the effect of acceleration acting over the course of the current time step. This gem describes in detail the differences and implications of the integration tech- niques to aid the physics developer in understanding design choices for position, velocity, and acceleration updates in physics simulation. As we build up to the introduction of this method, we will first discuss a heuristic model for classifying errors of integration techniques. 2.5 Imp roved Numerical Integration with Analytical Techniques 169 Classifying Errors Often, the method for classifying errors in integration techniques is to label them as first order, second order, and so on. Methods that are first order have an upper bound error on the order of the time step taken to the first power. Methods that are second order have an upper bound error on the order of the time step to the second power. Taking a small number to a large power makes the small number smaller. Thus, higher- order methods yield more accuracy. Error can also be classified in terms of how well an integrator conserves energy. Integrators might add or remove energy from the system. Some integrators can conserve energy on average. For instance, the semi-implicit, or Symplectic Euler, method is a first-order method, but it conserves energy on average. If an integrator adds energy to the system, the system can become unstable and diverge, especially at higher time steps. The accuracy of a method can affect its stability, but it does not determine it, as shown by the Symplectic Euler method. More often than not, it is the stability of a method that we desire, more than the accuracy. In this gem we will be taking a different approach to classifying error. This approach is based on the fact that the stepping equations usually assume that the acceleration is constant over the course of a time step. The kinematics of constant acceleration are a problem that can be solved easily and exactly. Comparing the kinematic equations of constant acceleration with the results of a numerical method provides qualitative insight into sources of error in the method. When derivatives are discretized, it is done by means of a finite difference. Such a finite difference of positions implies that the velocity is constant over the course of the time step. In order to introduce a non-constant velocity, you must explicitly introduce an equation involving acceleration. Similarly, the only way to deal with non-constant acceleration is to explicitly introduce an equation involving the derivative of accelera- tion. Since many numerical methods do not involve any such equation, we are safe in making the comparison with the kinematic equations for constant acceleration, at least over the course of a single time step. Kinematics of Constant Acceleration We know the exact form of the trajectory of particles that are subject to a constant acceleration. We can compare this set of equations to the results of common numerical meth- ods in order to gain a qualitative idea about the error in the method. Consider the standard Euler method: This set of equations can be transformed into a set that more closely resembles the kinematic equations by inserting the velocity equation into the position equation: The appearance of vn-1 in the position equation is due to the fact that we must insert vn and must therefore re-index the velocity equation. The differences in form of this equation to the kinematic equations can be considered as qualitatively representative of the error of the method. We may perform this same procedure with the Symplectic Euler method: Insert the velocity equation into the position equation to transform to the kine- matic form: This equation is a much closer match in form to the kinematic equations. We can use this resemblance to justify the fact that the Symplectic Euler method must in some way be better than the standard Euler method. The Kinematic Integrator If we are trying to find an integration method that when converted to a kinematic form is identical to the kinematic equations, why not just use the kinematic equations as the integration method? If we do this, then we are guaranteed to get trajectories that are exact, within the assumption that acceleration is constant over the course of the time step. However, the accelerations we are usually interested in modeling are not constant over the course of the time step. We must use a value for the acceleration that encapsulates the fact that the acceleration is changing. 170 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 171 We could use the acceleration averaged over the time step, , as the constant acceleration value. By inserting the average acceleration into the kinematic equations, we achieve a method that we will refer to as the average acceleration method. In order to calculate this average exactly, we must analytically integrate the acceleration over the time step, which in many instances can be done easily. The average acceleration method therefore represents a blend between numerical and analytic integration. We are numerically integrating the current position and velocity from the previous position and velocity, but we are analytically integrating accelerations that are acting during the current time step. Of course, calculating the average acceleration exactly requires that we know how to integrate the particular force in question. Luckily, most forces that are applied in game physics are analytic models that are easily integrated. Calculating the average acceleration from an analytic model of a force is usually just as easy as calculating the acceleration at an instant of time. If the average acceleration is calculated analytically, then the velocity portion of the kinematic equations produces exact results. However, the position portion would require a double integral in order to achieve an exact result. If the forces that we are dealing with follow simple analytic models, then calculating a double integral is usu- ally just as easy as calculating a single integral. We will generalize the idea of the average acceleration method in order to intro- duce the kinematic integrator. The kinematic integrator is a set of stepping equations that allow for exact analytic calculation of both the velocity integral and the position integral. The exact method uses the following definitions for dv and dx: If we are using the average acceleration method, then we define dv and dx as: In the case of constant acceleration, it is very easy to perform both the single and the double integral. The integral contributions of a constant acceleration are given as: If there are multiple forces acting on a body, we can express the integral contribu- tions as a sum of contributions that are due to each force. Thus for the kinematic integrator we accumulate dv’s and dx’s rather than accel- erations. All forces acting on a body provide contributions to dv and dx. The amount that is contributed is dependent on the nature of the force and can usually be calculated exactly. If all forces acting on a body are integrable, then every contribution is exact. The kinematic integrator can be used to perform the Symplectic Euler method with the following integral contributions: This method is useful if the acceleration is not integrable (or if we are too lazy to calculate the integrals). These contributions are not going to be exact, but they will at least conserve energy, which will maintain stability. The kinematic integrator does not represent a specific integration method, but rather the ability to isolate the portions of the stepping equations that actually require integration. Since the method used to evaluate the integral contributions is not explic- itly specified, we have a degree of freedom in choosing which method might be best for a particular force. For instance, if there is a contribution from a constant gravita- tional force, then we can easily use the exact method. We will see that for some forces we will want to use the average acceleration method. Or we could use the Symplectic Euler method if the force in question is too complicated to integrate or if we are per- forming a first pass on the implementation of a particular force. Integral Contributions Due to a Spring Force Probably one of the most common forces, next to a constant uniform force, is a spring force. The spring force is proportional to the displacement of the spring from equilib- rium. This results in the following equation of motion: 172 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 173 The solution of this equation of motion is analytically solvable and is given by: Using this exact trajectory, we can determine the integral contributions: where c and s represent the cosine and sine, respectively, of v(t). We could represent this calculation as a matrix operation. The components of this matrix depend on the size of the time step Dt, the mass of the body m, the strength of the spring k, and the equilibrium position of the spring l. The components of the matrix can be cached and reused until any of these parameters change. In many instances, these parameters do not change for the lifetime of the spring. Thus, the calculation of the integral contributions of a spring is relatively trivial—in other words, six multiplies and four adds. We can consider that the spring that we have been discussing is anchored to an infinitely rigid object at the origin, since we have only taken into account the action of the spring on a single body, and the spring coordinates are the same as the body coordinates. It is only slightly more complicated to calculate the integral contribu- tions due to a spring that connects two movable bodies. Multiple Forces Before discussing the integral contributions of other possible forces, we need to dis- cuss what happens when multiple forces act on a body. If all of the forces acting on a body depend only on time, then the result of accu- mulating exact integral contributions will be exact. But consider the case where at least one of the forces depends on the position of the body, such as the spring force. The integral contribution of the spring takes into account the position of the body at intermediate values as the spring acts over the course of the time interval. However, the calculation is not aware of intermediate changes in position that are due to other forces. The result is a very slight numerical error in the resulting trajectory of the particle. As an example, consider two springs that are acting on the same body. For sim- plicity, both springs are attached to an infinitely rigid body at the origin, and the rest length of both springs is zero. The springs have spring constants k1 and k2. The accel- eration becomes: If we are to handle these forces separately, then we would exactly calculate the inte- gral contributions of two springs, with frequencies v1 and v2, and accumulate the results. The springs can also be combined into a single force. In this case, only a single integral contribution would be calculated. It might be surprising to discover that these two methods produce different results. The second method is exact, while the first contains a slight amount of numerical error. This seems to imply that: which is usually not true. However, in the current circumstance, if the acceleration is a function of the position, which in turn is a function of the acceleration, then the integrals have feedback. The effect of this feedback is that we cannot separate the sum into independent integrals, since the independent integrals will not receive feedback from each other. If the acceleration only depends on time, there is no feedback, and the integrals can safely be separated. Because of this, you may want to separate out forces that depend only on time and accumulate their integral contributions as a group. The integral contributions of these forces can be safely added together without introduc- ing numerical error. Unfortunately, most of the forces that are applicable to game physics depend on the position of the body. Though the error due to multiple forces is incredibly small, it represents a tiny violation of energy conservation. If this violation persists for long enough, then the trajectory can eventually diverge. The error due to multiple forces is initially much less than the error in the Symplectic Euler method, but it can eventually grow until it is out of control. Since we can integrate exactly the system with two springs, we can use this exam- ple to gauge the error present in the different methods for calculating the integral con- tributions. Results of this error calculation are contained in the figures. The first figure contains the results of integrating one spring; the second figure represents errors due to integrating two springs. The integration takes place over one half of the period of the two-spring system. The Symplectic Euler method as well as the pulse method (which will be introduced later) both conserve energy on average. Neither the average acceleration method nor the exact method conserve energy over long intervals of time. It takes quite a while, 174 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 175 but the exact method will eventually diverge. The average acceleration method will eventually dampen out. For this reason, the average acceleration method is preferred if the integration is going to take place over a long interval. Integral Contributions of a Pulse One solution to the problem due to multiple forces is to approximate the action of the force as a pulse. A pulse is a force that acts very briefly over the course of the time step. A perfect pulse acts instantaneously. Pulses acting at the beginning of the time step do not depend on intermediate values of position or velocity; therefore, pulses are immune to the multiple-force problems. Consider a rectangular pulse. The area under the curve of a rectangular pulse is dv 5 adt, where dt is the width of the pulse, and a is the magnitude of the pulse. It is possible to shrink the width of the pulse while maintaining the same area. To do this, we must, of course, increase the magnitude of the pulse. If we allow the pulse width to go to zero, the height diverges to infinity. However, the product of the width and the height is still equal to dv. In the case where dv 5 1, there is a special name for this infinite pulse. It is called a delta function, and is denoted as d(t). The delta function is zero at all values of t, except at t 5 0. At t 5 0, the delta function is equal to infinity. Integrating the delta function is very easy to do. for all values of a and b where a , 0 , b. If the interval between a and b does not contain zero, then the result of the integral is zero. Figure 2.5.1 The relative errors in the case of one spring and two springs. To represent the force as a pulse, we define the acceleration as: Here, dv is the area under the curve of this instantaneous pulse, as previously described. Integrating this acceleration is easy to do: Integrating the position contribution is now just integrating a constant. We can choose dv as the amount of acceleration we want to pack into the pulse. If we want to pack all of the acceleration due to an arbitrary force into the pulse, then we use the average value of the acceleration . This produces stepping equations of the form: In a more familiar form, the stepping equations for the pulse would appear as: If you make the assumption that the acceleration is constant over the time inter- val, then you can replace with . Doing this, you will arrive back at the stepping equations for the standard Symplectic Euler method. This gives meaningful insight into the Symplectic Euler method. It is a method that assumes that the acceleration is constant over the course of the time step and delivers all of the acceleration in an instantaneous pulse at the beginning of the time step. Since pulses can be accumulated without introducing error, the Symplectic Euler method can be considered to be an exact solution to forces of an approximated form. Since the method is exact, it intrinsically conserves energy. The error in the method is entirely due to the differences between the actual form of the force and the approximate form of the pulse. Integral Contributions of Collision Forces One very common thing that needs to be done in a game physics engine is to resolve collisions. To calculate our collision, we will assume that the forces due to the collision act within some subinterval that fits entirely within our given time interval. The nature of the force does not matter as much as the fact that it operates on a very small timescale and achieves the desired result. 176 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 177 For simplicity, we will assume that the object has normal incidence with a plane boundary. We will also assume that the collision with the boundary is going to take place within the time interval. To handle the collision correctly, we should sweep the trajectory of the body between xn and xn+1 and find the exact position and the exact time where the collision begins to take place. If we want to find the projected position at the next time step xn+1, we should wait to apply the collision force until all other forces have been applied, so that the projected next position is as accurate as possible. For the purpose of simplification, we will consider that the trajectory is a straight line that connects the position xn with xn+1. The position of the body at some interme- diate value of the time step is given by: where is the fraction of the time step that represents the moment of impact. The intermediate velocity is simplified as well, to be: With this physically inaccurate yet simplified trajectory, we can perform a sweep of the body against the collision boundary. The result of this sweep will determine the time fraction f. This will tell us the position and velocity (almost) of the body at the moment of impact. To determine the force that is applied in order to resolve the collision, we need to apply a pulse force that reflects the component of velocity vf about the contact plane. This reflection incorporates the interaction of the body with the contact surface; thus, the result should incorporate surface friction. For now, we will assume that there is no friction. To apply the pulse force, we will begin by applying a constant force over a small subinterval of the time step. The subinterval begins at time fDt and ends at . The velocity of the body at the end of the subinterval is given by: According to our initial simplification, the velocity is aligned with the normal of the contact plane. Thus, all of the velocity is reflected. We choose av to enact this reflection. The integral contributions of this force in the limit that are given by: Since this force pulse does not contribute anything to the position integral, the application of this force does not change the value of xn+1 from what it would have been. Thus, the value of xn+1 will still violate the collision constraint. We need to apply a second force pulse, which moves the position of the body out of the collision surface. Assuming that the body has normal incidence to the plane, then the amount the body needs to be pushed is given by Dx 5 xf 2 xn+1. Again, we are going to apply a constant force on a subinterval. The acceleration of this force ax is determined by considering the kinematic equations for constant acceleration. The integral contributions for this acceleration as are: Adding these two sets of contributions together gives the final result for the colli- sion force. This set of contributions can be expressed in terms of the position at the beginning and end of the interval, as well as the parameter f, which represents the moment when the collision first begins to happen. Integral Contributions of Viscous Forces A viscous force is a force that is related to, and usually opposing, the velocity of an object. The equation can be solved for the velocity in terms of the constant k and the mass m. This is done by expressing the acceleration as the derivative of velocity. This equation can be integrated to get: 178 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 179 The integral contributions of this force can be determined exactly and are given by: It may very well be sufficient to approximate the exponent function with the first- order Taylor series expansion. Since the force dissipates energy, exactness for the sake of accuracy is not required. Viscosity can be applied in an anisotropic fashion, meaning that the viscosity constant k can actually be a vector . The viscosity force then contains the dot prod- uct of with vn. A simple method of introducing surface contact friction is to generate an anisotropic viscous force in the event that a collision is determined. The viscosity in the direction of the collision surface normal can be tuned separately from the components in the surface plane in order to decouple the restitution of reflected velocity with sliding friction. Of course, this general viscosity force does not accurately model dry friction, where there is a transition from static to kinetic friction. But it is a good place to start. Integral Contributions of Constraint Forces Many physics engines offer a variety of constraints on objects. We will not calculate the integral contributions of every possible constraint, but rather suggest a general mechanism of determining resolution forces to enforce that constraints are satisfied. In many physics engines, collisions are handled as constraints. The pattern for evaluating the resolution forces for a collision can apply to all constraints. This general pattern is as follows. 1. Approximate the trajectory and find the value of f, which represents the moment when the constraint violation began. 2. Forces are applied as pulses. Using the Jacobian of the constraint equations to determine the direction of the force, the magnitude is calculated in order to bring the velocity vn+1 into compliance with the constraint. 3. If needed, another force pulse is applied in the same direction in order to bring the position vn+1 into compliance. 4. The contributions for both pulses are added together. Summary When using the kinematic integrator, the state of a body is defined by the position xn and velocity vn at the beginning of the current time step tn. The state of a body is stepped forward in time using the kinematic integration equations: The quantities dv and dx represent the portions of the stepping equations that can be analytically integrated per force and accumulated. The integral contributions can be calculated using the exact method as: where the inner integral of dx is indefinite, and the outer integral is a definite integral. The integral contributions for the average acceleration method are: The integral contributions for a pulse are: If desired the integral contributions of the Symplectic Euler method are defined as: These different methods can be mixed, depending on the desired results. When there is more than one force applied that depends on position, then the average acceleration method can be much more accurate than the exact method. 180 Section 2 Physics and Animation 2.5 Imp roved Numerical Integration with Analytical Techniques 181 The acceleration contributions for the spring with a spring constant k and a natural length l are: The contribution of a viscous force, with a viscosity constant v, is: For a collision response force, the integral contributions are defined as: where f is the fraction of the time step when the collision begins. For a general constraint, the generic framework is as follows: 1. Approximate the trajectory and find the value of f, which represents the moment when the constraint violation began. 2. Forces are applied as pulses. Using the Jacobian of the constraint equations to determine the direction of the force, the magnitude is calculated in order to bring the velocity vn+1 into compliance with the constraint. 3. If needed, another force pulse is applied in the same direction in order to bring the position xn+1 into compliance. 4. The contributions for both pulses are added together. Conclusion Using a well-chosen mix of numerical integration and analytic integration, it is possible to achieve exact trajectories for some force models. If there are multiple forces applied, error may accumulate, since the analytic integration of individual forces cannot take other forces into account. Using the average acceleration method for calculating the integral contributions can result in relative errors that are millions of times smaller than the errors from the Symplectic Euler method alone. We have seen that the Symplectic Euler method can be thought of as an exact method, which approximates the force as pulses. The errors in this method are due to this misrepresentation of the force. This gem has demonstrated the fact that the kinematics of simple physical models, which are prevalent in game-related physics, can be leveraged to dramatically reduce and sometimes eliminate the error of integration methods. Going forward, work on this topic might include determining the integral contri- butions due to the different flavors of translational and rotational constraints. Also, determining whether there is a way to pre-accumulate elements of the acceleration integrands prior to integration would provide a very natural solution to the problems that arise because of multiple forces that depend on position. 182 Section 2 Physics and Animation 183 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance B. Charles Rasco, Ph.D., President, Smarter Than You Software Basic physics simulations have been around in games for quite a while. But beyond simple gravity, simple 2D collisions, and not-so-simple 3D collisions, there is a lack of further refinement of object motion. Drag physics is one area that has not been adequately addressed. Most games simulate drag physics with a simple linear model, if they simulate it at all. This gem demonstrates and contrasts two different mathemati- cal models that simulate drag: a linear drag model and a quadratic drag model. The quadratic drag model is more realistic, but both models have applicability to different situations in physical simulations for many different types of games. The gem also defines and explains relevant parameters of drag physics, such as the terminal velocity and the characteristic decay time. The linear three-dimensional drag problem and both the one-dimensional linear and parts of the quadratic drag problems are discussed in [Davis86]. This gem adds a stable implementation of quadratic drag in three dimensions. The integrals used in this article may be found in any calculus book or on the web [Salas82, MathWorld]. Games could use improved drag physics for many things: artillery shells, bullets from smaller guns, golf ball trajectories, car racing games (air resistance is the main force that limits how fast a car can go), boats in water, and space games that involve landing on alien planets. Even casual games could use improved drag physics. The quadratic drag model in this gem was developed to make the iPhone and iPod Touch app Golf Globe by ProGyr. In this game, the user tries to get a golf ball onto a tee inside a snow globe. Originally the game was designed with linear water resistance, but it did not feel quite right. This was the motivation to improve the water resistance to the more realistic, and more complicated, quadratic model. I have a real golf ball in a snow globe and have never been able to get it onto the tee. To be honest, I have never gotten the ball on the tee in the Golf Globe game on the 3D eagle level, but in the game the easy levels are much simpler because the easier levels are two-dimensional. 184 Section 2 Physics and Animation Physics This section provides the general mathematics and physics background for both the linear and the quadratic drag models. Three-Dimensional Physics The forces for the linear model are: And the forces for the quadratic model are given by: where is the velocity squared, and is the unit vector that points in the direction of the velocity. The first term of both equations is the force due to gravity, and in my representa- tion down is the negative z direction. The direction of gravity depends on your coor- dinate system. The second term of both equations is the drag term. I chose the alpha and beta as different labels for the drag coefficient in order to keep it clear which ver- sion of air resistance I am referencing. For both models, if this value is big, there is a lot of fluid resistance, and if it is small, there is little fluid resistance. This parameter should always have positive value and does not actually need to remain constant. It can change if an object changes shape or angle from the direction of the fluid motion. A sail in the wind does not have a constant value, but changes depending on how taut the sail is and how perpendicular it is relative to the wind, among other factors. The velocity in these equations is the velocity of the object in the fluid, so if there is wind, then it would be the velocity of the object minus the velocity of the wind. The linear equation is solvable in a complete analytical closed-form solution. Since the quadratic mixes all of the components of the velocity vector together, it is not solvable in a closed form, but we can solve it numerically. In addition, the quad- ratic drag model has the advantage of being a better approximation of realistic air and fluid resistance. Interestingly, both are exactly solvable when considered as one- dimensional problems. This will be discussed in more detail later in this gem. Conversion from Three-Dimensional Physics to One-Dimensional Physics To convert the drag problem from three dimensions into one dimension, we need to break the object’s motion into two components: motion along the velocity (one dimension) and motion perpendicular to the current velocity (two dimensions). The drag force only acts along the current direction of velocity; thus, we can treat it as a one-dimensional problem. The unit vector in the direction of the velocity is given by , or if the velocity is zero it is valid to set . Setting the normal to zero works because if the object is not moving, there will be no fluid resistance. The components of the forces along the direction of motion and perpendicular to the direction of motion are shown in Figure 2.6.1. We can update the velocity along the current direc- tion of motion as well as perpendicular to it once it is decomposed this way. One-Dimensional Solutions For a one-dimensional solution, we start from the one-dimensional version of Newton’s Second Law, F 5 MA, and from the definition of acceleration, a 5 dv/dt. If the force is a function of the velocity alone, then these two equations can be combined to get time as a function of velocity, as shown here: The force equations that are relevant to the present matter are , the linear fluid resistance equation, and , the quadratic fluid resistance equation. The cosine in these equations is the cosine of the angle between the current velocity and the vertical. For the linear equation, the negative sign in front of the velocity term means the force always opposes the direction of motion. If the velocity is positive, then the force is in the negative direction. If the velocity is nega- tive, then the force is in the positive direction, which is taken care of automatically by the linear equation. In the quadratic equation, the different sign must be applied based on which way the object is traveling, since squaring the velocity loses the direc- tional information. 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance 185 Figure 2.6.1 Breaking three-dimensional space into components along the direction of motion and perpendicular to the current direction of motion. For the linear equation, we have: where is the terminal velocity. There is one other variable that describes the motion, and that is the characteristic time, . The characteristic time is a measure of how quickly (or slowly) the object speeds up or slows down to the terminal velocity, and it shows up in the equations after we integrate. The formula for the quadratic drag model’s terminal velocity and characteristic time is different than the linear drag model’s characteristic time and terminal velocity. The linear drag models are labeled with subscript 1s. Another way to calculate the terminal velocity is to set the force equation to zero and solve for the velocity. This integral is evaluated with a change of variables , so that . Lastly, we solve this somewhat intimidating equation for vf, which turns out nicely as: (1) If you feel like practicing taking limits, see what happens when the air resistance coefficient goes to zero. The traditional pops up. The first few terms in this expansion are useful if you want to approximate the exact solution in order to avoid the expensive exponential function calls. For the quadratic case, we solve the equations in the same way. We start with the general equation: From this equation we define the terminal velocity, , and the characteristic time is (not obvious from the equation) , which shows up after we integrate. Integrating this equation is straightforward. Although related, the plus version and the minus version are different integrals, and we need to evaluate both. First, let us do the integral when the terminal velocity term (labeled with subscript 2s for the quadratic model) and the velocity square terms have the same sign. This is the case where the force of gravity and the fluid resistance are in the same direction, which happens when the particle is traveling up and both forces are pointing down. 186 Section 2 Physics and Animation To keep track of the overall sign relative to the velocity direction, we introduce the variable, sign, which is either 1 or –1, on the outside of the integral. This is important to get the correct overall sign compared to the direction of the velocity unit vector, . We continue to solve for the time as a function of initial and final velocity to obtain: Now we solve this equation for the final velocity and obtain: (2) Secondly, we solve for the case where the terminal velocity and velocity squared terms have opposite signs: There is a relevant connection between the inverse hyperbolic tangent and the natural log of the form: for The absolute value bars from the integral are important in order to get the equa- tion to involve the inverse hyperbolic tangent correctly. We must answer whether the initial and final velocities are bigger than the terminal velocity. In addition, given that the initial velocity is less than the terminal velocity, is the final velocity necessarily less than the terminal velocity? It turns out that it is (see the solution below), and it is acceptable to convert both natural logarithms to inverse hyperbolic tangents. A simi- lar question can be asked if the initial velocity is greater then the terminal velocity. So there are four different cases with which we need to deal: , , , and . For the first case : and solving for the final velocity: (3a) 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance 187 For the second case : and solving for the final velocity: (3b) For the third case, , where the object begins traveling at the terminal velocity in the same direction as the force of gravity, we get: (3c) There are several ways to arrive at this result. We can take a limit of both of the first two cases, Equations (3a) and (3b), or if we look at the force in the quadratic force equation, it is zero, hence the object must travel at a constant velocity. And lastly is the case when the terminal velocity is identically zero, . It is easiest to start from the original equation and solve like the previous cases. As previously, we solve this for the final velocity: (3d) All right, this is all the information we need to solve for the linear and quadratic three- dimensional cases, and we now transition to solving the three-dimensional linear case. Solution of the Three-Dimensional Linear Case We could solve the linear three-dimensional case in a similar manner as we solve the quadratic version. However, we choose to treat it as three separate one-dimensional problems because the results can be used by the AI in your game. The previous equa- tions are modified to suit each of the Cartesian directions x, y, and z. Again, gravity is in the negative z direction. The solutions of the three different velocity equations are: and 188 Section 2 Physics and Animation with and . The terminal velocities in the horizontal direction are zero. If we integrate these equations with respect to time, then we get the position as a function of time. and The complete distance equations are useful for AI targeting, AI pathfinding, and calculating targeting info for players, without actually propagating the solutions for- ward in time step by step. In general, the distance equations are well approximated for most video games with , since the time for most games is very small and is much faster than updating with the complete position equations. Solution of the Three-Dimensional Quadratic Case The three-dimensional quadratic equations are , , . It is impossible to solve these as three independent equations since they are all coupled by the quadratic term of the fluid resistance. One way to solve this is to break the motion into two directions, along the current velocity and perpendicular to the current velocity, and then update the velocity based on the force calculated. Along the direction of motion, the force is , where nz is the z component of and is equal to negative of the cosine of the angle between the direction of motion and the gravity vector. Notice that the drag term is always neg- ative since it is in the opposite direction as the current velocity. to the direction of motion, the forces are . If the velocity is either straight up or straight down, and , there is no component of force perpendicular to the current velocity. And if the velocity is completely horizontal, the perpendicular force is completely in the negative z direction. If you have a different coordinate system, this equation will look slightly different. For the motion along the velocity normal, there are three different cases to consider: nz.0, nz, 0, and nz5 0. If nz.0, the object is moving upwards, the opposite direc- tion from gravity, and the fluid resistance is in the same direction as gravity. If nz, 0, then the object is moving in the same direction as gravity, and the fluid resistance force is in the opposite direction as gravity. Lastly, if nz5 0, then the object is moving horizontally, and gravity has no effect on horizontal motion. 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance 189 Perpendicular 190 Section 2 Physics and Animation For nz.0, the relevant solution is Equation (2) with , , and . The velocity is: For nz , 0, the relevant solution is Equation (3a), (3b), (3c), or (3d) depending on the current velocity relevant to the current terminal velocity with , , and . The velocity update is: for for for or for Notice that the characteristic time for the last instance is undefined, but it is never used in the solution, so there should not be a problem. For the force perpendicular to the initial velocity, the updated velocity is the same as Equation (3) with v0 5 0 and its own distinct terminal velocity, , and characteristic time, . This leads to: For most cases this can be approximated (since the characteristic time is usually large and the update time is usually small for games) as: (4) This last equation should look familiar. So usually there is no need to calculate the separate terminal velocity or the characteristic time for the perpendicular motion if the mentioned approximations are valid. There is one issue that occurs when a slow-moving object changes from traveling upward to going downward, as shown in Figure 2.6.2. If this case is not handled, then the object appears to go back in the direction it came from, when in fact it would stop traveling in the horizontal direction and travel down along the direction of gravity. There are several ways to handle this. The most straightforward approach is to propagate the object for the time it takes to get to the top with Equation (2) and then for the rest of the time propagate with Equation (3a). From Equation (2), we can calcu- late the time it takes to stop moving in the positive velocity direction. The result is: (5) If the time to the top is less than the total time to propagate, then subtract the time to the top from the total time to propagate. Then use the downward velocity update for the time remaining and use zero as the initial velocity. There might be a slight discontinuity in the object’s smooth motion on the screen, but usually there is not. If there is too much, then update the position in a more rigorous manner. The last thing to consider is how to update the position. All of these equations are integrable, which is more accurate, but they provide little extra accuracy noticeable in a gaming environment and are computationally expensive to evaluate. For most game purposes, it is accurate enough to use Euler integration to calculate the position , where the time, t, in the equation is the time since the last update and is not the global time. For the Golf Globe game on an iPod Touch, the game runs smoothly with no visible numerical issues at 60 frames per second by updating the position using Euler integration. 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance 191 Figure 2.6.2 A bad thing that can happen for slow-moving particles and/or large update times. Pseudocode This is the pseudocode for a function that calculates the final velocity, velocityFinal, given the initial velocity, velocityInitial, of an object by the time, deltaTime. This does not update the position of the object, which should be updated separately but with the same deltaTime. void updateVelocityWithFluidResistance( const float deltaTime, const ThreeVector& velocityInitial, ThreeVector& velocityFinal ) { Calculate Velocity Normal or zero it. Calculate terminal velocity and characteristic time along velocity normal. For objects going up use Equation (2), but first check to make sure the velocity does not reach the top Equation (5). For objects going down use either Equation (3a), (3b), (3c), or (3d) as is appropriate to the initial velocity at the beginning of the frame. Update the velocity perpendicular to the initial velocity with Equation (4). } Comparison of Linear versus Quadratic Solutions There are benefits to using either the linear version or the quadratic version. The com- putational advantages lie with the linear solution. The quadratic computational cost is hardly prohibitive, especially if there are not too many objects updated with the quad- ratic model. And the quadratic model looks and feels much nicer, especially in a game where the user is getting feedback from movement due to handheld motion, such as from the iPod Touch or another handheld device. There are several advantages to the linear solution: There are exact velocity and position equations for all time, which is useful for AI and targeting; it is faster compu- tationally; and the exponent in the solution is easily expanded, making it only minutely more computationally intense than updates with gravity only. One disadvantage to the linear solution is that it is not as realistic, especially when compared directly to the quadratic case. There are some advantages to using the quadratic solution: It is more realistic and it is not hugely more computationally intense (as long as there are not too many objects). There are several disadvantages to the quadratic solution: It is computationally more intense, especially if there are many objects; there is no analytic solution for all time, thus it is not as easy to use for AI and player targeting; and its approximations are more difficult and are more numerically touchy to the input drag parameters. 192 Section 2 Physics and Animation There is a possible way to balance the two approaches if there are too many objects to update with the quadratic drag model. Use the linear model for the vast majority of these objects and then use the quadratic for the more important, or highly visible, objects. For both the linear and the quadratic solutions, it is a good approximation to use Euler integration to update the object position. It will save big on computational cost if there are many objects, especially for the quadratic case. Conclusion The concept of drag is simple: Slow things down in the opposite direction they are moving. However, it is difficult to implement in a numerically accurate and stable manner. The simple and mathematically accurate linear drag model as presented here would be an improvement to many current games. The mathematically and physically accurate quadratic model is an additional improvement. There are many areas of simulation in games that would visibly benefit from using the mathematically and physically accurate quadratic drag model instead of the linear drag model: boats, sail- boats (both with respect to the water and the wind in the sails), parachutes, car racing, snow-globe games, golf ball trajectories, craft flying through atmospheres, and artillery shells. Along the lines of this gem, there are several areas that would be interesting to research further. Physically based drag coefficients depend on all sorts of factors, including temperature, density, type of fluid, and the shape of the object. For space- craft entering an atmosphere, several of these things are relevant and can make for some interesting gameplay while flying into land or chasing AIs or other players in a multiplayer game. References [Davis86] Davis, A. Douglas. Classical Mechanics. Academic Press, Inc., 1986. [MathWorld] “Wolfram MathWorld.” n.d. . [Salas82] Salas, S. L. and Einar Hille. Calculus: One and Several Variables with Analytic Geometry: Part 1. John Wiley and Sons, 1982. 2.6 What a Drag: Modeling Realistic Three-Dimensional Air and Fluid Resistance 193 194 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes Krzysztof Mieloszyk, Gdansk University of Technology Physics in real-time simulations as well as in computer games has been gaining importance. The game producers have observed that the advanced realism of the graphics is not enough to keep players content and that the virtual world created in a game should follow the rules of physics in order to behave more realistically. Until now, such realism has been mainly reflected by illustrating the interactions between the material objects. However, the phenomenon of reciprocity between such media as liquid or gas and more complex objects is also of high importance. With increasing processor power, the growing number of physics simulation methods that were ini- tially developed purely for scientific reasons are being utilized in the entertainment market. However, simulation methods commonly used in games tend to be relatively simple in the computational sense or very focused on a specific case. An attempt to adapt them to other game variants is often impossible or requires an application of complicated parameters with values that are difficult to acquire. Hence, some more universal methods that allow for more easily modeled game physics are valuable and attractive tools for hobbyists and creators of later game extensions. Requirements for the Mesh Representing the Object Commonly, the medium is a physical factor that in various ways affects the complete surface of the object. Hence, for the purpose of fluid dynamics simulation, every shape needs to be represented as a closed body with all its faces pointing outward and fully covering its outside surface. The inside should be looked upon as full; it must not 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes 195 have any empty spaces or “bubbles.” Every edge should directly contact exactly two surfaces, while the entire object mesh must fulfill the following condition: V 1 F 2 E 5 2 (1) where V 5 vertices, F 5 faces, and E 5 edges. Because the preliminary interaction value is calculated on points, it is convenient to organize the mesh in a list of points in the form of coordinates and a list of triangles represented as three sequential indices of its vertices. Additionally, every triangle should have a normal vector of its surface, as well as the surface value. In some cases, depending on the computational method used, the position of the triangle center can also be useful. In this article, we will sometimes refer to vertices as points and triangle meshes as faces. Physical Rudiments Using some physical basics of fluid mechanics (referring among others to Pascal’s Law), pressure is defined as a quotient of force F, which exercises perpendicularly to the surface sector limiting the object given, and the surface S of this sector [Bilkowski83]. [Pa] 5 [N/m2] (2) The total pressure affecting the body surface in a given medium point can be rep- resented by a sum of two components: static and dynamic pressure. Ptotal 5 Pstat 1 Pdyn [Pa] (3) In order to define the static pressure that is needed to create buoyancy in a selected medium, we need the medium density r [kg/m3], gravitational field intensity g [m/s2], and the distance to the medium surface h. (In the case of air, the altitude is needed.) Ptotal 5rg h [Pa] (4) Dynamic pressure originates from the kinetic energy pressure of the medium, which results from its movement and can be determined by knowing the medium density r as well as the fluid velocity V [m/s] at a selected position (applicable for sub- sonic speed). [Pa] (5) Equation (5) can be extended with a nondimensional pressure coefficient cp, whose value can be calculated as a cosine of the angle between the normal vector of the sur- face sector and the vector of the actual velocity in a selected point (see Figure 2.7.1). This can also be represented as a scalar product of the normal surface and the normal- ized velocity vector. 196 Section 2 Physics and Animation cp 5 cos (a) 5 (6) As an effect, we obtain a general equation for dynamic pressure exerted on a seg- ment of the surface shown in Equation (7). The method presented in this gem solves the computation of dynamic pressure based on velocity and assumes that we use a rigid body. The use of physical models based on joints or springs is also possible. [Pa] (7) Pre-Computation Based on a Triangle Mesh The whole issue of calculating the pressure distribution on a mesh comes down to computing the values on the mesh vertices (or center of triangles) taking into consid- eration the immersion, velocity, surface area, and normal of each of the triangles that form the object. This allows us to calculate the force of the pressure that influences the physical model. Additional problems arise in the case of the object being placed in two different media simultaneously. Here the division plane separates the object faces into those faces that are totally immersed in only one medium and faces that are located in both of them. The factors requiring serious consideration are characterized in the following section. Calculating the Distance to Medium Boundary For calculations of the static pressure affecting the object, it is necessary to define the distance h between the selected point and the medium boundary (for example, a water surface). For this, the scalar product can be applied, giving us the following Equation (8), schematically illustrated in Figure 2.7.2. [m] (8) Figure 2.7.1 Normal vector of the surface and normal vector of velocity and the angle between them. 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes 197 Assuming that the normal vector of the surface is pointing upward, the negative value of the obtained distance means that the point is located under the “water” sur- face, and the positive is above its surface. In cases when the boundary is located at the virtual world “sea level,” we know in which of the two media the point is located. To reduce the possibility of a calculation error, point Psurface (through which the surface dividing the media is crossing) should correspond to the projection of the object’s center onto this surface, in accordance with its normal vector. Calculating the Point Velocity Because the objects simulated in the virtual world are frequently moving, one of the most important factors needing to be improved in their dynamics is the influence of the surrounding medium. Therefore, it is required to include the velocities of each of the object mesh points (Equation (9)) as a sum of the mass center’s linear velocity Vobj and the object’s angle velocity (Figure 2.7.3). The velocity Vp obtained this way represents the global velocity for selected point P. To calculate the dynamic pressure, the local velocity vector, which is equal to the negative global point velocity, is required [Padfield96]. Figure 2.7.2 Defining the distance of the point from the surface. Figure 2.7.3 Velocity of the arbitrary point belonging to the moving object. We represent this mathematically as follows: [m/s] (9) Morphing Shape Objects The simulated object is not always a rigid body, because some objects move by chang- ing their body shape. A fish or a bird can be used as an example here because various body fragments of those animals move with different velocities in relation to other parts. For such objects, when calculating the point velocity, it is necessary to take the so-called morphing effect into consideration. The simplest method here is to find the change between the old and new local positions of the mesh points (see Figure 2.7.4), followed by the velocity computation, keeping the time difference Dt between the morphing animation frames in mind. As a result and an extension to Equation (9), we obtain Equation (10), which is presented below. (10) Triangle Clipping Because there is a situation in which the object triangle can be located in both media simultaneously, it is necessary to determine the borderline and find those parts that are completely submerged in only one of the media. To achieve this, the triangle should be divided into three triangles calculated by cutting points of the edges that cross the medium border. By applying Equation (11), we obtain PAB, PAC, and dividing the triangle ABC, we create a triangle A PAB PAC, located in one medium, and two arbitrary triangles located in the other medium (for example, B C PAB and C PAC PAB), as presented in Figure 2.7.5. 198 Section 2 Physics and Animation Figure 2.7.4 Moving vertex position in morphing mesh. 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes 199 It is important to keep in mind that the newly created triangles must have the same normal vectors as the primary triangle. Even though each of the new triangles has the same normal, it is still necessary to recalculate the current surface area of each of those triangles. It might be very useful to apply the cross-product property, described in Equation (12), for the exemplary triangle ABC. This surface area can then be used to calculate the pressure force. (11) (12) Calculation of the Pressure on the Triangle Surface To calculate the pressure force exerted on the arbitrary triangle of the body, the pres- sure at each of the triangle vertices should be taken into consideration. Hence, we need to calculate the distance of each vertex from the medium surface for static pressure (Equation (6)) and the vertex local velocity, taking the angle between the velocity and the triangle surface normal into account, for dynamic pressure (Equation (7)). In the next step, the average complete pressure of the triangle vertices is calculated. However, it is useful to know that some simplifications are possible for cases when any mesh vertex has the velocity calculated from v x r that is considerably smaller than the object lin- ear velocity Vobj or when v is close to zero. As a result of this, the pressure gradient distribution is almost linear, and the calculation of the complete pressure exerted on the triangle surface is reduced to obtaining the pressure in the triangle center (based on h and Vp). Thus, the pressure force is a product of a triangle normal, its surface area, and pressure (Equation (2)). For models based on a rigid body, calculation of the complete force and the torque comes down to summing the pressure forces from each surface fragment, as well summing all torques as pressure force vector cross-products and the forces at the center of the triangles. Figure 2.7.5 Assigning the cutting points on the surface by edge AB. Simple Correction of Dynamic Pressure The method for computing the static pressure presented in this gem is based solely on simple physical rules. However, the effect of dynamic pressure is one of the most com- plicated problems of the mechanics, for which no simple modeling method exists. The computational method used here is highly simplified and based on a heuristic parameter cp. This is sufficient for the purpose of physical simulation, which visually resembles the correct one; however, obtained values strongly diverge from the real val- ues, resulting in too low of lift and too high of drag values. Hence, when calculating a force vector obtained from dynamic pressure, it is beneficial to introduce a correction in accordance with the velocity vector at a selected point. As in Equation (13), the fac- tors used here are determined empirically and should be interpreted as recommended, not required. The corrected dynamic pressure force vector of the selected triangle has to be summed with the static pressure force vector occurring on its surface. The results obtained in that way approximate the results of the real object (for example, an air- plane) for which the dynamic pressure is most important. (13) Conclusion The simplified method of computing the flow around an arbitrary closed body pre- sented here allows us to easily obtain approximate values of forces surrounding a medium. This is possible thanks to simulating some fundamental physical phenom- ena, such as static and dynamic pressure. Figure 2.7.6 shows screenshots from an application that interactively simulates real time using the algorithm presented here. This clearly shows a potential possibility for use in simulating the behavior of objects with fairly complicated shapes. The control here is carried out by changing the inter- action between the medium and the object obtained, by morphing the shape of the object’s surface, or by moving or rotating the object elements. This allows us to create controlled planes, ships, airships, submarines, hydrofoils, seaplanes, swimming surfers, or drifting objects, as well as objects that undergo morphing, such as birds or fish. All of this is simulated based on the object mesh only. The simplifications used in the algorithm bring some inaccuracies. The main problem here is the lack of consideration for the reciprocal interferential influences for each face on the final pressure values. This is a consequence of analyzing each face sep- arately, as if it were moving in the flow consistent with its own velocity. In reality, the velocity of the flow at a given fragment of the body surface depends on the velocity over the adjacent surface fragments [Wendt96, Ferziger96]. Dependence of the value on the pressure coefficient parameter cp (as angle cosine) is a generalization. In fact, this issue would require some more specialized characteristics. Another drawback of the material presented here is that the medium viscosity has not been taken into con- sideration, which also adds to inaccuracies in the results. These computation versus performance tradeoffs need to be evaluated when comparing real-time methods and those that may be used in fluid dynamics engineering and design. 200 Section 2 Physics and Animation 2.7 Application of Quasi-Fluid Dynamics for Arbitrary Closed Meshes 201 Implementations of the equations in this article motivated by specific examples are available on the CD-ROM. References [Bilkowski83] Bilkowski, J. and J. Trylski. Physics. Warsaw: PWN, 1983. [Ferziger96] Ferziger, Joel H. and Milovan Peri . Computational Methods for Fluid Dynamics. Berlin: Springer-Verlag, 1996. [Padfield96] Padfield, Gareth D. Helicopter Flight Dynamic. Oxford: Blackwell Science Limited, 1996. [Wendt96] Wendt, John F. Computational Fluid Dynamics. Berlin: Springer-Verlag 1996. Figure 2.7.6 Examples of applications: drifting object (top), static object (middle), morphing object (bottom). (Grayscale marks the pressure distribution in the middle and bottom images.) 202 2.8 Approximate Convex Decomposition for Real- Time Collision Detection Khaled Mamou Collision detection is essential for realistic physical interactions in video games and physically based modeling. To ensure real-time interactivity with the player, video game developers usually approximate the 3D models, such as game characters and static objects, with a set of simple convex shapes, such as ellipsoids, capsules, or convex hulls. While adequate for some simple interactions, these basic shapes provide poor approximations for concave surfaces and generate false collision detections (see Figure 2.8.1). In this gem, we present a simple and efficient approach to decomposing a 3D mesh into a set of nearly convex surfaces. This decomposition is used to compute a faithful approximation of the original 3D mesh, particularly adapted to collision detection. First, we introduce the approximate convex decomposition problem. Next, our proposed segmentation technique and performance characteristics will be evalu- ated. Finally, we conclude with some areas of future work. Figure 2.8.1 Convex hull versus approximate convex decomposition. (a) Original mesh (b) Convex hull of (a) (c) Approximate convex decomposition of (a) 2.8 Approximate Convex Decomposition for Real-Time Collision Detection 203 Approximate Convex Decomposition Let S be a triangular mesh of connectivity k and geometry g. Intuitively, S is a piece- wise linear surface composed of a set of triangles stitched along their edges. The con- nectivity k is represented as a simplical complex describing the topology of the mesh. k is composed of a set of vertices x 5 {v1,v2,...,vv} P IN (IN denotes the set of positive integers and V the number of vertices of k), together with a set of non-empty subsets of x, called simplices and verifying the following conditions: • Each vertex vPx is a simplex of k, and • Each subset of a simplex of k is also a simplex of k. A d-simplex is defined as a simplex composed of d11 elements of x. The 0-simplices correspond to the set of vertices of k, the 1-simplices to the set of its edges, and the 2-simplices to the set of its triangles, denoted as u5{t1,t2,...,tT}. (T represents the num- ber of triangles.) The geometry g specifies the shape of the surface by associating 3D positions and usually surface normals to the vertices of k. A surface S is convex if it is a subset of the boundary of its convex hull (in other words, the minimal convex volume containing S). Computing an exact convex decomposition of an arbitrary surface S consists of partitioning it into a minimal set of convex sub-surfaces. Chazelle, et al. prove that computing such decomposition is an NP-hard problem and evaluate different heuris- tics to resolve it [Chazelle95]. Lien, et al. claim that the exact convex decomposition algorithms are impractical since they produce a high number of clusters, as shown in Figure 2.8.2 [Lien04]. To provide a tractable solution, they propose to relax the exact convexity constraint and consider instead the problem of computing an approximate convex decomposition (ACD) of S. Here, for a fixed parameter e.0, the goal is to determine a partition P5{p1,p2,...,pK} of u with a minimal number of clusters K and verify that each cluster has concavity lower than e. (a) Exact convex decomposition [Chazelle95] generates 7,611 clusters (b) Approximate convex decomposition generates 11 clusters Figure 2.8.2 Exact convex decomposition versus approximate convex decomposition. The ACD problem has been addressed in a number of recent publications [Lien04, Lien08, Kraevoy07, Attene08]. Attene, et al. apply a hierarchical segmentation approach to a tetrahedral mesh generated from S [Attene08]. The tetrahedralization process exploited by Attene, et al. is, in practice, hard to compute and introduces extra computational complexity. Other methods avoid this limitation by considering the original 3D mesh directly [Lien04, Lien08, Kraevoy07]. Kraevoy, et al. introduce an incremental Lloyd-type segmentation technique exploiting a concavity-based seed placement strategy [Kraevoy07]. Here, the concavity measure is defined as the area weighted average of the distances from the clusters to their corresponding convex hulls. Lien, et al. claim that such concavity measure does not efficiently capture the important features of the surface [Lien04, Lien08]. They propose instead to compute the maximal distance between the mesh vertices and the clusters’ convex hulls. Their divide-and-conquer approach iteratively divides the mesh until the concavity of each sub-part is lower than the threshold e, as shown in Figure 2.8.3. Here, at each step i, the vertex with the highest concavity is selected, and the cluster to which it belongs is divided into two sub-clusters by considering a bisection plane incident to . The main limitation of this approach is related to the choice of the “best” cut plane, which requires a sophisticated analysis of the model’s features. A public implementation of [Lien04] is provided in [Ratcliff06]. Here, the sophisticated feature analysis procedure is replaced by a simple cut plane selection strategy that splits each cluster according to its longest direction. This suboptimal choice generates over-segmentations, which are optimized by applying a post-processing procedure aiming at aggregating the maximal number of clusters while ensuring the maximal concavity constraint. As illustrated in Figure 2.8.7, the ACDs produced by Ratcliff are, in practice, suboptimal, since the aggregation procedure is applied to clusters generated only by plane-based bisections. To overcome the aforementioned limitations, we introduce, in the next section, a simple and efficient hierarchical ACD approach for 3D meshes. 204 Section 2 Physics and Animation Figure 2.8.3 The divide-and-conquer ACD approach introduced in [Lien04] and [Lien08]. 2.8 Approximate Convex Decomposition for Real-Time Collision Detection 205 Hierarchical Approximate Convex Decomposition Our proposed hierarchical approximate convex decomposition (HACD) proceeds as follows. First, the dual graph of the mesh is computed. (See the upcoming “Dual Graph” section.) Then, its vertices are iteratively clustered by successively applying topological decimation operations, while minimizing a cost function related to the concavity and the aspect ratio of the produced segmentation clusters. Finally, by approximating each cluster with the boundary of its convex hull [Preparata77], a faithful approximation of the original mesh is computed. This surface approximation is piecewise convex and has a low number of triangles (when compared to T ), which makes it particularly well adapted for collision detection. Let’s first recall the definition of the dual graph associated with a 3D mesh. Dual Graph The dual graph S * associated with the mesh S is defined as follows. Each vertex of S * corresponds to a triangle of S. Two vertices of S * are neighbors (in other words, con- nected by an edge of the dual graph) if and only if their corresponding triangles in S share an edge. Figure 2.8.4 illustrates an example of a dual graph for a simple 3D mesh. Decimation Operator Once the dual graph is computed, the algorithm starts the decimation stage, which consists of successively applying half-edge collapse operations to S *. Each half-edge collapse operation applied to an edge (v,w), denoted hecol(v,w), merges the two ver- tices v and w, as illustrated in Figure 2.8.5. The vertex w is deleted, and all its incident edges are connected to v. (a) Simple mesh (b) Dual graph of (a) Figure 2.8.4 Example of a dual graph. Let A(v) be the list of the ancestors of the vertex v. Initially, A(v) is empty. At each operation hecol (v,w) applied to the vertex v, the list A(v) is updated as follows: (1) Simplification Strategy The decimation process described in the previous section is guided by a cost function describing the concavity and the aspect ratio of the surface S(v,w) resulting from the unification of the vertices v and w and their ancestors [Garland01]: (2) As in [Garland01], we define the aspect ratio of the surface S(v,w) as follows: (3) where r(S(v,w)) and s(S(v,w)) represent the perimeter and the area of S(v,w), respectively. The cost EShape(v,w) was introduced in order to favor the generation of compact clusters. In the case of a disk, the cost EShape equals one. The more irregular a surface, the higher its aspect ratio cost. Inspired by [Lien08], we define the concavity C(v,w) of S(v,w), as follows (see Figure 2.8.6): (4) where P(M) represents the projection of the point M on the convex hull CH(u,w) of S(v,w), with respect to the half ray with origin M and direction normal to the sur- face S(v,w) at M. 206 Section 2 Physics and Animation Figure 2.8.5 Half-edge collapse decimation operation. 2.8 Approximate Convex Decomposition for Real-Time Collision Detection 207 The global decimation cost E(v,w) associated with the edge (v,w) is given by: (5) where • D is a normalization factor equal to the diagonal of the bounding box of S, and • a is a parameter controlling the contribution of the shape factor EShape(v,w) with respect to the concavity cost. (See the upcoming “Choice of the Parameter a” section.) At each step of the decimation process, the hecol(v,w) operation with the lowest decimation cost is applied, and a new partition P(i) 5 {p1(i),p2(i),...,pK(i)} is com- puted as follows: (6) where represents the vertices of the dual graph S * obtained after i half-edge collapse operations. This process is iterated until all the edges of S * generating clusters with concavities lower than e are decimated. Choice of the Parameter a The clusters detected during the early stages of the algorithm are composed of a low number of adjacent triangles with a concavity almost equal to zero. Therefore, the decimation cost E is dominated by the aspect ratio related cost EShape, which favors the generation of compact surfaces. This behavior is progressively inverted during the decimation process, since the clusters become more and more concave. To ensure that the cost (a.EShape) has no influence on the choice of the later decimation operations, we have set the parameter a as follows: (7) Figure 2.8.6 Concavity measure for a 3D mesh. This choice guarantees, for disk-shaped clusters, that the cost (a.EShape) is 10 times lower than the concavity-related cost . Experimental Results To validate our approach, we have first compared its segmentation results to those of [Ratcliff06], which provides a simplified version of the original algorithm described in [Lien04] and [Lien08]. In Figure 2.8.7, we compare the ACDs generated by our approach to those obtained by using Ratcliff’s method. Here, the accuracy of the generated piecewise convex approximations is objectively evaluated by using the root mean squares (RMS ) and Hausdorff errors [Aspert02]. Let’s recall that the RMS error measures the mean distance from the original mesh S to its piecewise convex approx- imation S9. It is defined as follows: (8) where D is the diagonal of the bounding box of S, a(S) is its area, and d(p,S9) is the distance from a point pPS to S9. The distance d(p,S9) is given by: (9) The Hausdorff error, denoted H, measures the maximal distance from S to S9 and is defined as follows: (10) The reported results show that the proposed HACD technique provides significantly (that is, from 20 percent to 80 percent) lower RMS and H errors, while detecting a lower number of clusters. Figures 2.8.7(j) through 2.8.7(r) clearly show the limitations of the plane-based bisection strategy and the aggregation post-processing procedure of Ratcliff, which generates over-segmentations and poor ACDs. Color Plate 8 presents the segmentation results and the approximate convex decom- positions generated by our approach for different 3D meshes. For all the models, the generated segmentations ensure a concavity lower than e and guarantee that the max- imal distance from S to S9 is lower than 3 percent of D. Therefore, the generated piecewise convex approximations provide faithful approximations of the original meshes with a small number of clusters. Moreover, our technique successfully detects the convex parts and the anatomical structure of the analyzed 3D models. For all of the models shown in Color Plate 8, the piecewise convex approximations were computed by considering an approximation of the clusters’ convex hulls with a maximum of 32 vertices for each. The number of triangles composing the obtained con- vex surfaces is lower than 8 percent of T. Furthermore, the piecewise convexity property makes the generated approximations particularly well suited to collision detection. 208 Section 2 Physics and Animation 2.8 Approximate Convex Decomposition for Real-Time Collision Detection 209 (a) T 5 5804 (c) K 5 23, RMS 5 0.7, H 5 2.6(b) K 5 28, RMS 5 0.8, H 5 5.3 (d) T 5 19563 (f) K 5 22, RMS 5 0.5, H 5 2.5(e) K 5 26, RMS 5 0.9, H 5 5.2 (g) T 5 39698 (i) K 5 11, RMS 5 0.7, H 5 1.4(h) K 5 11, RMS 5 1.3, H 5 1.7 (j) T 5 54344 (l) K 5 14, RMS 5 0.27, H 5 1.5(k) K 5 19, RMS 5 0.8, H 5 4.2 (m) T 5 27150 (o) K 5 4, RMS 5 0.37, H 5 1.9(n) K 5 6, RMS 5 0.9, H 5 4.3 (p) T 5 15241 (r) K 5 14, RMS 5 0.4, H 5 2.4(q) K 5 16, RMS 5 1.2, H 5 6.5 Figure 2.8.7 Comparative evaluation: (a,d,g,j,m,p) original meshes, (b,e,h,k,n,q) piecewise convex approximations generated by [Ratcliff06], and (c,f,i,l,o,r) piecewise convex approximations generated by the proposed HACD technique. Conclusion We have presented a hierarchical segmentation approach for approximate convex decomposition of 3D meshes. The generated segmentations are exploited to construct faithful approximations of the original mesh by a set of convex surfaces. This new rep- resentation is particularly well suited for collision detection. We have shown that our proposed technique efficiently decomposes a concave 3D model into a small set of nearly convex surfaces while automatically detecting its anatomical structure. This property makes the proposed HACD technique an ideal candidate for skeleton extraction and pattern recognition applications. References [Attene08] Attene, M., et al. “Hierarchical Convex Approximation of 3D Shapes for Fast Region Selection.” Computer Graphics Forum 27.5 (2008): 503–522. [Chazelle95] Chazelle, B., et al. “Strategies for Polyhedral Surface Decomposition: An Experimental Study.” Symposium on Computational Geometry (1995): 297–305. [Garland 01] Garland, M., et al. “Hierarchical Face Clustering on Polygonal Surfaces.” Symposium on Interactive 3D Graphics (2001): 49–58. [Hoppe96] Hoppe, H. “Progressive Meshes.” International Conference on Computer Graphics and Interactive Techniques (1996): 99–108. [Kraevoy07] Kraevoy, V., et al. “Model Composition from Interchangeable Components.” Pacific Conference on Computer Graphics and Applications (2008): 129–138. [Lien04] Lien, J.M., et al. “Approximate Convex Decomposition.” Symposium on Computational Geometry (2004): 457–458. [Lien08] Lien, J.M., et al. “Approximate Convex Decomposition of Polyhedra and Its Applications.” Computer Aided Geometric Design (2008): 503–522. [Aspert02] Aspert, N., et al. “MESH: Measuring Error Between Surfaces Using the Hausdorff Distance.” IEEE International Conference on Multimedia and Expo 1 (2002): 705–708. [Preparata77] Preparata, F. P., et al. “Convex Hulls of Finite Sets of Points in Two and Three Dimensions.” ACM Communication 29.2 (1977): 87–93. [Ratcliff06] Ratcliff, J. “Approximate Convex Decomposition.” John Ratcliff’s Code Suppository. April 2006. n.d. . 210 Section 2 Physics and Animation 211 SECTION 3 AI Introduction Borut Pfeifer With recent advances in graphics and animation, AI is one of the areas of game programming with the most potential for growth. The techniques in this sec- tion address some of the most troublesome and encouraging future areas for game AI development. Advances in AI architecture, believable decision-making, more detailed character simulation, and player modeling all offer the possibility of creating new or improved aspects of gameplay, as well as enhancing our ability to actually build that gameplay more quickly. AI architecture is crucial for effective development. A poor architecture can bring the development of new AI gameplay on a project to a halt. A good one can empower new features and new experiences. Cyril Brom, Tomáš Poch, and Ondrej Šerý discuss creating worlds with high numbers of NPCs in their gem, “AI Level of Detail for Really Large Worlds,” managing multiple simulation levels for each area of the game world. Kevin Dill writes about using patterns in AI decision-making code in his gem “A Pattern- Based Approach to Modular AI for Games.” The more effective an AI programmer is in creating reusable and scalable code, the more time there is to iterate on gameplay. ˇ 212 Section 3 AI AI movement and pathfinding are always the earliest difficult problems an AI programmer has to solve on a project. “Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques,” by D. Hunter Hale and G. Michael Youngblood, details their research on new methods of creating navigation meshes using space-filling algorithms. Michael Ramsey covers the pathfinding used for a wide variety of animals and movement types in World of Zoo in his gem “A Practical Spatial Architecture for Animal and Agent Navigation.” Brian Pickrell sheds some light on the often-overlooked area of control theory and how it can be applied to an agent’s steering in “Applying Control Theory to Game AI and Physics.” Decision-making is the aspect of AI that affects players most directly. The challenge is finding meaningful decisions characters can make to appear intelligent, while also being understandable to the player. It is much easier to create decision-making that allows for an NPC’s success than it is to balance understandable behavior with the complexity and depth required to appear intelligent. Thomas Hartley and Quasim Mehdi describe a method to allow NPCs to adapt better to players’ combat behavior over time with their gem “Adaptive Tactic Selection in First-Person Shooter (FPS) Games.” Dave Mark discusses how to use complexity in building AI decision-making models in “Embracing Chaos Theory: Generating Apparent Unpredictability through Deter- ministic Systems.” More detailed character simulation can also help create that depth and believability. Rob Zubek breaks down an AI decision-making approach used in games such as the Sims series in his gem, “Needs-Based AI.” Phil Carlisle takes a look at how we can easily add emotional modeling to a behavior tree–based architecture in his gem, “A Frame- work for Emotional Digital Actors.” Baylor Wetzel’s gem, “Scalable Dialog Authoring,” confronts the difficult problem of creating dialog for many NPCs by abstracting concepts, such as a cultural group an NPC belongs to, and describes how to empower designers to author more variety in dialog with these abstractions. Player modeling is a burgeoning area of game AI. It can be used to help improve a player’s experience, such as in Left 4 Dead. In the gem “Graph-Based Data Mining for Player Trace Analysis in MMORPGs,” Nikhil Ketkar and G. Michael Youngblood write about their work modeling players in massively multiplayer games (MMOs). They’ve used MMO player data to determine models to detect gold farmers and bots as well to determine effective locations for in-game advertising. Such models and processes can be used to improve a wide variety of aspects of the player experience. As game AI programmers, we are now in the spotlight to create the next level of innovative game experiences. People are no longer impressed by the same fancy normal maps and motion-captured animation; they want their characters to be more believable and more entertaining in how they act. It’s up us to make this a reality, and I hope the gems in this section will help give you ideas to solve some of these problems. 213 3.1 AI Level of Detail for Really Large Worlds Cyril Brom, Charles University in Prague Tomáš Poch Ondrej Šerý One challenge for games featuring large worlds with many non-player characters (NPCs) is to find a good balance between the consumption of computational resources and simulation believability. On the one hand, the cost of simulation of a whole world, including all the NPCs, is enormous. On the other hand, if one simu- lates just the proximity of a player, one asks for plausibility troubles. For instance, when a player is supposed to return to a once left area, NPCs and objects in this area may not be in a believable state—the area has not changed since it was left, people have not moved, the ice cream left on the table has not melted, and so on. Addition- ally, this approach cannot handle NPCs that can move freely around the world. To compromise between these two extremes, a number of level-of-detail AI techni- ques (LOD AI) have been invented. While LOD AI for a traffic simulation [Chenney01] and for enemies in an RPG [Brockington02] and an action game [Grinke04] has already been written about, less is known about how to vary simulation detail for general NPCs. This gem presents a LOD AI technique tailored for simulations of large worlds featuring hundreds of commonplace NPCs with relatively complex behavior. These NPCs can perform tasks that include manipulation with several objects and require a variety of everyday behaviors, such as harvesting, merchandising, or watering a garden (as opposed to pure walking and fighting). The NPCs are interactive, and they can ˇ move around the world as the story dictates. The technique is gradual, which means that it allows for several levels of detail (LOD) based on the distance from the player or important places (Figure 3.1.1). The technique also considers the fact that whole locations containing objects and NPCs may cease to exist when the LOD decreases and need to be re-created when it increases again. Gradual LOD Example Think of a classic fantasy RPG setting: a medieval county with many villages. There is a pub in one of them, where miners go to have some fun after their shift. A player can go there as well; she can leave and return any time. You want to avoid the situation in which she would realize that the miners were not being simulated properly when she was at the other end of the village, but you want to save the resources when she is out. You may define the following LODs ranging from the full simulation to almost no simulation. • Detail 5: Full simulation—used when the player is in the pub or in its proximity. • Detail 4: Every room from the pub is abstracted to a single point. The tables in the saloon are organized just in a list or an abstract graph; exact positions become unimportant. Miners (and other NPCs) are still sitting at the tables, but they are not drinking properly; they just empty the whole glass of beer in one go, say, every 20 to 40 minutes. The barman brings new beer, but now he is not walking as he would with LOD 5. Instead, he “jumps” from table to table. The beer still has to be paid for. A miner can jump to another table as well, or to a next room, or he can leave the pub. LOD 4 will be used typically if the player is, say, 100 to 300 meters from the pub. 214 Section 3 A I Figure 3.1.1 Three types of level-of-detail AI techniques. Places are projected on the X-axis. • Detail 3: The glasses, the tables, and the pub’s rooms cease to exist. The whole pub is abstracted to a single point. Yet the miners can still be there (we don’t know where exactly, but we need not care, as will be detailed later). The beer level in the barrel will decrease a bit every half an hour or so based on the number of miners in the pub. The miners can leave the pub or enter. LOD 3 will be used if the player is at the other end of the village. Notice that when she is approaching the pub, the detail first jumps to 4 and only then to 5. Additionally, when it jumps to 4 from 3, the miners must be assigned to tables, and glasses must be generated properly (for example, at the tables). • Detail 2: The whole village is abstracted to a single point. The barman is not simulated because his story, as specified by designers, never leads him out of the bar, while the miners still are; they may go from a village (meaning the pub or home) to the mine (in other words, to work). This detail is used when the player is not in the village but is in its proximity. • Detail 1: The village and its surroundings are abstracted to a single point; no village’s inhabitants are simulated, but a foreigner may pay a visit. You can imagine that the foreigner is a story-important persona. His importance can demand that the area he is located in has an LOD of at least 4. Thus, as he moves, he redefines LOD similarly to the player. Alternatively, some events may be considered important. When miners—in the pub, at LOD 3—start a brawl, the brawl can level up the detail to 5 no matter where the user is. Throughout, we will assume that LOD increases or decreases just by 1; larger changes can be done by repetition. We will assume one player in the simulation, but the technique can be used for multiplayer games as well. The LOD technique introduced in this game has been implemented as a part of a general simulator of 2D virtual words, including the aforementioned example (with some modifications). The simulator and the example are included on the CD. Graphical LOD versus LOD AI Conceptually, it is often useful to think about a game’s control mechanisms in terms of a two-layered architecture. While AI is the higher layer, physics and graphics are the lower layer. (See [Chenney01] for more on this point.) This is actually a simplified view, nevertheless useful for explanatory purposes. According to this metaphor, the player’s direct experience is provided by the lower animation layer, which is only influenced by the higher layer. The LOD AI technique is only related to the higher layer. Since the position of the player’s avatar forces the maximum LOD AI in correctly designed worlds, only the areas (or their parts) simulated at the maximum detail may be visualized. The animation layer takes the outcome of the AI layer as an abstract prescription of what to show. The animation layer operates with several graphical levels of detail, and it may add additional complexity above the finest LOD AI detail. In our example, LOD AI Detail 5 takes care of whether a miner will drink a beer or whether he will go to the waypoint on the right or on the left, but the movement of his hand or walking smoothly will be dealt with by the animation engine. 3.1 AI Level of Detail for Really Large Worlds 215 Simulation at the Full Detail This gem concerns itself only with the AI layer. Assume to start that our goal is to simulate the whole world at the full AI LOD and that we need not care about animation. A good way to think about what happens in the AI layer is in terms of a discrete-event simulation paradigm. According to this view, time is represented by a chronological sequence of simulation events, which are ordered in an event list. These simulation events are abstract entities that mark time instants in which the state of the game is modified. Additionally, every simulation event can generate a new simulation event to the event list or remove an existing event. Technically, every simulation event is associa- ted with a piece of code. In a nutshell, after initialization, the whole system works in the following cycle: 1. Take the first simulation event from the event list and remove it from the list. 2. Process this event; that is, run the code associated with the event. As a part of this: a. Change the state variables of some entities. b. Insert new simulation events to the event list at appropriate places. c. Remove some simulation events from the event list. When this paradigm is used, it is important to distinguish between real time, which is the time the user experiences, and simulation time, which is the time of the simulated world as represented by the event list. One processing cycle happens, by definition, in zero simulation time (though it cannot happen in zero real time). For real-time games, these two times must of course be synchronized. For more on discrete- event simulations and event handling, see [Channey01, Harvey02, Wiki09]. The simulation paradigm per se says nothing about how the simulation events relate to what really happens in the game. There are designers who specify this; they must create simulation events around those changes in the state of the game that the AI layer should be aware of. In a sense, the simulation events present the AI layer’s window into the virtual world. One class of simulation events is story-important events (for example, the dungeon’s gate will open every day at midnight). These events typically will be pre-scheduled in the event list from the beginning or hooked into the event list by a trigger in run time (for example, from the moment the player enters the village, the neighboring dungeon will open every day at midnight). A different class of events is due to slow changes of objects’ states (for example, increasing the rust level of a sword every week by one). But the most important class is due to changes caused by atomic actions of NPCs or the player; these actions must be represented by simulation events in the event list. Because these actions are indivisible from the standpoint of the AI layer, it suffices to represent each of them by a start event and an end event. Typically, these two events will not be present in the event list simultaneously. During processing of a start event, first, other parts of the game are notified that the action starts, and second, it is estimated 216 Section 3 A I how long the action will last, and its end event is hooked to the event list at the appro- priate place (Figure 3.1.2). When this time comes and the end event is processed, states of some objects are changed, and the NPC’s AI controller is invoked to decide what the next action of this NPC is, hooking the start event of that action to the event list. In fact, because start events tend to be hooked at the beginning of the event list, it is often possible to skip their generation and to hook the respective end events directly. Note that this mechanism allows you to have atomic actions with various durations. Classic discrete-event simulations typically work only with simulation events hooked into the event list. However, for games, asynchronous events are also needed. That is, sometimes another part of the game can generate an event that has to be propagated to the AI layer and processed by it immediately. For instance, the collision detection may recognize that someone has nudged the sipping person with an elbow. This means that the atomic action of sipping a beer cannot be finished. Thus, the AI layer must delete its end event and generate the start event of spilling action instead. Another issue is that the time of actions’ ends (thus, the time of end events) is only estimated by the AI layer. This is fine when we simulate a part of the world that is not visualized: An estimate becomes the actual duration provided the action is not interrupted by an asynchronous event. However, for visualized parts, duration of some atomic actions will be determined by the animation engine. End events of these actions need to be synchronized with the actual end of the action. Toward LOD AI: Hierarchical Behavior The question of this gem is how to make the mechanism of start events and end events cheaper when the full detail is not required. Our solution will capitalize on the fact that behavior of NPCs can be represented hierarchically. This means, more or less, that the NPCs are conceived as having a list of behaviors that are decomposed to sub-behaviors, which are further refined until some atomic actions are reached. Figure 3.1.3 shows how this applies to a drinking miner. 3.1 AI Level of Detail for Really Large Worlds 217 Figure 3.1.2 Atomic action of sipping a beer is represented in the event list by a start event and an end event. 218 Section 3 A I Conceptually, this follows the Belief-Desire-Intention architecture [Bratman87]. This will most likely be implemented with behavior trees or hierarchical FSMs—for example [Fu&Houlette04, Isla05, Champandard08]. In academic literature, the decomposition of high-level behaviors to sub-behaviors is often more complicated. There, the distinction between goals and tasks is often made. While goals represent what shall be achieved, tasks represent how to achieve this. Every goal can be accomplished by several tasks, and every task can be achieved by adopting some sub-goals. Importantly, an NPC needs to perform only one task to achieve a goal, provided there is no failure, but it must fulfill all sub-goals, or most of them, to solve a task. Consequently, the behavioral decomposition is given by an AND-OR tree (AND levels for goals, OR levels for tasks). The CD examples feature such AND-OR trees; however, the LOD technique works well with any behavioral decomposition; the distinction between goals and tasks is unimportant for it. Thus, for explanatory purposes, we assume throughout that we have just ordinary sub-behaviors, and we call them tasks. Whatever the exact representation is, the key is that a) the hierarchy can be con- structed so that its highest level represents abstract behavior to which lower levels add more detail, and b) each level can be made executable. Thus: 1. In the design phase, construct the hierarchy in this way and manually assign LODs to its levels (refer to Figure 3.1.3). Figure 3.1.3 Hierarchical representation of a drinking miner’s behavior. Atomic actions are in gray. Other nodes represent tasks. Note that one detail can be assigned to more levels. 3.1 AI Level of Detail for Really Large Worlds 219 2. During execution, determine what level of the behavioral hierarchy is the lowest that should be executed. This is the level of the LOD that corresponds to the detail of the simulation at the place where the NPCs are located (see Figure 3.1.4). 3. Execute this task as atomic, approximating what would happen when the simulation is run at a finer detail. Unfortunately, these three points brings many problems. We will address them in turn. What Does It Mean to Execute a Task as Atomic? Atomic execution means two things. First, every task to be executed atomically must be represented in the event list by a start event and an end event, similar to atomic actions. For instance, assume that there is LOD 4 in the pub. The corresponding behavioral level contains the task of drinking at a table. The event list will contain the end event of this act but not of atomic actions of sipping a beer or chatting. Second, for every task, designers must specify a) its result, and b) its estimated duration. When a task is simulated in full detail, its particular runs can have different durations. Often, the exact duration of a run will be influenced by the animation layer (for example, an NPC walks from one room to another in a different amount of time based on the positions of objects the NPC has to avoid). Fortunately, with lower LODs, you need not worry about exactly how long the task would have lasted had the simulation run in the full detail. All you need is a plausible estimate of that time. Figure 3.1.4 “Drinking at a table” is executed as atomic. Atomic tasks are in gray. 220 Section 3 A I Based on this estimate, you place the task’s end event in the event list. Note that you may generate this estimate from a probabilistic distribution given by designers. When should the result (in other words, Point A) be revealed? There are three possibilities: 1) at the beginning of the task (that is, when the start event is processed); 2) at the end (in other words, when the end event is processed); or 3) during the task. For the drinking miner, (1) would mean drinking the glass at once and then doing nothing for 20 to 40 minutes, while (2) would be doing nothing for 20 to 40 minutes and then drinking the glass at once. Variant (3) is not consistent, and (1) changes the world sooner than it is known that the task is successfully finished. Thus, we recom- mend using (2). However, designing the scenario at different levels presents extra work, and you have to consider in which situations these shortcuts are really needed. In extreme, you program the whole simulation from scratch at each LOD, though programming more abstract levels is much simpler than the full detail. What to Do When the Detail Increases Sadly, the detail can elevate in the middle of a task’s atomic execution—in other words, between its start event and end event. For instance, the player may enter the pub while the miners are drinking at a table atomically. We call such situation a task expansion and the part of the task that has already been performed at the lower detail a task stub (see Figure 3.1.5). When this happens, you need to: 1. Remove the end event of the expanded task from the event list. 2. Compute the partial effect of the task stub, if needed. 3. Start the simulation at the higher detail. Figure 3.1.5 LOD has increased from 4 (up) to 5 (bottom), creating a stub from drinking atomically. The original end event must be removed. The partial effect of a task stub should approximate the outcome of the stub’s detailed simulation (in other words, as if the task was running from the beginning until the moment of expansion at the higher detail). You again need the designer’s specification and the code for determining the partial effect, which is extra work. Actually, it is necessary to perform Point 2 only if the user may notice the discrep- ancy; otherwise, it is sufficient to start execution of the expanded task at the lower detail from its beginning, pretending that the task stub has never happened. In other words, you specify that the partial effect is nothing. In our drinking example, this would mean that all the drinkers would start with a full glass at the time of the LOD increase, but this may be fine if you do not visualize how much beer is in the glasses. However, consider watering a garden: Here, the player may find it strange that the gardener is always just starting with the first garden bed at the time the player enters the garden. It may be necessary to decide that some fraction of the garden has already been watered at the time of expansion. When you need to compute the partial effect of a task stub, what are the options? Often, it is desirable to avoid simulation of the task stub at the higher detail. This would give you an exact result, but at the cost of too much overhead. Instead, you need to consider which state variables the task changes predictably (see Figure 3.1.6, left and middle) and which it does not (see Figure 3.1.6, right). In the former case, the partial effect can be determined by a simple formula. In the latter case, a more sophis- ticated ad hoc mechanism has to be created. For instance, think of the LOD increase from Detail 3 to 4 in the pub. On one hand, you can easily figure out how much beer has been drunk based on the number of miners in the pub. On the other hand, you also need to assign the miners to the tables. To do the latter, the designer has to come up with an ad hoc mechanism. 3.1 AI Level of Detail for Really Large Worlds 221 Figure 3.1.6 Although the final outcome of three tasks being executed atomically is the same (Point A), their details differ. At the finer LOD, the state variable may change in a predictable way (left, middle) or in an unpredictable way (right). The arrows denote the LOD increase. 222 Section 3 A I What to Do When the Detail Decreases Let us have a location in which detail n should be decreased to n-1. We say that tasks (or atomic actions) corresponding to detail n are being pruned away, whereas tasks corresponding to detail n-1 are being shrunk—that is, starting to be executed as atomic. Since the LOD decrease must apply to a whole area, as detailed later, more tasks may need to be pruned away—for example, drinking tasks of all individual miners in the pub. These tasks may not end at the same moment. Thus, they should be stopped using the mechanism of partial effect described previously. Only then can you execute shrinking tasks as atomic. It is important to stop the tasks being pruned away at one instant; otherwise, the area will end up in an inconsistent state, with some tasks being simulated at detail n and others at n-1. An important point of any LOD AI is that, by definition, some information is lost during a LOD decrease. This brings two problems. First, the partial effect of tasks and actions being pruned away should be partly forgotten but also partly exploited by the subsequent simulation at LOD n-1. Second, when the LOD increases again, the missing information should be reconstructed. This is similar to lossy compression. The second problem was partly treated earlier (assigning miners to tables), and we will also return to it later. The first problem is exemplified now in Figure 3.1.7, showing a barman walking to a table while LOD goes from 5 to 4. In this example, we are concerned with the barman’s position, but similar reasoning applies for any state of an object. Figure 3.1.7 LOD decrease. “Steps” are being pruned away at t2 , while “go-to-next-object” task is being shrunk into the atomic “jump.” Recall that the saloon is a single point at LOD 4. Figure 3.1.7 shows that when the simulation runs on the LOD 4, the walking barman is engaged in the task “go-to- next-object,” which makes the barman “jump” from one object to another. These objects are tables or the bar. If the saloon is not too oblong, we may further assume unitary distances (at LOD 4) between all pairs of these objects. At LOD 5, “go-to- next-object” breaks down into a sequence of “step-to” atomic actions. Now, consider the LOD decrease from 5 to 4 at time t2 —that is, around the middle of the “go-to-next-object” from Object X to the table. Should the barman start his “jump” from Object X or Place Y, and how long should it take? Because Place Y does not exist at LOD 4, it seems that the best option is to let the barman start his “jump” from X. However, since you cannot roll back what has already happened, the barman would walk from X to Y twice, and a clever user may notice this should the detail increase soon (say, at time t3, when the barman is expected to be around Z). The second option is to say that the barman is somewhere on the way and com- pute how long the rest of the “jump” (in other words, from Y to the table) will last. This is more accurate, but it creates an extra work. Additionally, this works well only when the object’s state is updated in a predictable way at the finer detail (refer to Figure 3.1.6, left and middle). The first method actually works acceptably if you manage to avoid an early LOD increase. The longer the period between a LOD decrease and the subsequent LOD increase, the more inconsistencies due to a simplified determination of the initial state of the simulation at LOD n-1 are disguised. A simple mechanism for avoiding early LOD increase will be shown later. Space Representation So far, we have spoken about how to simplify behavior given a LOD, but we also need to know the value of the LOD and its radius, and we need these values for all parts of the virtual world. Perhaps the simplest way to get this information is to represent the world hierarchically, as you can examine on the CD example. The spatial hierarchy keeps information about children, parents, and neighbors of every location except for the leaves and the root, which lack children or has only them, respectively. For simplification, assume now that the number of LODs equals the number of spatial levels. (This is not a strict requirement.) Now, a membrane metaphor can be used to describe which LOD is where. Imagine an elastic membrane cutting through the spatial hierarchy (see Figure 3.1.8), touching some locations. We say that every loca- tion or an atomic place that is at the membrane at a particular instant is simulated as an abstract point. No location “below” the membrane exists. Every NPC is simulated at the LOD equal to the level on which the membrane is touching the area in which that NPC is located; spatial LOD determines behavioral LOD. 3.1 AI Level of Detail for Really Large Worlds 223 Formally, the membrane is nothing more than a list of locations that are simulated as abstract points at a particular moment. The membrane can be reshaped in every time step. For the purposes of coherence, we enforce the following shaping rules: 1. If a location or an atomic place X is at the membrane, every location or atomic place with the same parent as X is at the membrane, too. 2. If a location Y is above the membrane, every location with the same parent as Y is above or at least at the membrane, too. For example, if an atomic place from the saloon is at the membrane, all the saloon’s atomic places will be at the membrane, too (Rule 1). This ensures that when the simulation runs in the full detail somewhere in the saloon, it will run in the full detail everywhere here. Because the saloon itself is above the membrane in this case, all rooms from the pub must be simulated at least as abstract points (Rule 2). This ensures that almost always at least something happens in the locations near the center of attention. Because LODs of two adjacent locations can differ (that is, when they do not have the same parent), the visibility of a player should be limited to the location she is in. If this is not possible, such as in open spaces, another shaping rule must be settled. 3. The detail must be maximum in all locations the player can see at a given instant. 224 Section 3 A I Figure 3.1.8 LOD membrane. Limitations of Rules 1 through 3 Rules 1 through 3 work fine, but you should know that there is an exception in which they do not guarantee that something happens near the center of attention. The prob- lem is that two neighboring locations may have different parents, thus their LODs can differ by more than 1. Think of a world with two kingdoms. There are two houses there next to each other, but the border goes right between them. Even though the first house is simulated in full detail, the second kingdom may still have just LOD 1; the houses do not have a common parent. To deal with this, a mechanism of influences is needed. This would allow you to specify another shaping rule based on the influence a location has on its all neighboring locations. The trouble with this mechanism is that it may cause cascade effects during membrane reshaping, increasing the overhead. You can read more about this topic in the documentation on the CD. Pathfinding So far, we have been silent on pathfinding. With the spatial hierarchy, it is apparent that a hierarchical refinement of A* can be used easily for simplifying pathfinding at a lower LOD. Positions of Objects and NPCs We know that the pub is abstracted to a single point at LOD 3. Assume there is one miner and one barman there. What happens with their positions during LOD changes? Because the pub’s rooms do not exist at LOD 3, both the barman and the miner are, so to speak, spread out in the whole pub as two uncollapsed quantum particles; they are at the membrane. Assume further that the LOD decreases to 2. While this lifts the miner a level up, spreading him out in the village, the barman becomes non-simulated because his story never leads him out of the pub (see Figure 3.1.9). This means that the barman “fell through” the membrane; his traces have been lost for the rest of the simulation. The problem is that we need to narrow down object and NPC positions after the detail elevates. When the LOD goes from 2 to 3, we know that the barman has to be collapsed into the pub, but this is not the case of the miner. We do not know where to generate him; he can be collapsed into the pub or into his house or on the street, and so on. A similar problem arises for the LOD increase from Detail 3 to 4. We now introduce a basic mechanism dealing with generation of positions of objects. The real case is actually more complex than this mechanism. Following, we will extend it and comment on how to use it for NPCs. 1. Objects in every location should be divided into two groups—those that are location-native, for objects owned by the location, and those that are location- foreign, for objects owned by other locations. (A glass is location-native in the pub, as opposed to a mine truck.) An object can be location-native in more locations. (A glass is location-native in the bar as well as in the kitchen.) 3.1 AI Level of Detail for Really Large Worlds 225 2. Every location from the nth level of the spatial hierarchy should have a method implementing where to generate location-native objects if the detail elevates from n to n+1 (for example, during the Detail 3 to 4 transition, the pub “knows” where to generate the tables and glasses). 3. When LOD decreases in a particular location, the detailed positional infor- mation for all the location-foreign objects having been “lifted up” is memo- rized, but not for the location-native objects having been “lifted up” (due to Point 2). 4. When location-native objects “fall through” the membrane, only their total number in the area where they “fall though” is remembered (for example, after the 4-to-3 transition, it will be remembered that there are, say, 27 beer glasses in the pub). For location-foreign objects, the exact positions are remembered. (After the same transition, the watering can will become non- simulated, but two memory records will be kept: for the 5-to-4 and 4-to-3 transitions.) 5. When LOD increases, location-foreign objects are generated based on their stored positional information. In some rare situations, this information may not be available; this will be discussed in a moment. The idea behind this is that during the design phase, one can specify relatively easily which objects are native in which locations and implement placing algorithms appropriately. 226 Section 3 A I Figure 3.1.9 LOD increase. Some objects are “lifted up,” while others are not. Storage of Positional Information When a location-foreign object is simulated, it can hold its own positional information. The records about location-foreign non-simulated objects’ positions and about numbers of location-native non-simulated objects can be kept by parent locations. (For example, the pub “knows” that it contains 27 glasses.) Often, when the parent location ceases to exist, the player is so far away that the records held by this location can be happily forgotten. Nevertheless, should a piece of information survive destruction of the location to which it is attached, you can do the following. In the design phase, specify the information level denoting the LOD at which this particular kind of information can be forgotten. At run time, after the loca- tion “falls through” the membrane, attach your record to the upper location (in our case, the village) provided the new LOD is still higher than or equal to the information level of this record. (Mind the direction: A higher LOD is lower in the hierarchy!) General Information and Expirations The aforementioned mechanism can be generalized in two ways. First, one can store not only positional information, but any state information. Assume a player has bro- ken a window in the pub. A clever player would not return soon, but if she does, she may expect the window to still be broken. The record about the broken window can be stored by the pub. Similarly, sometimes it may be beneficial to record the exact posi- tional information for a location-native object. If the player moves a table a bit, and the pub’s LOD goes to and fro, the basic aforementioned mechanism would generate the table incorrectly, on the original place. The extra record will remedy this problem. After a while, this record may become useless; sooner or later, someone may repair the window and move the table back. If you need this feature, you can basically delete a record after a specified amount of time using a mechanism of expirations. Initialization and Traveling Objects There are two situations in which it is necessary to initialize objects’ positions. The first one occurs after the simulation starts. Additionally, sometimes it is necessary to initialize objects’ positions when the objects move between locations. The first situation is trivial: You must have an implicit initialization procedure. Let us now elaborate on the second case. Assume the pub is simulated at LOD 4, meaning all the pub’s rooms are single points. A miner puts down a watering can in the saloon, where the can is location-foreign. Now, the LOD elevates to 5. Where do you place the can? Sometimes, the procedure initializing positions at the beginning of the simulation can help. More often, you would find useful a general placing mecha- nism that groups objects based on their typical places of occurrence into floor-objects, table-objects, and so on, and that generates these objects randomly within constraints of the objects’ categories. Consider now that LOD decreases later from 5 to 4, and the miner picks up the can in the pub and takes it out. If this happens, do not forget to delete the record about the can’s detailed position. 3.1 AI Level of Detail for Really Large Worlds 227 NPCs When the detail increases, we have to refine a) positions of NPCs, and b) tasks they are engaged in. For (a), we can treat NPCs as objects. For (b), the task expansion mechanism described earlier should be used. Reshaping the Membrane The final question is how to shape the membrane. A simple solution, adopted by the example on the CD, is to assign an existence level and a view level to every object. If the detail decreases below the existence level, the object ceases to exist; otherwise, it is to be simulated. The view level then determines the detail required by the object after it starts to be simulated. Hence, the view level is always equal to or higher than the existence level. For many common objects, these levels will be equal, but not for story- important objects or NPCs. Every user’s avatar will have the existence level equal to 1 and the view level equal to the maximum. Assume now a traveler with his view level set to 4 and his existence level to 2. He will not exist until the detail is at least 2, but when it elevates to this value, the traveler will demand that in the location to which he has been generated, the detail goes further to 4. Note that according to the shaping rules, this further determines increasing detail in some of the neighboring locations. Because this may create new objects demanding additional LOD increases, a cascade of changes may be triggered. The algorithm for doing this consistently is detailed in [Šerý06]. Sometimes, you may want to increase the detail even if no important object is around; for instance, think of a brawl being simulated in a village far away that is expected to grow into a local uprising. You may need to show the behavior of those NPCs to the player in a cut scene. To do this, you can simply put an invisible object into the simulation with the appropriate view level to ensure the player will see the AI behavior. What Is the Radius of LODs? Using this technique, all users and important objects/NPCs tend to automatically create a “simulation crater” around them due to the spatial hierarchy and the shaping rules. As they move, the detail elevates (typically) by one in locations at the edge of the crater, avoiding abrupt jumps of LOD. LODs are not specified in terms of metrical distances, but in terms of number of locations between the object/NPC and the edge of the crater. This has two advantages. First, the LOD does not change all the time, as would be the case with pure metrical distances, helping to reduce the overhead. Further, the overhead is spread in time: If you go from 1 to 5, you do this in more steps. Figure 3.1.10 demonstrates that there is indeed a qualitative difference in overhead between the 4-to-5 increase (left) and 3-to-5 (right) in the pub in our CD example. Second, by the time a person arrives at a location, that location typically has been simulated for a while, disguising inconsistencies caused due to lower LODs. 228 Section 3 A I When to Decrease the Detail There is a problem with reshaping the LOD membrane when an object moves between two locations repeatedly. The LOD may start to oscillate, increasing the resources’ consumption. You can do either of the following two things. First, you can have a larger crater determining when to decrease the LOD. This larger crater would embrace the smaller crater that enforces the LOD increase. Second, you can use, as we did, a garbage mechanism. With the garbage mechanism, you do not decrease LOD until the resources are needed for something else. This mimics the larger crater automatically. However, because the cleanup causes overhead, it is better to use the garbage mecha- nism a bit sooner than the resources are actually needed, letting the mechanism work over several time steps. Creating the Structure of the World In practice, it rarely makes sense to have more LODs than levels of the spatial hierarchy. In the spatial hierarchy, you should keep a reasonable number of sublocations for every parent—neither 1 nor 50. This is not a strict rule, but if you violate this principle too often, the hierarchy should be reconstructed. Even though higher numbers may make sense logically (a building with many small rooms), with respect to the technique presented here, this would increase the overhead for a LOD change. Recall that LODs also have to be assigned to the behavioral hierarchy. You should do this consistently, meaning the degree of abstraction of two tasks from a particular level should be similar. For instance, if LOD 3 is assigned to watering a garden, it should also be assigned to cooking, but not peeling potatoes. For technical purposes, you may need more levels in behavioral hierarchy than LODs, as demonstrated on the CD example. 3.1 AI Level of Detail for Really Large Worlds 229 Figure 3.1.10 Processor consumption during the LOD increase. The X-axis represents the time in the game. The LOD increases at 10:15 p.m.; this time is arbitrary. This figure shows data from three particular runs. Note that the data for the 3-to-4 increase resembles the data for the 4-to-5 increase. Another aspect to keep in mind is that after assigning LODs to tasks, objects required for these tasks must have their existence level set accordingly. Assume watering a garden has LOD 3, and the garden is an abstract point at LOD 3. If it is required by designers that watering a garden has to run with a watering can at this LOD, meaning the gardener has to pick up the can before he comes to the garden, the existence level of the can must be 3 or less. Watering a garden would then mean something like, “Stay there holding the can for a specified amount of time, after which the garden will become watered at one instant.” Of course, designers might also specify that the gar- dener is able to water the garden without the can at LOD 3; the task’s outcome would be the same, only the can would not be required. In the latter case, you would need to generate the can after the LOD increase. Note that the technique does not allow for two existence levels for one object kind—for example, it is not possible that a knife for fighting has existence level 3, while a knife for cooking has 5. You must define either one object kind with the lower number or two different object kinds with two different existence levels. Source Code Summary The application included on the CD is a simulator of virtual worlds with LOD AI (in other words, it is not a game). The examples discussed in this chapter are based on the implemented demo world, which features five LODs and three kinds of NPCs: miners, barmen, and singers. The following behaviors have been implemented: walking home from a pub or a mine and vice versa, leisure time behavior (for pubs), and working behavior (for mines). The code is in Java, and the documentation that is included details the LOD technique further. Conclusion Level-of-detail techniques allow for compromising between the consumption of com- putational resources and simulation plausibility. However, every LOD technique presents extra work for designers and programmers. Thus, it should be contemplated carefully which LOD approach is needed and whether it is needed at all. The technique introduced in this gem fits well for large worlds with many NPCs with everyday behavior. It capitalizes on the fact that both space and behavior of NPCs can be, and often already are, represented hierarchically. The different number of LODs helps not only with simulation plausibility, but also with keeping the over- head under control during LOD changes. The technique is generic, which means that for specific domains, special-purpose mechanisms can outperform it—for example, for fighting behavior [Brockington02, Grinke04]. Even in complex worlds, any special-purpose mechanism can be augmented with some or all of this technique if needed. 230 Section 3 A I Acknowledgements This technique and the simulator on the CD were developed as a part of research projects 1ET100300517 of the Program Information Society and MSM0021620838 of the Ministry of Education of the Czech Republic. We want to thank several students who participated in the simulator development: Martin Juhász, Jan Kubr, Jiří Kulhánek, Pavel Šafrata, Zdeněk Šulc, Jiří Vorba, and Petr Zíta. References [Bratman87] Bratman, Michael E. Intention, Plans, and Practical Reason. Harvard University Press, 1987. [Brockington02] Brockington, Mark. “Level-Of-Detail AI for a Large Role-Playing Game.” AI Game Programming Wisdom I. Boston: Charles River Media, 2002. 419–425. [Champandard08] Champandard, Alex J. “Getting Started with Decision Making and Control Systems.” AI Game Programming Wisdom IV. Boston: Charles River Media, 2008. 257–264. [Chenney01] Chenney, Stephen. “Simulation Level-Of-Detail.” 2001. University of Wisconsin. n.d. . [Fu04] Fu, Dan, and Ryan Houlette. “The Ultimate Guide to FSMs in Games.” AI Game Programming Wisdom II. Boston: Charles River Media, 2004. 283–302. [Grinke04] Grinke, Sebastian. “Minimizing Agent Processing in ‘Conflict: Desert Storm.’” AI Game Programming Wisdom II. Boston: Charles River Media, 2004. 373–378. [Harvey02] Harvey, Michael, and Carl Marshall. “Scheduling Game Events.” Game Programming Gems 3. Boston: Charles River Media, 2002. 5–14. [Isla05] Isla, Damian. “Handling Complexity in Halo 2.” 3 Nov. 2005. Gamasutra. n.d. . [Šerý06] Šerý, Ondřej, et al. “Level-Of-Detail in Behaviour of Virtual Humans.” Proceedings of SOFSEM 2006: Theory and Practice of Computer Science 3831 (2006): 565–574. [Wiki09] Wikipedia, The Free Encyclopedia. “Discrete Event Simulation.” 2009. Wikipedia. n.d. . 3.1 AI Level of Detail for Really Large Worlds 231 232 3.2 A Pattern-Based Approach to Modular AI for Games Kevin Dill, Boston University A great deal of time and effort is spent developing the AI for the average modern game. Years ago, AI was often an afterthought for a single gameplay programmer, but these days most game projects employ at least one dedicated AI specialist, and entire AI teams are becoming increasingly common. At the same time, more and more devel- opers are coming to realize that, even in multiplayer games, AI is not only a critical component for providing fun gameplay, but it is also essential if we are going to con- tinue to increase the sense of realism and believability that were previously the domain of physics and rendering. It does no good to have a brilliantly rendered game with true-to-life physics if your characters feel like cardboard cutouts or zombie robots. Given this increase in team size, the increasing prominence of AI in the success or failure of a game, and the inevitable balancing and feature creep that occur toward the end of every project, it behooves us to search for AI techniques that enable fast imple- mentation, shared conventions between team members, and easy modification. Toward that end, this gem describes methods for applying patterns to our AI in such a way as to allow it to be built, tuned, and extended in a modular fashion. The key insight that drives the entirety of this work is that the decisions made by the AI can typically be broken down into much smaller considerations, individual tests used in combination to make a single decision. The same considerations can apply to many different decisions. Furthermore, the evaluation of those considerations can be performed independent of the larger decision, and then the results can be combined as necessary into a final decision. Thus, we can implement the logic for each consider- ation once, test it extensively, and then reuse that logic throughout our AI. None of the core principles described here are new. They can be found through- out software engineering and academic AI in a variety of forms. However, all too often we game programmers rush into building code, trying to solve our specific problem of the day and get the product out the door, without taking a step back and thinking about how to improve those systems. With some thought and organization, it could 3.2 A Pattern-Based Approach to Modular AI for Games 233 be easier to change, extend, and even reuse bits and pieces in future projects. Taking the time to do that would pay off both in the short run, making life easier (and thus improving the final result) for the current title, and in the long run, as more AI code is carried forward from game to game. A Real-World Example: Apartment Shopping Let’s begin with a real-world example that illustrates the general ideas behind this work. Imagine that you have just taken a new job as an AI engineer at Middle of Nowhere Games, and you are in the process of searching for an apartment in some faraway city. You might visit a variety of candidates, write down a list of the advantages and disad- vantages of each, and then use that list to guide your final decision. For an apartment in a large complex near a busy shopping area, for example, your list might look something like this: 606 Automobile Way, Apt 316 Pros Cons Close to work Great view…of a used car lot Easy highway access No off-street parking Convenient shopping district Highway noise Another apartment, located in the attic of a kindly old lady’s country house, might have a wholly different list: 10-B Placid Avenue Pros Cons Low rent 45-minute commute Nearby woods, biking trails No shopping nearby Electricity and water included Thin walls, landlady downstairs These lists clearly reflect the decision-maker’s personal taste—in fact, it seems likely that if two people were to make lists for the same apartment, their lists would have little in common. It is not the actual decision being made that’s important, but rather the process being used to arrive at that decision. That is to say, given a large decision (Where should I live for the next several years of my life?), this process breaks that decision down into a number of independent considerations, each of which can be evaluated in isolation. Only after each consideration has been properly evaluated do we tackle the larger decision. There is a reasonably finite set of common considerations (such as the rent, ease of commute, size of the apartment, aesthetics of the apartment and surrounding environment, and so forth) that would commonly be taken into account by apartment shoppers. From an AI point of view, if we can encode those considerations, we can then share the logic for them from actor to actor, and in some cases even from decision to decision. As an example of the latter advantage, imagine that we wanted to select a location for a picnic. Several of the considerations used for apartments—such as overall cost, the aesthetics of the surrounding environment, and the length of the drive to get there— would also be used when picking a picnic spot. Again, if it were an NPC making this decision, then cleverly designed code could be shared in a modular way and could perhaps even be configured by a designer once the initial work of implementing the overall architecture and individual considerations was complete. Another advantage of this type of approach is that it supports extensibility. For example, imagine that after visiting several apartments, we came across one that had a hot tub and pool or a tennis court. Previously, we had not even considered the avail- ability of these features. However, we can now add this consideration to our list of pros and cons without disturbing the remainder of our logic. At this point we have made a number of grandiose claims—hopefully sufficient to pique the reader’s interest—but we clearly have some practical problems as well. What is described above is a wholly human approach to decision-making, not easily replicated in code. Our next step, then, should be to see whether we can apply a sim- ilar approach to making relatively simple decisions, such as those that rely on a single yes or no answer. Boolean Decisions Many common architectures rely on simple Boolean logic at their core. For example, from a functional point of view, finite state machines have a Boolean decision-maker attached to each transition, determining whether to take that transition given the current situation. Behavior trees navigate the tree through a series of Boolean decisions (take this branch or don’t take this branch) until they arrive at an action they want to take. Rule-based architectures consist of a series of “rules,” each of which is a Boolean decision stating whether or not to execute the associated action. And so forth. Constructing Decisions To build a pattern-based architecture for our AI, we first need to define a shared inter- face for our considerations and then decide how to combine the results of their evalu- ation into a final decision. For Boolean decisions, consider the following interface: class IConsiderationBoolean { public: IConsiderationBoolean() {} virtual ~IConsiderationBoolean() {} // Evaluate this consideration bool Evaluate(const DecisionContext& context) = 0; 234 Section 3 AI 3.2 A Pattern-Based Approach to Modular AI for Games 235 // Load the data that controls our decisions void LoadData(const DataNode& node) = 0; } Every consideration will inherit from this interface and therefore will be required to specify the Evaluate() and LoadData() methods. Evaluate() takes a DecisionContext as its only argument. The context contains whatever information might be needed to make a decision. For example, it might include the current game time, a pointer to the actor being controlled, a pointer to the game world, and so on. Alternatively, it might simply be a pointer to the actor’s knowledge base, where beliefs about the world are stored. Regardless, Evaluate() processes the state of the world (as contained in the context) and then returns true if execution is approved or false otherwise. The other mandatory function is LoadData(). The data being loaded specifies how the decision should be made. We will go into more detail on this issue later in this gem. Games are rife with examples of considerations that can be encoded in this way. For example, we might have a health consideration that will only allow an action to be taken if a character’s hit points are in a specified range. A time-of-day consideration might only allow an action to take place during the day (or at night or between noon and 1:00 p.m.). A cool-down consideration could prevent an action from being taken if it has been executed in the recent past. The simplest approach for combining Boolean considerations into a final deci- sion is to give each consideration veto power on the overall decision. That is, take the action being gated by the decision if and only if every consideration returns true. Obviously, more robust techniques can be implemented, up to and including full predicate logic, but this simple approach works well for a surprisingly large number of cases and has the advantage of being extremely simple to implement. A Simple Example: First-Person Shooter AI As an example of this approach in a game environment, consider the simple state machine in Figure 3.2.1, which is a simplified version of what might be found in a typical first-person shooter’s combat AI (for example). Figure 3.2.1 A simplified state machine for FPS combat AI. In this AI, Chase is our default state. We exit it briefly to dodge or take a shot, but then return to it as soon as we’re done with that action. Here are the considerations we might attach to each transition: Chase 1 Shoot: • We have a line of sight to the player. • It has been at least two seconds since our last shot. • It has been at least one second since we last dodged. • The player’s health is over 0 percent (that is, he’s not dead yet). Chase 1 Dodge: • We have a line of sight to the player. • The player is aiming at us. • It has been at least one seconds since our last shot. • It has been at least five seconds since we last dodged. • Our health is below 60 percent. (As we take more damage, we become more cautious.) • Our health is less than 1.2 times the player’s health. (If we’re winning, we become more aggressive.) Shoot 1 Chase: • We’ve completed the Shoot action. Dodge 1 Chase: • We’ve completed the Dodge action. One advantage of breaking down our logic in this way is that we can encapsulate the shared logic from each decision in a single place. Here are the considerations used above: • Line of Sight Consideration. Checks the line of sight from our actor to the player (or, more generally, from our actor to an arbitrary target that can be speci- fied in the data or the DecisionContext). • Aiming at Consideration. Checks whether the player’s weapon is pointed toward our actor. • Cool-down Consideration. Checks elapsed time since a specified type of action was last taken. • Absolute Health Consideration. Checks whether the current health of our actor (or the player) is over (or under) a specified cutoff. • Health Comparison Consideration. Checks the ratio between our actor’s health and the player’s health. • Completion Consideration. Checks whether the current action is complete. Each of those considerations represents a common pattern that can be found not only in this specific example, but also in a great many decisions in a great many games. 236 Section 3 A I 3.2 A Pattern-Based Approach to Modular AI for Games 237 These patterns are not unique to these particular decisions or even this particular genre. In fact, many of them can be seen in one form or another in virtually every game that has ever been written. Line-of-sight checks, for example, or cool-downs to prevent abilities from being used too frequently…these are basic to game AI. AI Specification At this point we have all the pieces we need. We know what behaviors we plan to sup- port (in this case, one behavior per state), we know all of the decisions that need to be made (represented by the transitions), and we know the considerations that go into each decision. However, there is still some work to do to put it all together. Most of what needs to be done is straightforward. We can implement a character class that contains an AI. The AI contains a set of states. Each state contains a list of transitions, and each transition contains a list of considerations. Keep in mind that the considerations don’t always do the same thing. For exam- ple, both Chase 1 Shoot and Chase 1 Dodge contain an Absolute Health consider- ation, but those considerations are expected to return true under very different conditions. For Chase 1 Shoot, we return true if the health of the player is above 0 percent. For Chase 1 Dodge, on the other hand, we return true if the health of our actor is below 60 percent. More generally, each decision that includes this condition also needs to specify whether it should examine the health of the player or our actor, whether it should return true when that value is above or below the cutoff, and what cutoff it should use. The information used to specify how each instance of a consider- ation should evaluate the world is obtained through the LoadData() function: void CsdrHealth::LoadData(const DataNode& node) { // If true we check the player’s health, otherwise // we check our actor’s health. m_CheckPlayer = node.GetBoolean(“CheckPlayer”); // The cutoff for our health check – may be the // upper or lower limit, depending on the value // of m_HighIsGood. m_Cutoff = node.GetFloat(“Cutoff”); // If true then we return true when our health is // above the cutoff, otherwise we return true when // our health is below the cutoff. m_HighIsGood = node.GetBoolean(“HighIsGood”); } As you can see, this is fairly straightforward. We simply acquire the three values the Evaluate() function will need from our data node. With that in mind, here is the Evaluate() function itself: bool CsdrHealth::Evaluate(const DecisionContext& ctxt) { // Get the health that we’re checking – either ours // or the player’s. float health; if (m_CheckPlayer) health = ctxt.GetHealth(ctxt.GetPlayer()); else health = ctxt.GetHealth(ctxt.GetMyActor()); // Do the check. if (m_HighIsGood) return currentHealth >= m_Cutoff; else return currentHealth <= m_Cutoff; } } Again, there’s nothing complicated here. We get the appropriate health value (either ours or the player’s, depending on what LoadData() told us) from the context and compare it to the cutoff. The simplicity of this code is in many ways the entire point. Each consideration is simple and easy to test in its own right, but combined they become powerfully expressive. Extending the AI Now that we have built a functional core AI for our shiny new FPS game, it’s time to start iterating on that AI, finding ways in which it’s less than perfect, and fixing them. As a first step, let’s imagine that we wanted to add two new states: Flee and Search (as seen in Figure 3.2.2). Like all existing states, these new states transition to and from the Chase state. Here are the considerations for our new transitions: Chase 1 Flee: • Our health is below 15 percent. • The player’s health is above 10 percent. (If he’s almost dead, finish him.) 238 Section 3 A I Figure 3.2.2 Our state machine with two new states. 3.2 A Pattern-Based Approach to Modular AI for Games 239 Flee 1 Chase: • The player’s health is below 10 percent. • We have a line of sight to the player. Chase 1 Search: • We don’t have a line of sight to the player. Search 1 Chase: • We have a line of sight to the player. As you can see, adding these states should be fairly simple. We will need to imple- ment the new behaviors, but all of the considerations needed to decide whether to execute those behaviors already exist. Of course, not all changes can be made using existing considerations. For example, imagine that we want to make two further changes to our AI: • QA complains that dodging is too predictable. Instead of always using a five-second cool-down, they want us to use a cool-down that varies randomly between three and seven seconds. • The artists would like to add a really cool-looking animation for drawing your weapon. In order to support this, the designers have asked us to have the characters draw their weapons when they enter Chase and then holster them again if they go into Search. The key is to find a way to modify our existing code so that we can support these new specifications without affecting any other portion of the AI. We certainly don’t want to have to go through the entire AI for every character, find every place that these considerations are used, and change the data for all of them. Implementing the variable cool-down is fairly straightforward. Previously the Cool-Down consideration took a single argument to specify the length of the cool- down. We’ll modify it to optionally take minimum and maximum values instead. Thus, all the existing cases will continue to work (with an exact value specified), but our Cool-Down consideration will now have improved functionality that can be used to fix this bug and can also be used in the future as we continue to build the AI. We’ll have to take some care in making sure that we check that exactly one type of cool- down is specified. In other words, the user needs to specify either an exact cool-down or a variable cool-down; he or she can’t specify both. An Assert in LoadData() should be sufficient. For the second change, we can make the Chase behavior automatically draw the weapon (if it’s not drawn already), so that doesn’t require a change to our decision logic. We do need to ensure that we don’t start shooting until the weapon is fully drawn, however. In order to do that, we simply implement an Is Weapon Drawn considera- tion and add it to the transition from Chase to Shoot. Data-Driven AI One thing to notice is that all of the values required to specify our AI are selected at game design time. That is, we determine up front what decisions our AI needs to make, what considerations are necessary to support those decisions, and what tuning values we should specify for each consideration. Once the game is running, they are always the same. For example, our actor’s hit points might go up or down, but if a decision uses an Absolute Health consideration, then the threshold at which we switch between true and false never changes during gameplay. Since none of this changes during gameplay, nearly all of the decision-making logic can be specified in data. We can specify the AI for each character, where the AI contains a set of states, each state contains a list of transitions, each transition contains a list of considerations, and each consideration contains the tuning values used for that portion of the AI. This sort of hierarchical structure is something that XML does well, making it an excellent choice for our data specification. As with all data-driven architectures, the big advantage is that if we want to change the way the decisions are made, we only have to change data, not code. No recompile is required. If you’ve implemented the ability to reload the data while the game is live, you don’t even need to restart the game, which can save a lot of time when testing a sit- uation that is tricky to re-create in-game. Of course, some changes, such as implement- ing an entirely new consideration, will require code changes, but much of the tuning and tweaking—and sometimes even more sizeable adjustments—will not. One thing to consider when taking this approach is whether it’s worth investing some time into the tools you use for data specification. Our experience has been that the time spent specifying and tuning behavior is significantly greater than the time spent writing the core reasoner. Good tools can not only make those adjustments quicker and easier (especially if they’re integrated into the game so that adjustments can be made in real time), but they can also include error checking for common mis- takes and help to avoid subtle behavioral bugs that might otherwise be hard to catch. Further, since many of these considerations are broadly applicable to a variety of games, not only the considerations but also the tools for specifying them could be car- ried from game to game as part of your engine (or even integrated into a new engine). Finally, it is often possible to allow designers and artists to specify AI logic if you cre- ate good tools for doing so, giving them more direct control over the look and feel of the game and freeing you up to focus on issues that require your technical expertise. Meta-Considerations One of the benefits of this approach is that it reduces the amount of duplicate code in your AI. For example, if there are seven different decisions that evaluate the player’s hit points, instead of writing that evaluation code in seven places we write it once, in the form of the Absolute Health consideration. As our work on the AI progresses, we might quickly find that we also have an Absolute Mana consideration, an Absolute Stamina consideration, a Time of Day consideration, a Cool-Down consideration, and a Distance to Player consideration. 240 Section 3 A I 3.2 A Pattern-Based Approach to Modular AI for Games 241 Although each of these considers a different aspect of the situation in-game, under the covers they each do exactly the same thing. That is, they compare a floating-point value from the game (such as the player’s hit points, the current time of day, the dis- tance to the player, and so forth) to one or two static values that are specified in data. With that in mind, it’s worth looking for opportunities to further reduce dupli- cate code by building meta-considerations, which is to say high-level considerations that handle the implementation of more specific, low-level considerations, such as the ones given above. This has the advantage of not only further reducing the duplication of code, but also enforcing a uniform set of conventions for specifying data. In other words, if all of those considerations inherit from a Float Comparison consideration base class, then the data for them is likely to look remarkably similar, and a designer specifying data for one that he hasn’t used before is likely to get the result he expects on his first try, because it works the same way as every other Float Comparison con- sideration that he’s used before. Float-Based Decisions While nearly all decisions are ultimately Boolean (that is, an AI either takes an action or it doesn’t), it is often useful to evaluate the suitability of a variety of options and then allow that evaluation to guide our decisions. As with Boolean approaches, there are a variety of approaches for doing this. Discussion of these approaches can be found throughout the AI literature. A few game-specific examples include [Dill06, Dill08, and Mark09]. For the purposes of this gem, however, the interesting question is not how and when to use a float-based approach, but rather how to build modular, pattern-based evaluation functions when we do. An Example: Attack Goals Imagine that we are responsible for building the opposing-player AI for a real-time strategy game. Such an AI would need to decide when and where to attack. In order to do this, we might periodically score several prospective targets. For each target we would consider a number of factors, including its economic value (that is, whether it generates revenue or costs money to maintain), its impact on the strategic situation (for example, would it allow you access to the enemy’s territory, consolidate your defenses, or protect your lines of supply), and the overall military situation (in other words, whether you can win this fight). Our next step should be to find a way for each of these considerations to evaluate the situation independently that enables us to easily combine the results of all of those evaluations into a single score, which can be used to make our final decision. Creating an evaluation function is as much art as science, and in fact there is an entire book dedicated to this subject [Mark09]. However, just as with Boolean deci- sions, there are simple tricks that can be used to handle the vast majority of situations. Specifically, we can modify our Evaluate() function so that it returns two values: a base priority and a final multiplier. When every consideration has been evaluated, we first add all of the base priorities together and then multiply that total by the product of the final multipliers. This allows us to create considerations that are either additive or multiplicative in nature, which are two of the most common techniques for creat- ing priority values. Coming back to our example, the considerations for economic value and strategic value might both return base priorities between –500 and 500, generating an overall base priority between –1,000 and 1,000 for each target. They would return a negative value if, from their point of view, taking this action would be a bad idea. For example, capturing a building that has an ongoing upkeep cost might receive a negative value from the economic consideration (unless it had some economic benefit to offset that cost), because once you own it, you’ll have to start paying its upkeep. Similarly, attacking a position that, if obtained, would leave you overextended and exposed would receive a negative value from the strategic consideration. These considerations could return a final multiplier of 1. The consideration for the military situation, however, could be multiplicative in nature. That is, it would return a base priority of 0 but would return a final multiplier between 0 and 3. (For a better idea of how to generate that multiplier, see our previ- ous work [Dill06].) Thus if the military situation is completely untenable (in other words, the defensive forces are much stronger than the units we would use to attack), then we could return a very small multiplier (such as 0.000001), making it unlikely that this target would be chosen no matter how attractive it is from an economic or strategic standpoint. On the other hand, if the military situation is very favorable, then we would strongly consider this target (by specifying a multiplier of 3), even if it is not tremendously important in an economic or strategic sense. If we have no mili- tary units to attack with, then this consideration might even return a multiplier of 0. We do not execute an action whose overall priority is less than or equal to zero. Thus if all of the base priorities add up to a negative value, or if any consideration returns a final multiplier of zero, then the action will not be executed. Just as in the Boolean logic case, a single consideration can effectively veto an action by returning a final multiplier of zero. As with the previous examples, the greatest value to be found is that these same considerations can be reused elsewhere in the AI. For example, the economic value might be used when selecting buildings to build or technologies to research. The strategic value might be used when selecting locations for military forts and other defensive structures. The military situation would be considered not only for attacks, but also when deciding where to defend, and perhaps even when deciding whether to build structures in contested areas of the map. Alternate Approaches One weakness of the aforementioned approach is that it only allows considerations to have additive or multiplicative effects on one another. Certainly there are many other ways to combine techniques—in fact, much of the field of mathematics addresses this topic! 242 Section 3 A I 3.2 A Pattern-Based Approach to Modular AI for Games 243 One common trick, for example, is to use an exponent to change the shape of the curve when comparing two values. Certainly we can extend our architecture to include this, perhaps including an exponent to the set of values returned by our Evaluate() function and applying all of the exponents after the final multipliers. Doing so significantly increases the complexity of our AI, however, because this new value needs to be taken into account with every consideration we implement and every decision we make. This may not seem like a big deal, but it can make the task of specifying, tuning, and debugging the AI significantly harder than it would otherwise be—especially if it is to be done by non-technical folks who may not have the same intuitive sense of how the numbers combine as an experienced AI engineer would. Along the same lines, in many cases even the final multiplier is overkill. For sim- pler decisions (such as those in an FPS game or an action game), we can often have the Evaluate() function return a base priority as before, but instead of returning a multiplier, it can simply return a Boolean value that specifies whether it wants to veto this action. If any consideration returns false, then the action is not taken (the score is set to 0); otherwise, the score is the sum of the base priorities. Conclusion This gem has presented a set of approaches for building modular, pattern-based architectures for game AI. All of these approaches function by breaking a decision into separate considerations and then encoding each consideration independently. There are several advantages to techniques of this type: • Code duplication between decisions is dramatically reduced, because the code for each consideration goes in a single place. • Considerations are reusable not only within a single project, but in some cases also between multiple projects. As a result, AI implementation will become easier as the library of considerations grows larger and more robust. • Much of the AI can be specified in data, with all the advantages that implies. • With proper tools and a good library of considerations, designers and artists can be enabled to specify AI logic directly, both getting them more directly involved in the game AI and freeing up the programmer for other tasks. References [Dill06] Dill, Kevin. “Prioritizing Actions in a Goal-Based AI.” AI Game Programming Wisdom 3. Boston: Charles River Media Inc., 2006. 321–330. [Dill08] Dill, Kevin. “Embracing Declarative AI with a Goal-Based Approach.” AI Game Programming Wisdom 4. Boston: Charles River Media Inc., 2008. 229–238. [Mark09] Mark, Dave. Behavioral Mathematics for Game AI. Course Technology PTR, 2009. 244 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques D. Hunter Hale G. Michael Youngblood When implementing a navigation system for intelligent agents in a virtual environ- ment, the agent’s world representation is one of the most important decisions of the development process [Tozour04]. A good world representation provides the agent with a wealth of information about its environment and how to navigate through it. Conversely, a bad representation of the world can confuse or mislead an agent and become more of a hindrance than an aid. Currently, the most common type of world representation is the navigation mesh [McAnils08]. This mesh contains a complete listing of all of the navigable areas (negative space) and occupied areas (positive space) present in a level or area of a game. Traditional methods of generating the navigation mesh focus on using the vertices of objects to generate series of triangles. These triangles then become the navigation mesh [Tozour02]. This does generate high-coverage navigation meshes, but the meshes tend to have areas that can cause problems for agents navigating through the world. These problem areas take the form of many separate triangular negative space areas coming together at a single point. Agents or other objects that are standing on or near this point are simultaneously in more than one region. The presence of objects in more than one region at once means that every region will have to be evaluated for events involving the overlapping objects instead of just a single region if objects were well localized. Instead of using a triangulation-based navigation mesh generation technique, we approached the problem using the Space Filling Volume (SFV) algorithm [Tozour04] as a base. SFV is a growth-based technique that first seeds the empty areas of a game world with quads or cubes and then expands the objects in every direction until they hit an obstruction. The quads and the connections between them define the navigation mesh. 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 245 By using quads as the core shape in the algorithm, the problem of many regions coming together in a single point is dramatically reduced, since quads can only meet at most four corners. This basic approach works well for worlds composed of axis-aligned obstruc- tions but produces low-coverage navigation meshes when applied to non-axis-aligned worlds or highly complex worlds. Our improved algorithms address these limitations, as well as provide several other benefits over traditional navigation mesh generation techniques. The two algorithms described here are enhancements to the traditional imple- mentations of 2D and 3D Space Filling Volumes. The first is a 2D algorithm called Planar Adaptive Space Filling Volumes (PASFV). PASFV consumes a representation of an arbitrary non-axis-aligned 3D environment, similar to a blueprint of a building, and then generates a high-quality decomposition. Our new decomposition algorithms seed the world with growing quads, which, when a collision with geometry occurs, dynamically increase their number of sides to better approximate the shape the grow- ing region intersected. Using the ability to dynamically create higher-order polygons from quads along with a few other features not present in classic SFV, PASFV gener- ates almost 100-percent coverage navigation meshes for levels where it is possible to generate planar splices of the obstructing geometry in the level. The second algorithm, Volumetric Adaptive Space Filling Volumes (VASFV), is, as the name implies, a native 3D implementation of Adaptive Space Filling Volumes with several enhancements. The enhancements allow VASFV to grow cuboids that morph into complex shapes to better adapt to the geometry of the level they are decomposing, similar to the 2D version. This algorithm, like its 2D cousin, generates a high-coverage decomposition of the environment; however, the world does not need to be projected into a planar representation, and the native geometry can be decom- posed without simplification. In addition, this algorithm has a speed/time advantage over the PASFV in post-processing because it can consume complex levels in a single run. PASFV generally has to decompose a level one floor at a time, and the generated navigation meshes must then be reconnected to be useful. VASFV will generate a sin- gle navigation mesh per level, removing a potentially expensive step from the process of generating a spatial decomposition. The Algorithms Both the PASFV and VASFV algorithms work off the common principle of expanding a grid of pre-seeded regions in a game environment to fill all of the available negative space. In practice, this filling effect looks somewhat similar to a marshmallow that has been heated in the microwave. Both of these algorithms use a similar approach, but the implementations are sufficiently different that each algorithm deserves a full explanation. PASFV The PASFV algorithm [Hale08] is an iterative algorithm that can be broken down into a series of simple steps. First, a set of invariants and starting conditions must be established and maintained. For our implementation of PASFV, all of the input geometry must be convex. This allows the use of the point-in-convex-object collision test [Schneider03] when determining whether a growing region has intruded into a positive space area. If the input geometry is not natively convex, it can be converted to be convex by using a triangular subdivision algorithm. While the algorithm is running, the following two conditions must be maintained to have a successful decomposition. First, at the end of every growth cycle, all the negative space regions in the world must be convex; otherwise, the collision detection tests, which are based on an assumption of convexity, will return invalid data. Second, if a region has ended a growth cycle covering an area of the level, it must continue to cover that area. If this restriction is not maintained, there will be gaps in the final decomposition. Our algorithm begins in a state we refer to as the initial seeding state, where the world is “seeded” with a user-defined grid at specified intervals of negative space regions. These regions will grow and decompose the world. If the proposed seed placement falls within a positive space obstruction, it is discarded. Initial regions are unit squares (a unit square is a square with an edge length equal to one of the base units of the world) with four edges arranged in a counterclockwise direction from the point closest to the origin. The initial placement of these regions in the world is such that they are axis- aligned. After being seeded in the world, each of the placed regions is iteratively pro- vided a chance to grow. Growth is defined for a region as a chance to move each edge outward individually in the direction of each edge’s normal. The decomposition of a level may take two general cases. The first case occurs when all of the positive space regions are axis-aligned. The more advanced decomposition case occurs if there is non-axis-aligned geometry. First, we will examine the base case or axis-aligned case for a spatial decomposi- tion in PASFV. Growth occurs in the direction of the normal for each of the edges of a region and is a single unit in length. After an edge has advanced, we verify the new regional coverage with three collision detection tests. We want to guarantee that no points from our newly expanded region have intruded into any of the other regions or any positive space obstructions. We also want to prove that no points from other regions or obstructions would be contained within the growing region. Finally, the region will perform a self-test to ensure that the region is still convex. This final check is not necessary for the base case of the axis-aligned world and can be omitted if there are no non-axis-aligned collision objects. Assuming all tests return results showing there are no collisions or violations of convexity, the region finalizes its current shape, and the next region grows. 246 Section 3 AI 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 247 If a collision is detected, several things must be done to correct it. First, the grow- ing region must return to its previous shape. At this point, since both the region and the obstruction it collided with are axis-aligned, we know the region is parallel and adjacent to the object, as shown in Figure 3.3.1(a). Stopping growth here will provide an excellent representation of free space near the collided object. Finally, we set a flag on the edge of the region where the collision occurred. This flag indicates the edge should not attempt to grow again. The iterative growth proceeds until no region is able to grow. This method of growth is sufficient to deal with axis-aligned worlds and produces results similar to traditional SFV. The advanced case algorithm for PASFV is able to deal with a much wider variety of world environments. This case begins by building from the base case algorithm; however, because it needs to deal with non-axis-aligned positive space regions, it incorporates several new ways of dealing with potential collisions. When a collision Figure 3.3.1 In this illustration we see all of the potential collision cases in PASFV. The growing negative space regions are shown as white boxes, and the direction of growth is marked with an arrow. Positive space regions are drawn in gray. In (a) we see the most basic axis-aligned collision case. Then (b) shows the collision case that occurs when a vertex of a positive space object intersects a growing negative space region. Finally, in (c) we illustrate the most complex case, where a negative space region is subdivided into a higher-order polygon to better adapt to world geometry. with a positive space object occurs, one of three cases handles the collision. The first case occurs if the colliding edge of the positive space object is parallel to the edge of the growing negative space region. The particular obstruction is axis-aligned, so we revert to the base case. The second case occurs when a single vertex of an obstruction collides with a growing edge, as shown in Figure 3.3.1(b). In this case, there is, unfortunately, nothing we can do to grow further in this direction. This occurs because unless we are willing to lose the convex property of the region or relinquish some of the area already covered by the region, doing either of these things would violate our invariants. This case reverts to the base case, and the negative space around the object will have to be covered by additional seeding passes, which are described at the end of this section. The final, most complex, collision case occurs when a vertex of the growing region collides with a non-axis-aligned edge of a positive space obstruction, as shown in Figure 3.3.1(c). In this case, the colliding vertex must be split into a pair of new vertices and a new edge inserted between them. This increases the order of the poly- gon that collided with the obstruction. The directions of growth for these two newly created vertices are modified so they will follow the line equation for the edge that they collided with instead of the normal of the edge they were on. In addition, potential expansions of these new vertices are limited to the extent of the positive space edge they collided with. In this manner, the original edges, which were adjacent to the collision point, grow outward. This outward movement expands the newly created edge, so that it is spread out along the obstruction. Limiting the growth of these vertices is important because they are creating a non-axis-aligned edge as they expand. As long as this newly generated non-axis-aligned edge is adjacent to positive space, no other region can interact with it, and we limit region-to-region col- lisions to the base case. By using these three advanced case collision solutions, we are able to generate high-quality decompositions for non-axis-aligned worlds. As in the simple case with an axis-aligned world, the algorithm stops once all of the regions present in the world are unable to grow any further. The aforementioned growth methods are not, by themselves, enough to ensure that the entirety of the world is covered by the resulting navigation mesh. In particu- lar, the decompositions resulting from the second collision case are suboptimal. In order to deal with these issues, the second half of the PASFV algorithm comes into play. After all growth has terminated, the algorithm enters a new seeding phase. In this phase, each region places new regions (seeds) in any adjacent unclaimed free space, and then we flag the region as seeded so they will not be considered if there are any later seeding passes. If any seeds are placed, they are provided the opportunity for growth like the originally placed regions. This cycle of growth and seeding repeats until there are no new seeds placed in the world, as shown in Figure 3.3.2. At this point, the algorithm has fully decomposed the world and terminates. 248 Section 3 A I 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 249 VASFV The Volumetric Adaptive Space Filling Volume (VASFV) algorithm [Hale09] is a natural extension of the PASFV algorithm. Unlike PASFV, the VASFV algorithm works in native 3D and grows cubes and higher-order polyhedrons instead of quads. The initial constraint of convex input shapes is the same for both algorithms. In addition, the requirements that decomposed areas remain decomposed and that all regions always end a growth step in a convex state are still necessary. The initial setup of VASFV is very similar to its predecessor. Both algorithms begin by seeding a grid of initial regions throughout the world. In VASFV, the grid extends upward along the Z-axis of the world as well as X-axis and Y-axis. Then, in the first of many departures from the previous algorithm, the seeds fall down in the direction of gravity until they come to rest on an object. Seeds that end in the same location are removed. This helps to prevent the formation of large, relatively useless regions that float above the level and allows the ground-based regions to grow up to the maximum allowable height. These seeds are initially spawned as unit cubes represented as regions with four faces listed in a counterclockwise order, starting with the closest to the origin followed by the bottom and top faces, respectively. At this point, the regions may grow and expand. Figure 3.3.2 In this illustration we see several growth and seeding cycles in the PASFV algorithm to decompose an area. The growing negative space regions are shown as white boxes, and the direction of growth is marked with an arrow. Positive space regions are drawn in gray. As with the planar version of this algorithm, there are two main cases to deal with: axis-aligned and non-axis-aligned worlds. The base case for the axis-aligned world proceeds in a manner almost identical to the planar algorithm. Each region is iteratively provided the opportunity to grow out each face one unit in the direction of the normal of each face. We then run the same three tests for error conditions as per- formed in the planar version of the algorithm. As a reminder, these three tests are checks to ensure that the growing region has not intersected an existing region or obstruction with one of its vertices, that no region’s or object’s vertices have intersected the newly expanded region with their vertices, and that the region is still convex. If any of these checks fail, the region will revert to its previous size and not attempt to grow again in that direction. These steps are identical to the planar case; however, the secondary algorithms required to implement them are more complex in 3D. As with the planar version of this algorithm, this base case will produce a very good decomposition of an axis-aligned world. Like the planar version of the algorithm, the advanced growth case for dealing with collisions with non-axis-aligned geometry can be broken down into four cases (shown in Figure 3.3.3). The primary determining factor of which of the cases the algorithm will go into is determined by how many vertices are in collision and whether negative space vertices collide with positive space, or vice versa. The simplest case occurs when the growing cubic region has intersected one or more vertices of a positive space obstruction. Just like in 2D, since there is nothing that the growth algo- rithm can do to better approximate the free space around the object it has collided with, it returns to its previous valid shape, halting further growth in that direction. The next three collision cases occur when vertices from a single face of the grow- ing negative space region intersect positive space. The first and simplest of these cases occurs when three or more vertices of a negative space region intersect the same face of a positive space object. When this happens, it means that the growing face of the negative space object is parallel to and collinear with the face of the positive space obstruction it intersected. Therefore, these two faces are both axis-aligned. We know this because three points define a plane, so by sharing these three points, both of these faces are on the same plane. This tells us that since the negative space face is axis- aligned, the positive space face we collided with must be as well. We can thus revert to the base case for this collision. The final two collision cases require the insertion of a new face into the negative space region so that it adapts to the face of the collided object. The first case occurs when a single vertex of the region intersects an obstruction. In this case, the vertex will be subdivided into three new vertices, and a new triangular face is inserted (which has the same plane equation as the face of the object it collided with). The normal of this new face will be the inverse of the normal of the face of the intersected obstruction. These new points are restricted to prevent them from growing beyond the face of the obstruction they collided with in order to not create more non-axis-aligned geometry. 250 Section 3 A I 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 251 The final collision case occurs when exactly two vertices of a negative space region intersect another object. This means that a single edge of the region is in contact with the shape it collided with and that edge needs to be split. We split the edge by adding a new rectangular face to the region and by subdividing each colliding vertex into two vertices. This new face is once again created using the negation of the normal of the face it would intersect. The points involved in the collision are restricted to growing just along the collided obstruction. With the last two special collision cases, it is pos- sible to generate navigation meshes with a degree of accuracy and fidelity to the underlying level that is not possible using previous growth-based techniques. Like the planar version of this algorithm, VASFV also uses a seeding algorithm to ensure full coverage of a level. However, the volumetric seeding approach is slightly different from the planar version. Once each region has reached its maximum possible Figure 3.3.3 This illustration shows the possible collision cases for the VASFV algorithm. The growing negative space regions are shown in white. The positive space objects are shown in gray. Section (a) shows the base axis-aligned collision case from above. Section (b) shows the more complex positive space vertex collision case, again from above. Sections (c) and (d) are illustrated slightly differently. In order to clearly show how the negative space region reacts to a positive space collision, the positive space object is not drawn, and the colliding vertex is marked with a circle. Section (c) shows the single vertex collision case where a new triangular face is inserted into the negative space region. Section (d) shows the more complex two-vertex collision where a quadrilateral face is inserted into a negative space region to better approximate the object it collided with. extents, the seeding algorithm iteratively provides each region a chance to create new seeds in adjacent negative space regions. However, instead of immediately growing the new regions, the newly placed seeds are subjected to a simulated gravity and projected downward until they hit something, and duplicate seeds that end up occupying the same space as already placed seeds are removed. At this point, the algorithm allows the newly placed seeds a chance to grow. This cycle of growth and seeding repeats until no new seeds are successfully placed, at which point the algorithm terminates. The application of gravity to seeds might not be the most obvious approach to seeding in 3D, but it serves an important purpose in orienting region growth to better accommodate agent movement through the world. A good example of this occurs on staircases. First, consider the case where seeds do not drop due to gravity. As shown in Figure 3.3.4, a region that grows up adjacent to the bottom of the stairs will generate a single seed midway up the stairs, which will contain air space above the first several stairs and only land on one of the middle stairs. Then later seedings will result in a confusing mess of regions, none of which accurately models stair usage. Some of these regions require the agent to crawl through, while other regions require the agent to fly. 252 Section 3 A I Figure 3.3.4 This figure serves to illustrate how gravity affects the seeding process for VASFV. In this figure, we see a staircase viewed from the side. Positive space regions are marked in white. Negative space regions in this illustration are marked with the light gray gradient. The upper set of four time steps shows what happens if the generated seeds are allowed to float freely and grow in midair. The lower six time steps show how the decomposition changes to better model the stair steps after the application of gravity to each generated seed. 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 253 Now consider the same staircase decomposed using a gravity-based seeding method. With gravity-assisted seeding, when a seed is generated from the initial region at the bottom of the stairs, it falls into the floor space of the first stair. The seed then grows outward to fully decompose the floor of that single stair and grows up into the airspace over the stair. In this manner, the seeds gradually climb the stairs, and each stair will become its own logical region, which makes sense given how stairs are typically traversed. This gravity-based seeding is also applicable to other methods of world space traversal, such as flight, because biasing the decomposition world in respect to the features present on the ground still makes sense. Post-Processing Decompositions generated using either of the algorithms presented here can be improved by the application of several simple post-processing steps. Most importantly, if any two regions are positioned such that they could be combined into a single region while maintaining convexity, these regions should combine into a single region. The same technique can be applied to compress three input regions into two convex regions. This technique can be extended to higher numbers of regions, but it becomes harder to implement and the returns generally decline, because larger combinations are less likely to yield new convex shapes. All of the negative space regions should be exam- ined for zero-length edges or collinear vertices, which, if detected, should be removed. A full navigation mesh can be constructed from the decomposition by linking adjacent negative space regions. Aside from which negative space regions connect to each other, each region can also store least-cost paths to every other region. Other navigation mesh quality metrics can be applied to the generated mesh to determine whether it is good enough for its intended purpose or if some of the input parameters for initial seeding should be adjusted and a better navigation mesh generated. Conclusion In this gem, we have presented two new growth-based methods of generating navigation meshes. Both of these new methods are derived from the classic Space Filling Volumes algorithm. These two methods each have areas where they specialize, and both generate excellent decompositions, as shown in Figure 3.3.5. The PASFV technique works very well for levels that can be projected to a single 2D plane. It also can deal with levels that have more than one 2D representation, though these will require a touch more post-processing to combine all of the naviga- tion meshes into a single mesh. The navigation meshes produced by this algorithm are very clean with few sharp points or narrow regions that taper off to a point. Such regions are problematic because they cannot contain all of an agent moving through the world, and an agent can end up in many different regions at the same time. This is not a problem for PASFV because at most four different negative space regions can come together at a point, since all negative-space-to-negative-space collisions will be axis-aligned, and the angles involved in these collisions tend to be around 90 degrees. PASFV is also highly comparable to other navigation mesh generation algorithms in terms of speed, as it can be shown to run in O(n1/x) with an upper bound of O(n), where n is the number of square units of space to decompose and x is a function of how many seeds are placed in the world [Hale09b]. For more complex world environments that do not approximate to 2D very well or that would require multiple 2D approximations, Volumetric Adaptive Space Filling Volumes is a solution for navigation mesh generation, even though it is slightly harder to implement and takes longer to run due to the more complex collision calculations. VASFV will also provide good decompositions that have few narrow corners or poorly accessible regions. In addition, because of its gravity-based seeding, VASFV will better model how agents move, resulting in superior decompositions to traditional triangu- lation-based methods [Hale09a]. The run time for VASFV is algorithmically the same as PASFV; however, instead of n being the number of square units in the world, it is the number of cubic units. By presenting both 2D and 3D algorithms for generating spatial decompositions, we are offering multiple options to move beyond traditional triangulation-based methods of producing a navigation mesh and into advanced growth-based techniques. The most current implementations of both of these algorithms can be found at, along with some other inter- esting tools and techniques for navigation mesh generation and evaluation. 254 Section 3 A I Figure 3.3.5 The first image shows the results of PASFV. Black areas indicate obstructions, while the variously colored regions show negative space. The second image shows a navigation mesh generated by VASFV on a non-axis-aligned staircase and viewed from the side. 3.3 Automated Navigation Mesh Generation Using Advanced Growth-Based Techniques 255 References [Hale08] Hale, D. H., G. M. Youngblood, and P. Dixit. “Automatically-Generated Convex Region Decomposition for Real-time Spatial Agent Navigation in Virtual Worlds.” Artificial Intelligence and Interactive Digital Entertainment (AIIDE). Stanford University, Stanford, CA. 2008. [Hale09a] Hale, D. H., and G. M. Youngblood. “Full 3D Spatial Decomposition for the Generation of Navigation Meshes.” Artificial Intelligence and Interactive Digital Entertain- ment (AIIDE). Stanford University, Stanford, CA. 2009. [Hale09b] Hale, D. H., and G. M. Youngblood. “Dynamic Updating of Navigation Meshes in Response to Changes in a GameWorld.” Florida Artificial Intelligence Research Society (FLAIRS). The Shores Resort and Spa, Daytona Beach, FL. 2009. [McAnils08] McAnils, C., and J. Stewart. “Intrinsic Detail in Navigation Mesh Generation.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. 95–112. [Touzor02] Tozour, P. “Building a Near-Optimal Navigation Mesh.” AI Game Programming Wisdom. Boston: Charles River Media, 2002. 171–185. [Tozour04] Tozour, P. “Search Space Representations.” AI Game Programming Wisdom 2. Boston: Charles River Media, 2004. 85–102. [Schneider03] Schneider, P. and D. Eberly. “Point in Polygon/Polyhedron.” Geometric Tools for Computer Graphics. Morgan Kaufmann Publishers, 2003. 695–713. 256 3.4 A Practical Spatial Architecture for Animal and Agent Navigation Michael Ramsey—Blue Fang Games, LLC. “Not so many years ago, the word ‘space’ had a strictly geometrical meaning: the idea evoked was simply that of an empty area.” —Henri Lefebvre Game literature is inundated with various techniques to facilitate navigation in an environment. However, many of them fail to take into account the primary uni- fying medium that animals and agents use as locomotion in the real world. And that unifying medium is space [Lefebvre97]. The architectonics1 of space relative to an ani- mal’s or agent’s motion in a game environment is the motivation for this gem. Tradi- tional game development focuses on modeling what is physically in the environment, so it may seem counterintuitive to model what is not there, but one of the primary rea- sons for modeling the empty space of an environment is that it is this spatial vacuum that frames our interactions (be they locomotion or a simple idle animation) within that environment. Space is the associative system between objects in our environments. This article will discuss this spatial paradigm and the techniques that we used during the development of a multi-platform game, entitled World of Zoo (WOZ). WOZ was a challenging project not only by any standard definition of game development, but also because we desired our animals’ motion to be credible. An important aspect of any animal’s believability is that they are not only aware of their surroundings, but that they also move through a dynamic environment (Color Plates 1 and 2 contain examples of WOZ’s environment) in a spatially appropriate 1A unifying structure is commonly referred to as an architectonic, as it is used to describe and associate elements that are separated into a perceived whole. 3.4 A Practical Spatial Architecture for Animal and Agent Navigation 257 and consistent manner. This maxim had to hold true whether the animal was locomoting over land, over water, or even through air! To help facilitate the representation of our spatial environments, we used several old tools in new ways, and in conjunction with a few inventions of our own, we believe we accomplished our goals. Fundamental Components of the Spatial Representation System The primary element for constructing a spatial representation in WOZ was the sphere, termed a navsphere. Figure 3.4.1 shows a subset of a navsphere layout from an exhibit. The navsphere is fundamentally important because not only is it used to generate the navigable representation of the world (see below), but more importantly, it defines the interactable spatial dimensions of the world. What this means is that we define in 3D space where an animal can go, not just where it cannot. To keep animals from going places, we rely not only upon the tried-and-true techniques of collision detection [Bergen04], but also on collision determination. Collision determination is a technique of knowing ahead of time that a collision may occur (similar to how the majority of physics packages handle contact points). This determination of a potential collision is implicit when using navspheres, because we are able to determine that an animal is nearing the edge of navigable representation. These edges are termed spatial boundaries, and in this specific example, it defines the implicit relationship between the navspheres and the outer lying geometry [Gibson86]. Because of this knowledge, we can augment the animal’s behavior to slow down, start to turn away, or skid to a stop. It should be noted that the navsphere is an approximation of space, and it is not an exact spatial inverse of the placed geometry. A spatial system that is modeled from constructive solid geometry (CSG) principles would definitely be ideal, but the dynamic Figure 3.4.1 Two navspheres in a level. Connectivity information between neighboring navspheres is accomplished by having a slight overlap. nature of WOZ’s environments made this unfeasible. However, we did invest some initial time into investigating and utilizing various CSG techniques. The primary aspect of CSG that we discovered to be applicable for a spatially accurate representation of an environment was the Boolean intersection operator, which is the overlap of two objects merged into a convex component. The cumulative Boolean intersections would form the space through which our animals would move. The complexity and cost would come from the determination of an agent’s occupancy within that envi- ronment’s CSG representation, as they are not primitives but complex convex objects. This approach would definitely be more accurate than spheres, but at the cost of not being usable on current-generation consoles or the typical desktop PC (mainly due to the arbitrary manner in which WOZ’s environments were modeled). As an animal moves through an environment, there need to be mechanisms in place to help control an animal’s interaction with navspheres—we do this by assigning properties to the navsphere that define how certain interactions will occur. Some of these properties include defining the type of animal locomotion allowed in that navshape (whether it be land, water, or air) and spatial parameters for certain animal sizes. Having a system that is spatially centric requires a single enabling component that allows it to be easily accessed by other game systems. By allowing navspheres to overlap, we are capable of generating a navigable representation—a navrep—of the environment. A similar system is the circle-based waypoint graph [Tozour03]. For a visual example of this process, compare Figure 3.4.2 and Figure 3.4.3. Figure 3.4.2 shows the overlapping navspheres for one of WOZ’s nursery levels. Figure 3.4.3 shows the type of connectivity information generated from that navsphere layout. You can think of the navrep generated as the walkable surface for the world; however, note that the navrep is in 3D space and can wrap around other geometry as well as other navreps. 258 Section 3 AI Figure 3.4.2 A navsphere layout for the bear nursery. Figure 3.4.3 This figure shows how the connectivity information is generated from the overlapping navspheres in Figure 3.4.2. 3.4 A Practical Spatial Architecture for Animal and Agent Navigation 259 The primary reason to construct a connectivity graph from the spatial representation is that we need to execute potentially expensive operations on the game world, such as pathing queries, reachability tests, and visibility queries. The basic algorithm for gen- erating the navrep is to iterate over the navspheres, searching for overlap with any other navspheres. If we find any overlap, we establish a bidirectional link between the nav- spheres. Later on in the development of WOZ, we also found that we could use the same mechanism for one-way links by embedding directed connectivity information in the navsphere itself; this manifested itself in game objects such as one-way teleport doors. Navigation System Architecture As we move on to discussing the spatial aspects of the WOZ navigation system, it will help to understand the basic structure and components of the system as a whole (see Figure 3.4.4). The primary interface between the navigation system and the other components of the WOZ game is the navigation manager. The navigation manager facilitates access to the planner. The planner contains the navigable representation of the environment—both the spatial vacuum and the generated connectivity graph. The pathfinder uses the A* algorithm, which provides support for both spatial and routing biases [Hart68, Stout00]. Also provided is a general utilities object that contains general navigation code. When an animal (noted as an entity in Figure 3.4.4) needs to route through an envi- ronment, it will issue a call through the animal planner into the navigation manager. The navigation manager will then access its own world planner, and using the navrep it generates a coarse route through the environment. This coarse route is then returned to the animal’s planner for use. As you’ll notice in Figure 3.4.4, WOZ has two planners: the navigation planner and the animal planner. The animal planner handles any immediate spatial or interanimal tasks, while the navigation planner han- dles the more rudimentary routing operations (for example, path biasing) as well as handling the interactions with the navsphere reservation system. Figure 3.4.4 The navigation system. Navigation Planner Navigable Representation Navspheres and Connectivity Graph Navigation Manager AnimalPlanner NavUtilities Pathfinder Entity Animal Movement Controller Behavioral Component Navrep Continuity and Stitching Navreps are not necessarily continuous. What this means is that the level designers can author disparate navreps based upon differing navrep types (for example, land or water), as well as navreps that represent differing levels of elevation in a zoo exhibit, such as the ledges on a cliff face. Linking multiple, non-overlapping navreps in the navigation system requires the creation of navfeelers. Figure 3.4.5 (left) contains an example of two navreps that were authored as disconnected. There is a navrep on the ledge and also a navrep on the base of the exhibit. The navfeeler is the fishing pole–like extrusion from the top navsphere to one of the bottom navspheres. Navfeelers allow the level authors to link these navreps together. It attempts to find any navsphere below; if it finds a navsphere below itself, we then establish a bidi- rectional link between the two. The analogy that we used during development to help explain this was to think of a fisherman at the end of pier, with his fishing pole stick- ing out over the water at roughly 45 degrees. His line would dangle into the water, which effectively links land and water for the navigation system. If the navfeeler finds a navsphere below it, we then establish a bidirectional link between the two navspheres. Once these navreps are linked together (refer to Figure 3.4.5, right), the navigation system can then pathfind over the multiple navreps. Although this example is shown for a land-to-land navrep, the same mechanism is used for land-to-water bridges, which allows the penguins to dive into and jump out of water. 260 Section 3 A I Figure 3.4.5 On the left is an example of a navfeeler authored in level to connect two disparate navreps with the final result, on the right, of a navigable navrep. 3.4 A Practical Spatial Architecture for Animal and Agent Navigation 261 Handling Locomotion and Turning Locomotion in WOZ is executed by root accumulation of multiple animations that form a final pose. By root accumulation, we mean that the animators author with full displacement; any movement of the root of the animal’s skeleton is contained in the animation. This allows the animations to retain all the inherent tweaks, such as decel- eration, that might otherwise be lost in the typical engineer-centric approach, where the animators are required to author animations on the spot. While root motion is very necessary for exhibiting animator-envisaged motion, it also does not mesh well with traditional navigation paradigms. Turning was handled independently of locomotion, which was advantageous because it allowed the animators to avoid generating a host of different turn anima- tions. To generate a turn angle, an animal selects a target point, such as the next point along a route or a game object. This target point is then turned into a turn angle (by doing a dot product between the target point and the heading of the animal and then solving for theta), which is then used to twist the spine of an animal in the desired direction. If we had an animal with a four-bone spine and a turn angle of 40 degrees, we would simply apply 10 degrees of twist to each of the bones. In this section I’ll talk about how we handled two central components of progression through our naviga- tion system: locomotion and turning. While the navigation system plans through the world using the connectivity graph (which was generated from the navspheres), we turn to the spatial representation of the world in order to facilitate locomotion and turning validation. Each animal has associated with itself an occupancy shape (see Figure 3.4.6). This occupancy shape is used to control progression through the suggested route. The occupancy shape is not static; it can grow in size, as well as placement, according to the current animation. A fast-galloping zebra will have its occupancy shape projected out in front, whereas a slow-moving zebra would have the occupancy shape centered more on itself. One key differentiation in the navigation system implemented for WOZ versus other games is that during normal locomotion an animation could and generally would deviate from the proposed path through the environment. A combination of root motion, behavioral prodding, and turn speeds that vary based upon the state in the behavior graph makes following an exact route impossible without adversely affecting the quality of an animal’s movement. This is not dissimilar to how real ani- mals or humans move through the world. Humans don’t plan exact motions; we don’t plan our exact muscle contractions—we move in accordance with our under- standing of the space made available to us. It’s this space that allows us to identify boundaries and make use of the relationships between the space and objects we are afforded [Gibson86]. So it made sense to ensure that WOZ’s animals respect the spa- tial relationships of the environment accordingly. To help influence the turning of an animal, we implemented a system that uses a series of spheres projected around an animal in order to determine the suggested turn angles. We accomplished this by determining whether the projected spheres were inside the navrep. If one of the projected spheres was completely outside the navrep, we would execute a turn in the opposite direction. For example, if an animal’s motion wanted to move it into a wall, we could execute a tight turn in the opposite direction simply by altering the blend weights of the current animation. The blend weights would effectively bias the animal away from obstructions. It is important to remember that we modeled the world not only geometrically, but also its spatial representation, so this type of inside or outside test would be possible. By projecting these spheres around the animal, we could make turns that would take the animal away from objects according to their spatial proximity to the navrep boundaries. Another positive side effect of this approach is that we avoided doing costly and potentially numerous ray- casts for this operation. 262 Section 3 A I Figure 3.4.6 Each animal has an occupancy shape. This variable occupancy shape is used to denote the rough spatial representation relative to the navigation system. 3.4 A Practical Spatial Architecture for Animal and Agent Navigation 263 Conclusion While there are many navigable representations that can be used in a game, very few of them concern themselves with the spatial vacuum that exists in between the static geometry. This gem has shown you how to represent this space using several common tools, in conjunction with some new approaches to modeling locomotion. While mod- eling locomotion based on a desired sense of progression through the environment is different than most locomotion models, it also opens the door to modeling the actual motion of an animal or agent as it occurs in the real world. Animals or agents don’t follow exact routes; they alter their movement according to an understanding of not only the static objects in the world, but also the available space in which to exhibit their behaviors. Hopefully, the presentation of a few old ideas intermixed with a few new ones will help you rethink an animal’s or agent’s interactions inside an environment. By con- sidering the space in between physical aspects of an environment, we now have the capability to make decisions about our environment that are not purely reactive by nature. We are afforded the mechanisms to forecast environmental interactions, which is perhaps one of the first steps toward a credible motion management system. Acknowledgements I wish to thank the following team members for their contributive efforts to the WOZ AI system: Bruce Blumberg, Steve Gargolinksi, Ralph Hebb, and Natalia Murray. References [Bergen04] Bergen, Gino Van Den. Collision Detection in Interactive 3D Environments. Morgan Kaufmann, 2004. [Ericson05] Ericson, Christer. Real-Time Collision Detection. Morgan Kaufmann, 2005. [Gibson86] Gibson, James J. The Ecological Approach to Visual Perception. LEA, 1986. [Hart68] Hart, P. E., N. J. Nilsson, and B. Raphael. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” IEEE Transactions on Systems Science and Cybernetics 4.2 (1968): 100–107. [Lefebvre97] Lefebvre, Henri. The Production of Space. Blackwell Publishing, 1997. [Richenbach58] Reichenbach, Hans. The Philosophy of Space and Time. Dover, 1958. [Stout00] Stout, Bryan. “The Basics of A* for Path Planning.” Game Programming Gems. Boston: Charles River Media, 2000. 254–263. [Tozour03] Tozour, P. “Search Space Representations.” AI Game Programming Wisdom 2. Boston: Charles River Media, 2003. 85–102. [Week01] Week, Jeffrey. The Shaping of Space. Marcel Dekker, Inc., 2001. 264 3.5 Applying Control Theory to Game AI and Physics Brian Pickrell Control theory is the engineering study of dynamic systems, such as airplanes and other machines. The name is a bit misleading, since designing controls (in other words, airplane autopilots or missile guidance systems) is only one application of the theory. Control theory is actually the analysis of equations to extract some fundamen- tal information about how entire classes of systems act in all possible circumstances. Control theory is something of a sister science to simulation. A simulation looks at a specific situation and predicts what the system will do in great detail, but it doesn’t explain how the system behaves in general. Control theory, on the other hand, doesn’t predict anything at all, but it gives general information in the form of quan- tifiable measurements that are true for any situation or inputs. In other words, it tells you what the system can and cannot do. These measures are mathematical and quite abstract, but they are important, and for the most part, they are things that simulation engines just do not provide. Some of the ideas described in this gem may seem like shortcuts to avoid doing proper simulation, but in fact they have just as much scien- tific validity. It is better to think of controls analysis as complementary to simulation, and as a way to do some things that your physics engine was not designed for. One task for which control theory is better suited is designing the steering and motion of physics objects in games. To give a specific example, how would you depict the motion of an in-game car making a sudden turn? We all know that a simple, abrupt change of direction is not realistic and doesn’t look believable—the vehicle should swerve and sway a little bit while turning. If you just program a motion curve by guessing, the results will not be much more believable. If you use a high-fidelity physics engine to simulate the turn, you will likely have to work out the forces and other parameters required to make it sway convincingly by trial and error, and the results of this tweaking cannot be reused for other cars in other turns. 3.5 Applying Control Theory to Game AI and Physics 265 Our first formula describes the family of functions that you can use for this situ- ation. An object such as a moving vehicle must obey these functions in its motions. If your game object doesn’t move like this, it’s not physically realistic. Conversely, you can create a plausible motion curve at very little cost in physics analysis by following this formula and using some common-sense rules of thumb. Here is that result: Any unforced dynamic system moving under a set of linear differential equations has an output motion of the form: (1) This is a set of harmonic oscillators, each of the form: (2) v = natural frequency z = damping coefficient This is a sinusoidal wave where A is the amplitude and v (omega) is the frequency, and the oscillations die out (or grow) exponentially according to a damping coeffi- cient z (zeta). A single harmonic oscillator looks something like Figure 3.5.1. Several harmonic modes added together, each with its own amplitude, frequency, and damping coefficient, as in Equation (1), look like Figure 3.5.2. We will spend most of the rest of this gem defining the concepts behind Equation (1) and showing how they were derived. Figure 3.5.1 Simple damped harmonic oscillator. Dynamic Systems The concept of a dynamic system in control theory is the same as that used in simula- tion; that is, a system is any set of parts that act on each other in some quantifiable way and change over time. The underlying mathematics applies equally to all different sorts of systems, using any units of measurement that are appropriate and any scien- tific laws that produce linear equations. The “parts” themselves may be conceptual rather than physical, as when measuring the concentrations of different reagents in a chemical reaction. The “output” can be any measurable quantity at some point in the system, measured in whatever units are appropriate. In this gem, we will mostly refer to objects moving under the physical laws of mass, force, and so on, but with the understanding that other dynamic systems work the same way. We did not say what the units of Equation (1) are or how you should implement the result. This is up to the game programmer to decide. Your output y(t) can be dis- tance measurements (xyz coordinates) or velocities or angular units, such as steering heading, as long as your system is linear in the units you choose. This explains how an airplane flying in a circle (which doesn’t look like Equation (1) at all) is consistent with linear control theory: because in angular units, its heading changes at a constant rate, which does fit the template of Equation (1). This should be good news to pro- grammers of cockpit view–style games. Everything we tell you here can be implemented directly in angular units, without requiring messy polar coordinate conversions. Linear Systems We have already stated the requirement that a system be linear. What does this mean? A linear dynamic system is one whose defining equations of motion follow the form: (3) 266 Section 3 AI Figure 3.5.2 Oscillator with two harmonic modes. 3.5 Applying Control Theory to Game AI and Physics 267 where all of the coefficients An are constants. The reason that we could say with such certainty that all linear systems follow the form of Equation (1) is that the result is a necessary mathematical consequence of Equation (3). If you don’t make that assump- tion, then the results don’t look like those sinusoidal functions. Engineers talk all the time about non-linearities in their systems; this usually means that the coefficients A in the equations of motion are not constant. They are functions of some other value, or they change over time. (It can also mean that something else is being done with the derivatives—for example, is not linear.) When engineers try to apply linear control theory to systems with non-linearities (and they always do; a good part of controls engineering consists of finding ways to approximate non-linear reality with linear equations), the results often look similar to Equation (1) but the v, z, and A values of the different modes keep changing, or the poles and frequency responses in the charts we’ll soon see keep moving around. All that said, it is safe in an in-game world to define all physical responses as being always linear. Then we can go ahead and use our linear results without worry. Feedback, Damping, and Stability Systems oscillate the way they do because of internal feedback among components. Sometimes unexpected feedback effects can cause a system to oscillate more rather than less; in fact, the 19th-century origins of control theory lie in explanations of why a steam engine’s mechanical governor couldn’t keep it running at a steady speed. One of the primary questions that controls engineers analyze is whether the system is stable. If any of the damping coefficients z in Equation (1) is negative, then the entire system is unstable (in other words, the equation does not converge as ). Notice that in such a case, the exponential part of the equation is positive, so the sinusoidal oscillations expand correspondingly. Fortunately, instability is never a surprise in the game world, since we have the luxury of declaring positive z ’s for all of our systems. However, you will see this effect in physics simulations that drift out of limits and in game object movements that go wild because of player overcontrolling. Most programmers know a little bit about feedback and use it occasionally but do not have a way to quantify the system stability or instability that results. Here are some different types of damping: z > 2 v Overdamped (stable, no oscillations) z = 2 v Critical damping (stable) 0 < z < 2 v Underdamped (stable) z = 0 Undamped (oscillates) z < 0 Unstable Second-Order Linear Differential Equations Let’s look at how Equation (2) is derived for a couple of simple harmonic oscillators. Equation (2) is the general solution of the second-order linear differential equation (LDE); that is, one where the differential terms go as high as the second derivative but no higher. Each of these examples, then, can form a single component of a more complicated system that has multiple oscillatory modes. Both of these examples are basic textbook cases. You can see from them how the same control theory applies in so many different scientific fields; it is because many natural laws (in these cases, Coulomb’s law and Newton’s second law) express them- selves as similar first- and second-order differential equations. Example 1: Mass and Spring Figure 3.5.3 shows a heavy object M bouncing up and down on a spring K, obeying Newton’s laws of motion. The “output” value being measured is , the position of the object measured in units of distance. We’ll pretend that all of the friction in the system occurs at dashpot B. (A dashpot is the working part of a shock absorber.) The equations show that the weight moves under the forces of the spring and friction. The spring force changes as the object moves, a built-in feedback. The friction force depends on velocity (which is the derivative of the position). Newton’s second law converts force to acceleration, which is the second derivative of position. Outside forces pushing on the system are represented by . (We’ll get to that later; for now, assume it is zero.) 268 Section 3 A I Figure 3.5.3 Mass and spring. Example 2: RCL Circuit Figure 3.5.4 shows a very simple electronic circuit with a resistor R, a capacitor C, an inductor L, and a voltage input E. The inductor tends to keep current i flowing once started, while the capacitor builds up a charge that fights against the current. The 3.5 Applying Control Theory to Game AI and Physics 269 interaction of these two causes the current (and voltage) in the circuit to oscillate back and forth. Either voltage or current can be read as the output value; in this case, we’re using current. Some Math Formulae We are almost ready to give the solution to the general second-order LDE, but let’s review two fundamental mathematical formulae we will use. Euler’s Formula Euler’s Formula (4) This is a fundamental identity from complex mathematics that relates exponentials to trigonometric functions. It says that e to any power that is a pure imaginary number lies on the unit circle in the complex plane, and that for an exponential with a com- plex exponent, the imaginary part of the result oscillates in a sine-like way while the real part either grows or diminishes exponentially (depending on whether the real part is positive or negative), as shown in Figure 3.5.5. Figure 3.5.4 RCL circuit. Figure 3.5.5 Roots and exponentials in the complex plane. Fundamental Theorem of Algebra A real polynomial of degree n: (5) has n roots (including complex and multiple roots). Complex roots always come in conjugate pairs a+bi, a-bi. Between them, these two formulae mean that, in the complex number domain, sine and exponential functions are the same thing! Harmonic Oscillator—Derivation Now for the solution to the equation resulting from Examples 1 and 2: Find all equations y(t) that obey the differential relation (6) The derivation of the result is interesting because it shows how imaginary num- bers keep coming up in control theory. Solving a differential equation, solving an exponential equation, and finding roots of a polynomial are tantamount to the same thing. Also, we’ll see how a complex exponential can be separated into two parts: an exponential (z) and a sinusoidal component (v). Control engineers tend to talk lightly of complex “roots” as damping ratios and frequencies, an abstraction that can be weird and confusing for outsiders. This is where that equivalency comes from. The solution starts by assuming that all solutions are variations on the basic form where r is a complex number. (7) 270 Section 3 A I 3.5 Applying Control Theory to Game AI and Physics 271 The general solution, then, is: (8) General solution of second-order LDE (unforced) 1 where z and v can be derived directly from the coefficients A, B, and C once you know the solution. c1 and c2 are arbitrary constants—any values are valid, and the values for a particular case depend on the initial conditions. You can find the system’s frequency and damping ratio directly from the system constants (such as inductance and capacitance) without going back through the differential equation. Here is another formulation that is handy to have: (9) Block Diagrams What about a system that has more parts than our examples? When an analysis extends to multiple equations of motion, the interrelationships quickly become much harder to sort out. A block diagram analysis is one way to manage this. Making a suitable block diagram to represent a complex system is a matter of engineering judgment, much like designing a system model for a simulation. We do not want to explain all the ins and outs of block diagramming, but a block diagram can help explain how control inputs apply to the equations we’ve done so far. We said that Equations (1) and (2) applied to unforced systems—that is, ones with no inputs. In the case of modeling a car’s steering, that would mean that there was no one at the wheel. What use is a model like that? One answer is that we can include the steering inside the block being modeled. Equations (1) and (2) are what are called transfer functions. In a block diagram, you can represent an entire subsystem as a single box with an input and an output, and the transfer function is the conversion between input and output. Figure 3.5.6 shows a diagram that demonstrates “closing the loop”—that is, converting an open-loop (uncontrolled) equation into one that accounts for the effects of steering or other con- trol inputs. 1 It would be a fair question to ask what the meaning is of the imaginary part of the equation, since you can’t have imaginary distances or voltages. An (evasive) answer is that the imaginary part is required to be zero as a constraint condition in an analysis, which we’re skipping over. The reader may just ignore the imaginary part and read only the real part of the equation. G(s) is our original transfer function. R(s) is an external input (steering), and G(s) converts the input to the final output value C(s). H(s) is a feedback control system that reads the output, transforms it in a certain way, and uses it to modify the com- mand input. The following is important if you want to model autonomously steered (NPC) vehicles in a game: H(s), which represents the internal dynamics of the steering sys- tem, is modeled as a linear system of components exactly like the main system G(s). H(s) could be a collection of levers, springs, hydraulics, electronics, and so forth—it is just another linear system and is conceptually no different than G(s). Furthermore, the entire diagram that results, including controls, is also linear and can be redrawn as a single “black box” transfer function. Therefore, you can legitimately represent a vehicle and its driver as a single, complicated transfer function where the input is the desired course and the output is the actual motions of the vehicle. This does not work for a real person but can represent any linear controls, including a simple autopilot, a smart or dumb robot driver, or a human non-player character (NPC) whose driving style is a linear function. The caveat to this is that you would be simulating the steering behavior of a con- trol system without actually doing the steering. Only in extremely simple cases is it possible to extract the implied control responses from a system transfer function. The demo on the CD shows an example of this. Laplace Transform The block diagram of Figure 3.5.6 replaces differential equations with Laplace transforms. For our examples, G(s) is the Laplacian L of Equation (6), which is As2 1 Bs 1C. The Laplace transform is a clever mathematical trick that converts differential equations to more manageable polynomial functions. (10) 272 Section 3 A I Figure 3.5.6 Closed-loop block diagram. 3.5 Applying Control Theory to Game AI and Physics 273 Definition of Laplace Transform2 Laplace transforms are written as capital letters in the block diagrams, and the original function parameter—time t—becomes s, which is not a measurable value but an abstract complex number. Laplace transforms have several useful properties. First of all, the transforms of the most common functions are all polynomials. Also, they are linear and can be added and multiplied by constants. They support integration and differen- tiation; they convert step inputs and other discontinuous functions into continuous functions. And, function composition is the same as multiplying in the Laplacian domain—in other words: function composition (11) Yet another handy property of the Laplacian is that it is units-independent. Since all units of measurement transform to the same s, it is possible to mix transfer functions whose outputs are in different units in the same block diagram. Using the properties of the Laplace transform, Figure 3.5.6 leads to the following relationships: control ratio (12) The denominator of the last function is called the characteristic equation 3 of the system. characteristic equation (13) From Characteristic Equation to Equations of Motion The next step is simple but powerful. The Laplace transform implicitly does the same conversion to exponentials that was done in the derivation of Equation (8). Therefore, the roots of the characteristic equation of the system correspond to the n harmonic modes in Equation (1) , with –z as the real part and v as the imaginary part.4 If you can factor a system’s characteristic equation into the form: (14) roots of characteristic equation 2 By convention, a lowercase letter denotes an original function and a capital letter denotes a Laplace transform with the parameter s. It is important to notice which is which throughout this article. 3 This is the same as the characteristic equations found in linear algebra, if you represent the underlying system of differential equations as a matrix. 4 Note the – sign in front of z. A positive real part of a root means a negative damping coefficient. then any function matching Equation (1) with any constants An is a possible motion of this system.5 If you’re not allowed to make up an answer, characteristic equations are hard to solve! The transfer function from our examples was easy, but in general both G(s) and H (s) may be polynomial fractions (one polynomial divided by another) that require numerical methods to factor. But for game simulations, you may invent a characteristic equation in already-factored form! If you do so, you are implicitly stating that the physics of the system are unknown, the internal logic of the controls is unknown, the roots were found in an unknown way, but the overall response of the entire system includ- ing controls is just what the designer specified. Root-Locus Plots Root-locus plots are a bit of a digression, but they give a measure of insight into how one would choose the invented roots mentioned earlier. Aeronautical engineers found that in many cases, control systems included an electronic amplifier or the equivalent, and the amplification, or gain, came out as a constant multiplier in H(s) in Figure 3.5.6 and that furthermore, the stability or instability of the entire system depended in hard- to-anticipate ways on what gain (K) was chosen. The original root-locus method6 used some extremely clever pencil-and-ruler methods to replace the computation-intensive factoring of Equation (13) into Equation (14); the plots are still useful even though we’re doing them by computer now 7. A root-locus plot is a plot of all possible roots of the characteristic equation for values of K from zero to q. Here are some examples: In a root-locus plot, circles represent roots of the numerator of the characteristic equation, called zeroes. X’s represent roots of the denominator, called poles because the function is infinite at the poles. The actual roots begin at the poles when gain K=0 and move toward the zeroes as K increases. The graph is drawn in complex s-space, where the real (X-axis) component is damping coefficient – z and the imaginary (vertical axis) component is frequency v. Again, each root corresponds to one of the modes of Equa- tion (1), so you can draw an output function directly if you know what the roots are. As mentioned, if any mode of a system has a negative damping ratio, the entire system is unstable. In a root-locus plot, this means that if any root lies on the right of the Y-axis, the system is unstable. This happens in Figure 3.5.7(c)—the system is sta- ble at low gain, but if you turn up the gain past a certain point, it becomes unstable. 274 Section 3 A I 5 Phase offsets of the various modes, which we have neglected in this gem, are also allowable and in fact are very important in matching Equation (1) to real motion curves. 6 The root-locus method is also known as the Evans method. It was developed in 1948 by Walter R. Evans, a graduate student at UCLA. 7 For instance, Matlab’s rlocus() function, or [El-Dawy03], a freeware root-locus plotting program. 3.5 Applying Control Theory to Game AI and Physics 275 In this example, notice that (b) and (c) are modifications of (a) made by adding a zero and a pole, respectively, to the characteristic equation. This represents adding components (such as resistors and capacitors) to the control system. The shape of the plot and the stability of the system changed dramatically each time. Controls engi- neers make it their business to know what to add to a circuit to shape the root locus the way they want. Fortunately for the game designer, we can usually just place roots where we like and pretend that a root-locus analysis has been done. A root close to the vertical axis represents underdamping and an oscillation that takes a long time to die out; roots far to the left represent great stability and rapid response. Figure 3.5.7 Root loci of characteristic equations. Frequency Analysis8 This gem has completely skipped over a branch of control theory that is actually of equal importance. This is frequency analysis. It is based on the idea that a system’s behavior can be broken down into its responses to sine-wave inputs at different fre- quencies. If the input to a linear system is a pure sinusoidal function—in other words, sin(ft), where f is any frequency (not necessarily v)—the output will always be a sinusoidal wave at the same frequency f, but with a different amplitude A. It will also be delayed from the input wave by a time offset that is expressed as a phase angle u. A and u are different for every input frequency f, and a graph of frequency response A(f) and phase delay u(f) is a frequency analysis. The best-known plot types are the Bode plot and the Nyquist plot 9. Any input function can be broken down into a sum of a (possibly infinite) set of sinusoidal functions and can usually be approximated by just a few. (This is the basis of Fourier theory.) Therefore, a frequency analysis can be used to break down a system’s motions into a few harmonic modes based on A and u rather than z and v. The demo on the book’s CD has chosen not to follow this route, for simplicity. Code Example: Control Law Racers The demo program on the book’s CD shows one way to apply the ideas behind a root-locus analysis in a game. Rather than modeling a real system and then analyzing it, the demo goes the other way by asking what behavior a system ought to have and then displaying it without actually doing a model. The demo shows several slot cars moving down a track. The simulation uses a side-scrolling format, so there is just one degree of freedom: y position. When the player tells them to change lanes, the cars swerve abruptly and then maneuver into the new lane. Their steering is a bit weak compared to the weight of the cars, so they weave back and forth before finding their target. What is interesting is that every car is a little bit different and steers differently. The steering behavior of the cars is encapsulated in the mode structure, which contains the frequencies and damping rates of their motions. struct mode { float zeta; // damping coefficient float omega; // frequency float A; // Initial amplitude of wave }; 276 Section 3 A I 8 See [D’Azzo60], chapters 8–10, and [Thaler60], §4.8, 7.5. 9 Developed by Hendrik Bode and Harry Nyquist, respectively, who were both engineers at Bell Laboratories in the 1930s. 3.5 Applying Control Theory to Game AI and Physics 277 Each mode supplies one of the terms in Equation (1). Each car can have any number of modes. There is no steering or simulation loop in this demo; the control law models both the physics of the car itself and the actions of its driver. Since a car with only one mode is no more complex than the weight and spring of Example 1, this may be a simple autopilot indeed. To design your car’s control law, add modes. Imagine a root-locus plot like Figure 3.5.7. But instead of plotting a real function, just draw one or more X’s for the roots. Think about how well you want your car to handle and how quickly you want it to sway. A real car weaving back and forth on the road may move with a period of one to five seconds or so; that is, an omega of 0.2–1.0. As for damping, if you want your car to keep shimmying for a long time, then give it a zeta close to 0; if you want it to han- dle well and accurately, then give it a large zeta. A critical damping ratio of zeta = 2 * omega will produce fastest settling. The in-code comments explain how to add modes to your car. The step after that, giving an actual steering command by setting the A for each mode, is handled arbi- trarily by the demo. The next step of interest is the navigation step. This is where the values in the car’s modes are applied to Equation (1). vector::iterator drivIt; // car for( drivIt = driverList.begin(); drivIt != driverList.end(); drivIt++ ) { // Sum of all the oscillatory modes // measuring distance from target lane implicitly commands the // car to go there when all oscillations have settled drivIt->driverPos.y = targetLane + Y_BASE; modeList::iterator theModeIter; // mode for( theModeIter = drivIt->modes.begin(); theModeIter != drivIt->modes.end(); theModeIter++ ) { // Add this mode’s contribution to position float zeta = theModeIter->zeta; float omega = theModeIter->omega; float A = theModeIter->A; // Time conversion from dimensionless units (radians and // rad/sec) to full cycles per second float t = 2.0 * PI * resetTime; drivIt->driverPos.y += A * exp(-zeta * t) * cos(omega * t); } } The last statement computes the output position driverPos.y directly, without going through a simulation step. You will have to go into source code to change the cars. Try adding more cars and changing modes. See what happens when a car’s modes include both large and small damping coefficients. Conclusion The control theory presented in this gem is the so-called classical control theory, based on linear systems, that was current practice in the aerospace and electronics industries from roughly the 1930s through the 1960s and has been eclipsed by the rise of computers but has certainly not gone away. It’s a bit surprising that such a well- established body of knowledge is so forgotten in the game industry. This gem is meant to introduce programmers and designers to the concepts of control theory as well as to present one way to simulate and direct physical objects. Other applications of these ideas provide a virtually unexplored field for the industry. References [D’Azzo60] D’Azzo, John J., and Constantine H. Houpis. Feedback Control System Analysis and Synthesis. McGraw-Hill Book Company, Inc., 1960. [El-Dawy03] El-Dawy, Ahmed Saad. “RootLocus.” n.d. Geocities. n.d. . [Graham61] Graham, Dunstan, and Duane McRuer. Analysis of Nonlinear Control Systems. John Wiley & Sons, Inc., 1960. [Thaler60] Thaler, George J., and Robert G. Brown. Analysis and Design of Feedback Control Systems. McGraw-Hill Book Company, Inc., 1960. 278 Section 3 A I 279 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games Thomas Hartley, Institute of Gaming and Animation (IGA), University of Wolverhampton and Quasim Mehdi, Institute of Gaming and Animation (IGA), University of Wolverhampton One of the key capabilities of human game players that is not typically employed by non-player characters (NPCs) in commercial first-person shooter (FPS) games is in-game learning and adaptation. The ability of a human player to adapt to oppo- nents’ tactics is an important skill and one that separates an expert game player from a novice. NPCs that incorporate in-game learning and adaptation are more responsive to human players’ actions and are therefore more capable opponents. This gem presents a practical in-game approach to adapting an NPC’s selection of combat tactics. A Dynamic Approach to Adaptive Tactic Selection To achieve successful in-game (that is, run-time) tactic selection in FPS games, we proposed to adapt the online adaptation algorithm dynamic scripting [Spronck06, Spronck05]. The reinforcement learning–inspired dynamic scripting algorithm has been designed specifically for use in online learning scenarios and has previously shown significant promise in other genres of scripted games, such as role playing and strategy games. Consequently, the approach offers an interesting augmentation to the tradition- ally scripted tactic selection techniques used in commercial FPS games. However, the approach does need to be adapted to make it suitable for use in an FPS environment. 280 Section 3 AI The FPS version of the dynamic scripting algorithm presented in this gem has been adapted to the selection of tactics rather than rules for scripts, so it has a number of differences from previously implemented versions. First, the library of tactics in the developed system is organized in groups according to a dual-layered state representa- tion. Previous implementations organize the dynamic scripting rulebases according to NPC type and high-level game states [Ponsen and Spronck04]. However this limited state representation is not suited to FPS games, as the selection of player-versus-player tactics greatly depends on the current state of the NPC [Thurau06]. The second con- tribution is the development of fitness functions to evaluate the success of a tactic and the encounter with an opponent. The third contribution is the development of a tactic selection process that makes use of a prioritized list approach in combination with the K-Nearest Neighbor algorithm (K-NN) [Mitchell97]. The goal of this approach is to organize the tactics so that the most successful ones are more likely to be selected first. Overview of the Adaptive Tactic Selection Architecture Tactics in the system are organized into tactic libraries, which in turn are organized according to a dual-layered representation of the game environment. As illustrated in Figure 3.6.1, the upper layer determines which tactic library should be selected accord- ing to an abstract state representation and/or rules, which are typically described in terms of a behavior type or goal. For example, an “engage enemy” library of tactics could be defined by health above 50 percent and the possession of a high-powered weapon. This process can be performed manually using a game developer’s domain knowledge and traditional game development techniques, such as finite state machines (FSM), decision trees, or rules. For example, the abstract state space in Figure 3.6.2 is based on the hierarchical FSM/prioritized list approach used to control NPC behavior in Halo 2 [Isla05]. The highest priority state (arranged right to left) that matches the current game state is selected. This approach should allow the tactic selection architec- ture to be easily integrated with existing game AI techniques. As illustrated in Figure 3.6.2, the lower state space contains instances of a library’s tactics at points in an n-dimensional feature space. The feature space and the number of tactic instances within a tactic library are kept relatively small in order to maintain performance. Once the upper layer has determined the current tactic library, the K closest library instance(s) to the current game state are selected and used to determine the NPC’s current list of tactics. The following procedure outlines how an instance of the lower state space is selected. If the closest library instance to the query state (in other words, the NPC’s current state) is equal or below a predefined threshold, it is used to determine a prioritized tactic list. If the closest library instance is above a predefined query threshold, the K closest tactic instances to the current environment state are used for tactic selection. This multi-layered approach to selection allows tactics to be associated with detailed game states, while also being efficient in retrieving from the library. 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games 281 Figure 3.6.1 High-level overview of the tactic selection architecture. Figure 3.6.2 Overview of the tactic selection process and the state organization. The K closest library instances are selected. The weights of each tactic are combined using distance weighting and are used to determine a prioritized list. All tactics have an associated weight that influences the order in which they are selected in a given state. The weights reflect the success or failure of a tactic during previous encounters with an opponent. Learning only occurs between the tactics in each library instance, and one library does not affect the learning in another, even if the same tactic is found in both libraries. Tactic selection is achieved through a prior- itized list [Isla05], which is adapted to support online list generation. First, the procedure outlined above is used to determine the K library instance(s) closest to the current game state. If K > 1 and the closest tactic instance is above a query threshold, the tactic’s weights are combined using the distance-weighted K-NN algorithm. The individual or combined tactic library instance is used to create the prioritized list. The K-NN algorithm classifies an unlabeled instance (for example, s in Figure 3.6.2) by comparing its n-dimensional feature vector to stored instances. The K-Nearest Neighbors to the query instance are found and used to determine a com- bined weight for a tactic. This is achieved through a weighted average that reflects a stored instance’s distance to the query instance. The position of a tactic within the prioritized list is determined by ordering tac- tics according to their weight. (For example, the tactics with the largest weights have the highest priority of selection.) A new prioritized list of tactics is generated whenever an upper layer state change occurs or when the library instance changes. If a distance- weighted prioritized list has been created, a predefined distance threshold is used to determine whether a new list should be generated. The distance threshold is based on the distance from the current list’s query state to the NPC’s present world state. Once a list of tactics is generated, the tactic that has the highest priority and is capable of running is selected as the current behavior. If higher-priority tactics become available, they can interrupt the current tactic on the next game loop. To avoid repet- itive behavior, the tactic selection process can also include a check to prevent the same variant of tactic from following each other. (For example, “strafe a short distance” would not follow “strafe a long distance.”) Therefore, if the next tactic is a similar variant to the current tactic, it is skipped, and the subsequent tactic is evaluated. This rule doesn’t apply to tactic interrupts, which occur regardless of tactic variation. When a tactic is complete, is no longer capable of running, or a fixed amount of time has elapsed, the next tactic in the list that is capable of running is selected. When a tactic is interrupted, is completed, or times out; a library change occurs; or a library instance change occurs, the tactic has its weight updated according to a weight-update and fitness function (outlined in the next section). If the query instance was above a distance threshold, the weight update is applied to the tactics in each K close library instance according to the instance’s distance from the initial query point. Fitness and Weight-Update Functions The weight-update function alters the weights associated with tactics in response to a reinforcement signal that is supplied by tactic and encounter fitness functions. The aim of the fitness function is to reward tactics that defeated an opponent or improved 282 Section 3 A I the NPC’s chance of defeating the opponent. The fitness functions for the tactic and the encounter are defined below. Each function returns a value in the range of [–1,1]. (1) The equation above evaluates the fitness of a tactic and contains two components. In the equation, f refers to the fitness of the tactic t that is being evaluated. The first com- ponent A(t) P [–1,1] determines the difference in life caused by the tactic [Andrade04]. That is the life lost from the opponent minus the life lost from the NPC. The differ- ence in life lost represents a key metric in measuring the performance of a tactic, as removing all an opponent’s life is the ultimate goal of FPS combat. The second component of the fitness function S(t) P [–1,1] evaluates the surprise of the tactic. A surprise is the anticipation of an experience that the actual experience does not fulfil [Saunders02]. In the tactic selection architecture, the experience is the average value of the difference in life lost from previous encounters, and the actual experience is the current difference in life lost. Each component of the tactic fitness function is weighted according to its contri- bution as in the standard dynamic scripting approach [Spronck05]. The selection of the fitness function’s contribution is based on game goals. (For example, an NPC places high value on its health.) The two components of the tactic fitness function are determined using Equations (2) and (3). The function hl(NPC) in Equations (2) and (3) refers to the health (in other words, life) lost from the adapting NPC, and hl(opp) refers to the health lost from an opponent (in other words, the damage caused by the adapting NPC). AvghLifeLostn in Equation (3) contains the average difference in life caused by the tactic from the previous n weight updates. (2) (3) Equation (4) evaluates the fitness of an encounter, that is, the tactics used by the adapting NPC during a combat to reach a winning or losing state. When an NPC reaches a terminal state, the tactics used during the encounter are updated according to whether the NPC won or lost. In Equation 4, f refers to the fitness of the encounter e that is being evaluated. hlt(NPC) refers to the health (in other words, life) lost from the adapting NPC during the performance of tactic t. hlt(opp) refers to the health lost from an opponent during the performance of tactic t. (4) 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games 283 Equations (2), (3), and (4) assume that the life lost or damage is in the range of 0 to 100. Equation (3) divides surprise by 200 so that the returned value will be in the range of –1 to 1. Equation (4) divides health lost by 125 in order to give the tactic a guaranteed reward or punishment depending on whether the encounter was winning or losing. The scale of the reward or punishment can be set according to game requirements. The weight update function uses the tactic and encounter fitness functions to generate a weight adaptation for the library tactics (i.e. an amount to adjust the cur- rent weight of a tactic). The function is based on the standard dynamic scripting tech- nique, where weights are bound by a range [Wmin, Wmax ] and surplus weights (that is, weight adjustments that result in a remainder outside the minimum and maximum range) are shared between all other weights [Spronck05]. However the fitness functions outlined in Equations (1) and (4) return a value between [–1, 1], where a negative fitness value indicates a losing tactic, while a positive fitness value indicates a winning tactic. Therefore a simplified weight update function is used in this approach. In Equa- tion (5), Pmax is the maximum penalty, Rmax is the maximum reward, F is the fitness of the tactic, and b is set to 0 and is the break-even point for the tactic’s fitness. Above this point, the tactic’s fitness is considered to be winning, while below this point the tactic is considered to be losing. In this simplified weight update function, the break- even point is built into the function itself. A fitness of 0 means the weights are unchanged. (5) When multiple tactic library instances are used to create a combined prioritized list, the weight update is shared between the K instances according to their distance to the initial query state (in other words, the point that resulted in the library instances being retrieved). The sharing of the weight update is performed using Equation (6), where Dwi is the weight update for a library instance, di is the distance to the library instance from the query state, and d is the total weight. Where: (6) Adapting Tactic Selection To illustrate the tactic selection architecture, we describe the example of an NPC learning to adapt its selection of combat tactics. In this example, the adapting NPC is endowed with an “Attack” library of tactics that could include melee attack, charge attack, strafe charge attack, circle strafe attack, and attack from cover. Tactics are man- ually coded and comprise multiple lines of logic that define the behavior to be per- formed (for example, movement and weapon handling code). In addition to behavior logic, there should also be code to determine whether a tactic is capable of running. For example, the circle strafe tactic may not be capable of running in a narrow corridor. 284 Section 3 A I For clarity, only one tactic library is used in this example. The upper state space controls the selection of tactic libraries using conventional techniques, and adaptation occurs independently within each tactic library instance; therefore, additional tactic libraries can be easily added. After a library has been defined, the default weight of the tactics needs to be set. Tactics can be given the same initial weight, or a weight can be selected based on domain knowledge. In this example, all the tactic weights are ini- tially set to the same default weight; therefore, the position of tactics within the initial prioritized list is randomly determined. Once a tactic library has been defined, the next step is to determine the instances of the library. This involves defining the state space, the number of instances, the posi- tion of the instances within the state space, and how the state space is organized. In this example, the lower state space is defined as the relative distance between the adapting NPC and the targeted opponent. This simple feature vector was selected because dis- tance is a prominent factor in the selection of FPS tactics. In most cases the requirements of the game can be used to determine the most appropriate state space representation. The number of library instances is determined by the size of the state space. In this example, the state space is compact; therefore, only a few instances are needed. The requisite number of instances is also affected by their position in the state space and how the state space is organized. The position of tactic instances in the state space can be manually managed using domain knowledge or automatically added using a distance metric, such as: if s > r, then create a new instance of the tactic library, where s is the distance between the current state and the closest tactic instance and r is a predefined distance threshold. If the distance is below the threshold, then the closest tactic cases are used. Figure 3.6.3 illustrates an example of tactic library instances and state space organization. 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games 285 Figure 3.6.3 Example arrangement of the lower state space. At the start of a combat encounter, the tactic selection architecture generates a list of tactics that are ordered by their associated weight. The NPC attempts to defeat the opponent by executing the first tactic in the list. If the tactic is not capable of running, the next tactic in the list is selected. This is repeated until a tactic capable of running is found. The list of tactics is generated by determining the closest K library instances to the NPC’s position in the lower state space. Once a tactic and encounter are com- plete, the weight of the tactics used are updated using Equations (1) and (4). In this example, K is set to 1, and the initial weights of tactics are the same; therefore, only the closest library instance is used, and the order of the tactics is initially random. The process of an NPC selecting and adjusting its weights is summarized below: • A prioritized list of tactics is generated, and the melee attack tactic is selected for execution. The tactic performs poorly during the encounter and results in 100- percent health loss for the adapting NPC and 5-percent health loss for the opponent. • The encounter finishes with the death of the adapting NPC; therefore, the tactic and encounter fitness are determined. The tactic’s weight is updated based on its success. The first step in the process is to determine a tactic’s fitness—in this example, hl_npc equals 100 and hl_opp equals 5. int hl_npc = botDamage() int hl_opp = playerDamage() float a = (hl_opp - hl_npc) / 100 float s = ( (hl_opp - hl_npc) - avgDiffInLifeLost ) / 200 float fitness = 1 / 10 * (7 * a + 3 * s) • Once the fitness of the tactic has been determined, the weights of all the tactics in the library instance are updated. If the number of library instances equals 1, as in this example, the weight adjustment for the performed tactic is determined as follows: if (fitness < 0) weightAdjustment = tacticPenaltyMax * fitness else weightAdjustment = tacticRewardMax * fitness • Next, the encounter weight update is performed. This process involves the algo- rithm looping through the tactics used during the encounter, determining their encounter fitness, weight adjustment, and weight update. The calculation of the encounter fitness is shown below. The tacticResults array contains information on each tactic used during the encounter and its performance. if (hasWon == true) fitness = 1 - (tacticResults[i].getHl_npc() / 125) else fitness = -1 + (tacticResults[i].getHl_opp() / 125) • After the tactic weights have been updated, the adapting NPC can generate a new prioritized list. In this example, the melee attack tactic will be at the end of the list due to its poor performance. 286 Section 3 A I Conclusion This gem has outlined an approach to the in-game adaptation of scripted tactical behavior in FPS computer games. The technique enables NPCs in these games to adapt their selection of tactics in a given state based on their experience of the tactics’ success. Less successful tactics are not selected or are only selected when more preferred tactics are not available. References [Andrade04] Andrade, G. D., H. P. Santana, A. W. B. Furtado, A. R. G. A. Leitão, and G. L. Ramalho. “Online Adaptation of Computer Games Agents: A Reinforcement Learning Approach.” 1st Brazilian Symposium on Computer Games and Digital Entertainment (SBGames2004). [Isla05] Isla, D. “Handling Complexity in the Halo 2 AI.” GDC 2005 Proceedings. 2005. Gamasutra. n.d. . [Mitchell97] Mitchell, T. Machine Learning. McGraw-Hill, 1997. [Ponsen and Spronck04] Ponsen, M. and P. Spronck. “Improving Adaptive Game AI with Evolutionary Learning.” Computer Games: Artificial Intelligence, Design and Education (CGAIDE 2004): 389–396. [Saunders02] Saunders, R. “Curious Design Agents and Artificial Creativity: A Synthetic Approach to the Study of Creative Behaviour.” Ph.D. thesis. University of Sydney, Sydney. 2002. [Spronck05] Spronck, P. “Adaptive Game AI.” Ph.D. Thesis. Universiteit Maastricht, Maastricht. 2005. [Spronck06] Spronck, P. “Dynamic Scripting.” AI Game Programming Wisdom 3. Ed. S. Rabin. Boston: Charles River Media, 2006. 661–675. [Thurau06] Thurau, C. “Behavior Acquisition in Artificial Agents.” Ph.D. thesis. Bielefeld University, Bielefeld. 2006. 3.6 Adaptive Tactic Selection in First-Person Shooter (FPS) Games 287 288 3.7 Embracing Chaos Theory: Generating Apparent Unpredictability through Deterministic Systems Dave Mark, Intrinsic Algorithm LLC One of the challenges of creating deep, interesting behaviors in games and simula- tions is to enable our agents to select from a wide variety of actions while not abandoning completely deterministic systems. On the one hand, we want to step away from having very obvious if/then triggers or monotonous sequences of actions for our agents. On the other hand, the need for simple testing and debugging necessitates the avoidance of introducing random selection to our algorithms. This gem shows how, through embracing the concept and deterministic techniques of chaos theory, we can achieve complex-looking behaviors that are reasonable yet not immediately predictable by the viewer. By citing examples from nature and science (such as weather) as well as the simple artificial simulations of cellular automation, the gem explains what causes chaotic-looking systems through purely deterministic rules. The gem then presents some sample, purely deterministic behavior systems that exhibit complex, observably unpredictable sequences of behavior. The gem concludes by explaining how these sorts of algorithms can be easily integrated into game AI and simulation models to generate deeper, more immersive behavior. The Need for Predictability The game development industry often finds itself in a curious predicament with regard to randomness in games. Game developers rely heavily on deterministic systems. Programming is inherently a deterministic environment. Even looking only at the lowly if/then statement, it is obvious that computers themselves are most “comfortable” in a realm where there is a hard-coded relationship between cause and effect. Even non- binary systems, such as fuzzy state machines and response curves, could theoretically be reduced to a potentially infinite sequence of statements that state, “Given the value x, the one and only result is y.” 3.7 Embracing Chaos Theory 289 Game designers, programmers, and testers also feel comfortable with this techno- logical bedrock. After all, in the course of designing, developing, and observing the complex algorithms and behaviors, they often have the need to be able to say, with certainty, “Given this set of parameters, this is what should happen.” Often the only metric of the success or failure of the development process we have is the question, “Is this action what we expected? If not, what went wrong?” Shaking Things Up Game players, on the other hand, have a different perspective on the situation. The very factor that comforts the programmer—the knowledge that his program is doing exactly what he predicts—is the factor that can annoy the player…the program is doing exactly what he predicts. From the standpoint of the player, predictability in game characters can lead to repetitive, and therefore monotonous, gameplay. The inclusion of randomness can be a powerful and effective tool to simulate the wide variety of choices that intelligent agents are inclined to make [Mark09]. Used correctly and limiting the application to the selection of behaviors that are reasonable for the NPC’s archetype, randomness can create deeper, more believable characters [Ellinger08]. While this approach provides a realistic depth of behavior that can be attractive to the game player, it is this same abandonment of a predictable environ- ment that makes the analysis, testing, and debugging of behaviors more complicated. Complete unpredictability can also lead to player frustration, as they are unable to progress in the game by learning complex agent behavior. The only recourse that a programming staff has in controlling random behaviors is through tight selection of random seeds. In a dynamic environment, however, the juxtaposition of random selection of AI with the unpredictable nature of the player’s actions can lead to a com- binatorial explosion of possible scenario-to-reaction mappings. This is often a situation that is an unwanted—or even unacceptable—risk or burden for a development staff. The solution lies in questioning one of the premises in the above analysis—that is, that the experience of the players is improved through the inclusion of randomness in the decision models of the agent AI. While that statement may well be true, the premise in question is that the experience that the player has is based on the actual random number call in the code. The random number generation in the agent AI is merely a tool in a greater process. What is important is that the player cannot perceive excessive predictable regularity in the actions of the agent. As we shall discuss in this article, accomplishing the goal of unpredictability can exist without sacrificing the moment-by-moment logical determinism that developers need in order to confi- dently craft their agent code. A Brief History of Chaos The central point of how and why this approach is viable can be illustrated simply by analyzing the term chaos theory. The word chaos is defined as “a state of utter confusion or disorder; a total lack of organization or order.” This is also how we tend to use it in general speech. By stating that there is no organization or order to a system, we imply randomness. However, chaos theory deals entirely within the realm of purely deter- ministic systems; there is no randomness involved. In this sense, the idea of chaos is more aligned with the idea of “extremely complex information” than with the absence of order. To the point of this article, because the information is so complex, we observers are unable to adequately perceive the complexity of the interactions. Given a momentary initial state (the input), we fail to determine the rule set that was in effect that led to the next momentary state (the output). Our inability to perceive order falls into two general categories. First, we are often limited by flawed perception of information. This occurs by not perceiving the exis- tence of relevant information and not perceiving relevant information with great enough accuracy to determine the ultimate effect of the information on the system. The second failure is to adequately perceive and understand the relationships that define the systems. Even with perfect perception of information, if we are not aware of how that information interacts, we will not be able to understand the dynamics of the system. We may not perceive a relationship in its entirety or we may not be clear on the exact magnitude that a relationship has. For example, while we may realize that A and B are related in some way, we may not know exactly what the details of that relationship are. Perceiving Error Chaos theory is based largely on the first of these two categories—the inability to per- ceive the accuracy of the information. In 1873, the Scottish theoretical physicist and mathematician James Clerk Maxwell hypothesized that there are classes of phenom- ena affected by “influences whose physical magnitude is too small to be taken account of by a finite being, [but which] may produce results of the highest importance.” As prophetic as this speculation is, it was the French mathematician Henri Poincaré, considered by some to be the father of chaos theory, who put it to more formal study in his examination of the “three-body problem” in 1887. Despite inventing an entirely new branch of mathematics, algebraic topology, to tackle the problem, he never com- pletely succeeded. What he found in the process, however, was profound in its own right. He summed up his findings as follows: If we knew exactly the laws of natur e and the situation of the univ erse at the initial moment, we could predict exactly the situation of the same univ erse at a succeeding moment. But even if it were the case that the natural laws had no longer any secret for us, we could still know the situation approximately. If that enabled us to predict the succeeding situation with the same approximation, that is all we require, and we should say that the phenomenon had been predicted, that it is governed by the laws. But [it] is not always so; it may happen that small dif- ferences in the initial conditions produce very great ones in the final phenomena. A small error in the former will produce an enormous error in the later. Prediction becomes impossible….[Wikipedia09] 290 Section 3 AI 3.7 Embracing Chaos Theory 291 This concept eventually led to what is popularly referred to as the butterfly effect. The origin of the term is somewhat nebulous, but it is most often linked to the work of Edward Lorenz. In 1961, Lorenz was working on the issue of weather prediction using a large computer. Due to logistics, he had to terminate a particularly long run of processing midway through. In order to resume the calculations at a later time, he made a note of all the relevant variables in the registers. When it was time to continue the process, he re-entered the values that he had recorded previously. Rather than re- enter one value as 0.506127, he simply entered 0.506. Eventually, the complex simu- lation diverged significantly from what he had predicted. He later determined that the removal of 0.000127 from the data was what had dramatically changed the course of the dynamic system—in this case, resulting in a dramatically different weather system. In 1963, he wrote of his findings in a paper for the New York Academy of Sciences, noting that, “One meteorologist remarked that if the theory were correct, one flap of a seagull’s wings could change the course of weather forever.” (He later substituted “butterfly” for “seagull” for poetic effect.) Despite being an inherently deterministic environment, much of the problem with predicting weather lies in the size of the scope. Certainly, it is too much to ask scientists to predict on which city blocks rain will fall and on which it will not during an isolated shower. However, even predicting the single broad path of a large, seem- ingly well-organized storm system, such as a hurricane, baffles current technology. Even without accounting for the intensity of the storm as a whole, much less the indi- vidual bands of rain and wind, the various forecasts of simply the path of the eye of the hurricane that the different prediction algorithms churn out lay out like the strings of a discarded tassel. That these mathematical models all process the same information in such widely divergent ways speaks to the complexity of the problem. Thankfully, the mathematical error issue is not much of a factor in the closed sys- tem of a computer game. We do not have to worry about errors in initial observations of the world, because our modeling system is actually a part of the world. If we restart the model from the same initial point, we can guarantee that, unlike Lorenz’ misfortune, we won’t have an error of 0.000127 to send our calculations spinning off wildly into the solar system. (Interestingly, in our quest for randomness, we can build a system that relies on a truly random seed to provide interesting variation—the player.) Additionally, we don’t have to worry about differences in mathematical calculation on any given run. All other things being equal (for example, processor type), a given combination of for- mula and input will always yield the same output. These two factors are important in constructing a reliable deterministic system that is entirely under our control. Brownian Motion As mentioned earlier, the second reason why people mistake deterministic chaos for randomness is that we often lack the ability to perceive or realize the relationships between entities in a system. In fact, we often are not aware of some of the entities at all. This was the case with the discovery of a phenomenon that eventually became known as Brownian motion. Although there had been observations of the seemingly random movement of particles before, the accepted genesis of this idea is the work of botanist Robert Brown in 1827. As he watched the microscopic inner workings of pollen grains, he observed minute “jittery” movement by vacuoles. Over time, the vacuoles would even seem to travel around their neighborhood in an “alive-looking” manner. Not having a convenient explanation for this motion, he assumed that pollen was “alive” and was, after the way of living things, moving of its own accord. He later repeated the experiments with dust, which ruled out the “alive” theory but did noth- ing to explain the motion of the particles. The real reason for the motion of the vacuoles was due to the molecular and atomic level vibrations due to heat. Each atom in the neighborhood of the target vibrates on its own pattern and schedule, with each vibration nudging both the target and other adjacent atoms slightly. The combination of many atoms doing so in myriad directions and amounts provides a staggering level of complexity. While completely deterministic from one moment to the next—that is, “A will nudge B n distance in d direction”— the combinatorial explosion of interconnected factors goes well beyond the paltry scope of Poincaré’s three-body problem. The problem that Brown had was that he could not perceive the existence of the atoms buffeting the visible grains. What’s more, even when the existence of those atoms is known (and more to the point, once the heat-induced vibration of molecules is understood), there is no way that anyone can know what that relationship between cause and effect is from moment to moment. We only know that there will be an effect. This speaks to the second of the reasons we listed earlier—that we often lack the ability to perceive or realize the relationships between entities in a system. This effect is easier for us to take advantage of in order to accomplish our goal. By incorporating connections between agents and world data that are beyond the ability of the player to adequately perceive, we can generate purely deterministic cause/effect chains that look either random or at least reasonably unpredictable. Exploring Cellular Automata One well-known example bed for purely deterministic environments is the world of cellular automata. Accordingly, one of the most well-known examples of cellular automata is Conway’s Game of Life. Conway’s creation (because the term “game” is probably pushing things a little) started as an attempt to boil down John von Neumann’s theories of self-replicating Turing machines. What spilled out of his project was an interesting vantage point on emergent behavior and, more to the point of this gem, the appearance of seemingly coordinated, logical behavior. Using Conway’s Life as an example, we will show how applying simple, deterministic rules produce this seem- ingly random behavior. 292 Section 3 A I 3.7 Embracing Chaos Theory 293 The environment for Life is a square grid of cells. A cell can be either on or off. Its state in any given time slice is based on the states of the eight cells in its immediate neighborhood. The number of possible combinations of the cells in the local neighbor- hood is 28 or 256. (If you account for mirroring or rotation of the state space, the actual number of unique arrangements is somewhat smaller.) The reason that the Game of Life is easy for us to digest is its brevity and simplicity, however. We do not care about the orientation of the live neighbors, but only a sum of how many are alive at that moment. The only rules that are in effect are: 1. Any live cell with two or three live neighbors lives on to the next generation. 2. Any live cell with fewer than two live neighbors dies (loneliness/starvation). 3. Any live cell with more than three live neighbors dies (overcrowding). 4. Any dead cell with exactly three live neighbors becomes a live cell (birth). Figure 3.7.1 shows a very simple example of these rules in action. In the initial grid, there are three “live” cells shown in black. Additionally, each cell contains a num- ber showing how many neighbors that cell currently has. Note that two of the “dead” cells (shown in gray) have three neighbors, which, according to Rule 4 above, means they will become alive in the next iteration. The other dead cells have zero, one, or two neighbors, meaning they will remain at a status quo (in other words, dead) for the next round. The center “live” cell has two neighbors, which, according to Rule 1 above, allows it to continue living. On the other hand, the two end cells have only a single live neighbor (the center cell) and will therefore die of starvation the next round. The results are shown on the right of Figure 3.7.1. Two of the prior cells are now dead (shown in gray), and two new cells have been born to join the single surviving cell. Interestingly, this pattern repeats such that the next iteration will be identical to the first (a horizontal line), and so on. This is one of the many stable or tightly repeating patterns that can be found in Life. Specifically, this one is commonly called a blinker. Figure 3.7.2 shows another, slightly more involved example. The numbers in the initial frame make it easier to understand why the results are there. Even without the numbers, however, the relationships between the initial state and the subsequent one are relatively easy to discern on this small scale. While the rule set seems simple and intuitive enough, when placed on a larger scale and run over time, the “behavior” of the entire “colony” of cells starts to look random. This is due to the overwhelming number of interactions that we perceive at any one moment. Looking at the four panels of Figure 3.7.3, it is difficult for us to intuitively predict what would have happened next except for either in the most gen- eral sense (for example, that solid blob in the middle is too crowded and will likely collapse) or in regard to very specific subsets of the whole (for example, that is a blinker in the lower-left corner). Interestingly, it is this very combination of “reasonable” expectations with not knowing exactly what is going to appear next that gives depth to Conway’s simulation. Over time, one develops a sense of familiarity with the generalized feel of the simulation. 294 Section 3 A I Figure 3.7.1 A simple example of the rule set in Conway’s Game of Life. In this case, a two-step repeating figure called a blinker is generated by the three boxes. Figure 3.7.2 The dynamic nature of the cells acting together. Figure 3.7.3 When taken as a whole, the simple cells in Life take on a variety of complex-looking behaviors. While each cell’s state is purely deterministic, it is difficult for the human mind to quickly predict what the next overall image will look like. 3.7 Embracing Chaos Theory 295 For example, we can expect that overcrowded sections will collapse under their own weight, and isolated pockets will die off or stagnate. We also recognize how certain static or repeating features will persist until they are interfered with—even slightly— by an outside influence. Still, the casual observer will still perceive the unfolding (and seemingly complex) action as being somewhat random…or at least undirected. Short of pausing to analyze every cell on every frame, the underlying strictly rule-based engine goes unnoticed. On the other hand, from the standpoint of the designer and tester, this simula- tion model is elegantly simplistic. The very fact that you can pause the simulation and confirm that each cell is behaving properly is a boon. For any given cell at any stage, it is a trivial problem to confirm that the resulting state change is performing exactly as designed. Leveraging Chaos Theory in Games What sets Conway’s Life apart from many game scenarios is not the complexity of the rule set, but rather the depth of it. On its own, the process of passing the sum of eight binary inputs through four rules to receive a new binary state does not seem terribly complex. When we compare it to what a typical AI entity in a game may use as its decision model, we realize that it is actually a relatively robust model. For instance, imagine a very simple AI agent in a first-person shooter game (see Figure 3.7.4). It may take into account the distance to the player and the direction in which the player lies. When the player enters a specified range, the NPC “wakes up,” turns, and moves toward the player. There is one input state—distance—and two out- put states: “idle” and “move toward player.” While this seems extraordinarily simple, as recently as 10 to 15 years ago, this was still common for enemy AI. Needless to say, the threshold and resultant behavior were easy to discern over time. Players could perceive both the cause and the effect with very little difficulty. Likewise, designers and programmers could test this behavior with something as simple as an onscreen distance counter. At this point, there is very little divergence between the simplicity for the player and the simplicity for the programmer. Adding a Second Factor If we were to add a second criterion to the decision, the situation does not necessarily become much more complicated. For example, we could add a criterion stating that the agent will only attack the player when he is in range and carrying a weapon. This is an intuitively sound addition and is likely something that the player will quickly under- stand. On the other hand, this also means that the enemy is again rigidly predictable. Other factors can be added to a decision model, however, which could obscure the point at which a behavior change should occur. Even the addition of other binary factors (such as the states of the cells in Life) can complicate things quickly for the observer if they aren’t intuitively obvious. For instance, imagine that the rule for attacking the player was no longer “if the distance from player to enemy < n” but rather “if the player’s distance to two enemies < n” (see Figure 3.7.5). As the player approaches the first enemy, there would be no reaction from that first enemy until a second enemy is within range as well. While this may seem like a contrived rule, it stresses an important point. The player will most certainly be interested in the actions of the first enemy and will not easily recognize that its reaction was ultimately based on the distance to the second enemy. 296 Section 3 A I Figure 3.7.4 If there is only one criterion in a decision model, it is relatively simple for the player to determine not only what the criterion is, but what the critical threshold value is for that criterion to trigger the behavior. Figure 3.7.5 The inclusion of a second criterion can obscure the threshold value for— or even the existence of—the first criterion. In this case, because the second enemy is included, the player is not attacked as he enters the normal decision radius of the first enemy. 3.7 Embracing Chaos Theory 297 The player may not be able to adequately determine what the trigger conditions are because we have masked the existence of the second criterion from the player. People assume that causes and effects are linked in some fashion. In this example, because there is no intuitive link between the player’s distance to the second enemy and the actions of the first, the player will be at a loss to determine when the first enemy will attack. The benefit of this approach is that the enemy no longer seems like it is acting strictly on the whims of the player. That is, the player is no longer deciding when he wants the enemy to attack—it is seemingly attacking on its own. This imparts an aura of unpredictability on the enemy, which, in essence, makes it seem more autonomous. Of course, we programmers know that the agent is not truly autonomous but merely acting as a result of second criterion. In fact, this new rule set is almost as simple to monitor and test as it is to program in the first place. We have the benefit of knowing what the two rules are and how they interact—something that is somewhat opaque to the player. Selecting the Right Factors As mentioned, the inclusion of the player’s distance to the second agent as a criterion for the decisions of the first agent is a little contrived. In fact, it has the potential for embarrassing error. If the second agent was far away, it is possible that the player could walk right up to the first one and not be inside the threshold radius of the second. In this case, the first agent would check his decision criteria and determine that the player was not in range of two people and, as a result, would not attack the player. This does not mean that there is a flaw in the use of more than one criterion for the decision—simply that there is a flaw in which criteria are being used in conjunction. In this case, the criterion that was based on the position of the second agent—and, more specifically, the player’s proximity to the second agent—was arbitrary. It was not directly related to the decision that the agent is making. The solution to this is to include factors on which it is reasonable for an agent to base his decisions. In this example, the factors may include items such as: • Distance to player • Perceived threat (for example, visible weapon) • Agent’s health • Agent’s ammo • Agent’s alertness level • Player’s distance to sensitive location We already covered the first two. The others are simply examples of information that could be considered relevant. For the last one, we could use the distance measure- ment from the player to a location such as a bomb detonator. If the player is too close to it, the agent will attack. This is different than the example with two agents above in that the distance to a detonator is presumably more relevant to the agent’s job than the player’s proximity to two separate agents. While a number of the criteria listed previously could be expressed as continuous values, such as the agent’s health ranging from 0 to 100, for the sake of simplicity, they can also be reduced to Boolean values. We could rephrase “agent’s health” as “agent has low health,” for instance. If we define “low health” as a value below 25, we are now able to reduce that criterion to a Boolean value. The same could be done with “agent’s ammo.” This, of course, is very similar to what we did with the distance. We could assert that “if agent has less than 10 shots remaining,” then “agent has low ammo.” What we have achieved with the above list could be summed up with the follow- ing pseudo-code query: If PlayerTooCloseToMe() or PlayerCloseToTarget() and WeaponVisible() and IAmHealthy() and IHaveEnoughAmmo() and IAmAlert()) then Attack() Even with these criteria, the number of possible configurations is 27 or 128. (Incidentally, the number of configurations of cells in Life is 28 or 256.) In our initial example, it would take only a short amount of time to determine the distance thresh- old at which the agent attacks. By the inclusion of so many relevant factors into the agent’s behavior model, the only way that any one threshold can be ascertained is with the inclusion of the caveat “all other things being equal.” Certainly in a dynamic envi- ronment, it is a difficult prospect to control for all variables simultaneously. While having 128 possible configurations seems like a lot, it is not necessarily the number of possible configurations of data that will obscure the agent’s possible selec- tion from the player. Much of the difficulty that a player would have in knowing exactly what reaction the agent will have is due to the fact that the player cannot per- ceive all the data. This is similar to the impasse at which Robert Brown found himself. He could not detect the actual underlying cause of the jitteriness of the pollen grains and dust particles. His observation, therefore, was that the motion was random yet reasonable; he perceived lifelike motion where there was no life. A good way of illustrating this point is by working backward—that is, looking at the situation from the point of the player. If the agent’s behavior changes from one moment to the next, the player may not be able to determine which of the aforemen- tioned factors crossed one of the defined thresholds to trigger the change. In some cases, this would be easy. For example, if the player draws his weapon and the agent attacks, the player can make the assertion that the weapon was the deciding factor. However, if the agent does not attack and, for instance, runs away instead, the player may not be able to determine whether it was due to the agent having low health or low ammo. Similarly, if the player is moving near the agent with his weapon drawn, and the agent begins to attack, the player may not be able to ascertain whether it was his prox- imity to the agent, a secondary point (for example, a detonator), or a change in the agent’s alertness status that caused the transition to occur. Once combat is engaged 298 Section 3 A I 3.7 Embracing Chaos Theory 299 and values such as health and ammo are changing regularly, the number of possible reasons for a change in behavior increases significantly. Of course, this is the reason why it is important to use relevant information as part of your decision. If you use rational bases for your decisions, it makes it more likely that the decision can at least be understood after it happens. There is a big dif- ference between predictability and understandability. The player may not know exactly when the agent is going to change behaviors, but he should be able to offer a reason- able guess as to why it happened. From a development and testing standpoint, the important issue to note here is that this is still purely deterministic. There is no randomness included at all. A simple code trace or onscreen debug information would confirm the status of all seven of these criteria. When compared against the decision code, the developer can confirm whether the agents are operating as planned or, in the case that they are not, deter- mine which of the criteria needs to be adjusted. Beyond Booleans We can extend the aforementioned ideas to go beyond purely Boolean flags, however. By incorporating fuzzy values and appropriate systems to handle them, we could have more than one threshold value on any of the previous criteria. For instance, we could use the seven aforementioned factors to select from a variety of behaviors. Rather than simply determining whether the agent will attack the player, for example, we could include actions such as finding cover, running away, reloading, or calling for help. In order to do this, we could partition one or many of the factors into multiple zones. For example, if we were to arrange two factors on two axes and determine a threshold across each, we would arrive at four distinct “zones” (see Figure 3.7.6, Panel 1). Each of these zones can be assigned to a behavior. In this case, using only two factors and two threshold values, we can arrive at four distinct behaviors. The more thresholds we insert, the more zones are created. Using our two-axis example, by increasing the threshold values from 1 to 3 in each direction, we increase the number of result spaces from 4 to 16 (see Figure 3.7.6, Panel 2). We can visualize how this would affect behaviors if we imagine the values of our two factors moving independently along their respective axes. For example, imagine a data point located in Behavior G in Panel 2. If Factor 1’s value were to change, we could expect to see Behaviors F and H—and even E—depending on the amount of change in Factor 1. If Factor 2 were the only one changing, we could expect to see changes to Behaviors C, K, and O. If changes were occurring in only Factor 1 or 2, by observing the changes in behavior that occurred, we could eventually determine where those thresholds between behaviors are. However, if both factors were continu- ally changing independent of each other, we now could possibly see any one of the 16 behaviors. This would make it significantly more difficult to exactly predict the cause- and-effect chain between factors and behaviors. For instance, assume once again that we start with the data point in G. If we were witnessing a reduction in Factor 1, we may see a state change to B, C, F, J, or K. All of those states can be reached from G if Factor 1 is decreasing. What we would have to realize is that the ultimate state is decided by Factor 2 as well—for example, G→C could happen if Factor 1 was decreasing slowly and Factor 2 was increasing. Corre- spondingly, G→K could happen if Factor 1 was decreasing slowly and Factor 2 was decreasing. Naturally, similar extensions of logic apply to result states of B and J. The conclusion is, while we can make a general statement about where our data point may end up while Factor 1 is decreasing, we can’t know the eventual result state without also knowing the behavior of Factor 2. The examples shown in Figure 3.7.6 only show two dimensions due the limitations of what can be shown on paper. Our decision model does not share those limitations, however. By using each potential factor in our decision model and defining one or more thresholds of meaning, we can create a complex state space of potential results. As a measurement of how this expands the potential number of actions, we can compare the number of possible outcomes. When the system was composed of 7 Boolean values, we had 128 possible combinations of data. Even by simply partitioning each of the 7 inputs into 3 ranges rather than 2, the number of combinations becomes 37 or 2,187. These results could then be mapped to a startling array of behaviors. Of course, not all the result zones need to represent individual behaviors. Imagine mapping the 2,187 possible result spaces onto 30 different behaviors, for instance. To further leverage the power of this system, we can step beyond the mentality of mapping a single value onto a series of states. Instead, we can use systems that can take multiple inputs in conjunction and select a state that is dependant on values of those multiple inputs. For example, we may want our agent to consider a combination of its distance to the player and their own health. That is, the state of their own health becomes more important if the player is closer. The two endpoints of the “relevancy vector” of this 300 Section 3 A I Figure 3.7.6 As the number of logical partitions through axes is increased, the number of potential results increases exponentially as a factor of the number of axes involved. By combining the factors prior to partitioning (Panel 3), thresholds can be made more expressive. 3.7 Embracing Chaos Theory 301 decision would be between (“player far” + “perfect health”) and (“player near” + “zero health”). As either of those factors moves from one end to the other, it has less effect on the outcome than if both of these factors were moving simultaneously. Figure 3.7.6, Panel 3 shows a two-dimensional visualization of this effect. In this case, the factors themselves are not partitioned. Instead, they remain continuous vari- ables. When combined, however, they create a new directed axis in the state space— in this case shown by the combined shading. We can set a threshold across that new axis (dotted line) that can be used to determine where the behavior changes. Now, from the point of view of the player, he cannot determine at what point on Factor 1 the behavior changes without taking into account changes in Factor 2 as well. Just as we did with single-axis-based thresholds, by determining one or more multi- axis threshold lines in an n-dimensional space, we partition our space into a variety of zones. By analyzing in which sections of the resulting partitioned space our current input data falls, we can select from multiple potential outputs. By combining values in this way, we can potentially arrive at more expressive outputs at any stage of our decision model. Specific techniques for accomplishing and managing this sort of decision complex- ity can be found elsewhere [Mark08]. The methods that we use to arrive at the result- ing values are not the focus here, however. The important part is that we are doing all of this in a purely deterministic fashion—that is, we could verify that any given combi- nation of factors is mapped to the appropriate action. While there is still no random factor being included in these calculations, the dizzying number of potential combina- tions provides for reasonable-looking, yet not inherently predictable results. Conclusion To sum up, while our desire as game developers may be to express a variety of reasonable- looking but slightly unpredictable behaviors, we do not have to resort to randomness in order to generate that effect. By including more than one or two simple, easily per- ceivable criteria in our decision models, we can begin to obscure the workings of that model from the player, yet leave it perfectly exposed and understandable to the program- mer and even the design team. However, in order to avoid the potential for arbitrary- looking decisions by our agents, we must be careful to select criteria that are relevant to the decision being made. In this way we are also providing deeper, more realistic- looking, and potentially more immersive behaviors for our agents. References [Ellinger08] Ellinger, Benjamin. “Artificial Personality: A Personal Approach to AI.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. [Mark08] Mark, Dave. “Multi-Axial, Dynamic Threshold Fuzzy State Machine.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. [Mark09] Mark, Dave. Behavioral Mathematics for Game AI. Boston: Charles River Media, 2009. [Wikipedia09] “Henri Poincaré.” n.d. Wikipedia. n.d. . 302 3.8 Needs-Based AI Robert Zubek Needs-based AI is a general term for action selection based on attempting to fulfill a set of mutually competing needs. An artificial agent is modeled as having a set of conflicting motivations, such as the need to eat or sleep, and the world is modeled as full of objects that can satisfy those needs at some cost. The core of the AI is an action selection algorithm that weighs those various possible actions against each other, trying to find the best one given the agent’s needs at the given moment. The result is a very flexible and intuitive method for building moderately complex autonomous agents, which are nevertheless efficient and easy to understand. This gem presents some technical details of needs-based AI. We begin with a general overview and then dive directly into technical implementation, presenting both general information and some specific hints born out of experience implementing this AI. Finally, we finish with an overview of some design consequences of using this style of AI. Background In terms of its historical context, the needs-based AI approach is related to the family of behavior-with-activation-level action selection methods common in autonomous robotics. (For an overview, see [Arkin98], page 141.) In game development, it was also independently rediscovered by The Sims, where it has been enjoyed by millions of game players. The Sims also contributed a very useful innovation on knowledge repre- sentation, where behaviors and their advertisements are literally distributed “in the world” in the game, and therefore are very easily configurable. My own interest in this style of AI was mainly driven by working with The Sims (at Northwestern University and later at EA/Maxis). I have since reimplemented vari- ations on this approach in two other games: Roller Coaster Kingdom, a web business simulation game, and an unpublished RPG from the Lord of the Rings franchise. I found the technique to be extremely useful, even across such a range of genres and platforms. 3.8 Needs-Based AI 303 Unfortunately, very few resources about the original Sims AI remain available; of those, only a set of course notes by Ken Forbus and Will Wright, plus a Sims 2 presen- tation by Jake Simpson are freely downloadable on the web at this point. (Links can be found in the “References” section at the end of this gem.) My goal here is to present some of this knowledge to a wider audience, based on what I’ve gained from robotics and The Sims, as well as personal experience building such agents in other games. Needs-Based AI Overview There are many ways to drive an artificial agent; some games use finite state machines, others use behavior trees, and so on. Needs-based AI is an alternative with an exciting benefit: The smarts for picking the next action configure themselves automatically, based on the agent’s situation as well as internal state; yet the entire algorithm remains easy to understand and implement. Each agent has some set of ever-changing needs that demand to be satisfied. When deciding what to do, the agent looks around the world and figures out what can be done based on what’s in the area. Then it scores all those possibilities, based on how beneficial they are in satisfying its internal needs. Finally, it picks an appropriate one based on the score, finds what concrete sequence of actions it requires, and pushes those onto its action queue. The highest-level AI loop looks like this: • While there are actions in the queue, pop the next one off, perform it, and maybe get a reward. • If you run out of actions, perform action selection, based on current needs, to find more actions. • If you still have nothing to do, do some fallback actions. That second step, the action selection point, is where the actual choice happens. It decomposes as follows: 1. Examine objects around you and find out what they advertise. 2. Score each advertisement based on your current needs. 3. Pick the best advertisement and get its action sequence. 4. Push the action sequence on your queue. The next sections will delve more deeply into each of these steps. Needs Needs correspond to individual motivations—for example, the need to eat, drink, or rest. The choice of needs depends very much on the game. The Sims, being a simulator of everyday people, borrowed heavily from Maslow’s hierarchy (a theory of human behav- ior based on increasingly important psychological needs) and ended up with a mix of basic biological and emotional drivers. A different game should include a more specific set of motivations, based on what the agents should care about in their context. Inside the engine, needs are routinely represented as an array of numeric values, which decay over time. In this discussion we use the range of [0, 100]. Depending on the context, we use the term “need” to describe both the motivation itself (written in boldface—for example, hunger) and its numeric value (for example, 50). Needs routinely have the semantics of “lower is worse and more urgent,” so that hunger=30 means “I’m pretty hungry,” while hunger=90 means “I’m satiated.” Need values should decay over time to simulate unattended needs getting increasingly worse and more urgent. Performing an appropriate action then refills the need, raising it back to a higher value. For example, we simulate agents getting hungry if they don’t eat by decaying the hunger value over time. Performing the “eat” action would then refill it, causing it to become less urgent (for a while). Advertisements and Action Selection When the time comes to pick a new set of actions, the agent looks at what can be done in the environment around them and evaluates the effect of the available actions. Each object in the world advertises a set of action/reward tuples—some actions to be taken with a promise that they will refill some needs by some amount. For example, a fridge might advertise a “prepare food” action with a reward of +30 hunger and “clean” with the reward of +10 environment. To pick an action, the agent examines the various objects around them and finds out what they advertise. Once we know what advertisements are available, each of them gets scored, as described in the next section. The agent then picks the best adver- tisement using the score and adds its actions to their pending action queue. Advertisement Decoupling Please notice that the discovery of what actions are available is decoupled from choos- ing among them: The agent “asks” each object what it advertises, and only then scores what’s available. The object completely controls what it advertises, so it’s easy to enable or disable actions based on object state. This provides great flexibility. For example, a working fridge might advertise “prepare food” by default; once it’s been used several times, it also starts advertising “clean me”; finally, once it breaks, it stops advertising anything other than “fix me” until it’s repaired. Without this decoupling, imagine coding all those choices and possibilities into the agent itself, not just for the fridge but also for all the possible objects in the world—it would be a disaster and impossible to maintain. On the other side of the responsibility divide, the agent can also be selective about what kinds of advertisements it accepts. We can use this to build different agent subtypes or personalities. For example, in a later section we will describe how to use advertisement filtering to implement child agents with different abilities and oppor- tunities than adults. 304 Section 3 AI 3.8 Needs-Based AI 305 Advertisement Scoring Once we have an object’s advertisements, we need to score them and stack them against all the other advertisements from other objects. We score each advertisement separately, based on the reward it promises (for example, +10 environment) and the agent’s current needs. Of course it’s not strictly necessary that those rewards actually be granted as promised; this is known as false advertising, and it can be used with some interesting effects, as described later. Here are some common scoring functions, from the simplest to the more sophis- ticated: A. Trivial scoring future valueneed = current valueneed + advertised deltaneed score = ∑ all needs (future valueneed ) Under this model, we go through each need, look up the promised future need value, and add them up. For example, if the agent’s hunger is at 70, an advertisement of +20 hunger means the future value of hunger will be 90; the final score is the sum of all future values. This model is trivially easy and has significant drawbacks: It’s only sensitive to the magnitude of changes, and it doesn’t differentiate between urgent and non-urgent needs. So increasing hunger from 70 to 90 has the same score as increasing thirst from 10 to 30—but the latter should be much more important, considering the agent is very thirsty! B. Attenuated need scoring Needs at low levels should be much more urgent than those at high levels. To model this, we introduce a non-linear attenuation function for each need. So the score becomes: score = ∑ all needs Aneed (future valueneed ) where Aneed is the attenuation function, mapping from a need value to some numeric value. The attenuation function is commonly non-linear and non- increasing: It starts out high when the need level is low and then drops quickly as the need level increases. For example, consider the attenuation function A(x) = 10/x. An action that increases hunger to 90 will have a score of 1/9, while an action that increases thirst to 30 will have a score of 1/3, so three times higher, because low thirst is much more important to fulfill. These attenuation functions are a major tuning knob in needs-based AI. You might also notice one drawback: Under this scheme, improving hunger from 30 to 90 would have the same score as improving it from 50 to 90. Worse yet, worsening hunger from 100 to 90 would have the same score as well! This detail may not be noticeable in a running system, but it’s easy to fix by examining the need delta as well. C. Attenuated need-delta scoring It’s better to eat a filling meal than a snack, especially when you’re hungry, and it’s worse to eat something that leaves you hungrier than before. To model this, we can score based on need level difference: score = ∑all needs (Aneed (current valueneed) – Aneed (future valueneed )) For example, let’s consider our attenuation function A(x) = 10/x again. Increasing hunger from 30 to 90 will now score 1/3 – 1/9 = 2/9, while increasing it from 60 to 90 will score 1/6 – 1/9 = 1/18, so only a quarter as high. Also, decreasing hunger from 100 to 90 will have a negative score, so it will not be selected unless there is nothing else to do. Action Selection Once we know the scores, it’s easy to pick the best one. Several approaches for arbitra- tion are standard: • Winner-takes-all: The highest-scoring action always gets picked. • Weighted-random: Do a random selection from the top n (for example, top three) high-scoring advertisements, with probability proportional to score. • Other approaches are easy to imagine, such as a priority-based behavior stack. In everyday implementation, weighted-random is a good compromise between having some predictability about what will happen and not having the agent look unpleasantly deterministic. Action Selection Additions The model described earlier can be extended in many directions to add more flexibility or nuance. Here are a few additions, along with their advantages and disadvantages: A. Attenuating score based on distance Given two objects with identical advertisements, an agent should tend to pick the one closer to them. We can do this by attenuating each object’s score based on distance or containment: score = D ( ∑ all needs ( … ) ) where D is some distance-based attenuation function, commonly a non-increasing one, such as the physically inspired D(x) = x / distance2. However, distance attenuation can be difficult to tune, because a distant object’s advertisement will be lowered not just compared to other objects of this type, but also compared to all other advertisements. This may lead to a “bird in hand” kind of behavior, where the agent always prefers a much worse action nearby rather than a better one farther away. 306 Section 3 A I 3.8 Needs-Based AI 307 B. Filtering advertisements before scoring It’s useful to add prerequisites to advertisements. For example, kids should not be able to operate stoves, so the stove should not advertise the “cook” action to them. This can be implemented in several ways, from simple attribute tests to a full language for expressing predicates. It’s often best to start with a simple filter mechanism, because complex pre- requisites are more difficult to debug when there are many agents running around. An easy prerequisites system could be as simple as setting Boolean attributes on characters (for example, is-adult, and so on) and adding an attribute mask on each advertisement; action selection would only consider advertisements whose mask matches up against the agent’s attributes. C. Tuning need decay Agents’ need levels should decay over time. This causes agents to change their priorities as they go through the game. We can tune this system by modifying need decay rates individually. For example, if an agent’s hunger doesn’t decay as quickly, they will not need to eat as often and will have more time for other pursuits. We can use this to model a bare-bones personality profile—for example, whether someone needs to eat/drink/entertain themselves more or less often. It can also be used for difficulty tuning—agents whose needs decay more quickly are harder to please. D. Tuning advertisement scores The scoring function can also simulate simple personality types directly, by tuning down particular advertisement scores. To do this, we would have each agent contain a set of tuning parameters, one for each need, that mod- ify that need’s score: new scoreagent,need = old scoreagent,need * tuningagent,need For example, by tuning down the +hunger advertisement’s score, we’ll get an agent that has a stronger preference for highly fulfilling food; tuning up a +thirst advertisement will produce an agent that will happily opt for less satisfying drinks, and so on. E. Attenuation function tuning Attenuation functions map from low need levels to high scores. Each need can be attenuated differently, since some needs are more urgent than others. As such, they are a major tuning knob in games, but a delicate one because their effects are global, affecting all agents. This requires good design iterations, but analytic functions (for example, A(x) = 10/x) are not easy for designers to tweak or reason about. A happy medium can be found by defining attenuation functions using piecewise-linear functions (in other words, point pairs that define individual straight-line segments, rather than continuous, analytic formulas). These can be stored and graphed in a spreadsheet file and loaded during the game. Action Performance Having chosen something to do, we push the advertisement’s actions on the agent’s action queue, to be performed in order. Each action would routinely be a complete mini-script. For example, the stove’s “clean” action might be small script that: • Animates the agent getting out a sponge and scrubbing the stove • Runs the animation loop and an animated stove condition meter • Grants the promised reward It’s important that the actual reward be granted manually as part of the action, and not be awarded automatically. This gives us two benefits: • Interrupted actions will not be rewarded. • Objects can falsely advertise and not actually grant the rewards they promised. False advertisement is an especially powerful but dangerous option. For example, suppose that we have a food item that advertises a hunger reward but doesn’t actually award it. A hungry agent would be likely to pick that action—but since they got no reward, at the next selection point they would again likely pick, and then again, and again. This quickly leads to very intriguing “addictive” behaviors. This may seem like a useful way to force agents to perform an action. But it’s just as hard to make them stop once they’ve started. False advertisements create action loops that are very difficult to tune. In practice, forcing an action is easier done by just pushing the desired action on the agent’s action queue. Action Chaining Performing a complex action, such as cooking a meal, usually involves several steps (such as preparing and cooking) and several objects (a fridge, a cutting board, a stove). This sequence must not be atomic—steps can be interrupted, or they can fail due to some external factors. Complex sequences are implemented by chaining multiple actions together. For example, eating dinner might decompose into several separate actions: • Take a food item from the fridge. • Prepare the food item on a counter. • Cook the food item on the stove. • Sit down and eat, thereby getting a hunger reward. 308 Section 3 A I 3.8 Needs-Based AI 309 It would be suboptimal to implement this as a single action; there is too much variability in the world for it to always work out perfectly. We can create action sequences in two ways. The simpler way is to just manufac- ture the entire sequence of actions right away and push the whole thing on the agent’s queue. Of course, these steps can fail, in which case the remaining actions should also be aborted. For some interesting side effects, aborting an action chain could create new actions in its place. For example, a failed “cook food” action sequence could cre- ate a new “burned food” object that needs to be cleaned up. The second method, more powerful but more difficult, is to implement action chaining by “lazy evaluation.” In this approach, only one action step is created and run at a time, and when it ends, it knows how to create the next action and front- loads it on the queue. For an example of how that might look, consider eating dinner again. The refrig- erator’s advertisement would specify only one action: “take food.” That action, toward the end, would then find the nearest kitchen counter object, ask it for the “prepare food” action, and load that on the queue. Once “prepare food” was done, it would find the nearest stove, ask it for a new “cook food” action, and so on. Lazy action chaining makes it possible to modify the chain based on what objects are available to the agent. For example, a microwave oven might create a different “cook food” action than a stove would, providing more variety and surprise for the player. Second, it makes interesting failures easier. For example, the stove can look up some internal variable (for example, repair level) to determine failure and randomly push a “create a kitchen fire” action instead. In either case, using an action queue provides nice modularity. Sequences of smaller action components are more loosely coupled and arguably more maintainable than standard state machines. Action Chain State Saving When an action chain is interrupted, we might want to be able to save its state some- how so that it gets picked up later. Since all actions are done on objects, one way to do this is to mutate the state of the object in question. For example, the progress of “cleaning” can be stored as a sep- arate numeric cleanness value on an object, which gets continuously increased while the action is running. But sometimes actions involve multiple objects, or the state is more complicated. Another way to implement this is by creating new state objects. An intuitive example is food from the original Sims: The action of prepping food creates a “prepped food” object, cooking then turns it into a pot of “cooked food,” which can be plated and turned into a “dinner plate.” The state of preparation is then embedded right in the world; if the agent is interrupted while prepping, the cut-up food will just sit there until the agent picks it up later and puts it on the stove. Design Consequences of Needs-Based AI With the technical details of needs-based AI behind us, let’s also consider some of the design implications of this style of development, since it’s different from more tradi- tional techniques. First of all, the player’s experience with this AI really benefits from adding some feedback to the agents. Developers can just look at the internal variables and immedi- ately see “the agent is doing this because it’s hungry, or sleepy, or other such.” But the player will have no such access and is likely to build an entirely different mental model of what the agent is doing. Little bits of feedback, like thought bubbles about what needs are being fulfilled, are easy to implement and go a long way toward making the system comprehensible to the player. A second point is about tuning. Some of the tunable parameters have global effect and are therefore very difficult to tune after the game has grown past a certain size. The set of needs, their decay rates, score attenuation functions, and other such elements will apply to all characters in the game equally, so tuning them globally requires a lot of testing and a delicate touch. If a lot of variety is desired between different parts of the game, it might be a good idea to split the game into a number of smaller logical partitions (levels, and so on) and have a different set of those tunable parameters, one for each partition. Ideally, there would be a set of global tuning defaults, which work for the entire game, and each partition could specifically override some of them as needed. Partitioning and overriding tuning values buys us greater flexibility, although at the cost of having to tune each partition separately. Third, this AI approach tends heavily toward simulation and makes it hard to do scripted scenes or other triggered actions. Imagine implementing some actions on a trigger, such as having the agent approach the player when he comes into view. One might be tempted to try to implement that using just needs and advertisements, but the result will be brittle. If particular one-off scripted behaviors are desired, it would be better to just man- ually manufacture appropriate action sequences and forcibly push them on the agent’s action queue. But in general, this overall approach is not very good for games that need a lot of triggered, scripted sequences (for example, shooter level designs). Needs- based AI works better for simulated worlds than for scripted ones. Conclusion Needs-based AI is computationally very efficient; only a trivial amount of the CPU is required to pick what to do and to handle the resulting action sequence. The system’s internals are very easy to understand; by just inspecting the agent’s internal needs values, you can get a good idea of why it does what it does. And by externalizing the set of possible actions into the world, the system also achieves great modularity—the AI can be “reconfigured” literally by adding or removing objects around the agent. 310 Section 3 A I 3.8 Needs-Based AI 311 In spite of unusual design consequences, the needs-based approach is very capable, easy to implement, and effective at creating good characters. It’s a powerful tool for many situations. Acknowledgements Thanks to Ken Forbus and Richard Evans, from whom I’ve learned most of what I know about this style of AI. References [Arkin98] Arkin, R. Behavior Based Robotics. MIT Press, 1998. [Forbus02] Forbus, Ken, and Will Wright. “Simulation and Modeling: Under the Hood of The Sims.” 2002. Northwestern University. n.d. . [Simpson] Simpson, Jake. “Making The Sims the Sims.” n.d. . 312 3.9 A Framework for Emotional Digital Actors Phil Carlisle In this gem, we will describe a framework that may be used to endow game charac- ters with some level of emotional behavior. Using a simple behavior tree implemen- tation augmented with supporting appraisal and blackboard implementations, we will demonstrate that emotions can be easily implemented and can enhance the behavior and expression of game characters. We use the term “digital actor” because although the emotional system we present is based on sound academic research in psychology and cognitive science, the intention is to produce the “illusion” of emotion, rather than specifically trying to model emotion itself. However, the reader is advised to read [Oatley06] as an accessible introduction to emotional psychology, which is a useful starting point for anyone trying to add more emotion to their games. The first section of this gem will describe models of personality, mood, and emo- tion derived from academic literature in the area. The second section will describe how these models are incorporated into an emotional framework. The third section will describe how the framework is used in a number of game scenarios. In conclu- sion, we will offer some ideas for further work. Models of Emotion, Mood, and Personality In addition to emotion, we need to represent both personality and mood in order to have a complete emotional framework. It is useful to consider these elements in terms of the timescale required for change. Typically, our personality changes very slowly, if at all. It can take many years for our personalities to alter. Mood, on the other hand, tends to change in a relatively shorter period of days or weeks. Finally, we have emotions, which are frequently expressed minute to minute and can often only be portrayed for fleeting seconds. For instance, it is not unusual to hear someone described as having a “quick temper.” 3.9 A Framework for Emotional Digital Actors 313 Emotion There are many theories of emotion and models of personality associated with them. For example, [Eyesenck65] describes models of personality and emotion often cited in academic literature. One model of emotion used in a large volume of academic litera- ture is the OCC model [Ortony88] created by Ortony, Chlore, and Collins. This model is discussed very well in [Bartneck02], and, as is the case with many academic implementations, we are going to use a subset of the OCC model for our emotional representation. The OCC model typically represents 22 discrete emotional categories. In practice, this is probably too complex a model for most video games; thus, the model should be simplified based on the requirements for a particular game. The OCC model, broadly speaking, breaks down emotions into three categories: emotional reaction to objects, events, and agents. Each category is then broken down further depending on whether the emotion is related to self or other and whether the emotion is considered a good thing or a bad thing. The complexity arises because we have to account for the conse- quences for others and our relationship with them. Consider the following case: • Agent A likes apples. • Agent B likes apples and Agent A. • Agent C likes apples but not Agent A. In a simulation with all three agents and one apple, assuming that Agent A acquires the apple, we then have the following emotional reactions: • Agent A is happy (due to acquiring the apple). • Agent B is unhappy at not acquiring the apple but happy at Agent A acquiring the apple. • Agent C is unhappy at not acquiring the apple and is even less happy at seeing Agent A acquire the apple. These are relatively simple direct relationships between goal (acquire apple, self/ friend) and emotion (happy, sad). But human emotion is a little more complicated. In this case, Agent C may immediately feel unhappy, but that feeling may then cause him to feel ashamed that he was made to feel unhappy by the actions of A. Clearly, we are going to have to simplify the model somewhat to be useful in a video-game context, but be aware that sometimes the most interesting emotional expression comes from the feelings created by social situations exactly like this. Mood Mood is generally more changeable than personality. However it is often represented very simply. A useful paper that represents mood simply is [Egges04], which maps mood to a single floating-point number in the range of –1..1 to represent negative or positive moods, respectively. Typically, mood changes over weeks or months and acts as an overall bias that changes as different emotional events occur. Interestingly, mood can color our perception of emotional events. For instance, people in a negative mood may view all emotional events as negative, even if the event is generally to their advantage. In the case of the earlier example, a negative mood might be used within the behavior tree to disable certain actions. For instance, if Agent A is in a highly negative mood, even the perception of an apple may not trigger the behavior tree action required to seek attainment of the apple. This can be accomplished only by adding knowledge of the available apple to the agent’s blackboard when it has a mood that is higher than a given threshold, although care must be taken to allow the desire to maintain the agent’s health (by acquiring food, for example) to override this behavior. See the description of the appraisal class later in this gem for further information. Personality Personality is often represented in the academic literature via the OCEAN model of [McRae96]. This personality model represents a person’s personality across five dimensions representing: • Openness • Conscientiousness • Agreeability • Extroversion • Neuroticism The values portrayed for each of the five dimensions are typically in the range of 0..1, where, for example, 0 for openness means that the person is fully closed off and +1 means the person is fully open. Essentially, the model of personality provides a default biasing mechanism for further formulas and allows us to represent different personality types by simply configuring each dimension with different default floating- point values. Personality is the least changeable part of our emotional makeup, so it is reason- able to simply store personality as a series of fixed values. Personality provides bias toward a particular set of possible actions within the behavior tree. Referring to the earlier example scenario for Agent B, given the choice of two potential actions of either trying to obtain the apple for oneself or allowing Agent A to obtain the apple, the personality model can be used to select from the two choices. An agent with high agreeability would choose to allow Agent A to obtain the apple, whereas the converse would simply try to obtain the apple for itself. Thus, the personality model allows us to create unique behavior for each agent without the need for per-agent behavior trees. The Emotional Framework Given the emotional model described in the previous section, we need to be able to incorporate code that represents the model within an architecture that enables it to affect our characters’ behavior. In this example, we will incorporate the emotional model by implementing an appraisal class, which modifies values within a character’s blackboard. 314 Section 3 AI 3.9 A Framework for Emotional Digital Actors 315 (See [Isla02] for information concerning blackboards.) The blackboard will be inspected by a simple behavior tree in order to incorporate the emotional values within the char- acters’ update logic, and the same mechanism can also be used to incorporate the emotional values with the characters’ movement logic—for example, animating with a sullen walk cycle if the emotional mood is negative. A very good reference concerning the use of behavior trees and emotional models is the work of the Oz project group at Carnegie Mellon University [Reilly96], which went on to be used in the game Façade. Appraisal/Arousal Typically, emotion is broken up into appraisal, where goals are created and events and objects are classified, and arousal, where the magnitude of the reaction to the sensory input is processed. The purpose of the appraisal class is to map any sensory input to changes in the agent’s emotional variables and blackboard data. In our framework, for convenience this class handles both appraisal and arousal. During appraisal, sensory input and event data are fed to the appraisal class, which then determines the appropri- ate changes in blackboard data. During the appraisal, data may be added to the knowl- edge representation, changes may be made to the emotional variables, and in certain cases input may be ignored. Similarly, the appraisal class is responsible for determining the changes occurring from the success or failure of the agent’s goals, which may again alter the agent’s blackboard or emotional variables. Knowledge Representation There is a choice to be made with regard to the appraisal class relating to its usage of memory and dynamic data structures. In a human context, we are capable of learning about new objects we have not encountered before, new events that happen, or new agents (human or otherwise) that we meet. This implies that we are able to store Figure 3.9.1 The emotional framework architecture. information as we build up an emotional picture relating to these objects, events, and agents. In a game context, we may or may not be able to spare the memory to process previously unknown information. The easiest case for implementation and perhaps the more robust case for design is that we explicitly determine all possible known objects/events/agents a priori and simply load that information at run time. This is the method chosen for the example implementation. An easy method of expanding on this simple implementation is to incorporate a classifier system such that instead of storing a reaction to an individual object/event/agent, we classify them and store a reaction to the “class” rather than a specific instance of the class. An alternative method is to simply place an upper limit on the number of memories stored dynamically, allowing new memories to be cre- ated. These memories would then expire over time to allow further memories to be formed. In addition, this scheme may be extended with different limits on the amount of memories stored for each type of event, thus allowing for more “impor- tant” memories to be stored. The use of a blackboard allows us to store all per-agent data in a generic structure that allows access from all of our actor’s systems. Typically, this is used to store agent goals or attributes, such as the currently selected target enemy. However, the blackboard is a useful storage mechanism to employ for the cre- ation of emotional agents. It is worth bearing in mind that the most common case of retrieval from the blackboard—to retrieve a particular value associated with a new sensed object/agent/ event—must be as efficient as possible. An alternative structure, such as a semantic network [Sowa92], may prove to be a far better solution, if slightly more complex in terms of code. For convenience, each agent’s blackboard and behavior tree configurations are parsed from XML data. This allows for run-time configuration of each agent, using a unique blackboard, a unique behavior tree, or both. XML data for an example blackboard: Here we can see a simple blackboard specification for an agent. Each sub-element with a tag of , < object>, or defines a unique structure that is stored within the blackboard. Each element is stored internally as a simple stl::vector of the appropriate type. 316 Section 3 A I 3.9 A Framework for Emotional Digital Actors 317 The Appraisal Process To modify our behavior based on the emotional framework, we need to consider the steps that occur when new emotional responses are required. Sensing a New Object An agent’s perception system typically responds to queries instigated by its behavior tree (which we will refer to as a BT for brevity’s sake). For example, the BT may have executed a sequence of nodes that resulted in a query for the availability of nearby food. Our goal in this case is to determine the agent’s emotional reaction to each sensed object—specifically, the more useful case of allowing the agent to determine the selection of which food object to try to obtain based on the emotional reward associated with those available. Consider the case where a query returns three different food items within the query radius. In the simplest case, we can simply determine our like/dislike of the available food items based on a simple classification, such as whether the item is fruit or vegetable or whether the item is sweet or sour. In the ideal case, we need to consider our past experiences either with the unique object or with objects with a similar clas- sification. It is beyond the scope of this gem to discuss the intricacies of human mem- ory and its method of classification. Another aspect of this pattern of memory is that the relative novelty of an object can greatly alter the intensity of the reaction to the object. An agent who is unused to seeing guns may react significantly to the sight of an armed friend, whereas a gangster would be less likely to have a similar reaction. The final aspect we should consider when dealing with objects is the penalty or reward associated with interactions. For example, an agent may have a strong liking for apples, but if the agent consumes an apple that is sour, it should have some effect on the subsequent desire for more apples. In practical terms, in the case of our object queries, we first use the appraisal class to return preference values for each object in turn. We then simply choose the object with the highest appraisal value for interaction. Once an object has been selected, we store that object as a goal within the blackboard as an object to obtain. It is important to constrain our memory usage at this point, as new objects may be perceived quite often and marked for attainment. We can achieve this by attaching an expiration value to each new object attainment goal. The blackboard then removes all expired goals within its update loop. Once an object is attained (typically via another node in the behavior tree), we then consider any attainment goals relating to the object. If a specific attainment goal is found within the blackboard, we then consider the arousal value of achieving the specific goal. This arousal processing takes into account the goals for attainment of the object. There are two major reactions to consider here. The first is that the agent must consider his own reaction to the object. Typically, we would try to obtain objects we like, but if the game allows attaining of objects by other means—for example, by allow- ing agents to simply give objects to each other—there may be negative consequences. An agent who obtains a ticking time bomb should definitely not be happy about its attainment. The second reaction to the attainment of an object is with respect to other agents. An agent who obtains an object that is highly desired by another agent, depending on whether the other agent is liked or disliked, may feel guilty or happy for acquiring the item, respectively. Alternatively, if no specific attainment goals are stored in the blackboard, we can simply consider the attainment goals of other agents for the object, or we can consider the general valence of the object and react based on our positive or negative feelings about it. For example, an object that is attained may allow us to accomplish a goal for another agent if we give it to him. In this case, we may simply decide to create a goal to pass on the object if it achieves a goal of an agent we have positive affect towards, or we may decide to keep or dispose of the object if it denies a goal of an agent we have negative affect towards. Sensing a New Agent The term “agent” in the OCC model does not describe an AI agent, but instead describes an agent of change. Typically, these are often other characters in a game context. However, this is not always the case, and an agent in OCC terms may be some other external force that has an effect on the world. In the most common case, the agents in our emotional framework will actually describe characters within the game world. With this in mind, we will use the term “agent” to mean both the OCC model of an agent and the AI game character agent. Modeling of inter-agent affect affords us some unique social interactions, such as the seemingly altruistic act of passing on an object. The most obvious use of agent knowledge when considering emotion is for a like/dislike evaluation. This can be simply stored as a positive or negative value associated with a unique identifier. In the example framework, agents are stored by name with a valence value associated with them within the blackboard. This value is useful for when any opportunities are presented to the agent, such as denying another agent a resource or being able to give another agent something they require. This brings up an important point about agent-to-agent interactions. Typically, when humans interact, they create mental models of the motivations of the interact- ing agent in order to determine how to proceed with the interaction. For instance, when a human considers giving a gift to another, they try and imagine the reaction to the gift the other person will have, using this as a method of deciding whether to pro- ceed with the interaction. This agent mental modeling is important for social interac- tions; however, it is problematic for games because of the amount of memory and processing time required for implementation. The problem is compounded by the notion that humans often model other humans’ models of themselves (in other words, how does this person feel about us?). In the example framework, we have decided to leave this mental modeling unimplemented for the sake of brevity and for practical purposes. However, for a truly deep social simulation, this modeling is a highly desirable feature. 318 Section 3 A I 3.9 A Framework for Emotional Digital Actors 319 Sensing an Event A great deal of an agent’s behavior will generally stem from sensing some event that occurs within the world. This event could be an object attainment event, for self or for other. It might also be an important event that requires immediate action, such as hearing a grenade drop at the agent’s feet. In this situation, the appraisal class uses its understanding of available events to create the associated knowledge within the agent’s blackboard. In the case of the grenade event, the appraisal class simply adds a threat object with a high reaction value to the blackboard. This will then allow the BT that has a branch “respond to threat” to take the appropriate actions. The reason why it is useful to pass events via the appraisal class is that the importance of an event can change over time as an agent responds to more of the same event. Consider the case of the grenade event. There is a radius within which we can expect to take damage, but outside of that radius, the reaction to the grenade can change depending on how many times we have seen grenades explode. If we know from experience that outside of a certain radius we may sustain injury but that the injury is entirely random, then we may, over time, be conditioned to simply block the grenade from thought. This effect can happen in any stressful situation where our emotions allow us to regulate our reactions, essentially becoming “numb” to what would normally be highly stressful situations over time. This change in attitude toward events is due to the appraisal/arousal process. Essentially, this is a feedback loop that changes an agent’s response over time as the agent adds positive or negative arousal to the event depending on the event’s outcome; in addition, this effect is subject to some decay in the dulling of the arousal. In essence, this means that events that occur frequently become less arousing emotionally, but that if the event has not occurred for some time, the arousal may once again be relatively high. In the example framework, this dulling of arousal values for events happens by a simple scaling factor being applied to the arousal value for the event. Over time, the scaling factor is reset to 1.0, with any events of that type causing the scale factor to reduce slightly. Thus, over time, the arousal associated with the event can move between 0..1 depending on the frequency of the event. Another aspect to the appraisal class is the incorporation of personality and mood into the emotional outcomes expressed in the blackboard. In effect, personality and mood modulate the intensity of the emotional reaction to any given stimulus. For a good introduction to why this is important, see [Eckman04]. To correctly simulate the effect of personality, mood, varying arousal, and decay, we therefore apply different functions for calculating the effect of any given emotional stimulus and then apply the results of these functions to the agent’s blackboard. For event stimulus, the first role of the appraisal is to determine whether to respond to the event at all. Some events can simply be ignored, especially when in a state of high overall arousal. For instance, if a grenade event is perceived, any subsequent event is blocked from being perceived until the agent has dealt with the response. This simulates the effect of our emotions, which act as regulatory systems allowing for rapid response to dangerous threats. Given a relatively low state of overall arousal, we can often respond to relatively minor events. The personality model acts to bias available choices within the behavior tree. Given two possible outcomes for any sensory input, we can use a scale factor based on the personality variable associated with a given choice to determine which outcome has a higher priority. For example, when given the choice to interact in a conversation with another agent or to obtain a required item, an agent with an introverted person- ality would choose the latter. The mechanism for this choice involves classifying each behavior tree selection with respect to personality and then using this as a scaling value when doing priority selection, each choice essentially scaling its priority up or down based on the personality trait variable value. The model of mood, although simplistic in nature, allows us to further apply some filtering on the selection of available behavior tree choices. The mood value is initially used in the input phase of the appraisal class. This simply scales the emotional valence of input senses, which may cause some sensory input to be ignored when it otherwise may have been acted upon. For example, if we perceive an object that is beneficial to another agent, we calculate a valence for the goal of attaining the apple for the other agent. Normally, we would then create the goal for the attainment of the apple by adding the apple to the blackboard. However, when the mood is negative, the positive affect generated by attaining the apple for the other agent is cancelled out, and we simply never add the apple to the blackboard. Given the framework described thus far, what does a typical update loop look like for the agent? See Figure 3.9.2. The update loop begins with sensory input. In practical usage, the event input can happen at any point in time, but it follows a very similar cyclic pattern repre- sented by the sensory update loop represented in Figure 3.9.2. Input is fed into the appraisal class for processing. This appraisal class is used to then map the input into 320 Section 3 A I Figure 3.9.2 Agent update loop. 3.9 A Framework for Emotional Digital Actors 321 changes in the agent blackboard. The appraisal may add or remove goals within the blackboard depending on the input. It also may affect the emotional values associated with agents, objects, or events stored within the blackboard, which in turn may affect the processing within the appraisal class during the next update cycle. Then the behavior tree responds to new knowledge within its blackboard and in turn effects changes in the game world. The cycle is then repeated continually. This reactive framework simply represents a sense > think > act cycle that is enhanced with emotion to become a sense > feel > think > act cycle. The blackboard may also be interrogated by animation and locomo- tion systems. For example, to alter walk-cycle blending to allow for a display of mood, an agent with a negative mood value may blend in more of a labored walk cycle. In comparison, an agent with a positive mood may blend more of a bouncy walk cycle. Conclusion Emotion is a very complex subject, and there are many academic theories that attempt to describe how emotions work and how they might be classified. Video games are unlikely to ever completely simulate the entire spectrum of emotional responses, even if it were desirable to do so. As game designers and developers, we can incorporate simple models of emotion in order to add some personality to individual agents. Non-verbal communication is an important part of human social interaction [Mehrabian72], and emotional values modify this communication. For instance, we tend to hold gaze on agents we are attracted to for a lot longer. A large part of the motivation for the incorporation of emotion into video games is that we can start to add non-verbal communication signals to our agents. This helps to create agents that feel more realistic and alive. Emotions can be used to modify things such as posture, gesture, gaze, gait, behavior, and memory. Imagine a world where agents remember your actions and are positively brimming with joy when you come to visit them! Imagine a world where you can understand simply from the posture of an agent whether it is happy to see you or intends to do you harm. To learn more about non- verbal communication and how it works in humans, see [Argyle75]. As more memory and processing time become available on newer platforms, we may begin to consider deeper models of agent emotion and memory, which in turn should lead to a more effective display of an agent’s emotional state. This emotional display should lead to more engaging and believable characters—agents that can express themselves and their emotions non-verbally and engage the emotions of play- ers at a deeper level. References [Argyle75] Argyle, M. Bodily Communication. Methuen & Co Ltd., 1975. [Bartneck02] Bartneck, C. “Integrating the OCC Model of Emotions in Embodied Characters.” Workshop on Virtual Conversational Characters, 2002. [Eckman04] Ekman, P. Emotions Revealed. UK: Phoenix Books, 2004. [Egges2004] Egges, Arjan, Sumedha Kshirsagar, and Nadia Magnenat-Thalmann. “Generic Personality and Emotion Simulation for Conversational Agents.” Computer Animation and Virtual Worlds 15.1 (2004): 1–13. [Eysenck65] Eysenck, H. J. Fact and Fiction in Psychology. Harmondsworth: Penguin, 1965. [Isla02] Isla, D., and B. Blumburg. “Blackboard Architecture.” AI Game Programming Wisdom 1. Boston: Charles River Media, 2002. 333–344. [McRae96] McCrae, R., and P. T. Costa Jr. “Toward a New Generation of Personality Theories: Theoretical Contexts for the Five-Factor Model.” The Five-Factor Model of Personality: Theoretical Perspectives. Guilford Press, 1996. 51–87. [Mehrabian72] Mehrabian, A. Non-Verbal Communication. Transaction Publishing, 1972. [Oatley06] Oatley K., D. Keltner, and J. Jenkins. Understanding Emotions (2nd edition). Blackwell Publishers Inc., 2006. [Ortony88] Ortony, A., G. Clore, and A. Collins. The Cognitive Structure of Emotions. Cambridge University Press, 1988. [Reilly96] Reilly, W. S. “Believable Social and Emotional Agents.” PhD Thesis. School of Computer Science, Carnegie Melon University, Pittsburgh. 1996. [Sowa92] Sowa, J. “Semantic Networks.” n.d. John F. Sowa. 15 Sept. 2009. . 322 Section 3 A I 323 3.10 Scalable Dialog Authoring Baylor Wetzel, Shikigami Games It has been a goal of many a game to create a large city filled with people you can talk to. Not an inn or castle or a small town, but a city. A big city filled with hundreds (or thousands, or more!) of agents, each of which acts like an individual. But there’s a reason we fill shopping malls with zombies and countrysides with monsters but not cities with people—creating hundreds of people, each with their own personality, takes a lot of time…a cost-prohibitively long time. There won’t be games with large spaces truly filled with intelligent, conversational non-player characters (NPCs) until we find a way to create these agents more efficiently. Although the techniques in this gem don’t try to tackle every problem and bottleneck that you’ll encounter in building a dialog system, hopefully they will help you create large groups of agents much faster. Conversational Agents Today Games are filled with characters that talk. Although not every game and every character needs to be able to answer questions or carry on a conversation, conversational ability is important to a wide variety of games. Dialog varies from the “select a topic” approach used in the Elder Scroll series (where one scenario involves convincing a love-addled stalker to give back the item he stole from a woman who would not go out with him), to the deep conversational trees of the original Fallout, to the multi-way conversations of Planescape: Torment, to the jury trials in Jade Empire. NPCs in these games are normally more complex than NPCs in other games. They might refuse to discuss a given topic with someone they don’t know, ignore someone they previously argued with, insult someone from a rival group, or yell at someone trying to strike up a conversation in the ladies’ bathroom. Conversations can lead characters to give up their evil plans, join the player’s team, or reveal the secret of their miniature giant space hamster. The work described here is part of research done at Alelo, which makes “serious games” designed to teach foreign languages and cultures. Many (though not all) of these games are used to train soldiers how to perform tasks overseas. These tasks range from manning checkpoints and conducting house-to-house searches to negotiating with local leaders and helping set up clinics. Success often depends on showing the 324 Section 3 AI proper level of politeness and professionalism, earning trust through culturally appro- priate small talk and asking the correct questions in the correct way. In this gem, we’ll use the example of a U.S. soldier in Iraq to explain the techniques. Typical Methods for Building Conversational Agents There are a few ways to make conversational agents, one of the more common (and painful) ways being to build them manually in script (if (1==option) bobDialog42() else…). An easier approach is to use a dialog editor to build a tree, where one node is what the NPC says, the nodes under that are things the player can say in response, the nodes under those are the NPC’s response, and so on. Each node typically contains the exact text the agent will say. You say, “Do you like football?” and the NPC will reply “Sure, who doesn’t?” An NPC might have several responses based on whether they like you, if you have fulfilled a quest for them, if you are both at a bar, and so on. In Neverwinter Nights, this is done by calling the TextAppearsWhen script, and in the Elder Scr olls editors (including Fallout 3’s G.E.C.K. editor), it’s done by checking the Conditions field, but the idea is the same—for every possible dialog option, the designer writes the input text (player choice), the output text (from the NPC), and for each possible output writes a script (either by hand in NWN or using a spreadsheet- like tool in Elder Scrolls) to determine whether that particular output should be used. If not, the game checks the next output in the list. The results can be very good, but it takes a lot of time, thought, and planning to build. The Scalability Problem Let’s start with a positive—the range of dialog that can be created by current tech- niques is essentially perfect. If you want an NPC’s response to change based on the player’s shoes, intelligence, the last enemy they fought, the health of the NPC’s dog, and the phase of the moon, you can do that. The problem is that it’s going to take you a long time. This isn’t the only problem. Because of the sheer volume of data, designers face the problem of covering the whole possibility space—you handled a lot of the possible variable combinations, but did you get all of them? In Fallout, NPCs asked about quests that have long since been completed. In Mass Effect, the person sitting next to you will calmly inform you that they’ve picked up a communications signal right rather than ask why you’ve just driven off a bridge into a bottomless chasm. In Neverwinter Nights, you can rescue a girl from a giant, go to the girl’s farm, kill her family, then talk to her, and she’ll thank you and ask you to visit again soon. Finding and correcting problems like this isn’t hard, it just takes time. How much time? To build a professional dialog tree, you not only have to decide which topics an agent can discuss, the words to use, and the flow of the conversation, you need to think about all the factors (NPC personality, NPC culture, NPC state, world state, player state, conversational history, history with player, and so on) that should affect a conversation and make sure the NPC reacts appropriately. For one of the games I worked on in 2009, creating the full dialog tree for an important NPC took two to three weeks. Even then, the agent had the standard lapses in awareness and limited conversational ability that you find in any game. That particular game was a bit more complex than most, but creating a decent conversational agent in any game is still measured in days and weeks, not minutes or hours. The required effort influences how games are made. It would take a small team of designers working on nothing but conversations a full year to make 100 individual (non-cloned) NPCs, and those NPCs would still suffer from limited conversational ability and situational awareness. Assuming one wishes to at least break even on their game, it is essentially impossible to make large worlds filled with unique, believable agents using current techniques. As a result, time and money force those worlds to be filled with a handful of high-quality NPCs (those that drive the plot) and dozens to hundreds of generic NPCs with no or almost no conversational abilities at all. Unique Personalities and Other Things We Might Want Our primary goal is to reduce the time it takes to create a conversational agent. The associated goal is to reduce the cost to create a single agent, allowing us to reduce the cost of making the game or to create significantly more agents for the same cost. (This gem focuses on the latter.) We have already said we want agents with better conversational breadth and situ- ational awareness. (In other words, responses are based on the NPC’s personality, state, job, culture, feelings toward the player, and so on.) Another desirable trait is realistic uniqueness—characters in the game are roughly as diverse as people in the real world. Which highlights the problems of a common technique—making a few high-quality NPCs (or dialog trees) and cloning them. Using templates (“Hi, my name is; I live here in”), you could fill a world with hundreds of agents who knew some basic information about themselves but who all acted the same (or behaved like one of a handful of personality types). What we want are people who are realistically unique—based on who they are, two agents will give different answers when it makes sense and the same answer when it makes sense. Another desirable feature (already present in some games) is for the player to be able to change an agent’s attitude and behavior toward them. Scenarios often require the player to earn the trust of an NPC. Likewise, bad behavior on the player’s part should have consequences. Being able to win an agent’s trust is often the key to a mis- sion, and being able to make someone hopping mad is simply fun. Culture describes how a group of people behave in certain circumstances. For example, it might be considered rude to ask an Afghani man about his wife, refuse a cup of tea in Iraq, or ask a first-level character about their flying mount. If you’re deal- ing with a large number of cultures (groups, roles, character types, and so on), the sheer volume of dialog data makes it hard to verify that agents behave consistently or behave the way the lead designer requested. For serious games, where the behavior often has to be evaluated by an educational expert and/or people from the culture being modeled, unless those people are also game programmers, this is a serious problem. 3.10 Scalable Dialog Authoring 325 Format is also a problem. All of the behavioral information can be captured in the standard script and tree structure of most dialog systems, but if the knowledge is explicit (say, a spreadsheet that focuses on behavior rather than wording), it’s easier for an expert to review (and author) the information. It’s much harder to bring in a group of people from that culture and ask them to review the information if the information is scattered across hundreds of script files. So another desirable feature is the ability to explicitly describe a culture or group of people. A benefit of an explicit cultural representation is that it allows another desirable feature—plug-and-play cultures. If the culture of the NPCs could be swapped out with other cultures, making a new city filled with conversational agents would be as easy as cloning an existing city and swapping the culture, which meets both the goal of being fast (and cheap) and the goal of the NPCs being realistically different. Editing dialog must entail an easy-to-understand workflow (not requiring a Ph.D. to use it). It needs to be data driven in a way that makes it quick to create easy-to-use tools, as well as quick to write unit tests for the system. A final thing is something we don’t want—the tool should not preclude designers from being able to do things they can do now. An example is a tool that uses psycho- logical data to generate realistic behavior but doesn’t allow the designer to override that behavior. While realism is often nice, it is more important that designers be able to achieve the behavior they want. In entertainment games, realism must sometimes be sacrificed to fun or moving the plot along. In educational games, characters must sometimes do things to further the educational goal, such as a character correcting rather than overlooking an error, or leading the player to the correct behavior rather than harshly punishing them. What We Won’t Cover It likely comes as no surprise that a gem of this length will not cover every aspect of conversational agents. The focus of this gem is on intention planning, which means deciding how you want to respond to a topic. We’ll use topics, concept trees, response types, trust levels, rapport modifiers, temperament stats, explicitly modeled cultural groups, sets of sparse culture wrappers, and a bit of memory to help decide when we should answer a question, feign ignorance, or insult the speaker’s mother. What this gem doesn’t cover is realization, the actual words that come out of the NPC’s mouth. In the old days, this was a simple problem—if the designer decides the NPC should insult the player, the response type Insult would map to one or more insults. If the NPC’s intent is Answer, the response type and topic can be used to look up a specific answer, which might be specific to that character or used by the entire world. Using the techniques presented here, if the designer decides halfway through the project that all 300 guards in the game need to be able to discuss bunnies or Pre-Raphaelite poetry with complete strangers (but still act stuffy around people they actively dislike), the change can be made in a few minutes or less. Being able to add entirely new responses or whole topics to hundreds of NPCs in a matter of minutes is a nice feature and offers all sorts of dreams of large, expansive 326 Section 3 A I dialog-filled worlds. Unfortunately, these days, things are a little more difficult. Most high-end games now use voice actors, meaning each statement a designer adds to an NPC must be recorded, a slow and expensive process. In these situations, voice record- ing, rather than designer creativity or predicting variable combinations, becomes the bottleneck. The topic of faster, cheaper speech won’t be covered in this gem, but it is worth noting that even when the set of lines an NPC can say are fixed, considerable time must still be spent mapping the player’s input to the NPC’s output. The techniques presented here can help you more intelligently (and quickly) build those mappings. Overview The goal of this gem is to describe a way to scale how one authors conversational agents. Current systems typically use hard-coded input-output mappings annotated with gateway scripts to decide which of the hard-coded responses to use. The system described here uses a variety of techniques, but at the core, it tries to break the hard- coded links and replace them with abstractions. Rather than linking the player’s input directly to the NPC’s output, we use the player’s input to determine the NPC’s inten- tion and then use the intention to select the NPC’s behavior. All inputs are mapped to a Topic. The Topic is checked against the NPC’s CulturalGroup and current level of Trust toward the player to determine a ResponseType. Topics belong to a topic hierarchy, so if there is no match on the Topic in the NPC’s dialog specification, the system moves up a level and checks for a ResponseType to the parent Topic. CulturalGroup is a sparse set of {Trust-Topic-ResponseType} mappings. Culture represents not just nationality, but any group membership that affects how the agent will respond to a topic. An agent can (and almost certainly will) belong to multiple groups. Groups are prioritized, and conflicts are resolved by selecting the highest- priority matching group. Throughout this gem, we’ll use the example of a U.S. soldier (the player) in an Iraqi city. Table 3.10.1 lists seven NPCs the player might interact with—a typical civilian, a policeman, a policeman secretly working for the insurgents, another U.S. soldier, an insurgent (although the player doesn’t know this), and two doctors. 3.10 Scalable Dialog Authoring 327 TABLE 3.10.1 NPCs in an Iraqi City Name: Nori Anwar Zuhair Scott Shakir Suha Halema Bio: Iraqi Man Policeman Policeman U.S. Soldier Insurgent Doctor Doctor Roles: Iraqi Policeman Zuhair Scott SoccerFan AidWorker GovRep SoccerFan GovRep GovRep Soldier Insurgent GovRep AidWorker Person Iraqi Insurgent USCitizen Iraqi IraqiFemale IraqiFemale Person Policeman Person Person Iraqi Iraqi Person Person Person Trust: 0 0 –2 2 –6 0 0 Intention Modeling Imagine a game in which there are 100 NPCs, and the player can insult any of them. How will they react? Most will return the insult in a dozen different ways, based on their personality, intelligence, culture, and preferred insult. Many will ignore the player. A few will attack. There are potentially 100 actual actions or phrases that might be used, but (in our example) there are only three intentions (insult, ignore, attack). In most games, the player’s action (the input) is hard-coded directly to the NPC’s behavior (the output). The result is that the designer must write thousands of pairings, such as {input:Hear(“Do you like films about gladiators?”), output:Say(“Get away from me, weirdo.”)}. Much of this work is redundant—the same output is used for multiple inputs, and the same pairings are used in dozens of NPCs. If it is decided that this behavior is no longer desired (say, if designers decide late in the process that elves, unlike dwarves and humans, never insult others), the dialog pairings must be tracked down across dozens or hundreds of scripts or dialog files and changed. Doing duplicate work is not only inefficient and hard to maintain, it’s not fun. To make the designer’s life easier, we’ll have them map inputs to intentions (a much easier task) and separately map intentions to behaviors. Topics, Response Types, and Trust In our approach, for a given culture (which we’ll discuss a little later), a Topic and Trust level are used to select ResponseType. Topic is the subject the player is asking about (swords, rumors, dragons, and so on). In this gem, we’ll assume that Topic is the only input. While this is sufficient for most current video games, more demanding games will likely use a more complex input consisting of an action (asking a question, demanding, greeting, complimenting, insulting, and so on), a topic, and possibly some metadata (vigor, politeness, and so on). The type of input is irrelevant to the rest of the system, so we’ll keep things simple and assume Topic is the sole input. The ResponseType represents the responding NPC’s intention. Possible values include Answer (give the player the information they’re looking for, if possible), Refuse, Evade, ChangeTopic, VagueAnswer, Lie, Ignore, Insult, Threaten, Correct (if the player has made a mistake in what they asked for; this is more useful in educational games), PositiveLie (say something is great, regardless of whether it is), NegativeLie, and Custom (e.g., attack). More complicated (and academically respectable) schemes exist, but this works well for our purposes. The actual words spoken by the NPC are based on the ResponseType. Suppose the player has been told that a bomb has been placed in the market. The player stops a person on the street and asks for directions. Table 3.10.2 shows how the NPCs defined in Table 3.10.1 might react. Five of the seven will attempt to answer the ques- tion (although one, Scott, does not know the answer). Shakir, an insurgent, will insult the player, while Zuhair, a corrupt cop, will lie to the player, sending him away from the market. (Note that the policeman does not necessarily know there is a bomb and 328 Section 3 A I wants it to explode, he merely dislikes the player; if the policeman knew about the bomb, he could be made to respond differently using a context modifier, but that is outside of the scope of this gem.) TABLE 3.10.2 What an NPC Says Depends on Their Intention Player asks: Where is the market? lying Name: Nori Anwar Zuhair Scott Shakir Suha Halema Role: Iraqi Iraqi Iraqi USCitizen Insurgent Iraqi Iraqi Intent: Answer Answer Lie Answer Insult Answer Answer Say: Left Left Right Not sure Pig Left Left Trust is the amount of trust the NPC has in the player. It’s what ties the Topic to the ResponseType for that cultural group. This attribute does not have to be Trust— it could be rapport or some combination of other attributes, although trust as a single value works well in most instances. The important thing is that there is an attitudinal value that unambiguously ties an input to a ResponseType. Using the example in Table 3.10.1, let’s assume that the player has insulted Anwar the policeman. Let’s measure trust from –10 (distrust) to 10 (full trust). Topic=insult, Culture=Police, and Trust=0. We have the rules (evaluated in order): {Trust >= 5, Ignore} {Trust >= 0, Insult} {Trust >= -7, Threaten} {Trust < -7, Custom:Attack} When the player insults Anwar, Anwar will decide to insult the player. Assuming an insult lowers Trust by 1, if the player insults Anwar again, Anwar will threaten him. If the player keeps it up, Anwar will eventually attack. Knowing that an NPC will insult the player does not automatically determine what the NPC will do. The behavior generation system might be as simple as map- ping ResponseType=Insult to Say(“Oh yeah, your momma.”). The intent is mapped at the group level (all policemen), but the behavior could be different for each individual policeman. Each NPC could have his own favorite insult. Insults could be chosen based on that NPC’s intelligence. They could be based on the player’s class, how they’re dressed, or their location (sports arena, store, and so on). This decision is made independent of the intention system. Separating intention from behavior has several important implications. First, separate designers can be assigned to intent (say, someone familiar with personality or social psychology) and realization (for example, a writer). Second, because it’s a smaller set of data and explicit in its goals, it is easier for one person to view and correct the data (important when striving for consistency across agents and designers). 3.10 Scalable Dialog Authoring 329 Third, the smaller set of options (which presumably will be chosen from a list rather than entered as free text) means the intent portion can be built faster and more easily. By removing duplicate data (in the behavior system, you only have to map behaviors to a small set of intents, not the much larger set of inputs), the overall amount of work should be reduced. Fourth, it makes it easier for designers to tweak dialog later in the development process without editing (and possibly adding bugs to) individual NPC dialog trees. Fifth, having a separate intention layer makes it easier to write unit tests, in part because there’s less data to test and in part because the tests aren’t dependent on free text (important both because text is often changed and because of internation- alization). Sixth, the explicit ResponseType and Trust levels help designers remember which conditions they need to handle. (Note that this is not required: One can create a group Person that returns Answer for all topics at all Trust levels.) A final, and important, reason why separating intention from behavior is impor- tant is because it allows for design by composition, as seen in the next section. Concept Hierarchies To enhance conversational breadth, topics belong to a topic hierarchy. If the player asked about murders and the NPC’s group didn’t have an entry for Murder, the sys- tem would check the parent topic (say, Problems). If that was missing, it would check the next level until it had reached the root topic. There are several advantages to this, but two are worth mentioning. First, by placing a ResponseType on the root node, the agent has a default answer to anything the player asks. This helps cover errors when a mapping has been forgotten. Second, it allows a designer to add new topics through extension, which is normally lower risk than edits. If the designer decides that some characters need special behaviors when discussing Pre-Raphaelite poetry, they can add it to the topic hierarchy. The groups configured to discuss obscure Victorian poetry will do so, and everyone else will respond to the general topic of poetry, writing, or something more general. It should also be noted that the {Topic-Trust-ResponseType} mappings do not have to be completely specified. An NPC can be set up to talk about local crime when Trust is eight or better and have no other mapping for that topic. If Trust was below that value, the system would then use the parent topic. This comes in handy when a person belongs to multiple groups, as described in the next section. Cultural Wrappers When designers specify intent, they do so at the group level, not for individual NPCs. We will refer to these groups as CulturalGroups. A CulturalGroup can represent race, nationality, occupation, political affiliation, or any group membership that affects how one reacts to something. Examples include Thai, Rural, Soldier, FootballFan, PunkRocker, AngryLoner, and Parent. Designers can also use the cultural group con- cept to model personality traits, such as Paranoid and Bully. 330 Section 3 A I TABLE 3.10.3 Questions about Football Can Be Answered as Questions about Either Football or Sports Do you like football? rapport building, concept abstraction Name: Nori Anwar Zuhair Scott Shakir Suha Halema Topic: Football Sports Sports Sports Football Sports Sports Role: SoccerFan Iraqi Iraqi USCitizen SoccerFan Iraqi Iraqi Intent: Answer Answer Answer Answer Evasive Answer Answer Say: Yes Yes Yes No Maybe Not really Yes Trust: +1 +1 +1 +1 +1 +1 +1 TABLE 3.10.4 U.S. and Iraqi Cultures Differ in Their Willingness to Discuss Their Spouse with a Stranger Tell me about your spouse. Name: Nori Anwar Zuhair Scott Shakir Suha Halema Topic: Spouse Spouse Spouse Spouse Spouse Spouse Spouse Role: Iraqi Iraqi Iraqi USCitizen Iraqi Iraqi Iraqi Intent: Refuse Refuse Refuse Answer Insult Refuse Refuse Say: How rude! I refuse No Chevy’s nice You’re a pig! No No Trust: –1 –1 –1 +1 –1 –1 –1 A cultural group contains one or more {Topic-Trust-ResponseType} mapping. These mappings are typically sparse—most doctors have a predictable reaction to medical questions but not to questions about books, movies, or enchanted swords. Agents belong to one or more groups. Typically, one of those groups will be Person, which will contain default mappings. Other groups specialize the agent. Some, such as Iraqi, will be fairly broad and dense, containing a lot of mappings, while others, such as FootballFan, will be small and focused. The agent can have an unlimited number of groups. When designing agents, two design principles are used: design by composition and design by exception. Design by composition says that the designer should build the agent by selecting pieces (cultural groups) rather than writing the agent from scratch. The process is simple and fast—Agent A is a Doctor, GovernmentRepresentative, Iraqi, IraqiFemale, and Person, and Agent B is a GovernmentRepresentative, Insurgent, Policeman, Iraqi, and Person. It takes only a few seconds to select the groups from a list, and assigning the groups fully specifies how the agents will react to any dialog option in the game. 3.10 Scalable Dialog Authoring 331 (Note that if per-agent behavior is used, that work will still need to be done, although it should still be less work than in a traditional system.) Design by composition speeds up the authoring process for a single agent. Design by exception speeds up the group authoring process. Following the prin- ciple of design by exception, default values should be set up in the base group (in our example, Person), and only those values that differ from the default should be placed in new groups. For example, you could have the group LittleGirl love to talk to complete strangers about any type of animal, the group DemonEnthusiast love to dis- cuss demons, and the group RabbitPhobe be too terrified to discuss rabbits with any- one but their closest friends. Assigning those three groups to an agent produces an agent that will gladly talk about animals and demons yet refuse to discuss bunnies. The group RabbitPhobe does not contain mappings for any Topic other than Rabbit, making it fast to create, and the designer is not forced to create hundreds of combina- tion groups, such as PeopleWhoLoveAnimalsAndDemonsButNotRabbits. Although it won’t be frequent, conflicts between CulturalGroups can occur. Consider a doctor who runs a government clinic in a war zone. Although the clinic is trying its best, there are still rampant health problems in the area. If you ask the agent about the problems, the Doctor in her wants to complain that they aren’t doing enough, while the GovernmentRepresentative in her wants to say that the clinic is doing just fine. The dialog system we describe here works more generally for any problem of determining an agent’s reaction to an event. How one handles the cultural group con- flicts depends on the domain. For the dialog system, we decided to use a first-chance event handler with a prioritized group list (referred to as Cultural Wrappers). When setting up the agent, the designer must select the order of the groups. In Table 3.10.1, Suha has Doctor prioritized over GovernmentRepresentative, while Halema has the same groups but in a different order. When asked about problems (Table 3.10.5), Suha complains about healthcare, while Halema says there are no problems. TABLE 3.10.5 An NPCʼs Answer Is Based on the Order (Prioritization) of Their Groups Are there any problems here? Name: Nori Anwar Zuhair Scott Shakir Suha Halema Role: Iraqi Iraqi GovRep USCitizen Insurgent Iraqi GovRep Intent: Answer Answer Deny Answer Insult Answer Deny Say: Yes Crime It’s very safe I don’t know You! Healthcare No In a very small number of cases, there is no acceptable ordering of groups—some- times Group A supersedes Group B, and other times B supersedes A. As an example, in Table 3.10.1, Zuhair is both a policeman and (secretly) an insurgent. He wants to help the terrorists, but not at the risk of blowing his cover. He might voice support for 332 Section 3 A I the terrorists around people he trusts (as a terrorist) but be polite to the player (as a policeman). In these instances, a simple solution is to create a new group that contains only those Topics and Trust levels needed to resolve conflicts. This new group might be an actual general-purpose group, such as UndercoverInsurgent, but it can also represent that specific individual (in this case, Zuhair). Individuals have their personal quirks that can’t be captured by any group, so modeling the things that are truly specific to an individual is okay, but the “individual group” should only contain the exceptions. Unless the agent is truly, eccentrically unique, most of his responses should be specified in the more general groups. Earlier it was mentioned that Topics belong to a topic hierarchy. When determin- ing the agent’s intent, if a match on the Topic isn’t found, the parent Topic is used, moving up the tree until a match is found. When multiple groups are used, preference is given to Topic specificity. Consider the Topic TableTennis, child of Sports, and the ordered group list [USCitizen, PingPongFan]. USCitizen does not have a mapping for TableTennis but does for Sports, while PingPongFan matches on TableTennis. Assuming a Trust level of 0, the program first checks for a match on {USCitizen, TableTennis, 0} and fails to find a match. It then checks {PingPongFan, TableTennis, 0}, where it finds a match. Had it failed, it would have then checked {USCitizen, Sports, 0} and then {PingPongFan, Sports, 0}. This gives greater freedom in arrang- ing groups and allows for a greater number of groups to be used. If the system moved up the topic heirarchy before checking the next group, any group with a value high in the tree (for example, at the root node, which matches everything) would prevent any other group from having an influence. Creating Unique Individuals Each combination (and ordering) of groups results in an individual who is unique. It does not mean that he will behave differently than every other agent under all circum- stances at all times. We could design them to do so, but the agents wouldn’t appear realistic, they’d appear insane. It doesn’t matter whether someone loves kittens, is a doctor, or grew up in a small town; if you ask him whether a particular neighborhood is dangerous or whether a given restaurant is good, there are only a limited number of responses you should get. Responding to the question “where is the train station” by juggling cats might be unique, but it isn’t helpful. That said, there’s nothing prevent- ing the designer from adding that reaction. The number of unique individuals you can create by combining groups grows quickly with the number of groups. With three groups, 15 unique individuals can be made. With five groups, the number is 325. With 10 groups, the number is 9,864,100, a number larger than most cities in the world. Add one more group, and you can cover all cities and most countries. The process of assigning and ordering groups is not the only way to create unique NPCs. Intent can be tweaked at the NPC level using the TrustModifier property. This represents how trusting someone is and how quick they are to change their trust level. 3.10 Scalable Dialog Authoring 333 It is multiplied against Trust to produce a modified trust score used in Trust checks. The default TrustModifier is 1. An agent with a TrustModifier of 1.5 is 50 percent more trusting than normal—the agent needs only a Trust of four to trigger responses that other agents with the same group list require a six for. Circumstantial modifiers can be used to modify Trust scores. For example, if the agent has been arrested or has a gun pointed at him, the threshold for giving an answer could be lowered. It seems likely that this would need to vary by NPC some- how (perhaps by a willpower property). Although beyond the scope of this article, uniqueness can also be created in the behavior generation system. A single {Topic, ResponseType} pairing (where Topic can be a wildcard when topic is irrelevant, such as when insulting or ignoring the player) can map to a set of realizations, one of which is selected at random (preferably using an intelligent random system that filters out long repeated sequences). Behaviors can also be chosen based on attributes of the agent. For example, if the ResponseType was Compliment, an agent with a high intelligence or charisma might say something clever, while someone with low intelligence might stumble badly and say something offensive. Conclusion One of the biggest obstacles to creating games filled with hundreds of intelligent, con- versational agents is the sheer amount of work (and therefore cost) required to create them. It’s not that it’s hard work (although designing interesting characters and dialog can certainly be difficult); it’s that any kind of work done several hundred times is a lot of work. And sometimes, quantity is as important as quality—you can’t make a living, breathing, realistic city with just three characters. One of the keys to improving AI is to improve its authoring scalability; there needs to be processes and tools that make it easier to populate virtual worlds. Hopefully, the ideas presented in this gem will be a big step toward helping you fill your own worlds with intelligent, interesting, unique characters. 334 Section 3 A I 335 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs Nikhil S. Ketkar and G. Michael Youngblood In this gem we will present techniques for analyzing player trace data in massively multiplayer role-playing games (MMORPGs). As MMORPGs become increas- ingly popular, with the number of subscribers going into the millions, an MMORPG provider is faced with a number of technical and business questions. For instance, how do you place an advertisement in MMORPGs, or how do you detect cheating in the form of bots and gold farmers? We observe that these and a lot of other such ques- tions can be answered by analyzing player traces (graph representations of the player’s movement in the world), but most traditional approaches in machine learning and data mining fall short when applied to the task due to the inherent structural nature of player trace data. We claim that the application of techniques in the area of graph- based data mining that are designed to work with structured data are most suitable for the analysis of player traces. Data Logging and Preprocessing Typically, player movements in the world can be logged as the location of the player in 3D space at discrete intervals in time. Thus, the logged data for a single player is a sequence of the form {(x0, y0, z0, t0),(x2, y2, z2, t2)...}. The player spawns in the world at time t0 (which is 0) and (x0, y0, z0) refers to the position where the player spawns. Subsequently, the player moves in the world, and assuming that we are logging data at every second, (x1, y1, z1, t1) refers to the position the player is at t151. Similarly, we have a number of positions with the corresponding time for the player movements until the player exits the world. We refer to this sequence as a walk. Our overall dataset consists of a set of such walks, where each walk corresponds to one session of a single player in the game. 336 Section 3 AI Figure 3.11.1 presents a visual representation of three walks. This data was collected in a world shown in Figures 3.11.2 and 3.11.3. This world is a part of the Urban MMO Testbed (UMMOT). UMMOT is an experimental environment designed to study human interactions in virtual worlds and is an extension of the Urban Combat Testbed [Cook07, Youngblood08]. Figure 3.11.1 A visual representation of three walks in a world. Figure 3.11.2 Urban MMO Testbed: Birds-eye view. While data in such raw form is easy to log, for the purposes of analysis it needs to be preprocessed to a more suitable form. Dealing with data at this level of granularity (loca- tions in 3D coordinate space) is computationally intensive and leads to poor results. Hence, we convert the 3D locations to a discrete form by superimposing a grid on the world. Figure 3.11.4 illustrates this process. Once such a grid is superimposed, all points inside one grid cube are assigned to the same discrete location. Once data is preprocessed in this manner, it consists of a set of walks of the form W 5 {(l1, t1), (l2, t2)...}, which is a sequence of discrete grid locations for the corresponding time instance. 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 337 Figure 3.11.3 Urban MMO Testbed: Screenshot. Figure 3.11.4 Superimposing a grid to get discrete locations. Selecting a proper granularity for the grid (the size of a single cube) is important. Too small a cube size will lead to too many locations, and too large will lead to too few locations. It is recommended that the cube size be equal to, or a small multiple of, the bounding box of the character model. Advertisement Placement in MMORPGs Placing advertisements in MMORPGs can provide host companies with an addi- tional source of revenue. However, in order to capitalize on this business opportunity, companies hosting the MMORPGs need to provide coverage guarantees to their clients who would pay for placing advertisements. The notion of advertisement coverage is central to all marketing and is loosely defined as the estimated number of prospects reached by an advertisement. A lot of value is placed on advertisement coverage, as the clients placing the advertisement are solely interested in reaching as many prospects as possible and would be willing to pay a higher amount of money for higher coverage. An advertisement in the New York Times costs significantly more than an advertise- ment in the Charlotte Observer precisely because an advertisement in the New York Times will achieve higher coverage. In the case of MMORPGs, estimating and providing guarantees on coverage is challenging for a number of reasons. Players typically spawn in different locations, moving around the world performing tasks, and this is quite different from a reader reading a newspaper or visiting a webpage. Advertisements can be placed in different locations in the world, but where should they be placed? Furthermore, can some guar- antees on coverage be provided? Given a set of preprocessed walks, as described in the previous section, advertise- ments could be placed at any of the locations in the walks. We define coverage for a set of locations as the number of walks that contain the specified locations divided by the total number of walks. Intuitively, coverage captures the number of players that will get quite close to and most likely see an advertisement. Note that it is quite possi- ble, although unlikely, that a player will get close to a location but not see the adver- tisement because he or she is looking in a different direction. In our setting, we do not explicitly model where the player is looking. Such an approach has been tried before and has been found to be quite computationally expensive [Dixit08]. Figure 3.11.5 illustrates examples of how the coverage is computed. Note that there are four distinct walks, as illustrated in the large graph on the left. Subfigures (a), (b), (c), and (d) illustrate four different location selections for this graph. For location selection as illustrated in Subfigure (a), two locations are selected, which cover Walks 2, 3, and 4. Because there are a total of four walks, this amounts to coverage of 75 percent. For (b), two locations are selected, which covers all four walks, which amounts to coverage of 100 percent. Similarly for (c) and (d) we have 50-percent and 100-percent coverage, respectively. 338 Section 3 A I In such a setting, our task is to maximize the coverage while minimizing the number of advertisements placed. This task is equivalent to the set-cover problem, which is NP-Complete. Thus, optimal solutions are not feasible, and it is necessary to develop approaches that can produce near-optimal solutions at moderate computation cost. An important, additional dimension in the task of advertisement placement is that of generalization to future player behavior (in terms of walks). Given a certain amount of training data, suppose that we select a set of locations that maximize cov- erage on this data. Then, our selected set of locations should achieve the required level of coverage on future walks. Assuming that training data is a good reflection of the entire population, maximizing coverage on training data will most likely achieve high coverage on unseen data. The important question here is how large of a sample is required to get a good generalization on unseen data. Another factor to consider is the cost of collecting and logging data. Clearly, in order to have training sets on which to base advertisement placement, some data needs to be collected. The most expensive case is where actual positions (3D locations) are logged. The space required for such logging grows linearly with the number of walks and may not be feasible. Another approach is to simply log the number of players that visit a particular location, which is constant in the number of walks. There is an inher- ent tradeoff in the amount of logged data and the quality of the solution. We now present a set of approaches for these tasks and discuss their strengths and weaknesses. Frequency Maximizing Approach Frequency-based placement is a relatively simple approach that selects locations based on the number of walks passing through a particular location. Walks in the training data are processed sequentially to count the number of walks passing through each 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 339 Figure 3.11.5 Coverage computation. location, and these counts are used to select the topmost locations. Note that this approach only requires the logging of frequencies of visits to each position, which requires space constant with respect to the number of walks. Another important thing to note is that this approach might produce suboptimal solutions in many cases, because it does not consider the overlap between walks. Figure 3.11.6 illustrates an example of such a case. Assuming that Position B has already been selected, the fre- quency maximizing approach will select Location A over Location C. This is because, individually, A covers four walks and C covers two. This is clearly suboptimal, as there is an overlap of three walks between A and B (Walks 1, 2, and 3). Markov Steady-State Probability-Based Approach The Markov steady-state probability-based placement approach is based on comput- ing the probability of a player visiting a particular location based on the transition probabilities. The first step in this approach is to process the walks to generate a tran- sition probability matrix. The transition probability matrix stores probabilities of transitions from a location to any other location and is computed by counting the transitions and dividing by the total number of outgoing transitions. Once this matrix is generated, which we will refer to as M, the steady-state probabilities can be com- puted by solving xM 5 x. There are several exact and iterative approaches to solve such a system of linear equations. Based on the steady-state probabilities, advertise- ments can be placed by selecting locations with the highest probabilities. Figure 3.11.7 illustrates an example of this. Note that this approach requires the logging of transitions and requires space constant with respect to the number of walks. Greedy, Marginal Gain Maximizing Approach This approach is based on selecting locations in a greedy manner, maximizing the marginal gain with each added location. Initially, the location with the highest fre- quency (or coverage) is selected. This is followed by considering each location (not already selected) and evaluating the coverage of the newly formed set of locations (previously selected locations with the current location). After evaluating each loca- tion in this manner, the location that maximizes the coverage is added to the set. 340 Section 3 A I Figure 3.11.6 Suboptimal selection by frequency maximizing approach. This is illustrated in Figure 3.11.8. Note that this approach considers the mar- ginal coverage while considering a new location to add and therefore would produce solutions that are more optimal as compared to the frequency-based approach. Another important point is that this approach requires logging the actual walks—that is, the space required to log the necessary data grows linearly with the length of the walk as well as the number of walks. Experimental Comparison We experimentally compared the three approaches to advertisement placement on a dataset of 2,436 walks. More details on the dataset can be found in [Cook07] and [Youngblood08]. Experiments were conducted for various sizes of training sets, rang- ing from 0.25 percent to 50 percent of the entire dataset, while the remaining data was used for testing. (Advertisement placements were selected based on the walks in the training set, and these placements were evaluated on the test set.) Figure 3.11.9 shows the coverage achieved by each of the three approaches for different numbers of advertisements, with 5 percent of the data used for training, on the training set. Figure 3.11.10 shows coverage on the test set. As a baseline, we also include a random placement approach in the experimentation. Each result is an average over five runs of different samples of training and test sets. Results indicate that the greedy marginal gain maximizing approach significantly outperforms both the frequency- 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 341 Figure 3.11.7 Computation of steady-state probabilities. Figure 3.11.8 Computing the marginal coverage. based and the Markov steady-state probability-based approaches. The frequency-based approach is comparable to the Markov steady-state probability-based approach. Similar results are observed on various other training sizes greater than 5 percent. An interesting observation is that we get diminishing returns with an increased number of advertise- ments placed. That is, a lot of coverage is achieved due to the initial advertisements, but after about seven or eight advertisements are placed, there is very little improvement in coverage. The closeness between the coverage on the training sets and the coverage on the test sets implies that our advertisement placement generalizes well to unseen data. However, this is not the case for very small training sizes. Figure 3.11.11 shows the coverage achieved by each of the three approaches for various budgets on the number of advertisements placed with 0.25 percent of the data used for training, on the train- ing set. Figure 3.11.12 shows coverage on the test set. For such a small size of training set, 100-percent coverage is achieved for very few advertisements, but these results do not carry over to the unseen data. For very small training sizes, we see that the per- formance of each of the three approaches is no better than random. The important point to take home is to have a sufficiently large sample size. Unfortunately, our pres- ent work does not include a theoretical bound on the sample size, and we advise users to partition their datasets (into training and testing sets) to determine the appropriate size experimentally. In general, the size of the training set depends on the size of the world and the variability in the walks, and we are working toward proving an upper bound on the training set size. 342 Section 3 A I Figure 3.11.9 Comparison of approaches to advertisement placement. Five percent of the data used for training, coverage on the training set. 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 343 Figure 3.11.10 Comparison of approaches to advertisement placement. Five percent of the data used for training, coverage on the test set. Figure 3.11.11 Comparison of approaches to advertisement placement. 0.25 percent of the data used for training, coverage on the training set. Each of the three approaches generalizes well to future data given a sufficient size of the training data. In our experimentation, the greedy marginal gain maximizing approach significantly outperforms the frequency-based and Markov steady-state probability-based approaches. However, it should be noted that for a large-scale implementation, the greedy marginal gain maximizing approach requires the logging of the actual walks, which can consume space that grows linearly with the number of walks as well as the length of walks. The other approaches require the logging of visits to particular locations or transition probabilities, which is constant with respect to the number of walks. Building Player Profiles with Clustering The idea behind clustering is partitioning a given set of examples into subsets (referred to as clusters) such that examples in each subset are similar to other examples in the subset by some measure. Cluster analysis is an unsupervised learning technique that allows us to categorize data such that trends in the data are identified. A good introduction to cluster analysis can be found in [Jain99]. In the case of player trace analysis, we are interested in building player profiles that group players into categories such that players in a group have similar behaviors. Figure 3.11.13 shows player traces for six different players. Although visualizing such information can lead to important insight, this is only possible for small datasets. 344 Section 3 A I Figure 3.11.12 Comparison of approaches to advertisement placement. 0.25 percent of the data used for training, coverage on the test set. When player traces become long or there are too many player traces, analyzing them visually becomes a tedious process. Clustering player traces serves as an important step in analyzing traces because it reduces the data from individual player traces to groups of player traces. Since there are far fewer groups than individual traces, it becomes possible to perform a visual analysis of these groups instead of individual traces. The key challenge in applying clustering algorithms to player trace data is that typically clustering algorithms are designed for attribute-valued data (data represented as a single table), and player trace data is structured and cannot be represented as a single table without losing important information. A simple example of such data can be a table representing information about customers where each row represents a cus- tomer and each column represents a specific attribute, such as age or yearly income. To produce such a grouping, a similarity measure between two examples (in this case, customers) is required. There are several distances measured for attribute-valued data—for instance, Euclidean distance, which can be used to achieve good results. Contrast this customer data (represented as a table) with player trace data introduced earlier, which cannot be represented as a table. The notion of Euclidean distance cannot be used to measure the similarity between two player traces. This is because 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 345 Figure 3.11.13 Visualizing player traces in 3D space. 346 Section 3 A I Euclidean distance can only be used on data points represented as n-dimensional vectors of equal length, and walks in the world are a sequence of points with variable length. To address this difficulty, we introduce a similarity measure between two player walks in the world. Using this similarity measure, any of the standard clustering algo- rithms can be applied to clustering player trace data. Distance Measure The largest common subsequence (LCS) is used to measure the similarity between two walks. An illustration of longest common subsequence can be found in Figure 3.11.14. Using LCS accounts for fragments of similarity between two walks. For example, suppose that some walks consist of an important set of behaviors that are sequentially repeated in each of the walks. However, these repeating behaviors are interlaced with other actions that are not common to all the walks. Such a case is illustrated in Figure 3.11.15. Here we have four walks, where Walks 1 and 2 have two behaviors in common. Walks 3 and 4 also have two behaviors in common. The uncommon actions represent the variability or the noise in the data and should be ignored. What should be considered are the sequentially repeating, common aspects of the walks, which are in fact captured by the LCS. If we group the walks in Figure 3.11.15 based on LCS, we will have two groups—the first with Walks 1 and 2 and the second with Walks 3 and 4. However, the LCS by itself does not take into account what fraction of two traces is similar. For example, in Figure 3.11.15, Walks 3 and 4 have much longer chunks of portions in common (with respect to the length of the entire walk) as compared to Walks 1 and 2. Hence, Walks 3 and 4 are much more similar to each other than Walks 1 and 2. Figure 3.11.14 Longest common subsequence. To take this into consideration, we define the similarity measure as: where A and B are the two walks under consideration. Note that this similarity measure is typically a number between 0 and 1. Identical walks will have a similarity measure of 1, while completely dissimilar walks will have a similarity measure of 0. The LCS problem is quite well studied in literature, with numerous applications in bioinformatics. An O(mn) time algorithm (where m and n are the lengths of the input sequences) for LCS can be found in [Wagner74]. Applying Standard Clustering Algorithms Using the distance measure for player trace walks specified earlier, it is now possible to extend any of the standard clustering algorithms for the task of clustering player traces. The general idea is to replace the distance computation (which is typically Euclidian distance) by the distance measure based on LCS. Alternatively, for a given set of walks, we can precompute a similarity matrix, which is basically a triangular matrix where each entry indicates the similarity between the example in the row and the example in the column (illustrated in Figure 3.11.16). Many clustering algorithms can operate on such a similarity matrix to produce clusters that can be used for subse- quent analysis. 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 347 Figure 3.11.15 Longest common subsequence captures common behavior. Common Behavior 1 Common Behavior 2 Random Actions Walk 1 Walk 2 Walk 3 Walk 4 Common Behavior 3 Common Behavior 4 A common way to visualize clustering results is a dendrogram, which is a treelike structure that depicts the similarity between the examples. We illustrate a dendrogram on the first five walks (due to space limitations) in our datasets in Figure 3.11.17. The overall procedure for cluster analysis is to first generate a dendrogram (source code for generating a dendrogram has been provided on CD, and more details can be found in [Jain99]) for the entire dataset, as depicted in Figure 3.11.17, and then focus on specific clusters to understand their common elements. The dendrogram is an extremely effective tool for visual data analysis because it allows the user to focus on 348 Section 3 A I Figure 3.11.16 Similarity matrix. Figure 3.11.17 Results of clustering (on a very small subset). w1 w2 w3 w4 w1 Similarity between w2 and w3 w2 w3 w4 specific samples in the data rather than the entire dataset. The LCS procedure can also be used to produce the common subsequence thereby identifying such common ele- ments. For example, in Figure 3.11.17, the common element between Walks 1 and 3 is the walk up the stairs. Following such a procedure (looking at clusters and identify- ing common elements) can allow the identification of several important behaviors. Detecting Bots and Gold Farmers with Classification Models The basic task underlying the detection of bots and gold farmers is that of learning a binary classification model. This task is quite well studied in the field of machine learning and is commonly referred to as supervised learning. In supervised learning, we are given a set of examples labeled as positive or negative (positive and negative are the two classes or categories), and a supervised learning algorithm induces a function that can classify unseen examples into these two categories. A good introduction to super- vised learning can be found in [Mitchell97]. While the supervised learning problem is quite well studied, most algorithms for supervised learning only deal with attribute valued data. As mentioned earlier, player traces cannot be represented as attribute val- ued data, and hence applying existing algorithms to the task can be quite challenging. An important class of supervised learning algorithms is support vector machines (SVMs), which have been successfully applied to many application domains. A good introduction to SVMs can be found in [Cristianini00]. While SVMs also typically deal with attribute valued data, they can be extended to operate with structured data by specifying a kernel function. A kernel function basically computes a similarity measure between two examples. The LCS-based similarity measure used for clustering can also be used as a kernel function, allowing us to apply SVMs to classify player traces. Using an LCS-Based Similarity Measure with K-NN We begin by discussing the use of the LCS-based similarity measure with the K-Nearest Neighbor (K-NN) algorithm, which is a relatively simple classification algorithm and would allow the reader to develop an intuition for the task of player trace classification. The K-NN algorithm is a simple lazy algorithm that stores all the input examples, and when a prediction is to be made on an unseen example, it first computes the K nearest neighbors using some measure of similarity and predicts the class of the unseen example as the majority of its neighbors. Typically, in the case of attribute valued data, Euclidian distance is used to measure the similarity between two examples. Although simple, the K-NN algorithm can produce good classification models. To extend the K-NN algorithm to operate on player trace data, we use the LCS- based measure to calculate neighbor distance. To predict whether a given player trace is a bot trace or a human trace, we detect K nearest neighbors of the trace under con- sideration and predict its class (bot or human) based on the majority of the neighbors. 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 349 350 Section 3 A I To see why the LCS similarity measure serves the purpose of distinguishing between bots and humans, consider the following observations. First, bots (or gold farmers) constantly repeat a set of actions. While these actions may be interlaced by random movements, in order to achieve their objective (for example, killing boars in World of Warcraft (WoW) to gain experience points), they have to repeat some sequence of actions. Second, the areas on the world where these actions can be performed are specific. (For example, there are particular locations in WoW that are intended for neophyte players to kill boars and gain experience points.) A set of player traces that represent bots (or gold farmers) will have specific repeating locations easily captured by the LCS. While K-NN is conceptually simple, it is computationally unfeasible for the task of player trace analysis on large datasets. This is because in order to identify the K nearest neighbors, we have to compute the LCS similarity measure of the unseen examples with all the other examples in the training set. The LCS-similarity measure can be computed in O(mn). (m and n are lengths of the input sequences.) This is sufficiently fast for batch processing (offline classification and analysis of player traces); however, when bot detection needs to be performed in real time, this is too slow. To address this issue, we need a more sophisticated technique, namely SVMs. Using an LCS-Based Similarity Measure with SVMs The LCS-based similarity measure can also be used in conjunction with SVMs. SVMs typically operate on attribute valued data and, given a set of training examples (cate- gorized into two categories), produce a hyperplane (a higher dimensional plane) that separates examples into the two categories. In order to classify an unseen example, its distance and orientation with the hyperplane are computed, and based on this, we can make a prediction about its category. Figure 3.11.18 illustrates this process. Figure 3.11.18 Support vector machines. The hyperplane, more correctly referred to as the maximum margin hyperplane, is a plane that puts the maximum distance between the positive and negative examples. The maximum margin hyperplane is defined by the examples, which, in a sense, lie on the boundary of the positive and negative regions and are referred to as the support vectors. The key point to note here is that in order to classify an unseen example, the LCS measure only needs to be computed against the support vectors, and not the entire set of examples. This significantly speeds up the process of prediction. Conclusion We presented a number of techniques to analyze data in MMORPGs, dealing with specific problems such as advertisement placement, profile building, and bot detection. In conclusion, the most important point we would like to convey to the community is the added value of logging player data. Understanding interaction and behavior in virtual worlds can help us design better virtual worlds, and this is only possible through the collection and analysis of such data. In most cases, the cost of collecting such data is a small price to pay compared to the insight received by analyzing the data. References [Cook07] Cook, D. J., L. B. Holder, and G. M. Youngblood. “Graph-Based Analysis of Human Transfer Learning Using a Game Testbed.” IEEE Transactions on Knowledge and Data Engineering 19.11 (2007): 1465–478. [Cristianini00] Cristianini, N. and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [Dixit08] Dixit, Priyesh N., and G. Michael Youngblood. “Understanding Information Observation in Interactive 3D Environments.” Sandbox ‘08: Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games. 2008. 163–170. [Jain99] Jain, A. K., M. N. Murty, and P. J. Flynn. “Data Clustering: A Review.” ACM Computing Surveys 31.3 (1999): 264–323. [Mitchell97] Mitchell, T. Machine Learning. WCB McGraw Hill, 1997. [Wagner74] Wagner, R. A., and M. J. Fischer. “The String-to-String Correction Problem.” Journal of the ACM (JACM) 21.1 (1974): 168–173. [Youngblood08] Youngblood, G. M. and P. N. Dixit. “Understanding Intelligence in Games Using Player Traces and Interactive Player Graphs.” Game Programming Gems 7. Boston: Charles River Media, 2008. 265–280. 3.11 Graph-Based Data Mining for Player Trace Analysis in MMORPGs 351 This page intentionally left blank 353 SECTION 4 GENERAL PROGRAMMING Introduction Doug Binks, Intel Semiconductors AG Game programming, like many disciplines, is becoming increasingly specialized. The steady trend of technical innovation, along with the broadening require- ments of game development, force us to focus our finite mental resources on an ever- narrowing section of the field. Yet the very basis of the programming endeavor is the ability to coerce the computational architecture into performing to our will. Most of the gems in this section deal with this—the fundamental art of game programming. Performance is a critical aspect of most game software, and so several gems deal with this issue, either directly, by showcasing solutions for common tasks with enhanced performance, or indirectly, by providing a better understanding of some aspect of performance programming. Multi-threading is an increasingly important area for programmers looking for more cycles to execute their instructions, and suitably, a pair of articles targets this. Several articles deal with memory issues, from allocation to optimization and profiling. A good part of getting a system to do what we want is ensuring that it actually does. In this vein, a couple of articles deal with error logging and enabling the QA process. The solutions presented require minimal effort to implement, so they stand a good chance of being widely used if included in a code base. Other articles deal with functionality. It’s here that the gems cover the widest area, partly through addressing general approaches to adding functionality and partly through describing specific but different types of functionality. Continuing the trend of many previous editions, there’s a bias toward tools—rightly so, as tools play an ever-important role in game development. Whether you’re a jack of all trades, a master of one, or new to game programming, you’ll find a good deal of useful innovation, information, and experience within this section. 354 Section 4 General Programming 355 4.1 Fast-IsA Joshua Grass, PhD Many advanced scripting languages have notions of class hierarchies similar to those in programming languages such as C++, C#, or Java. Scripts written in these languages often need to perform safe casts or IsA checks on objects. In our game, This Is Vegas, we found that the amount of time spent performing the IsA check was not insignificant. This gem describes a method for processing class hierarchy data to change the IsA operation from O(N) to O(1). In our case, this resulted in a perform- ance improvement of more than one percent for the cost of adding one DWORD for each class (depending on your platform; if the platform does not have the BitScanReverse operator, you will need to store the location of the most significant bit along with the index). The algorithm is also an interesting study in combining several well-known data structures that we often see in school but rarely get to use to achieve some tangible results. Problem Definition Given a class hierarchy, we need to be able to determine whether Class A is a subclass of Class B. A typical class hierarchy might look like Figure 4.1.1. Figure 4.1.1 An example class hierarchy. Our first implementation of an IsA() function would be as follows: bool IsA(Class *pA, Class *pB) { while (*pA != NULL) { if (pA == pB) { return true; } pA = pA->GetParentClass(); } return false; } This algorithm has two major problems: The worst-case scenario requires a traversal from leaf to root of the class tree, which can be very expensive if you are frequently doing IsA tests on leaf nodes (for example, we have an array of humans, and we want to process only thieves). The second problem is one of cache issues. The class metadata may be loaded anywhere in memory, and if one of the classes we traverse is not cur- rently in the cache, this operation can result in cache thrashing and low performance. Balanced Class Hierarchies While a graph with a variable number of branches at each node gives the system a huge amount of flexibility, it also means that there is no regular way in which we can store or access the hierarchy. Let’s imagine that we were incredibly lucky in our class hierarchy, and at the very end of the project, we had a uniform graph in which each node branched exactly twice, such as in the class hierarchy displayed in Figure 4.1.2. This graph has many useful properties, the main advantage being that there is a simple way of laying out the classes such that they can fit in one contiguous array of memory. Programmers writing A* algorithms use this structure (a heap) all the time because it eliminates the need for storing pointers and it makes memory management of an open list extremely easy. 356 Section 4 General Programming Figure 4.1.2 A balanced binary class hierarchy. NULL Object Human Weapon Warrior Mage Sword Bow 0 1 2 3 4 5 67 Level 0 Level 1 Level 2 Level 2 Level 3 Level 3 Level 3 Level 3 Each level we add to the tree adds 2(N-1) new nodes to the storage array, where N is the new level. So if we were to add Level 4 in our aforementioned example, we would need to add eight additional items to our array. Adding Level 5 would add 16 new nodes, and so on. We start our table at Entry 1 instead of Entry 0 for reasons that will be discussed later in the gem. What we’re really interested in here is the index of the nodes and their relation- ship to their parents. I will refer to this index as the class index for the rest of the gem. In the above table, the class index is the second row. It is important to note that we do not actually create a heap or store the classes in it. We use the heap structure purely to create a useful ordering for the classes. The function for getting the parent’s class index of a node is trivial: int parentIndex(int nIndex) { return nIndex >> 1; } If we take the class index of Sword (6) and right shift it by 1, we get the result 3, which is the class index of the parent node, Weapon. We can do this again and deter- mine that the parent node of Weapon (3) is, in fact, Object (1). Given this perfectly balanced tree, we can easily rewrite our IsA function to use the class indices in the storage array to determine whether a class is a subclass of another. bool IsA_Balanced2Tree(Class *pA, Class *pB) { int nAIndex = pA->GetClassIndex(); int nBIndex = pB->GetClassIndex(); while (nAIndex != 0) { if (nAIndex == nBIndex) { return true; } nAIndex = nAIndex >> 1; } return false; } So while this function doesn’t look that much better initially, it does have one huge advantage over our previous algorithm. It doesn’t depend on any information from the parent classes of A. It only retrieves information from Class A and from Class B, which are very likely to already be in the cache. So we have eliminated the possibility of any unnecessary cache misses for intermediary classes between A or B or, in the worst-case scenario where A is not a child of B, all of the parent classes of A. 4.1 Fast-IsA 357 We can further improve this algorithm by realizing that once the index for a parent of A is less than the index for B, there is no way that they are ever going to be equal. bool IsA_Balanced2Tree_V2(Class *pA, Class *pB) { int nAIndex = pA->GetClassIndex(); int nBIndex = pB->GetClassIndex(); while (nAIndex >= nBIndex) { if (nAIndex == nBIndex) { return true; } nAIndex = nAIndex >> 1; } return false; } This has just reduced our worst-case scenario drastically. In the case where we are testing a list of Humans to see whether they are Mages, we can halt immediately if they are Warriors, because the index of Warrior (4) is less than the index for Mage (5). Even in the case where we were searching for Warriors, we would only need to do one right shift before we could halt the function and return false. Eliminating the Tree Traversal The class indices have an additional property that allows us to remove the while loop from our function. Here is the child function for our nodes in our class tree: int childIndex(int nIndex, bool bRight) { if (bRight) return (nIndex << 1) + 1; else return (nIndex << 1); } Any child of Node A has an index equal to the index Node A left-shifted a number of times plus a number defining the child’s position in the sub-tree. The usefulness of this observation becomes much more apparent if we write out the indices in binary (see Figure 4.1.3). This observation allows us to make the following rule: If Class A is a child of Class B, then the leftmost N bits of B will match A where N is the highest bit set in A. This works because we started our class hierarchy at Index 1, so we know that all indices are 1 followed by an arbitrary number of bits. 358 Section 4 General Programming Using this rule we can write our IsA function one more time without the use of the while loop. (Non-constant bit-shift operators are emulated on the PS3, but this is implemented in microcode, so it is still much faster than a while loop.) bool IsA_Balanced2Tree_V3(Class *pA, Class *pB) { int nAIndex = pA->GetClassIndex(); int nBIndex = pB->GetClassIndex(); if (nAIndex <= nBIndex) return nAIndex == nBIndex; nAIndex = nAIndex >> (BSR(nAIndex) – BSR(nBIndex)); return nAIndex == nBIndex; } The BSR function in our case is a wrapper for an inline assembly function that uses the BSR assembly instruction (BitScanReverse). This instruction returns the index of the leftmost set bit/most significant bit, which is exactly what we need for this algorithm. If a platform does not have the BSR assembly instruction, we can eas- ily pre-calculate this value and store it in the Class object along with the array index (GetArrayIndexMSB()). This was our first implementation of the function before we found out about the BSR instruction. Finally, we can take advantage of one further property of the right-shift operator. If the amount to shift is negative, then the result is 0. And since we start our class hier- archy with an index of 1, no class will match 0. This leads to our final implementation of Fast-Isa. bool FastIsA(Class *pA, Class *pB) { int nAIndex = pA->GetClassIndex(); int nBIndex = pB->GetClassIndex(); return nBIndex == (nAIndex >> (BSR(nAIndex) – BSR(nBIndex)); } 4.1 Fast-IsA 359 Figure 4.1.3 Binary representation of indices of a node and its children. Building a Balanced Tree All of the previous work has been built upon the notion that our class hierarchy is a perfectly balanced binary tree. In practice, this is rarely the case. Luckily, what we want out of the IsA function isn’t any notion of depth between nodes, but only if they are in fact ancestors. Because of this, there is no reason why we cannot insert phantom classes to balance our tree. Figure 4.1.4 displays an example of the transformation from an unbalanced to a balanced hierarchy. In the case of these two class trees, every possible IsA relationship is maintained. In any situation where we have more than two direct children of a class, we can insert a number of phantom class nodes between the parent and the children to ensure that the IsA tree is a balanced binary tree. It is important to realize that while we are using the notion of a heap to generate the indices, we are not actually storing anything in this structure. It is purely virtual, so adding large numbers of phantom nodes to balance the tree does nothing except use up our index space. For most games, a 32-bit DWORD will contain more than enough space for the class hierarchy. The simplest implementation for building the class tree is the following algorithm. I recommend implementing this and determining whether you are close to running out of index space before moving to a more complicated algorithm. void BuildTree(Class *pA) { int nCurrentClassIndex = pA->GetClassIndex(); int nNumChildClasses = pA->GetNumberOfChildClasses(); int nNumLevels = BSR(nNumChildClasses) + 1; int nChildIndexStart = nCurrentClassIndex << nNumLevels; for (int i = 0; i < nNumChildClasses; ++i) { 360 Section 4 General Programming Figure 4.1.4 Converting a three-child node into a binary hierarchy. Class *pChild = pA->GetChildClass(i); pChild->SetClassIndex(nChildIndexStart + i); BuildTree(pChild); } } The heart of implementing a more complicated class tree construction algorithm is realizing that in most cases we have a fair amount of play in how we actually lay out the phantom nodes. Take the four-child case (see Figure 4.1.5). Both of these class trees have the exact same IsA relationship, so to the Fast-IsA algorithm there is no distinction. If our class tree was very complex (or deep), we could balance the tree based on the number of subclasses or the maximum depth of any subclass of a class. This leads to an algorithm similar to Huffman encoding to optimize the class tree such that the maximum depth of any leaf is minimized. With more than 3,700 script classes in This Is Vegas, we never encountered a problem with our hierarchy depth. If your class tree is greater than 32 levels deep, there is no reason why you cannot simply change your class index from a DWORD to a QWORD. The memory costs are mini- mal (since we only pay per class and not per instance), and 64-bit processors will be able to perform these operations at the same speed. 4.1 Fast-IsA 361 Figure 4.1.5 Alternate ways of decomposing a four-child node into a binary hierarchy. Conclusion Moving to the Fast-IsA algorithm had a large impact on the performance of our game for the cost of only one additional DWORD per class. In our case, it was a performance improvement of more than one percent for the entire game and far greater in specific portions of the code. Given how simple the implementation is and how little it costs, this was an easy optimization win for us. We also found the Fast-IsA function to impact performance in our pipeline’s baking process, which uses a large number of IsA checks. 362 Section 4 General Programming 363 4.2 Registered Variables Peter Dalton, Smart Bomb Interactive Inter-system communication is a critical consideration in a game engine, often dic- tating the broad architecture of the code base. In practice, sacrifices are often made to allow one system to know about the internal workings of another in order to accommodate better communication. Although sometimes this can be appropriate, it often results in a loss of modularity. These compromised systems lose their “black box” characteristics and become harder to maintain and replace. This gem will present a solution to this problem by demonstrating a technique for linking up shared vari- ables across disparate systems. This allows systems to define a set of inputs or control variables that can be seamlessly linked to variables in other systems. It is important to recognize that this is not a messaging system, nor is it meant to replace one. Rather, it is a system that allows a programmer to control communica- tion across various systems without requiring blind casts, global variables, or a flat class hierarchy. This technique allows for basic variable types, such as integers and floats, as well as complex data types, such as arrays, classes, and other user-defined types. This technique has been successfully utilized to facilitate the necessary commu- nication required to control animation systems, user interface parameters, display shader parameters, and various other systems where variable manipulation is required. Getting Started The basic idea is to create a wrapper for a variable and then allow these wrapped vari- ables to be linked together. The code that is dependent upon the variables can be implemented without any special considerations. Rather than just accessing the value that has been wrapped, the registered variable will walk the chain of linked variables and provide access to the appropriate variable. When building registered variables, there are several key goals to keep in mind. • Keep the registered variable seamless. To make a registered variable truly useful, it needs to be easy to work with. The goal is to make it transparent to the programmer whether they are using a regular integer or our newly created integer registered variable. Operator overloading will be the key here. 364 Section 4 General Programming • Allow one registered variable to be linked to another. We are going to allow registered variables to set redirectors, or in other words, allow registered variables to be chained together. • Tracking of a “dirty state.” To enhance the usefulness of a registered variable, we will include a dirty state in the variable. This provides users with knowledge of when the variable has actually changed, which is useful for run-time optimizations. • Custom run-time type information. This will become necessary when we start registering variables together. It allows us to confidently cast to a specific type without the need for blind casts. • Provide a way to link registered variables directly. We will provide an initial, explicit method for linking variables together. This method is important when dealing with specific situations where the control variables are well defined. • Provide a way to link registered variables indirectly. As our systems grow in complexity, we want to allow for variables to be generically linked together with- out either system knowing the internal details of the other. This indirect method will become the key to dealing with complex situations where all control variables are not well defined or are ambiguous. Assumptions The code we will present is taken from a commercial Xbox 360 engine. It utilizes several routines and data structures provided by the base engine that are beyond the scope of this gem. These dependencies are minimal; however, we need to explicitly mention them in order to avoid confusion. A basic implementation of these data structures has been included on the accompanying CD-ROM. TArrays The TArray class is a templated array holder. It is used to hold links to other registered variables. FNames This class implements a string token system. It is a holder for all of the strings that exist within the game engine. Each string is assigned a unique identifier by which it can then be referenced and compared against other FNames with a constant cost of O(1). This functionality is the key to implementing the required run-time type information and the means by which registered variables are given unique names for linking. The Base Class: RegisteredVar The base class from which all registered variables will be derived is the RegisteredVar class. This class provides all of the support for linking registered variables together and tracking the dirty state. Here only key portions of the RegisteredVar class are shown; a complete implementation can be found on the accompanying CD-ROM. 4.2 Registered Variables 365 class RegisteredVar { public: // Provides IsA<>() and GetClassType() routines, described later. DECLARE_BASEREGISTERED_VARIABLE( RegisteredVar ); RegisteredVar() : m_bDirty(false), m_pRedirector(null) {} virtual ~RegisteredVar() { if (m_pRedirector) m_pRedirector->m_References.RemoveItem( this ); while (m_References.Num()) m_References[0]->SetRedirector( null, false ); } void SetRedirector( RegisteredVar* InRedir ) { if (InRedir!=this && (!InRedir || (InRedir->IsA( GetClassType() ) && !InRedir->IsRegistered( this )))) { if (m_pRedirector) m_pRedirector->m_References.RemoveItem( this ); m_pRedirector = InRedir; if (m_pRedirector) m_pRedirector->m_References.AddItem( this ); } } void SetDirty( bool InDirty, bool InRecurse=false ); bool IsDirty() const; void SetFName( FName InName ) { m_Name = InName; } FName GetFName() const { return m_Name; } protected: template T* GetBaseVariable() const { return m_pRedirector ? m_pRedirector->GetBaseVariable() : (T*)this; } FName m_Name; bool m_bDirty; RegisteredVar* m_pRedirector; TArray m_References; }; There are two key elements to getting this class correct. The first is preventing dangling pointers in the destructor. The important consideration here is that since we are going to be linking up registered variables blindly between systems, we do not want to end up pointing to a registered variable that has been deleted. This scenario would result in dangling pointers to invalid memory addresses and severe headaches. To prevent this, we create a link back to the referencing registered variable so that we can clean it up when the referenced registered variable is deleted. This would normally create a doubly linked list; however, in our case it is common for multiple registered variables to redirect to a single registered variable, thus creating the need for an array of pointers as illustrated in Figure 4.2.1. The second key to keep in mind is that anytime you access a registered variable, you need to ask yourself an important question: Should I be working with “this” copy of the registered variable or should I forward the request to the redirected registered variable? If you decide that the correct answer is to work on the redirected registered variable, the GetBaseVariable() routine will retrieve the base registered variable that should be used. Single Variable Values versus Array Variable Values The next step is to divide all the variables into two distinct classifications: single value types and arrays. The first classification, single value types, encompasses integers, floats, user-defined types, and so on and will be the focus of the examples provided. The second, array types, will encompass arrays of integers, floats, user-defined types, and so on. The implementation of array types is very similar to single values types with just a few minor alterations. The implementation of array types has been pro- vided on the accompanying CD-ROM. Having made this distinction, we will now create a templated base class that provides 99 percent of the functionality required by any variable type. template class RegisteredVarType : public RegisteredVar { public: RegisteredVarType(); T Get() { return GetBaseVariable()->m_Value; } const T& Get() const { return GetBaseVariable()->m_Value; } void Set( const T& InV ) { GetBaseVariable()->SetDirectly( InV ); } 366 Section 4 General Programming Figure 4.2.1 This diagram illustrates how registered variables will be linked together and how we will also be tracking referencing registered variables to avoid dangling pointers. operator T() const { return GetBaseVariable()->m_Value; } operator T&() { return GetBaseVariable()->m_Value; } void operator=( const RegVar& InV ) { Set( InV.Get() ); } // Implement comparison operators >,<,>=,<=,==,!=, see CD-ROM. bool operator>( const T& InV ) { return Get() > InV; } // Implement mathematic operators /,*,+,-, see CD-ROM. T operator/( const T& InV ) const { return Get() / InV; } // Implement assignment operators /=,*=,+=,-=, see CD-ROM. RegVar& operator/=( const T& InV ) { Set( Get() / InV ); return *(RegVar*)this; } protected: void SetDirectly( const T& InValue ) { if (m_Value != InValue) { m_Value = InValue; SetDirty( true ); } for (int ii = 0; ii < m_Parents.Num(); ++ii) ((RegVar*)m_References[ii])->SetDirectly( InValue ); } T m_Value; }; Examining the code should illustrate the emphasis placed on providing the appropriate overloaded operators to allow the programmer to seamlessly use registered variables. The programmer should not have to change the code whether they are using a standard variable or a registered variable. This ensures that the registered variable is used correctly and seamlessly. It also makes it easy to add and remove registered variables from a system since only the variable definition and linking code needs to be updated. Another important consideration is the SetDirectly() routine used by the Set() method. The SetDirectly() routine first determines whether the value is actually dif- ferent than the current value and sets the dirty flag if appropriate. This dirty flag allows the owner of the variable to effectively track when the state of the variable has truly changed, thus allowing for run-time optimizations. A common optimization, when dealing with shader parameter blocks within DirectX, is to prevent the blocks from being invalidated and rebuilt unless absolutely necessary. Thus, if you have a variable controlling the state of a shader, you will want to make sure that the variable has actually changed before processing it. You should also notice that there is no automatic means by which the dirty flag is cleared. To clear the flag, the owner of the variable will need to explicitly call the SetDirty( false ) routine when the owner is done dealing with the change. Since the dirty flag is stored in each variable, the owner of the variable can deal with the flag in its own way. In the case of a variable controlling a shader parameter, we would not want to handle the variable change and rebuild the state block until the material is required by the renderer. 4.2 Registered Variables 367 However, another variable might also be linked to this state and want to handle the change immediately. It is also safe for the owner to choose to ignore the dirty flag if it isn’t required. The SetDirectly() routine also has the task of copying the value to the entire chain of linked registered variables. This feature is important to retain the most recent value in the event that a registered variable clears its redirector either explicitly or if the redirector is deleted. If the value was not copied, we would see a pop from the old value to whatever value is currently stored. While this might not be a critical issue, it can cause undesired behavior, as the variable might appear un-initialized. Copying the value is also useful when debugging, allowing for the value to be easily shown in the watch window without digging through a list of linked variables. Type-Specific Registered Variable At this point we have built all the base classes required, and creating registered variables is now straightforward. class RegisteredVarBOOL : public RegisteredVarType { DECLARE_REGISTERED_VARIABLE( RegisteredVarBOOL, RegisteredVarType ); RegisteredVarBOOL& operator=( const bool& InValue ) { Set( InValue ); return *this; } }; class RegisteredVarFLOAT : public RegisteredVarType { DECLARE_REGISTERED_VARIABLE(RegisteredVarFLOAT, RegisteredVarType ); RegisteredVarFLOAT & operator=( const float& InValue ) { Set( InValue ); return *this; } }; The listings above add support for both the standard Boolean and float types. Implementing additional types is as simple as duplicating the provided code and updating the names and types appropriately. Note that the operator=() was not spec- ified within the templated base class RegisteredVarType in order to resolve conflicts when using the Visual Studio 2008 C++ compiler. Setting a Registered Variable Directly We’ll now look at a simple example to illustrate what registered variables can do. We have a weapon class attached to a vehicle class, and the vehicle needs to tell the weapon when to fire. If we create a Boolean registered variable within the vehicle and link it to the weapon, we can then just manipulate the variable within the vehicle and control the state of the weapon. Also, if we have multiple components that need to know about the weapon firing, such as AI logic, user interfaces, or game code, we now only have one variable that needs to be updated to keep everyone in sync. In contrast, 368 Section 4 General Programming without using registered variables we would need to create a Fire() function within the weapon and call it to start and stop firing. We would also need to manually notify all other systems that the weapon is firing. The registered variable approach has the advantage that once the variables are correctly registered, it is much easier to control communication. class Weapon { void SetFireRegVar( RegisteredVar* InVar ) { m_Fire.SetRedirector( InVar ); } void HeartBeat( float InDeltaTime ) { if (m_Fire) FireWeapon(); } RegisteredVarBOOL m_Fire; }; class Vehicle { void Initialize() { m_MyWeapon.SetFireRegVar( &m_FireWeapon ); } void HeartBeat( float InDeltaTime ) { m_FireWeapon = DoWeWantToFire(); } bool DoWeWantToFire(); RegisteredVarBOOL m_FireWeapon; Weapon m_MyWeapon; }; IsA Functionality The DECLARE_REGISTERED_VARIABLE macro requires further explanation to assist in understanding the implementation. The purpose of this macro is to provide type infor- mation for the registered variable. It ensures that we do not link two registered variables together that are not of the same basic type. It also allows us to determine the type of register variable that we have given only a pointer to the base class RegisteredVar. #define DECLARE_REGISTERED_VARIABLE( InClass, InBaseClass ) \ protected: \ typedef InClass ThisClass; \ typedef InBaseClass Super; \ public: \ virtual FName GetSuperClassType( FName InComponentType ) const \ { \ FName SuperType = NAME_None; \ if (InClass::StaticGetClassType() == InComponentType) \ SuperType = Super::StaticGetClassType(); \ 4.2 Registered Variables 369 else if (InComponentType != NAME_None) \ SuperType = Super::GetSuperClassType( InComponentType ); \ return SuperType == InComponentType ? NAME_None : SuperType; \ } \ virtual FName GetClassType() const \ { \ return InClass::StaticGetClassType(); \ } \ static FName StaticGetClassType() \ { \ static FName TypeName = FName( STRING( InClass ) ); \ return TypeName; \ } #define DECLARE_BASEREGISTERED_VARIABLE( InClass ) \ DECLARE_REGISTERED_VARIABLE( InClass, InClass ) \ template bool IsA() const \ { \ return IsA( T::StaticGetClassType() ); \ } \ bool IsA( const FName& InTypeName ) const \ { \ for (FName Type = GetClassType(); Type != NAME_None; \ Type = GetSuperClassType( Type )) \ { \ if (Type == InTypeName) \ return true; \ } \ return false; \ } While this code is being utilized here to provide IsA functionality for registered variables, it is generic in nature and can be used to provide RTTI functionality to any class or structure. The code is written in the form of a macro to prevent the code from being duplicated due to it being required at every level of the inheritance chain. An important consideration is to recognize that this implementation does not support multiple inheritance but could be extended to do so. Setting a Registered Variable Indirectly Now that we have basic RTTI information, we can safely link registered variables together without knowing the internals of other systems. Let’s examine another example. Suppose we have a material used for rendering that has a parameter we can adjust to change its damage state. The damage state is defined within the material and used to determine how the material is rendered. In this example we would like to create a generic system in which a high-level object can register a variable with another object and have it correctly link to a control variable. We want a vehicle class to provide a variable to control the damage state of the material, and then the vehicle can drive the material’s control variable by simply modifying its own variable. In this example, adding a function or parameter to the Material class would not be desirable because it would lead to bloat and would not be applicable to all materials. 370 Section 4 General Programming class RegisterVariableHolder { virtual void RegisterVariable( RegisterVar& InVar ) {} }; class BaseClass : public RegisteredVariableHolder { virtual void RegisterVariables(RegisterVariableHolder& InHolder) {} }; class Material : public BaseClass {}; class DamageStateMaterial : public Material { void Initialize() { m_DamageState.SetFName( “DamageState” ); } virtual void RegisterVariable( RegisteredVar& InVar ) { if (InVar.IsA() && InVar.GetFName() == m_Trans.GetFName()) { m_DamageState.SetRedirector( &InVar ); } } RegisteredVarFLOAT m_DamageState; }; class Vehicle : public BaseClass { void Initialize() { m_VehicleDamageState.SetFName( “DamageState” ); m_Fire.SetFName( “Fire” ); RegisterVariables( *m_pMaterial ); } virtual void RegisterVariables( RegisterVariableHolder& InHolder ) { InHolder.RegisterVariable( m_VehicleDamageState ); InHolder.RegisterVariable( m_Fire ); } RegisteredVarFLOAT m_VehicleDamageState; RegisteredVarBOOL m_Fire; Material* m_pMaterial; }; Now, whenever the vehicle changes m_VehicleDamageState, the material class’s m_DamageState variable will be automatically updated without the material being required to provide accessor routines or the vehicle knowing the type of material it has been assigned. The vehicle can also ignore the material since the only thing it needs to do is update its own registered variable. While the example is fairly simple, the princi- ple can be applied to solve many more problems. 4.2 Registered Variables 371 Conclusion Within our game engine we have found registered variables to be an essential part of inter-system communication because they abstract the communication layer and minimize system dependencies. Registered variables are utilized to control the state of animation flow systems, expose data to user interfaces, such as hit points and ammo counts, and control material parameters, such as damage states and special rendering stages. We provide tools within the game’s editor to allow artists and level designers to specify exactly which registered variables should be linked together within the game. Systems have been designed to allow users to dynamically create new registered variables within the game editor and link them to any other appropriate registered variable. For us, this has opened the door for content builders to access any set of data within the game engine and gives them the necessary controls to manipulate gameplay. We hope that you will have fun experimenting with the concept of registered vari- ables and that you will find them useful in improving your code. You will find an implementation of the techniques presented on the CD-ROM. 372 Section 4 General Programming 373 4.3 Efficient and Scalable Multi-Core Programming Jean-François Dubé, Ubisoft Montreal Nowadays, multi-core computers (and game consoles such as the Microsoft Xbox 360 and Sony PlayStation 3) are very common. Programmers are faced with the challenges of writing multi-threaded code: data sharing, synchronization, deadlocks, efficiency, and scalability. Adding multi-threading to an existing game engine is an enormous effort and might give an initial speed gain, but will it run twice as fast if you double the number of cores? What if you run it on a 16+ cores system, such as Intel’s Larrabee architecture? Will it be able to run on platforms with co-processors, such as Sony’s PlayStation 3? In this gem, we’ll see how to write efficient and scalable multi-threaded code. The first section will deal with the “efficiency” part of the problem, while the second part will deal with the “scalability” part. Efficient Multi-Threaded Programming Multi-threaded programming introduces a variety of issues the programmer must be aware of in order to produce code that performs the required operations correctly. Additionally, certain operations can lead to additional overheads in a multi-threaded pro- gram. In this section we’ll look at high-performance methods for resolving these issues. Shared Data The main problem with multi-threaded programming is concurrent access to the same memory locations. Consider this simple example: uint32 IncrementCount() { static uint32 Count=0; return Count++; } This is commonly translated into three operations: a load, an addition, and a store. Now, if two threads execute this function slightly at the same time, what will happen? Here’s an example: Thread 1 read Count and store it into register R1 Thread 1 increment R1 Thread 2 read Count and store it into register R1 Thread 2 increment R1 Thread 1 store the value of R1 into Count Thread 2 store the value of R1 into Count If Count was originally 5 before this sequence of events, what will be the result afterward? While the expected value is 7, the resulting value would be 6, because each thread has a copy of Count in its register R1 before it was actually updated to memory. This example is very simple, but with more complex interactions this could lead to data corruption or invalid object states. This can be fixed by using atomic operations or by using synchronization primitives. Atomic Operations Atomic operations are special instructions that perform operations on a memory location in an atomic manner; that is, when executed by more than one core on the same memory location, it is guaranteed to be done atomically. For example, the InterlockedIncrement function could be used in the previous example to make it thread-safe and lock-free. A very useful atomic operation is the Compare And Swap (CAS) function, imple- mented as InterlockedCompareExchange on Windows. Essentially, it compares a value with another and exchanges it with a third value based on the outcome of the compa- rison, atomically. It then returns the original value before the swap. Here’s how it can be represented in pseudocode: uint32 CAS(uint32* Ptr, uint32 Value, uint32 Comperand) { if(*Ptr == Comperand) { *Ptr = Value; return Comperand; } return *Ptr; } This atomic operation is very powerful when used correctly and can be used to perform almost any type of operation atomically. Here’s an example of its usage: uint32 AtomicAND(volatile uint32* Value, uint32 Op) { while(1) { uint32 CurValue = *Value; uint32 NewValue = (CurValue & Op); 374 Section 4 General Programming if(CAS(Value, NewValue, CurValue) == CurValue) { return NewValue; } } } In this example, we read the current value and try to exchange it with the new value. If the result of the CAS returns the old value, we know that it wasn’t changed during the operation and that the operation succeeded. (It was swapped with the new value.) On the other hand, if the result is not equal to the old value, we must retry, since it was changed by another thread during the operation. This is the basic opera- tion on which almost all lock-free algorithms are based. Synchronization Primitives Sometimes, atomic operations are not enough, and we need to be able to ensure that only a single thread is executing a certain piece of code. The most common synchro- nization primitives are mutexes, semaphores, and critical sections. Although in essence they all do the same thing—prevent execution of code from multiple threads—their performance varies significantly. They are kernel objects, which means that the operating system is aware when they are locked. Therefore, they will generate a costly context switch if already locked by another thread. On the other hand, most operating systems will make sure the thread that has a critical section locked will not be preempted by another thread while it holds the lock. So using those primitives depends on a lot of factors: the time span of the lock, the frequency of locking, and so on. When locking is required very frequently and for a very low amount of time, we want to avoid the overhead from operating system process rescheduling or context switching. This can be achieved by using a spin lock. A spin lock is simply a lock that will actively wait for a resource to be freed, as seen in this simplified implementation: while(CAS(&Lock, 1, 0)) {} What it does is simple: It uses the CAS function to try to gain access to the lock variable. When the function returns 0, it means we acquired the lock. Releasing the lock is simply assigning 0 to the Lock variable. A real implementation of a spin lock usually should contain another waiting loop that doesn’t use atomic functions to reduce inter-CPU bus traffic. Also, on some architectures, memory barriers are required to make sure that the state of the Lock variable is not reordered in some ways, as seen in the next section. The complete implementation is available on the CD. On architectures that don’t change a thread’s affinity (in other words, that don’t reschedule threads on different processors, such as Xbox 360), running several threads on the same core competing for a shared resource using a spin lock is a very bad idea. 4.3 Efficient and Scalable Multi-Core Programming 375 376 Section 4 General Programming If the lock is held by a thread when it gets interrupted by the operating system scheduler, other threads will be left spinning trying to acquire the lock, while the thread holding it is not making progress toward releasing it. This results in worse performance than using a critical section, as shown in Figures 4.3.1 and 4.3.2. Figure 4.3.1 Wasting cycles when using spin locks for a long time. Figure 4.3.2 Efficient locking when using critical sections. Memory Ordering Memory can be read and written in a different order than written in your code for two main reasons: compiler optimizations and hardware CPU reordering. The latter differs a lot depending on the hardware, so knowing your architecture is important. See [McKenney07] for a detailed look at this problem. Consider the following pieces of code running simultaneously: Thread 1 Thread 2 GlobalValue = 50; while(!ValueIsReady) {} ValueIsReady = true; LocalValue = GlobalValue; While this looks completely fine and will almost always work when running on a single core depending on the compiler optimizations, it will probably fail when run- ning on different cores. Why? First, the compiler will most likely optimize the while loop and keep ValueIsReady in a register; therefore, declaring it as volatile should fix the problem on most compilers. Second, due to out-of-order memory accesses, Val- ueIsReady might get written first; therefore, Thread 2 can read GlobalValue before it is actually written by Thread 1. Debugging this kind of bug without knowing that memory reordering exists can be long and painful. Three types of memory barriers exist: read, write, and full barriers. A read memory barrier forces reads from memory to complete, while a write memory barrier forces writes from memory to complete, so other threads can access the memory safely. A full memory barrier is simply forcing both reads and writes to complete. Some compilers will also correctly handle reads and writes of volatile variables by trea- ting them as memory barriers (Visual Studio 2005 for instance, as described in [MSDN]). Also, some of the Interlocked functions come with the Acquire and Release semantics, which behave like read and write barriers, respectfully. The previous example can be solved by using a write memory barrier before setting ValueIsReady to ensure that the write to GlobalValue actually gets written before ValueIsReady, ensuring that other threads will see the new GlobalValue before ValueIsReady is written. False Sharing Most multi-core architectures have a per-core cache, which is generally organized as an array of memory blocks, each with a power of two size (128 bytes, for example), called cache lines. When a core performs a memory access, the whole line is copied into the cache to hide the memory latency and maximize speed. False sharing occurs when two cores operate on different data that resides in the same cache line. In order to keep memory coherency, the system has to transfer the whole cache line across the bus for every write, wasting bandwidth and memory cycles. The solution to this prob- lem is to make sure that the data is structured in a way that avoids this problem. 4.3 Efficient and Scalable Multi-Core Programming 377 Memory Allocator Contention Memory allocators can rapidly become a bottleneck in multi-threaded applications. Standard memory allocator implementations provided in the C run time (malloc/free) or even optimized and widely used allocators, such as Doug Lea’s dlmalloc [Lea00], aren’t designed to be used concurrently; they need a global lock to protect all calls, which inherently leads to false sharing. The design of a multi-processor’s optimized memory allocator is beyond the scope of this gem, but a good comparison of available multi-processor allocators can be found in [Intel07]. Idle Thread State When a thread is idle, it is important that it doesn’t waste CPU cycles. Here’s some code we often see from a person writing multi-threaded code for the first time: while(!HasWork()) { Sleep(0); } The Sleep(0) will make the thread give up the remainder of its time slice for other threads, which is fine. But, when it gets rescheduled, it will loop back, unneces- sarily wasting CPU time (and multiple context switches) until it has work to do. A better solution is to put the thread in sleep mode, waiting for an event. This is achieved by using the CreateEvent and WaitForSingleObject functions. Waiting for an event essentially tells the operating system scheduler that the thread is waiting and shouldn’t get any CPU time until the event is triggered. Thread Local Storage It is possible to declare per-thread global variables using Thread Local Storage (TLS). The declaration differs from compiler to compiler, but here’s how it works under Visual Studio: __declspec(thread) int GlobalVar; Each thread now has its own GlobalVar variable copy. This can be especially use- ful for per-thread debugging information (such as the thread name, its profiling stats, and so on) or for custom memory allocators that can operate on a per-thread basis, effectively removing the need for locking. Lock-Free Algorithms A good introduction to such things is discussed in detail in [Jones05] and [Herlihy08]. Essentially, these algorithms are pieces of code that can be executed by multiple threads without locking. This can lead to enormous speed gain for some algorithms, such as memory allocators and data containers. For example, a lock-free queue could 378 Section 4 General Programming need to be safe when multiple threads push data into it, while multiple threads also pop data at the same time. This is normally implemented using CAS functions. A complete implementation is available on the CD. Scalable Multi-Threaded Programming The most common way to rapidly thread an existing application is to take large and independent parts of the code and run them in their own thread (for example, render- ing or artificial intelligence). While this leads to an immediate speed gain and lots of synchronization problems, it is not scalable. For example, if we run an application using three threads on an eight-core system, then five cores will sit idle. On the other hand, if the application is designed from the start to use small and independent tasks, then perfect scalability can be achieved. To accomplish this, several options already exist, such as the Cilk language [CILK], which is a multi-threaded parallel programming language based on ANSI C, or Intel’s Threading Building Blocks [TBB]. An imple- mentation of a simple task scheduler is presented next. Task Scheduler Requirements The required properties of our scheduler are: 1. Handle task dependencies. 2. Keep worker threads’ idle time at a minimum. 3. Keep CPU usage low for internal task scheduling. 4. Have extensibility to allow executing tasks remotely and on co-processors. The scheduler is lock-free, which means that it will never block the worker threads or the threads that push tasks to it. This is achieved by using fixed-size lock-free queues and a custom spin lock and by never allocating memory for its internal execution. Tasks A task is the base unit of the scheduler; this is what gets scheduled and executed. In order to achieve good performance and scalability, the tasks need to be small and independent. The (simplified) interface looks like this: class Task { volatile sint* ExecCounter; volatile sint SyncCounter; public: virtual void Execute()=0; virtual sint GetDependencies(Task**& Deps); virtual void OnExecuted() {} } A task needs to implement the Execute function, which is what gets called when it is ready to be executed (in other words, no more dependencies). 4.3 Efficient and Scalable Multi-Core Programming 379 380 Section 4 General Programming To expose dependencies to the scheduler, the GetDependencies() function can be overloaded to return the addresses and the number of dependent tasks. A base imple- mentation that returns no dependencies is implemented by default. A task is considered as fully executed when its Execute function has been called and when its internal SyncCounter becomes zero. The ExecCounter is an optional counter that gets atomically decremented when the Execute function is called. For tasks that spawn sub-tasks, setting the sub-task’s ExecCounter pointer to the parent’s SyncCounter ensures that their OnExecuted functions will only be called when all sub- tasks have been executed. Worker Threads The scheduler automatically creates one worker thread per logical core. Figure 4.3.3 illustrates how the worker threads behave: Each worker thread is initially pushed in a lock-free queue of idle threads and waits for a wakeup event. When a new task is assigned to a worker thread by the scheduler, it wakes up and executes the task. Once the task is done, the worker thread does several things. First, it tries to execute a sched- uling slice (which will be explained in the next section) by checking whether the scheduler lock is already acquired by another thread. Then, it tries to pop a waiting task from the scheduler. If a task is available, it executes it, and the cycle restarts. On the other hand, if no tasks are available, it pushes itself in the lock-free queue of idle threads and waits for the wakeup event again. Figure 4.3.3 Worker thread logic. Scheduler The scheduler is responsible for assigning tasks to the worker threads and performing tidying of its internal task queues. To schedule a task, any thread simply needs to call a function that will push the task pointer in the lock-free queue of unscheduled tasks. This triggers a scheduling slice, which is explained next. Scheduling Slice At this point, pending tasks are simply queued in a lock-free queue, waiting to be scheduled by the scheduler. This is done in the scheduling slice, which does the fol- lowing, as described in Figure 4.3.4: 1. Register pending tasks. 2. Schedule ready tasks. 3. Delete executed tasks. 4.3 Efficient and Scalable Multi-Core Programming 381 Figure 4.3.4 Scheduler slice logic. Registering Pending Tasks During this phase, the scheduler pops the pending tasks and registers them internally. Then, it needs to handle dependencies. If a task is dependent on previously scheduled tasks, it goes through the dependencies and checks to see whether they have been exe- cuted. If that’s the case, the task is ready to execute and is marked as ready. Otherwise, the task is kept as pending until the next scheduling slice. Scheduling Ready Tasks The second phase of the scheduling slice is to assign tasks that are ready to be executed to worker threads. The scheduler first tries to pop an idle thread from a lock-free queue of idle threads. If it succeeds, the task is directly assigned to that thread, and the thread waiting event is signaled. If all threads are working, the task is queued in the lock-free queue of waiting tasks. The scheduler then repeats the process until there are no more tasks to assign. Deleting Executed Tasks The last phase of the scheduling slice handles the tasks that are considered fully executed by calling the OnExecuted function on them and by deleting them if they are marked as auto-destroy; some tasks may need to be manually deleted by their owner, as they might contain results, and so on. Future Work Since tasks are usually independent from each other, the scheduler can be extended to support execution of tasks on remote computers or on architectures with co-processors. To achieve this, each task would have a way to package all the data it needs to execute. Then, the scheduler would dispatch the task to other computers or co-processors though a simple protocol. Compilation of the Execute function might be required to target the co-processor’s architecture. When a completed task message arrives, the task would then receive and unpackage the resulting data. This could be used as a distrib- uted system (such as static lighting/normal map computations, distribution of the load of a game server to clients, and so on) or as a way to distribute work on the PlayStation 3 SPUs automatically. Optimizations The single task queue could become a bottleneck on systems with a large number of cores or ones that have no shared cache between all cores. Here, using per-worker thread task queues along with task stealing could prove advantageous. Furthermore, the scheduling could be made more cache friendly through a policy of inserting newly created tasks into the front of the worker thread’s queue that generated the task, as these are likely to be consuming the data generated by the task that created them. All of these optimizations depend on the target architecture details. 382 Section 4 General Programming Conclusion As we have just seen, the scheduler is completely lock-free; its internal CPU usage is kept to a minimum, the worker threads are either executing a task or waiting for one, and most of all, it scales with the number of cores. The complete source code on the CD comes with several sample tasks that have been tested on a Intel Core 2 Quad CPU running at 2.83 GHz, with the following results (also see Figure 4.3.5): Fibonacci Sequence Perlin Noise QuickSort 1 thread 1.05ms 9.06s 9.30s 2 threads 0.29ms 4.55s 4.84s 3 threads 0.22ms 3.03s 3.44s 4 threads 0.15ms 2.29s 2.74s 5 threads 0.14ms 2.21s 2.81s 6 threads 0.12ms 2.31s 2.86s The Fibonacci sequence test has been created to see how dependencies are handled. The Perlin noise test, on the other hand, has been implemented to see how perform- ance and scalability can be achieved; the test consists of computing a 2048×2048 Perlin noise grayscale image with 16 octaves. The QuickSort test consists of sorting 65 million random integers. As we can see, the expected scalability is achieved for all of these tests. 4.3 Efficient and Scalable Multi-Core Programming 383 Figure 4.3.5 Test results. References [CILK] “The Cilk Project.” Massachusetts Institute of Technology. n.d. . [Lea00] Lea, Doug. “A Memory Allocator.” 2000. State University of New York, Oswego. n.d. . [Herlihy08] Herlihy, Maurice and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publisher, 2008. [Intel07] “The Foundations for Scalable Multi-Core Software in Intel® Threading Building Blocks.” 2007. Intel. n.d. . [Jones05] Jones, Toby. “Lock-Free Algorithms.” Game Programming Gems 6. Ed. Mark Deloura. Boston: Charles River Media, 2005. [McKenney07] MeKenney, Paul E. “Memory Ordering in Modern Microprocessors.” 2007. Raindrop Laboratories. n.d. . [MSDN] “Synchronization and Multiprocessor Issues.” n.d. Microsoft. n.d. . [TBB] “Intel Threading Building Blocks.” n.d. Intel. n.d. . 384 Section 4 General Programming 385 4.4 Game Optimization through the Lens of Memory and Data Access Steve Rabin, Nintendo of America Inc. As modern processors have become faster and faster, memory access has failed to keep pace. This is such a pressing issue on current console hardware that experts advise developers to treat memory as if it were as slow as hard drive access [Isensee06]. As a result of this vast disparity between CPU and memory speed, a great deal of horsepower goes to waste as CPUs wait for data to work on. Thus, a key aspect to optimizing games is keeping the CPU well fed with data. But isn’t the cache supposed to alleviate this problem? While the cache is indis- pensible to any complex computer architecture, it isn’t a panacea. To its benefit, cache is brilliantly transparent to the code, but if we want to truly optimize our game, we’ll have to pull back the curtains and understand how the cache works in order to help it out. Once we better understand the cache and the memory architecture, it will become apparent that we need to respect the cache by shrinking the size of data and keeping it better organized. These will be our two guiding principles for most of our optimizations. Understand the Cache Main memory is extremely slow relative to the CPU. It’s slow because it’s typically far away from the CPU (on another chip) and made with fewer transistors per bit (which requires the bits to be refreshed so they don’t fade away—which takes extra time). However, it’s possible to make more expensive memory that is closer to the CPU (directly on the same chip) and faster by using more chip real estate per bit (so it doesn’t need to be refreshed). However, this more expensive memory can’t be as large as main memory because it simply won’t fit on the silicon die next to the CPU. So what is a computer architect to do? The solution is to keep copies of the most recently used main memory bits in the ultra-fast cache memory that sits on the CPU die. In fact, this strategy works so well that modern computer architectures have multiple levels of cache, usually referred to as level 1, level 2, and level 3 (L1, L2 and L3, respectively). L1 cache is the smallest and fastest. Because of different access patterns between data and instructions, it’s commonly split into an L1 data cache and an L1 instruction cache (typically 32 KB each). L2 cache is slightly slower and is usually shared between data and instructions (typically 256 KB to 6 MB, larger if shared among cores). L3 cache is a relatively new development in consumer hardware and appears on Intel’s i7. On the i7, each core has a dedicated L1 and L2, but the 8-MB L3 is shared among the four cores. Table 4.4.1 shows cache sizes for various platforms [AMD09, Bell08, Lanterman07, Shimpi09, Wikipedia09]. TABLE 4.4.1 Cache Sizes for Various Platforms Platform L1 Instruction/Data Cache L2 Cache L3 Cache Cache Line Size iPhone 3GS 32 KB/32 KB 256 KB N/A 64 bytes Wii 32 KB/32 KB 256 KB N/A 32 bytes PS3 32 KB/32 KB 512 KB N/A 128 bytes Xbox 360 32 KB/32KB 1 MB N/A 128 bytes AMD Athlon X2 64 KB/64 KB 512 KB to 1 MB N/A 64 bytes Intel Core 2 32 KB/32 KB 1 MB to 6 MB N/A 64 bytes Intel i7 32 KB/32 KB 256 KB 8 MB 64 bytes But how do these multiple levels of cache work? If the CPU needs to load a word from memory, it will first ask the L1 cache. If the data is there, it’s a hit in the L1 cache, and the data will be delivered very quickly to the CPU. If the data isn’t in the L1 cache, then it is a miss in the L1 cache, and then the L2 cache is checked. If the data is in the L2 cache, it is a hit, and the data is delivered to the L1 cache and the CPU. If it is a miss in the L2 cache, then the next level must be checked. If the next level is main memory, then the requested data will be delivered to the L2, the L1, and the CPU. However, there is one more twist to how cache works. Memory isn’t just copied to different levels of cache in bytes or words. It’s copied in chunks known as cache lines, which are aligned sequential pieces of memory (for example, 128 bytes on the PS3 and Xbox 360, which is 32 words of 4 bytes each). So when the CPU asks for a 4-byte word and it misses in each cache, the entire cache line is copied from main memory into each cache level. This cache line copy is the reason why spatial coherency is so important for the working data set. Table 4.4.1 shows cache line sizes for various platforms. 386 Section 4 General Programming Knowing that main memory is slow and the cache is fast, our goal will be to keep data and instructions in the cache as long as possible. The worst thing that can happen is cache thrashing, where the working set of data and instructions can’t fit inside the cache at the same time. In this case, memory is brought into the cache only to be thrown out because there isn’t enough room for the next needed piece of data. Imagine a tight loop that continually operates on 512 KB of data when the L2 cache size is only 256 KB. With the limited size of cache and the importance of our data being in the cache, it becomes clear why we must keep data small and well organized. Pinpoint Problem Areas If we were to optimize every system in our game, we would waste a lot of time and effort. For example, if we double the speed of a function that only takes 0.01 percent of the frame time, then we have effectively done nothing toward speeding up our game. It’s a hard truth to face since doubling the speed of a function sounds so satis- fying, but the overall improvement is too small to make a difference. Profiling our game is the only true way to pinpoint which areas are worth spend- ing time to improve. Unless you can prove that a system or section of code is a bottle- neck, you shouldn’t waste your effort even thinking about optimizing it. Once you identify code that takes a significant amount of time, how can you tell whether the CPU is waiting for data? The answer is performance counters. Perform- ance counters are measurement tools built directly into the CPU that can count events and help identify problems during code execution. Many profilers can record performance counters, or you can usually turn them on and record the results directly in your code. Using the performance counters, there are two key metrics to measure in order to identify when the CPU is spinning, waiting for data. The first is instructions per cycle (IPC), which gives a rough measure of how much work gets done per CPU cycle. If you measure an IPC of 0.8, then on average 0.8 instructions are completing per CPU cycle. The second key performance counter is the percentage of loads or stores that resulted in an L2 cache miss (which means a load from main memory). If the IPC is low and the percentage of L2 cache misses is high, then the cache isn’t working well for this piece of code, perhaps because of cache thrashing. Once you know which code or data needs to be optimized, you can go to work with the following suggestions. Avoid Waste The key to being efficient is avoiding waste. One source of wasted CPU cycles comes from reading memory that you don’t use. Since the cache copies memory in cache line chunks, it’s critical to use everything you read, in order to be efficient. For example, if you have an array of structs and only need to operate on one element in each struct, then you might be wasting a majority of the data you’re bringing in from main memory. 4.4 Game Optimization through the Lens of Memory and Data Access 387 For example, if you only operate on 30 percent of the struct, then as much as 70 percent of the data you’re bringing in from main memory is wasted. Solutions to this problem are presented in the “Organize the Data” section. Another source of waste comes from reading or writing memory that is not con- tiguous. Again, since memory is copied in cache line chunks, it is important to read and write sequentially to avoid waste. Lastly, waste can arise from redundancy. It would be wasteful to read the same data multiple times a frame. Instead, batch operations together and try to optimize for a single pass through the data per frame. Shrink the Data Our first main strategy is simple: If your data is smaller, more of it will fit in the cache. This is as easy as carefully managing your data type sizes. If you’re working with inte- gers that won’t get very large, consider using a short (2-byte integer) or just a single byte to represent the value. Instead of using 64-bit doubles, consider using 32-bit floats. However, one of the biggest wins is with Booleans. Typically, a Boolean takes up 4 bytes, but this is a huge waste of space since ideally it should be represented as a single bit. Packed Structures An easy way to manage your data type sizes is by defining them inside a structure. The following code is an example of an inefficient structure for a billboard particle, followed by an efficiently packed one. Notice the use of reduced ranges, bitfields, and indices, as well as how types are reordered by size to reduce padding. The result is a 1/3 savings in space, or about 7 K saved for 500 particles. struct InefficientParticle //total size 44 bytes { bool visible; //31 bits of padding Texture *texture; //pointer to texture int alpha; //only needs 0 to 255 float rotation; //too much precision int type; //enumeration - 4 possible types Vec3 position; Vec3 velocity; }; struct EfficientParticle //total size 30 bytes { Vec3 position; Vec3 velocity; unsigned char alpha; //saved 3 bytes (0-255) unsigned char rotation; //saved 3 bytes (0-255 degrees) unsigned texture:4; //saved 28 bits (texture index) unsigned type:2; //saved 29 bits (enumeration) unsigned visible:1; //saved 31 bits (single bit) }; 388 Section 4 General Programming To emphasize the importance of ordering struct variables from largest to smallest (which reduces padding and results in smaller structs), consider the following two examples of the same data: struct WastedPadding //20 bytes total { char var1; //1 byte float var2; //4 bytes char var3; //1 byte int var4; //4 bytes char var5; //1 byte }; struct OptimalPadding //12 bytes total { float var2; //4 bytes int var4; //4 bytes char var1; //1 byte char var3; //1 byte char var5; //1 byte }; The following is a struct packing checklist for quick reference: • Are numbers represented as the smallest reasonable data type? • Use a float instead of a double? • Use an 8-bit or 16-bit integer instead of an int? • Use an 8-bit or 16-bit integer instead of a float (lose some precision)? • Use the minimal bits necessary to represent the highest number? • Can a char array be converted to a pointer and the string stored elsewhere? • Can a pointer be converted to an index? • Are all Booleans converted to a single bit? • Are all enumerations converted to the range of numbers needed? • Are the data types ordered from largest to smallest to reduce padding? Compile for Size Code is data! The smaller your code is, the more code will persist in the L1 instruction cache and in the shared instruction/data L2 cache. Since both code and data compete for L2 cache, smaller code also helps keep data in the cache. All compilers offer the ability to optimize the compiled code for either speed or size. By compiling for speed, small functions will be inlined, and loops will be unrolled (which will bloat the size of the code). If you compile for size, then compact code will be favored. Since it’s not clear which option will make your game run the fastest, you’ll need to profile each option. It also may be the case that you want to compile some parts for speed and others for size. 4.4 Game Optimization through the Lens of Memory and Data Access 389 Organize the Data The second main strategy is to better organize your data to be cache-conscious. This means creating contiguous data structures (rather than scattered around memory) and grouping frequently used data together, away from infrequently used data. Prefer Compact Contiguous Containers Node-based containers, such as linked lists, hurt cache performance in two ways. The first is that they waste space by storing pointers to the next node, thus bloating the data structure. The second is that they allow the container nodes to be scattered around memory, hurting spatial locality [Isensee06]. A much more efficient data structure is either an array, an STL vector, or an STL deque, where the data is stored sequentially without the use of pointers. If you are using STL node–based containers, be sure to allocate new nodes from a dedicated heap to maintain spatial locality. Separate Hot and Cold Data While object-oriented programming and encapsulation lead to effective organization and comprehension, they can also lead to inefficient cache use. Within each struct or class, some members are referenced often by the code, and others are referenced infre- quently. Ideally, all of the hot data (data that is used often) is placed together and is separate from the cold data (data that is seldom used) [Ericson03, Kaeli01, Franz98]. This would result in better cache utilization since the hot data is adjacent and more likely to be in the cache. However, it’s not enough to just separate hot and cold data within a struct or class. Consider what happens with an array of structs that have both hot and cold data. The problem that arises is that the cold data causes gaps between the groups of hot data, as in the left side of Figure 4.4.1. 390 Section 4 General Programming Figure 4.4.1 On the left, an array of structs containing a mix of hot and cold data. The middle image shows splitting the hot and cold data for each struct, with a pointer linking data from the same struct. The right image is slightly more efficient since the pointers are eliminated and struct correspondence is implicit in the array index. Clearly, we need to further distill the hot and cold data even between structs or classes. In order to preserve encapsulation, one solution is to keep the hot data within the data structure, but then reference the cold data with a pointer (with the cold data liv- ing in some other part of memory). This is shown in the middle image of Figure 4.4.1. The most cache-efficient solution would be to further weaken encapsulation and eliminate the extra pointer. In this scheme, there are two corresponding arrays: one holding the hot data and another holding the cold data. The link between the hot and cold data is maintained implicitly by the array index. For example, if a particular struct was originally stored in buffer[2], then the hot data is now stored in hot[2], and the cold data is stored in cold[2]. This is shown in the right image of Figure 4.4.1. Manipulate the Cache So far the optimizations have been centered around being cache-conscious and play- ing nice with the cache. The following two optimizations are more direct and attempt to directly manipulate the cache to our benefit. Prefetch Data Since the CPU will spin while waiting for data, it can be advantageous to prefetch data so that it’s in the cache when the CPU is ready to use it. Some CPUs have spe- cific prefetch instructions, but if these aren’t available, you might have to cleverly prefetch the data yourself. The following code performs software prefetching for the next four array elements in a loop [Ericson03]. for (int i = 0; i < 4 * n; i += 4) { Touch(array[i + 4]); //Forces prefetch of memory Process(array[i + 0]); Process(array[i + 1]); Process(array[i + 2]); Process(array[i + 3]); } When prefetching data, timing is of the essence. You must not get the data too early, since it could be evicted from the cache before it’s ever used. Conversely, you must not do it too late, since it might not be ready in time. Verifying with a profiler is the only surefire way to confirm that prefetching is having an effect. Lock the Cache To maximally use the cache, some consoles, such as the Wii, allow a portion of the cache to be locked and directly managed by the game. (Data must be manually moved in and out.) This can be extremely effective, since you are guaranteed to have particu- lar data in the cache. However, this is console dependent, so check with your target platform to see whether this is available. General-purpose computers, such as PCs, do not allow the cache to be locked, because this would interfere with other processes. 4.4 Game Optimization through the Lens of Memory and Data Access 391 Conclusion Optimization in games is about eliminating wasted cycles. With CPUs chronically waiting on memory access, waste comes in several forms: from waiting for bloated data structures from main memory, not using everything that is read into the cache, and redundantly pulling the same data into the cache. These things can be avoided by shrinking your data structures and organizing them better to be cache-conscious. Lastly, you can manipulate the cache through prefetching and potentially locking the cache. However, as with any optimization work, ensure that you’re spending time improving actual bottlenecks. Only a profiler and performance counters can give you a good idea of where to concentrate your efforts. Finally, measure and compare improvements with a profiler to ensure that you’re making a tangible difference, resulting in the entire game running faster and not just the code you modified. References [AMD09] AMD. “Key Architectural Features AMD Athlon™ X2 Dual-Core Processors.” 2009. Advanced Micro Devices. n.d. . [Bell08] Bell, Brandon. “Intel Core i7 (Nehalem) Performance Preview.” 2008. Firing Squad. n.d. . [Ericson03] Ericson, Christer. “Memory Optimization.” Game Developers Conference. 2003. Sony Computer Entertainment. n.d. . [Franz98] Franz, Michael and Thomas Kister. “Splitting Data Objects to Increase Cache Utilization.” 1998. Technical Report – University of California-Department of Information and Computer Science. n.d. . [Isensee06] Isensee, Pete. “C++ on Next-Gen Consoles: Effective Code for New Architectures.” 2006. Game Developers Conference. n.d. . [Kaeli01] Kaeli, David. “Profile-Guided Instruction and Data Memory Layout.” 2001. Northeastern University Computer Architecture Research Laboratory. n.d. . [Lanterman07] Lanterman, Aaron. “Architectural Comparison: Xbox 360 vs. PlayStation 3.” 2007. Georgia Institute of Technology. n.d. . [Shimpi09] Shimpi, Anand. “The iPhone 3GS Hardware Exposed & Analyzed.” 2009. AnandTech. n.d. . [Wikipedia09] Wikipedia. “Broadway (microprocessor).” 2009. Wikipedia. n.d. . 392 Section 4 General Programming 393 4.5 Stack Allocation Michael Dailly Performance in games has always been important—and fun! Indeed, it’s what brings many people into our profession in the first place, and it’s always fun look- ing for simple new ways of speeding up our code. One of the best places to optimize has always been in the center of a tight loop, where every cycle counts. Although this doesn’t happen as much as it used to, if you’re dealing with anything that allocates, speed is almost always important. This gem describes a method of allocation that allows you to shave valuable cycles off of your allocator, yet is frighteningly simple to follow and implement. Overview The standard way of doing rapid allocation is to use a linked list of sorts, either single or doubly linked. This allows you to get the next free element quickly, and all you need to do is maintain the links. However, when dealing with linked lists (particularly doubly linked lists), you can end up touching more memory than you want, and thus incur cache misses and delays. The code is also longer than we’d all like, and in these days of 64-bit address pointers, you can end up burning more memory than you really need to. So what’s the answer? Well, you could use indices instead of pointers to do your linked list with, but they can be even slower if you’re not careful. So, in steps the stack allocator. This system uses a pre-allocated list of items, with arbitrary-sized indices—be that pointers, INTs, WORDs, BYTEs, or even BITs! But unlike a normal list, we use a simple stack concept and position the stack pointer (SP) at the end. We can then pop the next free item off the stack and return it with minimal fuss or code, while we push items onto the list to free them. Figure 4.5.1 illustrates a typical sequence where several objects are allocated and freed (where SP is the current stack pointer). The ability to easily use varying sizes of indices, bits, pointers, or even POD types directly without having to really change the implementation is very powerful. But before looking at some examples, let’s look at how we would implement the code. Example Implementation We’ll start with a simple example implementation using a stack. But unlike a normal stack, you pre-fill it with your object’s pointers, indexes (INTs, SHORTs, BYTEs, or even bits), or any other kind of object handle. First, we need to create and initialize the array. In this example we’ll use basic pointer allocation. #define MAX_OBJECT 10000 Particle* Stack[MAX_OBJECT]; // Our object stack int SP; // The Stack Pointer Particle ParticleArray[MAX_OBJECT]; // The object pool // ##################################################################### // Inititialise the stack with object indexes from 0 to MAX_OBJECT-1 // ##################################################################### void InitStack( void ) { // Create our object list // Pre-Fill stack with indexes (or pointers) for(int i=0;i=0, “ Error: Stack pointer has gone negative!”); if( SP==0 ) return -1; return pStack[—SP]; } // ##################################################################### // Push a used particle back onto the free pool // ##################################################################### void Push(Particle* _pParticle ) { ASSERT( SP>=0, “Error: Stack pointer has gone negative”); ASSERT( SP!= MAX_OBJECT, “ Error: Not enough space in stack. Object freed twice?”); pStack[SP++] = _pParticle; } Allocating a particle is simply a matter of calling Pop(), while releasing it only requires you to call Push(). Again, not rocket science. Of course, these functions could just as well be called Alloc() and Free(), but for the sake of clarity when deal- ing with a stack method, we’ll stick with Push() and Pop(). You can find code for a template based on this method on the accompanying CD (TemplateExample_V1). Index-Based Implementation Let’s now look at another couple of possible uses. The most obvious one is to allocate an index as a handle to an object rather than a pointer to the object. In this example, we have 256 sprites that we wish to allocate from, and rather than allocating an object and returning a whole pointer, we’ll simply allocate and return the BYTE index, thereby saving memory. #define MAX_SPRITES 256 unsigned char Stack[MAX_SPRITES]; Sprite SpritePool[MAX_SPRITES]; // ##################################################################### // Inititialise the stack with object indexes from 0 to MAX_OBJECT-1 // ##################################################################### void InitStack( void ) { // Pre-Fill stack with ready to use particles. for(int i=0;i /// Initialise the free list /// // ###################################################################### void Init( void ) { // Create and fill our 18bit table. // Use Push to build the table as it’s easier in this case. SP = 0; for(int i=0;i /// Push an 18bit index onto the object stack /// /// /// In: 18Bit number to store // ####################################################################### void Push( int _value) { assert(_value>=0); assert(_value<0x40000); assert(SP>4; Stack_bits[index] &= ~(3<>16)&0x3) << shift; SP++; } While much slower than a linked list or using the stack with a straight array of pointers, it’s obviously written for storage efficiency, in this case saving around 458 K compared to storing the whole 32-bit pointer. Next, allocation... // ####################################################################### /// Function: /// Return the next free index (an 18bit number) /// /// /// Out: /// Return the next free 18bit index, or -1 for an error. /// // ####################################################################### int Pop( void ) { // If none left, then return an error if( SP==0) return -1; // Get the main 16bits int val = Stack_short[—SP]; // Now OR in the extra bits we need to make up the 18bit index. val |= ( (Stack_bits[SP>>4]>>((SP&0xf)<<1)) &0x3)<<16; return val; } The ease and speed at which you can adapt the stack allocation system to your needs is a real strength, and the ability to rapidly allocate not only pointers, BYTEs, SHORTs, and INTs, but also groups of bits efficiently is always going to be a winner. 4.5 Stack Allocation 397 Using this system, you can compress your data right down into as few bits as required without sacrificing all your speed. There are always better ways to compress data down, but this method often allows you to strike a middle ground in a matter of minutes, not days. These are obviously extreme cases, but when trying to save memory, these types of options aren’t usually open to you, so you usually end up either not bothering or wasting valuable time on far more complex methods; after all, how else could you easily allocate using 18-bit indices? It’s normally not such a straightforward problem. The example 18BitAlloc can also be found on the CD. POD Types Lastly, we’ll show how to use POD types for more complex indexing requirements. For those who don’t know, POD stands for Plain Old Data. POD types can store whole structures as long as they fit inside a standard native type, such as an INT, SHORT, or BYTE. Here’s an example of a simple POD type. struct STileCoordinate { unsigned char X; // 64x64 X tile coordinate unsigned char Y; // 64x64 Y tile coordinate short page; // Texture page to use. (also padding) }; This POD stores several pieces of pre-generated information—in this case, a 64×64 tile coordinate inside multiple 4096×4096 texture pages. First, let’s see how you would normally deal with them. class TileCoordinate { STileCoordinate* m_pNext; // Next in the list unsigned char m_X; // 64x64 X tile coordinate unsigned char m_Y; // 64x64 Y tile coordinate short m_page; // Texture page to use. }; If we had 10 texture pages, this would mean 40,960 tiles we need to allocate from. At 8 bytes per tile (the m_page member is padded for better alignment), this makes 327,680 bytes. Calling a standard allocator would then return a pointer to this struc- ture, allowing us to use it as an origin for copying tile data into the free space. So how can we do better? First, let’s remove the pointer, as we know the stack sys- tem doesn’t need that (or rather stores it outside the object itself). This leaves us with a 4-byte structure, and while we could use the very first example to allocate using a pointer to the structure (and saving no extra memory), we can do better. Since this data fits inside 4 bytes, we can actually allocate and return the whole structure’s data directly without the need for pointers. Taking the STileCoordinate type, we can actually make an array of these that is the same amount of memory as an 398 Section 4 General Programming array of INTs (since the whole structure is 4 bytes long). Now modern compilers will recognize this and allow us to pass around this data inside a register; that is, the whole structure is copied around without the need for pointers or expensive memcpy() oper- ations. In fact, this is copied around at the same speed as a standard integer number. Now our Pop() command doesn’t return a pointer, but a structure, like so: // ####################################################################### /// Function: /// Return the a free tile structure /// /// /// Out: /// Return the next free tile structure, or an error tile. /// // ####################################################################### STileCoordinate Pop( void ) { // If none left, then return an error if( SP==0) return EmptyTile; // Get the main 16bits return TileStack[—SP]; } We need to pre-define what an empty tile is (probably just –1 for X,Y and page), but aside from that, we get the data passed to us directly. You can then store this as an STileCoordinate (without the need for a pointer) and save yourself 163,840 bytes of memory in the process. The beauty of the stack allocator is that it doesn’t care what it returns; it will return pretty much anything to you, and it’s up to you to decide what that anything is. You could just as easily define the type as plain UNSIGNED INT and mask out bits of information as you need to, rather than relying on bytes or shorts to make things fit. For example, we could have assigned 12 bits for the X and Y coordinates and 8 for the page number, which would allow us to specify an actual pixel coordinate inside the 4096×4096 texture if we really wanted to. The important thing to remember is that the stack system returns data quickly, and that data doesn’t have to be a pointer to some data or even a handle to data! In fact, it can be an entire structure if you wanted it to be. POD types are best because there’s no special copying as they fit inside a register, but any type could be used. You can find the POD example on the CD in the PODAlloc folder. Known Issues One of the few drawbacks to the stack allocation method is that you currently can’t do it atomically, and so it’s not automatically thread-safe. This means you must manage multiple-thread access yourself using all the standard lock methods normally at your disposal or use per-thread stacks. 4.5 Stack Allocation 399 You also can’t put any limits on the allocation; that is, you can’t search through the whole list and allocate one that better suits your needs—not without manually compacting the remain elements in the list, that is. If this is exceptional, then it might be worth occasionally compressing the list, but chances are if you need to do this, then you shouldn’t use this method. The only other real disadvantage is that there’s no easy way to keep a used list. In a doubly linked list, you can move an item from the free to the used list, and this can help you debug things because it’s easy to see what’s in use. In this system, it’s very much a free-only list. This doesn’t mean you can’t maintain debug tables or lists, but it does mean it’s not a natural part of the system. Advantages and Disadvantages First, the advantages: 1. The code is microscopic, so small it can easily be placed inline. The compiler will often do this anyway. 2. Since all the objects’ pointers or indices are gathered next to each other, they are very cache-friendly. In fact, you’ll get multiple items per cache line, reducing cache misses to a bare minimum, particularly if you’re allocating one after the other. 3. You can allocate a number of any type and of any bit size, from only a few bits to longer, more complex streams, and you can do so almost as easily as any other value or item. 4. You can allocate POD types (or bit-packed data) directly and rapidly. This saves on having to wrap simple types or include pointers through them, thereby increasing their size. 5. You don’t have to modify any object to put links through it or have wrapper objects or templates to link things together. This simplifies your code. 6. It requires either the same or less memory than a 1D linked list system. 7. Allocation time is constant no matter how many elements you’re allocating from. 8. Block allocation is not only possible, but very simple and quick. Depending on use, you could even return a pointer into the stack that holds the allocated block so you don’t even have to allocate an array to return. 9. It’s fast and easy to implement, even on severely restricted CPUs, and in any language. It works just as well on an old 6502 as it does on a modern PC. 10. It’s very simple to debug, since you don’t have to follow links all over the place. 400 Section 4 General Programming Next, the disadvantages: 1. You can’t easily remove items from the middle of the stack. This means it’s hard to keep an allocated list as well as a free list. 2. It’s not automatically thread-safe. 3. You can’t easily place any restrictions on allocation without severely affect- ing performance. Conclusion This system has been used to allocate a single bullet from 20, sprites for rendering, scripting slots, blocks of memory, particles—you name it. The ease in setting up and using, not to mention the fact that debugging is simply a matter of viewing the stack, makes it simple for any level of ability to implement. Also, on modern hardware, with its relatively slow memory and cache lines, the cache coherency is an added bonus that’s particularly useful when dealing with thousands (or even tens of thousands) of allocations in a single game cycle. Memory bandwidth is the ultimate enemy when dealing with large numbers of objects in a game, and being able to reduce the size of an object handle to an index and group them into the same cache line can be a big win. Finally, it’s fast, simple, easy to implement, usually uses less memory than most other methods, and is very easy to adapt as your needs change! 4.5 Stack Allocation 401 402 4.6 Design and Implementation of an In-Game Memory Profiler Ricky Lung This gem introduces an architecture and implementation for a low-overhead in- game memory profiler with multi-threading support. With this profiler, one can examine the memory allocation distribution of a running game in a call-stack-style table in real time. These statistics can be extremely valuable for performing memory usage optimization and memory leak tracking. Introduction Memory consumption management is one of the cornerstones of technical quality for games. Developers try their best to fit as much content as possible into a finite mem- ory resource. We all know that choosing the right tool to do any kind of performance tuning is crucial, as it provides solid data instead of guesswork. Strangely, there is only a very limited number of memory profiling tools available for C++ in both the open- source and commercial worlds. This gem tries to narrow this gap by providing a light- weight memory profiling library. Now, let us have a look at what it will provide. The profiler can generate a call graph at a given time interval, as shown in Figure 4.6.1. Each column of the table reveals useful information: • Name. Mimics the program structure as a call stack, where the name of the func- tion comes from the user-supplied string literal. • TCount. Total number of allocations currently made in the calling function and its child call. • SCount. Number of allocations currently made in the calling function without counting its child call. • TkBytes. Total amount of memory currently allocated in the calling function and its child call, in units of kilobytes. 4.6 Design and Implementation of an In-Game Memory Profiler 403 • SkBytes. Amount of memory currently allocated in the calling function. If you are looking for the memory eater, this column might be the first place to look. • SCount / F. Number of allocations performed by this function per frame. You might want to reduce this number such that less run-time overhead of your game is spent on memory allocation/deallocation. • Call / F. Tells you how many times a function is invoked in a single rendering frame. Memory Profiling Basics The memory profiling mechanism can be divided into three main parts: collect mem- ory allocation information, relate the collected information with the program structure, and, finally, present the results. At first glance, the technique we use in CPU profiling can be applied: taking a measure at the beginning of a code block of interest and again at the end, minus their difference. But the challenge is yet to come, because memory allocated within a code block can be deallocated somewhere else. Therefore, instead of collecting information for a single code block at a time, we turn the problem into ask- ing which corresponding code block the current memory operation is related to by intercepting every memory allocation operation. The remaining sections of this gem will discuss how to utilize function hooking to intercept allocation operations and how to get their corresponding call-stack information and statistics. Of course, in this multi-core era, we will add support for multi-threaded applications. Figure 4.6.1 A view of the memory profiling remote client. Function Hooking One simple way of intercepting memory allocation is defining your own new/delete operator in C++, but it has the limitation of not being able to intercept code that you don’t have source code access to. Instead, we use a more low-level approach called function hooking, which is extremely powerful. The term “hooking” in computer programming covers a range of techniques used to alter or augment the behavior of an application. Some applications, such as DirectX/OpenGL debuggers and profilers, use such a technique. Although it can be very complicated to hook other processes’ functions as those standalone profiling applications do, in this memory profiler we only need to do some patching on the memory that a function is resident in. After the patching is done on a function—for example, the malloc function—whenever anywhere else in the program tries to invoke it, a proxy function myHookedMalloc is invoked instead of the original one. The patching process needs a few assembly tricks, so let’s get our hands dirty and examine what the assembly of the malloc function looks like under x86 first. void * __cdecl malloc(size_t size) { 78583D3F mov edi,edi 78583D41 push ebp 78583D42 mov ebp,esp 78583D44 push esi ... The above assembly shows the first four assembly instructions of the malloc func- tion, which prepare the stack and register for use with that function, commonly called the function prologue. We will replace the first instruction with a jump instruction. void * __cdecl malloc(size_t size) { 78583D3F jmp myHookedMalloc 78583D41 push ebp 78583D42 mov ebp,esp 78583D44 push esi ... With the unconditional jump instruction added to the very beginning of the malloc function, all control flow will be directly transferred to our proxy function. void* myHookedMalloc(size_t size) { void* p = (*originalMalloc)(size); logMallocUsage(p, size); return p; } What the proxy function does is simple: It invokes the original malloc function to perform the actual allocation and log the memory size along with the allocated mem- ory pointer for further processing. We then need to resolve the problem of how we can invoke the original malloc function. To get back the original function, we need to 404 Section 4 General Programming 4.6 Design and Implementation of an In-Game Memory Profiler 405 back up the first assembly instruction before we perform the patching and store it in an executable memory location. (Note that you can make a block of memory exe- cutable on the Windows platform using VirtualProtect.) Following that memory location, put a jump instruction pointing to the second instruction of the original function. mov edi,edi jmp &malloc + sizeof(mov edi,edi) = 78583D41 We give this block of executable memory a meaningful name, originalMalloc, with the same function signature as malloc does. These backup instructions can also be utilized for restoring the patched function when the memory profile gets shut down. One particular issue that we haven’t addressed is how to get the size of a binary assembly instruction, which is necessary to perform the replace and copy operations correctly. A easy way to tackle this problem is to simply hardcode the instruction size, but this will create maintenance problems because different C run times may have dif- ferent compiled binary code, even for the same function. A better solution is to calculate it at run time. Luckily, there are free libraries that can do the task, such as libdasm. All the aforementioned details of function hooking are encapsulated into a single class with only three functions, which can be found on the CD-ROM. class FunctionPatcher { void* copyPrologue(void* func, int givenPrologueSize); void patch(void* func, void* replacement); void unPatchAll(); } Again, using malloc as the example, the usage of the class is as follows. void (*origianlMalloc)(size_t); FunctionPatcher functionPatcher; origianlMalloc = functionPatcher.copyPrologue(&::malloc, 5); funcitonPatcher.patch(&malloc, &myMalloc); Although all the source code and example here only work under x86, the same concept should be able to be applied to other platforms, including game consoles. Call-Stack Generation In the current generation of software architecture, most programs are built on a command and control scheme: One method calls another method and instructs it to perform some action. The place that stores the nested method call information is the call stack. From a low-level point of view, it is just a simple memory block with some integer offsets giving the stack size; while at a higher-level point of view, we can use a graph structure to represent the call stack. (The terms call graph and call stack can be used interchangeably.) To collect the full call stack of a running program, a stack-walker [Gaurav08] with debug symbol information can be used. However, generating a full call stack is not only time consuming, but it also clutters the profiling results. Thus, a custom function call annotation scheme inspired by the previous Gems articles [Rabin00] and [Hjeistrom02] is used instead, which gives users a more flexible way to collect statistics for their most important functions. Those applications that already use similar profiling tools can easily integrate with this memory profiler by combining the scope variable/macro. We use a tree-like structure in C++ to mimic the actual call-graph structure. A tree is used instead of a graph because we will accumulate the profiling data recursively instead of making more general node connections. This tree structure contains a number of nodes, while each of these nodes contains the necessary memory profiling statistic and the link to its sub-nodes. Obviously, the root node of the tree represents the entry point of the program, usually the main function. More nodes can be added to the tree on user’s request, through declaring a scope variable. void createMesh() { ProfilingScope scope(“createMesh”); ... } Suppose Function A calls Function B, and both functions have a scope variable declared. Then a node named B will be inserted as a child of the node named A. The next time Function A invokes B, a linear search for all the child nodes under A with the name B is performed. This implementation assumes that the name argument of the scope variable must be a string literal; therefore, we can use a simple pointer com- parison instead of a full-blown string comparison during searching. Provided that the average number of child node is around the order of 10, the call-stack generation process is reasonably fast. For the hooked memory allocation function to make use of the call-stack infor- mation, the call tree will store a variable that points to the current call-stack node. Thus, a connection between the current allocation and the current call stack is made. Collecting Statistics Once we have the call-stack data structure on hand, we can start to collect statistics. Only a handful of statistics need to be stored in each node: • Invocation count • Exclusive allocation count • Exclusive allocation size The exclusive allocation count and size will be updated whenever the hooked functions are invoked. Apart from simply updating the statistics, the hooked func- tions need to perform some bookkeeping in order to retrieve the corresponding call- stack node for a particular allocated memory block during the deallocation operation. 406 Section 4 General Programming 4.6 Design and Implementation of an In-Game Memory Profiler 407 The mapping between the two can be established using an STL map, but this results in a high run time and memory overhead; therefore, we use the approach of embed- ding the call-stack node pointer with the allocated memory block. void* myHookedMalloc(size_t size) { void* p = (*originalMalloc)(sizeof(Node*)+sizeof(int)+size); (Node*)p = currentCallStackNode; *(int*)((char*)p + sizeof(Node*)) = size; currentCallStackNode->allocCount++; currentCallStackNode->byte += size; return (char*)p+sizeof(Node*)+sizeof(int); } void myHookedFree(void* p) { int allocSize = *(int*)((char*)p-sizeof(int)); p = ((char*)p-sizeof(Node*)-sizeof(int)); Node* n = (Node*)p; n->allocCount—; n->byte -= allocSize; (*originalFree)(p); } Other statistics that take into account the child function calls can be calculated on the fly during profile report generation, because this step needs to traverse the call tree anyway. Multi-Thread Issues By definition, each executing thread should have its own call stack. This means every call of myHookMalloc can access the per-thread call-stack data structure freely without any problem. Thread Local Storage (TLS) comes to assistance here. For each thread, we store the pointer to its corresponding call-stack root node and another pointer to the current node using the win32 API TlsSetValue. This data will be retrieved by myHookedMalloc using TlsGetValue. Care has to be taken in implementing myHookedFree because it may operate on a call-stack node that doesn’t belong to the current thread. To protect the deallocation routine from race conditions, each thread’s root node is assigned a critical section, and all its ancestor nodes will keep a reference to it. When myHookedMalloc or myHookedFree is invoked, the critical section of that node will be locked before any statistic information update. Such a design keeps the number of critical sections to a minimum while preserving a very low lock contention level; moreover, it does not create a locking hierarchy, so it should be free from dead-lock. The Source Code The source code on the CD comes with Microsoft Visual Studio solutions that allow you to compile and execute several demo programs related to the memory profiler. Although the code snippets in this gem are simplified for illustration purposes, the code on the CD is carefully crafted for maximum correctness and robustness. A client-to-server architecture is also employed in one of the demos, which shows you how to make an external tool to monitor the memory usage of your application. A CPU profiler based on the same code base as the memory profiler is also sup- plied as a reference. Conclusion By moving a step forward, our in-game CPU profiling techniques can be extended to measure memory usage as well. This gem also demonstrated how to modify the pro- filer to cope with the multi-core era. We hope you will utilize this profiler to develop a greater understanding of your program, locate memory hotspots, and improve your memory consumption and performance. References [Gaurav08] Kumar, Gaurav. “Authoring Stack Walker for X86.” 6 Jan. 2008. WinToolZone. n.d. . [Rabin00] Rabin, Steve. Game Programming Gems. Boston: Charles River Media, 2000. [Hjeistrom02] Hjeistrom, Greg, and Byon Garrabrant. Game Programming Gems. Boston: Charles River Media, 2002. 408 Section 4 General Programming 409 4.7 A More Informative Error Log Generator J.L. Raza and Peter Iliev Jr., Programmers commonly create error logging functions to aid in debugging during development. Usually, these functions print out error messages onscreen or to a log file. Although it is functional, the error message is only as good as the programmer’s insight into how the problem could have occurred. In other words, a one-line message is not always enough to help fix the bug. You may need to know which function passed in bad data or how the program got into this bad state, and that is what this gem seeks to help with. In this gem, we’ll present a new error logging function that automatically generates useful information for the programmer and/or tester via what is known as retrieving the Run-Time Stack Information (RTSI). Definition of RTSI Retrieving the RTSI is a debugging tool conventionally used to aid in software develop- ment. During debugging, stack walking is the act of pausing the program and taking a glimpse into the program’s function stack at run time. For example, given the following code: void C() { int i = 0;//set a breakpoint here } void B() { C() ; } void A() { B() ; } int main() { A() ; ... return 1 ; } When the compiler reaches the breakpoint in function C(), and the programmer requests the program’s run-time stack information, it would return something like this: C B A Main … The bottom of the stack shows the program’s entry point. For simplicity, the operating system functions called have been omitted from this example. In this case, the entry point is main(). It then details what function main() called, A(), and recur- sively what function A() called, and so on until it reaches the current function where the breakpoint was set. As the flow of execution on the program continues, retrieving the RTSI again will return a different result based on where the program’s execution is paused. Potential Uses RTSI is a normal component in many commercial and open-source compilers. It can be useful for tracking stack-overflow bugs, as well as taking a snapshot of the pro- gram’s current execution status once it hits a breakpoint. Most often, programmers will only use the RTSI inside their compiler. But when working in game development teams, there are more people around than just program- mers. There are artists, level designers, and testers. All of these people will be playing the game. Yet, not everyone may have access to a compiler or a debug development kit, and anyone can run into a crash bug. Having the program dump out as much info as possible during these situations is vital, and dumping out the RTSI can be a big help. The RTSI’s usefulness extends beyond just the work environment, too. Online games, such as MMORPGs or RTS games, do beta testing where they release the game to a select few people in their homes and let them play the game before it’s out. The main intent is balancing and getting feedback. Beta testing is also valuable time that can be used for bulletproofing the game. So as not to waste the testing time, whenever bad data is encountered or a crash occurs, the RTSI can be sent from the tester’s machine to a developer-owned server for use later. 410 Section 4 General Programming C/C++ does not have a standard function to call to print the program’s RTSI, so it’s up to the operating system and hardware vendors to distribute a set of functions in an API to accomplish this job. In this gem, we’ll explore the API functions on the Windows XP platform for gathering the RTSI on an x86 processor. The example will also work on x64 processors, but some of the assembly code will need to change to ref- erence the correct registers. Setting Up the Code So you have decided that the RTSI is a resource that’s going to be used in your game. Henceforth, it needs to go through the three basic steps any resource goes through: loading, usage, and unloading. Loading Since the RTSI resource will probably be used in several different spots in the game code, we have the issue of when to load it, when to unload it, and how to handle access to it. All these issues can be solved by setting it up as a singleton resource. For reference on singletons, see [Bilas00]. To load our RTSI API function calls, we use the following code: //load the dbghelp.dll which is the windows library used //for crawling up the stack m_dllHandle = LoadLibrary( “dbghelp.dll” ); //sets up which processes symbols //we are going to look at m_pSymInitialize = (SymInitialize) GetProcAddress( m_dllHandle, “SymInitialize” ); //used to crawl up the stack //and look at each function call m_pStackWalk64 = (StackWalk64) GetProcAddress( m_dllHandle, “StackWalk64” ); //used to retrieve the line number in each file the //function was called from m_pSymGetLineFromAddr64 = (SymGetLineFromAddr64) GetProcAddress( m_dllHandle, “SymGetLineFromAddr64” ); //used to get the name of each function in string form m_pSymGetSymFromAddr64 = (SymGetSymFromAddr64)GetProcAddress( m_dllHandle, “SymGetSymFromAddr64” ); //get and set for the options of what information we want //to retrieve. //this example only cares for line numbers and //function names. m_pSymGetOptions = (SymGetOptions) GetProcAddress( m_dllHandle, “SymGetOptions” ); m_pSymSetOptions = (SymSetOptions) GetProcAddress( m_dllHandle, “SymSetOptions” ); 4.7 A More Informative Error Log Generator 411 Usage We need to set up an access point to the RTSI with the required function pointers. We only need to activate it when something is about to go wrong; this is what the ASSERT() call is for. For reference on how to use that and a few handy tricks, see [Rabin00]. This access point can be created as such: #if DEBUG #define ASSERT( expression ) \ if ( !expression ) \ { \ printf( “ASSERT[“ #expression “]\n” ); \ AssertWrapper::a.PrintStack(); \ printf( “\n\nPress enter to continue:\n” );\ getchar(); \ } #else //If we aren’t in a debug configuration then just compile out //Asserts #define ASSERT( expression ) #endif Now, when we call the ASSERT() macro and the result is false, we’ll print out the RTSI. Notice that it’s the PrintStack function inside the Assert class that does the inter- esting work. There are a couple of important pieces of code in that function. The first is how we fill the CONTEXT data structure: CONTEXT context; memset( &context, 0, sizeof(CONTEXT)); context.ContextFlags = CONTEXT_FULL; __asm call x __asm x: pop eax __asm mov context.Eip, eax __asm mov context.Ebp, ebp __asm mov context.Esp, esp Here we are copying the instruction, frame, and stack pointers into Context. This will let the StackWalk function know where to look for all the debugging info we need. Normally, you would copy the instruction pointer directly (EIP), but on x86 processors there is no direct way to access that pointer, so we use a trick. This section of code is processor family–specific, so the register will probably be named differently for the different companies. For example, x64 (AMD) uses Rip, Rbp, and Rsp. The other important aspect of the PrintStack function is the loop that displays the functions on the run-time stack. unsigned int i = 0; do { 412 Section 4 General Programming //get the next stack frame info this->m_pStackWalk64( IMAGE_FILE_MACHINE_I386, process, thread, &frame, (void*)&context, NULL, SymFunctionTableAccess64, SymGetModuleBase64, NULL ); //now we try to print the current frame’s info if ( frame.AddrReturn.Offset != 0 && i != 0 ) { char line1[32]; char line2[32]; memset( line1, ‘\0’, 32 ); memset( line2, ‘\0’, 32 ); PrintLineNumber( process, frame.AddrPC.Offset, line2 ); PrintFuncName( process, frame.AddrPC.Offset, line1 ); printf( “%-32s || %s\n”, line1, line2 ); } i++; } while( frame.AddrReturn.Offset != 0 ); Thus, if we want to integrate the RTSI to a game GUI, then that is where we would make our changes. Unloading To unload it, we perform the reverse of the loading steps. FreeLibrary( m_dllHandle ); m_pSymInitialize = NULL; m_pStackWalk64 = NULL; m_pSymGetLineFromAddr64 = NULL; m_pSymGetSymFromAddr64 = NULL; m_pSymGetOptions = NULL; m_pSymSetOptions = NULL; Error Logs Redundancy Problem versus Using RTSI Having explained the RTSI and how to implement it in a non-compiler environment, we can now begin to exemplify scenarios in which this feature could be quite useful. It is common for programmers to create a set of functions that display an error message when something goes wrong during testing. The problem with a generic error log function is that the programmer has to keep track of where he is in the code in order for the error message to create any useful information. Following is an example of this common scenario. 4.7 A More Informative Error Log Generator 413 int A() { if ( did_an_error_occur() ) { assert( ! “Problem with function A()” ) ; return 0 ; } else { ... return 1 ; // All went well } } int B() { if ( did_an_error_occur () ) { assert( ! “Problem with function B()” ) ; return 0 ; } else { ... return 1 ; // All went well } } int C() { if( A() && B() ) return 1 ; return 0 ; } ... The problem with this design is that as the software evolves, so will its function calls and the context in which these functions are called. Keeping track of those error messages adds extra overhead to the development of the game. This could be solved if the error log function could “know” its current context and report accordingly, which is exactly what RTSI can do. With the given code sample, the programmer could then bind the error log to the in-game GUI. This means that once an error log function gets triggered, a tester can report to the programmer a wider span of technical information (since the log is intrin- sically related to the game’s code) as well as not have to resort to an error check table. 414 Section 4 General Programming Conclusion It’s important for there to be a continuous flow of information among the staff on a game development team. Unfortunately, quantitative data regarding when a game hits an error in a non-compiler environment is not an easy thing to capture. With tools like the RTSI, these potential problems can be smoothed out. One could even bind the RTSI entry point to a script that fills a form in a bug database with this kind of information so that it could be later analyzed by the programming team. References [Bilas00] Bilas, Scott. “An Automatic Singleton Utility.” Game Programming Gems. Boston: Charles River Media, 2000. 36–40. [Rabin00] Rabin, Steve. “Squeezing More Out of Assert.” Game Programming Gems. Boston: Charles River Media, 2000. 109–114. 4.7 A More Informative Error Log Generator 415 416 4.8 Code Coverage for QA Matthew Jack When working on a rapidly developing code base, a good QA team is a highly versatile resource that no automated testing can replace. However, actually playing the game is such a high-level form of testing that it can be hard to relate it to low-level changes in the code. First, changes that a programmer has made may only be executed in certain situ- ations depending on the player’s actions, which can be hard to give a procedure to re-create. For a given piece of low-level code, a programmer may have little idea where and when in the games it is actually used! Further, even when we can give clear instruc- tions for testing, we are rarely absolutely sure whether they worked and the code was actually executed; in some cases there will be a clear visible indication, but in many others the effect of the code is subtle. Often, it is simplest in the end to throw up our hands, give no details, and just ask QA to “be thorough”—which is time-consuming for them and still provides no guarantees about the result. This gem describes a framework that combines code coverage analysis and con- ventional QA by using real-time feedback. It addresses these issues and improves the testing process for both QA and programmers. It can bring rigor and repeatability to testing and yield valuable insight to the low-level programmer about where the code is actually used. Common Approaches Before examining the approach of this gem, let’s consider some of the established test- ing methodologies. • Unit testing. Especially for localized, pure refactoring, unit testing can be rigorous and fast. However, to be pragmatic, most games are developed without comprehen- sive unit testing. One reason for this, at least for rapidly changing game code, may be that unit testing is considered to add too much development overhead. If unit tests are not present in the original code, then restructuring to add them is likely to be an enormous task that could itself introduce bugs. 4.8 Code Coverage for QA 417 • Automated functional testing. For instance, a repeatable play-through using an input recorder. This is very convenient for detecting crashes; however, an experienced eye is still required to watch for changes in behavior, and it in no way addresses the thoroughness of the original play-through. Furthermore, if the game you are testing is currently under development, it may change so quickly that a recording has a very short lifespan. • QA testing. Plain old QA is essentially a manual form of functional testing. A good QA team can adapt rapidly to a changing specification, report changes in behavior, and benefit from human intuition when something looks wrong. These are valuable qualities to build upon. In all of these approaches, we might ask: How complete is their testing? One way of assessing this is through code coverage analysis. This measures how rigorously a given test exercises source code, allowing this to be quantified and improved. Fundamental to this gem, we will examine it in detail. An Analogy: Breakpoint Testing There is a rough-and-ready version of this gem that I have often employed, and it turns out it is already available in your IDE. We might call it breakpoint testing. In refactoring a function/class/file/blob, one systematic approach to testing on the developer’s machine is to place a breakpoint on every statement that represents a relevant code path and then run the game, removing each breakpoint as it is hit. When all breakpoints have been removed, without any crashes or unexpected behav- ior, testing is complete. This can be a good minimal test of changes, as it ensures the developer has executed all the code he changed and has observed the overall results. However, playing through takes time, and it must be done on that computer, so this cannot easily be delegated. Furthermore, it is not repeatable: Future regression tests would require re-creating all of the breakpoints. However, note that breakpoint testing works for new code as well as refactored code and that it directly couples high-level functional tests to low-level details in the source. This gem is a framework that formalizes and expands that ad hoc hybrid of code coverage and manual testing. Code Coverage Conventional code coverage analysis is used to quantify how completely a testing procedure (usually automatic) exercises source code [Cornett96]. It works by moni- toring at run time the number of unique lines (or functions, branches, code paths, and so on) that have been executed by the test, compared to the number theoretically possible. This basic principle of discovering which code we have actually run is attrac- tive, but there are several questions we must address when applying this to games. The first is what kind of testing procedure we should use, since code coverage is only useful in analyzing tests. Unit testing is the standard in the wider software indus- try, but for games we propose QA testing of real levels. This is simple and flexible, and chances are your company is already doing it every day. It also better represents the end product, where code and assets intertwine to determine final quality. Next, we have to consider granularity. Instrumenting every line will cause a sig- nificant slowdown that may make the game unplayable—especially on consoles—and the increased executable size may prevent it from loading at all. Furthermore, this is sheer data overload: Such a glut of information would take further processing to make any sense to us. Monitoring every function is too arbitrary—some functions are trivial and/or speed-critical, while others may, for instance, contain large switch statements. Tracking all branches may again generate too much data. Also, we must consider the final output, which is related to the granularity. A list of all the numbers of all the lines executed would be meaningless in itself—and rendered useless as soon as the source code changes. Function names would be instantly recog- nizable, but what if we change them? The result most commonly quoted from code coverage is a single figure: a percentage representing the completeness of testing, where targets are commonly 90- to 100-percent coverage [Cornett96, Obermeit06]. Note that if we use real levels as tests, in most games we will see much lower coverage, as each level usually employs only a subset of the features. If we accept that we will only test a fraction of our code, we have an ambiguity: Was a given feature left unexercised because testing was insuf- ficient or because it is not used in this level? It may also beg the question: Much of the code may have been tested, but was this piece of code tested? Finally, if we use human testers to perform the tests, our results may vary wildly with different play-throughs. Can we guide them toward comprehensive, repeatable testing? There are various code coverage packages already available, but without address- ing these issues, their value in practical games development is limited. Implementation Key to our approach is that, rather than use any automatic instrumentation, the pro- grammer places his own markers into the code using a simple macro. This allows us to solve the problems of granularity and meaningful output and leads us to solutions for the remaining issues. By using manual markers, we can keep them to a manageable number, applied only where required and omitted from performance-critical or trivial code. This—and careful implementation of the marker code—avoids slowdown and code-bloat. We use a unique string to label each marker. You could base these on the class and method name—for example, CCoverBehaviour_ThrowGrenade_A—or use a pure feature descriptor—for example, ThrowGrenadeFromCover_A. The labels make the output mean- ingful, while still allowing the programmer to add markers to individual if/else branches, cases of a switch statement, loop bodies, or anywhere else he thinks they may be useful. 418 Section 4 General Programming 4.8 Code Coverage for QA 419 The Markers The following macro expands to code that constructs a static object with a trivial con- structor and calls its Hit() function: #define CCMARKER( label ) \ { \ static CCodeCoverageMarker ccMarker_##label( #label ); \ ccMarker_##label.Hit(); \ } where Hit() registers the marker with a singleton [Alexandrescu01] for tracking the first time it is executed. Keeping this method separate allows for a reset feature. inline void Hit() { if (!m_bHit) { m_bHit = true; CCodeCoverageTracker::GetInstance().Register(this); } } Some of the advantages of this approach include: • Macros are easily compiled out, ensuring zero performance impact in the release. • Using a static object allows a fast check for whether this is the first time hit. • Functions can be renamed or moved into a different class or file, but while marker labels are preserved, code coverage results remain undisrupted. This is helpful as we build up a history. (See the upcoming “Collecting Results” section.) • Descriptive labels immediately provide hints for real-time feedback to QA. When wondering how to hit that last one percent of markers, a tester will really appreci- ate the onscreen text “ThrowGrenadeFromCover_A”. C++ compilers include predefined macros that could generate marker labels auto- matically, based on function names, line numbers, and so on. Note, however, that for this saving of a few moments, we lose much of the advantage of a manually defined label. If labels are not unique, our results will of course be misleading. This would usu- ally occur if someone copied and pasted code somewhere else, but it could also occur if a single marker declaration was instantiated multiple times in the binary. This could happen through use within macros or templates. Functions inlined as an optimization are not a problem, as the C++ standard defines that all inlined copies of a function must share one copy of any local static variables [ISO/IEC14882-98]. Note that some older compilers did not handle this properly. The best way to avoid the more common copy/paste problem would be to write a simple script to scan the source code for duplicate labels. All types of duplicates can also be detected by the framework itself at run time—the example implementation includes such a check. In my own experience, there have been no problems with duplicates. Examples: Applying Markers Markers are commonly embedded at the start of methods: void BrainClass::Update() { CCMARKER(“Brain_Update”); UpdateVision(); UpdateBlackboard(); … } while, if the method includes various early-out checks, it is most effective to place the marker at the end, where the “meat” of the code has already been used: void Brain::SendSignal( const char * signal, int targetID ) { if (targetID == m_myTargetID) // Am I the target? return; if (GetDistanceToTarget(targetId) > 20.0f) // Is it too far? return; … CCMARKER(“Brain_SendSignal”); } Here I’ve used a structured name, confident that sending signals will always be a responsibility of my Brain class, or perhaps distinguishing it from other places where signals are sent—but in many cases you may prefer a totally pure name, as below. Where you only describe the feature, you can freely move the code and preserve the label without it becoming misleading. You can also include multiple markers at different points in the same scope if desired, in loops (if not speed-critical): for (int i=0; i [ISO/IEC14882-98] “Standard for Programming Language C++.” dcl.fct.spec/4. 1998. [Obermeit06] Obermeit, Tony. “Code Coverage Lessons.” 2006. Mobile Me. n.d. . 428 4.9 Domain-Specific Languages in Game Engines Gabriel Ware Domain-specific languages, DSLs for short, are computer languages used to solve problems within the explicit boundaries of the problem domain. The benefits of DSLs are multiple: They help by separating domain-related code from application code; they let domain experts solve problems using a language they understand; they can have multiple outputs, and users can easily shift from one to another; and last but not least, designing domain-specific languages usually tighten relations between pro- grammers and experts. This gem will dig into domain-specific languages, answering the following questions: What is a DSL? When should I use a DSL? How do I build a DSL? Domain-Specific Languages in Depth In this section we’ll explore DSLs and their uses, concluding with some guidance on when to use them. Domain-Specific Languages: Definitions and Examples Several definitions have been proposed for domain-specific languages. DSLs can be defined as artificial languages expressing instructions to a machine while working on a narrow field of expertise, a specific domain. These computer languages are sometimes referred to as little languages or micro-languages because of the limited expressivity of their syntaxes. Their syntaxes are restricted to the problem domain they are modeling, including only what is relevant to the problems. Languages such as C, C++, or Java, which are labeled general programming languages, or GPLs, provide generic solutions to a broad range of problems and, as such, can be opposed to DSLs that provide more tailored solutions to a restricted set of problems. Domain-specific languages have existed for a long time, and their use in com- puter science is widespread. Successful examples include Lex and Yacc, programming languages intended to create lexers and parsers to help in building compilers; SQL, a computer language targeted at relational databases; and LaTeX, a document markup language providing a high-level abstraction of TeX. DSLs have several characteristics 4.9 Domain-Specific Languages in Game Engines 429 emerging from their form and the process used to build them. The main characteristic of DSLs is their syntaxes, which provide appropriate notations to the domain model and a very limited set of instructions. This limits what problems users can solve but at the same time allows the language to be learned quickly. DSLs are also usually declar- ative. They can sometimes be viewed as specification languages, providing domain experts with the capability of writing specifications that will become new tools, solve problems, and encode domain knowledge. Because they encode domain knowledge as perceived by domain experts, DSLs are usually built from a user perspective. Such a user-centric process tries not to take into account external factors, such as compiler capabilities, and prefers focusing on user experience. Data mining is a domain that could be used as an example illustrating how domain-specific languages help. Code to search for persons matching certain criteria in tables can be written in any GPL, but it is much easier to write with a language that is appropriate to the domain. Listings 4.9.1 and 4.9.2 show SQL and C++ code snip- pets that provide the same feature to an application. Even if C++ is more adequate on a general basis, it does not focus on the domain and thus is harder to use in this spe- cific case than SQL. LISTING 4.9.1 Mining a table for persons named Paul in SQL Select users from table where name=‘Paul’ LISTING 4.9.2 Mining a table for persons named Paul in C++—relying on the STL to handle allocations and strings manipulation std::list::const_iterator const table_end = table.end(); std::string SearchedName(“Paul”); for(std::list::iterator it = table.begin(); it != table_end; ++it) { if (it->Name == SearchedName) { users.push_back(*it); } } Another interesting feature these two listings exhibit is the differences in term of interface. While GPLs provide a satisfying interface for programmers, some DSLs provide very light syntaxes nearly exempt from notations that do not translate into everyday language, dropping parentheses, braces, and any other artifact as much as possible. These notations are referred to as language noise, and programming interfaces that minimize this noise are called fluent interfaces. Domain-specific languages do not always provide a fluent interface to their users, but this can be a useful feature to pro- vide when end users do not have a programming background. 430 Section 4 General Programming The Different Types of Domain-Specific Languages While GPLs are usually classified by their programming paradigm and the type of output produced by their compilers, domain-specific languages are distinguished by the methods used to build them. As such, two main categories are emerging—internal DSLs, sometimes referred to as embedded DSL, and external DSLs. When a DSL provides a custom syntax and relies on a custom-made lexer, parser, and compiler, it is categorized as an external DSL. Building an external DSL is the same process as building a new general-purpose language: Programmers have to design the language and implement it as well as any needed tools, such as editors, parsers, compilers, and debuggers. On the other hand, internal DSLs are built from general programming languages that offer syntaxes malleable enough to build new language from them. This greatly reduces the amount of work needed to implement the language as programmers rely on the existing tool chain and language’s features. On the down side, internal DSLs’ syntaxes usually include language noise from their host language. In addition to these two categories, DSLs are also often classified by the type of interface they provide to end users. While some DSLs are created by programmers to be used by other programmers, others are designed to be used by domain experts who do not have programming experience. Thus, some DSLs provide textual interfaces, but other DSLs adopt graphical front ends in order to ease programming. An example of a successful visual domain-specific language is Unreal Engine’s Kismet, which allows designers to control actions and handle events by connecting boxes using the graphical user interface provided by Unreal Editor. Advantages of DSLs Domain experts may not be able to write code but are usually able to review code written using domain-focused syntaxes. DSL code can sometimes even be written directly by domain experts, achieving end user programming. DSLs concentrate on domain knowledge, and thus it is important that coders creating new DSLs deeply understand the problem domain. As a side effect, this usually tightens relations between programmers and domain experts, resulting in more accurate solutions. The limited expressiveness of DSLs limits user input and, as such, can help to reduce user errors. DSLs are easier to master, and as interfaces become more fluent, the code starts to be self documenting. Through these characteristics, DSLs are able to express important information while hiding implementation details. Just like good APIs, DSLs provide users the abil- ity to program at a higher level of abstraction. This leads to a clear separation between domain knowledge and implementation that allows for better conservation and reuse of this knowledge. Finally, DSLs provide new opportunities to do error checking, statistical analysis, or any other transformation of the domain knowledge. 4.9 Domain-Specific Languages in Game Engines 431 Disadvantages of DSLs There are also several drawbacks to using domain-specific languages. The most prob- lematic is the cost of building and maintaining a new language. While building exter- nal DSLs still requires quite a bit of effort, new tools and new techniques have been used to reduce these costs. Another alternative is to embed the DSL in a host lan- guage. As the language evolves and requirements change, language maintenance can become a burden. It can be very tempting to grow the problem domain by adding new keywords and notations, but this usually leads to building general-purpose lan- guages with some domain-specific keywords. This is a very costly approach and should be avoided unless it is desired. Another drawback of using multiple languages to build an application is that programmers need to learn more than a few languages to control the whole pipeline, and thus they need to quickly learn and adapt. One last problematic aspect of using domain-specific languages is that it introduces an extra layer of complexity, which can slow the debugging process. Relations between DSL and Game Development Game development provides a wide variety of challenges in many different domains. To take up these challenges, programmers usually use a few general-purpose languages and build frameworks that will help resolve domain-related issues. DSLs appear to be a good fit to this environment. Typical examples of problem domains related to video games are game logic, navigation, animation, locomotion, data modeling, serializa- tion, and transport. Thinking in terms of modeling problem domains and user expe- rience helps to define what solutions are needed. When to Create a New Language Creating new languages is a difficult and time-consuming task, so deciding when to use a DSL is a very important process. The need for a domain-specific language usually arises when a common pattern is detected from several problems. Those patterns can occur at code level, in programs, subroutines, or data, as well as at the application level, building similar tools or architecture several times. The problem domain can usually be identified from these patterns, and the boundaries of the domain can then be deter- mined. Domain-specific languages stress staying domain-focused, so it is important to deeply understand the domain’s definition. If the boundaries are blurry and users can’t anticipate requirements, then it may be impossible to design the language. It is impor- tant to carefully choose the bounds because if they are too narrow, the DSL won’t be used to encode enough domain language, whereas if they are too broad, the language may lose its focus. Boundaries also influence the language interface by defining which variants are to be exposed to the user. Exposing too many variants will slow language learning, while not exposing enough will render it less usable. Domain experts, docu- mentations, and specifications can help determine such boundaries. Lastly, because creating a new language is a difficult task, it is important to know whether such a language will be reused. If the problem domain is too narrow and domain knowledge should be encoded only once, it may be better to build a frame- work over a general-purpose language, but if domain knowledge needs to be encoded multiple times, solve multiple issues, or requires a lot of effort to be encoded using a GPL, then creating a new domain-specific language may be a good option. Figure 4.9.1 depicts part of the decision process. Creating a DSL The process applied when creating a domain-specific language can be summed in six steps, as illustrated by Figure 4.9.2. We start with the problems that must be solved to meet our goals. As stressed earlier, it is important to be able to detect recurrent patterns coming from different problems because it will lead to the identification of a problem domain. If patterns are detected soon enough, the domain can be examined, and new problems may be antic- ipated. The second step is to acquire as much knowledge about the domain as possi- ble. Documentation and domain experts are the best sources of domain knowledge and will be able to explain what users expect from the domain. This leads to a user- centric approach and designing the language from a user perspective. The last step before designing the language is to choose between internal and external DSLs and the type of interface the language will provide to the end user. Interfaces are usually driven by the domain model to represent end user ability to use graphical and textual interfaces. In the language design phase, the specifications of the language are laid down. Notations and keywords needed to model the domain are chosen, and variants—what the interface will show—and invariants—what assumptions about the model will be hidden in the implementation—are identified. Lastly, all tools required to implement the language are created. 432 Section 4 General Programming Figure 4.9.1 DSL decision process. 4.9 Domain-Specific Languages in Game Engines 433 Choosing between Types of DSLs Determining the user interface of a new language is a decisive factor for its adoption as a new tool. By understanding what users are expecting from the tool, programmers will be able to refine the required features. If the domain can be modeled using text, then both internal and external DSLs can satisfy the needs, and choosing between them is a matter of understanding their constraints along with the programming proficiencies of the end user. Internal DSLs rely on the availability of a host language that is malleable enough to let a DSL emerge from its own syntax. When such a language is available, its syntax and tool chain will influence the look and feel of the DSL. If those constraints are acceptable, then build- ing an internal DSL is the quickest way to create a new textual DSL, as it will provide needed tools for the language. On the other hand, external DSLs do not rely on another language, allowing for a better customization of their syntaxes. But they also require that programmers build tools such as parsers, interpreters, or compilers to support the new language. If no host language that satisfies syntax needs is available, then external DSLs are the way to go. While some domains are very easy to model using words, other domains cannot be modeled, or at least are difficult to model, using text. In this case, the domain- specific language can rely on a graphical interface to help end users encode their knowledge. Another factor for building a graphical front end to the DSLs is the pro- gramming background of their users. Some DSLs are intended for non-programmers, and if the domain is too complex and thus requires exposing too many variants and keywords, it will then probably be easier for the user to represent the domain knowl- edge using graphical tools. Although graphical DSLs are usually built from scratch, some tools do exist that help in creating them. Figure 4.9.2 DSL creation process. Figure 4.9.3 synthesizes the whole process. Common Programming Techniques for Building Internal DSLs As domain-specific languages become more and more popular, several programming patterns used to build them emerge. Luckily, most programming tips used to build domain-specific languages are easy to understand and use, but some may not be avail- able from all host languages. The first, and probably oldest, technique is the use of macros. It has been widely used by C and C++ developers to pre-process source code. It uses the preprocessor capabilities to build fluent interfaces that generate complex code at pre-processing time. Another old and widely available technique is called function sequencing. Domain knowledge is encoded using sequences of function calls. The implementation of this method relies heavily on the side effects of each function to affect the execution con- text of each subsequent call. While this method provides a potentially acceptable solu- tion in terms of interface, relying on side effects can be dangerous and hard to debug as sequences become more and more complex. Listing 4.9.3 shows an example of function sequencing. LISTING 4.9.3 Function sequencing animation_engine(); character_controller(); playanimation(“run_fast”); easing_in_using(LINEAR_EASE_FUNCTION); during(10_MSEC); An evolution of function sequencing is called method chaining. It uses objects to pass the context between calls without adding noise to the language. With this tech- nique, each method call returns an object that provides a part of the language interface. This helps to fragment the interface across multiple types of objects. Rewriting the previous example using method chaining leads to Listing 4.9.4. 434 Section 4 General Programming Figure 4.9.3 DSL type decision process. 4.9 Domain-Specific Languages in Game Engines 435 LISTING 4.9.4 Method chaining animation_engine(). character_controller(). playanimation(“run_fast”). easing_in_using(LINEAR_EASE_FUNCTION). during(10_MSEC); Nested functions are another way to call functions while removing language noise as much as possible. When using this method, all calls are nested, as presented in List- ing 4.9.5. The main characteristic of nested functions is the order in which functions get called. It can be very useful when domain knowledge can be expressed as a sum of properties and containers. LISTING 4.9.5 Nested functions character_controller( playanimation( “run_fast”, easing_in_using(LINEAR_EASE_FUNCTION), during(10_MSEC) ) ) Another frequently used technique for building domain-specific languages on top of an existing framework is to separate the fluent interface from the existing API. Flu- ent interfaces are usually created using assumptions about the calling context of their routines. Although this helps naming methods that chain efficiently and produce nearly fluent code, this way of writing an API violates what is considered good pro- gramming practices. Thus, it may be interesting to get the best of both worlds by adding a fluent interface on top of a more standard framework. Lambda functions, sometimes named blocks, anonymous methods, or closures, are a feature that only recently became widespread in many mainstream languages. They have been successfully applied to creating DSLs because they offer the key character- istic of evaluating code, with minimal language noise, in a predetermined context. Lambda functions are very similar to a standard method definition but do not require the same textual overhead as functions: They do not need to have names, complete parameter lists, and return types and are simply associated to standard variables that can be passed across functions. The dynamic handling of missing methodsis another widespread technique to create domain-specific languages. It is a popular feature of languages such as Smalltalk and Ruby, where you can override doesNotUnderstand and method_missing, respectively. Other languages, such as Python, can provide similar features using other internal mechanisms. Handling missing keywords can be very convenient when the language has to deal with unknown keywords and unknown function names. This technique allows the user to create keywords as needed. It allows creating new modeling languages with very little noise very easily. LISTING 4.9.6 Animation DSL relying on method_missing, nested function, and closures 1: animset = define_animation_set( :terrestrial_locomotion) { 2: idle(from_file(“tm_idle”)) 3: run_forward from_file “tm_run_fwrd” 4: walk_forward from_file “tm_wlk_fwrd” 5: turn_90degsLeft from_file “tm_trn_90deg” 6: 7: jump_forward from_file “tm_jmp_fwrd” 8: jump_forward { can_blend_with all_from(“terrestrial_locomotion”) } 9: jump_forward { can_blend_with transitions_from(“aerial_locomotion”) 10: } Listing 4.9.6 shows a domain-specific language where animation identifiers are keywords chosen by the user and thus impossible to predict. It demonstrates usage of blocks and nested functions using the Ruby programming language. The code block is written between brackets and, in our example, it is given to the define_animation_set function. In this domain-specific language, define_animation_set creates an anima- tion set object and asks Ruby to evaluate the given block in the context of this object. The animation set object interface will provide functions such as from_file, which is used to load an animation from a given filename. In order to reduce language noise, the language relies on the ability of Ruby to deduce parentheses placement. Lines 2 and 3 are similar and interpreted the same way by the Ruby language, the only differ- ence coming from parentheses’ presence. Lastly, Ruby handles function calls such as idle or run_forward as missing calls that are handled by our implementation to iden- tify new animations. A sample demonstrating how Listing 4.9.6 is implemented using the Ruby programming language is provided on the accompanying CD-ROM. Another easy and very powerful technique to add meaning to code is called literals extension and is usually available from object-oriented languages. Extending literals helps readers by allowing modifiers, which may or may not do anything useful, to be used on literals in order to add fluency to the code. It requires the language to handle everything, including literals, as objects that can call methods. Literal extension also relies on the ability of the language to reopen and extend class definitions. LISTING 4.9.7 Literal extension run if distance_to(nearest_enemy) < 10.meters One last technique worth mentioning is abstract syntax tree and parse tree manip- ulations. It is a rare feature allowing programmers to access the parse tree or the abstract syntax tree after the code has been parsed by the host-language parser. Ruby’s ParseTree and C# 3.0 both work in a similar function; a library call is used to parse a code fragment and return a data structure representing the code expressions. This fea- ture is useful when translating code from one language to another or when the DSL needs to rely on a wider range of expression than happens to be available in the host language and thus needs to be transformed before use. 436 Section 4 General Programming 4.9 Domain-Specific Languages in Game Engines 437 Tools Easing Language Construction External domain-specific languages have fewer constraints than internal ones but need slightly more effort to create because of the time required to build parsers and compilers. Luckily, tools have been created to ease this process and reduce this overhead. Lexical analyzers and parser generators, such as Lex and Yacc, have been around for a long time and are still a great help to build languages, but they tend to be replaced by new tools, such as ANTLRWorks or Microsoft’s DSL tools, which are both providing powerful development environments focused on creating domain-specific languages. ANTLRWorks is an integrated development environment for creating languages using ANTLR V3 grammars. It offers rapid iteration cycles by providing a full-featured editor, embedding an interpreter, and providing a debugger and a lot of other tools to ease the development process. While ANTLRWorks uses textual grammars to create external textual DSLs, Microsoft’s DSL tools provide a way to create visual domain- specific languages to be integrated into Microsoft Visual Studio. The Microsoft DSL tools help design the language and its graphical interface by providing wizards and tools easing domain modeling, specifying classes and relationships, and binding designers’ shapes to the model concepts. Although Microsoft’s tools for domain- specific languages can’t be used for building run-time DSLs, they offer opportunities for integrating a custom visual domain-specific language inside Visual Studio. Multi-Language Game Engine Development This section presents domain-specific languages for two domains related to low-level engine programming. Other examples of DSLs for game engines are shading and ren- dering passes, sound logic encoding emitters, occluders and propagation logic, behav- iors of artificial agents, and locomotion rules for animation systems. The first example of a domain-specific language in a game engine relates to data structure modeling. Data management issues occur in many places, such as a pipeline’s applications intercommunication, engine working set, multi-threading and performance issues, and network replication. A common problem is the need to write data manipulation and serialization code multiple times. Thus, it may be interesting to encode as much knowledge about data structures as possible using a domain-spe- cific language that handles tasks such as generating code for serializing and accessing data in all languages used in the pipeline. Such a DSL could also allow for statistical analysis of working sets, which would help profile the engine’s need in terms of data. Acquiring knowledge about the domain of data management is easy because pro- grammers are the domain experts. This type of DSL should solve data-related problems by providing a simple syntax to encode the data structures’ layout. The language will encode structures and should provide end users with a way to control fields’ identification, align- ment properties, and serialization requirements. Lastly, this domain-specific language will be used by programmers, and thus it is acceptable to use a textual interface with low language noise. Listing 4.9.8 shows a sample of such data management DSL using Ruby as its host language. You can find an implementation of this DSL on the CD-ROM. LISTING 4.9.8 Simple structure layout using a domain-specific language struct (:PlayerInfos) { required string :name, replicate_over_network! required key :race required boolean :is_male required int32 :level required int32 :exp_points required vector3f :position, replicate_over_network!, 16.bytes.alignment required quaternion :orientation, replicate_over_network! optional int32 :money optional float :reputation } The second use case for domain-specific languages targets the engine’s threading model. Scalability of the engine’s performance over machine generations has become a very active field of research. Console engines usually take advantage of running on fixed hardware with known specifications, but good engines must allow evolution of hardware. A recurring pattern when changing hardware is to rewrite the threading model to reflect new hardware and get better performance. Another pattern related to threading models happens during development when programmers try to offload heavy tasks from one processor to another, thus changing how tasks update. Again, a language focused on task dependencies and hardware specifications can help handling modifications of the threading model. Such a language has to expose to the user vari- ants such as number of cores, number of threads per core, or preferred number of soft- ware threads. It can also expose tasks that are run by the engine and their dependencies in order to help scheduling given the hardware constraints. The output of such a lan- guage can be either code or data that would drive current engine’s threading frame- work. Like the previous domain-specific language presented, this language is targeted at programmers, and an internal DSL’s properties satisfy our requirements. Listing 4.9.9 shows what such a DSL could look like, and its implementation is provided on the accompanying CD-ROM. LISTING 4.9.9 Threading a domain-specific language hardware { has 3.cores.each { |core| core.have 2.hardware_threads } } software do instanciate 6.software_threads instanciate :camera.module instanciate :player.module, :bots.module, :sound.module instanciate :physics.module, :graphics.module camera.depends_on(:player) bots.depends_on(:player) graphics.is_bound_to(thread(0)) end 438 Section 4 General Programming 4.9 Domain-Specific Languages in Game Engines 439 Integrating DSLs into the Pipeline We will now focus on how DSLs integrate in the production pipeline. Engine Integration through Embedding The quickest and easiest way to integrate a domain-specific language into an engine is to directly embed it. As such, creating an internal DSL that relies on the engine’s main language seems to be an evident way to provide domain-specific languages from the game engine. But, with C++ being the preferred language for building game engines, it is difficult to provide a domain-specific language that allows for rapid iterations. C++ provides a very strict syntax, and most of the advanced features used to build DSLs are difficult, if not impossible, to use. Another problem of relying on C++ to build an internal DSL is the compilation process that may disturb domain experts with- out any programming background. However, C++ provides macros, nested functions, method chaining, and templates that are powerful tools for building fluent interfaces. Developers who create DSLs using C++ as a host language must be careful about build times, ease of debugging, and code bloat, as many of the aforementioned tech- niques can lead to these problems if not used properly. Engine Integration through Code Generation Integrating a DSL that relies on a language other than the one used by the engine is made possible by using DSLs as application generators. In this case, the domain-specific language is used to input domain knowledge and transform these high-level specifica- tions to low-level code that will be included in the engine. This approach provides the same advantages as any other code generation technique: End users can easily input data without worrying about the implementation, programmers can modify an implemen- tation without the user noticing, and code need only be optimized once per code generator. Although this technique has the advantage of separating the domain-specific lan- guage from the language used to implement the engine, it has the major drawback of increasing the complexity of the build process. An example of DSL relying on code generation is Unreal Script, as it binds scripts to native classes by generating C++ headers. Although this is very convenient for debugging and for very tight integration into the engine, it requires script program- mers to be really careful when modifying scripts, because it may trigger a full rebuild of the engine. When using this code generation technique, developers must try to reduce compilation and link time as much as possible. Engine Integration through Interpretation DSLs can also be integrated into the engine by using a virtual machine that will read and execute domain-specific code at run time. Embedding virtual machines for languages such as Lua or Python has been a popular method for years, and it is possible to build internal DSLs on top of such languages. Another path is to create an external DSL and embed its virtual machine inside the engine, like [Gregory09] and [Sweeney06]. This integration method has the advantage of removing any constraint previously imposed by the engine’s language and also helping reduce iteration cycles, but it sac- rifices run-time performances. Independent of building your own virtual machine or using a preexisting one, it is crucial to provide tools that will assist the debugging phase, since this new language will add an extra layer of complexity. Engine Integration through a Hybrid Approach An interesting way to integrate DSLs in an engine is a hybrid approach where DSL code can be either compiled to machine code or interpreted by a virtual machine. Although such an approach requires substantial effort to write compilers and inter- preters, it would provide the best of both worlds, allowing for fast iteration during development and maximizing release build performance. Tools are crucial to overcome debugging issues, but developers need to also care about the execution environment of scripts, as it will change when going from inter- preted to compiled. The quickest way to set up this hybrid approach for game engines is to create an internal DSL using an already available interpreter, such as Lua, Python, Lisp, or Ruby, and bind it to the native engine’s framework. Code generation routines should be written in the host language and used to translate DSL code to native code relying on the engine’s framework. Pipeline Integration through Data Generation Tools that help domain experts input their knowledge into the pipeline usually pro- vide an interface relying on domain-specific languages. Tools providing DSLs inte- grated into game pipelines are very common. Unreal Engine provides Kismet for scripting game events [Unreal05], Crytek’s CryENGINE offers Flow Graph—a visual editing system allowing designers to script game logic [Crytek09]. Other examples exist in the field of artificial intelligence [Borovikov08]. Pipeline Integration through Centralization DSLs let users encode domain knowledge using custom syntax and usually help cen- tralize this knowledge. As a side effect, it can be very interesting to use DSLs not only to encode domain knowledge, but also to distribute it to any application of the pro- duction pipeline, easing knowledge transfer across multiple languages and applications. For example, tools such as Google’s protocol buffers or Facebook’s thrift provide domain-specific languages that ease data transfer across complex application architec- tures, which are very similar to game pipelines. 440 Section 4 General Programming 4.9 Domain-Specific Languages in Game Engines 441 Conclusion Domain-specific languages have been around for a long time and are successfully employed to solve a wide variety of problems throughout the software industry. They offer tailored solutions, are easy to learn and manipulate, enable various opportunities to mine the knowledge they encode, and focus on end user experience. It is still diffi- cult to reduce the costs associated with creating and learning several languages, but because video game development addresses such a wide range of problem domains, it seems to be a perfect fit for domain-specific languages. References [Borovikov08] Borovikov, Igor, and Aleksey Kadukin. “Building a Behavior Editor for Abstract State Machines.” AI Game Programming Wisdom 4. Boston: Charles River Media, 2008. [Crytek09] CryEngine Team, Crytek. “CryEngine3 specifications.” 11 March 2009. [Gregory09] Gregory, Jason. “State-Based Scripting in Uncharted2.” Game Developers Conference. 2009. [Sweeney06] Sweeney, Tim. “The Next Mainstream Programming Language: A Game Developer’s Perspective.” Symposium on Principles of Programming Languages. 2006. [Unreal05] Unreal Engine Team, Epic. “Unreal Kismet, the Visual Scripting System.” 2005-2008. . 442 4.10 A Flexible User Interface Layout System for Divergent Environments Gero Gerber, Electronic Arts (EA Phenomic) The more people you want to address with your game, the more divergent system environments you have to support and take into account. The differences are not only different CPUs and/or GPUs, but also displays. So what you want and need to develop is software that scales with its environment and makes efficient use of it in all aspects. In this gem, we highlight an approach to efficient resource usage, especially the available screen size, from the perspective of the user interface (UI) layout. We show how you can keep the UI layout system sufficiently flexible so that, once in place, you can achieve optimal layout results without the need to handle special cases in the source code. The Problem The more UI elements (widgets) you have in your game, and the smaller the screen size of your minimum supported system requirements, the more important it is to have efficient and flexible UI layouts. In addition to the fact that there’s a wide range of screen resolutions out there, you also have to take into account the fact that there are different aspect ratios. For example, many current laptops make use of previously unusual aspect ratios. So when designing the UI layout, you may need to consider more than the widget’s size, position, and font size. In most cases, where the difference between minimum supported system requirements and high end is large, you also have to make use of different assets—for example, textures, which you put on the widgets in order to have a crisply rendered widget. Finding an algorithmic solution for making optimal use of screen space and giving a useful layout is difficult and may, if done in code, not always result in solutions UI designers or artists want. 4.10 A Flexible User Interface Layout System for Divergent Environments 443 Some Cheap Solutions There are some solutions that solve this problem in a cheap and sometimes acceptable way. The first solution makes use of virtual screen coordinates (see Figure 4.10.1). Here the space the UI works with is always the same (for example, 1024×768). All widgets are positioned in this virtual space. When the physical screen resolution differs from the virtual space, all widgets are then scaled automatically. This effectively projects the virtual screen coordinates onto the physical screen resolution. As long as the two spaces do not differ that much in size or aspect ratio, everything can look acceptable. But if the difference in size or aspect ratio becomes too large, you get non-uniform-scaled widgets with blurred textures. The second solution works for designs with only a few widgets. Here, all widgets are bound to a combination of screen borders or are centered. So when, for example, you define a widget that is bound to the lower and right screen borders, the widget will stick to the lower-right part of the screen when the screen resolution increases in width or height. You can see border linking as a form of alignment (see Figure 4.10.2). (.Net users may know border linking under the name of anchoring.) The difference is that with border linking, in contrast to simple alignment, you can define multiple borders to link to, and you can define a specific number of pixels the widget has to stay away from the borders. This way, you can make use of a large number of screen resolutions without non-uniform-scaled widgets or blurred textures. The drawback here is that with increasing screen resolutions, the widgets make use of a smaller part Figure 4.10.1 Virtual screen coordinates. of the screen and so become harder to read. At the other end, below a given screen resolution the widgets may start overlapping each other, which may not be the desired behavior. But for some designs, this may be the way to go. However, both solutions may not work optimally when additional constraints exist. For example, when you develop for consoles such as Xbox 360, you have to take into account the fact that on some TVs, only 80 to 90 percent of the picture may be visible. This area is called the title-safe region. Parts of the display outside this region may not be visible and should not be occupied by any widget or other important game elements. This case has to be considered in the UI layout when running in this special environment. Another solution would be to do multiple UI layouts and check inside the source code for which layout to use (see Figure 4.10.3). This results in significant extra work for the UI artist, and you have many duplicate layout definitions that need to be kept in sync. Adding, removing, or editing existing widgets in multiple layouts can become complicated and error prone. Things become worse for the software engineer in a multi-layout scenario when, at run time, widgets have to be created (for example, adding new widgets to a list box), widget textures have to be exchanged, or UI effects have to be started (for exam- ple, you want to use different scaled UI effects for different screen resolutions). In these cases there would have to be a check in the source code that decides which asset to load in order to fit into the current UI layout. When, later during development, new layouts are added or removed, all these code sections have to be adjusted in order to make use of this new layout. 444 Section 4 General Programming Figure 4.10.2 Border linking. 4.10 A Flexible User Interface Layout System for Divergent Environments 445 A More Flexible Solution Widgets are defined by a set of properties (position, size, texture, font style, and so on) that can be seen as key-value pairs. Many applications make use of XML to define widgets. Using a binary format would result in faster loading times, but for the sake of readability, we will stick to plain text XML in the examples. The following is a sample XML definition for a widget that defines the widget’s name, position, and size. This example is static and may not suit all environments. It could be better to have a larger widget in larger screen resolutions. A more flexible solution we have already used in a simpler form in our last project is the use of conditional modifiers (CM). A conditional modifier is a node in a widget’s XML definition that contains an additional set of widget properties. You can attach a CM to any widget. A CM always contains a set of conditions connected with condition operators, which, when evalu- ated to true, enable the properties defined inside the CM. An example of a CM inside a widget’s XML definition is presented here: In this example the widget is created with the properties defined in the Widget node. As a child node of the widget, we added a CM node. As a child node of the CM, there is a condition that evaluates to true when the current platform is equal to xbox360. The CM is loaded into memory when the corresponding condition is ful- filled. After the widget is fully created, we iterate over all loaded CMs attached to the widget (in the order defined in the XML). For each CM whose conditions evaluate to true, we apply the specified properties to the outer widget. The property cm_name inside the CM node is just cosmetic and shall help to give the CM construct a meaning. So in the example above on a PC, the widget would be placed at position (15, 15) with a size of (20, 30), and on the Xbox 360 its position would be (30, 30) with a width of (40, 50). Of course, you do not have to re-specify all widget properties inside a CM, but only those will be applied when the CM evaluates to true. So in this example, you could, for example, only change the size of the widget if needed. It is also possible to combine the previously discussed border-linking solution with the CM system. Concatenate Conditional Modifiers and Conditions Using only a single CM or a single condition inside a CM is not that useful. So to fur- ther improve the flexibility of CMs, you can chain together multiple conditions and CMs to form a more complex set of conditions. In the following example you can see how this works: 446 Section 4 General Programming 4.10 A Flexible User Interface Layout System for Divergent Environments 447 Here we apply the properties inside the CM only in the case that screen width is greater than 1024, screen height is greater than 768, and if the current platform is not Xbox 360. In this case we change the widget’s position. As you can see in this exam- ple, there’s a second CM that changes the widget’s texture to a low-resolution variant when Condition_MinSpec evaluates to true. Both CMs are independent from each other and change different properties of the widget in a different environment. It is also possible for two different CMs to change the same widget property in different cases. The number of different conditions only depends on the needs you have to obtain the desired UI layout for a specific environment. Implementation Details of the CM System Implementing the CM system is straightforward. In order to keep the creation of CMs and their corresponding conditions in one place in the code, you can use the well-known Factory Method Pattern [Gamma94] for creating conditions. Figure 4.10.4 shows the class diagram for the CM system. What you can see from the UML diagram is that a widget can contain an arbi- trary number of CMs and that each CM contains at least one condition. Two condi- tions with a condition operator in between form a condition pair that can be evaluated. Of course, you have to consider operator precedence in these cases. For the sake of sim- plicity, we show only two custom conditions in the diagram (ConditionScreenWidth and ConditionScreenHeight). Additional condition types can be added, depending on your needs. You can also add custom condition operators derived from Condition- OperatorBase. Conditional Modifier Condition References As you may have noticed, a non-trivial CM definition adds a good deal of potentially redundant data to each widget definition. This redundancy can be resolved by adding a CM-specific property (cm_conditions_reference) to the CM definition that refer- ences an XML file containing all the relevant conditions and condition operators. The advantage of this is that when you have to modify this set of conditions, you only need to touch one single file, and all referring CMs will work with the new definition. Here’s an example: Following are the corresponding definitions from conditions_xbox360.xml. Widget Templates We can further improve the way we define widgets, especially if we make use of many similar widget definitions. A good example for this is the definition for a default but- ton style in a game. Default buttons may share many properties that are equal in all instances. For example, the default button’s size, texture states, sounds, and so on would be the same for each instance of a default button. It would not make much sense if at each place where you define some sort of default button, you have to specify all the properties by which a default button is defined. For these cases you can make use of widget templates (see Figure 4.10.6). Figure 4.10.5 Conditional modifier condition references. A widget template defines a widget with a default set of properties. A widget tem- plate is an additional XML widget definition file. A concrete widget definition can refer to widget templates via the additional property template_filename. The widget template can contain CMs and conditions. From the implementation point of view, you create a widget instance from the specified template XML definition, apply all properties found in the widget template, then apply all CMs defined in the template, and at the end you set the instance-specific properties for the concrete widget instance. If we stay with the default button example instance, specific properties would be the text shown on the button or the event that gets fired when clicking the button. This way you can save a lot of data inside the widget definitions, and you can make changes quickly because you only have to change the widget template itself. This feature is particularly useful for global styles. The following example shows how you can make use of widget templates. The property template_filename refers to the widget template and contains all default properties and CMs including conditions. So in this example, the only custom properties for this widget instance are its position and its name. Proxy Assets Many situations require assets such as textures or UI effects to be loaded at run time. One example would be that you want to exchange a texture or show some UI effects to highlight some widget. This requires the specific UI effects that fit the current widget shape and size to be loaded. At this point, when using CMs, you don’t know which asset to load, because on different widget sizes you may want to apply textures with different resolutions, and on different widget shapes and widget sizes you have to 450 Section 4 General Programming Figure 4.10.6 Defining widgets with widget template files. 4.10 A Flexible User Interface Layout System for Divergent Environments 451 use different UI effects. In order to decouple these problems from the source code, you can make use of proxy assets. A proxy asset is an XML file that contains the asset infor- mation itself (for example, the path to some texture or some UI effects file) and some CMs that control which asset to use in which environment. This is similar to the use of CMs in conjunction with widgets. Here is an example of a proxy asset for a texture: This example by default uses the texture high_res_texture.tga. Only when Condition_MinSpec evaluates to true do we actually load low_res_texture.tga. So in the source code you would reference the proxy asset’s XML definition instead of a concrete texture filename. This way, we have decoupled the asset from the source code. Because we use the same CM system here as described in conjunction with widgets, we can make use of the same factory as well. So when adding new conditions, you can use them automatically for proxy assets, too. See Figure 4.10.7. Performance Of course, all this parsing of CMs takes time. The conditions need only be evaluated once, so it would make sense to keep the conditions in memory and query the result when needed. This is, of course, only valid in cases where you do not create conditions that can change during run time. The same is true for widget templates. If your engine supports the cloning of widgets, you can keep a copy of the widget template in memory and clone it when you create a new widget instance of it. Proxy assets can be optimized by loading and evaluating them only once and reusing the results from memory. Figure 4.10.7 Proxy assets decouple assets from the source code. Problems When using CMs, the definition of widgets becomes a more complex task, and addi- tional work is required to configure all CMs and their corresponding conditions. We recommend tool support for adding, removing, and modifying CMs. In our last project, we had effectively three different UI layouts we used for all different screen resolutions. We defined a minimum screen resolution of 1024×768 (4:3), a medium one of 1280×1024 (5:4), and a large one defined by 1680×1050 (16:10). Screen res- olutions greater in width and height than the ones defined use the large screen resolu- tion layout. Screen resolutions in between use the next smaller layout. UI elements positioned relative to a screen border keep the correct position via their border linking property. Because of this, it is possible to scale the UI layout to higher screen resolutions. This way, we covered the most common screen resolutions used by our customers. We also disabled some laptop screen resolutions in order to save time. Of course, when running the game with extremely large screen resolutions, you run into the problem described earlier (widgets become small in relation to screen size). Fortunately, screen sizes do not grow at such a pace that this should become a problem. The UI Editor we used in house was capable of generating CMs recursively regarding the position and size of the widgets. Therefore, we created our base UI layout for the smallest supported screen resolution of 1024×768 and let the UI Editor create CMs with a corresponding scale factor for the higher resolutions. Another aspect you have to consider is the additional amount of testing you need when you have different UI layouts. Therefore, you should plan early which UI lay- outs you need and by which conditions these layouts are being controlled. This way, your QA can check out the different UI layouts with a defined checklist. Conclusion The CM system described in this gem gives you a good deal of flexibility in defining your UI layout. It offers UI designers and UI artists many options to define the UI layout without requiring corresponding source code changes. On the other hand, you have to invest some work to integrate this system into your engine, and testing the UI layouts is more work compared to just testing a single simple UI layout. But from experience with this system, the advantages outweigh the additional amount of work. References [Gamma94] Gamma, E., et. al. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994. 452 Section 4 General Programming 453 4.11 Road Creation for Projectable Terrain Meshes Igor Borovikov, Aleksey Kadukin, R oads are becoming an increasingly important part of large environments in modern games. They play a significant role for both the aesthetics and the func- tionality of the environment. In this gem, we explore several techniques that are useful for modeling roads in a large game environment like that of The Sims 3 by Electronic Arts [EA:Sims3]. The techniques discussed are general, and for simplicity we limit our discussion to the case of projectable terrain meshes. Some of our inspiration was drawn from earlier works, such as Digital Element World Builder [DEWB], a package for modeling and rendering natural scenes. We start by building a trajectory for a road on a projectable mesh using elements of variational calculus and describe an algorithm for creating optimal paths on the terrain surface. Next, we move to building a mesh for the terrain while satisfying a number of natural design requirements for the local road geometry. Roads as Geodesic Curves Many roads in the real world exhibit certain optimal properties in a sense that they were built to minimize the cost of connecting one location to another. As a trivial example, consider an ideal road between two points A and B on a horizontal plane, which is a straight line and is the shortest length curve connecting the points. When placing roads between two locations on a terrain model, we can use the same principle and require that a certain cost function is minimized on the curve represent- ing the road. Variational calculus provides a framework for describing such curves. An introduction to variational calculus can be found in any good book on differential geometry, for example [DNF91]. We will limit the formal part of our presentation here to the bare minimum, skipping details that are not important for the algorithm we propose. Variation Problem for Roads We will represent a road between points A and B on a terrain with a smooth curve γ(t) parameterized with the natural parameter (in other words, the curve is para- meterized by the arc length), where γ(0)=A, γ(L)=B, and L is the length of the curve. Then we can define the cost of moving from A to B along such curve as the length of the curve: (1) where is the cost function: (2) This cost function is calculated based on a Euclidian metric in 3D space. The cost function (2) restricted to the terrain surface induces the so-called Riemannian metric on the terrain (which still directly corresponds to our intuitive geometric notions of distance). Shortest length curves on such a Riemannian space are called geodesics. For terrain modeled as a height map z = f(x,y), we have: (3) Geodesics for such a cost function are similar to a rubber band stretched across the terrain along the shortest trajectory between destination points while keeping contact with the surface across their entire length. Such curves provide the shortest path but do not correspond to our desire to provide natural-looking roads, and in par- ticular they ignore the role of gravity. An important observation is that actual roads also minimize variation of altitude along the trajectory. This is true for both foot trails and automobile roads. Instead of going along the shortest path across a hill, roads tend to bend around while trying to maintain the same altitude. The altitude variation element along a path is: (4) We can define a new length element as follows: (5) with λ>=0. For larger values of λ, we expect the geodesics to be “flatter” in the z direc- tion. Using in (1) the new cost function (5) will give us: (6) which will provide a better approximation to the behavior of actual roads. The param- eter λ corresponds to the tradeoff between the desire to reach the destination point as soon as possible versus the desire to save on climbing up and down any slopes along the pathway. 454 Section 4 General Programming Solutions for the variational problem (1) can be found among the solutions of Euler-Lagrange equations: (7) When solving the equations (3), we need to also satisfy the boundary conditions γ(0)=A and γ(L)=B. However, solving the boundary condition problem for the equa- tions (7) is not a simple task. Instead, in the next section, we will obtain a numeric solution by directly optimizing a discrete version of (6). Numerical Solution for Geodesics Here we will look for the minimum of the cost functional (6) directly by approximat- ing the smooth curve γ with a piecewise linear curve on the surface. The following procedure converges to the solution of the original continuous problem in a wide range of conditions; however, in the interest of brevity, we will leave out the proof. A piecewise linear representation of the curve γ(t) requires n+1 nodes: {γi, i=0,…,n} or, in coordinates, γi=(xi, yi, f(xi, yi)). For such a curve, there will be n line cuts between nodes: ci=[γi, γi+1]. The length of the line cut ci is denoted as |ci|. Obviously, |ci | = (8) The discrete approximation of the cost along a curve for the variational problem (7) is the following sum instead of an integral: (9) For a fixed number of nodes, the following iterative optimization process allows us to find local minimum for (9). G := calculate the cost using (9) CostChange := G While cost change is greater than threshold do: For each node 0. [DNF91] Dubrovin, B. A., A. T. Fomenko, and S. P. Novikov. Modern Geometry—Methods and Applications: Part I: The Geometry of Surfaces, Transformation Groups, and Fields. Springer, 1991. [Wiest98] Wiest, Richard L. “A Landowner’s Guide to Building Forest Access Roads.” July 1998. National Forest Service. n.d. . 4.11 Road Creation for Projectable Terrain Meshes 461 462 4.12 Developing for Digital Drawing Tablets Neil Gower A walk though the art department in almost any game studio will reveal a wide selection of digital drawing tablets. In their most basic form, tablets replace the mouse with a stylus for pointer input. However, treating a tablet as merely a mouse substitute greatly underutilizes its potential. Most tablets offer a rich set of inputs, including pressure and tilt sensitivity as well as a variety of buttons. In this gem, we look at ways to harness the full potential of the drawing tablet as an input device, first by surveying the research on pen-based interfaces and then by developing an interface layer in C++ to conveniently access tablet functionality. Equipped with this knowledge, you will be able to make full use of tablet features directly in your tools. Along the way, we also note best practices to follow when devel- oping for tablets. Background The pen-based interface devices we find in game development are typically digitizer tablets, tablet PCs, or displays with built-in tablet functionality. Standard tablets pro- vide a form of indirect interaction—the user moves the pen on a tablet on their desk to move the pointer on the screen. Tablet PCs and tablet displays provide direct inter- action by allowing the user to place the pen direc