Operating System Concepts with Java 【采用Java的操作系统概念】


The integrated, online text offers students a low-cost alternative to the printed text, enhanced with direct links to specific portions of the text, plus interactive animations and videos to help students visualize concepts. Select WileyPLUS courses include LabRat, the programming assignment tool allows you to choose from hundreds of programming exercises to automate the process of assigning, completing, compiling, testing, submitting, and evaluating programming assignments. When students spend more time practicing, they come to class better prepared. See—and Try WileyPLUS in action! Details and Demo: www.wileyplus.com Try WileyPLUS with LabRat: www.wileyplus.com/tours Why WileyPLUS for Computer Science? W ileyPLUS for Computer Science is a dynamic online environment that motivates students to spend more time practicing the problems and explore the material they need to master in order to succeed. Why WileyPLUS for Computer Science? “I used the homework problems to practice before quizzes and tests. I re-did the problems on my own and then checked WileyPLUS to compare answers.” — Student Alexandra Leifer, Ithaca College Students easily access source code for example problems, as well as self- study materials and all the exercises and readings you assign. All instructional materials, including PowerPoints, illustrations and visual tools, source code, exercises, solutions and gradebook organized in one easy-to-use system. WileyPLUS combines robust course management tools with the complete online text and all of the interactive teaching & learning resources you and your students need in one easy-to-use system. “Overall the WileyPLUS package made the material more interesting than if it was just a book with no additional material other than what you need to know.” — Edin Alic, North Dakota State University “WileyPLUS made it a lot easier to study. I got an A!” — Student Jeremiah Ellis Mattson, North Dakota State University Eighth Edition ABRAHAM SILBERSCHATZ Yale University PETER BAER GALVIN Corporate Technologies, Inc. GREG GAGNE Westminster College JOHN WILEY & SONS. INC Operating System Concepts with Java ViceȱPresidentȱandȱExecutiveȱPublisherȱȱ ȱ DonȱFowleyȱ ExecutiveȱEditorȱȱȱȱȱBethȱLangȱGolubȱ EditorialȱAssistantȱȱȱȱȱMikeȱȱBerlinȱ ExecutiveȱMarketingȱManagerȱȱȱȱChristopherȱRuelȱ MarketingȱAssistantȱȱȱȱȱDianaȱSmithȱ SeniorȱMediaȱEditorȱȱȱȱȱLaurenȱSapiraȱ SeniorȱProductionȱEditorȱȱȱȱKenȱSantorȱ CoverȱIllustrationsȱȱȱȱȱSusanȱCyrȱ CoverȱDesignerȱȱȱȱȱHowardȱGrossmanȱ TextȱDesignerȱȱȱȱȱȱJudyȱAllanȱ ȱ ȱ ThisȱbookȱwasȱsetȱinȱPalatinoȱbyȱtheȱauthorȱusingȱLaTeXȱandȱprintedȱandȱboundȱbyȱR.R.ȱDonnelleyȱJeffersonȱCity.ȱ TheȱcoverȱwasȱprintedȱbyȱR.R.ȱDonnelleyȱJeffersonȱCity.ȱ ȱ Copyrightȱ©ȱ2010ȱJohnȱWileyȱ&ȱSons,ȱInc.ȱȱAllȱrightsȱreserved.ȱ Credits:ȱȱFigureȱ1.11:ȱFromȱHennesyȱandȱPatterson,ȱComputerȱArchitecture:ȱAȱQuantitativeȱApproach,ȱThirdȱ Edition,2002,ȱ©ȱMorganȱKaufmannȱPublishers,ȱFigureȱ5.3,ȱp.ȱ394.ȱReprintedȱwithȱpermissionȱofȱtheȱpublisher.ȱȱ Figureȱ5.13:ȱadaptedȱwithȱpermissionȱfromȱSunȱMicrosystems,ȱInc.ȱFigureȱ9.18:ȱFromȱIBMȱSystemsȱJournal,ȱVol.ȱ 10,ȱNo.ȱ3,ȱ©ȱ1971,ȱInternationalȱBusinessȱMachinesȱCorporation.ȱReprintedȱbyȱpermissionȱofȱIBMȱCorporation.ȱȱ Figureȱ11.9:ȱFromȱLeffler/McKusick/Karels/Quarterman,ȱTheȱDesignȱandȱImplementationȱofȱtheȱ4.3BSDȱUNIXȱ OperatingȱSystem,ȱ©ȱ1989ȱbyȱAddisonȬWesleyȱPublishingȱCo.,ȱInc.,ȱReading,ȱMA.ȱȱFigureȱ7.6,ȱp.ȱ196.ȱReprintedȱ withȱpermissionȱofȱtheȱpublisher.ȱFigureȱ13.4:ȱFromȱPentiumȱProcessorȱUser’sȱManual:ȱArchitectureȱandȱProgrammingȱ Manual,ȱVolumeȱ3,ȱ©1993.ȱReprintedȱbyȱpermissionȱofȱIntelȱCorporation.ȱFiguresȱ16.6,ȱ16.7,ȱandȱ16.9:ȱFromȱ Halsall,ȱDataȱCommunications,ȱComputerȱNetworks,ȱandȱOpenȱSystems,ȱThirdȱEdition,ȱ©1992,ȱAddisonȬWesleyȱ PublishingȱCo.,ȱInc.,ȱReading,ȱMA.ȱFigureȱ1.9,ȱp.ȱ14,ȱFigureȱ1.10,ȱp.ȱ15,ȱandȱFigureȱ1.11,ȱp.ȱ18.ȱReprintedȱwithȱ permissionȱofȱtheȱpublisher.ȱFigureȱ19.5:ȱFromȱKhanna/Sebree/Zolnowsky,ȱ“RealtimeȱSchedulingȱinȱSunOSȱ5.0,”ȱ ProceedingsȱofȱWinterȱUSENIX,ȱJanuaryȱ1992,ȱSanȱFrancisco,ȱCA.ȱDerivedȱwithȱpermissionȱofȱtheȱauthors.ȱȱ Figureȱ23.6:ȱDanȱMurphyȱ(http://tenex.opost.com/kapix.html).ȱSectionsȱofȱChapterȱ6ȱandȱChapterȱ18:ȱFromȱ Silberschatz/Korth,ȱDatabaseȱSystemȱConcepts,ȱThirdȱEdition,ȱ©1997,ȱMcGrawȬHill,ȱInc.,ȱNewȱYork,ȱNY.ȱSectionȱ13.5,ȱ p.ȱ451Ȭ454,ȱ14.1.1,ȱp.ȱ471Ȭ742,ȱ14.1.3,ȱp.ȱ476Ȭ479,ȱ14.2,ȱp.ȱ482Ȭ485,ȱ15.2.1,ȱp.ȱ512Ȭ513,ȱ15.4,ȱp.ȱ517Ȭ518,ȱ15.4.3,ȱp.ȱ523Ȭ 524,ȱ18.7,ȱp.ȱ613Ȭ617,ȱ18.8,ȱp.ȱ617Ȭ622.ȱReprintedȱwithȱpermissionȱofȱtheȱpublisher.ȱȱ ȱ Noȱpartȱofȱthisȱpublicationȱmayȱbeȱreproduced,ȱstoredȱinȱaȱretrievalȱsystemȱorȱtransmittedȱinȱanyȱformȱorȱbyȱanyȱ means,ȱelectronic,ȱmechanical,ȱphotocopying,ȱrecording,ȱscanningȱorȱotherwise,ȱexceptȱasȱpermittedȱunderȱ Sectionsȱ107ȱorȱ108ȱofȱtheȱ1976ȱUnitedȱStatesȱCopyrightȱAct,ȱwithoutȱeitherȱtheȱpriorȱwrittenȱpermissionȱofȱtheȱ Publisher,ȱorȱauthorizationȱthroughȱpaymentȱofȱtheȱappropriateȱperȬcopyȱfeeȱtoȱtheȱCopyrightȱClearanceȱCenter,ȱ Inc.ȱ222ȱRosewoodȱDrive,ȱDanvers,ȱMAȱ01923,ȱ(978)750Ȭ8400,ȱfaxȱ(978)750Ȭ4470.ȱRequestsȱtoȱtheȱPublisherȱforȱ permissionȱshouldȱbeȱaddressedȱtoȱtheȱPermissionsȱDepartment,ȱJohnȱWileyȱ&ȱSons,ȱInc.,ȱ111ȱRiverȱStreet,ȱ Hoboken,ȱNJȱ07030ȱ(201)748Ȭ6011,ȱfaxȱ(201)748Ȭ6008,ȱEȬMail:ȱPERMREQ@WILEY.COM.ȱȱȱ ȱ Evaluationȱcopiesȱareȱprovidedȱtoȱqualifiedȱacademicsȱandȱprofessionalsȱforȱreviewȱpurposesȱonly,ȱforȱuseȱinȱtheirȱ coursesȱduringȱtheȱnextȱacademicȱyear.ȱȱTheseȱcopiesȱareȱlicensedȱandȱmayȱnotȱbeȱsoldȱorȱtransferredȱtoȱaȱthirdȱ party.ȱȱUponȱcompletionȱofȱtheȱreviewȱperiod,ȱpleaseȱreturnȱtheȱevaluationȱcopyȱtoȱWiley.ȱȱReturnȱinstructionsȱ andȱaȱfreeȬofȬchargeȱreturnȱshippingȱlabelȱareȱavailableȱatȱwww.wiley.com/go/evalreturn.ȱOutsideȱofȱtheȱUnitedȱ States,ȱpleaseȱcontactȱyourȱlocalȱrepresentative.ȱ ȱ ȱ ISBN:ȱȱ978Ȭ0Ȭ470Ȭ50949Ȭ4ȱ ȱ ȱ ȱ PrintedȱinȱtheȱUnitedȱStatesȱofȱAmericaȱ ȱ 10ȱȱȱ9ȱȱȱ8ȱȱȱ7ȱȱȱ6ȱȱȱ5ȱȱȱ4ȱȱȱ3ȱȱȱ2ȱȱȱ1ȱ To my Valerie Avi Silberschatz To my parents, Brendan and Ellen Galvin Peter Baer Galvin To Steve, Ray, and Bobbie Greg Gagne Abraham Silberschatz is the Sidney J. Weinberg Professor & Chair of Com- puter Science at Yale University. Prior to joining Yale, he was the Vice President of the Information Sciences Research Center at Bell Laboratories. Prior to that, he held a chaired professorship in the Department of Computer Sciences at the University of Texas at Austin. Professor Silberschatz is an ACM Fellow, an IEEE Fellow, and a member of the Connecticut Academy of Science and Engineering. He received the 2002 IEEE Taylor L. Booth Education Award, the 1998 ACM Karl V. Karlstrom Outstanding Educator Award, and the 1997 ACM SIGMOD Contribution Award. In recognition of his outstanding level of innovation and technical excellence, he was awarded the Bell Laboratories President’s Award for three different projects—the QTM Project (1998), the DataBlitz Project (1999), and the NetInventory Project (2004). Professor Silberschatz’ writings have appeared in numerous ACM and IEEE publications and other professional conferences and journals. He is a coauthor of the textbook Database System Concepts. He has also written Op-Ed articles for the New York Times, the Boston Globe, and the Hartford Courant, among others. Peter Baer Galvin is the CTO for Corporate Technologies (www.cptech.com), a computer facility reseller and integrator. Before that, Mr. Galvin was the systems manager for Brown University’s Computer Science Department. He is also Sun columnist for ;login: magazine. Mr. Galvin has written articles for Byte and other magazines, and has written columns for SunWorld and SysAdmin magazines. As a consultant and trainer, he has given talks and taught tutorials on security and system administration worldwide. Greg Gagne is chair of the Computer Science department at Westminster College in Salt Lake City where he has been teaching since 1990. In addition to teaching operating systems, he also teaches computer networks, distributed systems, and software engineering. He also provides workshops to computer science educators and industry professionals. Preface Operating systems are an essential part of any computer system. Similarly, a course on operating systems is an essential part of any computer-science education. This field is undergoing rapid change, as computers are now prevalent in virtually every application, from games for children through the most sophisticated planning tools for governments and multinational firms. Yet the fundamental concepts remain fairly clear, and it is on these that we base this book. We wrote this book as a text for an introductory course in operating systems at the junior or senior undergraduate level or at the first-year graduate level. We hope that practitioners will also find it useful. It provides a clear description of the concepts that underlie operating systems. As prerequisites, we assume that the reader is familiar with basic data structures, computer organization, and a high-level language, preferably Java. The hardware topics required for an understanding of operating systems are included in Chapter 1. For code examples, we use predominantly Java, with some C, but the reader can still understand the algorithms without a thorough knowledge of these languages. Concepts are presented using intuitive descriptions. Important theoretical results are covered, but formal proofs are omitted. The bibliographical notes at the end of each chapter contain pointers to research papers in which results were first presented and proved, as well as references to material for further reading. In place of proofs, figures and examples are used to suggest why we should expect the result in question to be true. The fundamental concepts and algorithms covered in the book are often based on those used in existing commercial operating systems. Our aim is to present these concepts and algorithms in a general setting that is not tied to one particular operating system. We present a large number of examples that pertain to the most popular and the most innovative operating systems, including Sun Microsystems’ Solaris; Linux; Microsoft Windows Vista, Windows 2000, and Windows XP; and Apple Mac OS X. When we refer to Windows XP as an example operating system, we are implying Windows Vista, Windows XP, and Windows 2000. If a feature exists in a specific release, we state this explicitly. vii viii Preface Organization of This Book The organization of this text reflects our many years of teaching courses on operating systems. Consideration was also given to the feedback provided by the reviewers of the text, as well as comments submitted by readers of earlier editions. In addition, the content of the text corresponds to the suggestions from Computing Curricula 2005 for teaching operating systems, published by the Joint Task Force of the IEEE Computing Society and the Association for Computing Machinery (ACM). On the supporting Web site for this text, we provide several sample syllabi that suggest various approaches for using the text in both introductory and advanced courses. As a general rule, we encourage readers to progress sequentially through the chapters, as this strategy provides the most thorough study of operating systems. However, by using the sample syllabi, a reader can select a different ordering of chapters (or sections of chapters). Online support for the text is provided by WileyPLUS. On this site, students can find sample exercises and programming problems, and instructors can assign and grade problems. In addition, on WileyPLUS, students can access new operating-system simulators, which they can use to work through exercises and hands-on lab activities. References to the simulators and associated activities appear at the ends of several chapters in the text. Content of This Book The text is organized in eight major parts: • Overview. Chapters 1 and 2 explain what operating systems are, what they do, and how they are designed and constructed. These chapters discuss what the common features of an operating system are, what an operating system does for the user, and what it does for the computer-system operator. The presentation is motivational and explanatory in nature. We have avoided a discussion of how things are done internally in these chapters. Therefore, they are suitable for individual readers or for students in lower-level classes who want to learn what an operating system is without getting into the details of the internal algorithms. • Process management. Chapters 3 through 7 describe the process concept and concurrency as the heart of modern operating systems. A process is the unit of work in a system. Such a system consists of a collection of concurrently executing processes, some of which are operating-system processes (those that execute system code) and the rest of which are user processes (those that execute user code). These chapters cover methods for process scheduling, interprocess communication, process synchronization, and deadlock handling. Also included is a discussion of threads, as well as an examination of issues related to multicore systems. • Memory management. Chapters 8 and 9 deal with the management of main memory during the execution of a process. To improve both the utilization of the CPU and the speed of its response to its users, the computer must keep several processes in memory.There are many different Preface ix memory-management schemes reflecting various approaches to memory management, and the effectiveness of a particular algorithm depends on the situation. • Storage management. Chapters 10 through 13 describe how the file system, mass storage, and I/O are handled in a modern computer system. The file system provides the mechanism for on-line storage of and access to both data and programs. We describe the classic internal algorithms and structures of storage management and provide a firm practical understanding of the algorithms used—their properties, advantages, and disadvantages. Our discussion of storage also includes matters related to secondary and tertiary storage. Since the I/O devices that attach to a computer vary widely,the operating system needs to provide a wide range of functionality to applications to allow them to control all aspects of these devices. We discuss system I/O in depth, including I/O system design, interfaces, and internal system structures and functions. In many ways, I/O devices are the slowest major components of the computer. Because they represent a performance bottleneck, we also examine performance issues associated with I/O devices. • Protection and security. Chapters 14 and 15 discuss the mechanisms necessary for the protection and security of computer systems. The processes in an operating system must be protected from one another’s activities, and to provide such protection, we must ensure that only processes that have gained proper authorization from the operating system can operate on the files, memory, CPU, and other resources of the system. Protection is a mechanism for controlling the access of programs, processes, or users to the resources defined by a computer system. This mechanism must provide a means of specifying the controls to be imposed, as well as a means of enforcement. Security protects the integrity of the information stored in the system (both data and code), as well as the physical resources of the system, from unauthorized access, malicious destruction or alteration, and accidental introduction of inconsistency. • Distributed systems. Chapters 16 through 18 deal with a collection of processors that do not share memory or a clock—a distributed system. By providing the user with access to the various resources that it maintains, a distributed system can improve computation speed and data availability and reliability. Such a system also provides the user with a distributed file system, which is a file-service system whose users, servers, and storage devices are dispersed among the sites of a distributed system. A distributed system must provide various mechanisms for process synchronization and communication, as well as for dealing with deadlock problems and a variety of failures that are not encountered in a centralized system. • Special-purpose systems. Chapters 19 and 20 deal with systems used for specific purposes, including real-time systems and multimedia systems. These systems have specific requirements that differ from those of the general-purpose systems that are the focus of the remainder of the text. Real-time systems may require not only that computed results be “correct” but also that the results be produced within a specified deadline period. x Preface Multimedia systems require quality-of-service guarantees ensuring that the multimedia data are delivered to clients within a specific time frame. • Case studies. Chapters 21 through 23 in the book, and Appendices A through C (which are available on www.wiley.com/college/silberschatz and WileyPLUS), integrate the concepts described in the earlier chapters by describing real operating systems. These systems include Linux, Windows XP, FreeBSD, Mach, and Windows 2000. We chose Linux and FreeBSD because UNIX—at one time—was almost small enough to understand yet was not a “toy” operating system. Most of its internal algorithms were selected for simplicity, rather than for speed or sophistication. Both Linux and FreeBSD are readily available to computer-science departments, so many students have access to these systems. We chose Windows XP and Windows 2000 because they provide an opportunity for us to study a modern operating system with a design and implementation drastically different from those of UNIX. Chapter 23 briefly describes a few other influential operating systems. Operating-System Environments This book uses examples of many real-world operating systems to illustrate fundamental operating-system concepts. However, particular attention is paid to the Microsoft family of operating systems (including Windows Vista, Windows 2000, and Windows XP) and various versions of UNIX (including Solaris, BSD,andMacOS X). We also provide a significant amount of coverage of the Linux operating system, reflecting the most recent version of the kernel —Version 2.6—at the time this book was written. The text uses Java to illustrate many operating-system concepts, such as multithreading, CPU scheduling, process synchronization, deadlock, memory and file management, security, networking, and distributed systems. Java is more a technology than a programming language, so it is an excellent vehicle for demonstrations. Much of the Java-related material included has been developed and class-tested in undergraduate operating-systems classes. From our experience, students entering these classes lacking knowledge of Java—but with expe- rience using C++ and basic object-oriented principles—generally have little trouble with Java. Rather, most difficulties lie in understanding such concepts as multithreading and data sharing by multiple, concurrently running threads. These concepts are systemic rather than being specific to Java; even students with a sound knowledge of Java are likely to have difficulty with them. We thus emphasize concepts of operating systems rather than concentrating on Java syntax. All the Java programs in this text compile with versions 1.5 and 1.6 of the Java Software Development Kit (SDK). Java 1.5 provides several new features— both at the language level and at the API level—that enhance Java’s usefulness for studying operating systems. We include several new additions to the Java 1.5 API throughout this text. Many programs provided in this text will not compile with earlier releases of the Java SDK, and we encourage all readers to use Java 1.5 as a minimum Java configuration. Preface xi The text also provides a few example programs written in C that are intended to run in the Windows and POSIX programming environments. POSIX (which stands for Portable Operating System Interface) represents a set of standards implemented primarily for UNIX-based operating systems. Although Windows Vista, Windows XP, and Windows 2000 systems can also run certain POSIX programs, our coverage of POSIX focuses primarily on UNIX and Linux systems. The Eighth Edition As we wrote the Eighth Edition of Operating System Concepts with Java, we were guided by the many comments and suggestions we received from readers of our previous editions, as well as by our own observations about the rapidly changing fields of operating systems and networking. We have rewritten material in most of the chapters by bringing older material up to date and removing material that was no longer of interest or relevance. We have made substantive revisions and organizational changes in many of the chapters. Most importantly, we have added coverage of open-source operating systems in Chapter 1. We have also added more practice exercises for students and included solutions in WileyPLUS, which also includes new simulators to provide demonstrations of operating-system operation. Below, we provide a brief outline of the major changes to the various chapters: • Chapter 1, Introduction, has been expanded to include multicore CPUs, clustered computers, and open-source operating systems. • Chapter 2, Operating-System Structures, provides significantly updated coverage of virtual machines, as well as multicore CPUs, the GRUB boot loader, and operating-system debugging. • Chapter 4, Threads, adds new coverage of programming for multicore systems and updates the coverage of Java thread states. • Chapter 5, CPU Scheduling, adds coverage of virtual machine scheduling and multithreaded, multicore architectures. It also includes new schedul- ing features in Java 1.5. • Chapter 6, Process Synchronization, adds a discussion of mutual exclu- sion locks, priority inversion, and transactional memory. • Chapter 8, Main Memory, includes a discussion of NUMA. • Chapter 9, Virtual Memory, updates the Solaris example to include Solaris 10 memory management. • Chapter 10, File-System Interface, is updated with current technologies and capacities. • Chapter 11, File-System Implementation, includes a full description of Sun’s ZFS file system and expands the coverage of volumes and directories. • Chapter 12, Mass-Storage Structure, adds coverage of iSCSI,volumes,and ZFS pools. xii Preface • Chapter 13, I/O Systems, adds coverage of PCIX PCI Express, and Hyper- Transport. • Chapter 16, Distributed System Structures, adds coverage of 802.11 wireless networks. • Chapter 21, The Linux System, has been updated to cover the latest version of the Linux kernel. • Chapter 23, Influential Operating Systems, increases coverage of very early computers as well as TOPS-20, CP/M, MS-DOS,Windows,andthe original Mac OS. Programming Problems and Projects To emphasize the concepts presented in the text, we have added or modified 12 programming problems and projects using Java. In general, programming projects are more detailed and require a greater time commitment than programming problems. These problems and projects emphasize processes and interprocess communication, threads, process synchronization, virtual memory, file systems, and networking. New programming problems and projects include implementing socket programming, using Java’s remote method invocation (RMI), working with multithreaded sorting programming, listing threads in the Java virtual machine (JVM), designing a process identifier management system, and managing virtual memory. The Eighth Edition also incorporates a set of operating-system simulators designed by Steven Robbins of the University of Texas at San Antonio. The simulators are intended to model the behavior of an operating system as it performs various tasks, such as CPU and disk-head scheduling, process creation and interprocess communication, starvation, and address translation. These simulators are written in Java and will run on any computer system with Java 1.4. Students can download the simulators from WileyPLUS and observe the behavior of several operating-system concepts in various scenarios. In addition, each simulator includes several exercises that ask students to set certain parameters of the simulator, observe how the system behaves, and then explain this behavior. These exercises can be assigned through WileyPLUS.The WileyPLUS course also includes algorithmic problems and tutorials. Teaching Supplements The following teaching supplements are available on WileyPLUS and www.wiley.com/college/silberschatz: a set of slides to accompany the book, model course syllabi, all Java and C source code, up-to-date errata, three case-study appendices, and the Distributed Communication appendix. The WileyPLUS course also contains simulators and associated exercises, additional practice exercises (with solutions) not found in the text, and a test bank of additional problems. Students are encouraged to solve the practice exercises on their own and then use the solutions provided to check their own answers. Preface xiii To obtain restricted supplements, such as the solution guide to the exercises in the text, contact your local John Wiley & Sons sales representative. Note that these supplements are available only to faculty who use this text. You can find your Wiley representative by going to www.wiley.com/college and clicking “Who’s my rep?” Contacting Us We have attempted to clean up every error in this new edition, but—as happens with operating systems—a few obscure bugs may remain; an up-to-date errata list is accessible from the book’s home page. We would appreciate hearing from you about any textual errors or omissions in the book that are not on the current list of errata. We would be glad to receive suggestions on improvements to the book. We also welcome any contributions to the book’s Web page that could be of use to other readers, such as programming exercises, project suggestions, on-line labs and tutorials, and teaching tips. E-mail should be addressed to os-book-authors@cs.yale.edu. Any other correspondence should be sent to Avi Silberschatz, Department of Computer Science, Yale University, 51 Prospect Street, P.O. Box 208285, New Haven, CT 06520-8285 USA. Acknowledgments This book is derived from the previous editions, the first three of which were coauthored by James Peterson. Others who helped us with previous editions include Hamid Arabnia, Rida Bazzi, Randy Bentson, David Black, Joseph Boykin, Jeff Brumfield, Gael Buckley, Roy Campbell, P. C. Capon, John Carpenter, Gil Carrick, Thomas Casavant, Bart Childs, Ajoy Kumar Datta, Joe Deck, Sudarshan K. Dhall, Thomas Doeppner, Caleb Drake, M. Racsit Eskicio˘glu, Hans Flack, Robert Fowler, G. Scott Graham, Richard Guy, Max Hailperin, Rebecca Hartman, Wayne Hathaway, Christopher Haynes, Don Heller, Bruce Hillyer, Mark Holliday, Dean Hougen, Michael Huangs, Ahmed Kamel, Richard Kieburtz, Carol Kroll, Morty Kwestel, Thomas LeBlanc, John Leggett, Jerrold Leichter, Ted Leung, Gary Lippman, Carolyn Miller, Michael Molloy,Euripides Montagne, Yoichi Muraoka, Jim M. Ng, Banu ¨Ozden, Ed Pos- nak, Boris Putanec, Charles Qualline, John Quarterman, Mike Reiter, Gustavo Rodriguez-Rivera, Carolyn J. C. Schauble, Thomas P. Skinner, Yannis Smarag- dakis, Jesse St. Laurent, John Stankovic, Adam Stauffer, Steven Stepanek, John Sterling, Hal Stern, Louis Stevens, Pete Thomas, David Umbaugh, Steve Vinoski, Tommy Wagner, Larry L. Wear, John Werth, James M. Westall, J. S. Weston, and Yang Xiang Parts of Chapter 12 were derived from a paper by Hillyer and Silberschatz [1996]. Parts of Chapter 17 were derived from a paper by Levy and Silberschatz [1990]. Chapter 21 was derived from an unpublished manuscript by Stephen Tweedie. Chapter 22 was derived from an unpublished manuscript by Dave Probert, Cliff Martin, and Avi Silberschatz. Appendix C was derived from xiv Preface an unpublished manuscript by Cliff Martin. Cliff Martin also helped with updating the UNIX appendix to cover FreeBSD. Some of the exercises and accompanying solutions were supplied by Arvind Krishnamurthy. Marilyn Turnamian helped generate figures and presentation slides. Mike Shapiro, Bryan Cantrill, and Jim Mauro answered several Solaris- related questions. Bryan Cantrill from Sun Microsystems helped with the ZFS coverage. Steve Robbins of the University of Texas at San Antonio designed the set of simulators that we incorporate in WileyPLUS. Reece Newman of Westminster College initially explored this set of simulators and their appropriateness for this text. Jason Belcher provided assistance with Java generics. Josh Dees and Rob Reynolds contributed coverage of Microsoft’s .NET. Scott M. Pike of Texas A&M University contributed several algorithmic problems and tutorials for WileyPLUS. Judi Paige helped generate figures and presentation slides. Mark Wogahn has made sure that the software to produce the book (such as Latex macros and fonts) works properly. Our executive editor, Beth Golub, provided expert guidance as we prepared this edition. She was assisted by Mike Berlin, who managed many details of this project smoothly. The Senior Production Editor, Ken Santor, was instrumental in handling all the production details. Lauren Sapira has been very helpful with getting material ready and available for WileyPlus. The cover illustrator was Susan Cyr, and the cover designer was Howard Grossman. Beverly Peavler copy-edited the manuscript. The freelance proof- reader was Katrina Avery; the freelance indexer was WordCo, Inc. Finally, we would like to add some personal notes. Avi would like to thank Valerie for her support during the preparation of this new edition. Peter would like to thank his colleagues at Corporate Technologies. Greg would like to acknowledge three colleagues from Westminster College who were instrumental in his hiring in 1990: his dean, Ray Ownbey; program chair, Bobbie Fredsall; and academic vice president, Stephen Baar. Abraham Silberschatz, New Haven, CT, 2009 Peter Baer Galvin, Burlington, MA, 2009 Greg Gagne, Salt Lake City, UT, 2009 Contents PART ONE OVERVIEW Chapter 1 Introduction 1.1 What Operating Systems Do 3 1.2 Computer-System Organization 6 1.3 Computer-System Architecture 12 1.4 Operating-System Structure 18 1.5 Operating-System Operations 20 1.6 Process Management 23 1.7 Memory Management 24 1.8 Storage Management 25 1.9 Protection and Security 29 1.10 Distributed Systems 30 1.11 Special-Purpose Systems 32 1.12 Computing Environments 34 1.13 Open-Source Operating Systems 37 1.14 Summary 41 Exercises 43 Bibliographical Notes 46 Chapter 2 Operating-System Structures 2.1 Operating-System Services 49 2.2 User Operating-System Interface 52 2.3 System Calls 55 2.4 Types of System Calls 59 2.5 System Programs 67 2.6 Operating-System Design and Implementation 68 2.7 Operating-System Structure 70 2.8 Virtual Machines 76 2.9 Java 81 2.10 Operating-System Debugging 85 2.11 Operating-System Generation 90 2.12 System Boot 92 2.13 Summary 93 Exercises 94 Bibliographical Notes 100 PART TWO PROCESS MANAGEMENT Chapter 3 Processes 3.1 Process Concept 103 3.2 Process Scheduling 107 3.3 Operations on Processes 112 3.4 Interprocess Communication 119 3.5 Examples of IPC Systems 128 3.6 Communication in Client–Server Systems 131 3.7 Summary 142 Exercises 143 Bibliographical Notes 152 xv xvi Contents Chapter 4 Threads 4.1 Overview 153 4.2 Multithreading Models 157 4.3 Thread Libraries 159 4.4 Java Threads 162 4.5 Threading Issues 168 4.6 Operating-System Examples 178 4.7 Summary 180 Exercises 181 Bibliographical Notes 191 Chapter 5 CPU Scheduling 5.1 Basic Concepts 193 5.2 Scheduling Criteria 197 5.3 Scheduling Algorithms 198 5.4 Thread Scheduling 209 5.5 Multiple-Processor Scheduling 211 5.6 Operating System Examples 216 5.7 Java Scheduling 223 5.8 Algorithm Evaluation 227 5.9 Summary 233 Exercises 234 Bibliographical Notes 238 Chapter 6 Process Synchronization 6.1 Background 241 6.2 The Critical-Section Problem 243 6.3 Peterson’s Solution 245 6.4 Synchronization Hardware 246 6.5 Semaphores 249 6.6 Classic Problems of Synchronization 255 6.7 Monitors 264 6.8 Java Synchronization 270 6.9 Synchronization Examples 284 6.10 Atomic Transactions 289 6.11 Summary 298 Exercises 299 Bibliographical Notes 311 Chapter 7 Deadlocks 7.1 System Model 313 7.2 Deadlock Characterization 315 7.3 Methods for Handling Deadlocks 320 7.4 Deadlock Prevention 324 7.5 Deadlock Avoidance 328 7.6 Deadlock Detection 334 7.7 Recovery from Deadlock 338 7.8 Summary 339 Exercises 340 Bibliographical Notes 347 PART THREE MEMORY MANAGEMENT Chapter 8 Main Memory 8.1 Background 351 8.2 Swapping 358 8.3 Contiguous Memory Allocation 360 8.4 Paging 364 8.5 Structure of the Page Table 373 8.6 Segmentation 378 8.7 Example: The Intel Pentium 381 8.8 Summary 386 Exercises 387 Bibliographical Notes 391 Contents xvii Chapter 9 Virtual Memory 9.1 Background 393 9.2 Demand Paging 397 9.3 Copy-on-Write 404 9.4 Page Replacement 405 9.5 Allocation of Frames 418 9.6 Thrashing 422 9.7 Memory-Mapped Files 427 9.8 Allocating Kernel Memory 431 9.9 Other Considerations for Paging Systems 435 9.10 Operating-System Examples 441 9.11 Summary 443 Exercises 444 Bibliographical Notes 457 PART FOUR STORAGE MANAGEMENT Chapter 10 File-System Interface 10.1 File Concept 461 10.2 Access Methods 470 10.3 Directory and Disk Structure 473 10.4 File-System Mounting 484 10.5 File Sharing 486 10.6 Protection 491 10.7 Summary 496 Exercises 497 Bibliographical Notes 499 Chapter 11 File-System Implementation 11.1 File-System Structure 501 11.2 File-System Implementation 504 11.3 Directory Implementation 510 11.4 Allocation Methods 511 11.5 Free-Space Management 519 11.6 Efficiency and Performance 522 11.7 Recovery 526 11.8 NFS 530 11.9 Example: The WAFL File System 536 11.10 Summary 538 Exercises 539 Bibliographical Notes 548 Chapter 12 Mass-Storage Structure 12.1 Overview of Mass-Storage Structure 551 12.2 Disk Structure 554 12.3 Disk Attachment 555 12.4 Disk Scheduling 556 12.5 Disk Management 562 12.6 Swap-Space Management 566 12.7 RAID Structure 568 12.8 Stable-Storage Implementation 579 12.9 Tertiary-Storage Structure 580 12.10 Summary 589 Exercises 591 Bibliographical Notes 599 Chapter 13 I/O Systems 13.1 Overview 601 13.2 I/O Hardware 602 13.3 Application I/O Interface 611 13.4 Kernel I/O Subsystem 617 13.5 Transforming I/O Requests to Hardware Operations 624 13.6 STREAMS 626 13.7 Performance 628 13.8 Summary 631 Exercises 632 Bibliographical Notes 634 xviii Contents PART FIVE PROTECTION AND SECURITY Chapter 14 Protection 14.1 Goals of Protection 637 14.2 Principles of Protection 638 14.3 Domain of Protection 639 14.4 Access Matrix 644 14.5 Implementation of Access Matrix 648 14.6 Access Control 651 14.7 Revocation of Access Rights 652 14.8 Capability-Based Systems 653 14.9 Language-Based Protection 656 14.10 Summary 661 Exercises 662 Bibliographical Notes 664 Chapter 15 Security 15.1 The Security Problem 667 15.2 Program Threats 671 15.3 System and Network Threats 679 15.4 Cryptography as a Security Tool 684 15.5 User Authentication 695 15.6 Implementing Security Defenses 700 15.7 Firewalling to Protect Systems and Networks 707 15.8 Computer-Security Classifications 708 15.9 An Example: Windows XP 710 15.10 Summary 711 Exercises 712 Bibliographical Notes 714 PART SIX DISTRIBUTED SYSTEMS Chapter 16 Distributed System Structures 16.1 Motivation 719 16.2 Types of Network- based Operating Systems 721 16.3 Network Structure 725 16.4 Network Topology 729 16.5 Communication Structure 730 16.6 Communication Protocols 736 16.7 Robustness 740 16.8 Design Issues 743 16.9 An Example: Networking 745 16.10 Summary 747 Exercises 747 Bibliographical Notes 755 Chapter 17 Distributed File Systems 17.1 Background 757 17.2 Naming and Transparency 759 17.3 Remote File Access 762 17.4 Stateful Versus Stateless Service 767 17.5 File Replication 768 17.6 An Example: AFS 770 17.7 Summary 775 Exercises 776 Bibliographical Notes 777 Chapter 18 Distributed Coordination 18.1 Event Ordering 779 18.2 Mutual Exclusion 782 18.3 Atomicity 785 18.4 Concurrency Control 788 18.5 Deadlock Handling 792 18.6 Election Algorithms 799 18.7 Reaching Agreement 802 18.8 Summary 804 Exercises 805 Bibliographical Notes 807 Contents xix PART SEVEN SPECIAL -PURPOSE SYSTEMS Chapter 19 Real-Time Systems 19.1 Overview 811 19.2 System Characteristics 812 19.3 Features of Real-Time Kernels 814 19.4 Implementing Real-Time Operating Systems 816 19.5 Real-Time CPU Scheduling 820 19.6 An Example: VxWorks 5.x 826 19.7 Summary 828 Exercises 829 Bibliographical Notes 830 Chapter 20 Multimedia Systems 20.1 What Is Multimedia? 831 20.2 Compression 834 20.3 Requirements of Multimedia Kernels 836 20.4 CPU Scheduling 838 20.5 Disk Scheduling 839 20.6 Network Management 841 20.7 An Example: CineBlitz 844 20.8 Summary 847 Exercises 847 Bibliographical Notes 849 PART EIGHT CASE STUDIES Chapter 21 The Linux System 21.1 Linux History 853 21.2 Design Principles 858 21.3 Kernel Modules 861 21.4 Process Management 864 21.5 Scheduling 867 21.6 Memory Management 872 21.7 File Systems 880 21.8 Input and Output 886 21.9 Interprocess Communication 889 21.10 Network Structure 890 21.11 Security 892 21.12 Summary 895 Exercises 896 Bibliographical Notes 897 Chapter 22 Windows XP 22.1 History 899 22.2 Design Principles 901 22.3 System Components 903 22.4 Environmental Subsystems 926 22.5 File System 930 22.6 Networking 938 22.7 Programmer Interface 944 22.8 Summary 952 Exercises 952 Bibliographical Notes 953 Chapter 23 Influential Operating Systems 23.1 Feature Migration 955 23.2 Early Systems 956 23.3 Atlas 963 23.4 XDS-940 964 23.5 THE 965 23.6 RC 4000 965 23.7 CTSS 966 23.8 MULTICS 967 23.9 IBM OS/360 967 23.10 TOPS-20 969 23.11 CP/M and MS/DOS 969 23.12 Macintosh Operating System and Windows 970 23.13 Mach 971 23.14 Other Systems 972 Exercises 973 xx Contents PART NINE APPENDICES Appendix A BSD UNIX (contents online) A.1 UNIX History A1 A.2 Design Principles A6 A.3 Programmer Interface A8 A.4 User Interface A15 A.5 Process Management A18 A.6 Memory Management A22 A.7 File System A24 A.8 I/O System A32 A.9 Interprocess Communication A35 A.10 Summary A40 Exercises A41 Bibliographical Notes A42 Appendix B The Mach System (contents online) B.1 History of the Mach System B1 B.2 Design Principles B3 B.3 System Components B4 B.4 Process Management B7 B.5 Interprocess Communication B13 B.6 Memory Management B18 B.7 Programmer Interface B23 B.8 Summary B24 Exercises B25 Bibliographical Notes B26 Appendix C Windows 2000 (contents online) C.1 History C1 C.2 Design Principles C2 C.3 System Components C3 C.4 Environmental Subsystems C19 C.5 File System C21 C.6 Networking C28 C.7 Programmer Interface C33 C.8 Summary C40 Exercises C40 Bibliographical Notes C41 Appendix D Distributed Communication (contents online) D.1 Sockets D1 D.2 UDP Sockets D8 D.3 Remote Method Invocation D12 D.4 Other Aspects of Distributed Communication D17 D.5 Web Services D19 D.6 Summary D23 Exercises D24 Bibliographical Notes D25 Appendix E Java Primer (contents online) E.1 Basics E1 E.2 Inheritance E10 E.3 Interfaces and Abstract Classes E12 E.4 Exception Handling E16 E.5 Applications and Applets E17 E.6 Summary E19 Bibliographical Notes E19 Bibliography 975 Index 993 Part One Overview An operating system acts as an intermediary between the user of a computer and the computer hardware. The purpose of an operating system is to provide an environment in which a user can execute programs in a convenient and efficient manner. An operating system is software that manages the computer hard- ware. The hardware must provide appropriate mechanisms to ensure the correct operation of the computer system and to prevent user programs from interfering with the proper operation of the system. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. These goals form the basis for choices among various algorithms and strategies. Because an operating system is large and complex, it must be created piece by piece. Each of these pieces should be a well delineated portion of the system, with carefully defined inputs, outputs, and functions. This page intentionally left blank 1CHAPTER Introduction An operating system is a program that manages the computer hardware. It also provides a basis for application programs and acts as an intermediary between the computer user and the computer hardware. An amazing aspect of operating systems is how varied they are in accomplishing these tasks. Mainframe operating systems are designed primarily to optimize utilization of hardware. Personal computer (PC) operating systems support complex games, business applications, and everything in between. Operating systems for handheld computers are designed to provide an environment in which a user can easily interface with the computer to execute programs. Thus, some operating systems are designed to be convenient, others to be efficient, and others some combination of the two. Before we can explore the details of computer system operation, we need to know something about system structure. We begin by discussing the basic functions of system startup, I/O, and storage. We also describe the basic computer architecture that makes it possible to write a functional operating system. Because an operating system is large and complex, it must be created piece by piece. Each of these pieces should be a well-delineated portion of the system, with carefully defined inputs, outputs, and functions. In this chapter, we provide a general overview of the major components of an operating system. CHAPTER OBJECTIVES • To provide a grand tour of the major components of operating systems. • To describe the basic organization of computer systems. 1.1 What Operating Systems Do We begin our discussion by looking at the operating system’s role in the overall computer system. A computer system can be divided roughly into 3 4 Chapter 1 Introduction user 1 user 2 user 3 computer hardware operating system system and application programs compiler assembler text editor database system user n… … Figure 1.1 Abstract view of the components of a computer system. four components: the hardware, the operating system, the application programs, and the users (Figure 1.1). The hardware—the central processing unit (CPU),thememory,andthe input/output (I/O) devices—provides the basic computing resources for the system. The application programs—such as word processors, spreadsheets, compilers, and Web browsers—define the ways in which these resources are used to solve users’ computing problems. The operating system controls the hardware and coordinates its use among the various application programs for the various users. We can also view a computer system as consisting of hardware, software, and data. The operating system provides the means for proper use of these resources in the operation of the computer system. An operating system is similar to a government. Like a government, it performs no useful function by itself. It simply provides an environment within which other programs can do useful work. To understand more fully the operating system’s role, we next explore operating systems from two viewpoints: that of the user and that of the system. 1.1.1 User View The user’s view of the computer varies according to the interface being used. Most computer users sit in front of a PC, consisting of a monitor, keyboard, mouse, and system unit. Such a system is designed for one user to monopolize its resources. The goal is to maximize the work (or play) that the user is performing. In this case, the operating system is designed mostly for ease of use, with some attention paid to performance and none paid to resource utilization—how various hardware and software resources are shared. Performance is, of course, important to the user; but such systems 1.1 What Operating Systems Do 5 are optimized for the single-user experience rather than the requirements of multiple users. In other cases, a user sits at a terminal connected to a mainframe or a minicomputer. Other users are accessing the same computer through other terminals. These users share resources and may exchange information. The operating system in such cases is designed to maximize resource utilization— to assure that all available CPU time, memory, and I/O are used efficiently and that no individual user takes more than her fair share. In still other cases, users sit at workstations connected to networks of other workstations and servers. These users have dedicated resources at their disposal, but they also share resources such as networking and servers—file, compute, and print servers. Therefore, their operating system is designed to compromise between individual usability and resource utilization. Recently, many varieties of handheld computers have come into fashion. Most of these devices are standalone units for individual users. Some are connected to networks, either directly by wire or (more often) through wireless modems and networking. Because of power, speed, and interface limitations, they perform relatively few remote operations. Their operating systems are designed mostly for individual usability, but performance per unit of battery life is important as well. Some computers have little or no user view. For example, embedded computers in home devices and automobiles may have numeric keypads and may turn indicator lights on or off to show status, but they and their operating systems are designed primarily to run without user intervention. 1.1.2 System View From the computer’s point of view, the operating system is the program most intimately involved with the hardware. In this context, we can view an operating system as a resource allocator. A computer system has many resources that may be required to solve a problem: CPU time, memory space, file-storage space, I/O devices, and so on. The operating system acts as the manager of these resources. Facing numerous and possibly conflicting requests for resources, the operating system must decide how to allocate them to specific programs and users so that it can operate the computer system efficiently and fairly. As we have seen, resource allocation is especially important where many users access the same mainframe or minicomputer. A slightly different view of an operating system emphasizes the need to control the various I/O devices and user programs. An operating system is a control program. A control program manages the execution of user programs to prevent errors and improper use of the computer. It is especially concerned with the operation and control of I/O devices. 1.1.3 Defining Operating Systems We have looked at the operating system’s role from the views of the user and of the system. How, though, can we define what an operating system is? In general, we have no completely adequate definition of an operating system. Operating systems exist because they offer a reasonable way to solve the problem of creating a usable computing system. The fundamental goal of computer systems is to execute user programs and to make solving user 6 Chapter 1 Introduction STORAGE DEFINITIONS AND NOTATION A bit is the basic unit of computer storage. It can contain one of two values, 0 and 1. All other storage in a computer is based on collections of bits. Given enough bits, it is amazing how many things a computer can represent: numbers, letters, images, movies, sounds, documents, and programs, to name afew.Abyte is 8 bits, and on most computers it is the smallest convenient chunk of storage. For example, most computers don’t have an instruction to move a bit but do have one to move a byte. A less common term is word, which is a given computer architecture’s native storage unit. A word is generally made up of one or more bytes. For example, a computer may have instructions to move 64-bit (8-byte) words. A kilobyte, or KB, is 1,024 bytes; a megabyte, or MB, is 1,0242 bytes; and a gigabyte, or GB, is 1,0243 bytes. Computer manufacturers often round off these numbers and say that a megabyte is 1 million bytes and a gigabyte is 1 billion bytes. problems easier. Computer hardware is constructed toward this goal. Since bare hardware alone is not particularly easy to use, application programs are developed. These programs require certain common operations, such as those controlling the I/O devices. The common functions of controlling and allocating resources are then brought together into one piece of software: the operating system. In addition, we have no universally accepted definition of what is part of the operating system. A simple viewpoint is that it includes everything a vendor ships when you order “the operating system.” The features included, however, vary greatly across systems. Some systems take up less than 1 megabyte of space and lack even a full-screen editor, whereas others require gigabytes of space and are entirely based on graphical windowing systems. A more common definition, and the one that we usually follow, is that the operating system is the one program running at all times on the computer—usually called the kernel. (Along with the kernel, there are two other types of programs: systems programs, which are associated with the operating system but are not part of the kernel, and application programs, which include all programs not associated with the operation of the system.) The matter of what constitutes an operating system has become increas- ingly important. In 1998, the United States Department of Justice filed suit against Microsoft, in essence claiming that Microsoft included too much func- tionality in its operating systems and thus prevented application vendors from competing. For example, a Web browser was an integral part of the operating systems. As a result, Microsoft was found guilty of using its operating-system monopoly to limit competition. 1.2 Computer-System Organization Before we can explore the details of how computer systems operate, we need general knowledge of the structure of a computer system. In this section, we look at several parts of this structure. The section is mostly concerned 1.2 Computer-System Organization 7 THE STUDY OF OPERATING SYSTEMS There has never been a more interesting time to study operating systems, and it has never been easier. The open-source movement has overtaken operating systems, causing many of them to be made available in both source and binary (executable) format. This list includes Linux, BSD UNIX, Solaris, and part of Mac OS X. The availability of source code allows us to study operating systems from the inside out. Questions that previously could only be answered by looking at documentation or the behavior of an operating system can now be answered by examining the code itself. In addition, the rise of virtualization as a mainstream (and frequently free) computer function makes it possible to run many operating systems on top of one core system. For example, VMware (http://www.vmware.com)provides afree“player” on which hundreds of free “virtual appliances” can run. Using this method, students can try out hundreds of operating systems within their existing operating systems at no cost. Operating systems that are no longer commercially viable have been open-sourced as well, enabling us to study how systems operated in a time of fewer CPU, memory, and storage resources. An extensive but not complete list of open-source operating-system projects is available from http://dmoz.org/Computers/Software/Operating Systems/Open Source/. Simulators of specific hardware are also available in some cases, allowing the operating system to run on “native” hardware, all within the confines of a modern computer and modern operating system. For example, a DECSYSTEM-20 simulator running on Mac OS X can boot TOPS-20, load the source tapes, and modify and compile a new TOPS-20 kernel. An interested student can search the Internet to find the original papers that describe the operating system and the original manuals. The advent of open-source operating systems also makes it easy to move from student to operating-system developer. With some knowledge, some effort, and an Internet connection, a student can even create a new operating- system distribution! Just a few years ago, it was difficult or impossible to get access to source code. Now, access is limited only by how much time and disk space a student has. with computer-system organization, so you can skim or skip it if you already understand the concepts. 1.2.1 Computer-System Operation A modern general-purpose computer system consists of one or more CPUs and a number of device controllers connected through a common bus that provides access to shared memory (Figure 1.2). Each device controller is in charge of a specific type of device (for example, disk drives, audio devices, and video displays). The CPU and the device controllers can execute concurrently, competing for memory cycles. To ensure orderly access to the shared memory, a memory controller is provided whose function is to synchronize access to the memory. For a computer to start running—for instance, when it is powered up or rebooted—it needs to have an initial program to run. This initial 8 Chapter 1 Introduction USB controller keyboard printermouse monitor disks graphics adapter disk controller memory CPU on-line Figure 1.2 A modern computer system. program, or bootstrap program, tends to be simple. Typically, it is stored inread-onlymemory(ROM) or electrically erasable programmable read-only memory (EEPROM), known by the general term firmware, within the computer hardware. It initializes all aspects of the system, from CPU registers to device controllers to memory contents. The bootstrap program must know how to load the operating system and how to start executing that system. To accomplish this goal, the bootstrap program must locate and load into memory the operating- system kernel. The operating system then starts executing the first process, such as “init,” and waits for some event to occur. The occurrence of an event is usually signaled by an interrupt from either the hardware or the software. Hardware may trigger an interrupt at any time by sending a signal to the CPU, usually by way of the system bus. Software may trigger an interrupt by executing a special operation called a system call (also called a monitor call). When the CPU is interrupted, it stops what it is doing and immediately transfers execution to a fixed location. The fixed location usually contains the starting address where the service routine for the interrupt is located. The interrupt service routine executes; on completion, the CPU resumes the interrupted computation. A time line of this operation is shown in Figure 1.3. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism, but several functions are common. The interrupt must transfer control to the appropriate interrupt service routine. The straightforward method for handling this transfer would be to invoke a generic routine to examine the interrupt information; the routine, in turn, would call the interrupt-specific handler. However, interrupts must be handled quickly. Since only a predefined number of interrupts is possible, a table of pointers to interrupt routines can be used instead to provide the necessary speed. The interrupt routine is called indirectly through the table, with no intermediate routine needed. Generally, the table of pointers is stored in low memory (the first hundred or so locations). These locations hold the addresses of the interrupt service routines for the various devices. This array, or interrupt vector, of addresses is then indexed by a unique device number, given with the interrupt request, to provide the address of the interrupt service routine for 1.2 Computer-System Organization 9 user process executing CPU I/O interrupt processing I/O request transfer done I/O request transfer done I/O device idle transferring Figure 1.3 Interrupt time line for a single process doing output. the interrupting device. Operating systems as different as Windows and UNIX dispatch interrupts in this manner. The interrupt architecture must also save the address of the interrupted instruction. Many old designs simply stored the interrupt address in a fixed location or in a location indexed by the device number. More recent architectures store the return address on the system stack. If the interrupt routine needs to modify the processor state—for instance, by modifying register values—it must explicitly save the current state and then restore that state before returning. After the interrupt is serviced, the saved return address is loaded into the program counter, and the interrupted computation resumes as though the interrupt had not occurred. 1.2.2 Storage Structure The CPU can load instructions only from memory, so any programs to run must be stored there. General-purpose computers run most of their programs from rewriteable memory, called main memory (also called random-access memory or RAM). Main memory commonly is implemented in a semiconductor technology called dynamic random-access memory (DRAM).Computersuse other forms of memory as well. Recall that the bootstrap program is typically stored in read-only memory (ROM) or electrically erasable programmable read-onlymemory(EEPROM).BecauseROM cannot be changed, only static programs are stored there. The unchangeable nature of ROM is of use in game cartridges, so manufacturers can distribute games that cannot be modified. EEPROM cannot be changed frequently and so contains mostly static programs. For example, smartphones use EEPROM to store their factory- installed programs. All forms of memory provide an array of words, or storage units. Each word has its own address. Interaction is achieved through a sequence of load or store instructions to specific memory addresses. The load instruction moves a word from main memory to an internal register within the CPU,whereasthe store instruction moves the content of a register to main memory. Aside from explicit loads and stores, the CPU automatically loads instructions from main memory for execution. 10 Chapter 1 Introduction Most modern computer systems are based on the von Neumann archi- tecture. In such an architecture, both programs and data are stored in main memory, which is managed by a CPU. A typical instruction–execution cycle, as executed on such a system, first fetches an instruction from memory and stores that instruction in the instruction register. The instruction is then decoded and maycauseoperandstobefetchedfrommemoryandstoredinsomeinternal register. After the instruction on the operands has been executed, the result may be stored back in memory. Notice that the memory unit sees only a stream of memory addresses; it does not know how they are generated (by the instruction counter, indexing, indirection, literal addresses, or some other means) or what they are for (instructions or data). Accordingly, we can ignore how amemory address is generated by a program. We are interested only in the sequence of memory addresses generated by the running program. Ideally, we want the programs and data to reside in main memory permanently. This arrangement usually is not possible, for two reasons: 1. Main memory is usually too small to store all needed programs and data permanently. 2. Main memory is a volatile storage device that loses its contents when power is turned off or otherwise lost. Thus, most computer systems provide secondary storage as an extension of main memory. The main requirement for secondary storage is that it be able to hold large quantities of data permanently. The most common secondary-storage device is a magnetic disk,which provides storage for both programs and data. Most programs (system and application) are stored on a disk until they are loaded into memory. Many programs then use the disk as both the source and the destination of their processing. Hence, the proper management of disk storage is of central importance to a computer system, as we discuss in Chapter 12. In a larger sense, however, the storage structure that we have described— consisting of registers, main memory,and magnetic disks—is only one of many possible storage systems. Others include cache memory, CD-ROM,magnetic tapes, and so on. Each storage system provides the basic functions of storing a datum and holding that datum until it is retrieved at a later time. The main differences among the various storage systems lie in speed, cost, size, and volatility. The wide variety of storage systems in a computer system can be organized in a hierarchy (Figure 1.4) according to speed and cost. The higher levels are expensive, but they are fast. As we move down the hierarchy, the cost per bit generally decreases, whereas the access time generally increases. This trade-off is reasonable; if a given storage system were both faster and less expensive than another—other properties being the same—then there would be no reason to use the slower, more expensive memory. In fact, many early storage devices, including paper tape and core memories, are relegated to museums now that magnetic tape and semiconductor memory have become faster and cheaper. The top four levels of memory in Figure 1.4 may be constructed using semiconductor memory. In addition to differing in speed and cost, the various storage systems are either volatile or nonvolatile. As mentioned earlier, volatile storage loses 1.2 Computer-System Organization 11 registers cache main memory electronic disk magnetic disk optical disk magnetic tapes Figure 1.4 Storage-device hierarchy. its contents when the power to the device is removed. In the absence of expensive battery and generator backup systems, data must be written to nonvolatile storage for safekeeping. In the hierarchy shown in Figure 1.4, the storage systems above the electronic disk are volatile, whereas those below are nonvolatile. An electronic disk can be designed to be either volatile or nonvolatile. During normal operation, the electronic disk stores data in a large DRAM array, which is volatile. But many electronic-disk devices contain a hidden magnetic hard disk and a battery for backup power. If external power is interrupted, the electronic-disk controller copies the data from RAM to the magnetic disk. When external power is restored, the controller copies the data back into RAM. Another form of electronic disk is flash memory, which is popular in cameras and personal digital assistants (PDAs), in robots, and increasingly as removable storage on general-purpose computers. Flash memory is slower than DRAM but needs no power to retain its contents. Another form of nonvolatile storage is NVRAM,whichisDRAM with battery backup power. This memory can be as fast as DRAM and (as long as the battery lasts) is nonvolatile. The design of a complete memory system must balance all the factors just discussed: it must use only as much expensive memory as necessary while providing as much inexpensive, nonvolatile memory as possible. Caches can be installed to improve performance where a large access-time or transfer-rate disparity exists between two components. 1.2.3 I/O Structure Storage is only one of many types of I/O devices within a computer. A large portion of operating system code is dedicated to managing I/O,bothbecause 12 Chapter 1 Introduction of its importance to the reliability and performance of a system and because of the varying nature of the devices. Next, we provide an overview of I/O. A general-purpose computer system consists of CPUs and multiple device controllers that are connected through a common bus. Each device controller is in charge of a specific type of device. Depending on the controller, more than one device may be attached. For instance, seven or more devices can be attached to the small computer-systems interface (SCSI) controller. A device controller maintains some local buffer storage and a set of special-purpose registers. The device controller is responsible for moving the data between the peripheral devices that it controls and its local buffer storage. Typically, operating systems have a device driver for each device controller. This device driver understands the device controller and presents a uniform interface to the device to the rest of the operating system. To start an I/O operation, the device driver loads the appropriate registers within the device controller. The device controller, in turn, examines the contents of these registers to determine what action to take (such as “read a character from the keyboard”). The controller starts the transfer of data from the device to its local buffer. Once the transfer of data is complete, the device controller informs the device driver via an interrupt that it has finished its operation. The device driver then returns control to the operating system, possibly returning the data or a pointer to the data if the operation was a read. For other operations, the device driver returns status information. This form of interrupt-driven I/O is fine for moving small amounts of data but can produce high overhead when used for bulk data movement such as disk I/O. To solve this problem, direct memory access (DMA) is used. After setting up buffers, pointers, and counters for the I/O device, the device controller transfers an entire block of data directly to or from its own buffer storage to memory, with no intervention by the CPU. Only one interrupt is generated per block, to tell the device driver that the operation has completed, rather than the one interrupt per byte generated for low-speed devices. While the device controller is performing these operations, the CPU is available to accomplish other work. Some high-end systems use switch rather than bus architecture. On these systems, multiple components can talk to other components concurrently, rather than competing for cycles on a shared bus. In this case, DMA is even more effective. Figure 1.5 shows the interplay of all components of a computer system. 1.3 Computer-System Architecture In Section 1.2, we introduced the general structure of a typical computer system. A computer system may be organized in a number of different ways, which we can categorize roughly according to the number of general-purpose processors used. 1.3.1 Single-Processor Systems Most systems use a single processor. The variety of single-processor systems may be surprising, however, since these systems range from PDAsthrough 1.3 Computer-System Architecture 13 thread of execution instructions and data instruction execution cycle data movement DMA memory interrupt cache data I/O request CPU (*N) device (*M) Figure 1.5 How a modern computer system works. mainframes. On a single-processor system, there is one main CPU capable of executing a general-purpose instruction set, including instructions from user processes. Almost all systems have other special-purpose processors as well. They may come in the form of device-specific processors, such as disk, keyboard, and graphics controllers; or, on mainframes, they may come in the form of more general-purpose processors, such as I/O processors that move data rapidly among the components of the system. All of these special-purpose processors run a limited instruction set and do not run user processes. Sometimes they are managed by the operating system, in that the operating system sends them information about their next task and monitors their status. For example, a disk-controller microprocessor receives a sequence of requests from the main CPU and implements its own disk queue and scheduling algorithm. This arrangement relieves the main CPU of the overhead of disk scheduling. PCs contain a microprocessor in the keyboard to convert the keystrokes into codes to be sent to the CPU. In other systems or circumstances, special-purpose processors are low-level components built into the hardware. The operating system cannot communicate with these processors; they do their jobs autonomously. The use of special-purpose microprocessors is common and does not turn a single-processor system into a multiprocessor. If there is only one general-purpose CPU, then the system is a single-processor system. 1.3.2 Multiprocessor Systems Although single-processor systems are most common, multiprocessor systems (also known as parallel systems or tightly coupled systems)aregrowing 14 Chapter 1 Introduction in importance. Such systems have two or more processors in close commu- nication, sharing the computer bus and sometimes the clock, memory, and peripheral devices. Multiprocessor systems have three main advantages: 1. Increased throughput. By increasing the number of processors, we expect to get more work done in less time. The speed-up ratio with N processors is not N, however; rather, it is less than N. When multiple processors cooperate on a task, a certain amount of overhead is incurred in keeping all the parts working correctly. This overhead, plus contention for shared resources, lowers the expected gain from additional processors. Similarly, N programmers working closely together do not produce N times the amount of work a single programmer would produce. 2. Economy of scale. Multiprocessor systems can cost less than equivalent multiple single-processor systems, because they can share peripherals, mass storage, and power supplies. If several programs operate on the same set of data, it is cheaper to store those data on one disk and to have all the processors share them than to have many computers with local disks and many copies of the data. 3. Increased reliability. If functions can be distributed properly among several processors, then the failure of one processor will not halt the system, only slow it down. If we have ten processors and one fails, then each of the remaining nine processors can pick up a share of the work of the failed processor. Thus, the entire system runs only 10 percent slower, rather than failing altogether. Increased reliability of a computer system is crucial in many applications. The ability to continue providing service proportional to the level of surviving hardware is called graceful degradation. Some systems go beyond graceful degradation and are called fault tolerant, because they can suffer a failure of any single component and still continue operation. Note that fault tolerance requires a mechanism to allow the failure to be detected, diagnosed, and, if possible, corrected. The HP NonStop (formerly Tandem) system uses both hardware and software duplication to ensure continued operation despite faults. The system consists of multiple pairs of CPUs, working in lockstep. Both processors in the pair execute each instruction and compare the results. If the results differ, then one CPU of the pair is at fault, and both are halted. The process that was being executed is then moved to another pair of CPUs, and the instruction that failed is restarted. This solution is expensive, since it involves special hardware and considerable hardware duplication. The multiple-processor systems in use today are of two types. Some systems use asymmetric multiprocessing, in which each processor is assigned a specific task. A master processor controls the system; the other processors either look to the master for instruction or have predefined tasks. This scheme defines a master–slave relationship. The master processor schedules and allocates work to the slave processors. The most common systems use symmetric multiprocessing (SMP),in which each processor performs all tasks within the operating system. SMP means that all processors are peers; no master–slave relationship exists 1.3 Computer-System Architecture 15 CPU0 registers cache CPU1 registers cache CPU2 registers cache memory Figure 1.6 Symmetric multiprocessing architecture. between processors. Figure 1.6 illustrates a typical SMP architecture. Notice that each processor has its own set of registers, as well as a private—or local— cache; however, all processors share physical memory. An example of the SMP system is Solaris, a commercial version of UNIX designed by Sun Microsystems. A Solaris system can be configured to employ dozens of processors, all running Solaris. The benefit of this model is that many processes can run simultaneously —N processes can run if there are N CPUs—without causing a significant deterioration of performance. However, we must carefully control I/O to ensure that the data reach the appropriate processor. Also, since the CPUs are separate, one may be sitting idle while another is overloaded, resulting in inefficiencies. These inefficiencies can be avoided if the processors share certain data structures. A multiprocessor system of this form will allow processes and resources—such as memory—to be shared dynamically among the various processors and can lower the variance among the processors. Such a system must be written carefully, as we shall see in Chapter 6. Virtually all modern operating systems—including Windows, Mac OS X, and Linux—now provide support for SMP. The difference between symmetric and asymmetric multiprocessing may result from either hardware or software. Special hardware can differentiate the multiple processors, or the software can be written to allow only one master and multiple slaves. For instance, Sun’s operating system SunOS Version 4 provided asymmetric multiprocessing, whereas Version 5 (Solaris) is symmetric on the same hardware. Multiprocessing adds CPUs to increase computing power. If the CPU has an integrated memory controller, then adding CPUs can also increase the amount of memory addressable in the system. Either way, multiprocessing can cause a system to change its memory access model from uniform memory access (UMA) to non-uniform memory access (NUMA). UMA is defined as the situation in which access to any RAM from any CPU takes the same amount of time. With NUMA, some parts of memory may take longer to access than other parts, creating a performance penalty. Operating systems can minimize the NUMA penalty through resource management, as discussed in Section 9.5.4. A recent trend in CPU design is to include multiple computing cores on a single chip. In essence, these are multiprocessor chips. They can be more efficient than multiple chips with single cores because on-chip communication 16 Chapter 1 Introduction CPU core0 registers cache CPU core1 registers cache memory Figure 1.7 A dual-core design with two cores placed on the same chip. is faster than between-chip communication. In addition, one chip with multiple cores uses significantly less power than multiple single-core chips. As a result, multicore systems are especially well suited for server systems such as database and Web servers. In Figure 1.7, we show a dual-core design with two cores on the same chip. In this design, each core has its own register set as well as its own local cache; other designs might use a shared cache or a combination of local and shared caches. Aside from architectural considerations, such as cache, memory, and bus contention, these multicore CPUs appear to the operating system as N standard processors. This characteristic puts pressure on operating system designers—and application programmers—to make use of those CPUs. Finally, blade servers are a recent development in which multiple processor boards, I/O boards, and networking boards are placed in the same chassis. The difference between these and traditional multiprocessor systems is that each blade-processor board boots independently and runs its own operating system. Some blade-server boards are multiprocessor as well, which blurs the lines between types of computers. In essence, these servers consist of multiple independent multiprocessor systems. 1.3.3 Clustered Systems Another type of multiple-CPU system is the clustered system.Likemultipro- cessor systems, clustered systems gather together multiple CPUs to accomplish computational work. Clustered systems differ from multiprocessor systems, however, in that they are composed of two or more individual systems—or nodes—joined together. The definition of the term clustered is not concrete; many commercial packages wrestle with what a clustered system is and why one form is better than another. The generally accepted definition is that clus- tered computers share storage and are closely linked via a local-area network (LAN) (as described in Section 1.10) or a faster interconnect, such as InfiniBand. Clustering is usually used to provide high-availability service;thatis, service will continue even if one or more systems in the cluster fail. High availability is generally obtained by adding a level of redundancy in the system. A layer of cluster software runs on the cluster nodes. Each node can 1.3 Computer-System Architecture 17 BEOWULF CLUSTERS Beowulf clusters are designed for solving high-performance computing tasks. These clusters are built with commodity hardware—such as low-cost personal computers—that are connected via a simple local-area network. Interestingly, a Beowulf cluster uses no one specific software package but rather consists of a set of open-source software libraries that allow the computing nodes in the cluster to communicate with one another. Thus, there are a variety of approaches for constructing a Beowulf cluster, although Beowulf computing nodes typically run the Linux operating system. Since Beowulf clusters require no special hardware and operate using open-source software that is freely available, they offer a low-cost strategy for building a high-performance computing cluster. In fact, some Beowulf clusters built from collections of discarded personal computers are using hundreds of computing nodes to solve computationally expensive problems in scientific computing. monitor one or more of the others (over the LAN). If the monitored machine fails, the monitoring machine can take ownership of its storage and restart the applications that were running on the failed machine. The users and clients of the applications see only a brief interruption of service. Clustering can be structured asymmetrically or symmetrically. In asym- metric clustering,onemachineisinhot-standby mode while the other is running the applications. The hot-standby host machine does nothing but monitor the active server. If that server fails, the hot-standby host becomes the active server. In symmetric mode, two or more hosts are running applications and are monitoring each other. This mode is obviously more efficient, as it uses all of the available hardware. It does require that more than one application be available to run. As a cluster consists of several computer systems connected via a network, clustersmayalsobeusedtoprovidehigh-performance computing environ- ments. Such systems can supply significantly greater computational power than single-processor or even SMP systems because they are capable of running an application concurrently on all computers in the cluster. However, appli- cations must be written specifically to take advantage of the cluster by using a technique known as parallelization, which consists of dividing a program into separate components that run in parallel on individual computers in the cluster. Typically, these applications are designed so that once each computing node in the cluster has solved its portion of the problem, the results from all the nodes are combined into a final solution. Other forms of clusters include parallel clusters and clustering over a wide-area network (WAN) (as described in Section 1.10). Parallel clusters allow multiple hosts to access the same data on the shared storage. Because most operating systems lack support for simultaneous data access by multiple hosts, parallel clusters are usually accomplished by use of special versions of software and special releases of applications. For example, Oracle Real Application Cluster is a version of Oracle’s database that has been designed to run on a parallel cluster. Each machine runs Oracle, and a layer of software tracks 18 Chapter 1 Introduction computer interconnect computer interconnect computer storage area network Figure 1.8 General structure of a clustered system. access to the shared disk. Each machine has full access to all data in the database. To provide this shared access to data, the system must also supply access control and locking to ensure that no conflicting operations occur. This function, commonly known as a distributed lock manager (DLM),isincluded in some cluster technology. Cluster technology is changing rapidly. Some cluster products support dozens of systems in a cluster, as well as clustered nodes that are separated by miles. Many of these improvements are made possible by storage-area networks (SANs), as described in Section 12.3.3, which allow many systems to attach to a pool of storage. If the applications and their data are stored on the SAN, then the cluster software can assign the application to run on any host that is attached to the SAN. If the host fails, then any other host can take over. In a database cluster, dozens of hosts can share the same database, greatly increasing performance and reliability. Figure 1.8 depicts the general structure of a clustered system. 1.4 Operating-System Structure Now that we have discussed basic information about computer-system orga- nization and architecture, we are ready to talk about operating systems. An operating system provides the environment within which programs are executed. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. There are, however, many commonalities, which we consider in this section. One of the most important aspects of operating systems is the ability to multiprogram. A single program cannot, in general, keep either the CPU or the I/O devices busy at all times. Single users frequently have multiple programs running, however. Multiprogramming increases CPU utilization by organizing jobs (code and data) so that the CPU always has one to execute. The idea is as follows: The operating system keeps several jobs in memory simultaneously (Figure 1.9). Since, in general, main memory is too small to accommodate all jobs, the jobs are kept initially on the disk in the job pool. This pool consists of all processes residing on disk awaiting allocation of main memory. 1.4 Operating-System Structure 19 job 1 0 512M operating system job 2 job 3 job 4 Figure 1.9 Memory layout for a multiprogramming system. The set of jobs in memory can be a subset of the jobs kept in the job pool. The operating system picks and begins to execute one of the jobs in memory. Eventually, the job may have to wait for some task, such as an I/O operation, to complete. In a non-multiprogrammed system, the CPU would sit idle. In a multiprogrammed system, the operating system simply switches to, and executes, another job. When that job needs to wait, the CPU is switched to another job, and so on. Eventually, the first job finishes waiting and gets the CPU back. As long as at least one job needs to execute, the CPU is never idle. This idea is common in other life situations. A lawyer does not work for only one client at a time, for example. While one case is waiting to go to trial or have papers typed, the lawyer can work on another case. If he has enough clients, the lawyer will never be idle for lack of work. (Idle lawyers tend to become politicians, so there is a certain social value in keeping lawyers busy.) Multiprogrammed systems provide an environment in which the various system resources (for example, CPU, memory, and peripheral devices) are utilized effectively, but they do not provide for user interaction with the computer system. Time sharing (or multitasking) is a logical extension of multiprogramming. In time-sharing systems, the CPU executes multiple jobs by switching among them, but the switches occur so frequently that the users can interact with each program while it is running. Time sharing requires an interactive (or hands-on) computer system, which provides direct communication between the user and the system. The user gives instructions to the operating system or to a program directly, using a input device such as a keyboard or a mouse, and waits for immediate results on an output device. Accordingly, the response time should be short—typically less than one second. A time-shared operating system allows many users to share the computer simultaneously. Since each action or command in a time-shared system tends to be short, only a little CPU time is needed for each user. As the system switches rapidly from one user to the next, each user is given the impression that the entire computer system is dedicated to his use, even though it is being shared among many users. 20 Chapter 1 Introduction A time-shared operating system uses CPU scheduling and multiprogram- ming to provide each user with a small portion of a time-shared computer. Each user has at least one separate program in memory. A program loaded into memory and executing is called a process. When a process executes, it typically executes for only a short time before it either finishes or needs to perform I/O. I/O may be interactive; that is, output goes to a display for the user, and input comes from a user keyboard, mouse, or other device. Since interactive I/O typically runs at “people speeds,” it may take a long time to complete. Input, for example, may be bounded by the user’s typing speed; seven characters per second is fast for people but incredibly slow for computers. Rather than let the CPU sit idle as this interactive input takes place, the operating system will rapidly switch the CPU to the program of some other user. Time sharing and multiprogramming require that several jobs be kept simultaneously in memory. If several jobs are ready to be brought into memory, and if there is not enough room for all of them, then the system must choose among them. Making this decision is job scheduling, which is discussed in Chapter 5. When the operating system selects a job from the job pool, it loads that job into memory for execution. Having several programs in memory at the same time requires some form of memory management, which is covered in Chapters 8 and 9. In addition, if several jobs are ready to run at the same time, the system must choose among them. Making this decision is CPU scheduling, which is discussed in Chapter 5. Finally, running multiple jobs concurrently requires that their ability to affect one another be limited in all phases of the operating system, including process scheduling, disk storage, and memory management. These considerations are discussed throughout the text. In a time-sharing system, the operating system must ensure reasonable response time. This is sometimes accomplished through swapping,inwhich processes are swapped in and out of main memory to the disk. A more common method for achieving this goal is virtual memory, a technique that allows the execution of a process that is not completely in memory (Chapter 9). The main advantage of the virtual memory scheme isthat it enables users to run programs that are larger than actual physical memory. Further, it abstracts main memory into a large, uniform array of storage, separating logical memory as viewed by the user from physical memory. This arrangement frees programmers from concern over memory-storage limitations. Time-sharing systems must also provide a file system (Chapters 10 and 11). The file system resides on a collection of disks; hence, disk management must be provided (Chapter 12). Also, time-sharing systems provide a mechanism for protecting resources from inappropriate use (Chapter 14). To ensure orderly execution, the system must provide mechanisms for job synchronization and communication (Chapter 6), and it may ensure that jobs do not get stuck in a deadlock, forever waiting for one another (Chapter 7). 1.5 Operating-System Operations As mentioned earlier, modern operating systems are interrupt driven.Ifthere arenoprocessestoexecute,noI/O devices to service, and no users to whom to respond, an operating system will sit quietly, waiting for something to happen. Events are almost always signaled by the occurrence of an interrupt 1.5 Operating-System Operations 21 or a trap. A trap (or an exception) is a software-generated interrupt caused either by an error (for example, division by zero or invalid memory access) or by a specific request from a user program that an operating-system service be performed. The interrupt-driven nature of an operating system defines that system’s general structure. For each type of interrupt, separate segments of code in the operating system determine what action should be taken. An interrupt service routine is provided that is responsible for dealing with the interrupt. Since the operating system and the users share the hardware and software resources of the computer system, we need to make sure that an error in a user program could cause problems only for the one program running. With sharing, many processes could be adversely affected by a bug in one program. For example, if a process gets stuck in an infinite loop, this loop could prevent the correct operation of many other processes. More subtle errors can occur in a multiprogramming system, where one erroneous program might modify another program, the data of another program, or even the operating system itself. Without protection against these sorts of errors, either the computer must execute only one process at a time or all output must be suspect. A properly designed operating system must ensure that an incorrect (or malicious) program cannot cause other programs to execute incorrectly. 1.5.1 Dual-Mode Operation In order to ensure the proper execution of the operating system, we must be able to distinguish between the execution of operating-system code and user- defined code. The approach taken by most computer systems is to provide hardware support that allows us to differentiate among various modes of execution. At the very least, we need two separate modes of operation: user mode and kernel mode (also called supervisor mode, system mode,orprivileged mode). A bit, called the mode bit, is added to the hardware of the computer to indicate the current mode: kernel (0) or user (1). With the mode bit, we are able to distinguish between a task that is executed on behalf of the operating system and one that is executed on behalf of the user. When the computer system is executing on behalf of a user application, the system is in user mode. However, when a user application requests a service from the operating system (via a system call), it must transition from user to kernel mode to fulfill the request. This is shown in Figure 1.10. As we shall see, this architectural enhancement is useful for many other aspects of system operation as well. At system boot time, the hardware starts in kernel mode. The operating system is then loaded and starts user applications in user mode. Whenever a trap or interrupt occurs, the hardware switches from user mode to kernel mode (that is, changes the state of the mode bit to 0). Thus, whenever the operating system gains control of the computer, it is in kernel mode. The system always switches to user mode (by setting the mode bit to 1) before passing control to a user program. The dual mode of operation provides us with the means for protecting the operating system from errant users—and errant users from one another. We accomplish this protection by designating some of the machine instructions that 22 Chapter 1 Introduction user process executing user process kernel calls system call return from system call user mode (mode bit = 1) trap mode bit = 0 return mode bit = 1 kernel mode (mode bit = 0)execute system call Figure 1.10 Transition from user to kernel mode. may cause harm as privileged instructions. The hardware allows privileged instructions to be executed only in kernel mode. If an attempt is made to execute a privileged instruction in user mode, the hardware does not execute the instruction but rather treats it as illegal and traps it to the operating system. The instruction to switch to kernel mode is an example of a privileged instruction. Some other examples include I/O control, timer management, and interrupt management. As we shall see throughout the text, there are many additional privileged instructions. We can now see the life cycle of instruction execution in a computer system. Initial control resides in the operating system, where instructions are executed in kernel mode. When control is given to a user application, the mode is set to user mode. Eventually, control is switched back to the operating system via an interrupt, a trap, or a system call. System calls provide the means for a user program to ask the operating system to perform tasks reserved for the operating system on the user program’s behalf. A system call is invoked in a variety of ways, depending on the functionality provided by the underlying processor. In all forms, it is the method used by a process to request action by the operating system. A system call usually takes the form of a trap to a specific location in the interrupt vector. This trap can be executed by a generic trap instruction, although some systems (such as the MIPS R2000 family) have a specific syscall instruction. When a system call is executed, it is treated by the hardware as a software interrupt. Control passes through the interrupt vector to a service routine in the operating system, and the mode bit is set to kernel mode. The system- call service routine is a part of the operating system. The kernel examines the interrupting instruction to determine what system call has occurred; a parameter indicates what type of service the user program is requesting. Additional information needed for the request may be passed in registers, on the stack, or in memory (with pointers to the memory locations passed in registers). The kernel verifies that the parameters are correct and legal, executes the request, and returns control to the instruction following the system call. We describe system calls more fully in Section 2.3. The lack of a hardware-supported dual mode can cause serious shortcom- ings in an operating system. For instance, MS-DOS was written for the Intel 8088 architecture, which has no mode bit and therefore no dual mode. A user program running awry can wipe out the operating system by writing over it with data; and multiple programs are able to write to a device at the same time, 1.6 Process Management 23 with potentially disastrous results. Recent versions of the Intel CPU do provide dual-mode operation. Accordingly, most contemporary operating systems— such as Microsoft Vista and Windows XP,aswellasUnix,Linux,andSolaris —take advantage of this dual-mode feature and provide greater protection for the operating system. Once hardware protection is in place, it detects errors that violate modes. These errors are normally handled by the operating system. If a user program fails in some way—such as by making an attempt either to execute an illegal instruction or to access memory that is not in the user’s address space—then the hardware traps to the operating system. The trap transfers control through the interrupt vector to the operating system, just as an interrupt does. When a program error occurs, the operating system must terminate the program abnormally. This situation is handled by the same code as a user-requested abnormal termination. An appropriate error message is given, and the memory of the program may be dumped. The memory dump is usually written to a file so that the user or programmer can examine it and perhaps correct it and restart the program. 1.5.2 Timer We must ensure that the operating system maintains control over the CPU. We cannot allow a user program to get stuck in an infinite loop or to fail to call system services and never return control to the operating system. To accomplish this goal, we can use a timer.Atimercanbesettointerrupt the computer after a specified period. The period may be fixed (for example, 1/60 second) or variable (for example, from 1 millisecond to 1 second). A variable timer is generally implemented by a fixed-rate clock and a counter. The operating system sets the counter. Every time the clock ticks, the counter is decremented. When the counter reaches 0, an interrupt occurs. For instance, a 10-bit counter with a 1-millisecond clock allows interrupts at intervals from 1 millisecond to 1,024 milliseconds, in steps of 1 millisecond. Before turning over control to the user, the operating system ensures that the timer is set to interrupt. If the timer interrupts, control transfers automatically to the operating system, which may treat the interrupt as a fatal error or may give the program more time. Clearly, instructions that modify the content of the timer are privileged. Thus, we can use the timer to prevent a user program from running too long. A simple technique is to initialize a counter with the amount of time that a program is allowed to run. A program with a 7-minute time limit, for example, would have its counter initialized to 420. Every second, the timer interrupts and the counter is decremented by 1. As long as the counter is positive, control is returned to the user program. When the counter becomes negative, the operating system terminates the program for exceeding the assigned time limit. 1.6 Process Management A program does nothing unless its instructions are executed by a CPU.A program in execution, as mentioned, is a process. A time-shared user program 24 Chapter 1 Introduction such as a compiler is a process. A word-processing program being run by an individual user on a PC is a process. A system task, such as sending output to a printer, can also be a process (or at least part of one). For now, you can consider a process to be a job or a time-shared program, but later you will learn that the concept is more general. At this point, it is important to remember that a program by itself is not a process; a program is a passive entity, like the contents of a file stored on disk, whereas a process is an active entity. A process needs certain resources—including CPU time, memory, files, and I/O devices—to accomplish its task. These resources are either given to the process when it is created or allocated to it while it is running. In addition to the various physical and logical resources that a process obtains when it is created, various initialization data (input) may be passed along. For example, consider a process whose function is to display the status of a file on the screen of a terminal. The process will be given as an input the name of the file and will execute the appropriate instructions and system calls to obtain and display on the terminal the desired information. When the process terminates, the operating system will reclaim any reusable resources. A single-threaded process has one program counter specifying the next instruction to execute. (Threads are covered in Chapter 4.) The execution of such a process must be sequential. The CPU executes one instruction of the process after another, until the process completes. Further, at any time, one instruction at most is executed on behalf of the process. Thus, although two processes may be associated with the same program, they are nevertheless considered two separate execution sequences. A multithreaded process has multiple program counters, each pointing to the next instruction to execute for a given thread. A process is the unit of work in a system. Such a system consists of a collection of processes, some of which are operating-system processes (those that execute system code) and the rest of which are user processes (those that execute user code). All these processes can potentially execute concurrently— by multiplexing on a single CPU, for example. The operating system is responsible for the following activities in connec- tion with process management: • Scheduling processes and threads on the CPUs • Creating and deleting both user and system processes • Suspending and resuming processes • Providing mechanisms for process synchronization • Providing mechanisms for process communication We discuss process-management techniques in Chapters 3 through 6. 1.7 Memory Management As we discussed in Section 1.2.2, the main memory is central to the operation of a modern computer system. Main memory is a large array of words or bytes, ranging in size from hundreds of thousands to billions. Each word or byte has 1.8 Storage Management 25 its own address. Main memory is a repository of quickly accessible data shared by the CPU and I/O devices. The central processor reads instructions from main memory during the instruction-fetch cycle and both reads and writes data from main memory during the data-fetch cycle (on a von Neumann architecture). As noted earlier, the main memory is generally the only large storage device that the CPU is able to address and access directly. For example, for the CPU to process data from disk, those data must first be transferred to main memory by CPU-generated I/O calls. In the same way, instructions must be in memory for the CPU to execute them. For a program to be executed, it must be mapped to absolute addresses and loaded into memory. As the program executes, it accesses program instructions and data from memory by generating these absolute addresses. Eventually, the program terminates, its memory space is declared available, and the next program can be loaded and executed. To improve both the utilization of the CPU andthespeedofthecomputer’s response to its users, general-purpose computers must keep several programs in memory,creating a need for memory management. Many different memory- management schemes are used. These schemes reflect various approaches, and the effectiveness of any given algorithm depends on the situation. In selecting a memory-management scheme for a specific system, we must take into account many factors—especially the hardware design of the system. Each algorithm requires its own hardware support. The operating system is responsible for the following activities in connec- tion with memory management: • Keeping track of which parts of memory are currently being used and by whom • Deciding which processes (or parts thereof) and data to move into and out of memory • Allocating and deallocating memory space as needed Memory-management techniques are discussed in Chapters 8 and 9. 1.8 Storage Management To make the computer system convenient for users, the operating system provides a uniform, logical view of information storage. The operating system abstracts from the physical properties of its storage devices to define a logical storage unit, the file. The operating system maps files onto physical media and accesses these files via the storage devices. 1.8.1 File-System Management File management is one of the most visible components of an operating system. Computers can store information on several different types of physical media. Magnetic disk, optical disk, and magnetic tape are the most common. Each of these media has its own characteristics and physical organization. Each medium is controlled by a device, such as a disk drive or tape drive, that 26 Chapter 1 Introduction also has its own unique characteristics. These properties include access speed, capacity, data-transfer rate, and access method (sequential or random). A file is a collection of related information defined by its creator. Commonly, files represent programs (both source and object forms) and data. Data files may be numeric, alphabetic, alphanumeric, or binary. Files may be free-form (for example, text files), or they may be formatted rigidly (for example, fixed fields). Clearly, the concept of a file is an extremely general one. The operating system implements the abstract concept of a file by managing mass-storage media, such as tapes and disks, and the devices that control them. Also, files are normally organized into directories to make them easier to use. Finally, when multiple users have access to files, it may be desirable to control by whom and in what ways (for example, read, write, append) files may be accessed. The operating system is responsible for the following activities in connec- tion with file management: • Creating and deleting files • Creating and deleting directories to organize files • Supporting primitives for manipulating files and directories • Mapping files onto secondary storage • Backing up files on stable (nonvolatile) storage media File-management techniques are discussed in Chapters 10 and 11. 1.8.2 Mass-Storage Management As we have already seen, because main memory is too small to accommodate all data and programs, and because the data that it holds are lost when power is lost, the computer system must provide secondary storage to back up main memory. Most modern computer systems use disks as the principal on-line storage medium for both programs and data. Most programs—including compilers, assemblers, word processors, editors, and formatters—are stored on a disk until loaded into memory and then use the disk as both the source and destination of their processing. Hence, the proper management of disk storage is of central importance to a computer system. The operating system is responsible for the following activities in connection with disk management: • Free-space management • Storage allocation • Disk scheduling Because secondary storage is used frequently, it must be used efficiently. The entire speed of operation of a computer may hinge on the speeds of the disk subsystem and the algorithms that manipulate that subsystem. There are, however, many uses for storage that is slower and lower in cost (and sometimes of higher capacity) than secondary storage. Backups of disk data, seldom-used data, and long-term archival storage are some examples. Magnetic tape drives and their tapes and CD and DVD drives and platters are 1.8 Storage Management 27 typical tertiary storage devices. The media (tapes and optical platters) vary between WORM (write-once, read-many-times) and RW (read–write) formats. Tertiary storage is not crucial to system performance, but it still must be managed. Some operating systems take on this task, while others leave tertiary-storage management to application programs. Some of the functions that operating systems can provide include mounting and unmounting media in devices, allocating and freeing the devices for exclusive use by processes, and migrating data from secondary to tertiary storage. Techniques for secondary and tertiary storage management are discussed in Chapter 12. 1.8.3 Caching Caching is an important principle of computer systems. Information is normally kept in some storage system (such as main memory). As it is used, it is copied into a faster storage system—the cache—on a temporary basis. When we need a particular piece of information, we first check whether it is in the cache. If it is, we use the information directly from the cache; if it is not, we use the information from the source, putting a copy in the cache under the assumption that we will need it again soon. In addition, internal programmable registers, such as index registers, provide a high-speed cache for main memory. The programmer (or compiler) implements the register-allocation and register-replacement algorithms to decide which information to keep in registers and which to keep in main memory. There are also caches that are implemented totally in hardware. For instance, most systems have an instruction cache to hold the instructions expected to be executed next. Without this cache, the CPU would have to wait several cycles while an instruction was fetched from main memory. For similar reasons, most systems have one or more high-speed data caches in the memory hierarchy. We are not concerned with these hardware-only caches in this text, since they are outside the control of the operating system. Because caches have limited size, cache management is an important design problem. Careful selection of the cache size and of a replacement policy can result in greatly increased performance. Figure 1.11 compares storage performance in large workstations and small servers. Various replacement algorithms for software-controlled caches are discussed in Chapter 9. Main memory can be viewed as a fast cache for secondary storage, since data in secondary storage must be copied into main memory for use, and data must be in main memory before being moved to secondary storage for safekeeping. The file-system data, which reside permanently on secondary storage, may appear on several levels in the storage hierarchy. At the highest level, the operating system may maintain a cache of file-system data in main memory. In addition, electronic RAM disks (also known as solid-state disks) may be used for high-speed storage that is accessed through the file-system interface. The bulk of secondary storage is on magnetic disks. The magnetic- disk storage, in turn, is often backed up onto magnetic tapes or removable disks to protect against data loss in case of a hard-disk failure. Some systems automatically archive old file data from secondary storage to tertiary storage, such as tape jukeboxes, to lower the storage cost (see Chapter 12). 28 Chapter 1 Introduction Level Name Typical size Implementation technology Access time (ns) Bandwidth (MB/sec) Managed by Backed by 1 registers < 1 KB custom memory with multiple ports, CMOS 0.25 – 0.5 20,000 – 100,000 compiler cache 2 cache < 16 MB on-chip or off-chip CMOS SRAM 0.5 – 25 5000 – 10,000 hardware main memory 3 main memory < 64 GB CMOS DRAM 80 – 250 1000 – 5000 operating system disk 4 disk storage > 100 GB magnetic disk 5,000.000 20 – 150 operating system CD or tape Figure 1.11 Performance of various levels of storage. The movement of information between levels of a storage hierarchy may be either explicit or implicit, depending on the hardware design and the controlling operating-system software. For instance, data transfer from cache to CPU and registers is usually a hardware function, with no operating-system intervention. In contrast, transfer of data from disk to memory is usually controlled by the operating system. In a hierarchical storage structure, the same data may appear in different levels of the storage system. For example, suppose that an integer A that is to be incremented by 1 is located in file B, and file B resides on a magnetic disk. The increment operation proceeds by first issuing an I/O operation to copy the disk block on which A resides to main memory. This operation is followed by copying A to the cache and to an internal register. Thus, the copy of A appears in several places: on the magnetic disk, in main memory, in the cache, and in an internal register (see Figure 1.12). Once the increment takes place in the internal register, the value of A differs in the various storage systems. The value of A becomes the same only after the new value of A is written from the internal register back to the magnetic disk. In a computing environment where only one process executes at a time, this arrangement poses no difficulties, since an access to integer A will always be to the copy at the highest level of the hierarchy. However, in a multitasking environment, where the CPU is switched back and forth among various processes, extreme care must be taken to ensure that, if several processes wish to access A, then each of these processes will obtain the most recently updated value of A. The situation becomes more complicated in a multiprocessor environment (Figure 1.6), where, in addition to maintaining internal registers, each of the CPUs also contains a local cache. In such an environment, a copy of A may exist simultaneously in several caches. Since the various CPUs can all execute A A Amagnetic disk main memory hardware registercache Figure 1.12 Migration of integer A from disk to register. 1.9 Protection and Security 29 concurrently, we must make sure that an update to the value of A in one cache is immediately reflected in all other caches where A resides. This situation is called cache coherency, and it is usually a hardware problem (handled below the operating-system level). In a distributed environment, the situation becomes even more complex. In this environment, several copies (or replicas) of the same file can be kept on different computers that are distributed in space. Since the various replicas may be accessed and updated concurrently, some distributed systems ensure that, when a replica is updated in one place, all other replicas are brought up to date as soon as possible. There are various ways to achieve this guarantee, as we discuss in Chapter 17. 1.8.4 I/O Systems One of the purposes of an operating system is to hide the peculiarities of specific hardware devices from the user. For example, in UNIX, the peculiarities of I/O devices are hidden from the bulk of the operating system itself by the I/O subsystem.TheI/O subsystem consists of several components: • A memory-management component that includes buffering, caching, and spooling • A general device-driver interface • Drivers for specific hardware devices Only the device driver knows the peculiarities of the specific device to which it is assigned. We discussed in Section 1.2.3 how interrupt handlers and device drivers are used in the construction of efficient I/O subsystems. In Chapter 13, we discuss how the I/O subsystem interfaces to the other system components, manages devices, transfers data, and detects I/O completion. 1.9 Protection and Security If a computer system has multiple users and allows the concurrent execution of multiple processes, then access to data must be regulated. For that purpose, mechanisms ensure that files, memory segments, CPU, and other resources can be operated on by only those processes that have gained proper authoriza- tion from the operating system. For example, memory-addressing hardware ensures that a process can execute only within its own address space. The timer ensures that no process can gain control of the CPU without eventually relinquishing control. Device-control registers are not accessible to users, so the integrity of the various peripheral devices is protected. Protection, then, is any mechanism for controlling the access of processes or users to the resources defined by a computer system. This mechanism must provide a means to specify the controls to be imposed and a means to enforce the controls. Protection can improve reliability by detecting latent errors at the interfaces between component subsystems. Early detection of interface errors can often 30 Chapter 1 Introduction prevent contamination of a healthy subsystem by another subsystem that is malfunctioning. Furthermore, an unprotected resource cannot defend against use (or misuse) by an unauthorized or incompetent user. A protection-oriented system provides a means to distinguish between authorized and unauthorized usage, as we discuss in Chapter 14. A system can have adequate protection but still be prone to failure and allow inappropriate access. Consider a user whose authentication information (her means of identifying herself to the system) is stolen. Her data could be copied or deleted, even though file and memory protection are working. It is the job of security to defend a system from external and internal attacks. Such attacks spread across a huge range and include viruses and worms, denial-of- service attacks (which use all of a system’s resources and so keep legitimate users out of the system), identity theft, and theft of service (unauthorized use of a system). Prevention of some of these attacks is considered an operating- system function on some systems, while other systems leave the prevention to policy or additional software. Due to the alarming rise in security incidents, operating-system security features represent a fast-growing area of research and implementation. Security is discussed in Chapter 15. Protection and security require the system to be able to distinguish among all its users. Most operating systems maintain a list of user names and associated user identifiers (user IDs). In Windows Vista parlance, these are security IDs(SIDs). These numerical IDs are unique, one per user. When a user logs into the system, the authentication stage determines the appropriate user ID for the user. That user ID is associated with all of the user’s processes and threads. When an ID needs to be user readable, it is translated back to the user name via the user name list. In some circumstances, we wish to distinguish among sets of users rather than individual users. For example, the owner of a file on a UNIX system may be allowed to issue all operations on that file, whereas a selected set of users may only be allowed to read the file. To accomplish this, we need to define a group name and the set of users belonging to that group. Group functionality can be implemented as a system-wide list of group names and group identifiers. A user can be in one or more groups, depending on operating-system design decisions. The user’s group IDs are also included in every associated process and thread. In the course of normal use of a system, the user ID and group ID for a user are sufficient. However, a user sometimes needs to escalate privileges to gain extra permissions for an activity. The user may need access to a device that is restricted, for example. Operating systems provide various methods to allow privilege escalation. On UNIX, for example, the setuid attribute on a program causes that program to run with the user ID of the owner of the file, rather than the current user’s ID. The process runs with this effective UID until it turns off the extra privileges or terminates. 1.10 Distributed Systems A distributed system is a collection of physically separate, possibly heteroge- neous computer systems that are networked to provide the users with access to the various resources that the system maintains. Access to a shared resource 1.10 Distributed Systems 31 increases computation speed, functionality, data availability, and reliability. Some operating systems generalize network access as a form of file access, with the details of networking contained in the network interface’s device driver. Others make users specifically invoke network functions. Generally, systems contain a mix of the two modes—for example, FTP and NFS. FTP is usually an interactive, command-line tool for copying files between networked systems. NFS is a protocol that allows storage on a remote server to appear and behave exactly as if the storage is attached to the local computer. The protocols that create a distributed system can greatly affect that system’s utility and popularity. A network, in the simplest terms, is a communication path between two or more systems. Distributed systems depend on networking for their functionality. Networks vary by the protocols used, the distances between nodes, and the transport media. They also vary in their performance and reliability. TCP/IP is the most common network protocol, although ATM and other protocols are in widespread use. Likewise, operating-system support of protocols varies. Most operating systems support TCP/IP, including the Windows and UNIX operating systems. Some systems support proprietary protocols to suit their needs. To an operating system, a network protocol simply needs an interface device—a network adapter, for example—with a device driver to manage it, as well as software to handle data. These concepts are discussed throughout this book. Networks are characterized based on the distances between their nodes. A local-area network (LAN) connects computers within a room, a floor, or a building. A wide-area network (WAN) usually links buildings, cities, or countries. A global company may have a WAN to connect its offices worldwide. These networks may run one protocol or several protocols. The continuing advent of new technologies brings about new forms of networks. For example, a metropolitan-area network (MAN) could link buildings within a city. BlueTooth and 802.11 devices use wireless technology to communicate over a distance of several feet, in essence creating a small-area network such as might be found in a home. The media to carry networks are equally varied. They include copper wires, fiber strands, and wireless transmissions between satellites, microwave dishes, and radios. When computing devices are connected to cellular phones, they create a network. Even very short-range infrared communication can be used for networking. At a rudimentary level, whenever computers communicate, theyuseorcreateanetwork. Some operating systems have taken the concept of networks and dis- tributed systems further than the notion of providing network connectivity. A network operating system is an operating system that provides features such as file sharing across the network and that includes a communication scheme that allows different processes on different computers to exchange messages. A computer running a network operating system acts autonomously from all other computers on the network, although it is aware of the network and is able to communicate with other networked computers. A distributed operat- ing system provides a less autonomous environment: the different operating systems communicate closely enough to provide the illusion that only a single operating system controls the network. 32 Chapter 1 Introduction We cover computer networks and distributed systems in Chapters 16 through 18. 1.11 Special-Purpose Systems The discussion thus far has focused on the general-purpose computer systems that we are all familiar with. There are, however, other classes of computer systems whose functions are more limited and whose objective is to deal with limited computation domains. 1.11.1 Real-Time Embedded Systems Embedded computers are the most prevalent form of computers in existence. These devices are found everywhere, from car engines and manufacturing robots to DVDs and microwave ovens. They tend to have very specific tasks. The systems they run on are usually primitive, and so the operating systems provide limited features. Usually,they have little or no user interface, preferring to spend their time monitoring and managing hardware devices, such as automobile engines and robotic arms. These embedded systems vary considerably. Some are general-purpose computers, running standard operating systems—such as UNIX—with special-purpose applications to implement the functionality. Others are hardware devices with a special-purpose embedded operating system providing just the functionality desired. Yet others are hardware devices with application-specific integrated circuits (ASICs) that perform their tasks without an operating system. The use of embedded systems continues to expand. The power of these devices, both as standalone units and as elements of networks and the Web, is sure to increase as well. Even now, entire houses can be computerized, so that a central computer—either a general-purpose computer or an embedded system—can control heating and lighting, alarm systems, and even coffee makers. Web access can enable a home owner to tell the house to heat up before she arrives home. Someday, the refrigerator may call the grocery store when it notices the milk is gone. Embedded systems almost always run real-time operating systems.A real-time system is used when rigid time requirements have been placed on the operation of a processor or the flow of data; thus, it is often used as a control device in a dedicated application. Sensors bring data to the computer. The computer must analyze the data and possibly adjust controls to modify the sensor inputs. Systems that control scientific experiments, medical imaging systems, industrial control systems, and certain display systems are real- time systems. Some automobile-engine fuel-injection systems, home-appliance controllers, and weapon systems are also real-time systems. A real-time system has well-defined, fixed time constraints. Processing must be done within the defined constraints, or the system will fail. For instance, it would not do for a robot arm to be instructed to halt after it had smashed into the car it was building. A real-time system functions correctly only if it returns the correct result within its time constraints. Contrast this system with a time-sharing system, where it is desirable (but not mandatory) to respond 1.11 Special-Purpose Systems 33 quickly, or a batch system. In a batch system, long-running jobs are submitted to the computer and run without human interaction until completed. These systems may have no time constraints at all. Although Java (Section 2.9.2) is typically not associated with providing real-time characteristics, there is a specification for real-time Java that extends the specifications for the Java language as well as the Java virtual machine. The real-time specification for Java (RTSJ) identifies a programming interface for creating Java programs that must run within real-time demands. To address the timing constraints of real-time systems, the RTSJ addresses several issues that may affect the execution time of a Java program, including CPU scheduling and memory management. In Chapter 19, we cover real-time embedded systems in great detail. In Chapter 5, we consider the scheduling facility needed to implement real-time functionality in an operating system. In Chapter 9, we describe the design of memory management for real-time computing. Finally, in Chapter 22, we describe the real-time components of the Windows XP operating system. 1.11.2 Multimedia Systems Most operating systems are designed to handle conventional data, such as text files, programs, word-processing documents, and spreadsheets. However, a recent trend in technology is the incorporation of multimedia data into computer systems. Multimedia data consist of audio and video files as well as conventional files. These data differ from conventional data in that multimedia data—such as frames of video—must be delivered (streamed) according to certain time restrictions (for example, 30 frames per second). Multimedia describes a wide range of applications in popular use today. These include audio files such as MP3, DVD movies, video conferencing, and short video clips of movie previews or news stories downloaded over the Internet. Multimedia applications may also include live webcasts (broadcasting over the World Wide Web) of speeches or sporting events and even live webcams that allow a viewer in Manhattan to observe customers at a cafe in Paris. Multimedia applications need not be either audio or video; rather, a multimedia application often includes a combination of both. For example, a movie may consist of separate audio and video tracks. Nor must multimedia applications be delivered only to desktop personal computers. Increasingly, they are being directed toward smaller devices, including PDAs and cellular telephones. For example, a stock trader may have stock quotes delivered wirelessly and in real time to his PDA. In Chapter 20, we explore the demands of multimedia applications, describe how multimedia data differ from conventional data, and explain how the nature of these data affects the design of operating systems that support the requirements of multimedia systems. 1.11.3 Handheld Systems Handheld systems include personal digital assistants (PDAs), such as Palm and Pocket-PCs, and cellular telephones, many of which use special-purpose embedded operating systems. Developers of handheld systems and applica- 34 Chapter 1 Introduction tions face many challenges, most of which are due to the limited size of such devices. For example, a PDA is typically about 5 inches in height and 3 inches in width, and it weighs less than one-half pound. Because of their size, most handheld devices have small amounts of memory, slow processors, and small display screens. We take a look now at each of these limitations. The amount of physical memory in a handheld depends on the device, but typically it is somewhere between 1 MB and 1 GB. (Contrast this with a typical PC or workstation, which may have several gigabytes of memory.) As a result, the operating system and applications must manage memory efficiently. This includes returning all allocated memory to the memory manager when the memory is not being used. In Chapter 9, we explore virtual memory, which allows developers to write programs that behave as if the system has more memory than is physically available. Currently, not many handheld devices use virtual memory techniques, so program developers must work within the confines of limited physical memory. A second issue of concern to developers of handheld devices is the speed of the processor used in the devices. Processors for most handheld devices run at a fraction of the speed of a processor in a PC. Faster processors require more power. To include a faster processor in a handheld device would require a larger battery, which would take up more space and would have to be replaced (or recharged) more frequently. Most handheld devices use smaller, slower processors that consume less power. Therefore, the operating system and applications must be designed not to tax the processor. The last issue confronting program designers for handheld devices is I/O. A lack of physical space limits input methods to small keyboards, handwriting recognition, or small screen-based keyboards. The small display screens limit output options. Whereas a monitor for a home computer may measure up to 30 inches, the display for a handheld device is often no more than 3 inches square. Familiar tasks, such as reading e-mail and browsing Web pages, must be condensed into smaller displays. One approach for displaying the content in Web pages is Web clipping, where only a small subset of a Web page is delivered and displayed on the handheld device. Some handheld devices use wireless technology, such as BlueTooth or 802.11, allowing remote access to e-mail and Web browsing. Cellular telephones with connectivity to the Internet fall into this category. However, for PDAsthat do not provide wireless access, downloading data typically requires the user first to download the data to a PC or workstation and then to download the data to the PDA.SomePDAs allow data to be directly copied from one device to another using an infrared link. Generally, the limitations in the functionality of PDAs are balanced by their convenience and portability. Their use continues to expand as network connections become more available and other options, such as digital cameras and MP3 players, expand their utility. 1.12 Computing Environments So far, we have provided an overview of computer-system organization and major operating-system components. We now offer a brief overview of how these are used in a variety of computing environments. 1.12 Computing Environments 35 1.12.1 Traditional Computing As computing matures, the lines separating many of the traditional computing environments are blurring. Consider the “typical office environment.” Just a few years ago, this environment consisted of PCs connected to a network, with servers providing file and print services. Remote access was awkward, and portability was achieved by use of laptop computers. Terminals attached to mainframes were prevalent at many companies as well, with even fewer remote access and portability options. The current trend is toward providing more ways to access these computing environments. Web technologies are stretching the boundaries of traditional computing. Companies establish portals, which provide Web accessibility to their internal servers. Network computers are essentially terminals that understand Web-based computing. Handheld computers can synchronize with PCs to allow very portable use of company information. Handheld PDAscan also connect to wireless networks to use the company’s Web portal (as well as the myriad other Web resources). At home, most users had a single computer with a slow modem connection to the office, the Internet, or both. Today, network-connection speeds once available only at great cost are relatively inexpensive, giving home users more access to more data. These fast data connections are allowing home computers to serve up Web pages and to run networks that include printers, client PCs, and servers. Some homes even have firewalls to protect their networks from security breaches. Those firewalls cost thousands of dollars a few years ago and did not even exist until the 1980s. In the latter half of the previous century, computing resources were scarce. (Before that, they were nonexistent!) For a period of time, systems were either batch or interactive. Batch systems processed jobs in bulk, with predetermined input (from files or other sources of data). Interactive systems waited for input from users. To optimize the use of the computing resources, multiple users shared time on these systems. Time-sharing systems used a timer and scheduling algorithms to rapidly cycle processes through the CPU, giving each user a share of the resources. Today, traditional time-sharing systems are uncommon. The same schedul- ing technique is still in use on workstations and servers, but frequently the processes are all owned by the same user (or a single user and the operating system). User processes, and system processes that provide services to the user, are managed so that each frequently gets a slice of computer time. Consider the windows created while a user is working on a PC, for example, and the fact that they may be performing different tasks at the same time. 1.12.2 Client–Server Computing As PCs have become faster, more powerful, and cheaper, designers have shifted away from centralized system architecture. Terminals connected to centralized systems are now being supplanted by PCs. Correspondingly, user-interface functionality once handled directly by centralized systems is increasingly being handled by PCs as well. As a result, many of today’s systems act as server systems to satisfy requests generated by client systems.Thisformof specialized distributed system, called a client–server system, has the general structure depicted in Figure 1.13. 36 Chapter 1 Introduction client client client client server network … Figure 1.13 General structure of a client–server system. Server systems can be broadly categorized as compute servers and file servers: • The compute-server system provides an interface to which a client can send a request to perform an action (for example, read data); in response, the server executes the action and sends back results to the client. A server running a database that responds to client requests for data is an example of such a system. • The file-server system provides a file-system interface where clients can create, update, read, and delete files. An example of such a system is a Web server that delivers files to clients running Web browsers. 1.12.3 Peer-to-Peer Computing Another structure for a distributed system is the peer-to-peer (P2P) system model. In this model, clients and servers are not distinguished from one another; instead, all nodes within the system are considered peers, and each may act as either a client or a server, depending on whether it is requesting or providing a service. Peer-to-peer systems offer an advantage over traditional client-server systems. In a client-server system, the server is a bottleneck; but in a peer-to-peer system, services can be provided by several nodes distributed throughout the network. To participate in a peer-to-peer system, a node must first join the network of peers. Once a node has joined the network, it can begin providing services to—and requesting services from—other nodes in the network. Determining what services are available is accomplished in one of two general ways: • When a node joins a network, it registers its service with a centralized lookup service on the network. Any node desiring a specific service first contacts this centralized lookup service to determine which node provides the service. The remainder of the communication takes place between the client and the service provider. • A peer acting as a client must first discover what node provides a desired service by broadcasting a request for the service to all other nodes in the network. The node (or nodes) providing that service responds to the peer making the request. To support this approach, a discovery protocol must be provided that allows peers to discover services provided by other peers in the network. 1.13 Open-Source Operating Systems 37 Peer-to-peer networks gained widespread popularity in the late 1990s with several file-sharing services, such as Napster and Gnutella, that enable peers to exchange files with one another. The Napster system used an approach similar to the first type described above: a centralized server maintains an index of all files stored on peer nodes in the Napster network, and the actual exchanging of files takes place between the peer nodes. The Gnutella system uses a technique similar to the second type: a client broadcasts file requests to other nodes in the system, and nodes that can service the request respond directly to the client. The future of exchanging files remains uncertain because many of the files are copyrighted (music, for example), and there are laws governing the distribution of copyrighted material. In any case, though, peer- to-peer technology undoubtedly will play a role in the future of many services, such as searching, file exchange, and e-mail. 1.12.4 Web-Based Computing The Web has become ubiquitous, leading to more access by a wider variety of devices than was dreamt of a few years ago. PCs are still the most prevalent access devices, with workstations, handheld PDAs, and even cell phones also providing access. Web computing has increased the emphasis on networking. Devices that were not previously networked now include wired or wireless access. Devices that were networked now have faster network connectivity, provided by either improved networking technology, optimized network implementation code, or both. The implementation of Web-based computing has given rise to new categories of devices, such as load balancers, which distribute network connections among a pool of similar servers. Operating systems like Windows 95, which acted as Web clients, have evolved into Linux and Windows 7, which can act as Web servers as well as clients. Generally, the Web has increased the complexity of devices because their users require them to be Web-enabled. 1.13 Open-Source Operating Systems The study of operating systems, as noted earlier, is made easier by the availability of a vast number of open-source releases. Open-source operating systems are those made available in source-code format rather than as compiled binary code. Linux is the most famous open-source operating system, while Microsoft Windows is a well-known example of the opposite closed- source approach. In this section, we discuss several aspects of open-source systems in general and then describe how students can access source codes for the Linux, UnixBSD, and Solaris operating systems. 1.13.1 Benefits of Open-Source Systems Working with open-source systems offers many benefits to programmers and students of programming. Starting with the source code allows the programmer to produce binary code that can be executed on a system. Doing the opposite—reverse engineering thesourcecodefromthebinaries—isquite a lot of work, and useful items such as comments are never recovered. In 38 Chapter 1 Introduction addition, learning operating systems by examining the actual source code, rather than reading summaries of that code, can be extremely useful. With the source code in hand, a student can modify the operating system and then compile and run the code to try out those changes, which is an excellent learning tool. This text includes projects that involve modifying operating system source code, while also describing algorithms at a high level to be sure all important operating system topics are covered. Throughout the text, we provide pointers to examples of open-source code for deeper study. Another benefit to open-source operating systems is the large community of interested (and usually unpaid) programmers who contribute to the code by helping to debug it, analyze it, provide support, and suggest changes. Arguably, open-source code is more secure than closed-source code because many more eyes are viewing the code. Certainly open-source code has bugs, but open-source advocates argue that bugs tend to be found and fixed faster owing to the number of people using and viewing the code. Companies that earn revenue from selling their programs tend to be hesitant to open-source their code, but Red Hat, SUSE, Sun, and a myriad of other companies are doing just that and showing that commercial companies benefit, rather than suffer, when they open-source their code. Revenue can be generated through support contracts and the sale of hardware on which the software runs, for example. 1.13.2 History of Open-Source Systems In the early days of modern computing (that is, the 1950s), a great deal of software was available in open-source format. The original hackers (computer enthusiasts) at MIT’s Tech Model Railroad Club left their programs in drawers for others to work on. “Homebrew” user groups exchanged code during their meetings. Later, company-specific user groups, such as Digital Equipment Corporation’s DEC, accepted contributions of source-code programs, collected them onto tapes, and distributed the tapes to interested members. Computer and software companies eventually sought to limit the use of their software to authorized computers and paying customers. Releasing only the binary files compiled from the source code, rather than the source code itself, helped them to achieve this goal, as well as protecting their code and their ideas from their competitors. Another issue involved copyrighted material. Operating systems and other programs can limit the ability to play back movies and music or display electronic books to authorized computers. Such copy protection,ordigital rights management (DRM),wouldnotbe effective if the source code that implemented these limits were published. Laws in many countries, including the U.S. Digital Millennium Copyright Act (DMCA), make it illegal to reverse-engineer DRM code or otherwise try to circumvent copy protection. To counter the move to limit software use and redistribution, Richard Stallman in 1983 started the GNU project to create a free, open-source UNIX- compatible operating system. In 1985, he published the GNU Manifesto, which argues that all software should be free and open-sourced. He also formed the Free Software Foundation (FSF) with the goal of encouraging the free exchange of software source code and the free use of that software. Rather than copyright its software, the FSF “copylefts” the software to encourage sharing and improvement. The GNU General Public License (GPL) codifies copylefting 1.13 Open-Source Operating Systems 39 and is a common license under which free software is released. Fundamentally, GPL requires that the source code be distributed with any binaries and that any changes made to the source code be released under the same GPL license. 1.13.3 Linux As an example of an open-source operating system, consider GNU/Linux. The GNU project produced many UNIX-compatible tools, including compilers, editors, and utilities, but never released a kernel. In 1991, a student in Finland, Linus Torvalds, released a rudimentary UNIX-like kernel using the GNU compilers and tools and invited contributions worldwide. The advent of the Internet meant that anyone interested could download the source code, modify it, and submit changes to Torvalds. Releasing updates once a week allowed this so-called Linux operating system to grow rapidly, enhanced by several thousand programmers. The resulting GNU/Linux operating system has spawned hundreds of unique distributions, or custom builds, of the system. Major distributions include RedHat, SUSE, Fedora, Debian, Slackware, and Ubuntu. Distributions vary in function, utility, installed applications, hardware support, user inter- face, and purpose. For example, RedHat Enterprise Linux is geared to large commercial use. PCLinuxOS is a LiveCD—an operating system that can be booted and run from a CD-ROM without being installed on a system’s hard disk. One variant of PCLinuxOS, “PCLinuxOS Supergamer DVD,” is a LiveDVD that includes graphics drivers and games. A gamer can run it on any compatible system simply by booting from the DVD. When the gamer is finished, a reboot of the system resets it to its installed operating system. Access to the Linux source code varies by release. Here, we consider Ubuntu Linux. Ubuntu is a popular Linux distribution that comes in a variety of types, including those tuned for desktops, servers, and students. Its founder pays for the printing and mailing of DVDs containing the binary and source code (which helps to make it popular). The following steps outline a way to explore the Ubuntu kernel source code on systems that support the free “VMware Player” tool: • Download the player from http://www.vmware.com/download/player/ and install it on your system. • Download a virtual machine containing Ubuntu. Hundreds of “appliances”, or virtual machine images, preinstalled with oper- ating systems and applications, are available from VMware at http://www.vmware.com/appliances/. • Boot the virtual machine within VMware Player. • Get the source code of the kernel release of interest, such as 2.6, by executing wget http://www.kernel.org/pub/linux/kernel/v2.6/linux- 2.6.18.1.tar.bz2 within the Ubuntu virtual machine. • Uncompress and untar the downloaded file via tar xjf linux- 2.6.18.1.tar.bz2. • Explore the source code of the Ubuntu kernel, which is now in ./linux- 2.6.18.1. 40 Chapter 1 Introduction For more about Linux, see Chapter 21. For more about virtual machines, see Section 2.8. 1.13.4 BSD UNIX UnixBSD has a longer and more complicated history than Linux. It started in 1978 as a derivative of AT&T’s UNIX. Releases from the University of California at Berkeley came in source and binary form, but they were not open-source because a license from AT&T was required. UnixBSD’s development was slowed by a lawsuit by AT&T, but eventually a fully functional, open-source version, 4.4BSD-lite, was released in 1994. Just as with Linux, there are many distributions of UnixBSD, including FreeBSD, NetBSD, OpenBSD, and DragonFly BSD. To explore the source code of FreeBSD, simply download the virtual machine image of the version of interest and boot it within VMware, as described above for Ubuntu Linux. The source code comes with the distribution and is stored in /usr/src/.Thekernel source code is in /usr/src/sys. For example, to examine the virtual memory implementation code in the FreeBSD kernel, see the files in /usr/src/sys/vm. Darwin, the core kernel component of Mac OS X,isbasedon UnixBSD and is open-sourced as well. That source code is available from http://www.opensource.apple.com/darwinsource/. Every Mac OS X release has its open-source components posted at that site. The name of the package that contains the kernel is “xnu.” The source code for Mac OS X kernel revision 1228 (the source code for Mac OS X Leopard) can be found at www.opensource.apple.com/darwinsource/tarballs/apsl/xnu- 1228.tar.gz. Apple also provides extensive developer tools, documentation, and support at http://connect.apple.com. For more information on UnixBSD, see Appendix A. 1.13.5 Solaris Solaris is the commercial UNIX-based operating system of Sun Microsystems. Originally, Sun’s SunOS operating system was based on UnixBSD. Sun moved to AT&T’s System V UNIX as its base in 1991. In 2005, Sun open-sourced some of the Solaris code, and over time, the company has added more and more to that open-source code base. Unfortunately, not all of Solaris is open-sourced, because some of the code is still owned by AT&T and other companies. However, Solaris can be compiled from the open source and linked with binaries of the close-sourced components, so it can still be explored, modified, compiled, and tested. Thesourcecodeisavailablefromhttp://opensolaris.org/os/downloads/. Also available there are precompiled distributions based on the source code, documentation, and discussion groups. It is not necessary to download the entire source-code bundle from the site, because Sun allows visitors to explore the source code on-line via a source-code browser. 1.13.6 Conclusion The free software movement is driving legions of programmers to create thousands of open-source projects, including operating systems. Sites like 1.14 Summary 41 http://freshmeat.net/ and http://distrowatch.com/ provide portals to many of these projects. As we pointed out earlier, open-source projects enable students to use source code as a learning tool. Students can modify programs and test them, help find and fix bugs, and otherwise explore mature, full- featured operating systems, compilers, tools, user interfaces, and other types of programs. The availability of source code for historic projects, such as Multics, can help students to understand those projects and to build knowledge that will help in the implementation of new projects. GNU/Linux, UnixBSD, and Solaris are all open-source operating sys- tems, but each has its own goals, utility, licensing, and purpose. Sometimes licenses are not mutually exclusive, and cross-pollination occurs, allowing rapid improvements in operating-system projects. For example, several major components of Solaris have been ported to UnixBSD. The advantages of free software and open sourcing are likely to increase the number and quality of open-source projects, leading to an increase in the number of individuals and companies that use these projects. 1.14 Summary An operating system is software that manages the computer hardware, as well as providing an environment for application programs to run. Perhaps the most visible aspect of an operating system is the interface to the computer system it provides to the human user. For a computer to do its job of executing programs, the programs must be in main memory. Main memory is the only large storage area that the processor can access directly. It is an array of words or bytes, ranging in size from hundreds of thousands to billions. Each word in memory has its own address. The main memory is usually a volatile storage device that loses its contents when power is turned off or lost. Most computer systems provide secondary storage as an extension of main memory. Secondary storage provides a form of nonvolatile storage that can hold large quantities of data permanently. The most common secondary-storage device is a magnetic disk, which provides storage for both programs and data. The wide variety of storage systems in a computer system can be organized in a hierarchy according to speed and cost. The higher levels are expensive, but they are fast. As we move down the hierarchy, the cost per bit generally decreases, whereas the access time generally increases. There are several different strategies for designing a computer system. Single-processor systems have only one processor, while multiprocessor systems contain two or more processors that share physical memory and peripheral devices. The most common multiprocessor design is symmetric multiprocessing (or SMP), where all processors are considered peers and run independently of one another. Clustered systems are a specialized form of multiprocessor systems and consist of multiple computer systems connected by a local area network. To best utilize the CPU, modern operating systems employ multiprogram- ming, which allows several jobs to be inmemory at the same time, thus ensuring that the CPU always has a job to execute. Time-sharing systems are an exten- 42 Chapter 1 Introduction sion of multiprogramming wherein CPU scheduling algorithms rapidly switch between jobs, thus providing the illusion that all jobs are running concurrently. The operating system must ensure correct operation of the computer system. To prevent user programs from interfering with the proper operation of the system, the hardware has two modes: user mode and kernel mode. Certain instructions are privileged and can be executed only in kernel mode. The memory in which the operating system resides must also be protected from modification by the user. A timer prevents infinite loops. These facilities (dual mode, privileged instructions, memory protection, and timer interrupt) are basic building blocks used by operating systems to achieve correct operation. A process (or job) is the fundamental unit of work in an operating system. Process management includes creating and deleting processes and providing mechanisms for processes to communicate and synchronize with each other. An operating system manages memory by keeping track of what parts of memory are being used and by whom. The operating system is also responsible for dynamically allocating and freeing memory space. Storage space is also managed by the operating system; this includes providing file systems for representing files and directories and managing space on mass-storage devices. Operating systems must also be concerned with protecting and securing the operating system and users. Protection measures are mechanisms that control the access of processes or users to the resources made available by the computer system. Security measures are responsible for defending a computer system from external or internal attacks. Distributed systems allow users to share resources on geographically dispersed hosts connected via a computer network. Services may be provided through either the client–server model or the peer-to-peer model. In a clustered system, multiple machines can perform computations on data residing on shared storage, and computing can continue even when some subset of cluster members fails. LANsandWANs are the two basic types of networks. LANsenable processors distributed over a small geographical area to communicate, whereas WANs allow processors distributed over a larger area to communicate. LANs typically are faster than WANs. There are several computer systems that serve specific purposes. These include real-time operating systems designed for embedded environments such as consumer devices, automobiles, and robotics. Real-time operating systems have well-defined, fixed-time constraints. Processing must be done within the defined constraints, or the system will fail. Multimedia systems involve the delivery of multimedia data and often have special requirements of displaying or playing audio, video, or synchronized audio and video streams. Recently, the influence of the Internet and the World Wide Web has encouraged the development of operating systems that include Web browsers and networking and communication software as integral features. The free software movement has created thousands of open-source projects, including operating systems. Because of these projects, students are able to use source code as a learning tool. They can modify programs and test them, help find and fix bugs, and otherwise explore mature, full-featured operating systems, compilers, tools, user interfaces, and other types of programs. GNU/Linux, UnixBSD, and Solaris are all open-source operating systems. The advantages of free software and open sourcing are likely to increase the Practice Exercises 43 number and quality of open-source projects, leading to an increase in the number of individuals and companies that use these projects. Practice Exercises 1.1 What are the three main purposes of an operating system? 1.2 What are the main differences between operating systems for mainframe computers and personal computers? 1.3 List the four steps that are necessary to run a program on a completely dedicated machine—a computer that is running only that program. 1.4 We have stressed the need for an operating system to make efficient use of the computing hardware. When is it appropriate for the operating system to forsake this principle and to “waste” resources? Why is such a system not really wasteful? 1.5 What is the main difficulty that a programmer must overcome in writing an operating system for a real-time environment? 1.6 Consider the various definitions of operating system. Next, consider whether the operating system should include applications such as Web browsers and mail programs. Argue both that it should and that it should not, and support your answers. 1.7 How does the distinction between kernel mode and user mode function as a rudimentary form of protection (security) system? 1.8 Which of the following instructions should be privileged? a. Set value of timer. b. Read the clock. c. Clear memory. d. Issue a trap instruction. e. Turn off interrupts. f. Modify entries in device-status table. g. Switch from user to kernel mode. h. Access I/O device. 1.9 Some early computers protected the operating system by placing it in a memory partition that could not be modified by either the user job or the operating system itself. Describe two difficulties that you think could arise with such a scheme. 1.10 Some CPUs provide for more than two modes of operation. What are two possible uses of these multiple modes? 1.11 Timers could be used to compute the current time. Provide a short description of how this could be accomplished. 1.12 Is the Internet a LAN or a WAN? 44 Chapter 1 Introduction Exercises 1.13 In a multiprogramming and time-sharing environment, several users share the system simultaneously. This situation can result in various security problems. a. What are two such problems? b. Can we ensure the same degree of security in a time-shared machine as in a dedicated machine? Explain your answer. 1.14 The issue of resource utilization shows up in different forms in different types of operating systems. List what resources must be managed carefully in the following settings: a. Mainframe or minicomputer systems b. Workstations connected to servers c. Handheld computers 1.15 Under what circumstances would a user be better off using a time- sharing system rather than a PC or a single-user workstation? 1.16 Identify which of the functionalities listed below need to be supported by the operating system for (a) handheld devices and (b) real-time systems. a. Batch programming b. Virtual memory c. Time sharing 1.17 Describe the differences between symmetric and asymmetric multipro- cessing. What are three advantages and one disadvantage of multipro- cessor systems? 1.18 How do clustered systems differ from multiprocessor systems? What is required for two machines belonging to a cluster to cooperate to provide a highly available service? 1.19 Distinguish between the client–server and peer-to-peer models of distributed systems. 1.20 Consider a computing cluster consisting of two nodes running a database. Describe two ways in which the cluster software can manage access to the data on the disk. Discuss the benefits and disadvantages of each. 1.21 How are network computers different from traditional personal com- puters? Describe some usage scenarios in which it is advantageous to use network computers. 1.22 What is the purpose of interrupts? What are the differences between a trap and an interrupt? Can traps be generated intentionally by a user program? If so, for what purpose? Exercises 45 1.23 Direct memory access is used for high-speed I/O devices in order to avoid increasing the CPU’s execution load. a. How does the CPU interface with the device to coordinate the transfer? b. How does the CPU know when the memory operations are com- plete? c. The CPU is allowed to execute other programs while the DMA controller is transferring data. Does this process interfere with the execution of the user programs? If so, describe what forms of interference are caused. 1.24 Some computer systems do not provide a privileged mode of operation in hardware. Is it possible to construct a secure operating system for these computer systems? Give arguments both that it is and that it is not possible. 1.25 Give two reasons why caches are useful. What problems do they solve? What problems do they cause? If a cache can be made as large as the device for which it is caching (for instance, a cache as large as a disk), why not make it that large and eliminate the device? 1.26 Consider an SMP system similar to what is shown in Figure 1.6. Illustrate with an example how data residing in memory could in fact have two different values in each of the local caches. 1.27 Discuss, with examples, how the problem of maintaining coherence of cached data manifests itself in the following processing environments: a. Single-processor systems b. Multiprocessor systems c. Distributed systems 1.28 Describe a mechanism for enforcing memory protection in order to prevent a program from modifying the memory associated with other programs. 1.29 What network configuration would best suit the following environ- ments? a. A dormitory floor b. A university campus c. A state d. A nation 1.30 Define the essential properties of the following types of operating systems: a. Batch b. Interactive c. Time sharing 46 Chapter 1 Introduction d. Real time e. Network f. Parallel g. Distributed h. Clustered i. Handheld 1.31 What are the tradeoffs inherent in handheld computers? 1.32 Identify several advantages and several disadvantages of open-source operating systems. Include the types of people who would find each aspect to be an advantage or a disadvantage. Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Brookshear [2003] provides an overview of computer science in general. An overview of the Linux operating system is presented in Bovet and Cesati [2006]. Solomon and Russinovich [2000] give an overview of Microsoft Windows and considerable technical detail about the system internals and components. Russinovich and Solomon [2005] update this information to Windows Server 2003 and Windows XP. McDougall and Mauro [2007] cover the internals of the Solaris operating system. Mac OS X is presented at http://www.apple.com/macosx.MacOS X internals are discussed in Singh [2007]. Coverage of peer-to-peer systems includes Parameswaran et al. [2001], Gong [2002], Ripeanu et al. [2002], Agre [2003], Balakrishnan et al. [2003], and Loo [2003]. A discussion of peer-to-peer file-sharing systems can be found in Lee [2003]. Good coverage of cluster computing is provided by Buyya [1999]. Recent advances in cluster computing are described by Ahmed [2000]. A survey of issues relating to operating-system support for distributed systems can be found in Tanenbaum and Van Renesse [1985]. Many general textbooks cover operating systems, including Stallings [2000b], Nutt [2004], and Tanenbaum [2001]. Hamacher et al. [2002] describe computer organization, and McDougall and Laudon [2006] discuss multicore processors. Hennessy and Patterson [2007] provide coverage of I/O systems and buses, and of system architecture Bibliographical Notes 47 in general. Blaauw and Brooks [1997] describe details of the architecture of many computer systems, including several from IBM. Stokes [2007] provides an illustrated introduction to microprocessors and computer architecture. Cache memories, including associative memory, are described and ana- lyzed by Smith [1982]. That paper also includes an extensive bibliography on the subject. Discussions concerning magnetic-disk technology are presented by Freed- man [1983] and by Harker et al. [1981]. Optical disks are covered by Kenville [1982], Fujitani [1984], O’Leary and Kitts [1985], Gait [1988], and Olsen and Kenley [1989]. Discussions of floppy disks are offered by Pechura and Schoeffler [1983] and by Sarisky [1983]. General discussions concerning mass-storage technology are offered by Chi [1982] and by Hoagland [1985]. Kurose and Ross [2005] and Tanenbaum [2003] provide general overviews of computer networks. Fortier [1989] presents a detailed discussion of network- ing hardware and software. Kozierok [2005] discuss TCP in detail. Mullender [1993] provides an overview of distributed systems. Wolf [2003] discusses recent developments in developing embedded systems. Issues related to hand- held devices can be found in Myers and Beigl [2003] and Di Pietro and Mancini [2003]. A full discussion of the history of open sourcing and its benefits and challenges is found in Raymond [1999]. The history of hacking is discussed in Levy [1994]. The Free Software Foundation has published its philosophy on its Web site: http://www.gnu.org/philosophy/free-software-for- freedom.html. Detailed instructions on how to build the Ubuntu Linux kernel are on http://www.howtoforge.com/kernel compilation ubuntu. The open-source components of Mac OS X are available from http://developer.apple.com/open-source/index.html. Wikipedia (http://en.wikipedia.org/wiki/Richard Stallman) has an informative entry about Richard Stallman. The source code of Multics is available at http://web.mit.edu/multics- history/source/Multics Internet Server/Multics sources.html. This page intentionally left blank 2CHAPTEROperating- System Structures An operating system provides the environment within which programs are executed. Internally, operating systems vary greatly in their makeup, since they are organized along many different lines. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. These goals form the basis for choices among various algorithms and strategies. We can view an operating system from several vantage points. One view focuses on the services that the system provides; another, on the interface that it makes available to users and programmers; a third, on its components and their interconnections. In this chapter, we explore all three aspects of operating systems, showing the viewpoints of users, programmers, and operating-system designers. We consider what services an operating system provides, how they are provided, how they are debugged, and what the various methodologies are for designing such systems. Finally, we describe how operating systems are created and how a computer starts its operating system. CHAPTER OBJECTIVES • To describe the services an operating system provides to users, processes, and other systems. • To discuss the various ways of structuring an operating system. • To explain how operating systems are installed and customized and how they boot. 2.1 Operating-System Services An operating system provides an environment for the execution of programs. It provides certain services to programs and to the users of those programs. The specific services provided, of course, differ from one operating system to another, but we can identify common classes. These operating-system services areprovidedfortheconvenienceoftheprogrammer,tomaketheprogramming 49 50 Chapter 2 Operating-System Structures user and other system programs services operating system hardware system calls GUI batch user interfaces command line program execution I/O operations file systems communication resource allocation accounting protection and security error detection Figure 2.1 A view of operating-system services. task easier. Figure 2.1 shows one view of the various operating-system services and how they interrelate. One set of operating-system services provides functions that are helpful to the user. • User interface. Almost all operating systems have a user interface (UI). This interface can take several forms. One is a command-line interface (CLI), which uses text commands and a method for entering them (say, a program to allow entering and editing of commands). Another is a batch interface, in which commands and directives to control those commands are entered into files, and those files are executed. Most commonly, a graphical user interface (GUI) is used. Here, the interface is a window system with a pointing device to direct I/O, choose from menus, and make selections and a keyboard to enter text. Some systems provide two or all three of these variations. • Program execution. The system must be able to load a program into memory and to run that program. The program must be able to end its execution, either normally or abnormally (indicating error). • I/O operations. A running program may require I/O, which may involve a file or an I/O device. For specific devices, special functions may be desired (such as recording to a CD or DVD drive or blanking a display screen). For efficiency and protection, users usually cannot control I/O devices directly. Therefore, the operating system must provide a means to do I/O. • File-system manipulation. The file system is of particular interest. Obvi- ously, programs need to read and write files and directories. They also need to create and delete them by name, search for a given file, and list file information. Finally, some programs include permissions management to allow or deny access to files or directories based on file ownership. Many operating systems provide a variety of file systems, sometimes to allow personal choice, and sometimes to provide specific features or performance characteristics. 2.1 Operating-System Services 51 • Communications. There are many circumstances in which one process needs to exchange information with another process. Such communication may occur between processes that are executing on the same computer or between processes that are executing on different computer systems tied together by a computer network. Communications may be imple- mented via shared memory or through message passing, in which packets of information are moved between processes by the operating system. • Error detection. The operating system needs to be constantly aware of possible errors. Errors may occur in the CPU and memory hardware (such as a memory error or a power failure), in I/O devices (such as a parity error on tape, a connection failure on a network, or lack of paper in the printer), and in the user program (such as an arithmetic overflow, an attempt to access an illegal memory location, or a too-great use of CPU time). For each type of error, the operating system should take the appropriate action to ensure correct and consistent computing. Of course, there is variation in how operating systems react to and correct errors. Debugging facilities can greatly enhance the user’s and programmer’s abilities to use the system efficiently. Another set of operating-system functions exists not for helping the user but rather for ensuring the efficient operation of the system itself. Systems with multiple users can gain efficiency by sharing the computer resources among the users. • Resource allocation. When there are multiple users or multiple jobs running at the same time, resources must be allocated to each of them. Many different types of resources are managed by the operating system. Some (such as CPU cycles, main memory, and file storage) may have special allocation code, whereas others (such as I/O devices) may have much more general request and release code. For instance, in determining how best to use the CPU, operating systems have CPU-scheduling routines that take into account the speed of the CPU, the jobs that must be executed, the number of registers available, and other factors. There may also be routines to allocate printers, modems, USB storage drives, and other peripheral devices. • Accounting. We want to keep track of which users use how much and what kinds of computer resources. This record keeping may be used for accounting (so that users can be billed) or simply for accumulating usage statistics. Usage statistics may be a valuable tool for researchers who wish to reconfigure the system to improve computing services. • Protection and security. The owners of information stored in a multiuser or networked computer system may want to control use of that information. When several separate processes execute concurrently, it should not be possible for one process to interfere with the others or with the operating system itself. Protection involves ensuring that all access to system resources is controlled. Security of the system from outsiders is also important. Such security starts with requiring each user to authenticate himself or herself to the system, usually by means of a password, to gain access to system resources. It extends to defending external I/O devices, 52 Chapter 2 Operating-System Structures including modems and network adapters, from invalid access attempts and to recording all such connections for detection of break-ins. If a system is to be protected and secure, precautions must be instituted throughout it. A chain is only as strong as its weakest link. 2.2 User Operating-System Interface We mentioned earlier that there are several ways for users to interface with the operating system. Here, we discuss two fundamental approaches. One provides a command-line interface, or command interpreter, that allows users to directly enter commands to be performed by the operating system. The other allows users to interface with the operating system via a graphical user interface, or GUI. 2.2.1 Command Interpreter Some operating systems include the command interpreter in the kernel. Others, such as Windows XP and UNIX, treat the command interpreter as a special program that is running when a job is initiated or when a user first logs on (on interactive systems). On systems with multiple command interpreters to choose from, the interpreters are known as shells. For example, on UNIX and Linux systems, a user may choose among several different shells, including the Bourne shell, Cshell, Bourne-Again shell, Korn shell, and others. Third-party shells and free user-written shells are also available. Most shells provide similar functionality, and a user’s choice of which shell to use is generally based on personal preference. Figure 2.2 shows the Bourne shell command interpreter being used on Solaris 10. The main function of the command interpreter is to get and execute the next user-specified command. Many of the commands given at this level manipulate files: create, delete, list, print, copy, execute, and so on. The MS-DOS and UNIX shells operate in this way. These commands can be implemented in two general ways. In one approach, the command interpreter itself contains the code to execute the command. For example, a command to delete a file may cause the command interpreter to jump to a section of its code that sets up the parameters and makes the appropriate system call. In this case, the number of commands that can be given determines the size of the command interpreter, since each command requires its own implementing code. An alternative approach—used by UNIX, among other operating systems —implements most commands through system programs. In this case, the command interpreter does not understand the command in any way; it merely uses the command to identify a file to be loaded into memory and executed. Thus, the UNIX command to delete a file rm file.txt would search for a file called rm, load the file into memory, and execute it with the parameter file.txt. The function associated with the rm command would be defined completely by the code in the file rm.Inthisway,programmerscan add new commands to the system easily by creating new files with the proper 2.2 User Operating-System Interface 53 Figure 2.2 The Bourne shell command interpreter in Solaris 10. names. The command-interpreter program, which can be small, does not have to be changed for new commands to be added. 2.2.2 Graphical User Interfaces A second strategy for interfacing with the operating system is through a user- friendly graphical user interface, or GUI. Here, rather than entering commands directly via a command-line interface, users employ a mouse-based window- and-menu system characterized by a desktop metaphor. The user moves the mouse to position its pointer on images, or icons, on the screen (the desktop) that represent programs, files, directories, and system functions. Depending on the mouse pointer’s location, clicking a button on the mouse can invoke a program, select a file or directory—known as a folder—orpulldowna menu that contains commands. Graphical user interfaces first appeared due in part to research taking place in the early 1970s at Xerox PARC research facility. The first GUI appeared on the Xerox Alto computer in 1973. However, graphical interfaces became more widespread with the advent of Apple Macintosh computers in the 1980s. The user interface for the Macintosh operating system (Mac OS)hasundergone various changes over the years, the most significant being the adoption of the Aqua interface that appeared with Mac OS X. Microsoft’s first version of Windows—Version 1.0—was based on the addition of a GUI interface to the MS-DOS operating system. Later versions of Windows have made cosmetic changes in the appearance of the GUI along with several enhancements in its functionality, including Windows Explorer. 54 Chapter 2 Operating-System Structures Traditionally, UNIX systems have been dominated by command-line inter- faces. Various GUI interfaces are available, however, including the Common Desktop Environment (CDE) and X-Windows systems, which are common on commercial versions of UNIX, such as Solaris and IBM’s AIX system. In addition, there has been significant development in GUI designs from various open-source projects, such as K Desktop Environment (or KDE)andtheGNOME desktop by the GNU project. Both the KDE and GNOME desktops run on Linux and various UNIX systems and are available under open-source licenses, which means their source code is readily available for reading and for modification under specific license terms. The choice of whether to use a command-line or GUI interface is mostly one of personal preference. As a very general rule, many UNIX users prefer command-line interfaces, as they often provide powerful shell interfaces. In contrast, most Windows users are pleased to use the Windows GUI environment and almost never use the MS-DOS shell interface. The various changes undergone by the Macintosh operating systems provide a nice study in contrast. Historically, Mac OS has not provided a command-line interface, always requiring its users to interface with the operating system using its GUI. However, with the release of Mac OS X (which is in part implemented using a UNIX kernel), the operating system now provides both a new Aqua interface and a command-line interface. Figure 2.3 is a screen shot of the Mac OS X GUI. The user interface can vary from system to system and even from user to user within a system. It typically is substantially removed from the actual system structure. The design of a useful and friendly user interface is therefore Figure 2.3 The Mac OS X GUI. 2.3 System Calls 55 not a direct function of the operating system. In this book, we concentrate on the fundamental problems of providing adequate service to user programs. From the point of view of the operating system, we do not distinguish between user programs and system programs. 2.3 System Calls System calls provide an interface to the services made available by an operating system. These calls are generally available as routines written in C and C++, although certain low-level tasks (for example, tasks where hardware must be accessed directly) may need to be written using assembly-language instructions. Before we discuss how an operating system makes system calls available, let’s first use an example to illustrate how system calls are used: writing a simple program to read data from one file and copy them to another file. The first input that the program will need is the names of the two files: the input file and the output file. These names can be specified in many ways, depending on the operating-system design. One approach is for the program to ask the user for the names of the two files. In an interactive system, this approach will require a sequence of system calls, first to write a prompting message on the screen and then to read from the keyboard the characters that define the two files. On mouse-based and icon-based systems, a menu of file names is usually displayed in a window. The user can then use the mouse to select the source name, and a window can be opened for the destination name to be specified. This sequence requires many I/O system calls. Once the two file names are obtained, the program must open the input file and create the output file. Each of these operations requires another system call. There are also possible error conditions for each operation. When the program tries to open the input file, it may find that there is no file of that name or that the file is protected against access. In these cases, the program should print a message on the console (another sequence of system calls) and then terminate abnormally (another system call). If the input file exists, then we must create a new output file. We may find that there is already an output file with the same name. This situation may cause the program to abort (a system call), or we may delete the existing file (another system call) and create a new one (another system call). Another option, in an interactive system, is to ask the user (via a sequence of system calls to output the prompting message and to read the response from the terminal) whether to replace the existing file or to abort the program. Now that both files are set up, we enter a loop that reads from the input file (a system call) and writes to the output file (another system call). Each read and write must return status information regarding various possible error conditions. On input, the program may find that the end of the file has been reached or that there was a hardware failure in the read (such as a parity error). The write operation may encounter various errors, depending on the output device (no more disk space, printer out of paper, and so on). Finally, after the entire file is copied, the program may close both files (another system call), write a message to the console or window (more system calls), and finally terminate normally (the final system call). This system- 56 Chapter 2 Operating-System Structures call sequence is shown in Figure 2.4. As we can see, even simple programs may make heavy use of the operating system. Frequently, systems execute thousands of system calls per second. Most programmers never see this level of detail, however. Typically, appli- cation developers design programs according to an application programming interface (API).TheAPI specifies a set of functions that are available to an application programmer, including the parameters that are passed to each function and the return values the programmer can expect. Three of the most common APIs available to application programmers are the Win32 API for Win- dows systems, the POSIX API for POSIX-based systems (which include virtually all versions of UNIX,Linux,andMacOS X), and the Java API for designing programs that run on the Java virtual machine. Note that—unless specified —the system-call names used throughout this text are generic examples. Each operating system has its own name for each system call. Behind the scenes, the functions that make up an API typically invoke the actual system calls on behalf of the application programmer. For example, the Win32 function CreateProcess() (which unsurprisingly is used to create a new process) actually calls the NTCreateProcess() system call in the Windows kernel. Why would an application programmer prefer programming according to an API rather than invoking actual system calls? There are several reasons for doing so. One benefit of programming according to an API concerns program portability. An application programmer designing a program using an API can expect her program to compile and run on any system that supports the same API (although in reality, architectural differences often make this more difficult than it may appear). Furthermore, actual system calls can often be more detailed and difficult to work with than the API available to an application programmer. Regardless, there often exists a strong correlation betweenafunctionintheAPI and its associated system call within the kernel. source file destination file Example System Call Sequence Acquire input file name Write prompt to screen Accept input Acquire output file name Write prompt to screen Accept input Open the input file if file doesn't exist, abort Create output file if file exists, abort Loop Read from input file Write to output file Until read fails Close output file Write completion message to screen Terminate normally Figure 2.4 Example of using system calls. 2.3 System Calls 57 EXAMPLE OF STANDARD API As an example of a standard API,considerthewrite() method in the java.io.OutputStream class in the Java API, which allows writing to a file or network connection. The API for this method appears in Figure 2.5. return value parameters visibility modifier method name public void write(byte[] b, int off, int len) throws IOException exception Figure 2.5 The API for the write() method. This method returns a void—or no return value. An IOException is thrown if an I/O error occurs. The parameters passed to write() can be described as follows: • byte[] b—the data to be written. • int off—the starting offset in the array b to be written. • int len—the number of bytes to be written. In fact, many of the POSIX and Win32 APIs are similar to the native system calls provided by the UNIX, Linux, and Windows operating systems. The run-time support system (a set of functions built into libraries included with a compiler) for most programming languages provides a system-call interface that serves as the link to system calls made available by the operating system. The system-call interface intercepts function calls in the API and invokes the necessary system calls within the operating system. Typically, a number is associated with each system call, and the system-call interface maintains a table indexed according to these numbers. The system-call interface then invokes the intended system call in the operating-system kernel and returns the status of the system call and any return values. The caller need know nothing about how the system call is implemented or what it does during execution. Rather, it need only obey the API and understand what the operating system will do as a result of the execution of that system call. Thus, most of the details of the operating-system interface are hidden from the programmer by the API and are managed by the run-time support library. The relationship among an API, the system-call interface, and the operating system is shown in Figure 2.6, which illustrates how the operating system handles a user application invoking the open() system call. 58 Chapter 2 Operating-System Structures Implementation of open ( ) system call open ( ) user mode return user application system call interface kernel mode i open ( ) Figure 2.6 The handling of a user application invoking the open() system call. System calls occur in different ways, depending on the computer in use. Often, more information is required than simply the identity of the desired system call. The exact type and amount of information vary according to the particular operating system and call. For example, to get input, we may need to specify the file or device to use as the source, as well as the address and length of the memory buffer into which the input should be read. Of course, the device or file and length may be implicit in the call. Three general methods are used to pass parameters to the operating system. The simplest approach is to pass the parameters in registers.Insome cases, however, there may be more parameters than registers. In these cases, the parameters are generally stored in a block, or table, in memory, and the address of the block is passed as a parameter in a register (Figure 2.7). This is the approach taken by Linux and Solaris. Parameters also can be placed, or pushed, onto the stack by the program and popped off the stack by the operating system. Some operating systems prefer the block or stack method because those approaches do not limit the number or length of parameters being passed. Because Java is intended to run on platform-neutral systems, it is not possible to make system calls directly from a Java program. However, it is possible for a Java method to invoke C or C++ code that is native to the underlying platform on which the program is running (for example, Microsoft Vista or Linux). The C/C++ code can invoke a system call on the host system, thus allowing a Java program to make the system call indirectly. This is accomplished through the Java Native Interface (JNI), which allows aJavamethodtobedeclarednative.Thisnative Java method is used as a placeholder for the actual C/C++ function. Thus, calling the native Java method actually invokes the function written in C or C++. Obviously, a Java program that uses native methods is not considered portable from one system to another. 2.4 Types of System Calls 59 code for system call 13 operating system user program use parameters from table X register X X: parameters for call load address X system call 13 Figure 2.7 Passing of parameters as a table. 2.4 Types of System Calls System calls can be grouped roughly into six major categories: process control, file manipulation, device manipulation, information maintenance, communications,andprotection. In Sections 2.4.1 through 2.4.6, we discuss briefly the types of system calls that may be provided by an operating system. Most of these system calls support, or are supported by, concepts and functions that are discussed in later chapters. Figure 2.8 summarizes the types of system calls normally provided by an operating system. 2.4.1 Process Control A running program needs to be able to halt its execution either normally (end) or abnormally (abort). If a system call is made to terminate the currently running program abnormally, or if the program runs into a problem and causes an error trap, a dump of memory is sometimes taken and an error message generated. The dump is written to disk and may be examined by a debugger—a system program designed to aid the programmer in finding and correcting bugs—to determine the cause of the problem. Under either normal or abnormal circumstances, the operating system must transfer control to the invoking command interpreter. The command interpreter then reads the next command. In an interactive system, the command interpreter simply continues with the next command; it is assumed that the user will issue an appropriate command to respond to any error. In a GUI system, a pop-up window might alert the user to the error and ask for guidance. In a batch system, the command interpreter usually terminates the entire job and continues with the next job. Some systems allow control cards to indicate special recovery actions in case an error occurs. A control card is a batch-system concept. It is a command to manage the execution of a process. If the program discovers an error in its input and wants to terminate abnormally, it may also want to define an error level. More severe errors can be indicated by a higher-level error parameter. It is then possible to combine normal and abnormal termination by defining a normal 60 Chapter 2 Operating-System Structures • Process control ◦ end, abort ◦ load, execute ◦ create process, terminate process ◦ get process attributes, set process attributes ◦ wait for time ◦ wait event, signal event ◦ allocate and free memory • File management ◦ create file, delete file ◦ open, close ◦ read, write, reposition ◦ get file attributes, set file attributes • Device management ◦ request device, release device ◦ read, write, reposition ◦ get device attributes, set device attributes ◦ logically attach or detach devices • Information maintenance ◦ get time or date, set time or date ◦ get system data, set system data ◦ get process, file, or device attributes ◦ set process, file, or device attributes • Communications ◦ create, delete communication connection ◦ send, receive messages ◦ transfer status information ◦ attach or detach remote devices Figure 2.8 Types of system calls. termination as an error at level 0. The command interpreter or a following program can use this error level to determine the next action automatically. A process or job executing one program may want to load and execute another program. This feature allows the command interpreter to execute a program as directed by, for example, a user command, the click of a mouse, or a batch command. An interesting question is where to return control when 2.4 Types of System Calls 61 EXAMPLES OF WINDOWS AND UNIX SYSTEM CALLS Windows Unix Process CreateProcess() fork() Control ExitProcess() exit() WaitForSingleObject() wait() File CreateFile() open() Manipulation ReadFile() read() WriteFile() write() CloseHandle() close() Device SetConsoleMode() ioctl() Manipulation ReadConsole() read() WriteConsole() write() Information GetCurrentProcessID() getpid() Maintenance SetTimer() alarm() Sleep() sleep() Communication CreatePipe() pipe() CreateFileMapping() shmget() MapViewOfFile() mmap() Protection SetFileSecurity() chmod() InitlializeSecurityDescriptor() umask() SetSecurityDescriptorGroup() chown() the loaded program terminates. This question is related to the problem of whether the existing program is lost, saved, or allowed to continue execution concurrently with the new program. If control returns to the existing program when the new program termi- nates, we must save the memory image of the existing program; thus, we have effectively created a mechanism for one program to call another program. If both programs continue concurrently, we have created a new job or process to be multiprogrammed. Often, there is a system call specifically for this purpose (create process or submit job). If we create a new job or process, or perhaps even a set of jobs or processes, we should be able to control its execution. This control requires the ability to determine and reset the attributes of a job or process, including the job’s priority, its maximum allowable execution time, and so on (get process attributes and set process attributes).Wemayalsowanttoterminate a job or process that we created (terminate process)ifwefindthatitis incorrect or is no longer needed. Having created new jobs or processes, we may need to wait for them to finish their execution. We may want to wait for a certain amount of time to pass 62 Chapter 2 Operating-System Structures EXAMPLE OF STANDARD C LIBRARY The standard C library provides a portion of the system-call interface for many versions of UNIX and Linux. As an example, let’s assume a C program invokes the printf() statement. The C library intercepts this call and invokes the necessary system call (or calls) in the operating system—in this instance, the write() system call. The C library takes the value returned by write() and passes it back to the user program. This is shown in Figure 2.9. write ( ) system call user mode kernel mode #include int main ( ) { • • • printf ("Greetings"); • • • return 0; } standard C library write ( ) Figure 2.9 Standard C library handling of write(). (wait time); more probably, we will want to wait for a specific event to occur (wait event). The jobs or processes should then signal when that event has occurred (signal event). Quite often, two or more processes share data. To ensure the integrity of the data being shared, operating systems often provide system calls allowing a process to lock shared data, thus preventing another process from accessing the data until the lock is removed. Typically such system calls include acquire lock and release lock. System calls of these types, dealing with the coordination of concurrent processes, are discussed in great detail in Chapter 6. There are so many facets of and variations in process and job control that we next use two examples—one involving a single-tasking system and the other a multitasking system—to clarify these concepts. The MS-DOS operating system is an example of a single-tasking system. It has a command interpreter that is invoked when the computer is started (Figure 2.10a). Because MS-DOS is single-tasking, it uses a simple method to run a program and does not create a 2.4 Types of System Calls 63 (a) (b) free memory command interpreter kernel process free memory command interpreter kernel Figure 2.10 MS-DOS execution. (a) At system startup. (b) Running a program. new process. It loads the program into memory, writing over most of itself to give the program as much memory as possible (Figure 2.10b). Next, it sets the instruction pointer to the first instruction of the program. The program then runs, and either an error causes a trap, or the program executes a system call to terminate. In either case, the error code is saved in the system memory for later use. Following this action, the small portion of the command interpreter that was not overwritten resumes execution. Its first task is to reload the rest of the command interpreter from disk. Then the command interpreter makes the previous error code available to the user or to the next program. FreeBSD (derived from Berkeley UNIX) is an example of a multitasking system. When a user logs on to the system, the shell of the user’s choice is run. This shell is similar to the MS-DOS shell in that it accepts commands and executes programs that the user requests. However, since FreeBSD is a multitasking system, the command interpreter may continue running while another program is executed (Figure 2.11). To start a new process, the shell free memory interpreter kernel process D process C process B Figure 2.11 FreeBSD running multiple programs. 64 Chapter 2 Operating-System Structures executes a fork() system call. Then, the selected program is loaded into memory via an exec() system call, and the program is executed. Depending on the way the command was issued, the shell then either waits for the process to finish or runs the process “in the background.” In the latter case, the shell immediately requests another command. When a process is running in the background, it cannot receive input directly from the keyboard, because the shell is using this resource. I/O is therefore done through files or through a GUI interface. Meanwhile, the user is free to ask the shell to run other programs, to monitor the progress of the running process, to change that program’s priority, andsoon.Whentheprocessisdone,itexecutesanexit() system call to terminate, returning to the invoking process a status code of 0 or a nonzero error code. This status or error code is then available to the shell or other programs. Processes are discussed in Chapter 3, which includes a program example using the fork() and exec() system calls. 2.4.2 File Management The file system is discussed in more detail in Chapters 10 and 11. We can, however, identify several common system calls dealing with files. We first need to be able to create and delete files. Either system call requires the name of the file and perhaps some of the file’s attributes. Once the file is created, we need to open it and to use it. We may also read, write,or reposition (rewinding or skipping to the end of the file, for example). Finally, we need to close the file, indicating that we are no longer using it. We may need these same sets of operations for directories if we have a directory structure for organizing files in the file system. In addition, for either files or directories, we need to be able to determine the values of various attributes and perhaps reset them if necessary. File attributes include the file name, file type, protection codes, accounting information, and so on. At least two system calls, get file attribute and set file attribute,are required for this function. Some operating systems provide many more calls, such as calls for file move and copy. Others might provide an API that performs those operations using code and other system calls, and others might just provide system programs to perform those tasks. If the system programs are callable by other programs, then each can be considered an API by other system programs. 2.4.3 Device Management A process may need several resources to execute—main memory, disk drives, access to files, and so on. If the resources are available, they can be granted, and control can be returned to the user process. Otherwise, the process will have to wait until sufficient resources are available. The various resources controlled by the operating system can be thought of as devices. Some of these devices are physical devices (for example, disk drives), while others can be thought of as abstract or virtual devices (for example, files). A system with multiple users may require us first to request the device, to ensure exclusive use of it. After we are finished with the device, we release it. These functions are similar to the open and close system calls for files. Other operating systems allow unmanaged access to devices. 2.4 Types of System Calls 65 The hazard then is the potential for device contention and perhaps deadlock, which is described in Chapter 7. Once the device has been requested (and allocated to us), we can read, write, and (possibly) reposition the device, just as we can with files. In fact, the similarity between I/O devices and files is so great that many operating systems, including UNIX, merge the two into a combined file–device structure. In this case, a set of system calls is used on both files and devices. Sometimes, I/O devices are identified by special file names, directory placement, or file attributes. The user interface can also make files and devices appear to be similar, even though the underlying system calls are dissimilar. This is another example of the many design decisions that go into building an operating system and user interface. 2.4.4 Information Maintenance Many system calls exist simply for the purpose of transferring information between the user program and the operating system. For example, most systems have a system call to return the current time and date. Other system calls may return information about the system, such as the number of current users, the version number of the operating system, the amount of free memory or disk space, and so on. Another set of system calls is helpful in debugging a program. Many systems provide system calls to dump memory. This provision is useful for debugging. A program trace lists each system call as it is executed. Even microprocessors provide a CPU mode known as single step, in which a trap is executed by the CPU after every instruction. The trap is usually caught by a debugger. Many operating systems provide a time profile of a program to indicate the amount of time that the program executes at a particular location or set of locations. A time profile requires either a tracing facility or regular timer interrupts. At every occurrence of the timer interrupt, the value of the program counter is recorded. With sufficiently frequent timer interrupts, a statistical picture of the time spent on various parts of the program can be obtained. In addition, the operating system keeps information about all its processes, and system calls are used to access this information. Generally, calls are also used to reset the process information (get process attributes and set process attributes). In Section 3.1.3, we discuss what information is normally kept. 2.4.5 Communication There are two common models of interprocess communication: the message- passing model and the shared-memory model. In the message-passing model, the communicating processes exchange messages with one another to transfer information. Messages can be exchanged between the processes either directly or indirectly through a common mailbox. Before communication can take place, a connection must be opened. The name of the other communicator must be known, be it another process on the same system or a process on another computer connected by a communications network. Each computer 66 Chapter 2 Operating-System Structures in a network has a host name by which it is commonly known. A host also has a network identifier, such as an IP address. Similarly, each process has a process name, and this name is translated into an identifier by which the operating system can refer to the process. Theget hostidand get processid system calls do this translation. The identifiers are then passed to the general- purpose open and close calls provided by the file system or to specific open connection and close connection system calls, depending on the system’s model of communication. The recipient process usually must give its permission for communication to take place with an accept connection call. Most processes that will be receiving connections are special-purpose daemons, which are systems programs provided for that purpose. They execute a wait for connection call and are awakened when a connection is made. The source of the communication, known as the client, and the receiving daemon, known as a server, then exchange messages by using read message and write message system calls. The close connection call terminates the communication. In the shared-memory model, processes use shared memory create and shared memory attach system calls to create and gain access to regions of memory owned by other processes. Recall that, normally, the operating system tries to prevent one process from accessing another process’s memory. Shared memory requires that two or more processes agree to remove this restriction. They can then exchange information by reading and writing data in the shared areas. The form of the data is determined by the processes and is not under the operating system’s control. The processes are also responsible for ensuring that they are not writing to the same location simultaneously. Such mechanisms are discussed in Chapter 6. In Chapter 4, we look at a variation of the process scheme—threads—in which memory is shared by default. Both of the models just discussed are common in operating systems, and most systems implement both. Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It is also easier to implement than is shared memory for intercomputer communication. Shared memory allows maximum speed and convenience of communication, since it can be done at memory transfer speeds when it takes place within a computer. Problems exist, however, in the areas of protection and synchronization among the processes sharing memory. 2.4.6 Protection Protection provides a mechanism for controlling access to a computer system’s resources. Historically, protection was a concern only on multiprogrammed computer systems with several users. However, with the advent of networking and the Internet, all computer systems, from servers to PDAs, must be concerned with protection. Typically, system calls providing protection include set permission and get permission, which manipulate the permission settings of resources such as files and disks. The allow user and deny user system calls spec- ify whether particular users can—or cannot—be allowed access to certain resources. We cover protection in Chapter 14 and the much larger issue of security in Chapter 15. 2.5 System Programs 67 2.5 System Programs Another aspect of a modern system is the collection of system programs. Recall Figure 1.1, which depicts the logical computer hierarchy. At the lowest level is hardware. Next is the operating system, then the system programs, and finally the application programs. System programs,alsoknownassystem utilities, provide a convenient environment for program development and execution. Some of them are simply user interfaces to system calls; others are considerably more complex. They can be divided into these categories: • File management. These programs create, delete, copy, rename, print, dump, list, and generally manipulate files and directories. • Status information. Some programs simply ask the system for the date, time, amount of available memory or disk space, number of users, or similar status information. Others are more complex, providing detailed performance, logging, and debugging information. Typically, these pro- grams format and print the output to the terminal or other output devices or files or display it in a window of the GUI. Some systems also support a registry, which is used to store and retrieve configuration information. • File modification. Several text editors may be available to create and modify the content of files stored on disk or other storage devices. There may also be special commands to search contents of files or perform transformations of the text. • Programming-language support. Compilers, assemblers, debuggers, and interpreters for common programming languages (such as C, C++, Java, Visual Basic, and PERL) are often provided to the user with the operating system. • Program loading and execution. Once a program is assembled or com- piled, it must be loaded into memory to be executed. The system may provide absolute loaders, relocatable loaders, linkage editors, and overlay loaders. Debugging systems for either higher-level languages or machine language are needed as well. • Communications. These programs provide the mechanism for creating virtual connections among processes, users, and computer systems. They allow users to send messages to one another’s screens, to browse Web pages, to send e-mail messages, to log in remotely, or to transfer files from one machine to another. In addition to system programs, most operating systems are supplied with programs that are useful in solving common problems or performing common operations. Such application programs include Web browsers, word processors and text formatters, spreadsheets, database systems, compilers, plotting and statistical-analysis packages, and games. The view of the operating system seen by most users is defined by the application and system programs, rather than by the actual system calls. Consider a user’s PC. When a user’s computer is running the Mac OS X operating system, the user might see the GUI, featuring a mouse-and-windows 68 Chapter 2 Operating-System Structures interface. Alternatively, or even in one of the windows, the user might have a command-line UNIX shell. Both use the same set of system calls, but the system calls look different and act in different ways. Further confusing the user view, consider the user dual-booting Mac OS X and Windows Vista. Now the same user on the same hardware has two entirely different interfaces and two sets of applications using the same physical resources. On the same hardware, then, a user can be exposed to multiple user interfaces sequentially or concurrently. 2.6 Operating-System Design and Implementation In this section, we discuss problems we face in designing and implementing an operating system. There are, of course, no complete solutions to such problems, but there are approaches that have proved successful. 2.6.1 Design Goals The first problem in designing a system is to define goals and specifications. At the highest level, the design of the system will be affected by the choice of hardware and the type of system: batch, time shared, single user, multiuser, distributed, real time, or general purpose. Beyond this highest design level, the requirements may be much harder to specify. The requirements can, however, be divided into two basic groups: user goals and system goals. Users want certain obvious properties in a system. The system should be convenient to use, easy to learn and to use, reliable, safe, and fast. Of course, these specifications are not particularly useful in the system design, since there is no general agreement on how to achieve them. A similar set of requirements can be defined by those people who must design, create, maintain, and operate the system. The system should be easy to design, implement, and maintain; and it should be flexible, reliable, error free, and efficient. Again, these requirements are vague and may be interpreted in various ways. There is, in short, no unique solution to the problem of defining the requirements for an operating system. The wide range of systems in existence shows that different requirements can result in a large variety of solutions for different environments. For example, the requirements for VxWorks, a real- time operating system for embedded systems, must have been substantially different from those for MVS, a large multiuser, multiaccess operating system for IBM mainframes. Specifying and designing an operating system is a highly creative task. Although no textbook can tell you how to do it, general principles have been developed in the field of software engineering, and we turn now to a discussion of some of these principles. 2.6.2 Mechanisms and Policies One important principle is the separation of policy from mechanism.Mecha- nisms determine how to do something; policies determine what will be done. For example, the timer construct (see Section 1.5.2) is a mechanism for ensuring 2.6 Operating-System Design and Implementation 69 CPU protection, but deciding how long the timer is to be set for a particular user is a policy decision. The separation of policy and mechanism is important for flexibility.Policies are likely to change across places or over time. In the worst case, each change in policy would require a change in the underlying mechanism. A general mechanism insensitive to changes in policy would be more desirable. A change in policy would then require redefinition of only certain parameters of the system. For instance, consider a mechanism for giving priority to certain types of programs over others. If the mechanism is properly separated from policy, it can be used either to support a policy decision that I/O-intensive programs should have priority over CPU-intensive ones or to support the opposite policy. Microkernel-based operating systems (Section 2.7.3) take the separation of mechanism and policy to one extreme by implementing a basic set of primitive building blocks. These blocks are almost policy free, allowing more advanced mechanisms and policies to be added via user-created kernel modules or via user programs themselves. As an example, consider the history of UNIX.At first, it had a time-sharing scheduler. In the latest version of Solaris, scheduling is controlled by loadable tables. Depending on the table currently loaded, the system can be time shared, batch processing, real time, fair share, or any combination. Making the scheduling mechanism general purpose allows vast policy changes to be made with a single load-new-table command. At the other extreme is a system such as Windows, in which both mechanism and policy are encoded in the system to enforce a global look and feel. All applications have similar interfaces, because the interface itself is built into the kernel and system libraries. The Mac OS X operating system has similar functionality. Policy decisions are important for all resource allocation. Whenever it is necessary to decide whether or not to allocate a resource, a policy decision must be made. Whenever the question is how rather than what, it is a mechanism that must be determined. 2.6.3 Implementation Once an operating system is designed, it must be implemented. Traditionally, operating systems have been written in assembly language. Now, however, they are most commonly written in a higher-level language such as C or C++. The first system that was not written in assembly language was probably the Master Control Program (MCP) for Burroughs computers. MCP was written in a variant of ALGOL. MULTICS, developed at MIT, was written mainly in PL/1. The Linux and Windows XP operating systems are written mostly in C, although there are some small sections of assembly code for device drivers and for saving and restoring the state of registers. The advantages of using a higher-level language, or at least a systems- implementation language, for implementing operating systems are the same as those accrued when the language is used for application programs: the code can be written faster, is more compact, and is easier to understand and debug. In addition, improvements in compiler technology will improve the generated code for the entire operating system by simple recompilation. Finally, an operating system is far easier to port—to move to some other hardware— if it is written in a higher-level language. For example, MS-DOS was written 70 Chapter 2 Operating-System Structures in Intel 8088 assembly language. Consequently, it runs natively only on the Intel X86 family of CPUs. (Although MS-DOS runs natively only on Intel X86, emulators of the X86 instruction set allow the operating system to run non- natively—more slowly, with more resource use—on other CPUs. Emulators are programs that duplicate the functionality of one system in another system.) The Linux operating system, in contrast, is written mostly in C and is available natively on a number of different CPUs, including Intel X86, Sun SPARC,and IBMPowerPC. The only possible disadvantages of implementing an operating system in a higher-level language are reduced speed and increased storage requirements. This, however, is no longer a major issue in today’s systems. Although an expert assembly-language programmer can produce efficient small routines, for large programs a modern compiler can perform complex analysis and apply sophisticated optimizations that produce excellent code. Modern processors have deep pipelining and multiple functional units that can handle the details of complex dependencies much more easily than can the human mind. As is true in other systems, major performance improvements in operating systems are more likely to be the result of better data structures and algorithms than of excellent assembly-language code. In addition, although operating sys- tems are large, only a small amount of the code is critical to high performance; the memory manager and the CPU scheduler are probably the most critical rou- tines. After the system is written and is working correctly, bottleneck routines can be identified and can be replaced with assembly-language equivalents. (Bottlenecks are discussed further later in this chapter.) 2.7 Operating-System Structure A system as large and complex as a modern operating system must be engineered carefully if it is to function properly and be modified easily. A common approach is to partition the task into small components rather than have one monolithic system. Each of these modules should be a well-defined portion of the system, with carefully defined inputs, outputs, and functions. We have already discussed briefly in Chapter 1 the common components of operating systems. In this section, we discuss how these components are interconnected and melded into a kernel. 2.7.1 Simple Structure Many commercial operating systems do not have well-defined structures. Frequently, such systems started as small, simple, and limited systems and then grew beyond their original scope. MS-DOS is an example of such a system. It was originally designed and implemented by a few people who had no idea that it would become so popular. It was written to provide the most functionality in the least space, so it was not divided into modules carefully. Figure 2.12 shows its structure. In MS-DOS, the interfaces and levels of functionality are not well separated. For instance, application programs are able to access the basic I/O routines to write directly to the display and disk drives. Such freedom leaves MS-DOS vulnerable to errant (or malicious) programs, causing entire systems to crash 2.7 Operating-System Structure 71 ROM BIOS device drivers application program MS-DOS device drivers resident system program Figure 2.12 MS-DOS layer structure. when user programs fail. Of course, MS-DOS was also limited by the hardware of its era. Because the Intel 8088 for which it was written provides no dual mode and no hardware protection, the designers of MS-DOS had no choice but to leave the base hardware accessible. Another example of limited structuring is the original UNIX operating system. Like MS-DOS, UNIX initially was limited by hardware functionality. It consists of two separable parts: the kernel and the system programs. The kernel is further separated into a series of interfaces and device drivers, which have been added and expanded over the years as UNIX has evolved. We can view the traditional UNIX operating system as being layered, as shown in Figure 2.13. Everything below the system-call interface and above the physical hardware is the kernel. The kernel provides the file system, CPU scheduling, memory kernel (the users) shells and commands compilers and interpreters system libraries system-call interface to the kernel signals terminal handling character I/O system terminal drivers file system swapping block I/O system disk and tape drivers CPU scheduling page replacement demand paging virtual memory kernel interface to the hardware terminal controllers terminals device controllers disks and tapes memory controllers physical memory Figure 2.13 Traditional UNIX system structure. 72 Chapter 2 Operating-System Structures management, and other operating-system functions through system calls. Taken in sum, that is an enormous amount of functionality to be combined into one level. This monolithic structure was difficult to implement and maintain. 2.7.2 Layered Approach With proper hardware support, operating systems can be broken into pieces that are smaller and more appropriate than those allowed by the original MS-DOS and UNIX systems. The operating system can then retain much greater control over the computer and over the applications that make use of that computer. Implementers have more freedom in changing the inner workings of the system and in creating modular operating systems. Under a top- down approach, the overall functionality and features are determined and are separated into components. Information hiding is also important, because it leaves programmers free to implement the low-level routines as they see fit, provided that the external interface of the routine stays unchanged and that the routine itself performs the advertised task. A system can be made modular in many ways. One method is the layered approach, in which the operating system is broken into a number of layers (levels). The bottom layer (layer 0) is the hardware; the highest (layer N)isthe user interface. This layering structure is depicted in Figure 2.14. An operating-system layer is an implementation of an abstract object made up of data and the operations that can manipulate those data. A typical operating-system layer—say, layer M—consists of data structures and a set of routines that can be invoked by higher-level layers. Layer M, in turn, can invoke operations on lower-level layers. The main advantage of the layered approach is simplicity of construction and debugging. The layers are selected so that each uses functions (operations) and services of only lower-level layers. This approach simplifies debugging layer N user interface ••• layer 1 layer 0 hardware Figure 2.14 A layered operating system. 2.7 Operating-System Structure 73 and system verification. The first layer can be debugged without any concern for the rest of the system, because, by definition, it uses only the basic hardware (which is assumed correct) to implement its functions. Once the first layer is debugged, its correct functioning can be assumed while the second layer is debugged, and so on. If an error is found during the debugging of a particular layer, the error must be on that layer, because the layers below it are already debugged. Thus, the design and implementation of the system are simplified. Each layer is implemented with only those operations provided by lower- level layers. A layer does not need to know how these operations are implemented; it needs to know only what these operations do. Hence, each layer hides the existence of certain data structures, operations, and hardware from higher-level layers. The major difficulty with the layered approach involves appropriately defining the various layers. Because a layer can use only lower-level layers, careful planning is necessary. For example, the device driver for the backing store (disk space used by virtual-memory algorithms) must be at a lower level than the memory-management routines, because memory management requires the ability to use the backing store. Other requirements may not be so obvious. The backing-store driver would normally be above the CPU scheduler, because the driver may need to wait for I/O and the CPU can be rescheduled during this time. However, on a large system, the CPU scheduler may have more information about all the active processes than can fit in memory. Therefore, this information may need to be swapped in and out of memory, requiring the backing-store driver routine to be below the CPU scheduler. A final problem with layered implementations is that they tend to be less efficient than other types. For instance, when a user program executes an I/O operation, it executes a system call that is trapped to the I/O layer, which calls the memory-management layer, which in turn calls the CPU-scheduling layer, which is then passed to the hardware. At each layer, the parameters may be modified, data may need to be passed, and so on. Each layer adds overhead to the system call; the net result is a system call that takes longer than does one on a nonlayered system. These limitations have caused a small backlash against layering in recent years. Fewer layers with more functionality are being designed, providing most of the advantages of modularized code while avoiding the difficult problems of layer definition and interaction. 2.7.3 Microkernels We have already seen that as UNIX expanded, the kernel became large and difficult to manage. In the mid-1980s, researchers at Carnegie Mellon University developed an operating system called Mach that modularized the kernel using the microkernel approach. This method structures the operating system by removing all nonessential components from the kernel and implementing them as system and user-level programs. The result is a smaller kernel. There is little consensus regarding which services should remain in the kernel and which should be implemented in user space. Typically, however, microkernels provide minimal process and memory management, in addition to a communication facility. 74 Chapter 2 Operating-System Structures The main function of the microkernel is to provide a communication facility between the client program and the various services that are also running in user space. Communication is provided by message passing, which was described in Section 2.4.5. For example, if the client program wishes to access a file, it must interact with the file server. The client program and service never interact directly. Rather, they communicate indirectly by exchanging messages with the microkernel. One benefit of the microkernel approach is ease of extending the operating system. All new services are added to user space and consequently do not require modification of the kernel. When the kernel does have to be modified, the changes tend to be fewer, because the microkernel is a smaller kernel. The resulting operating system is easier to port from one hardware design to another. The microkernel also provides more security and reliability, since most services are running as user—rather than kernel—processes. If a service fails, the rest of the operating system remains untouched. Several contemporary operating systems have used the microkernel approach. Tru64 UNIX (formerly Digital UNIX)providesaUNIX interface to the user, but it is implemented with a Mach kernel. The Mach kernel maps UNIX system calls into messages to the appropriate user-level services. The Mac OS X kernel (also known as Darwin) is also based on the Mach microkernel. Another example is QNX, a real-time operating system. The QNX micro- kernel provides services for message passing and process scheduling. It also handles low-level network communication and hardware interrupts. All other services in QNX are provided by standard processes that run outside the kernel in user mode. Unfortunately, microkernels can suffer from performance decreases due to increased system function overhead. Consider the history of Windows NT. The first release had a layered microkernel organization. However, this version delivered low performance compared with that of Windows 95. Windows NT 4.0 partially redressed the performance problem by moving layers from user space to kernel space and integrating them more closely. By the time Windows XP was designed, its architecture was more monolithic than microkernel. 2.7.4 Modules Perhaps the best current methodology for operating-system design involves using object-oriented programming techniques to create a modular kernel. Here, the kernel has a set of core components and links in additional services either during boot time or during run time. Such a strategy uses dynamically loadable modules and is common in modern implementations of UNIX,such as Solaris, Linux, and Mac OS X. For example, the Solaris operating system structure, shown in Figure 2.15, is organized around a core kernel with seven types of loadable kernel modules: 1. Scheduling classes 2. File systems 3. Loadable system calls 4. Executable formats 2.7 Operating-System Structure 75 core Solaris kernel file systems loadable system calls executable formats STREAMS modules miscellaneous modules device and bus drivers scheduling classes Figure 2.15 Solaris loadable modules. 5. STREAMS modules 6. Miscellaneous 7. Device and bus drivers Such a design allows the kernel to provide core services yet also allows certain features to be implemented dynamically. For example, device and bus drivers for specific hardware can be added to the kernel, and support for different file systems can be added as loadable modules. The overall result resembles a layered system in that each kernel section has defined, protected interfaces; but it is more flexible than a layered system in that any module can call any other module. Furthermore, the approach is like the microkernel approach in that the primary module has only core functions and knowledge of how to load and communicate with other modules; but it is more efficient, because modules do not need to invoke message passing in order to communicate. The Apple Mac OS X operating system uses a hybrid structure. It is a layered system in which one layer consists of the Mach microkernel. The structure of Mac OS X appears in Figure 2.16. The top layers include application environ- ments and a set of services providing a graphical interface to applications. application environments and common services BSDkernel environment Mach Figure 2.16 The Mac OS X structure. 76 Chapter 2 Operating-System Structures Below these layers is the kernel environment, which consists primarily of the Mach microkernel and the BSD kernel. Mach provides memory management; support for remote procedure calls (RPCs) and interprocess communication (IPC) facilities, including message passing; and thread scheduling. The BSD component provides a BSD command-line interface, support for networking and file systems, and an implementation of POSIX APIs, including Pthreads. In addition to Mach and BSD, the kernel environment provides an I/O kit for development of device drivers and dynamically loadable modules (which Mac OS X refers to as kernel extensions). As shown in the figure, applications and common services can make use of either the Mach or BSD facilities directly. 2.8 Virtual Machines The layered approach described in Section 2.7.2 is taken to its logical conclusion in the concept of a virtual machine. The fundamental idea behind a virtual machine is to abstract the hardware of a single computer (the CPU, memory, disk drives, network interface cards, and so forth) into several different execution environments, thereby creating the illusion that each separate execution environment is running its own private computer. By using CPU scheduling (Chapter 5) and virtual-memory techniques (Chapter 9), an operating system host can create the illusion that a process has its own processor with its own (virtual) memory. The virtual machine provides an interface that is identical to the underlying bare hardware. Each guest process is provided with a (virtual) copy of the underlying computer (Figure 2.17). Usually, the guest process is in fact an operating system, and that is how a single physical machine can run multiple operating systems concurrently, each in its own virtual machine. (a) processes hardware kernel (b) programming interface processes processes processes kernelkernel kernel VM2VM1 VM3 implementation hardware virtual-machine Figure 2.17 System models. (a) Nonvirtual machine. (b) Virtual machine. 2.8 Virtual Machines 77 2.8.1 Benefits There are several reasons for creating a virtual machine. Most of them are fundamentally related to being able to share the same hardware yet run several different execution environments (that is, different operating systems) concurrently. One important advantage is that the host system is protected from the virtual machines, just as the virtual machines are protected from each other. A virus inside a guest operating system might damage that operating system but is unlikely to affect the host or the other guests. Because each virtual machine is completely isolated from all other virtual machines, there are no protection problems. At the same time, there is no direct sharing of resources. Two approaches to provide sharing have been implemented, however. First, it is possible to share a file-system volume and thus to share files. Second, it is possible to define a network of virtual machines and enable each machine to send information over the virtual communication network. The network is modeled after physical communication networks but is implemented in software. An advantage for developers is that a virtual-machine system is a perfect vehicle for operating-system research and development. Normally, changing an operating system is a difficult task. Operating systems are large and complex programs, and it is difficult to be sure that a change in one part will not cause obscure bugs to appear in some other part. The power of the operating system makes changing it particularly dangerous. Because the operating system executes in kernel mode, a wrong change in a pointer could cause an error that would destroy the entire file system. Thus, it is necessary to test all changes to the operating system carefully. The operating system, however, runs on and controls the entire machine. Therefore, the current system must be stopped and taken out of use while changes are made and tested. This period is commonly called system- development time. Since it makes the system unavailable to users, system- development time is often scheduled late at night or on weekends, when system load is low. A virtual-machine system can eliminate much of this problem. System programmers are given their own virtual machine, and system development is done on the virtual machine instead of on a physical machine. Normal system operation seldom needs to be disrupted for system development. Another advantage of virtual machines for developers is that multiple operating systems can be running on the developer’s workstation concur- rently. This virtualized workstation allows for rapid porting and testing of programs in varying environments. Similarly, quality-assurance engineers can test their applications in multiple environments without buying, powering, and maintaining a computer for each environment. A major advantage of virtual machines in production data-center use is system consolidation, which involves taking two or more separate systems and running them in virtual machines on one system. Such physical-to-virtual conversions result in resource optimization, as many lightly used systems can be combined to create one more heavily used system. If the use of virtual machines continues to spread, application deployment will evolve accordingly, creating additional advantages. If a system can easily 78 Chapter 2 Operating-System Structures add, remove, and move a virtual machine, then why install applications on that system directly? Instead, application developers would pre-install the application on a tuned and customized operating system in a virtual machine. That virtual environment would be the release mechanism for the application. This method would be an improvement for application developers; applica- tion management would become easier, less tuning would be required, and technical support of the application would be more straightforward. System administrators would find the environment easier to manage as well. Instal- lation would be simple, and redeploying the application to another system would be much easier than the usual steps of uninstalling and reinstalling. For widespread adoption of this methodology to occur, though, the format of virtual machines must be standardized so that any virtual machine will run on any virtualization platform. The “Open Virtual Machine Format” is an attempt to do just that, with major vendors agreeing to support it, and it could succeed in unifying virtual-machine formats. 2.8.2 Implementation Although the virtual-machine concept is useful, it is difficult to implement. Much work is required to provide an exact duplicate of the underlying machine. Remember that the underlying machine typically has two modes: user mode and kernel mode. The virtual-machine software can run in kernel mode, since it is the operating system. The virtual machine itself can execute only in user mode. Just as the physical machine has two modes, however, so must the virtual machine. Consequently, we must have a virtual user mode and a virtual kernel mode, both of which run in a physical user mode. Those actions that cause a transfer from user mode to kernel mode on a real machine (such as a system call or an attempt to execute a privileged instruction) must also cause a transfer from virtual user mode to virtual kernel mode on a virtual machine. Such a transfer can be accomplished as follows. When a system call, for example, is made by a program running on a virtual machine in virtual user mode, it will cause a transfer to the virtual-machine monitor in the real machine. When the virtual-machine monitor gains control, it can change the register contents and program counter for the virtual machine to simulate the effect of the system call. It can then restart the virtual machine, noting that it is now in virtual kernel mode. The major difference, of course, is time. Whereas the real I/O might have taken 100 milliseconds, the virtual I/O mighttakelesstime(becauseitis spooled) or more time (because it is interpreted). In addition, the CPU is being multiprogrammed among many virtual machines, further slowing down the virtual machines in unpredictable ways. In the extreme case, it may be necessary to simulate all instructions to provide a true virtual machine. VM, mentioned earlier, works for IBM machines because normal instructions for the virtual machines can execute directly on the hardware. Only the privileged instructions (needed mainly for I/O) must be simulated and hence execute more slowly. Without some level of hardware support, virtualization would be impos- sible. The more hardware support available within a system, the more feature rich, stable, and well performing the virtual machines can be. All major general- purpose CPUs provide some amount of hardware support for virtualization. 2.8 Virtual Machines 79 For example, AMD virtualization technology is found in several AMD proces- sors. It defines two new modes of operation—host and guest. Virtual machine software can enable host mode, define the characteristics of each guest virtual machine, and then switch the system to guest mode, passing control of the system to the guest operating system that is running in the virtual machine. In guest mode, the virtualized operating system thinks it is running on native hardware and sees certain devices (those included in the host’s definition of the guest). If the guest tries to access a virtualized resource, then control is passed to the host to manage that interaction. 2.8.3 VMware Despite the advantages of virtual machines, they received little attention for a number of years after they were first developed. Today, however, virtual machines are coming into fashion as a means of solving system compatibility problems. In this section, we explore the VMware Workstation virtual machine. As you will see from this example, virtual machines can typically run on top of operating systems of any of the design types discussed earlier. Thus, operating system design methods—simple layers, microkernels, modules, and virtual machines—are not mutually exclusive. Most of the virtualization techniques discussed in this section require virtualization to be supported by the kernel. Another method involves writing the virtualization tool to run in user mode as an application on top of the operating system. Virtual machines running within this tool believe they are running on bare hardware but in fact are running inside a user-level application. VMware Workstation is a popular commercial application that abstracts Intel X86 and compatible hardware into isolated virtual machines. VMware Workstation runs as an application on a host operating system such as Windows or Linux and allows this host system to concurrently run several different guest operating systems as independent virtual machines. The architecture of such a system is shown in Figure 2.18. In this scenario, Linux is running as the host operating system and FreeBSD,WindowsNT,and Windows XP are running as guest operating systems. The virtualization layer is the heart of VMware, as it abstracts the physical hardware into isolated virtual machines running as guest operating systems. Each virtual machine has its own virtual CPU, memory, disk drives, network interfaces, and so forth. The physical disk the guest owns and manages is really just a file within the file system of the host operating system. To create an identical guest instance, we can simply copy the file. Copying the file to another location protects the guest instance against a disaster at the original site. Moving the file to another location moves the guest system. By providing these capabilities, virtualization can improve the efficiency of system administration as well as system resource use. 2.8.4 Alternatives to System Virtualization System virtualization as discussed so far is just one of many system-emulation methodologies. Virtualization is the most common because it makes guest operating systems and applications “believe” they are running on native hardware. Because only the system’s resources need to be virtualized, these 80 Chapter 2 Operating-System Structures virtualization layer host operating system (Linux) CPU memory hardware I/O devices application application application application guest operating system (free BSD) virtual CPU virtual memory virtual devices guest operating system (Windows NT) virtual CPU virtual memory virtual devices guest operating system (Windows XP) virtual CPU virtual memory virtual devices Figure 2.18 VMware architecture. guests run at almost full speed. We now discuss two other options: simulation and para-virtualizatiion. 2.8.4.1 Simulation Another methodology is simulation, in which the host system has one system architecture and the guest system was compiled for a different architecture. For example, suppose a company has replaced its outdated computer system with a new system but would like to continue to run certain important programs that were compiled for the old system. The programs could be run in an emulator that translates each of the outdated system’s instructions into the native instruction set of the new system. Emulation can increase the life of programs and allow us to explore old architectures without having an actual old machine, but its major challenge is performance. Instruction-set emulation can run an order of magnitude slower than native instructions. Thus, unless the new machine is ten times faster than the old, the program will run slower on the new machine than it did on its native hardware. Another challenge is that it is difficult to create an emulator that works correctly because, in essence, this involves writing an entire CPU in software. 2.8.4.2 Para-virtualization Para-virtualization is another variation on the system-emulation theme. Rather than try to trick a guest operating system into believing it has a system to itself, para-virtualization presents the guest with a system that is similar but not identical to the guest’s preferred system. The guest must be modified to run on the para-virtualized hardware. The gain for this extra work is more efficient use of resources and a smaller virtualization layer. 2.9 Java 81 virtual platform device management zone 1 global zone Solaris kernel network addresses zone 2 zone management user programs system programs CPU resources memory resources user programs system programs network addresses device access CPU resources memory resources user programs system programs network addresses device access CPU resources memory resources device device… Figure 2.19 Solaris 10 with two containers. Solaris 10 includes containers,orzones, that create a virtual layer between the operating system and the applications. In this system, only one kernel is installed, and the hardware is not virtualized. Rather, the operating system and its devices are virtualized, giving processes within a container the impression that they are the only processes on the system. One or more containers can be created, and each can have its own applications, network stacks, network address and ports, user accounts, and so on. CPU resources can be divided up among the containers and the system-wide processes. Figure 2.19 shows a Solaris 10 system with two containers and the standard “global” user space. 2.9 Java Java is a technology introduced by Sun Microsystems in the mid-1990s. We refertoitasatechnology rather than just a programming language because it provides more than a conventional programming language. Java technology consists of two essential components: 1. Programming-language specification 2. Virtual-machine specification We provide an overview of these two components in this section. 2.9.1 The Java Programming Language Java is a general-purpose, object-oriented programming language with support for distributed programming. Java was originally favored by the Internet programming community because of its support for applets, which are 82 Chapter 2 Operating-System Structures programs with limited resource access that run within a Web browser. Now, Java is a popular language for designing desktop applications, client–server Web applications, and applications that run within embedded systems, such as smartphones. As mentioned, Java is an object-oriented language, which means that it offers support for the kind of object-oriented programming discussed earlier. Java objects are specified with the class construct; a Java program consists of one or more classes. For each Java class, the Java compiler produces an architecture-neutral bytecode output (.class) file that will run on any implementation of the Java virtual machine. Java also provides high-level support for networking and distributed objects. It is a multithreaded language as well, meaning that a Java program may have several different threads, or flows, of control, thus allowing the development of concurrent applications to take advantage of modern processors with multiple processing cores. We cover distributed objects using Java’s remote method invocation (RMI)inChapter 3, and we discuss multithreaded Java programs in Chapter 4. Java is also considered a secure language. This feature is especially important considering that a Java program may be executing across a distributed network. We look at Java security in Chapter 15. Java programs are written using the Java Standard Edition API.Thisis a standard API for designing desktop applications and applets with basic language support for graphics, I/O, security, database connectivity, and net- working. 2.9.2 The Java Virtual Machine The Java virtual machine (JVM) is a specification for an abstract computer. It consists of a class loader and a Java interpreter that executes architecture- neutral bytecodes. The class loader loads the compiled .class files from both the Java program and the Java API for execution by the Java interpreter, as diagrammed in Figure 2.20. After a class is loaded, it verifies that the .class file is valid Java bytecode and does not overflow or underflow the stack. It also ensures that the bytecode does not perform pointer arithmetic, which could provide illegal memory access. If the class passes verification, it is run by the Java interpreter. The JVM also automatically manages memory by performing garbage collection—the practice of reclaiming memory from objects no longer host system (Windows, Linux, etc.) class loader Java interpreter Java program .class files Java API .class files Figure 2.20 The Java virtual machine. 2.9 Java 83 in use and returning it to the system. Much research focuses on garbage- collection algorithms for increasing the performance of Java programs in the virtual machine. An instance of the JVM is created whenever a Java application or applet is run. This instance of the JVM starts running when the main() method of a program is invoked. In the case of applets, the programmer does not define a main() method. Rather, the browser executes the main() method before creating the applet. If we simultaneously run two Java programs and a Java applet on the same computer, we will have three instances of the JVM. The JVM may be implemented in software on top of a host operating system, such as Windows, Linux, or Mac OS X,oraspartofaWebbrowser. Alternatively, the JVM may be implemented in hardware on a chip specifically designed to run Java programs. If the JVM is implemented in software, the Java interpreter interprets the bytecode operations one at a time. A faster software technique is to use a just-in-time (JIT) compiler. Here, the first time a Java method is invoked, the bytecodes for the method are turned into native machine language for the host system. These operations are then cached so that subsequent invocations of a method are performed using the native machine instructions and the bytecode operations need not be interpreted all over again. A technique that is potentially even faster is to run the JVM in hardware on a special Java chip that executes the Java bytecode operations as native code, thus bypassing the need for either a software interpreter or a just-in-time compiler. It is the JVM that makes it possible to develop programs that are architecture-neutral and portable. An implementation of the JVM is system- specific, and it abstracts the system in a standard way to the Java program, providing a clean, architecture-neutral interface. This interface allows a .class file to run on any system that has implemented the JVM according to its specifi- cation. Java virtual machines have been designed for most operating systems, including Windows, Linux, Mac OS X and Solaris. When we use the JVM in this text to illustrate operating system concepts, we refer to the specification of the JVM, rather than to any particular implementation. 2.9.3 The Java Development Kit The Java development kit, or JDK, consists of (1) development tools, such as a compiler and debugger, and (2) a run-time environment, or JRE. The compiler turns a Java source file program into a bytecode (.class) file. The run-time environment provides the JVM as well as the Java API for the host system. The JDK is portrayed in Figure 2.21. 2.9.4 Java Operating Systems Most operating systems are written in a combination of C and assembly- language code, primarily because of the performance benefits of these lan- guages and the ease of interfacing with hardware. However, recent efforts have been made to write operating systems in Java. Such a system, known as a language-based extensible system, runs in a single address space. One of the difficulties in designing language-based systems concerns memory protection—protecting the operating system from malicious user programs as well as protecting user programs from one another. Traditional 84 Chapter 2 Operating-System Structures THE .NET FRAMEWORK The .NET Framework is a collection of technologies, including a set of class libraries and an execution environment, that come-together to provide a platform for developing software. This platform allows programs to be written to target the .NET Framework instead of a specific architecture. A program written for the .NET Framework need not worry about the specifics of the hardware or the operating system on which it will run. Thus, any architecture implementing .NET will be able to successfully execute the program. This is because the execution environment abstracts these details and provides a virtual machine as an intermediary between the executing program and the underlying architecture. At the core of the .NET Framework is the Common Language Runtime (CLR). The CLR is the implementation of the .NET virtual machine. It provides an environment for execution of programs written in any of the languages targeted at the .NET Framework. Programs written in languages such as C# (pronounced C-sharp)andVB.NET are compiled into an intermediate, architecture-independent language called Microsoft Intermediate Language (MS-IL). These compiled files, called assemblies, include MS-IL instructions and metadata. They have file extensions of either .EXE or .DLL.Upon execution of a program, the CLR loads assemblies into what is known as the Application Domain. As instructions are requested by the executing program, the CLR converts the MS-IL instructions inside the assemblies into native code that is specific to the underlying architecture, using just-in-time compilation. Once instructions have been converted to native code, they are kept and will continue to run as native code for the CPU. The architecture of the CLR for the .NET framework is shown below. CLR compilation host system just-in-time compiler C++ source MS-IL assembly MS-IL assembly VB.Net source 2.10 Operating-System Debugging 85 Figure 2.21 Java development kit. operating systems rely on hardware features to provide memory protection (Section 8.1). Language-based systems instead rely on type-safety features of the language. As a result, language-based systems are desirable on small hardware devices, which may lack hardware features that provide memory protection. The JX operating system is written almost entirely in Java and provides a run-time system for Java applications as well. JX organizes its system according to domains. Each domain represents an independent JVM. Additionally, each domain maintains a heap used for allocating memory during object creation and threads within itself, as well as for garbage collection. Domain zero is a microkernel (Section 2.7.3) responsible for low-level details such as system initialization and saving and restoring the state of the CPU. Domain zero is written in C and assembly language; all other domains are written entirely in Java. Communication between domains occurs through portals, communication mechanisms similar to the remote procedure calls (RPCs) used by the Mach microkernel. Protection within and between domains relies on the type safety of the Java language. Since domain zero is not written in Java, it must be considered trusted. The architecture of the JX system is illustrated in Figure 2.22. 2.10 Operating-System Debugging Broadly, debugging is the activity of finding and fixing errors, or bugs,ina system. Debugging seeks to find and fix errors in both hardware and software. Performance problems are considered bugs, so debugging can also include performance tuning, which improves performance by removing bottlenecks Java compiler prog1.java prog2.java prog1.class prog2.class host system class loader Java interpreter Java API .class files bytecodes move through local file system or network compile-time environment run-time environment (Java platform) 86 Chapter 2 Operating-System Structures DomainZero (microkernel) Portals Domain B Java threads objects and classes Heap Domain A Figure 2.22 The JX operating system. in the processing taking place within a system. A discussion of hardware debugging is outside of the scope of this text. In this section, we explore debugging kernel and process errors and performance problems. 2.10.1 Failure Analysis If a process fails, most operating systems write the error information to a log file to alert system operators or users that the problem occurred. The operating system can also take a core dump—a capture of the memory (referred to as the “core” in the early days of computing) of the process. This core image is stored in a file for later analysis. Running programs and core dumps can be probed by a debugger, a tool designed to allow a programmer to explore the code and memory of a process. Debugging user-level process code is a challenge. Operating-system kernel debugging is even more challenging because of the size and complexity of the kernel, its control of the hardware, and the lack of user-level debugging tools. A kernel failure is called a crash. As with a process failure, error information is saved to a log file, and the memory state is saved to a crash dump. Operating-system debugging frequently uses different tools and tech- niques from process debugging due to the very different nature of these two tasks. Consider that a kernel failure in the file-system code would make it risky for the kernel to try to save its state to a file on the file system before rebooting. A common technique is to save the kernel’s memory state to a section of disk set aside for this purpose that contains no file system. If the kernel detects an unrecoverable error, it writes the entire contents of memory, or at least the kernel-owned parts of the system memory, to the disk area. When the system reboots, a process runs to gather the data from that area and write it to a crash dump file within a file system for analysis. 2.10 Operating-System Debugging 87 Kernighan’s Law “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” 2.10.2 Performance Tuning To identify bottlenecks, we must be able to monitor system performance. Code must be added to compute and display measures of system behavior. In a number of systems, the operating system does this task by producing trace listings of system behavior. All interesting events are logged with their time and important parameters and are written to a file. Later, an analysis program can process the log file to determine system performance and to identify bottlenecks and inefficiencies. These same traces can be run as input for a simulation of a suggested improved system. Traces also can help people to find errors in operating-system behavior. Another approach to performance tuning is to include interactive tools with the system that allow users and administrators to question the state of various components of the system to look for bottlenecks. The UNIX command top displays resources used on the system, as well as a sorted list of the “top” resource-using processes. Other tools display the state of disk I/O,memory allocation, and network traffic. The authors of these single-purpose tools try to guess what a user would want to see while analyzing a system and to provide that information. Making running operating systems easier to understand, debug, and tune is an active area of operating-system research and implementation. The cycle of enabling tracing as system problems occur and analyzing the traces later is being broken by a new generation of kernel-enabled performance analysis tools. Further, these tools are not limited to a single purpose or to sections of code that were written to produce debugging data. The Solaris 10 DTrace dynamic tracing facility is a leading example of such a tool. 2.10.3 DTrace DTrace is a facility that dynamically adds probes to a running system, both in user processes and in the kernel. These probes can be queried via the D programming language to determine an astonishing amount about the kernel, the system state, and process activities. For example, Figure 2.23 follows an application as it executes a system call (ioctl) and further shows the functional calls within the kernel as they execute to perform the system call. Lines ending with “U” are executed in user mode, and lines ending in “K” in kernel mode. Debugging the interactions between user-level and kernel code is nearly impossible without a toolset that understands both sets of code and can instrument the interactions. For that toolset to be truly useful, it must be able to debug any area of a system, including areas that were not written with debugging in mind, and do so without affecting system reliability. The toolset must also have a minimum performance impact—ideally, it should have no impact when not in use and a proportional impact during use. The DTrace tool 88 Chapter 2 Operating-System Structures # ./all.d ‘pgrep xclock‘ XEventsQueued dtrace: script ’./all.d’ matched 52377 probes CPU FUNCTION 0 –> XEventsQueued U 0 –> _XEventsQueued U 0 –> _X11TransBytesReadable U 0 <– _X11TransBytesReadable U 0 –> _X11TransSocketBytesReadable U 0 <– _X11TransSocketBytesreadable U 0 –> ioctl U 0 –> ioctl K 0 –> getf K 0 –> set_active_fd K 0 <– set_active_fd K 0 <– getf K 0 –> get_udatamodel K 0 <– get_udatamodel K ... 0 –> releasef K 0 –> clear_active_fd K 0 <– clear_active_fd K 0 –> cv_broadcast K 0 <– cv_broadcast K 0 <– releasef K 0 <– ioctl K 0 <– ioctl U 0 <– _XEventsQueued U 0 <– XEventsQueued U Figure 2.23 Solaris 10 DTrace follows a system call within the kernel. meets these requirements and provides a dynamic, safe, low-impact debugging environment. Until the DTrace framework and tools became available with Solaris 10, kernel debugging was usually shrouded in mystery and accomplished via happenstance and archaic code and tools. For example, CPUshaveabreakpoint feature that will halt execution and allow a debugger to examine the state of the system. Then execution can continue until the next breakpoint or termination. This method cannot be used in a multiuser operating-system kernel without negatively affecting all of the users on the system. Profiling, which periodically samples the instruction pointer to determine which code is being executed, can show statistical trends but not individual activities. Code can be included in the kernel to emit specific data under specific circumstances, but that code slows down the kernel and tends not to be included in the part of the kernel where the specific problem being debugged is occurring. In contrast, DTrace runs on production systems—systems that are running important or critical applications—and causes no harm to the system. It slows activities while enabled, but after execution it resets the system to its pre-debugging state. It is also a broad and deep tool. It can broadly debug everything happening in the system (both at the user and kernel levels and between the user and kernel layers). DTrace can also delve deeply into code, showing individual CPU instructions or kernel subroutine activities. 2.10 Operating-System Debugging 89 The DTrace tool is composed of a compiler, a framework, providers of probes written within that framework, and consumers of those probes. DTrace providers create probes. Kernel structures keep track of all probes that the providers have created. The probes are stored in a hash-table data structure that is hashed by name and indexed according to unique probe identifiers. When a probe is enabled, a bit of code in the area to be probed is rewritten to call dtrace probe(probe identifier) and then continue with the code’s original operation. Different providers create different kinds of probes. For example, a kernel system-call probe works differently from a user-process probe, and that is different from an I/O probe. DTrace features a compiler that generates a byte code that is run in the kernel. This code is assured to be “safe” by the compiler. For example, no loops are allowed, and only specific kernel-state modifications may be made. Only users with the DTrace “privileges” (or “root” users) are allowed to use DTrace, as it can retrieve private kernel data (and modify data if requested). The generated code runs in the kernel and enables probes. It also enables consumers in user mode and enables communications between the two. A DTrace consumer is code that is interested in a probe and its results. A consumer requests that the provider create one or more probes. When a probe fires, it emits data that are managed by the kernel. Within the kernel, actions called enabling control blocks,orECBs, are performed when probes fire. One probe can cause multiple ECBs to execute if more than one consumer is interested in that probe. Each ECB contains a predicate (“if statement”)that can filter out that ECB. Otherwise, the list of actions in the ECB is executed. The most usual action is to capture some bit of data, such as a variable’s value at that point of the probe execution. By gathering such data, DTrace can build a complete picture of a user or kernel action. Further, probes firing from both user space and the kernel can show how a user-level action caused kernel-level reactions. Such data are invaluable for performance monitoring and code optimization. Once the probe consumer terminates, its ECBs are removed. If there are no ECBs consuming a probe, the probe is removed. That involves rewriting the code to remove the dtrace probe call and put back the original code. Thus, after a probe is destroyed, the system is exactly the same as before the probe was created, as if no probing had occurred. sched:::on-cpu uid == 101 { self->ts = timestamp; } sched:::off-cpu self->ts { @time[execname] = sum(timestamp - self->ts); self->ts = 0; } Figure 2.24 DTrace code. 90 Chapter 2 Operating-System Structures #dtrace-ssched.d dtrace: script ’sched.d’ matched 6 probes ˆC gnome-settings-d 142354 gnome-vfs-daemon 158243 dsdm 189804 wnck-applet 200030 gnome-panel 277864 clock-applet 374916 mapping-daemon 385475 xscreensaver 514177 metacity 539281 Xorg 2579646 gnome-terminal 5007269 mixer applet2 7388447 java 10769137 Figure 2.25 Output of the D code. DTrace takes care to assure that probes do not use too much memory or CPU capacity, which could harm the running system. The buffers used to hold the probe results are monitored for exceeding default and maximum limits. CPU time for probe execution is monitored as well. If limits are exceeded, the consumer is terminated, along with the offending probes. Buffers are allocated per CPU to avoid contention and data loss. An example of D code and its output shows some of its utility. The program in Figure 2.24 shows the DTrace code to enable scheduler probes and record the amount of CPU time of each process running with user ID 101 while those probes are enabled (that is, while the program runs). The output of the program, showing the processes and how much time (in nanoseconds) they spend running on the CPUs, is shown in Figure 2.25. Because DTrace is part of the open-source Solaris 10 operating system, it is being added to other operating systems when those systems do not have conflicting license agreements. For example, DTrace has been added to Mac OS X 10.5 and FreeBSD and will likely spread further due to its unique capabilities. Other operating systems, especially the Linux derivatives, are adding kernel- tracing functionality as well. Still other operating systems are beginning to include performance and tracing tools fostered by research at various institu- tions, including the Paradyn project (http://www.cs.wisc.edu/paradyn/). 2.11 Operating-System Generation It is possible to design, code, and implement an operating system specifically for one machine at one site. More commonly, however, operating systems are designed to run on any of a class of machines at a variety of sites with a variety of peripheral configurations. The system must then be configured or generated for each specific computer site, a process sometimes known as system generation (SYSGEN). 2.11 Operating-System Generation 91 The operating system is normally distributed on disk, on CD-ROM or DVD-ROM,orasan“ISO” image, which is a file in the format of a CD-ROM or DVD-ROM. To generate a system, we use a special program. This SYSGEN program reads from a given file, or asks the operator of the system for information concerning the specific configuration of the hardware system, or probes the hardware directly to determine what components are there. The following kinds of information must be determined. • What CPU is to be used? What options (extended instruction sets, floating- point arithmetic, and so on) are installed? For multiple CPU systems, each CPU may be described. • How will the boot disk be formatted? How many sections, or “partitions,” will it be separated into, and what will go into each partition? • How much memory is available? Some systems will determine this value themselves by referencing memory location after memory location until an “illegal address” fault is generated. This procedure defines the final legal address and hence the amount of available memory. • What devices are available? The system will need to know how to address each device (the device number), the device interrupt number, the device’s type and model, and any special device characteristics. • What operating-system options are desired, or what parameter values are to be used? These options or values might include how many buffers of which sizes should be used, what type of CPU-scheduling algorithm is desired, what the maximum number of processes to be supported is, and so on. Once this information is determined, it can be used in several ways. At one extreme, a system administrator can use it to modify a copy of the source code of the operating system. The operating system then is completely compiled. Data declarations, initializations, and constants, along with conditional compilation, produce an output-object version of the operating system that is tailored to the system described. At a slightly less tailored level, the system description can lead to the creation of tables and the selection of modules from a precompiled library. These modules are linked together to form the generated operating system. Selection allows the library to contain the device drivers for all supported I/O devices, but only those needed are linked into the operating system. Because the system is not recompiled, system generation is faster, but the resulting system may be overly general. At the other extreme, it is possible to construct a system that is completely table driven. All the code is always part of the system, and selection occurs at execution time, rather than at compile or link time. System generation involves simply creating the appropriate tables to describe the system. The major differences among these approaches are the size and generality of the generated system and the ease of modifying it as the hardware configuration changes. Consider the cost of modifying the system to support a newly acquired graphics terminal or another disk drive. Balanced against that cost, of course, is the frequency (or infrequency) of such changes. 92 Chapter 2 Operating-System Structures 2.12 System Boot After an operating system is generated, it must be made available for use by the hardware. But how does the hardware know where the kernel is or how to load that kernel? The procedure of starting a computer by loading the kernel is known as booting the system. On most computer systems, a small piece of code known as the bootstrap program or bootstrap loader locates the kernel, loads it into main memory, and starts its execution. Some computer systems, such as PCs, use a two-step process in which a simple bootstrap loader fetches a more complex boot program from disk, which in turn loads the kernel. When a CPU receives a reset event—for instance, when it is powered up or rebooted—the instruction register is loaded with a predefined memory location, and execution starts there. At that location is the initial bootstrap program. This program is in the form of read-only memory (ROM), because the RAM is in an unknown state at system startup. ROM is convenient because it needs no initialization and cannot easily be infected by a computer virus. The bootstrap program can perform a variety of tasks. Usually, one task is to run diagnostics to determine the state of the machine. If the diagnostics pass, the program can continue with the booting steps. It can also initialize all aspects of the system, from CPU registers to device controllers and the contents of main memory. Sooner or later, it starts the operating system. Some systems—such as cellular phones, PDAs, and game consoles—store the entire operating system in ROM. Storing the operating system in ROM is suitable for small operating systems, simple supporting hardware, and rugged operation. A problem with this approach is that changing the bootstrap code requires changing the ROM hardware chips. Some systems resolve this problem by using erasable programmable read-only memory (EPROM), which is read- only except when explicitly given a command to become writable. All forms of ROM are also known as firmware, since their characteristics fall somewhere between those of hardware and those of software. A problem with firmware in general is that executing code there is slower than executing code in RAM. Some systems store the operating system in firmware and copy it to RAM for fast execution. A final issue with firmware is that it is relatively expensive, so usually only small amounts are available. For large operating systems (including most general-purpose operating systems like Windows, Mac OS X,andUNIX) and for systems that change frequently, the bootstrap loader is stored in firmware, and the operating system is on disk. In this case, the bootstrap runs diagnostics and has a bit of code that can read a single block at a fixed location (say block zero) from disk into memory and execute the code from that boot block. The program stored in the boot block may be sophisticated enough to load the entire operating system into memory and begin its execution. More typically, it is simple code (as it fits in a single disk block) and knows only the address on disk and length of the remainder of the bootstrap program. GRUB is an example of an open-source bootstrap program for Linux systems. All of the disk-bound bootstrap, and the operating system itself, can be easily changed by writing new versions to disk. A disk that has a boot partition (more on that in Section 12.5.1) is called a boot disk or system disk. 2.13 Summary 93 Now that the full bootstrap program has been loaded, it can traverse the file system to find the operating system kernel, load it into memory, and start its execution. It is only at this point that the system is said to be running. 2.13 Summary Operating systems provide a number of services. At the lowest level, system calls allow a running program to make requests from the operating system directly. At a higher level, the command interpreter or shell provides a mechanism for a user to issue a request without writing a program. Commands may come from files during batch-mode execution or directly from a terminal when in an interactive or time-shared mode. System programs are provided to satisfy many common user requests. The types of requests vary according to level. The system-call level must provide the basic functions, such as process control and file and device manipulation. Higher-level requests, satisfied by the command interpreter or system programs, are translated into a sequence of system calls. System services can be classified into several categories: program control, status requests, and I/O requests. Program errors can be considered implicit requests for service. Once the system services are defined, the structure of the operating system can be developed. Various tables are needed to record the information that defines the state of the computer system and the status of the system’s jobs. The design of a new operating system is a major task. It is important that the goals of the system be well defined before the design begins. The type of system desired is the foundation for choices among various algorithms and strategies that will be needed. Since an operating system is large, modularity is important. Designing a system as a sequence of layers or using a microkernel is considered a good technique. The virtual-machine concept takes the layered approach and treats both the kernel of the operating system and the hardware as though they were hardware. Even other operating systems may be loaded on top of this virtual machine. Throughout the entire operating-system design cycle, we must be careful to separate policy decisions from implementation details (mechanisms). This separation allows maximum flexibility if policy decisions are to be changed later. Operating systems are now almost always written in a systems- implementation language or in a higher-level language. This feature improves their implementation, maintenance, and portability. To create an operating system for a particular machine configuration, we must perform system generation. Debugging process and kernel failures can be accomplished through the use of debuggers and other tools that analyze core dumps. Tools such as DTrace analyze production systems to find bottlenecks and understand other system behavior. For a computer system to begin running, the CPU must initialize and start executing the bootstrap program in firmware. The bootstrap can execute the operating system directly if the operating system is also in the firmware, or it can complete a sequence in which it loads progressively smarter programs 94 Chapter 2 Operating-System Structures from firmware and disk until the operating system itself is loaded into memory and executed. Practice Exercises 2.1 What is the purpose of system calls? 2.2 What are the five major activities of an operating system with regard to process management? 2.3 What are the three major activities of an operating system with regard to memory management? 2.4 What are the three major activities of an operating system with regard to secondary-storage management? 2.5 What is the purpose of the command interpreter? Why is it usually separate from the kernel? 2.6 What system calls have to be executed by a command interpreter or shell in order to start a new process? 2.7 What is the purpose of system programs? 2.8 What is the main advantage of the layered approach to system design? What are the disadvantages of using the layered approach? 2.9 List five services provided by an operating system, and explain how each creates convenience for users. In which cases would it be impossible for user-level programs to provide these services? Explain your answer. 2.10 Why do some systems store the operating system in firmware, while others store it on disk? 2.11 How could a system be designed to allow a choice of operating systems from which to boot? What would the bootstrap program need to do? Exercises 2.12 The services and functions provided by an operating system can be divided into two main categories. Briefly describe the two categories, and discuss how they differ. 2.13 Describe three general methods for passing parameters to the operating system. 2.14 Describe how you could obtain a statistical profile of the amount of time spent by a program executing different sections of its code. Discuss the importance of obtaining such a statistical profile. 2.15 What are the five major activities of an operating system with regard to file management? 2.16 What are the advantages and disadvantages of using the same system- call interface for manipulating both files and devices? Programming Problems 95 2.17 Would it be possible for the user to develop a new command interpreter using the system-call interface provided by the operating system? 2.18 What are the two models of interprocess communication? What are the strengths and weaknesses of the two approaches? 2.19 Why is the separation of mechanism and policy desirable? 2.20 It is sometimes difficult to achieve a layered approach if two components of the operating system are dependent on each other. Identify a scenario in which it is unclear how to layer two system components that require tight coupling of their functionalities. 2.21 What is the main advantage of the microkernel approach to system design? How do user programs and system services interact in a microkernel architecture? What are the disadvantages of using the microkernel approach? 2.22 In what ways is the modular kernel approach similar to the layered approach? In what ways does it differ from the layered approach? 2.23 What is the main advantage for an operating-system designer of using a virtual-machine architecture? What is the main advantage for a user? 2.24 Why is a just-in-time compiler useful for executing Java programs? 2.25 What is the relationship between a guest operating system and a host operating system in a system like VMware? What factors need to be considered in choosing the host operating system? 2.26 The experimental Synthesis operating system has an assembler incor- porated in the kernel. To optimize system-call performance, the kernel assembles routines within kernel space to minimize the path that the system call must take through the kernel. This approach is the antithesis of the layered approach, in which the path through the kernel is extended to make building the operating system easier. Discuss the pros and cons of the Synthesis approach to kernel design and optimization of system performance. Programming Problems 2.27 In Section 2.3, we described a program that copies the contents of one file to a destination file. This program works by first prompting the user for the name of the source and destination files. Write this program in Java. Be sure to include all necessary error checking, including ensuring that the source file exists. Once you have correctly designed and tested the program, run the program using a utility that traces system calls, if possible. Although Java does not support making system calls directly, it will be possible to observe which Java methods correspond to actual system calls. Some Linux systems provide the strace utility. On Mac OS X and Solaris systems, use dtrace. As Windows systems do not provide such features, you will have to trace through the Win32 version of this program using a debugger. 96 Chapter 2 Operating-System Structures Programming Projects Adding a System Call to the Linux Kernel. In this project, you will study the system-call interface provided by the Linux operating system and learn how user programs communicate with the operating system kernel via this interface. Your task is to incorporate a new system call into the kernel, thereby expanding the functionality of the operating system. Part 1: Getting Started A user-mode procedure call is performed by passing arguments to the called procedure either on the stack or through registers, saving the current state and the value of the program counter, and jumping to the beginning of the code corresponding to the called procedure. The process continues to have the same privileges as before. System calls appear as procedure calls to user programs but result in a change in execution context and privileges. In Linux on the Intel 386 architecture, a system call is accomplished by storing the system-call number into the EAX register, storing arguments to the system call in other hardware registers, and executing a trap instruction (which is the INT 0x80 assembly instruction). After the trap is executed, the system-call number is used to index into a table of code pointers to obtain the starting address for the handler code implementing the system call. The process then jumps to this address, and the privileges of the process are switched from user to kernel mode. With the expanded privileges, the process can now execute kernel code, which may include privileged instructions that cannot be executed in user mode. The kernel code can then carry out the requested services, such as interacting with I/O devices, and can perform process management and other activities that cannot be performed in user mode. The system call numbers for recent versions of the Linux kernel are listed in /usr/src/linux-2.x/include/asm-i386/unistd.h.(For instance, NR close corresponds to the system call close(),whichis invoked for closing a file descriptor, and is defined as value 6.) The list of pointers to system-call handlers is typically stored in the file /usr/src/linux-2.x/arch/i386/kernel/entry.S under the heading ENTRY(sys call table).Noticethatsys close is stored at entry number 6 in the table to be consistent with the system-call number defined in the unistd.h file. (The keyword .long denotes that the entry will occupy the same number of bytes as a data value of type long.) Part 2: Building a New Kernel Before adding a system call to the kernel, you must familiarize yourself with the task of building the binary for a kernel from its source code and booting the machine with the newly built kernel. This activity comprises the following tasks, some of which depend on the particular installation of the Linux operating system in use. Programming Projects 97 • ObtainthekernelsourcecodefortheLinuxdistribution.Ifthesourcecode package has already been installed on your machine, the corresponding files might be available under /usr/src/linux or /usr/src/linux-2.x (where the suffix corresponds to the kernel version number). If the package has not yet been installed, it can be downloaded from the provider of your Linux distribution or from http://www.kernel.org. • Learn how to configure, compile, and install the kernel binary. These tasks will vary among the different kernel distributions, but some typical commands for building the kernel (after entering the directory where the kernel source code is stored) include: ◦ make xconfig ◦ make dep ◦ make bzImage • Add a new entry to the set of bootable kernels supported by the system. The Linux operating system typically uses utilities such as lilo and grub to maintain a list of bootable kernels from which the user can choose during machine boot-up. If your system supports lilo,addanentryto lilo.conf,suchas: image=/boot/bzImage.mykernel label=mykernel root=/dev/hda5 read-only where /boot/bzImage.mykernel is the kernel image and mykernel is the label associated with the new kernel. This step will allow you to choose the new kernel during the boot-up process. You will then have the option of either booting the new kernel or booting the unmodified kernel if the newly built kernel does not function properly. Part 3: Extending the Kernel Source You can now experiment with adding a new file to the set of source files used for compiling the kernel. Typically, the source code is stored in the /usr/src/linux-2.x/kernel directory, although that location may differ in your Linux distribution. There are two options for adding the system call. The first is to add the system call to an existing source file in this directory. The second is to create a new file in the source directory and modify /usr/src/linux-2.x/kernel/Makefile to include the newly created file in the compilation process. The advantage of the first approach is that when you modify an existing file that is already part of the compilation process, the Makefile need not be modified. Part 4: Adding a System Call to the Kernel Now that you are familiar with the various background tasks corresponding to building and booting Linux kernels, you can begin the process of adding a new system call to the Linux kernel. In this project, the system call will have limited functionality; it will simply transition from user mode to kernel mode, 98 Chapter 2 Operating-System Structures print a message that is logged with the kernel messages, and transition back to user mode. We will call this the helloworld system call. While it has only limited functionality, it illustrates the system-call mechanism and sheds light on the interaction between user programs and the kernel. • Create a new file called helloworld.c to define your system call. Include the header files linux/linkage.h and linux/kernel.h. Add the follow- ing code to this file: #include #include asmlinkage int sys helloworld() { printk(KERN EMERG "hello world!"); return 1; } This creates a system call with the name sys helloworld(). If you choose to add this system call to an existing file in the source directory, all that is necessary is to add the sys helloworld() function to the file you choose. In the code, asmlinkage is a remnant from the days when Linux used both C++ and C code and is used to indicate that the code iswritteninC.Theprintk() function is used to print messages to a kernel log file and therefore may be called only from the kernel. The kernel messages specified in the parameter to printk() are logged in the file /var/log/kernel/warnings. The function prototype for the printk() call is defined in /usr/include/linux/kernel.h. • Define a new system-call number for NR helloworld in /usr/src/linux-2.x/include/asm-i386/unistd.h.Auserprogram can use this number to identify the newly added system call. Also be sure to increment the value for NR syscalls, which is stored in the same file. This constant tracks the number of system calls currently defined in the kernel. • Add an entry .long sys helloworld to the sys call table defined in the /usr/src/linux-2.x/arch/i386/kernel/entry.S file. As dis- cussed earlier, the system-call number is used to index into this table to find the position of the handler code for the invoked system call. • Add your file helloworld.c to the Makefile (if you created a new file for your system call). Save a copy of your old kernel binary image (in case there are problems with your newly created kernel). You can now build the new kernel, rename it to distinguish it from the unmodified kernel, and add an entry to the loader configuration files (such as lilo.conf). After completing these steps, you can boot either the old kernel or the new kernel that contains your system call. Part 5: Using the System Call from a User Program When you boot with the new kernel, it will support the newly defined system call; you now simply need to invoke this system call from a user program. Wiley Plus 99 Ordinarily,the standard C library supports an interface for system calls defined for the Linux operating system. As your new system call is not linked into the standard C library, however, invoking your system call will require manual intervention. As noted earlier, a system call is invoked by storing the appropriate value in a hardware register and performing a trap instruction. Unfortunately, these low-level operations cannot be performed using C language statements and instead require assembly instructions. Fortunately, Linux provides macros for instantiating wrapper functions that contain the appropriate assembly instructions. For instance, the following C program uses the syscall0() macro to invoke the newly defined system call: #include #include #include syscall0(int, helloworld); main() { helloworld(); } • The syscall0 macro takes two arguments. The first specifies the type of the value returned by the system call; the second is the name of the system call. The name is used to identify the system-call number that is stored in the hardware register before the trap instruction is executed. If your system call requires arguments, then a different macro (such as syscall0,where the suffix indicates the number of arguments) could be used to instantiate the assembly code required for performing the system call. • Compile and execute the program with the newly built kernel. There should be a message “hello world!” in the kernel log file /var/log/kernel/warnings to indicate that the system call has executed. As a next step, consider expanding the functionality of your system call. How would you pass an integer value or a character string to the system call and have it printed into the kernel log file? What are the implications of passing pointers to data stored in the user program’s address space as opposed to simply passing an integer value from the user program to the kernel using hardware registers? Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises 100 Chapter 2 Operating-System Structures • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Dijkstra [1968] advocated the layered approach to operating-system design. Brinch-Hansen [1970] was an early proponent of constructing an operating system as a kernel (or nucleus) on which more complete systems can be built. System instrumentation and dynamic tracing are described in Tamches and Miller [1999]. DTrace is discussed in Cantrill et al. [2004]. The DTrace source code is available at http://src.opensolaris.org/source/. Cheung and Loong [1995] explore issues of operating-system structure from microkernel to extensible systems. MS-DOS, Version 3.1, is described in Microsoft [1986]. Windows NT and Windows 2000 are described by Solomon [1998] and Solomon and Russinovich [2000]. Windows 2003 and Windows XP internals are described in Russinovich and Solomon [2005]. Hart [2005] covers Windows systems programming in detail. BSD UNIX is described in McKusick et al. [1996]. Bovet and Cesati [2006] thoroughly discuss the Linux kernel. Several UNIX systems—including Mach—are treated in detail in Vahalia [1996]. Mac OS X is presented at http://www.apple.com/macosx and in Singh [2007]. Solaris is fully described in McDougall and Mauro [2007]. The first operating system to provide a virtual machine was the CP/67 on an IBM 360/67. The commercially available IBM VM/370 operating system was derived from CP/67. Details regarding Mach, a microkernel-based operating system, can be found in Younget al. [1987]. Kaashoek et al. [1997] present details regarding exokernel operating systems, wherein the architecture separates management issues from protection, thereby giving untrusted software the ability to exercise control over hardware and software resources. The specifications for the Java language and the Java virtual machine are presented by Gosling et al. [2005] and by Lindholm and Yellin [1999], respec- tively. The internal workings of the Java virtual machine are fully described by Venners [1998]. Golm et al. [2002] highlight the JX operating system; Back et al. [2000] cover several issues in the design of Java operating systems. More information on Java is available on the Web at http://java.sun.com. Details about the implementation of VMware can be found in Sugerman et al. [2001]. Information about the Open Virtual Machine Format can be found at http://www.vmware.com/appliances/learn/ovf.html. Part Two Process Management A process can be thought of as a program in execution. A process will need certain resources—such as CPU time, memory, files, and I/O devices —to accomplish its task. These resources are allocated to the process either when it is created or while it is executing. A process is the unit of work in most systems. Systems consist of a collection of processes: Operating-system processes execute system code, and user processes execute user code. All these processes may execute concurrently. Although traditionally a process contained only a single thread of control as it ran, most modern operating systems now support processes that have multiple threads. The operating system is responsible for the following activities in connection with process and thread management: the creation and deletion of both user and system processes; the scheduling of processes; and the provision of mechanisms for synchronization, communication, and deadlock handling for processes. This page intentionally left blank 3CHAPTER Processes Early computer systems allowed only one program to be executed at a time. This program had complete control of the system and had access to all the system’s resources. In contrast, current-day computer systems allow multiple programs to be loaded into memory and executed concurrently. This evolution required firmer control and more compartmentalization of the various programs, and these needs resulted in the notion of a process,whichis a program in execution. A process is the unit of work in a modern time-sharing system. The more complex the operating system is, the more it is expected to do on behalf of its users. Although its main concern is the execution of user programs, it also needs to take care of various system tasks that are better left outside the kernel itself. A system therefore consists of a collection of processes: operating- system processes executing system code and user processes executing user code. Potentially, all these processes can execute concurrently, with the CPU (or CPUs) multiplexed among them. By switching the CPU between processes, the operating system can make the computer more productive. In this chapter, you will read about what processes are and how they work. CHAPTER OBJECTIVES • To introduce the notion of a process—a program in execution that forms the basis of all computation. • To describe the various features of processes, including scheduling, creation and termination, and communication. • To describe communication in client–server systems. 3.1 Process Concept A question that arises in discussing operating systems involves what to call all the CPU activities. A batch system executes jobs, whereas a time-shared system has user programs, or tasks. Even on a single-user system such as the original 103 104 Chapter 3 Processes Microsoft Windows, a user may be able to run several programs at one time: a word processor, a Web browser, and an e-mail package. And even if the user can execute only one program at a time, the operating system may need to support its own internal programmed activities, such as memory management. In many respects, all these activities are similar, so we call all of them processes. The terms job and process are used almost interchangeably in this text. Although we personally prefer the term process, much of operating-system theory and terminology was developed during a time when the major activity of operating systems was job processing. It would be misleading to avoid the use of commonly accepted terms that include the word job (such as job scheduling) simply because process has superseded job. 3.1.1 The Process Informally, as mentioned earlier, a process is a program in execution. A process is more than the program code, which is sometimes known as the text section. It also includes the current activity, as represented by the value of the program counter and the contents of the processor’s registers. A process generally also includes the process stack, which contains temporary data (such as function parameters, return addresses, and local variables), and a data section, which contains global variables. A process may also include a heap, which is memory that is dynamically allocated during process run time. The structure of a process in memory is shown in Figure 3.1. We emphasize that a program by itself is not a process; a program is a passive entity, such as a file containing a list of instructions stored on disk (often called an executable file), whereas a process is an active entity,with a program counter specifying the next instruction to execute and a set of associated resources. A program becomes a process when an executable file is loaded into memory. Two common techniques for loading executable files are double-clicking an icon representing the executable file and entering the name of the executable file on the command line (as in prog.exe or a.out). text 0 max data heap stack Figure 3.1 Process in memory. 3.1 Process Concept 105 new terminated runningready admitted interrupt scheduler dispatchI/O or event completion I/O or event wait exit waiting Figure 3.2 Diagram of process state. Although two processes may be associated with the same program, they are nevertheless considered two separate execution sequences. For instance, several users may be running different copies of the mail program, or the same user may invoke many copies of the Web browser program. Each of these is a separate process, and although the text sections are equivalent, the data, heap, and stack sections vary. It is also common to have a process that spawns many processes as it runs. We discuss such matters in Section 3.4. 3.1.2 Process State As a process executes, it changes state. The state of a process is defined in part by the current activity of that process. Each process may be in one of the following states: • New. The process is being created. • Running. Instructions are being executed. • Waiting. The process is waiting for some event to occur (such as an I/O completion or reception of a signal). • Ready. The process is waiting to be assigned to a processor. • Terminated. The process has finished execution. These names are arbitrary, and they vary across operating systems. The states that they represent are found on all systems, however. Certain operating systems also delineate process states more finely. It is important to realize that only one process can be running on any processor at any instant. Many processes may be ready and waiting, however. The state diagram corresponding to these states is presented in Figure 3.2. 3.1.3 Process Control Block Each process is represented in the operating system by a process control block (PCB)—also called a task control block.APCB is shown in Figure 3.3. It contains many pieces of information associated with a specific process, including these: 106 Chapter 3 Processes process state process number program counter memory limits list of open files registers • • • Figure 3.3 Process control block (PCB). • Process state. The state may be new, ready, running, waiting, halted, and so on. • Program counter. The counter indicates the address of the next instruction to be executed for this process. • CPU registers. The registers vary in number and type, depending on the computer architecture. They include accumulators, index registers, stack pointers, and general-purpose registers, plus any condition-code information. Along with the program counter, this state information must be saved when an interrupt occurs, to allow the process to be continued correctly afterward (Figure 3.4). • CPU-scheduling information. This information includes a process priority, pointers to scheduling queues, and any other scheduling parameters. (Chapter 5 describes process scheduling.) • Memory-management information. This information may include such information as the value of the base and limit registers, the page tables, or the segment tables, depending on the memory system used by the operating system (Chapter 8). • Accounting information. This information includes the amount of CPU and real time used, time limits, account numbers, job or process numbers, and so on. • I/O status information. This information includes the list of I/O devices allocated to the process, a list of open files, and so on. In brief, the PCB simply serves as the repository for any information that may vary from process to process. 3.1.4 Threads The process model discussed so far has implied that a process is a program that performs a single thread of execution. For example, when a process is running a word-processing program, a single thread of instructions is being executed. This single thread of control allows the process to perform only one 3.2 Process Scheduling 107 process P0 process P1 save state into PCB0 save state into PCB1 reload state from PCB1 reload state from PCB0 operating system idle idle executingidle executing executing interrupt or system call interrupt or system call • • • • • • Figure 3.4 CPU switch from process to process. task at a time. The user cannot simultaneously type in characters and run the spell checker within the same process, for example. Many modern operating systems have extended the process concept to allow a process to have multiple threads of execution and thus to perform more than one task at a time. On a system that supports threads, the PCB is expanded to include information for each thread. Other changes throughout the system are also needed to support threads. Chapter 4 explores multithreaded processes in detail. 3.2 Process Scheduling The objective of multiprogramming is to have some process running at all times, to maximize CPU utilization. The objective of time sharing is to switch the CPU among processes so frequently that users can interact with each program while it is running. To meet these objectives, the process scheduler selects an available process (possibly from a set of several available processes) for program execution on the CPU. For a single-processor system, there will never be more than one running process. If there are more processes, the rest will have to wait until the CPU is free and can be rescheduled. 3.2.1 Scheduling Queues As processes enter the system, they are put into a job queue that consists of all processes in the system. The processes that are residing in main memory and are ready and waiting to execute are kept on a list called the ready queue. This queue is generally stored as a linked list. A ready-queue header contains 108 Chapter 3 Processes PROCESS REPRESENTATION IN LINUX The process control block in the Linux operating system is represented by the C structure task struct. This structure contains all the necessary information for representing a process, including the state of the process, scheduling and memory-management information, list of open files, and pointers to the process’s parent and any of its children. (A process’s parent is the process that created it; its children are any processes that it creates.) Some of these fields include: pid t pid; /* process identifier */ long state; /* state of the process */ unsigned int time slice /* scheduling information */ struct task struct *parent; /* this process’s parent */ struct list head children; /* this process’s children */ struct files struct *files; /* list of open files */ struct mm struct *mm; /* address space of this process */ For example, the state of a process is represented by the field long state in this structure. Within the Linux kernel, all active processes are represented using a doubly linked list of task struct, and the kernel maintains a pointer —current—to the process currently executing on the system. This is shown in Figure 3.5. struct task_struct process information • • • struct task_struct process information • • • current (currently executing proccess) struct task_struct process information • • • • • • Figure 3.5 Active processes in Linux. As an illustration of how the kernel might manipulate one of the fields in the task struct for a specified process, let’s assume the system would like to change the state of the process currently running to the value new state. If current is a pointer to the process currently executing, its state is changed with the following: current->state = new state; pointers to the first and final PCBs in the list. Each PCB includes a pointer field that points to the next PCB in the ready queue. 3.2 Process Scheduling 109 queue header PCB7 PCB3 PCB5 PCB14 PCB6 PCB2 head head head head head ready queue disk unit 0 terminal unit 0 mag tape unit 0 mag tape unit 1 tail registers registers tail tail tail tail • • • • • • • • • Figure 3.6 The ready queue and various I/O device queues. The system also includes other queues. When a process is allocated the CPU, it executes for a while and eventually quits, is interrupted, or waits for the occurrence of a particular event, such as the completion of an I/O request. Suppose the process makes an I/O request to a shared device, such as a disk. Since there are many processes in the system, the disk may be busy with the I/O request of some other process. The process therefore may have to wait for the disk. The list of processes waiting for a particular I/O device is called a device queue. Each device has its own device queue (Figure 3.6). A common representation of process scheduling is a queueing diagram, such as that in Figure 3.7. Each rectangular box represents a queue. Two types of queues are present: the ready queue and a set of device queues. The circles represent the resources that serve the queues, and the arrows indicate the flow of processes in the system. A new process is initially put in the ready queue. It waits there until it is selected for execution, or is dispatched. Once the process is allocated the CPU and is executing, one of several events could occur: • The process could issue an I/O request and then be placed in an I/O queue. • The process could create a new subprocess and wait for the subprocess’s termination. • The process could be removed forcibly from the CPU,asaresultofan interrupt, and be put back in the ready queue. 110 Chapter 3 Processes ready queue CPU I/O I/O queue I/O request time slice expired fork a child wait for an interrupt interrupt occurs child executes Figure 3.7 Queueing-diagram representation of process scheduling. In the first two cases, the process eventually switches from the waiting state to the ready state and is then put back in the ready queue. A process continues this cycle until it terminates, at which time it is removed from all queues and has its PCB and resources deallocated. 3.2.2 Schedulers A process migrates among the various scheduling queues throughout its lifetime. The operating system must select, for scheduling purposes, processes from these queues in some fashion. The selection process is carried out by the appropriate scheduler. Often, in a batch system, more processes are submitted than can be executed immediately. These processes are spooled to a mass-storage device (typically a disk), where they are kept for later execution. The long-term scheduler,orjob scheduler, selects processes from this pool and loads them into memory for execution. The short-term scheduler,orCPU scheduler, selects from among the processes that are ready to execute and allocates the CPU to one of them. The primary distinction between these two schedulers lies in frequency of execution. The short-term scheduler must select a new process for the CPU frequently. A process may execute for only a few milliseconds before waiting for an I/O request. Often, the short-term scheduler executes at least once every 100 milliseconds. Because of the short time between executions, the short-term scheduler must be fast. If it takes 10 milliseconds to decide to execute a process for 100 milliseconds, then 10/(100 + 10) = 9 percent of the CPU is being used (wasted) simply for scheduling the work. The long-term scheduler executes much less frequently; minutes may sep- arate the creation of one new process and the next. The long-term scheduler controls the degree of multiprogramming (the number of processes in mem- ory). If the degree of multiprogramming is stable, then the average rate of process creation must be equal to the average departure rate of processes 3.2 Process Scheduling 111 leaving the system. Thus, the long-term scheduler may need to be invoked only when a process leaves the system. Because of the longer interval between executions, the long-term scheduler can afford to take more time to decide which process should be selected for execution. It is important that the long-term scheduler make a careful selection. In general, most processes can be described as either I/O bound or CPU bound. An I/O-bound process is one that spends more of its time doing I/O than it spends doing computations. A CPU-bound process, in contrast, generates I/O requests infrequently, using more of its time doing computations. The long-term scheduler needs to select a good process mix of I/O-bound and CPU-bound processes. If all processes are I/O bound, the ready queue will almost always be empty, and the short-term scheduler will have little to do. If all processes are CPU bound, the I/O waiting queue will almost always be empty, devices will go unused, and again the system will be unbalanced. The system with the best performance will thus have a combination of CPU-bound and I/O-bound processes. On some systems, the long-term scheduler may be absent or minimal. For example, time-sharing systems such as UNIX and Microsoft Windows systems often have no long-term scheduler but simply put every new process in memory for the short-term scheduler. The stability of these systems depends either on a physical limitation (such as the number of available terminals) or on the self-adjusting nature of human users. If performance declines to unacceptable levels on a multiuser system, some users will simply quit. Some operating systems, such as time-sharing systems, may introduce an additional, intermediate level of scheduling. This medium-term scheduler is diagrammed in Figure 3.8. The key idea behind a medium-term scheduler is that sometimes it can be advantageous to remove processes from memory (and from active contention for the CPU) and thus reduce the degree of multiprogramming. Later, the process can be reintroduced into memory,and its execution can be continued where it left off. This scheme is called swapping. The process is swapped out, and is later swapped in, by the medium-term scheduler. Swapping may be necessary to improve the process mix or because a change in memory requirements has overcommitted available memory, requiring memory to be freed up. Swapping is discussed in Chapter 8. swap in swap out endCPU I/O I/O waiting queues ready queue partially executed swapped-out processes Figure 3.8 Addition of medium-term scheduling to the queueing diagram. 112 Chapter 3 Processes 3.2.3 Context Switch As mentioned in Section 1.2.1, interrupts cause the operating system to change a CPU from its current task and to run a kernel routine. Such operations happen frequently on general-purpose systems. When an interrupt occurs, the system needs to save the current context of the process running on the CPU so that it can restore that context when its processing is done, essentially suspending the process and then resuming it. The context is represented in the PCB of the process; it includes the value of the CPU registers, the process state (see Figure 3.2), and memory-management information. Generically, we perform a state save of the current state of the CPU,beitinkernelorusermode,andthena state restore to resume operations. Switching the CPU to another process requires performing a state save of the current process and a state restore of a different process. This task is known as a context switch. When a context switch occurs, the kernel saves the context of the old process in its PCB and loads the saved context of the new process scheduled to run. Context-switch time is pure overhead, because the system does no useful work while switching. Context-switching speed varies from machine to machine, depending on the memory speed, the number of registers that must be copied, and the existence of special instructions (such as a single instruction to load or store all registers). Typical speeds are a few milliseconds. Context-switch times are highly dependent on hardware support. For instance, some processors (such as the Sun UltraSPARC) provide multiple sets of registers. A context switch here simply requires changing the pointer to the current register set. Of course, if there are more active processes than there are register sets, the system resorts to copying register data to and from memory, as before. Also, the more complex the operating system, the more work must be done during a context switch. As we will see in Chapter 8, advanced memory-management techniques may require extra data to be switched with each context. For instance, the address space of the current process must be preserved as the space of the next task is prepared for use. How the address space is preserved, and what amount of work is needed to preserve it, depend on the memory-management method of the operating system. 3.3 Operations on Processes The processes in most systems can execute concurrently, and they may be created and deleted dynamically. Thus, these systems must provide a mechanism for process creation and termination. In this section, we explore the mechanisms involved in creating processes and illustrate process creation on UNIX and Windows systems as well as creating processes using Java. 3.3.1 Process Creation A process may create several new processes, via a create-process system call, during the course of execution. The creating process is called a parent process, and the new processes are called the children of that process. Each of these new processes may in turn create other processes, forming a tree of processes. Most operating systems (including UNIX and the Windows family of operating systems) identify processes according to a unique process identifier 3.3 Operations on Processes 113 (or pid), which is typically an integer number. Figure 3.9 illustrates a typical process tree for the Solaris operating system, showing the name of each process and its pid. In Solaris, the process at the top of the tree is the sched process, with pid of 0. The sched process creates several children processes—including pageout and fsflush. These processes are responsible for managing memory and file systems. The sched process also creates the init process, which serves as the root parent process for all user processes. In Figure 3.9, we see two children of init—inetd and dtlogin. inetd is responsible for networking services, such as telnet and ftp; dtlogin is the process representing a user login screen. When a user logs in, dtlogin creates an X-windows session (Xsession), which in turns creates the sdt shel process. Below sdt shel,a user’s command-line shell—the C-shell or csh—iscreated.Inthiscommand- line interface, the user can then invoke various child processes, such as the ls and cat commands. We also see a csh process with pid of 7778, representing a user who has logged onto the system using telnet. This user has started the Netscape browser (pid of 7785) and the emacs editor (pid of 8105). On UNIX, we can obtain a listing of processes by using the ps command. For example, the command ps -el will list complete information for all processes currently active in the system. It is easy to construct a process tree similar to what is shown in Figure 3.9 by recursively tracing parent processes all the way to the init process. In general, a process will need certain resources (CPU time, memory, files, I/O devices) to accomplish its task. When a process creates a subprocess, that dtlogin pid = 251 Sched pid = 0 fsflush pid = 3 pageout pid = 2 init pid = 1 cat pid = 2536 ls pid = 2123 emacs pid = 8105 Netscape pid = 7785 Xsession pid = 294 sdt_shel pid = 340 Csh pid = 1400 Csh pid = 7778 telnetdaemon pid = 7776 inetd pid = 140 Figure 3.9 A tree of processes on a typical Solaris system. 114 Chapter 3 Processes subprocess may be able to obtain its resources directly from the operating system, or it may be constrained to a subset of the resources of the parent process. The parent may have to partition its resources among its children, or it may be able to share some resources (such as memory or files) among several of its children. Restricting a child process to a subset of the parent’s resources prevents any process from overloading the system by creating too many subprocesses. In addition to the various physical and logical resources that a process obtains when it is created, initialization data (input) may be passed along by the parent process to the child process. For example, consider a process whose function is to display the contents of a file—say, img.jpg—on the screen of a terminal. When it is created, it will get, as an input from its parent process, thenameofthefileimg.jpg, and it will use that file name, open the file, and write the contents out. It may also get the name of the output device. Some operating systems pass resources to child processes. On such a system, the new process may get two open files, img.jpg and the terminal device, and may simply transfer the datum between the two. When a process creates a new process, two possibilities exist in terms of execution: 1. The parent continues to execute concurrently with its children. 2. The parent waits until some or all of its children have terminated. There are also two possibilities in terms of the address space of the new process: 1. The child process is a duplicate of the parent process (it has the same program and data as the parent). 2. The child process has a new program loaded into it. Next, we illustrate these differences on UNIX and Windows systems. 3.3.1.1 Process Creation in UNIX In UNIX, as we’ve seen, each process is identified by its process identifier, which is a unique integer. A new process is created by the fork() system call. The new process consists of a copy of the address space of the original process. This mechanism allows the parent process to communicate easily with its child process. Both processes (the parent and the child) continue execution at the instruction after the fork(), with one difference: the return code for the fork() is zero for the new (child) process, whereas the (nonzero) process identifier of the child is returned to the parent. Typically, the exec() system call is used after a fork() system call by one of the two processes to replace the process’s memory space with a new program. The exec() system call loads a binary file into memory (destroying the memory image of the program containing the exec() system call) and starts its execution. In this manner, the two processes are able to communicate and then go their separate ways. The parent can then create more children; or, if it has nothing else to do while the child runs, it can issue a wait() system call to move itself off the ready queue until the termination of the child. 3.3 Operations on Processes 115 #include #include #include int main() { pid t pid; /* fork a child process */ pid = fork(); if (pid < 0) { /* error occurred */ fprintf(stderr, "Fork Failed"); return 1; } else if (pid == 0) { /* child process */ execlp("/bin/ls","ls",NULL); } else { /* parent process */ /* parent will wait for the child to complete */ wait(NULL); printf("Child Complete"); } return 0; } Figure 3.10 Creating a separate process using the UNIX fork() system call. The C program shown in Figure 3.10 illustrates the UNIX system calls previously described. We now have two different processes running copies of the same program. The only difference is that the value of pid (the process identifier) for the child process is zero, while that for the parent is an integer value greater than zero (in fact, it is the actual pid of the child process). The child process inherits privileges and scheduling attributes from the parent, as well as certain resources, such as open files. The child process then overlays its address space with the UNIX command /bin/ls (used to get a directory listing) using the execlp() system call (execlp() is a version of the exec() system call). The parent waits for the child process to complete with the wait() system call. When the child process completes (by either implicitly or explicitly invoking exit()) the parent process resumes from the call to wait(),where it completes using the exit() system call. This is illustrated in Figure 3.11. 3.3.1.2 Process Creation in Windows As an alternative example, consider process creation in Windows. Processes are created in the Win32 API using the CreateProcess() function, which is similar to fork() in that a parent creates a new child process. However, whereas fork() has the child process inheriting the address space of its parent, CreateProcess() requires loading a specified program into the address space 116 Chapter 3 Processes fork() exec() wait resumesparent child exit() Figure 3.11 Process creation using fork() system call. of the child process at process creation. Furthermore, whereas fork() is passed no parameters, CreateProcess() expects no fewer than ten parameters. The C program shown in Figure 3.12 illustrates the CreateProcess() function, which creates a child process that loads the application mspaint.exe. In the program, we opt for many of the default values of the ten parameters passed to CreateProcess(). Readers interested in pursuing the details of process creation and management in the Win32 API are encouraged to consult the bibliographical notes at the end of this chapter. Two parameters passed to CreateProcess() are instances of the START- UPINFO and PROCESS INFORMATION structures. STARTUPINFO specifies many properties of the new process, such as window size and appearance and han- dles to standard input and output files. The PROCESS INFORMATION structure contains a handle and the identifiers to the newly created process and its thread. We invoke the ZeroMemory() function to allocate memory for each of these structures before proceeding with CreateProcess(). The first two parameters passed to CreateProcess() are the application name and command-line parameters. If the application name is NULL (as it is in this case), the command-line parameter specifies the application to load. In this instance, we are loading the Microsoft Windows mspaint.exe application. Beyond these two initial parameters, we use the default parameters for inheriting process and thread handles as well as specifying no creation flags. We also use the parent’s existing environment block and starting directory. Last, we provide two pointers to the STARTUPINFO and PROCESS INFORMATION structures created at the beginning of the program. In Figure 3.10, the parent process waits for the child to complete by invoking the wait() system call. The equivalent of this in Win32 is WaitForSingleObject(),whichis passed a handle of the child process—pi.hProcess—and waits for this process to complete. Once the child process exits, control returns from the WaitForSingleObject() function in the parent process. 3.3.1.3 Process Creation in Java When a Java program begins execution, an instance of the Java virtual machine is created. On most systems, the JVM appears as an ordinary application running as a separate process on the host operating system. Each instance of the JVM provides support for multiple threads of control; but Java does not support a process model, which would allow the JVM to create several processes within the same virtual machine. Although there is considerable ongoing research in this area, the primary reason why Java currently does not 3.3 Operations on Processes 117 #include #include int main(VOID) { STARTUPINFO si; PROCESS INFORMATION pi; // allocate memory ZeroMemory(&si, sizeof(si)); si.cb = sizeof(si); ZeroMemory(&pi, sizeof(pi)); // create child process if (!CreateProcess(NULL, // use command line "C:\\WINDOWS\\system32\\mspaint.exe", // command line NULL, // don’t inherit process handle NULL, // don’t inherit thread handle FALSE, // disable handle inheritance 0, // no creation flags NULL, // use parent’s environment block NULL, // use parent’s existing directory &si, &pi)) { fprintf(stderr, "Create Process Failed"); return -1; } // parent will wait for the child to complete WaitForSingleObject(pi.hProcess, INFINITE); printf("Child Complete"); // close handles CloseHandle(pi.hProcess); CloseHandle(pi.hThread); } Figure 3.12 Creating a separate process using the Win32 API. support a process model is that it is difficult to isolate one process’s memory from that of another within the same virtual machine. ItispossibletocreateaprocessexternaltotheJVM, however, by using the ProcessBuilder class, which allows a Java program to specify a process that is native to the operating system (such as /usr/bin/ls or C:\\WINDOWS\\system32\\mspaint.exe). This is illustrated in Figure 3.13. Running this program involves passing the name of the program that is to run as an external process on the command line. We create the new process by invoking the start() method of the ProcessBuilder class, which returns an instance of a Process object. This process will run external to the virtual machine and cannot affect the virtual 118 Chapter 3 Processes import java.io.*; public class OSProcess { public static void main(String[] args) throws IOException { if (args.length != 1) { System.err.println("Usage: java OSProcess "); System.exit(0); } // args[0] is the command that is run in a separate process ProcessBuilder pb = new ProcessBuilder(args[0]); Process process = pb.start(); // obtain the input stream InputStream is = process.getInputStream(); InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr); // read the output of the process String line; while ( (line = br.readLine()) != null) System.out.println(line); br.close(); } } Figure 3.13 Creating an external process using the Java API. machine—and vice versa. Communication between the virtual machine and the external process occurs through the InputStream and OutputStream of the external process. 3.3.2 Process Termination A process terminates when it finishes executing its final statement and asks the operating system to delete it by using the exit() system call. At that point, the process may return a status value (typically an integer) to its parent process (via the wait() system call). All the resources of the process—including physical and virtual memory, open files, and I/O buffers—are deallocated by the operating system. Termination can occur in other circumstances as well. A process can cause the termination of another process via an appropriate system call (for example, TerminateProcess() in Win32). Usually, such a system call can be invoked only by the parent of the process that is to be terminated. Otherwise, users could arbitrarily kill each other’s jobs. Note that a parent needs to know the identities of its children. Thus, when one process creates a new process, the identity of the newly created process is passed to the parent. 3.4 Interprocess Communication 119 A parent may terminate the execution of one of its children for a variety of reasons, such as these: • The child has exceeded its usage of some of the resources that it has been allocated. (To determine whether this has occurred, the parent must have a mechanism to inspect the state of its children.) • The task assigned to the child is no longer required. • The parent is exiting, and the operating system does not allow a child to continue if its parent terminates. Some systems, including VMS, do not allow a child to exist if its parent has terminated. In such systems, if a process terminates (either normally or abnormally), then all its children must also be terminated. This phenomenon, referred to as cascading termination, is normally initiated by the operating system. In UNIX, we can terminate a process by using the exit() system call; its parent process may wait for the termination of a child process by using the wait() system call. The wait() system call returns the process identifier of a terminated child so that the parent can tell which of its children has terminated. If the parent terminates, however, all its children have assigned as their new parent the init process. Thus, the children still have a parent to collect their status and execution statistics. 3.4 Interprocess Communication Processes executing concurrently in the operating system may be either independent processes or cooperating processes. A process is independent if it cannot affect or be affected by the other processes executing in the system. Any process that does not share data with any other process is independent. A process is cooperating if it can affect or be affected by the other processes executing in the system. Clearly, any process that shares data with other processes is a cooperating process. There are several reasons for providing an environment that allows process cooperation: • Information sharing. Since several users may be interested in the same piece of information (for instance, a shared file), we must provide an environment to allow concurrent access to such information. • Computation speedup. If we want a particular task to run faster, we must break it into subtasks, each of which will execute in parallel with the others. Notice that such a speedup can be achieved only if the computer has multiple processing elements (such as CPUsorI/O channels). • Modularity. We may want to construct the system in a modular fashion, dividing the system functions into separate processes or threads, as we discussed in Chapter 2. 120 Chapter 3 Processes process A (a) M process BM kernel M (b) 12 1 2 process A shared process B kernel Figure 3.14 Communications models. (a) Message passing. (b) Shared memory. • Convenience. Even an individual user may work on many tasks at the same time. For instance, a user may be editing, printing, and compiling in parallel. Cooperating processes require an interprocess communication (IPC) mech- anism that will allow them to exchange data and information. There are two fundamental models of interprocess communication: (1) shared memory and (2) message passing. In the shared-memory model, a region of memory that is shared by cooperating processes is established. Processes can then exchange information by reading and writing data to the shared region. In the message- passing model, communication takes place by means of messages exchanged between the cooperating processes. The two communications models are contrasted in Figure 3.14. Both models are common in operating systems, and many systems imple- ment both. Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. Message passing is also easier to implement than is shared memory for intercomputer communication. Shared memory allows maximum speed and convenience of communication. Shared memory is faster than message passing because message-passing systems are typically implemented using system calls and thus require the more time- consuming task of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once shared memory is established, all accesses are treated as routine memory accesses, and no assistance from the kernel is required. In the remainder of this section, we explore these IPC models in more detail. 3.4.1 Shared-Memory Systems Interprocess communication using shared memory requires communicating processes to establish a region of shared memory. Typically, a shared-memory region resides in the address space of the process creating the shared-memory segment. Other processes that wish to communicate using this shared-memory segment must attach it to their address space. Recall that, normally, the operating system tries to prevent one process from accessing another process’s 3.4 Interprocess Communication 121 public interface Buffer { // Producers call this method public void insert(E item); // Consumers call this method public E remove(); } Figure 3.15 Interface for buffer implementations. memory. Shared memory requires that two or more processes agree to remove this restriction. They can then exchange information by reading and writing data in the shared areas. The form of the data and the location are determined by these processes and are not under the operating system’s control. The processes are also responsible for ensuring that they are not writing to the same location simultaneously. To illustrate the concept of cooperating processes, let’s consider the producer–consumer problem, which is a common paradigm for cooperating processes. A producer process produces information that is consumed by a consumer process. For example, a compiler may produce assembly code that is consumed by an assembler. The assembler, in turn, may produce object modules that are consumed by the loader. The producer–consumer problem also provides a useful metaphor for the client–server paradigm. We generally think of a server as a producer and a client as a consumer. For example, a web server produces (that is, provides) HTML files and images, which are consumed (that is, read) by the client web browser requesting the resource. One solution to the producer–consumer problem uses shared memory. To allow producer and consumer processes to run concurrently, we must have available a buffer of items that can be filled by the producer and emptied by the consumer. This buffer will reside in a region of memory that is shared by the producer and consumer processes. A producer can produce one item while the consumer is consuming another item. The producer and consumer must be synchronized, so that the consumer does not try to consume an item that has not yet been produced. Two types of buffers can be used. The unbounded buffer places no practical limit on the size of the buffer. The consumer may have to wait for new items, but the producer can always produce new items. The bounded buffer assumes a fixed buffer size. In this case, the consumer must wait if the buffer is empty, and the producer must wait if the buffer is full. We now illustrate a solution to the producer–consumer problem using shared memory. Such solutions may implement the Buffer interface shown in Figure 3.15. The producer process invokes the insert() method (Figure 3.16) when it wishes to enter an item in the buffer, and the consumer calls the remove() method (Figure 3.17) when it wants to consume an item from the buffer. Although Java does not provide support for shared memory, we can design a solution to the producer–consumer problem in Java that emulates shared memory by allowing the producer and consumer processes to share an instance of the BoundedBuffer class (Figure 3.18), which implements the 122 Chapter 3 Processes // Producers call this method public void insert(E item) { while (count == BUFFER SIZE) ; // do nothing -- no free space // add an item to the buffer buffer[in] = item; in = (in + 1) % BUFFER SIZE; ++count; } Figure 3.16 The insert() method. Buffer interface. Such sharing involves passing a reference to an instance of the BoundedBuffer class to the producer and consumer processes. This is illustrated in Figure 3.19. The shared buffer is implemented as a circular array with two logical pointers: in and out.Thevariablein points to the next free position in the buffer; out points to the first full position in the buffer. The variable count is the number of items currently in the buffer. The buffer is empty when count == 0 and is full when count == BUFFER SIZE. Note that both the producer and the consumer will block in the while loop if the buffer is not usable to them. In Chapter 6, we discuss how synchronization among cooperating processes can be implemented effectively in a shared-memory environment. 3.4.2 Message-Passing Systems In Section 3.4.1, we showed how cooperating processes can communicate in a shared-memory environment. The scheme requires that these processes share a region of memory and that the code for accessing and manipulating the shared memory be written explicitly by the application programmer. Another way to achieve the same effect is for the operating system to provide the means for // Consumers call this method public E remove() { E item; while (count == 0) ; // do nothing -- nothing to consume // remove an item from the buffer item = buffer[out]; out = (out + 1) % BUFFER SIZE; --count; return item; } Figure 3.17 The remove() method. 3.4 Interprocess Communication 123 import java.util.*; public class BoundedBuffer implements Buffer { private static final int BUFFER SIZE = 5; private int count; // number of items in the buffer private int in; // points to the next free position private int out; // points to the next full position private E[] buffer; public BoundedBuffer() { // buffer is initially empty count = 0; in=0; out=0; buffer = (E[]) new Object[BUFFER SIZE]; } // producers calls this method public void insert(E item) { // Figure 3.16 } // consumers calls this method public E remove() { // Figure 3.17 } } Figure 3.18 Shared-memory solution to the producer–consumer problem. cooperating processes to communicate with each other via a message-passing facility. Message passing provides a mechanism to allow processes to communicate and to synchronize their actions without sharing the same address space and is particularly useful in a distributed environment, where the communicating object referenceobject reference consumer BoundedBuffer object producer Figure 3.19 Simulating shared memory in Java. 124 Chapter 3 Processes processes may reside on different computers connected by a network. For example, a chat programontheWorldWideWebcouldbedesignedsothat chat participants communicate with one another by exchanging messages. A message-passing facility provides at least two operations: send(message) and receive(message). Messages sent by a process can be of either fixed or variable size. If only fixed-sized messages can be sent, the system-level imple- mentation is straightforward. This restriction, however, makes the task of pro- gramming more difficult. Conversely, variable-sized messages require a more complex system-level implementation, but the programming task becomes simpler. This is a common kind of trade-off seen throughout operating-system design. If processes P and Q want to communicate, they must send messages to and receive messages from each other; a communication link must exist between them. This link can be implemented in a variety of ways. We are concerned here not with the link’s physical implementation (such as shared memory, hardware bus, or network, which are covered in Chapter 16) but rather with its logical implementation. Here are several methods for logically implementing a link and the send()/receive() operations: • Direct or indirect communication • Synchronous or asynchronous communication • Automatic or explicit buffering We look at issues related to each of these features next. 3.4.2.1 Naming Processes that want to communicate must have a way to refer to each other. They can use either direct or indirect communication. Under direct communication, each process that wants to communicate must explicitly name the recipient or sender of the communication. In this scheme, the send() and receive() primitives are defined as: • send(P, message)—Senda message to process P. • receive(Q, message)—Receive a message from process Q. A communication link in this scheme has the following properties: • A link is established automatically between every pair of processes that want to communicate. The processes need to know only each other’s identity to communicate. • A link is associated with exactly two processes. • Between each pair of processes, there exists exactly one link. This scheme exhibits symmetry in addressing; that is, both the sender process and the receiver process must name the other to communicate. A variant of this scheme employs asymmetry in addressing. Here, only the sender 3.4 Interprocess Communication 125 names the recipient; the recipient is not required to name the sender. In this scheme, the send() and receive() primitives are defined as follows: • send(P, message)—Senda message to process P. • receive(id, message)—Receive a message from any process; the vari- able id is set to the name of the process with which communication has taken place. The disadvantage in both of these schemes (symmetric and asymmetric) is the limited modularity of the resulting process definitions. Changing the identifier of a process may necessitate examining all other process definitions. All references to the old identifier must be found, so that they can be modified to the new identifier. In general, any such hard-coding techniques, where identifiers must be explicitly stated, are less desirable than techniques involving indirection, as described next. With indirect communication, the messages are sent to and received from mailboxes,orports. A mailbox can be viewed abstractly as an object into which messages can be placed by processes and from which messages can be removed. Each mailbox has a unique identification. For example, POSIX message queues use an integer value to identify a mailbox. In this scheme, a process can communicate with some other process via a number of different mailboxes. Two processes can communicate only if the processes have a shared mailbox, however. The send() and receive() primitives are defined as follows: • send(A, message)—Senda message to mailbox A. • receive(A, message)—Receive a message from mailbox A. In this scheme, a communication link has the following properties: • A link is established between a pair of processes only if both members of the pair have a shared mailbox. • A link may be associated with more than two processes. • Between each pair of communicating processes, there may be a number of different links, with each link corresponding to one mailbox. Now suppose that processes P1, P2,andP3 all share mailbox A.Process P1 sends a message to A, while both P2 and P3 execute a receive() from A. Which process will receive the message sent by P1? The answer depends on which of the following methods we choose: • Allow a link to be associated with at most two processes. • Allowatmostoneprocessatatimetoexecuteareceive() operation. • Allow the system to select arbitrarily which process will receive the message (that is, either P2 or P3, but not both, will receive the message). The system also may define an algorithm for selecting which process will receive the message (that is, round robin, where processes take turns receiving messages). The system may identify the receiver to the sender. 126 Chapter 3 Processes A mailbox may be owned either by a process or by the operating system. If the mailbox is owned by a process (that is, the mailbox is part of the address space of the process), then we distinguish between the owner (who can only receive messages through this mailbox) and the user (who can only send messages to the mailbox). Since each mailbox has a unique owner, there can be no confusion about who should receive a message sent to this mailbox. When a process that owns a mailbox terminates, the mailbox disappears. Any process that subsequently sends a message to this mailbox must be notified that the mailbox no longer exists. In contrast, a mailbox that is owned by the operating system has an existence of its own. It is independent and is not attached to any particular process. The operating system then must provide a mechanism that allows a process to do the following: • Create a new mailbox. • Send and receive messages through the mailbox. • Delete a mailbox. The process that creates a new mailbox is that mailbox’s owner by default. Initially, the owner is the only process that can receive messages through this mailbox. However, the ownership and receiving privilege may be passed to other processes through appropriate system calls. Of course, this provision could result in multiple receivers for each mailbox. 3.4.2.2 Synchronization Communication between processes takes place through calls to send() and receive() primitives. There are different design options for implementing each primitive. Message passing may be either blocking or nonblocking— also known as synchronous and asynchronous. • Blocking send. The sending process is blocked until the message is received by the receiving process or by the mailbox. • Nonblocking send. The sending process sends the message and resumes operation. • Blocking receive. The receiver blocks until a message is available. • Nonblocking receive. The receiver retrieves either a valid message or a null. Different combinations of send() and receive() are possible. When both send() and receive() are blocking, we have a rendezvous between the sender and the receiver. The solution to the producer–consumer problem becomes trivial when we use blocking send() and receive() statements. The producer merely invokes the blocking send() call and waits until the message is delivered to either the receiver or the mailbox. Likewise, when the consumer invokes receive(), it blocks until a message is available. Note that the concepts of synchronous and asynchronous occur frequently in operating-system I/O algorithms, as you will see throughout this text. 3.4 Interprocess Communication 127 public interface Channel { // Send a message to the channel public void send(E item); // Receive a message from the channel public E receive(); } Figure 3.20 Interface for message passing. 3.4.2.3 Buffering Whether communication is direct or indirect, messages exchanged by commu- nicating processes reside in a temporary queue. Basically, such queues can be implemented in three ways: • Zero capacity. The queue has a maximum length of zero; thus, the link cannot have any messages waiting in it. In this case, the sender must block until the recipient receives the message. • Bounded capacity. The queue has finite length n; thus, at most n messages can reside in it. If the queue is not full when a new message is sent, the message is placed in the queue (either the message is copied or a pointer to the message is kept), and the sender can continue execution without import java.util.Vector; public class MessageQueue implements Channel { private Vector queue; public MessageQueue() { queue = new Vector(); } // This implements a nonblocking send public void send(E item) { queue.addElement(item); } // This implements a nonblocking receive public E receive() { if (queue.size() == 0) return null; else return queue.remove(0); } } Figure 3.21 Mailbox for message passing. 128 Chapter 3 Processes Channel mailBox; while (true) { Date message = new Date(); mailBox.send(message); } Figure 3.22 The producer process. waiting. Recall that the link’s capacity is finite, however. If the link is full, the sender must block until space is available in the queue. • Unbounded capacity. The queue’s length is potentially infinite; thus, any number of messages can wait in it. The sender never blocks. The zero-capacity case is sometimes referred to as a message system with no buffering; the other cases are referred to as systems with automatic buffering. 3.4.2.4 An Example: Message Passing with Java Now, we’ll examine a solution to the producer–consumer problem that uses message passing. Our solution will implement the Channel interface shown in Figure 3.20. The producer and consumer will communicate indirectly using the shared mailbox illustrated in Figure 3.21. The buffer is implemented using the java.util.Vector class, meaning that it will be a buffer of unbounded capacity. Also note that both the send() and receive() methods are nonblocking. When the producer generates an item, it places that item in the mailbox via the send() method. The code for the producer is shown in Figure 3.22. The consumer obtains an item from the mailbox using the receive() method. Because receive() is nonblocking, the consumer must evaluate the value of the Object returned from receive().Ifitisnull, the mailbox is empty.ThecodefortheconsumerisshowninFigure3.23. Chapter 4 shows how to implement the producer and consumer as separate threads of control and how to allow the mailbox to be shared between the threads. 3.5 Examples of IPC Systems In this section, we explore two different IPC systems. We first describe message passing in the Mach operating system. Then, we discuss Windows XP,which Channel mailBox; while (true) { Date message = mailBox.receive(); if (message != null) // consume the message } Figure 3.23 The consumer process. 3.5 Examples of IPC Systems 129 uses shared memory as a mechanism for providing certain types of message passing. 3.5.1 An Example: Mach The Mach operating system, developed at Carnegie Mellon University, is an example of a message-based operating system. You may recall that we introduced Mach in Chapter 2 as part of the Mac OS X operating system. The Mach kernel supports the creation and destruction of multiple tasks, which are similar to processes but have multiple threads of control. Most communication in Mach—including most system calls and all intertask information—is carried out by messages. Messages are sent to and received from mailboxes, called ports in Mach. As just noted, even system calls are made by messages. When a task is created, two special mailboxes—the Kernel mailbox and the Notify mailbox —are also created. The Kernel mailbox is used by the kernel to communicate with the task. The kernel sends notification of event occurrences to the Notify port. Only three system calls are needed for message transfer. The msg send() call sends a message to a mailbox. A message is received via msg receive(). Remote procedure calls (RPCs) are executed via msg rpc(), which sends a message and waits for exactly one return message from the sender. In this way, the RPC models a typical subroutine procedure call but can work between systems—hence the term remote. The port allocate() system call creates a new mailbox and allocates space for its queue of messages. The maximum size of the message queue defaults to eight messages. The task that creates the mailbox is that mailbox’s owner. The owner is also allowed to receive from the mailbox. Only one task at a time can either own or receive from a mailbox, but these rights can be sent to other tasks if desired. The mailbox’s message queue is initially empty. As messages are sent to the mailbox, the messages are copied into the mailbox. All messages have the same priority. Mach guarantees that multiple messages from the same sender are queued in first-in, first-out (FIFO) order but does not guarantee an absolute ordering. For instance, messages from two senders may be queued in any order. The messages themselves consist of a fixed-length header followed by a variable-length data portion. The header indicates the length of the message and includes two mailbox names. One mailbox name is the mailbox to which the message is being sent. Commonly, the sending thread expects a reply, so the mailbox name of the sender is passed on to the receiving task, which can use it as a “return address.” The variable part of a message is a list of typed data items. Each entry in the list has a type, size, and value. The type of the objects specified in the message is important, since objects defined by the operating system—such as ownership or receive access rights, task states, and memory segments—may be sent in messages. The send and receive operations themselves are flexible. For instance, when a message is sent to a mailbox, the mailbox may be full. If the mailbox is not full, the message is copied to the mailbox, and the sending thread continues. If the mailbox is full, the sending thread has four options: 130 Chapter 3 Processes 1. Wait indefinitely until there is room in the mailbox. 2. Wait at most n milliseconds. 3. Donotwaitatallbutratherreturnimmediately. 4. Temporarily cache a message. One message can be given to the operating system to keep, even though the mailbox to which that message is being sent is full. When the message can be put in the mailbox, a message is sent back to the sender; only one such message to a full mailbox can be pending at any time for a given sending thread. The final option is meant for server tasks, such as a line-printer driver. After finishing a request, such tasks may need to send a one-time reply to the task that had requested service; but they must also continue with other service requests, even if the reply mailbox for a client is full. The receive operation must specify the mailbox or mailbox set from which a message is to be received. A mailbox set is a collection of mailboxes, as declared by the task, which can be grouped together and treated as one mailbox for the purposes of the task. Threads in a task can receive only from a mailbox or mailbox set for which the task has receive access. A port status() system call returns the number of messages in a given mailbox. The receive operation attempts to receive from (1) any mailbox in a mailbox set or (2) a specific (named) mailbox. If no message is waiting to be received, the receiving thread can either wait at most n milliseconds or not wait at all. The Mach system was especially designed for distributed systems, which we discuss in Chapters 16 through 18, but Mach is also suitable for single- processor systems, as evidenced by its inclusion in the Mac OS X system. The major problem with message systems has generally been poor performance caused by double copying of messages; the message is copied first from the sender to the mailbox and then from the mailbox to the receiver. The Mach message system attempts to avoid double-copy operations by using virtual- memory-management techniques (discussed in Chapter 9). Essentially, Mach maps the address space containing the sender’s message into the receiver’s address space. The message itself is never actually copied. This message- management technique provides a large performance boost but works for only intrasystem messages. The Mach operating system is discussed further in an extra chapter posted on our website. 3.5.2 An Example: Windows XP The Windows XP operating system (described further in Chapter 22) is an example of modern design that employs modularity to increase functionality and decrease the time needed to implement new features. Windows XP provides support for multiple operating environments, or subsystems, with which application programs communicate via a message-passing mechanism. The application programs can be considered clients of the Windows XP subsystem server. The message-passing facility in Windows XP is called the local procedure- call (LPC) facility. The LPC in Windows XP communicates between two processes on the same machine. It is similar to the standard, widely used RPC mechanism, but it is optimized for and specific to Windows XP.LikeMach, 3.6 Communication in Client–Server Systems 131 Windows XP uses a port object to establish and maintain a connection between two processes. Every client that calls a subsystem needs a communication channel, which is provided by a port object and is never inherited. Windows XP uses two types of ports: connection ports and communication ports. They are really the same but are given different names according to how they are used. Connection ports are named objects and are visible to all processes; they give applications a way to set up communication channels. The communication worksasfollows: • The client opens a handle to the subsystem’s connection port object. • The client sends a connection request. • The server creates two private communication ports and returns the handle to one of them to the client. • The client and server use the corresponding port handle to send messages or callbacks and to listen for replies. Windows XP uses two types of message-passing techniques over a port that the client specifies when it establishes the channel. The simplest, which is used for small messages, uses the port’s message queue as intermediate storage and copies the message from one process to the other. Under this method, messages of up to 256 bytes can be sent. If a client needs to send a larger message, it passes the message through a section object, which sets up a region of shared memory. The client has to decide when it sets up the channel whether or not it will need to send a large message. If the client determines that it does want to send large messages, it asks for a section object to be created. Similarly,if the server decides that replies will be large, it creates a section object. So that the section object can be used, a small message is sent that contains a pointer and size information about the section object. This method is more complicated than the first method, but it avoids data copying. In both cases, a callback mechanism can be used when either the client or the server cannot respond immediately to a request. The callback mechanism allows them to perform asynchronous message handling. The structure of local procedure calls in Windows XP is shown in Figure 3.24. It is important to note that the LPC facility in Windows XP isnotpartof the Win32 API and hence is not visible to the application programmer. Rather, applications using the Win32 API invoke standard remote procedure calls. When the RPC is being invoked on a process on the same system, the RPC is indirectly handled through a local procedure call. LPCs are also used in a few other functions that are part of the Win32 API. 3.6 Communication in Client–Server Systems In Section 3.4, we described how processes can communicate using shared memory and message passing. These techniques can be used for communica- tion in client–server systems (Section 1.12.2) as well. In this section, we explore 132 Chapter 3 Processes Connection Port Connection request Handle Handle Handle Client Communication Port Server Communication Port Shared Section Object (< = 256 bytes) ServerClient Figure 3.24 Local procedure calls in Windows XP. three other strategies for communication in client–server systems: sockets, remote procedure calls (RPCs), and Java’s remote method invocation (RMI). 3.6.1 Sockets A socket is defined as an endpoint for communication. A pair of processes communicating over a network employ a pair of sockets—one for each process. AsocketisidentifiedbyanIP address concatenated with a port number. In general, sockets use a client–server architecture. The server waits for incoming client requests by listening to a specified port. Once a request is received, the server accepts a connection from the client socket to complete the connection. Servers implementing specific services (such as telnet, FTP,andHTTP) listen to well-known ports (a telnet server listens to port 23; an FTP server listens to port 21; and a Web, or HTTP, server listens to port 80). All ports below 1024 are considered well known; we can use them to implement standard services. When a client process initiates a request for a connection, it is assigned a port by its host computer. This port is some arbitrary number greater than 1024. For example, if a client on host X with IP address 146.86.5.20 wishes to establish a connection with a Web server (which is listening on port 80) at address 161.25.19.8, host X may be assigned port 1625. The connection will consist of a pair of sockets: (146.86.5.20:1625) on host X and (161.25.19.8:80) on the Web server. This situation is illustrated in Figure 3.25. The packets traveling between the hosts are delivered to the appropriate process based on the destination port number. All connections must be unique. Therefore, if another process also on host X wished to establish another connection with the same Web server, it would be assigned a port number greater than 1024 and not equal to 1625. This ensures that all connections consist of a unique pair of sockets. To explore socket programming further, we turn next to an illustration using Java. Java provides an easy interface for socket programming and has a rich library for additional networking utilities. Java provides three different types of sockets. Connection-oriented (TCP) sockets are implemented with the Socket class. Connectionless (UDP)sockets use the DatagramSocket class. Finally,the MulticastSocket classisasubclass 3.6 Communication in Client–Server Systems 133 socket (146.86.5.20:1625) host X (146.86.5.20) socket (161.25.19.8:80) web server (161.25.19.8) Figure 3.25 Communication using sockets. of the DatagramSocket class. A multicast socket allows data to be sent to multiple recipients. Our example describes a date server that uses connection-oriented TCP sockets. The operation allows clients to request the current date and time from the server. The server listens to port 6013, although the port could have any arbitrary number greater than 1024. When a connection is received, the server returns the date and time to the client. The date server is shown in Figure 3.26. The server creates a ServerSocket that specifies it will listen to port 6013. The server then begins listening to the port with the accept() method. The server blocks on the accept() method waiting for a client to request a connection. When a connection request is received, accept() returns a socket that the server can use to communicate with the client. The details of how the server communicates with the socket are as follows. The server first establishes aPrintWriterobject that it will use to communicate with the client. A PrintWriter object allows the server to write to the socket using the routine print() and println() methods for output. The server process sends the date to the client, calling the method println().Onceit has written the date to the socket, the server closes the socket to the client and resumes listening for more requests. A client communicates with the server by creating a socket and connecting to the port on which the server is listening. We implement such a client in the Java program shown in Figure 3.27. The client creates a Socket and requests a connection with the server at IP address 127.0.0.1 on port 6013. Once the connection is made, the client can read from the socket using normal stream I/O statements. After it has received the date from the server, the client closes the socket and exits. The IP address 127.0.0.1 is a special IP address known as the loopback. When a computer refers to IP address 127.0.0.1, it is referring to itself. This mechanism allows a client and server on the same host to communicate using the TCP/IP protocol. The IP address 127.0.0.1 could be replaced with the IP address of another host running the date server. In addition to an IP address, an actual host name, such as www.westminstercollege.edu, can be used as well. 134 Chapter 3 Processes import java.net.*; import java.io.*; public class DateServer { public static void main(String[] args) { try { ServerSocket sock = new ServerSocket(6013); // now listen for connections while (true) { Socket client = sock.accept(); PrintWriter pout = new PrintWriter(client.getOutputStream(), true); // write the Date to the socket pout.println(new java.util.Date().toString()); // close the socket and resume // listening for connections client.close(); } } catch (IOException ioe) { System.err.println(ioe); } } } Figure 3.26 Date server. Communication using sockets—although common and efficient—is gen- erally considered a low-level form of communication between distributed processes. One reason is that sockets allow only an unstructured stream of bytes to be exchanged between the communicating threads. It is the responsibility of the client or server application to impose a structure on the data. In the next two subsections, we look at two higher-level methods of communication: remote procedure calls (RPCs) and Java’s remote method invocation (RMI). 3.6.2 Remote Procedure Calls One of the most common forms of remote service is the RPC paradigm, which we discussed briefly in Section 3.5.1. The RPC wasdesignedasawayto abstract the procedure-call mechanism for use between systems with network connections. It is similar in many respects to the IPC mechanism described in Section 3.4, and it is usually built on top of such a system. Here, however, because we are dealing with an environment in which the processes are executing on separate systems, we must use a message-based communication scheme to provide remote service. In contrast to the IPC facility, the messages 3.6 Communication in Client–Server Systems 135 import java.net.*; import java.io.*; public class DateClient { public static void main(String[] args) { try { //make connection to server socket Socket sock = new Socket("127.0.0.1",6013); InputStream in = sock.getInputStream(); BufferedReader bin = new BufferedReader(new InputStreamReader(in)); // read the date from the socket String line; while ( (line = bin.readLine()) != null) System.out.println(line); // close the socket connection sock.close(); } catch (IOException ioe) { System.err.println(ioe); } } } Figure 3.27 Date client. exchanged in RPC communication are well structured and are thus no longer just packets of data. Each message is addressed to an RPC daemon listening to a port on the remote system, and each contains an identifier of the function to execute and the parameters to pass to that function. The function is then executed as requested, and any output is sent back to the requester in a separate message. A port is simply a number included at the start of a message packet. Whereas a system normally has one network address, it can have many ports within that address to differentiate the many network services it supports. If a remote process needs a service, it addresses a message to the proper port. For instance, if a system wished to allow other systems to be able to list its current users, it wouldhaveadaemonsupportingsuchanRPC attached to a port—say, port 3027. Any remote system could obtain the needed information (that is, the list of current users) by sending an RPC message to port 3027 on the server; the data would be received in a reply message. The semantics of RPCs allow a client to invoke a procedure on a remote host as it would invoke a procedure locally. The RPC system hides the details that allow communication to take place by providing a stub on the client side. Typically, a separate stub exists for each separate remote procedure. When the client invokes a remote procedure, the RPC system calls the appropriate stub, 136 Chapter 3 Processes passing it the parameters provided to the remote procedure. This stub locates the port on the server and marshals the parameters. Parameter marshalling involves packaging the parameters into a form that can be transmitted over a network. The stub then transmits a message to the server using message passing. A similar stub on the server side receives this message and invokes the procedure on the server. If necessary, return values are passed back to the client using the same technique. One issue that must be dealt with concerns differences in data representa- tion on the client and server machines. Consider the representation of 32-bit integers. Some systems (known as big-endian) store the most significant byte first, while other systems (known as little-endian) store the least significant byte first. Neither order is “better” per se; rather, the choice is arbitrary within a computer architecture. To resolve differences like this, many RPC systems define a machine-independent representation of data. One such representation is known as external data representation (XDR). On the client side, parameter marshalling involves converting the machine-dependent data into XDR before they are sent to the server. On the server side, the XDR data are unmarshalled and converted to the machine-dependent representation for the server. Another important issue involves the semantics of a call. Whereas local procedure calls fail only under extreme circumstances, RPCs can fail, or be duplicated and executed more than once, as a result of common network errors. One way to address this problem is for the operating system to ensure that messages are acted on exactly once, and another is to ensure that they are acted on at most once. Most local procedure calls have the “exactly once” functionality, but it is more difficult to implement. First, consider “at most once”. We can implement this semantic by attaching a timestamp to each message. The server must keep a history of all the timestamps of messages it has already processed or a history large enough to ensure that repeated messages are detected. Incoming messages that have timestamps already in the history are ignored. The client can then send a message one or more times and be assured that it only executes once. (Generation of these timestamps is discussed in Section 18.1.) For “exactly once,” we need to remove the risk that the server will never receive the request. To accomplish this, the server must implement the “at most once” protocol described above but must also acknowledge to the client that the RPC call was received and executed. These ACK messages are common throughout networking. The client must resend each RPC call periodically until it receives the ACK for that call. Another important issue concerns the communication between a server and a client. With standard procedure calls, some form of binding takes place during link, load, or execution time (Chapter 8) so that a procedure call’s name is replaced by the memory address of the procedure call. The RPC scheme requires a similar binding of the client and the server port, but how does a client know the port numbers on the server? Neither system has full information about the other because they do not share memory. Two approaches to this issue are common. First, the binding information may be predetermined, in the form of fixed port addresses. At compile time, an RPC call has a fixed port number associated with it. Once a program is compiled, the server cannot change the port number of the requested service. Second, binding can be done dynamically by a rendezvous mechanism. Typically, an 3.6 Communication in Client–Server Systems 137 client user calls kernel to send RPC message to procedure X matchmaker receives message, looks up answer matchmaker replies to client with port P daemon listening to port P receives message daemon processes request and processes send output kernel sends message to matchmaker to find port number from: client to: server port: matchmaker re: address for RPC X from: client to: server port: port P from: RPC port: P to: client port: kernel from: server to: client port: kernel re: RPC X port: P kernel places port P in user RPC message kernel sends RPC kernel receives reply, passes it to user messages server Figure 3.28 Execution of a remote procedure call (RPC). operating system provides a rendezvous (also called a matchmaker)daemon on a fixed RPC port. A client then sends a message containing the name of the RPC to the rendezvous daemon requesting the port address of the RPC it needs to execute. The port number is returned, and the RPC calls can be sent to that port until the process terminates (or the server crashes). This method requires the extra overhead of the initial request but is more flexible than the first approach. Figure 3.28 shows a sample interaction. The RPC scheme is useful in implementing a distributed file system (Chapter 17). Such a system can be implemented as a set of RPC daemons and clients. The messages are addressed to the distributed-file-system port on a server on which a file operation is to take place. The message contains the disk operation to be performed. The disk operation might be read, write, rename, delete,orstatus, corresponding to the usual file-related system calls. The return message contains any data resulting from that call, which is executed by the DFS daemon on behalf of the client. For instance, a message might contain a request to transfer a whole file to a client or be limited to a simple block request. In the latter case, several such requests may be needed if a whole file is to be transferred. 138 Chapter 3 Processes Java program JVM remote object JVM noitacovnidohtemetomer Figure 3.29 Remote method invocation. 3.6.3 Remote Method Invocation Remote method invocation (RMI) is a Java feature similar to RPCs. RMI allows a thread to invoke a method on a remote object. Objects are considered remote if they reside in a different Java virtual machine (JVM). Therefore, the remote object may be in a different JVM on the same computer or on a remote host connected by a network. This situation is illustrated in Figure 3.29. RMI and RPCs differ in two fundamental ways. First, RPCs support pro- cedural programming, whereby only remote procedures or functions may be called. In contrast, RMI is object-based: it supports invocation of methods on remote objects. Second, the parameters to remote procedures are ordinary data structures in RPC. With RMI, it is possible to pass primitive data types (for example, int, boolean), as well as objects, as parameters to remote methods. By allowing a Java program to invoke methods on remote objects, RMI makes it possible for users to develop Java applications that are distributed across a network. To make remote methods transparent to both the client and the server, RMI implements the remote object using stubs and skeletons. A stub is a proxy for the remote object; it resides with the client. When a client invokes a remote method, the stub for the remote object is called. This client-side stub is responsible for creating a parcel consisting of the name of the method to be invoked on the server and the marshalled parameters for the method. The stub then sends this parcel to the server, where the skeleton for the remote object receives it. The skeleton is responsible for unmarshalling the parameters and invoking the desired method on the server. The skeleton then marshals the return value (or exception, if any) into a parcel and returns this parcel to the client. The stub unmarshals the return value and passes it to the client. Let’s look more closely at how this process works. Assume that a client wishes to invoke a method on a remote object server with a signature remoteMethod(Object, Object) that returns a boolean value. The client executes the statement boolean val = server.remoteMethod(A, B); The call to remoteMethod() with the parameters A and B invokes the stub for the remote object. The stub marshals into a parcel the parameters A and B and the name of the method that is to be invoked on the server, then sends this parcel to the server. The skeleton on the server unmarshals the parameters and invokes 3.6 Communication in Client–Server Systems 139 client stub skeleton A,B, remoteMethod remoteMethod(A,B) val = Server.remoteMethod(A,B) parcel boolean return value boolean boolean parcel time remoteobject < < < < Figure 3.30 Marshalling parameters. the method remoteMethod(). The actual implementation of remoteMethod() resides on the server. Once the method is completed, the skeleton marshals the boolean value returned from remoteMethod() and sends this value back to the client. The stub unmarshals this return value and passes it to the client. The process is shown using the UML (Unified Modeling Language) sequence diagram in Figure 3.30. Fortunately, the level of abstraction that RMI provides makes the stubs and skeletons transparent, allowing Java developers to write programs that invoke distributed methods just as they would invoke local methods. It is crucial, however, to understand a few rules about the behavior of parameter passing and return values: • Local (or nonremote) objects are passed by copy using a technique known as object serialization which allows the state of an object to be written to a byte stream. The only requirement for object serialization is that an object must implement the java.io.Serializable interface. Most objects in thecoreJavaAPI implement this interface, allowing them to be used with RMI. • Remote objects are passed by reference. Passing an object by reference allows the receiver to alter the state of the remote object as well as invoke its remote methods. In our example, if A is a local object and B a remote object, A is serialized and passed by copy, and B is passed by reference. This will allow the server to invoke methods on B remotely. Next, using RMI, we’ll build an application that returns the current date and time, similar to the socket-based program shown in Section 3.6.1. We cover RMI in further detail in Appendix D, where we implement a message-passing solution to the producer–consumer problem using RMI. 140 Chapter 3 Processes import java.rmi.*; import java.util.Date; public interface RemoteDate extends Remote { public Date getDate() throws RemoteException; } Figure 3.31 The RemoteDate interface. 3.6.3.1 Remote Objects Before building a distributed application, we must first define the necessary remote objects. We begin by declaring an interface that specifies the methods that can be invoked remotely. In our example of a date server, the remote method will be named getDate() and will return a java.utiil.Date containing the current date. To provide for remote objects, this interface must also extend the java.rmi.Remote interface, which identifies objects implementing the interface as being remote. Further, each method declared in the interface must throw the exception java.rmi.RemoteException.For remote objects, we provide the RemoteDate interface shown in Figure 3.31. The class that defines the remote object must implement the RemoteDate interface (Figure 3.32). In addition to defining the getDate() method, the import java.rmi.*; import java.rmi.server.UnicastRemoteObject; import java.util.Date; public class RemoteDateImpl extends UnicastRemoteObject implements RemoteDate { public RemoteDateImpl() throws RemoteException {} public Date getDate() throws RemoteException { return new Date(); } public static void main(String[] args) { try { RemoteDate dateServer = new RemoteDateImpl(); // Bind this object instance to the name "RMIDateObject" Naming.rebind("RMIDateObject", dateServer); } catch (Exception e) { System.err.println(e); } } } Figure 3.32 Implementation of the RemoteDate interface. 3.6 Communication in Client–Server Systems 141 import java.rmi.*; public class RMIClient { static final String server = "127.0.0.1"; public static void main(String args[]) { try { String host = "rmi://" + server + "/RMIDateObject"; RemoteDate dateServer = (RemoteDate)Naming.lookup(host); System.out.println(dateServer.getDate()); } catch (Exception e) { System.err.println(e); } } } Figure 3.33 The RMI client. class must also extend java.rmi.server.UnicastRemoteObject.Extending UnicastRemoteObject allows the creation of a single remote object that listens for network requests using RMI’s default scheme of sockets for network communication. This class also includes a main() method. The main() method creates an instance of the object and registers with the RMI registry running on the server via the rebind() method. In this case, the object instance registers itself with the name “RMIDateObject.” Also note that we must create a default constructor for the RemoteDateImpl class,anditmustthrow a RemoteException if a communication or network failure prevents RMI from exporting the remote object. 3.6.3.2 Access to the Remote Object Once the remote object is registered on the server, a client (as shown in Figure 3.33) can get a proxy reference to the object from the RMI registry running on the server by using the static method lookup() in the Naming class. RMI provides a URL-based lookup scheme using the form rmi://server/objectName,where server is the IP name (or address) of the server on which the remote object objectName resides and objectName is the name of the remote object specified by the server in the rebind() method (in this case, RMIDateObject). Once the client has the proxy reference to the remote object, it invokes the remote method getDate(), which returns the current date and time. Because remote methods —aswellastheNaming.lookup() method—can throw exceptions, they must be placed in try-catch blocks. 3.6.3.3 Running the Programs We now demonstrate the steps necessary to run the example programs. For simplicity, we are assuming that all programs are running on the local host— that is, IP address 127.0.0.1. However, communication is still considered remote, 142 Chapter 3 Processes because the client and server programs are running in their own separate Java virutal machines. 1. Compile all source files. Make sure that the file RemoteDate.class is in the same directory as RMIClient. 2. Start the registry and create the remote object. To start the registry on UNIX platforms, the user can type rmiregistry & For Windows, the user can type start rmiregistry This command starts the registry with which the remote object will register. Next, create an instance of the remote object with java RemoteDateImpl This remote object will register using the name RMIDateObject. 3. Reference the remote object. Enter the statement java RMIClient on the command line to start the client. This program will get a proxy reference to the remote object named RMIDateObject and invoke the remote method getDate(). 3.6.3.4 RMI versus Sockets Contrast the socket-based client program shown in Figure 3.27 with the RMI client shown in Figure 3.33. The socket-based client must manage the socket connection, including opening and closing the socket and establishing an InputStream to read from the socket. The design of the client using RMI is much simpler. All it must do is get a proxy for the remote object, which allows it to invoke the remote method getDate() as it would invoke an ordinary local method. This example illustrates the appeal of techniques such as RPCsand RMI. They provide developers of distributed systems with a communication mechanism that allows them to design distributed programs without incurring the overhead of socket management. 3.7 Summary A process is a program in execution. As a process executes, it changes state. The state of a process is defined by that process’s current activity. Each process may be in one of the following states: new, ready, running, waiting, or terminated. Each process is represented in the operating system by its own process control block (PCB). A process, when it is not executing, is placed in some waiting queue. There are two major classes of queues in an operating system: I/O request queues and the ready queue. The ready queue contains all the processes that are ready to execute and are waiting for the CPU. Each process is represented by a PCB, Practice Exercises 143 and the PCBs can be linked together to form a ready queue. Long-term (job) scheduling is the selection of processes that will be allowed to contend for the CPU. Normally, long-term scheduling is heavily influenced by resource- allocation considerations, especially memory management. Short-term (CPU) scheduling is the selection of one process from the ready queue. Operating systems must provide a mechanism for parent processes to create new child processes. The parent may wait for its children to terminate before proceeding, or the parent and children may execute concurrently. There are several reasons for allowing concurrent execution: information sharing, computation speedup, modularity, and convenience. The processes executing in the operating system may be either independent processes or cooperating processes. Cooperating processes require an interpro- cess communication mechanism to communicate with each other. Principally, communication is achieved through two schemes: shared memory and mes- sage passing. The shared-memory method requires communicating processes to share some variables. The processes are expected to exchange information through the use of these shared variables. In a shared-memory system, the responsibility for providing communication rests with the application pro- grammers; the operating system needs to provide only the shared memory. The message-passing method allows the processes to exchange messages. The responsibility for providing communication may rest with the operating system itself. These two schemes are not mutually exclusive and can be used simultaneously within a single operating system. Three different techniques for communication in client–server systems are (1) sockets, (2) remote procedure calls (RPCs), and (3) Java’s remote method invocation (RMI). A socket is defined as an endpoint for communication. A connection between a pair of applications consists of a pair of sockets, one at each end of the communication channel. RPCs are another form of distributed communication. An RPC occurs when a process (or thread) calls a procedure on a remote application. RMI is the Java version of RPCs. RMI allows a thread to invoke a method on a remote object just as it would on a local object. The primary distinction between RPCsandRMI is that in RPCs data are passed to a remote procedure in the form of an ordinary data structure, whereas RMI allows objects to be passed in remote method calls. Practice Exercises 3.1 Palm OS provides no means of concurrent processing. Discuss three major complications that concurrent processing adds to an operating system. 3.2 The Sun UltraSPARC processor has multiple register sets. Describe what happens when a context switch occurs if the new context is already loaded into one of the register sets. What happens if the new context is in memory rather than in a register set and all the register sets are in use? 3.3 When a process creates a new process using the fork() operation, which of the following states is shared between the parent process and the child process? 144 Chapter 3 Processes a. Stack b. Heap c. Shared memory segments 3.4 With respect to the RPC mechanism, consider the “exactly once” semantic. Does the algorithm for implementing this semantic execute correctly even if the ACK message back to the client is lost due to a network problem? Describe the sequence of messages, and discuss whether “exactly once” is still preserved in this situation. 3.5 Assume that a distributed system is susceptible to server failure. What mechanisms would be required to guarantee the “exactly once” semantic for execution of RPCs? Exercises 3.6 Describe the differences among short-term, medium-term, and long- term scheduling. 3.7 Describe the actions taken by a kernel to context-switch between processes. 3.8 Construct a process tree similar to Figure 3.9. To obtain process informa- tion for the UNIX or Linux system, use the command ps -ael.Usethe command man ps to get more information about the ps command. On Windows systems, you will have to use the task manager. 3.9 Including the initial parent process, how many processes are created by the program shown in Figure 3.34? #include #include int main() { /* fork a child process */ fork(); /* fork another child process */ fork(); /* and fork another */ fork(); return 0; } Figure 3.34 How many processes are created? Exercises 145 #include #include #include int main() { pid t pid, pid1; /* fork a child process */ pid = fork(); if (pid < 0) { /* error occurred */ fprintf(stderr, "Fork Failed"); return 1; } else if (pid == 0) { /* child process */ pid1 = getpid(); printf("child: pid = %d",pid); /* A */ printf("child: pid1 = %d",pid1); /* B */ } else { /* parent process */ pid1 = getpid(); printf("parent: pid = %d",pid); /* C */ printf("parent: pid1 = %d",pid1); /* D */ wait(NULL); } return 0; } Figure 3.35 What are the pid values? 3.10 Using the program in Figure 3.35, identify the values of pid at lines A, B, C,andD. (Assume that the actual pids of the parent and child are 2600 and 2603, respectively.) 3.11 Consider the RPC mechanism. Describe the undesirable consequences that could arise from not enforcing either the “at most once” or “exactly once” semantic. Describe possible uses for a mechanism that has neither of these guarantees. 3.12 Explain the fundamental differences between RMI and RPCs. 3.13 Using the program shown in Figure 3.36, explain what the output will be at Line A. 3.14 What are the benefits and the disadvantages of each of the following? Consider both the system level and the programmer level. a. Synchronous and asynchronous communication b. Automatic and explicit buffering 146 Chapter 3 Processes #include #include #include int value = 5; int main() { pid t pid; pid = fork(); if (pid == 0) { /* child process */ value += 15; return 0; } else if (pid > 0) { /* parent process */ wait(NULL); printf("PARENT: value = %d",value); /* LINE A */ return 0; } } Figure 3.36 WhatwilltheoutputbeatLineA? c. Send by copy and send by reference d. Fixed-sized and variable-sized messages Programming Problems 3.15 Section 3.6.1 describes port numbers below 1024 as being well known; that is, they provide standard services. Port 17 is known as the quote-of- the-day service. When a client connects to port 17 on a server, the server responds with a quote for that day. Modify the date server shown in Figure 3.26 so that it delivers a quote of the day rather than the current date. The quotes should be printable ASCII characters and should contain fewer than 512 characters, although multiple lines are allowed. Since port 17 is considered well known and therefore unavailable, have your server listen to port 6017. The date client shown in Figure 3.27 may be used to read the quotes returned by your server. 3.16 A haiku is a three-line poem in which the first line contains five syllables, the second line contains seven syllables, and the third line contains five syllables. Write a haiku server that listens to port 5575. When a client connects to this port, the server responds with a haiku. The date client shown in Figure 3.27 may be used to read the quotes returned by your haiku server. Programming Problems 147 3.17 Write a client–server application using Java sockets that allows a client to write a message (as a String) to a socket. A server will read this message, count the number of characters and digits in the message, and send these two counts back to the client. The server will listen to port 6100. The client can obtain the String message that it is to pass to the server either from the command line or by using a prompt to the user. One strategy for sending the two counts back to the client is for the server to construct an object containing a. The message it receives from the client b. A count of the number of characters in the message c. A count of the number of digits in the message Such an object can be modeled using the following interface: public interface Message { // set the counts for characters and digits public void setCounts(); // return the number of characters public int getCharacterCount(); // return the number of digits public int getDigitCount(); } The server will read the String from the socket, construct a new object that implements the Message interface, count the number of characters and digits in the String, and write the contents of the Message object to the socket. The client will send a String to the server and will wait for the server to respond with a Message containing the count of the number of characters and digits in the message. Communication over the socket connection will require obtaining the InputStream and OutputStream for the socket. Objects that are written to or read from an OutputStream or InputStream must be serialized and therefore must implement the java.io.Serializable interface. This interface is known as a marker interface, meaning that it actually has no methods that must be implemented; basically, any objects that implement this interface can be used with either an InputStream or an OutputStream.For this assignment, you will design an object named MessageImpl that implements both java.io.Serializable and the Message interface shown above. Serializing an object requires obtaining a java.io.ObjectOutput- Stream and then writing the object using the writeObject() method in the ObjectOutputStream class. Thus, the server’s activity will be organized roughly as follows: 148 Chapter 3 Processes a. Reading the string from the socket b. Constructing a new MessageImpl object and counting the number of characters and digits in the message c. Obtaining the ObjectOutputStream for the socket and writing the MessageImpl object to this output stream Reading a serialized object requires obtaining a java.io.Object- InputStream and then reading the serialized object using the readOb- ject() method in the java.io.ObjectInputStream class. Therefore, the client’s activity will be arranged approximately as follows: a. Writing the string to the socket b. Obtaining the ObjectInputStream from the socket and reading the MessageImpl Consult the Java API for further details. 3.18 Write an RMI application that allows a client to open and read a file residing on a remote server. The interface for accessing the remote file object appears as import java.rmi.*; public interface RemoteFileObject extends Remote { public abstract void open(String fileName) throws RemoteException; public abstract String readLine() throws RemoteException; public abstract void close() throws RemoteException; } That is, the client will open the remote file using the open() method, where the name of the file being opened is provided as a param- eter. The file will be accessed via the readLine() method. This method is implemented similarly to the readLine() method in the java.io.BufferedReader class in the Java API. That is, it will read and return a line of text that is terminated by a line feed (\n), a carriage return (\r), or a carriage return followed immediately by a line feed. Once the end of the file has been reached, readLine() will return null.Once the file has been read, the client will close the file using the close() method. For simplicity, we assume the file being read is a character (text) stream. The client program need only display the file to the console (System.out). One issue to be addressed concerns handling exceptions. The server will have to implement the methods outlined in the interface Remote- Programming Projects 149 FileObject using standard I/O methods provided in the Java API,most of which throw a java.io.IOException. However, the methods to be implemented in the RemoteFileObject interface are declared to throw a RemoteException. Perhaps the easiest way of handling this situation in the server is to place the appropriate calls to standard I/O meth- ods using try-catch for java.io.IOException.Ifsuchanexception occurs, catch it, and then re-throw it as a RemoteException.Thecode might look like this: try { ... } catch (java.io.IOException ioe) { throw new RemoteException("IO Exception",ioe); } You can handle a java.io.FileNotFoundException similarly. Programming Projects Creating a Shell Interface Using Java This project consists of modifying a Java program so that it serves as a shell interface that accepts user commands and then executes each command in a separate process external to the Java virtual machine. Overview A shell interface provides the user with a prompt, after which the user enters the next command. The example below illustrates the prompt jsh> and the user’s next command: cat Prog.java. This command displays the file Prog.java on the terminal using the UNIX cat command. jsh> cat Prog.java Perhaps the easiest technique for implementing a shell interface is to have the program first read what the user enters on the command line (here, cat Prog.java) and then create a separate external process that performs the command. We create the separate process using the ProcessBuilder() object, as illustrated in Figure 3.13. In our example, this separate process is external to the JVM and begins execution when its run() method is invoked. A Java program that provides the basic operations of a command-line shell is supplied in Figure 3.37. The main() method presents the prompt jsh> (for java shell) and waits to read input from the user. The program is terminated when the user enters . This project is organized into three parts: (1) creating the external process and executing the command in that process, (2) modifying the shell to allow changing directories, and (3) adding a history feature. 150 Chapter 3 Processes import java.io.*; public class SimpleShell { public static void main(String[] args) throws java.io.IOException { String commandLine; BufferedReader console = new BufferedReader (new InputStreamReader(System.in)); // we break out with while (true) { // read what the user entered System.out.print("jsh>"); commandLine = console.readLine(); // if the user entered a return, just loop again if (commandLine.equals("")) continue; /** The steps are: (1) parse the input to obtain the command and any parameters (2) create a ProcessBuilder object (3) start the process (4) obtain the output stream (5) output the contents returned by the command */ } } } Figure 3.37 Outline of simple shell. Part 1: Creating an External Process The first part of this project is to modify the main() method in Figure 3.37 so that an external process is created and executes the command specified by the user. Initially, the command must be parsed into separate parameters and passed to the constructor for the ProcessBuilder object. For example, if the user enters the command jsh> cat Prog.java the parameters are (1) cat and (2) Prog.java, and these parameters must be passed to the ProcessBuilder constructor. Perhaps the easiest strategy for doing this is to use the constructor with the following signature: public ProcessBuilder (List command) A java.util.ArrayList—which implements the java.util.List interface —can be used in this instance, where the first element of the list is cat and the Programming Projects 151 second element is Prog.java. This is an especially useful strategy because the number of arguments passed to UNIX commands may vary (the cat command accepts one argument, the cp command accepts two, and so forth). If the user enters an invalid command, the start() method in the ProcessBuilder classthrowsanjava.io.IOException. If this occurs, your program should output an appropriate error message and resume waiting for further commands from the user. Part 2: Changing Directories The next task is to modify the program in Figure 3.37 so that it changes directories. In UNIX systems, we encounter the concept of the current working directory, which is simply the directory you are currently in. The cd command allows a user to change current directories. Your shell interface must support this command. For example, if the current directory is /usr/tom and the user enters cd music, the current directory becomes /usr/tom/music.Subsequent commands relate to this current directory. For example, entering ls will output all the files in /usr/tom/music. The ProcessBuilder class provides the following method for changing the working directory: public ProcessBuilder directory(File directory) When the start() method of a subsequent process is invoked, the new process will use this as the current working directory. For example, if one process with a current working directory of /usr/tom invokes the command cd music, subsequent processes must set their working directories to /usr/tom/music before beginning execution. It is important to note that your program must first make sure the new path being specified is a valid directory. If not, your program should output an appropriate error message. If the user enters the command cd, change the current working directory to the user’s home directory. The home directory for the current user can be obtained by invoking the static getProperty() method in the System class as follows: System.getProperty("user.dir"); Part 3: Adding a History Feature Many UNIX shells provide a history feature that allows users to see the history of commands they have entered and to rerun a command from that history. The history includes all commands that have been entered by the user since the shell was invoked. For example, if the user entered the history command and saw as output: 0 pwd 1ls-l 2 cat Prog.java 152 Chapter 3 Processes the history would list pwd as the first command entered, ls -l as the second command, and so on. Modify your shell program so that commands are entered into a history. (Hint: The java.util.ArrayList provides a useful data structure for storing these commands.) Your program must allow users to rerun commands from their history by supporting the following three techniques: 1. When the user enters the command history, you will print out the contents of the history of commands that have been entered into the shell, along with the command numbers. 2. When the user enters !!, run the previous command in the history. If there is no previous command, output an appropriate error message. 3. When the user enters !,runtheith command in the history. For example, entering !4 would run the fourth command in the command history. Make sure you perform proper error checking to ensure that the integer value is a valid number in the command history. Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Interprocess communication in the RC 4000 system is discussed by Brinch- Hansen [1970]. Schlichting and Schneider [1982] discuss asynchronous message-passing primitives. The IPC facility implemented at the user level is described by Bershad et al. [1990]. Details of interprocess communication in UNIX systems are presented by Gray [1997]. Barrera [1991] and Vahalia [1996] describe interprocess commu- nication in the Mach system. Russinovich and Solomon [2005], Solomon and Russinovich [2000], and Stevens [1999] outline interprocess communication in Windows 2003, Windows 2000, and UNIX respectively. Hart [2005] covers Windows systems programming in detail. The implementation of RPCs is discussed by Birrell and Nelson [1984]. Shri- vastava and Panzieri [1982] describe the design of a reliable RPC mechanism, and Tay and Ananda [1990] present a survey of RPCs. Stankovic [1982] and Staunstrup [1982] discuss procedure calls versus message-passing communi- cation. Harold [2005] provides coverage of socket programming in Java, and Grosso [2002] covers Java’s RMI. 4CHAPTER Threads The process model introduced in Chapter 3 assumed that a process was an executing program with a single thread of control. Most modern operating systems now provide features enabling a process to contain multiple threads of control. This chapter introduces many concepts associated with multithreaded computer systems, including a discussion of the APIs for the Pthreads, Win32, and Java thread libraries. We look at many issues related to multithreaded programming and its effect on the design of operating systems. Finally, we explore how the Windows XP and Linux operating systems support threads at the kernel level. CHAPTER OBJECTIVES • To introduce the notion of a thread—a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems. • To discuss the APIs for the Pthreads, Win32, and Java thread libraries. • To examine issues related to multithreaded programming. 4.1 Overview A thread is a basic unit of CPU utilization; it comprises a thread ID,aprogram counter, a register set, and a stack. It shares with other threads belonging to the same process its code section, data section, and other operating-system resources, such as open files and signals. A traditional (orheavyweight)process has a single thread of control. If a process has multiple threads of control, it can perform more than one task at a time. Figure 4.1 illustrates the difference between a traditional single-threaded process and a multithreaded process. 4.1.1 Uses Many software packages that run on modern desktop PCs are multithreaded. An application typically is implemented as a separate process with several threads of control. A Web browser might have one thread display images or 153 154 Chapter 4 Threads registers code data files stack registers registers registers code data files stackstackstack thread thread single-threaded process multithreaded process Figure 4.1 Single-threaded and multithreaded processes. text while another thread retrieves data from the network, for example. A word processor may have a thread for displaying graphics, another thread for responding to keystrokes from the user, and a third thread for performing spelling and grammar checking in the background. In certain situations, a single application may be required to perform several similar tasks. For example, a Web server accepts client requests for Web pages, images, sound, and so forth. A busy Web server may have several (perhaps thousands of) clients concurrently accessing it. If the Web server ran as a traditional single-threaded process, it would be able to service only one client at a time, and a client might have to wait a very long time for its request to be serviced. One solution is to have the server run as a single process that accepts requests. When the server receives a request, it creates a separate process to service that request. In fact, this process-creation method was in common use before threads became popular. Process creation is time consuming and resource intensive, however. If the new process will perform the same tasks as the existing process, why incur all that overhead? It is generally more efficient to use one process that contains multiple threads. If the Web-server process is multithreaded, the server will create a separate thread that listens for client requests. When a request is made, rather than creating another process, the server will create a new thread to service the request and resume listening for additional requests. This is illustrated in Figure 4.2. Threads also play a vital role in remote procedure call (RPC) systems. Recall from Chapter 3 that RPCs allow interprocess communication by providing a communication mechanism similar to ordinary function or procedure calls. Typically, RPC servers are multithreaded. When a server receives a message, it services the message using a separate thread. This allows the server to handle several concurrent requests. Finally, most operating system kernels are now multithreaded; several threads operate in the kernel, and each thread performs a specific task, such 4.1 Overview 155 client (1) request (2) create new thread to service the request (3) resume listening for additional client requests server thread Figure 4.2 Multithreaded server architecture. as managing devices or interrupt handling. For example, Solaris creates a set of threads in the kernel specifically for interrupt handling; Linux uses a kernel thread for managing the amount of free memory in the system. 4.1.2 Benefits The benefits of multithreaded programming can be broken down into four major categories: 1. Responsiveness. Multithreading an interactive application may allow a program to continue running even if part of it is blocked or is performing a lengthy operation, thereby increasing responsiveness to the user. For instance, a multithreaded Web browser can allow user interaction in one thread while an image is being loaded in another thread. 2. Resource sharing. Processes may only share resources through tech- niques such as shared memory or message passing. Such techniques must be explicitly arranged by the programmer. However, threads share the memory and the resources of the process to which they belong by default. The benefit of sharing code and data is that it allows an application to have several different threads of activity within the same address space. 3. Economy. Allocating memory and resources for process creation is costly. Because threads share the resources of the process to which they belong, it is more economical to create and context-switch threads. Empirically gauging the difference in overhead can be difficult, but in general it is much more time consuming to create and manage processes than threads. In Solaris, for example, creating a process is about thirty times slower than is creating a thread, and context switching is about five times slower. 4. Scalability. The benefits of multithreading can be greatly increased in a multiprocessor architecture, where threads may be running in parallel on different processors. A single-threaded process can only run on one processor, regardless how many are available. Multithreading on a multi- CPU machine increases parallelism. We explore this issue further in the following section. 156 Chapter 4 Threads T1 T2 T3 T4 T1 T2 T3 T4 T1single core time … Figure 4.3 Concurrent execution on a single-core system. 4.1.3 Multicore Programming A recent trend in system design has been to place multiple computing cores on a single chip. Each of these cores appears as a separate processor to the operating system (Section 1.3.2). Multithreaded programming provides a mechanism for more efficient use of multiple cores and improved concurrency. Consider an application with four threads. On a system with a single computing core, concurrency merely means that the execution of the threads will be interleaved over time (Figure 4.3), since the processing core can execute only one thread at a time. On a system with multiple cores, however, concurrency means that the threads can run in parallel, because the system can assign a separate thread to each core (Figure 4.4). The trend toward multicore systems has placed pressure on system designers as well as application programmers to make better use of the multiple computing cores. Designers of operating systems must write scheduling algorithms that use multiple processing cores to allow the parallel execution shown in Figure 4.4. For application programmers, the challenge is to modify existing programs as well as design new programs that are multithreaded to take advantage of multicore systems. In general, five areas present challenges in programming for multicore systems: 1. Dividing activities. This involves examining applications to find areas that can be divided into separate, concurrent tasks and thus can run in parallel on individual cores. 2. Balance. While identifying tasks that can run in parallel, programmers must also ensure that the tasks perform equal work of equal value. In some instances, a certain task may not contribute as much value to the overall process as other tasks; using a separate execution core to run that task may not be worth the cost. 3. Data splitting. Just as applications are divided into separate tasks, the data accessed and manipulated by the tasks must be divided to run on separate cores. T1 T3 T1 T3 T1core 1 T2 T4 T2 T4 T2core 2 time … … Figure 4.4 Parallel execution on a multicore system. 4.2 Multithreading Models 157 4. Data dependency. The data accessed by the tasks must be examined for dependencies between two or more tasks. In instances where one task depends on data from another, programmers must ensure that the execution of the tasks is synchronized to accommodate the data dependency. We examine such strategies in Chapter 6. 5. Testing and debugging. When a program is running in parallel on multiple cores, there are many different execution paths. Testing and debugging such concurrent programs is inherently more difficult than testing and debugging single-threaded applications. Because of these challenges, many software developers argue that the advent of multicore systems will require an entirely new approach to designing software systems in the future. 4.2 Multithreading Models Our discussion so far has treated threads in a generic sense. However, support for threads may be provided either at the user level, for user threads,orbythe kernel, for kernel threads. User threads are supported above the kernel and are managed without kernel support, whereas kernel threads are supported and managed directly by the operating system. Virtually all contemporary operating systems—including Windows XP, Windows Vista, Linux, Mac OS X, Solaris, and Tru64 UNIX (formerly Digital UNIX)—support kernel threads. Ultimately, a relationship must exist between user threads and kernel threads. In this section, we look at three common ways of establishing such a relationship. 4.2.1 Many-to-One Model The many-to-one model (Figure 4.5) maps many user-level threads to one kernel thread. Thread management is done by the thread library in user user thread kernel threadk Figure 4.5 Many-to-one model. 158 Chapter 4 Threads user thread kernel threadkkkk Figure 4.6 One-to-one model. space, so it is efficient; but the entire process will block if a thread makes a blocking system call. Also, because only one thread can access the kernel at a time, multiple threads are unable to run in parallel on multiprocessors. Green threads—a thread library available for Solaris—uses this model, as does GNU Portable Threads. 4.2.2 One-to-One Model The one-to-one model (Figure 4.6) maps each user thread to a kernel thread. It provides more concurrency than the many-to-one model by allowing another thread to run when a thread makes a blocking system call; it also allows multiple threads to run in parallel on multiprocessors. The only drawback to this model is that creating a user thread requires creating the corresponding kernel thread. Because the overhead of creating kernel threads can burden the performance of an application, most implementations of this model restrict the number of threads supported by the system. Linux, along with the family of Windows operating systems, implements the one-to-one model. 4.2.3 Many-to-Many Model The many-to-many model (Figure 4.7) multiplexes many user-level threads to a smaller or equal number of kernel threads. The number of kernel threads may be specific to either a particular application or a particular machine (an application may be allocated more kernel threads on a multiprocessor than on a uniprocessor). Whereas the many-to-one model allows the developer to user thread kernel threadkkk Figure 4.7 Many-to-many model. 4.3 Thread Libraries 159 user thread kernel threadkkk k Figure 4.8 Two-level model. create as many user threads as she wishes, true concurrency is not gained because the kernel can schedule only one thread at a time. The one-to-one model allows greater concurrency, but the developer has to be careful not to create too many threads within an application (and in some instances may be limited in the number of threads she can create). The many-to-many model suffers from neither of these shortcomings: developers can create as many user threads as necessary, and the corresponding kernel threads can run in parallel on a multiprocessor. Also, when a thread performs a blocking system call, the kernel can schedule another thread for execution. One popular variation on the many-to-many model still multiplexes many user-level threads to a smaller or equal number of kernel threads but also allows a user-level thread to be bound to a kernel thread. This variation, sometimes referred to as the two-level model (Figure 4.8), is supported by operating systems such as IRIX, HP-UX,andTru64UNIX. The Solaris operating system supported the two-level model in versions older than Solaris 9. However, beginning with Solaris 9, this system uses the one-to-one model. 4.3 Thread Libraries A thread library provides the programmer with an API for creating and managing threads. There are two primary ways of implementing a thread library. The first approach is to provide a library entirely in user space with no kernel support. All code and data structures for the library exist in user space. This means that invoking a function in the library results in a local function call in user space and not a system call. The second approach is to implement a kernel-level library supported directly by the operating system. In this case, code and data structures for the library exist in kernel space. Invoking a function in the API for the library typically results in a system call to the kernel. Three main thread libraries are in use today: (1) POSIX Pthreads, (2) Win32, and (3) Java. Pthreads, the threads extension of the POSIX standard, may be provided as either a user- or kernel-level library. The Win32 thread library is a kernel-level library available on Windows systems. The Java thread API allows threads to be created and managed directly in Java programs. However, because in most instances the Java virtual machine (JVM) is running on top of 160 Chapter 4 Threads a host operating system, the Java thread API is generally implemented using a thread library available on the host system. This means that on Windows systems, Java threads are typically implemented using the Win32 API; UNIX and Linux systems often use Pthreads. In the remainder of this section, we describe basic thread creation using Pthread and Win32 thread libraries. We cover Java threads in more detail in Section 4.4. As an illustrative example, we design a multithreaded program that performs the summation of a non-negative integer in a separate thread using the well-known summation function: sum = N  i=0 i For example, if N were 5, this function would represent the summation of integers from 0 to 5, which is 15. Each of the three programs will be run with the upper bounds of the summation entered on the command line; thus, if the user enters 8, the summation of the integer values from 0 to 8 will be output. 4.3.1 Pthreads Pthreads refers to the POSIX standard (IEEE 1003.1c) defining an API for thread creation and synchronization. This is a specification for thread behavior, not an implementation. Operating system designers may implement the specification in any way they wish. Numerous systems implement the Pthreads specification, including Solaris, Linux, Mac OS X,andTru64UNIX. Shareware implementations are available in the public domain for the various Windows operating systems as well. The C program shown in Figure 4.9 demonstrates the basic Pthreads API for constructing a multithreaded program that calculates the summation of a non- negative integer in a separate thread. In a Pthreads program, separate threads begin execution in a specified function. In Figure 4.9, this is the runner() function. When this program begins, a single thread of control begins in main(). After some initialization, main() creates a second thread that begins control in the runner() function. Both threads share the global data sum. Let’s look more closely at this program. All Pthreads programs must include the pthread.h header file. The statement pthread t tid declares the identifier for the thread we will create. Each thread has a set of attributes, including stack size and scheduling information. The pthread attr t attr declaration represents the attributes for the thread. We set the attributes in the function call pthread attr init(&attr). Because we did not explicitly set any attributes, we use the default attributes provided. (In Section 5.4.2, we discuss some of the scheduling attributes provided by the Pthreads API.) A separate thread is created with the pthread create() function call. In addition to passing the thread identifier and the attributes for the thread, we also pass the name of the function where the new thread will begin execution—in this case, the runner() function. Last, we pass the integer parameter that was provided on the command line, argv[1]. At this point, the program has two threads: the initial (or parent) thread in main() and the summation (or child) thread performing the summation 4.3 Thread Libraries 161 #include #include int sum; /* this data is shared by the thread(s) */ void *runner(void *param); /* the thread */ int main(int argc, char *argv[]) { pthread t tid; /* the thread identifier */ pthread attr t attr; /* set of thread attributes */ if (argc != 2) { fprintf(stderr,"usage: a.out \n"); return -1; } if (atoi(argv[1]) < 0) { fprintf(stderr,"%d must be >= 0\n",atoi(argv[1])); return -1; } /* get the default attributes */ pthread attr init(&attr); /* create the thread */ pthread create(&tid,&attr,runner,argv[1]); /* wait for the thread to exit */ pthread join(tid,NULL); printf("sum = %d\n",sum); } /* The thread will begin control in this function */ void *runner(void *param) { int i, upper = atoi(param); sum = 0; for (i = 1; i <= upper; i++) sum += i; pthread exit(0); } Figure 4.9 Multithreaded C program using the Pthreads API. operation in the runner() function. After creating the summation thread, the parent thread will wait for it to complete by calling the pthread join() function. The summation thread will complete when it calls the function pthread exit(). Once the summation thread has returned, the parent thread will output the value of the shared data sum. 162 Chapter 4 Threads 4.3.2 Win32 Threads The technique for creating threads using the Win32 thread library is similar to the Pthreads technique in several ways. We illustrate the Win32 thread API in the C program shown in Figure 4.10. Notice that we must include the windows.h header file when using the Win32 API. Just as in the Pthreads version shown in Figure 4.9, data shared by the separate threads—in this case, Sum—are declared globally (the DWORD data type is an unsigned 32-bit integer). We also define the Summation() function that is to be performed in a separate thread. This function is passed a pointer to a void, which Win32 defines as LPVOID. The thread performing this function sets the global data Sum to the value of the summation from 0 to the parameter passed to Summation(). Threads are created in the Win32 API using the CreateThread() function, and—just as in Pthreads—a set of attributes for the thread is passed to this function. These attributes include security information, the size of the stack, and a flag that can be set to indicate if the thread is to start in a suspended state. In this program, we use the default values for these attributes (which do not initially set the thread to a suspended state and instead make it eligible to be run by the CPU scheduler). Once the summation thread is created, the parent must wait for it to complete before outputting the value of Sum,since the value is set by the summation thread. Recall that the Pthread program (Figure 4.9) has the parent thread wait for the summation thread using the pthread join() statement. We perform the equivalent of this in the Win32 API using the WaitForSingleObject() function, which causes the creating thread to block until the summation thread has exited. (We cover synchronization objects in more detail in Chapter 6.) 4.4 Java Threads Threads are the fundamental model of program execution in a Java program, and the Java language and its API provide a rich set of features for the creation and management of threads. All Java programs comprise at least a single thread of control that begins execution in the program’s main() method. 4.4.1 Creating Java Threads There are two techniques for creating threads in a Java program. One approach is to create a new class that is derived from the Thread class and to override its run() method. However, the most common technique is to define a class that implements the Runnable interface. The Runnable interface is defined as follows: public interface Runnable { public abstract void run(); } When a class implements Runnable, it must define a run() method. The code implementing the run() method is what runs as a separate thread. 4.4 Java Threads 163 #include #include DWORD Sum; /* data is shared by the thread(s) */ /* the thread runs in this separate function */ DWORD WINAPI Summation(LPVOID Param) { DWORD Upper = *(DWORD*)Param; for (DWORD i = 0; i <= Upper; i++) Sum += i; return 0; } int main(int argc, char *argv[]) { DWORD ThreadId; HANDLE ThreadHandle; int Param; /* perform some basic error checking */ if (argc != 2) { fprintf(stderr,"An integer parameter is required\n"); return -1; } Param = atoi(argv[1]); if (Param < 0) { fprintf(stderr,"An integer >= 0 is required\n"); return -1; } // create the thread ThreadHandle = CreateThread( NULL, // default security attributes 0, // default stack size Summation, // thread function &Param, // parameter to thread function 0, // default creation flags &ThreadId); // returns the thread identifier if (ThreadHandle != NULL) { // now wait for the thread to finish WaitForSingleObject(ThreadHandle,INFINITE); // close the thread handle CloseHandle(ThreadHandle); printf("sum = %d\n",Sum); } } Figure 4.10 Multithreaded C program using the Win32 API. 164 Chapter 4 Threads Figure 4.11 shows the Java version of a multithreaded program that determines the summation of a non-negative integer. The Summation class implements the Runnable interface. Thread creation is performed by creating an object instance of the Thread class and passing the constructor a Runnable object. Creating a Thread object does not specifically create the new thread; rather, it is the start() method that actually creates the new thread. Calling the start() method for the new object does two things: 1. It allocates memory and initializes a new thread in the JVM. 2. It calls the run() method, making the thread eligible to be run by the JVM. (Note that we never call the run() method directly. Rather, we call the start() method, and it calls the run() method on our behalf.) When the summation program runs, two threads are created by the JVM. The first is the parent thread, which starts execution in the main() method. The second thread is created when the start() method on the Thread object is invoked. This child thread begins execution in the run() method of the Summation class. After outputting the value of the summation, this thread terminates when it exits from its run() method. Sharing of data between threads occurs easily in Win32 and Pthreads, as shared data are simply declared globally. As a pure object-oriented language, Java has no such notion of global data; if two or more threads are to share data in a Java program, the sharing occurs by passing references to the shared object to the appropriate threads. In the Java program shown in Figure 4.11, the main thread and the summation thread share the object instance of the Sum class. This shared object is referenced through the appropriate getSum() and setSum() methods. (You might wonder why we don’t use a java.lang.Integer object rather than designing a new Sum class. The reason is that the java.lang.Integer class is immutable—that is, once its integer value is set, it cannot change.) Recall that the parent threads in the Pthreads and Win32 libraries use pthread join() and WaitForSingleObject() (respectively) to wait for the summation threads to finish before proceeding. The join() method in Java provides similar functionality. Notice that join() can throw an InterruptedException, which we choose to ignore for now. We discuss handling this exception in Chapter 6. Java actually identifies two different types of threads: (1) daemon (pro- nounced “demon”) and (2) non-daemon threads. The fundamental difference between the two types is the simple rule that the JVM shutsdownwhenall non-daemon threads have exited. Otherwise, the two thread types are iden- tical. When the JVM starts up, it creates several internal daemon threads for performing tasks such as garbage collection. A daemon thread is created by invoking the setDaemon() method of the Thread class and passing the method the value true. For example, we could have set the thread in the program shown in Figure 4.11 as a daemon by adding the following line after creating —but beforestarting—thethread: thrd.setDaemon(true); 4.4 Java Threads 165 class Sum { private int sum; public int getSum() { return sum; } public void setSum(int sum) { this.sum = sum; } } class Summation implements Runnable { private int upper; private Sum sumValue; public Summation(int upper, Sum sumValue) { this.upper = upper; this.sumValue = sumValue; } public void run() { int sum = 0; for (int i = 0; i <= upper; i++) sum += i; sumValue.setSum(sum); } } public class Driver { public static void main(String[] args) { if (args.length > 0) { if (Integer.parseInt(args[0]) < 0) System.err.println(args[0] + " must be >= 0."); else { // create the object to be shared Sum sumObject = new Sum(); int upper = Integer.parseInt(args[0]); Thread thrd = new Thread(new Summation(upper, sumObject)); thrd.start(); try { thrd.join(); System.out.println ("The sum of "+upper+" is "+sumObject.getSum()); } catch (InterruptedException ie) {} } } else System.err.println("Usage: Summation "); } } Figure 4.11 Java program for the summation of a non-negative integer. 166 Chapter 4 Threads For the remainder of this text, we will refer only to non-daemon threads unless otherwise specified. 4.4.2 The JVM and the Host Operating System As we have discussed, the JVM is typically implemented on top of a host operating system (see Figure 2.20). This setup allows the JVM to hide the implementation details of the underlying operating system and to provide a consistent, abstract environment that allows Java programs to operate on any platform that supports a JVM. The specification for the JVM does not indicate how Java threads are to be mapped to the underlying operating system, instead leaving that decision to the particular implementation of the JVM. For example, the Windows XP operating system uses the one-to-one model; therefore, each Java thread for a JVM running on such a system maps to a kernel thread. On operating systems that use the many-to-many model (such as Tru64 UNIX), a Java thread is mapped according to the many-to-many model. Solaris initially implemented the JVM using the many-to-one model (the green threads library mentioned earlier). Later releases of the JVM used the many-to-many model. Beginning with Solaris 9, Java threads were mapped using the one-to-one model. In addition, there may be a relationship between the Java thread library and the thread library on the host operating system. For example, implementations of a JVM for the Windows family of operating systems might use the Win32 API when creating Java threads; Linux and Solaris systems might use the Pthreads API. 4.4.3 Java Thread States A Java thread may be in one of six possible states in the JVM: 1. New. A thread is in this state when an object for the thread is created with the new command but the thread has not yet started. 2. Runnable. Calling the start() method allocates memory for the new thread in the JVM and calls the run() method for the thread object. When athread’srun() method is invoked, the thread moves from the new to the runnable state. A thread in the runnable state is eligible to be run by the JVM. Note that Java does not distinguish between a thread that is eligible to run and a thread that is currently running. A running thread is still in the runnable state. 3. Blocked. A thread is in this state as it waits to acquire a lock—a tool used for thread synchronization. We cover such tools in Chapter 6. 4. Waiting. A thread in this state is waiting for an action by another thread. For example, a thread invoking the join() method enters this state as it waits for the thread it is joining on to terminate. 5. Timed waiting. This state is similar to waiting, except a thread specifies the maximum amount of time it will wait. For example, the join() method has an optional parameter that the waiting thread can use to specify how long it will wait until the other thread terminates. Timed waiting prevents a thread from remaining in the waiting state indefinitely. 4.4 Java Threads 167 TIMED_WAITING TERMINATED RUNNABLE WAITINGBLOCKED NEW new start() exits run() method join() locking join(time) Figure 4.12 Java thread states. 6. Terminated. A thread moves to the this state when its run() method terminates. Figure 4.12 illustrates these different thread states and labels several possible transitions between states. It is important to note that these states relate to the Java virtual machine and are not necessarily associated with the state of the thread running on the host operating system. The Java API for the Thread class provides several methods to determine the state of a thread. The isAlive() method returns true if a thread has been started but has not yet reached the Terminated state; otherwise, it returns false.ThegetState() method returns the state of a thread as an enumerated import java.util.Date; public class Factory { public static void main(String args[]) { // create the message queue Channel queue = new MessageQueue(); // Create the producer and consumer threads and pass // each thread a reference to the MessageQueue object. Thread producer = new Thread(new Producer(queue)); Thread consumer = new Thread(new Consumer(queue)); // start the threads producer.start(); consumer.start(); } } Figure 4.13 The Factory class. 168 Chapter 4 Threads import java.util.Date; class Producer implements Runnable { private Channel queue; public Producer(Channel queue) { this.queue = queue; } public void run() { Date message; while (true) { // nap for awhile SleepUtilities.nap(); // produce an item and enter it into the buffer message = new Date(); System.out.println("Producer produced " + message); queue.send(message); } } } Figure 4.14 Producer thread. data type as one of the values from above. The source code available with this text provides an example program using the getState() method. 4.4.4 A Multithreaded Solution to the Producer–Consumer Problem We conclude our discussion of Java threads with a complete multithreaded solution to the producer–consumer problem that uses message passing. The class Factory in Figure 4.13 first creates a message queue for buffering messages, using the MessageQueue class developed in Chapter 3. It then creates separate producer and consumer threads (Figures 4.14 and 4.15, respectively) and passes each thread a reference to the shared queue. The producer thread alternates among sleeping for a while, producing an item, and entering that item into the queue. The consumer alternates between sleeping and then retrieving an item from the queue and consuming it. Because the receive() method of the MessageQueue class is nonblocking, the consumer must check to see whether the message that it retrieved is null. 4.5 Threading Issues In this section, we discuss some of the issues to consider with multithreaded programs. In several instances, we highlight these issues with Java programs. 4.5 Threading Issues 169 import java.util.Date; class Consumer implements Runnable { private Channel queue; public Consumer(Channel queue) { this.queue = queue; } public void run() { Date message; while (true) { // nap for awhile SleepUtilities.nap(); // consume an item from the buffer message = queue.receive(); if (message != null) System.out.println("Consumer consumed " + message); } } } Figure 4.15 Consumer thread. 4.5.1 The fork() and exec() System Calls In Chapter 3, we described how the fork() system call is used to create a separate, duplicate process. The semantics of the fork() and exec() system calls change in a multithreaded program. If one thread in a program calls fork(), does the new process duplicate all threads, or is the new process single-threaded? Some UNIX systems have chosen to have two versions of fork(), one that duplicates all threads and another that duplicates only the thread that invoked the fork() system call. The exec() system call typically works in the same way as described in Chapter 3. That is, if a thread invokes the exec() system call, the program specified in the parameter to exec() will replace the entire process—including all threads. Which of the two versions of fork() to use depends on the application. If exec() is called immediately after forking, then duplicating all threads is unnecessary, as the program specified in the parameters to exec() will replace the process. In this instance, duplicating only the calling thread is appropriate. If, however, the separate process does not call exec() after forking, the separate process should duplicate all threads. 170 Chapter 4 Threads 4.5.2 Cancellation Thread cancellation is the task of terminating a thread before it has completed. For example, if multiple threads are concurrently searching through a database and one thread returns the result, the remaining threads might be canceled. AnothersituationmightoccurwhenauserpressesabuttononaWebbrowser that stops a Web page from loading any further. Often, a Web page is loaded using several threads—each image is loaded in a separate thread. When a user presses the stop button on the browser, all threads loading the page are canceled. A thread that is to be canceled is often referred to as the target thread. Cancellation of a target thread may occur in two different scenarios: 1. Asynchronous cancellation. One thread immediately terminates the target thread. 2. Deferred cancellation. The target thread periodically checks whether it should terminate, allowing it an opportunity to terminate itself in an orderly fashion. The difficulty with cancellation occurs in situations where resources have been allocated to a canceled thread or where a thread is canceled while in the midst of updating data it is sharing with other threads. This becomes especially troublesome with asynchronous cancellation. Often, the operating system will reclaim system resources from a canceled thread but will not reclaim all resources. Therefore, canceling a thread asynchronously may not free a necessary system-wide resource. With deferred cancellation, in contrast, one thread indicates that a target thread is to be canceled, but cancellation occurs only after the target thread has checked a flag to determine if it should be canceled or not. This allows the thread to be canceled at a point when it can be canceled safely. Pthreads refers to such points as cancellation points. Java threads can be asynchronously terminated using the stop() method of the Thread class. However, this method has been deprecated.Deprecated methods are still implemented in the current API, but their use is discouraged. We discuss why stop() was deprecated in Section 7.3.2. It is also possible to cancel a Java thread using deferred cancellation. As just described, deferred cancellation works by having the target thread periodically check whether it should terminate. In Java, checking involves use of the interrupt() method. The Java API defines the interrupt() method for the Thread class. When the interrupt() method is invoked, the interruption status of the target thread is set. A thread can periodically check its interruption status by invoking either the interrupted() method or the isInterrupted() method, both of which return true if the interruption status of the target thread is set. (It is important to note that the interrupted() method clears the interruption status of the target thread, whereas the isInterrupted() method preserves the interruption status.) Figure 4.16 illustrates how deferred cancellation works when the isInterrupted() method is used. An instance of an InterruptibleThread can be interrupted using the following code: 4.5 Threading Issues 171 Thread thrd = new Thread(new InterruptibleThread()); thrd.start(); ... thrd.interrupt(); In this example, the target thread can periodically check its interruption status via the isInterrupted() method and—if it is set—clean up before terminating. Because the InterruptibleThread classdoesnotextendThread, it cannot directly invoke instance methods in the Thread class. To invoke instance methods in Thread, a program must first invoke the static method currentThread(),whichreturnsaThread object representing the thread that is currently running. This return value can then be used to access instance methods in the Thread class. It is important to recognize that interrupting a thread via the interrupt() method only sets the interruption status of a thread; it is up to the target thread to check this interruption status periodically. Traditionally, Java does not wake a thread that is blocked in an I/O operation using the java.io package. Any thread blocked doing I/O in this package will not be able to check its interruption status until the call to I/O is completed. However, the java.nio package introduced in Java 1.4 provides facilities for interrupting a thread that is blocked performing I/O. 4.5.3 Signal Handling A signal is used in UNIX systems to notify a process that a particular event has occurred. A signal may be received either synchronously or asynchronously, class InterruptibleThread implements Runnable { /** * This thread will continue to run as long * as it is not interrupted. */ public void run() { while (true) { /** * do some work for awhile *... */ if (Thread.currentThread().isInterrupted()) { System.out.println("I’m interrupted!"); break; } } // clean up and terminate } } Figure 4.16 Deferred cancellation using the isInterrupted() method. 172 Chapter 4 Threads depending on the source of and the reason for the event being signaled. All signals, whether synchronous or asynchronous, follow the same pattern: 1. A signal is generated by the occurrence of a particular event. 2. A generated signal is delivered to a process. 3. Once delivered, the signal must be handled. Examples of synchronous signals include illegal memory access and division by 0. If a running program performs either of these actions, a signal is generated. Synchronous signals are delivered to the same process that performed the operation that caused the signal (that is the reason they are considered synchronous). When a signal is generated by an event external to a running process, that process receives the signal asynchronously. Examples of such signals include terminating a process with specific keystrokes (such as )and having a timer expire. Typically, an asynchronous signal is sent to another process. Every signal may be handled by one of two possible handlers: 1. A default signal handler 2. A user-defined signal handler Every signal has a default signal handler that the kernel runs when handling that signal. This default action can be overridden by a user-defined signal handler that is called to handle the signal. Signals may be handled in different ways. Some signals (such as changing the size of a window) may simply be ignored; others (such as an illegal memory access) may be handled by terminating the program. Handling signals in single-threaded programs is straightforward; signals are always delivered to a process. However, delivering signals is more complicated in multithreaded programs, where a process may have several threads. Where, then, should a signal be delivered? In general, the following options exist: 1. Deliver the signal to the thread to which the signal applies. 2. Deliver the signal to every thread in the process. 3. Deliver the signal to certain threads in the process. 4. Assign a specific thread to receive all signals for the process. The method for delivering a signal depends on the type of signal generated. For example, synchronous signals need to be delivered to the thread causing the signal and not to other threads in the process. However, the situation with asynchronous signals is not as clear. Some asynchronous signals—such as a signal that terminates a process (, for example)—should be sent to all threads. Most multithreaded versions of UNIX allow a thread to specify which signals it will accept and which it will block. Therefore, in some cases, an asyn- chronous signal may be delivered only to those threads that are not blocking 4.5 Threading Issues 173 it. However, because signals need to be handled only once, a signal is typically delivered only to the first thread found that is not blocking it. The standard UNIX function for delivering a signal is kill(pid t pid, int signal); here, we specify the process (aid) to which a particular signal is to be delivered. However, POSIX Pthreads also provides the pthread kill(pthread t tid, int signal) function, which allows a signal to be delivered to a specified thread (tid). Although Windows does not explicitly provide support for signals, they can be emulated using asynchronous procedure calls (APCs).TheAPC facility allows a user thread to specify a function that is to be called when the user thread receives notification of a particular event. As indicated by its name, an APC is roughly equivalent to an asynchronous signal in UNIX. However, whereas UNIX must contend with how to deal with signals in a multithreaded environment, the APC facility is more straightforward, because an APC is delivered to a particular thread rather than a process. 4.5.4 Thread Pools In Section 4.1, we mentioned multithreading in a Web server. In this situation, whenever the server receives a request, it creates a separate thread to service the request. Whereas creating a separate thread is certainly superior to creating a separate process, a multithreaded server nonetheless has potential problems. The first concerns the amount of time required to create the thread prior to servicing the request, together with the fact that this thread will be discarded once it has completed its work. The second issue is more troublesome. If we allow all concurrent requests to be serviced in a new thread, we have not placed a bound on the number of threads concurrently active in the system. Unlimited threads could exhaust system resources, such as CPU time or memory. One solution to this issue is to use a thread pool. The general idea behind a thread pool is to create a number of threads at process startup and place them into a pool, where they sit and wait for work. When a server receives a request, it awakens a thread from this pool—if one is available—and passes it the request to service. Once the thread completes its service, it returns to the pool and awaits more work. If the pool contains no available thread, the server waits until one becomes free. Thread pools offer these primary benefits: 1. Servicing a request with an existing thread is usually faster than waiting to create a thread. 2. A thread pool limits the number of threads that exist at any one point. This is particularly important on systems that cannot support a large number of concurrent threads. The number of threads in the pool can be set heuristically based on factors such as the number of CPUs in the system, the amount of physical memory, and the expected number of concurrent client requests. More sophisticated thread-pool architectures can dynamically adjust the number of threads in the pool according to usage patterns. Such architectures provide the further benefit of having a smaller pool—thereby consuming less memory—when the load on the system is low. 174 Chapter 4 Threads The java.util.concurrent package includes an API for thread pools, along with other tools for concurrent programming. The Java API provides several varieties of thread-pool architectures; we focus on the following three models, which are available as static methods in the java.util.concurrent.Executors class: 1. Single thread executor—newSingleThreadExecutor()—creates a pool of size 1. 2. Fixed thread executor—newFixedThreadPool(int size)—creates a thread pool with a specified number of threads. 3. Cached thread executor—newCachedThreadPool()—creates an unbounded thread pool, reusing threads in many instances. In the Java API, thread pools are structured around the Executor interface, which appears as follows: public interface Executor { void execute(Runnable command); } Classes implementing this interface must define the execute() method, which is passed a Runnable object such as the following: public class Task implements Runnable { public void run() { System.out.println("I am working on a task."); ... } } For Java developers, this means that code that runs as a separate thread using the following approach, as first illustrated in Section 4.4: Thread worker = new Thread(new Task()); worker.start(); can also run as an Executor: Executor service = new Executor; service.execute(new Task()); A thread pool is created using one of the factory methods in the Executors class: • static ExecutorService newSingleThreadExecutor() • static ExecutorService newFixedThreadPool(int nThreads) • static ExecutorService newCachedThreadPool() 4.5 Threading Issues 175 import java.util.concurrent.*; public class TPExample { public static void main(String[] args) { int numTasks = Integer.parseInt(args[0].trim()); // Create the thread pool ExecutorService pool = Executors.newCachedThreadPool(); // Run each task using a thread in the pool for (int i = 0; i < numTasks; i++) pool.execute(new Task()); // Shut down the pool. This shuts down the pool only // after all threads have completed. pool.shutdown(); } } Figure 4.17 Creating a thread pool in Java. Each of these factory methods creates and returns an object instance that implements the ExecutorService interface. ExecutorService extends the Executorinterface, allowing us to invoke the execute()method on this object. However, ExecutorService also provides additional methods for managing termination of the thread pool. The example shown in Figure 4.17 creates a cached thread pool and submits tasks to be executed by a thread in the pool. When the shutdown() method is invoked, the thread pool rejects additional tasks and shuts down once all existing tasks have completed execution. In Chapter 6, we provide a programming exercise to design and implement a thread pool. 4.5.5 Thread-Specific Data Threads belonging to the same process share the data of the process. Indeed, this sharing of data provides one of the benefits of multithreaded programming. However, in some circumstances, each thread might need its own copy of certain data. We will call such data thread-specific data. For example, in a transaction-processing system, we might service each transaction in a separate thread. Furthermore, each transaction may be assigned a unique identifier. To associate each thread with its unique identifier, we could use thread-specific data. Most thread libraries—including Win32 and Pthreads—provide some form of support for thread-specific data. Java provides support as well. At first glance, it may appear that Java has no need for thread-specific data, since all that is required to give each thread its own private data is to create threads by subclassing the Thread class and to declare instance data in this class. Indeed, as long as threads are constructed in that way, this approach works fine. However, when the developer has no control over the thread-creation 176 Chapter 4 Threads class Service { private static ThreadLocal errorCode = new ThreadLocal(); public static void transaction() { try { /** * some operation where an error may occur ... */ } catch (Exception e) { errorCode.set(e); } } /** * Get the error code for this transaction */ public static Object getErrorCode() { return errorCode.get(); } } Figure 4.18 Using the ThreadLocal class. process—for example, when a thread pool is being used—then an alternative approach is necessary. The Java API provides the ThreadLocal class for declaring thread-specific data. ThreadLocal data can be initialized with either the initialValue() method or the set() method, and a thread can inquire as to the value of ThreadLocal data using the get() method. Typically, ThreadLocal data are declared as static.ConsidertheService class shown in Figure 4.18, which declares errorCode as ThreadLocal data. The transaction() method in this class can be invoked by any number of threads. If an exception occurs, we assign the exception to errorCode using the set() method of the ThreadLocal class. Now consider a scenario in which two threads—say, thread 1 and thread 2—invoke transaction(). Assume that thread 1 generates exception A and thread 2 generates exception B.ThevalueoferrorCode forthread1andthread 2 will be Aand B, respectively. Figure 4.19 illustrates how a thread can inquire as to the value of errorCode() after invoking transaction(). 4.5.6 Scheduler Activations A final issue to be considered with multithreaded programs concerns com- munication between the kernel and the thread library, which may be required by the many-to-many and two-level models discussed in Section 4.2.3. Such coordination allows the number of kernel threads to be dynamically adjusted to help ensure the best performance. 4.5 Threading Issues 177 class Worker implements Runnable { private static Service provider; public void run() { provider.transaction(); System.out.println(provider.getErrorCode()); } } Figure 4.19 Inquiring the value of ThreadLocal data. Many systems implementing either the many-to-many or two-level model place an intermediate data structure between the user and kernel threads. This data structure—typically known as a lightweight process, or LWP —is shown in Figure 4.20. To the user-thread library, the LWP appears to be a virtual processor on which the application can schedule a user thread to run. Each LWP is attached to a kernel thread, and it is kernel threads that the operating system schedules to run on physical processors. If a kernel thread blocks (such as while waiting for an I/O operation to complete), the LWP blocks as well. Up the chain, the user-level thread attached to the LWP also blocks. An application may require any number of LWPs to run efficiently.Consider a CPU-bound application running on a single processor. In this scenario, only one thread can run at once, so one LWP is sufficient. An application that is I/O- intensive may require multiple LWPs to execute, however. Typically, an LWP is required for each concurrent blocking system call. Suppose, for example, that five different file-read requests occur simultaneously. Five LWPs are needed, because all could be waiting for I/O completion in the kernel. If a process has only four LWPs, then the fifth request must wait for one of the LWPstoreturn from the kernel. One scheme for communication between the user-thread library and the kernel is known as scheduler activation. It works as follows: The kernel provides an application with a set of virtual processors (LWPs), and the application can schedule user threads onto an available virtual processor. Furthermore, the kernel must inform an application about certain events. This procedure is known as an upcall. Upcalls are handled by the thread library LWP user thread kernel threadk lightweight process Figure 4.20 Lightweight process (LWP). 178 Chapter 4 Threads with an upcall handler, and upcall handlers must run on a virtual processor. One event that triggers an upcall occurs when an application thread is about to block. In this scenario, the kernel makes an upcall to the application informing it that a thread is about to block and identifying the specific thread. The kernel then allocates a new virtual processor to the application. The application runs an upcall handler on this new virtual processor, which saves the state of the blocking thread and relinquishes the virtual processor on which the blocking thread is running. The upcall handler then schedules another thread that is eligible to run on the new virtual processor. When the event that the blocking thread was waiting for occurs, the kernel makes another upcall to the thread library informing it that the previously blocked thread is now eligible to run. The upcall handler for this event also requires a virtual processor, and the kernel may allocate a new virtual processor or preempt one of the user threads and run the upcall handler on its virtual processor. After marking the unblocked thread as eligible to run, the application schedules an eligible thread to run on an available virtual processor. 4.6 Operating-System Examples In this section, we explore how threads are implemented in Windows XP and Linux systems. 4.6.1 Windows XP Threads Windows XP implements the Win32 API, which is the primary API for the family of Microsoft operating systems (Windows 95, 98, NT, 2000, and XP). Indeed, much of what is mentioned in this section applies to this entire family of operating systems. AWindowsXP application runs as a separate process, and each process may contain one or more threads. The Win32 API for creating threads is covered in Section 4.3.2. Windows XP uses the one-to-one mapping described in Section 4.2.2, where each user-level thread maps to an associated kernel thread. However, Windows XP also provides support for a fiber library, which provides the functionality of the many-to-many model (Section 4.2.3). By using the thread library, any thread belonging to a process can access the address space of the process. The general components of a thread include: • AthreadID uniquely identifying the thread • A register set representing the status of the processor • A user stack, employed when the thread is running in user mode, and a kernel stack, employed when the thread is running in kernel mode • A private storage area used by various run-time libraries and dynamic link libraries (DLLs) The register set, stacks, and private storage area are known as the context of the thread. The primary data structures of a thread include: • ETHREAD—executive thread block 4.6 Operating-System Examples 179 • KTHREAD—kernel thread block • TEB—thread environment block The key components of the ETHREAD include a pointer to the process to which the thread belongs and the address of the routine in which the thread starts control. The ETHREAD also contains a pointer to the corresponding KTHREAD. The KTHREAD includes scheduling and synchronization information for the thread. In addition, the KTHREAD includes the kernel stack (used when the thread is running in kernel mode) and a pointer to the TEB. The ETHREAD and the KTHREAD exist entirely in kernel space; this means that only the kernel can access them. The TEB is a user-space data structure that is accessed when the thread is running in user mode. Among other fields, the TEB contains the thread identifier, a user-mode stack, and an array for thread- specific data (which Windows XP terms thread-local storage). The structure of aWindowsXP thread is illustrated in Figure 4.21. 4.6.2 Linux Threads Linux provides the fork() system call with the traditional functionality of duplicating a process, as described in Chapter 3. Linux also provides the ability to create threads using the clone() system call. However, Linux does not user spacekernel space pointer to parent process thread start address ETHREAD KTHREAD • • • kernel stack scheduling and synchronization information • • • user stack thread-local storage thread identifier TEB • • • Figure 4.21 Data structures of a Windows XP thread. 180 Chapter 4 Threads distinguish between processes and threads. In fact, Linux generally uses the term task—rather than process or thread—when referring to a flow of control within a program. When clone() is invoked, it is passed a set of flags, which determine how much sharing is to take place between the parent and child tasks. Some of these flags are listed below: flag meaning CLONE_FS CLONE_VM CLONE_SIGHAND CLONE_FILES File-system information is shared. The same memory space is shared. Signal handlers are shared. The set of open files is shared. For example, if clone() is passed the flags CLONE FS, CLONE VM, CLONE SIGHAND,andCLONE FILES, the parent and child tasks will share the same file-system information (such as the current working directory), the same memory space, the same signal handlers, and the same set of open files. Using clone() in this fashion is equivalent to creating a thread as described in this chapter, since the parent task shares most of its resources with its child task. However, if none of these flags is set when clone() is invoked, no sharing takes place, resulting in functionality similar to that provided by the fork() system call. The varying level of sharing is possible because of the way a task is represented in the Linux kernel. A unique kernel data structure (specifically, struct task struct) exists for each task in the system. This data structure, instead of storing data for the task, contains pointers to other data structures where these data are stored—for example, data structures that represent the list of open files, signal-handling information, and virtual memory. When fork() is invoked, a new task is created, along with a copy of all the associated data structures of the parent process. A new task is also created when the clone() system call is made. However, rather than copying all data structures, the new task points to the data structures of the parent task, depending on the set of flags passed to clone(). Several distributions of the Linux kernel now include the NPTL thread library. NPTL (which stands for Native POSIX Thread Library) provides a POSIX- compliant thread model for Linux systems along with several other features, such as better support for SMP systems, as well as taking advantage of NUMA support. In addition, the start-up cost for creating a thread is lower with NPTL than with traditional Linux threads. Finally, with NPTL, the system has the potential to support hundreds of thousands of threads. Such support becomes more important with the growth of multicore and other SMP systems. 4.7 Summary A thread is a flow of control within a process. A multithreaded process contains several different flows of control within the same address space. The benefits of Exercises 181 multithreading include increased responsiveness to the user, resource sharing within the process, economy, and scalability issues such as more efficient use of multiple cores. User-level threads are threads that are visible to the programmer and are unknown to the kernel. The operating-system kernel supports and manages kernel-level threads. In general, user-level threads are faster to create and manage than are kernel threads, as no intervention from the kernel is required. Three different types of models relate user and kernel threads: The many-to-one model maps many user threads to a single kernel thread. The one-to-one model maps each user thread to a corresponding kernel thread. The many-to-many model multiplexes many user threads to a smaller or equal number of kernel threads. Most modern operating systems provide kernel support for threads; among these are Windows 98, XP, and Vista, as well as Solaris and Linux. Thread libraries provide the application programmer with an API for creating and managing threads. Three primary thread libraries are in common use: POSIX Pthreads, Win32 threads for Windows systems, and Java threads. Multithreaded programs introduce many challenges for the programmer, including the semantics of the fork() and exec() system calls. Other issues include thread cancellation, signal handling, and thread-specific data. The Java API addresses many of these threading issues. Practice Exercises 4.1 Provide two programming examples in which multithreading provides better performance than a single-threaded solution. 4.2 What are two differences between user-level threads and kernel-level threads? Under what circumstances is one type better than the other? 4.3 Describe the actions taken by a kernel to context-switch between kernel- level threads. 4.4 What resources are used when a thread is created? How do they differ from those used when a process is created? 4.5 Assume that an operating system maps user-level threads to the kernel using the many-to-many model and that the mapping is done through LWPs. Furthermore, the system allows developers to create real-time threads for use in real-time systems. Is it necessary to bind a real-time thread to an LWP? Explain. 4.6 A Pthread program that performs the summation function was provided in Section 4.3.1. Rewrite this program in Java. Exercises 4.7 Provide two programming examples in which multithreading does not provide better performance than a single-threaded solution. 4.8 Describe the actions taken by a thread library to context-switch between user-level threads. 182 Chapter 4 Threads 4.9 Under what circumstances does a multithreaded solution using multi- ple kernel threads provide better performance than a single-threaded solution on a single-processor system? 4.10 Which of the following components of program state are shared across threads in a multithreaded process? a. Register values b. Heap memory c. Global variables d. Stack memory 4.11 Can a multithreaded solution using multiple user-level threads achieve better performance on a multiprocessor system than on a single- processor system? Explain. 4.12 As described in Section 4.6.2, Linux does not distinguish between processes and threads. Instead, Linux treats both in the same way, allowing a task to be more akin to a process or a thread depending on the set of flags passed to the clone() system call. However, many operating systems—such as Windows XP and Solaris—treat processes and threads differently. Typically, such systems use a notation wherein the data structure for a process contains pointers to the separate threads belonging to the process. Contrast these two approaches for modeling processes and threads within the kernel. 4.13 The program shown in Figure 4.22 uses the Pthreads API.Whatwould be the output from the program at LINE C and LINE P? 4.14 Consider a multiprocessor system and a multithreaded program written using the many-to-many threading model. Let the number of user-level threads in the program be more than the number of processors in the system. Discuss the performance implications of the following scenarios. a. The number of kernel threads allocated to the program is less than the number of processors. b. The number of kernel threads allocated to the program is equal to the number of processors. c. The number of kernel threads allocated to the program is greater than the number of processors but less than the number of user- level threads. Programming Problems 4.15 Exercise 3.17 in Chapter 3 involves designing a client–server program where the client sends a message to the server and the server responds with a count containing the number of characters and digits in the message. However, this server is single-threaded, meaning that the server cannot respond to other clients until the current client closes Programming Problems 183 #include #include int value = 0; void *runner(void *param); /* the thread */ int main(int argc, char *argv[]) { int pid; pthread t tid; pthread attr t attr; pid = fork(); if (pid == 0) { /* child process */ pthread attr init(&attr); pthread create(&tid,&attr,runner,NULL); pthread join(tid,NULL); printf("CHILD: value = %d",value); /* LINE C */ } else if (pid > 0) { /* parent process */ wait(NULL); printf("PARENT: value = %d",value); /* LINE P */ } } void *runner(void *param) { value = 5; pthread exit(0); } Figure 4.22 C program for Exercise 4.13. its connection. Modify your solution to Exercise 3.17 so that the server services each client in a separate request. 4.16 Write a multithreaded Java, Pthreads, or Win32 program that outputs prime numbers. This program should work as follows: The user will run the program and will enter a number on the command line. The program will then create a separate thread that outputs all the prime numbers less than or equal to the number entered by the user. 4.17 Modify the socket-based date server (Figure 3.26) in Chapter 3 so that the server services each client request in a separate thread. 4.18 Modify the socket-based date server (Figure 3.26) in Chapter 3 so that the server services each client request using a thread pool. 4.19 The Fibonacci sequence is the series of numbers 0, 1, 1, 2, 3, 5, 8, .... Formally, it can be expressed as: 184 Chapter 4 Threads fib0 = 0 fib1 = 1 fibn = fibn−1 + fibn−2 Write a multithreaded program that generates the Fibonacci sequence using either the Java, Pthreads, or Win32 thread library. This program should work as follows: The user will enter on the command line the number of Fibonacci numbers that the program is to generate. The program will then create a separate thread that will generate the Fibonacci numbers, placing the sequence in data that can be shared by the threads (an array is probably the most convenient data structure). When the thread finishes execution, the parent thread will output the sequence generated by the child thread. Because the parent thread cannot begin outputting the Fibonacci sequence until the child thread finishes, this will require having the parent thread wait for the child thread to finish, using the techniques described in Section 4.3. 4.20 Write a multithreaded sorting program in Java that works as follows: A collection of items is divided into two lists of equal size. Each list is then passed to a separate thread (a sorting thread), which sorts the list using any sorting algorithm (or algorithms) of your choice. The two sorted lists are passed to a third thread (a merge thread), which merges the two separate lists into a single sorted list. Once the two lists have been merged, the complete sorted list is output. If we were sorting integer values, this program should be structured as depicted in Figure 4.23. Perhaps the easiest way of designing a sorting thread is to pass the constructor an array containing java.lang.Object,whereeachObject must implement the java.util.Comparable interface. Many objects in the Java API implement the Comparable interface. For the purposes of this project, we recommend using Integer objects. To ensure that the two 7, 12, 19, 3, 18 7, 12, 19, 3, 18, 4, 2, 6, 15, 8 Original List 2, 3, 4, 6, 7, 8, 12, 15, 18, 19 Merge Thread Sorted List Sorting Thread0 Sorting Thread1 4, 2, 6, 15, 8 Figure 4.23 Sorting. Programming Problems 185 sorting threads have completed execution, the main thread will need to use the join() method on the two sorting threads before passing the two sorted lists to the merge thread. Similarly, the main thread will need to use join() on the merge thread before it outputs the complete sorted list. For a discussion of sorting algorithms, consult the bibliography. 4.21 Write a Java program that lists all threads in the Java virtual machine. Before proceeding, you will need some background information. All threads in the JVM belong to a thread group, and a thread group is identified in the Java API by the ThreadGroup class. Thread groups are organized as a tree structure, where the root of the tree is the system thread group. The system thread group contains threads that are automatically created by the JVM, mostly for managing object references. Below the system thread group is the main thread group. The main thread group contains the initial thread in a Java program that begins execution in the main() method. It is also the default thread group, meaning that—unless otherwise specified—all threads you create belong to this group. It is possible to create additional thread groups and assign newly created threads to these groups. Furthermore, when creating a thread group, you may specify its parent. For example, the following statements create three new thread groups: alpha, beta, and theta: ThreadGroup alpha = new ThreadGroup("alpha"); ThreadGroup beta = new ThreadGroup("beta"); ThreadGroup theta = new ThreadGroup(alpha, "theta"); The alpha and beta groups belong to the default—or main—thread group. However, the constructor for the theta thread group indicates that its parent thread group is alpha. Thus, we have the thread-group hierarchy depicted in Figure 4.24. SYSTEM MAIN ALPHA BETA THETA Figure 4.24 Thread-group hierarchy. 186 Chapter 4 Threads Notice that all thread groups have a parent, with the obvious exception of the system thread group, whose parent is null. Writing a program that lists all threads in the Java virtual machine will involve a careful reading of the Java API—in particular, the java.lang.ThreadGroup and java.lang.Thread classes. A strategy for constructing this program is to first identify all thread groups in the JVM and then identify all threads within each group. To determine all thread groups, first obtain the ThreadGroup of the current thread, and then ascend the thread-group tree hierarchy to its root. Next, get all thread groups below the root thread group; you should find the overloaded enumerate() method in the ThreadGroup class especially helpful. Next, identify the threads belonging to each group. Again, the enumerate() method should prove helpful. One thing to be careful of when using the enumerate() method is that it expects an array as a parameter. This means that you will need to determine the size of the array before calling enumerate(). Additional methods in the ThreadGroup API should be useful for determining how to size arrays. Have your output list each thread group and all threads within each group. For example, based on Figure 4.24, your output would appear as follows: • system: all threads in the system group • main: all threads in the main group • alpha: all threads in the alpha group • beta: all threads in the beta group • theta: all threads in the theta group When outputting each thread, list the following fields: a. The thread name b. The thread identifier c. The state of the thread d. Whether or not the thread is a daemon In the source code download for this chapter on WileyPLUS,weprovide an example program (CreateThreadGroups.java)thatcreatesthe alpha, beta,andtheta thread groups as well as several threads within each group. To use this program, enter the statement new CreateThreadGroups(); at the beginning of your thread-listing program. 4.22 Modify the preceding problem so that the output appears in a tabular format using a graphical interface. Have the left-most column represent the name of the thread group and successive columns represent the four fields of each thread. In addition, whereas the program in the preceding problem lists all threads in the JVM only once, allow this program to periodically refresh the listing of threads and thread groups by specifying a refresh parameter on the command line when invoking Programming Projects 187 the program. Represent this parameter in milliseconds. For example, if your program is named ThreadLister, to invoke the program so that it refreshes the list ten times per second, enter the following: java ThreadLister 100 Programming Projects The projects below deal with two distinct topics—naming services and matrix muliplication. Project 1: Naming Service Project A naming service such as DNS (domain name system) can be used to resolve IP names to IP addresses. For example, when someone accesses the host www.westminstercollege.edu, a naming service is used to determine the IP address that is mapped to the IP name www.westminstercollege.edu. This assignment consists of writing a multithreaded naming service in Java using sockets (see Section 3.6.1). The java.net API provides the following mechanism for resolving IP names: InetAddress hostAddress = InetAddress.getByName("www.westminstercollege.edu"); String IPaddress = hostAddress.getHostAddress(); where getByName() throws an UnknownHostException if it is unable to resolve the host name. The Server The server will listen to port 6052 waiting for client connections. When a client connection is made, the server will service the connection in a separate thread and will resume listening for additional client connections. Once a client makes a connection to the server, the client will write the IP name it wishes the server to resolve—such as www.westminstercollege.edu— to the socket. The server thread will read this IP name from the socket and either resolve its IP address or, if it cannot locate the host address, catch an UnknownHostException. The server will write the IP address back to the client or, in the case of an UnknownHostException, will write the message “Unable to resolve host .” Once the server has written to the client, it will close its socket connection. The Client Initially, write just the server application and connect to it via telnet. For example, assuming the server is running on the local host, a telnet session will appear as follows. (Client responses appear in blue.) 188 Chapter 4 Threads telnet localhost 6052 Connected to localhost. Escape character is ’ˆ]’. www.westminstercollege.edu 146.86.1.17 Connection closed by foreign host. By initially having telnet act as a client, you can more easily debug any problems you may have with your server. Once you are convinced your server is working properly, you can write a client application. The client will be passed the IP name that is to be resolved as a parameter. The client will open a socket connection to the server and then write the IP name that is to be resolved. It will then read the response sent back by the server. As an example, if the client is named NSClient, it is invoked as follows: java NSClient www.westminstercollege.edu The server will respond with the corresponding IP address or “unknown host” message. Once the client has output the IP address, it will close its socket connection. Project 2: Matrix Multiplication Project Given two matrices, Aand B,wherematrixAcontains M rows and K columns and matrix B contains K rows and N columns, the matrix product of A and B is matrix C,whereC contains M rows and N columns. The entry in matrix C for row i,column j (Ci, j ) is the sum of the products of the elements for row i in matrix A and column j in matrix B.Thatis, Ci, j =  K n=1 Ai,n × Bn, j For example, if A is a 3-by-2 matrix and B is a 2-by-3 matrix, element C3,1 is the sum of A3,1 × B1,1 and A3,2 × B2,1. For this project, you need to calculate each element Ci, j in a separate worker thread. This will involve creating M × N worker threads. The main— or parent—thread will initialize the matrices A and B and allocate sufficient memory for matrix C, which will hold the product of matrices A and B.These matrices will be declared as global data so that each worker thread has access to A, B,andC. Matrices A and B can be initialized statically, as shown below: #define M 3 #define K 2 #define N 3 int A [M][K] = {{1,4}, {2,5}, {3,6}}; int B [K][N] = {{8,7,6}, {5,4,3}}; int C [M] [N]; Alternatively, they can be populated by reading in values from a file. Programming Projects 189 Passing Parameters to Each Thread The parent thread will create M × N worker threads, passing each worker the values of row i and column j that it is to use in calculating the matrix product. This requires passing two parameters to each thread. The easiest approach with Pthreads and Win32 is to create a data structure using a struct. The members of this structure are i and j, and the structure appears as follows: /* structure for passing data to threads */ struct v { int i; /* row */ int j; /* column */ }; Both the Pthreads and Win32 programs will create the worker threads using a strategy similar to that shown below: /* We have to create M * N worker threads */ for(i=0;ii = i; data->j = j; /* Now create the thread passing it data as a parameter */ } } The data pointer will be passed to either the pthread create() (Pthreads) function or the CreateThread() (Win32) function, which in turn will pass it as a parameter to the function that is to run as a separate thread. Sharing of data between Java threads is different from sharing between threads in Pthreads or Win32. One approach is for the main thread to create and initialize the matrices A, B,andC. This main thread will then create the worker threads, passing the three matrices—along with row i and column j — to the constructor for each worker. Thus, the outline of a worker thread appears in Figure 4.25. Waiting for Threads to Complete Once all worker threads have completed, the main thread will output the product contained in matrix C. This requires the main thread to wait for all worker threads to finish before it can output the value of the matrix product. Several different strategies can be used to enable a thread to wait for other threads to finish. Section 4.3 describes how to wait for a child thread to complete using the Win32, Pthreads, and Java thread libraries. Win32 provides the WaitForSingleObject() function, whereas Pthreads and Java use pthread join() and join(), respectively. However, in these programming examples, the parent thread waits for a single child thread to finish; completing this exercise will require waiting for multiple threads. 190 Chapter 4 Threads public class WorkerThread implements Runnable { private int row; private int col; private int[][] A; private int[][] B; private int[][] C; public WorkerThread(int row, int col, int[][] A, int[][] B, int[][] C) { this.row = row; this.col = col; this.A = A; this.B = B; this.C = C; } public void run() { /* calculate the matrix product in C[row] [col] */ } } Figure 4.25 Worker thread in Java. In Section 4.3.2, we describe the WaitForSingleObject() function, which is used to wait for a single thread to finish. However, the Win32 API also provides the WaitForMultipleObjects() function, which is used when waiting for multiple threads to complete. WaitForMultipleObjects() is passed four parameters: 1. The number of objects to wait for 2. A pointer to the array of objects 3. A flag indicating if all objects have been signaled 4. Atimeoutduration(orINFINITE) For example, if THandles is an array of thread HANDLE objects of size N,the parent thread can wait for all its child threads to complete with the statement: WaitForMultipleObjects(N, THandles, TRUE, INFINITE); A simple strategy for waiting on several threads using the Pthreads pthread join() or Java’s join() is to enclose the join operation within a #define NUM THREADS 10 /* an array of threads to be joined upon */ pthread t workers[NUM THREADS]; for (int i = 0; i < NUM THREADS; i++) pthread join(workers[i], NULL); Figure 4.26 Pthread code for joining ten threads. Bibliographical Notes 191 final static int NUM THREADS = 10; /* an array of threads to be joined upon */ Thread[] workers = new Thread[NUM THREADS]; for (int i = 0; i < NUM THREADS; i++) { try { workers[i].join(); } catch (InterruptedException ie) {} } Figure 4.27 Java code for joining ten threads. simple for loop. For example, you could join on ten threads using the Pthread code depicted in Figure 4.26. The equivalent code using Java threads is shown in Figure 4.27. Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Threads have had a long evolution, starting as “cheap concurrency” in programming languages and moving to “lightweight processes”, with early examples that included the Thoth system (Cheriton et al. [1979]) and the Pilot system (Redell et al. [1980]). Binding [1985] described moving threads into the UNIX kernel. Mach (Accetta et al. [1986], Tevanian et al. [1987a]) and V (Cheriton [1988]) made extensive use of threads, and eventually almost all major operating systems implemented them in some form or another. Thread performance issues were discussed by Anderson et al. [1989], who continued their work in Anderson et al. [1991] by evaluating the performance of user-level threads with kernel support. Bershad et al. [1990] describe combining threads with RPC. Engelschall [2000] discusses a technique for supporting user-level threads. An analysis of an optimal thread-pool size can be found in Ling et al. [2000]. Scheduler activations were first presented in Anderson et al. [1991], and Williams [2002] discusses scheduler activations in the NetBSD system. Other mechanisms by which the user-level thread library and the kernel cooperate with each other are discussed in Marsh et al. [1991], Govindan and Anderson [1991], Draves et al. [1991], and Black [1990]. Zabatta and Young [1998] compare Windows NT and Solaris threads on a symmetric 192 Chapter 4 Threads multiprocessor. Pinilla and Gill [2003] compare Java thread performance on Linux, Windows, and Solaris. Vahalia [1996] covers threading in several versions of UNIX.McDougall and Mauro [2007] describe recent developments in threading the Solaris kernel. Russinovich and Solomon [2005] discuss threading in the Windows operating system family. Bovet and Cesati [2006] and Love [2004] explain how Linux handles threading, and Singh [2007] covers threads in Mac OS X. Information on Pthreads programming is given in Lewis and Berg [1998] and Butenhof [1997]. Oaks and Wong [2004], Lewis and Berg [2000], and Holub [2000] discuss multithreading in Java. Goetz et al. [2006] present a detailed discussion of concurrent programming in Java. Beveridge and Wiener [1997] and Cohen and Woodring [1997] describe multithreading using Win32. Data structures and sorting algorithms presented in Java can be found in Goodrich and Tamassia [2006]. 5CHAPTER CPU Scheduling CPU scheduling is the basis of multiprogrammed operating systems. By switching the CPU among processes, the operating system can make the computer more productive. In this chapter, we introduce basic CPU-scheduling concepts and present several CPU-scheduling algorithms. We also consider the problem of selecting an algorithm for a particular system. In Chapter 4, we introduced threads to the process model. On operating systems that support them, it is kernel-level threads—not processes—that are in fact being scheduled by the operating system. However, the terms process scheduling and thread scheduling are often used interchangeably. In this chapter, we use process scheduling when discussing general scheduling concepts and thread scheduling to refer to thread-specific ideas. CHAPTER OBJECTIVES • To introduce CPU scheduling, which is the basis for multiprogrammed operating systems. • To describe various CPU-scheduling algorithms. • To discuss evaluation criteria for selecting a CPU-scheduling algorithm for a particular system. 5.1 Basic Concepts In a single-processor system, only one process can run at a time; any others must wait until the CPU is free and can be rescheduled. The objective of multiprogramming is to have some process running at all times, to maximize CPU utilization. The idea is relatively simple. A process is executed until it must wait, typically for the completion of some I/O request. In a simple computer system, the CPU then just sits idle. All this waiting time is wasted; no useful work is accomplished. With multiprogramming, we try to use this time productively. Several processes are kept in memory at one time. When one process has to wait, the operating system takes the CPU away from that 193 194 Chapter 5 CPU Scheduling CPU burst load store add store read from file store increment index write to file load store add store read from file wait for I/O wait for I/O wait for I/O I/O burst I/O burst I/O burst CPU burst CPU burst • • • • • • Figure 5.1 Alternating sequence of CPU and I/O bursts. process and gives the CPU to another process. This pattern continues. Every time one process has to wait, another process can take over use of the CPU. Scheduling of this kind is a fundamental operating-system function. Almost all computer resources are scheduled before use. The CPU is, of course, oneoftheprimarycomputerresources.Thus,itsschedulingiscentralto operating-system design. 5.1.1 CPU–I/O Burst Cycle The success of CPU scheduling depends on an observed property of processes: process execution consists of a cycle of CPU execution and I/O wait. Processes alternate between these two states. Process execution begins with a CPU burst. That is followed by an I/O burst, which is followed by another CPU burst, then another I/O burst, and so on. Eventually, the final CPU burst ends with a system request to terminate execution (Figure 5.1). The durations of CPU bursts have been measured extensively. Although they vary greatly from process to process and from computer to computer, they tend to have a frequency curve similar to that shown in Figure 5.2. The curve is generally characterized as exponential or hyperexponential, with a large number of short CPU bursts and a small number of long CPU bursts. 5.1 Basic Concepts 195 frequency 160 140 120 100 80 60 40 20 0 8 16 24 32 40 burst duration (milliseconds) Figure 5.2 Histogram of CPU-burst durations. An I/O-bound program typically has many short CPU bursts. A CPU-bound program might have a few long CPU bursts. This distribution can be important in the selection of an appropriate CPU-scheduling algorithm. 5.1.2 CPU Scheduler Whenever the CPU becomes idle, the operating system must select one of the processes in the ready queue to be executed. The selection process is carried out by the short-term scheduler (or CPU scheduler). The scheduler selects a process from the processes in memory that are ready to execute and allocates the CPU to that process. Note that the ready queue is not necessarily a first-in, first-out (FIFO) queue. As we shall see when we consider the various scheduling algorithms, a ready queue can be implemented as a FIFO queue, a priority queue, a tree, or simply an unordered linked list. Conceptually, however, all the processes in the ready queue are lined up waiting for a chance to run on the CPU. The records in the queues are generally process control blocks (PCBs) of the processes. 5.1.3 Preemptive Scheduling CPU-scheduling decisions may take place under the following four circum- stances: 1. When a process switches from the running state to the waiting state (for example, as the result of an I/O requestoraninvocationofwait for the termination of one of the child processes) 196 Chapter 5 CPU Scheduling 2. When a process switches from the running state to the ready state (for example, when an interrupt occurs) 3. When a process switches from the waiting state to the ready state (for example, at completion of I/O) 4. When a process terminates For situations 1 and 4, there is no choice in terms of scheduling. A new process (if one exists in the ready queue) must be selected for execution. There is a choice, however, for situations 2 and 3. When scheduling takes place only under circumstances 1 and 4, we say that the scheduling scheme is nonpreemptive or cooperative;otherwise,it is preemptive. Under nonpreemptive scheduling, once the CPU has been allocated to a process, the process keeps the CPU until it releases the CPU,either by terminating or by switching to the waiting state. This scheduling method was used by Microsoft Windows 3.x; Windows 95 introduced preemptive scheduling, and all subsequent versions of Windows operating systems have used preemptive scheduling. The Mac OS X operating system for the Macintosh also uses preemptive scheduling; previous versions of the Macintosh operating system relied on cooperative scheduling. Cooperative scheduling is the only method that can be used on certain hardware platforms, because it does not require the special hardware (for example, a timer) needed for preemptive scheduling. Unfortunately, preemptive scheduling incurs a cost associated with access to shared data. Consider the case of two processes that share data. While one is updating the data, it is preempted so that the second process can run. The second process then tries to read the data, which are in an inconsistent state. In such situations, we need new mechanisms to coordinate access to shared data; we discuss this topic in Chapter 6. Preemption also affects the design of the operating-system kernel. During the processing of a system call, the kernel may be busy with an activity on behalf of a process. Such activities may involve changing important kernel data (for instance, I/O queues). What happens if the process is preempted in the middle of these changes and the kernel (or the device driver) needs to read or modify the same structure? Chaos ensues. Certain operating systems, including most versions of UNIX, deal with this problem by waiting either for a system call to complete or for an I/O block to take place before doing a context switch. This scheme ensures that the kernel structure is simple, since the kernel will not preempt a process while the kernel data structures are in an inconsistent state. Unfortunately, this kernel-execution model is a poor one for supporting real-time computing and multiprocessing. These problems, and their solutions, are described in Sections 5.5 and 19.5. Because interrupts can, by definition, occur at any time, and because they cannot always be ignored by the kernel, the sections of code affected by interrupts must be guarded from simultaneous use. The operating system needs to accept interrupts at almost all times; otherwise, input might be lost or output overwritten. So that these sections of code are not accessed concurrently by several processes, they disable interrupts at entry and reenable interrupts at exit. It is important to note that sections of code that disable interrupts do not occur very often and typically contain few instructions. 5.2 Scheduling Criteria 197 5.1.4 Dispatcher Another component involved in the CPU-scheduling function is the dispatcher. The dispatcher is the module that gives control of the CPUto the process selected by the short-term scheduler. This function involves the following: • Switching context • Switching to user mode • Jumping to the proper location in the user program to restart that program The dispatcher should be as fast as possible, since it is invoked during every process switch. The time it takes for the dispatcher to stop one process and start another running is known as the dispatch latency. 5.2 Scheduling Criteria Different CPU-scheduling algorithms have different properties, and the choice of a particular algorithm may favor one class of processes over another. In choosing which algorithm to use in a particular situation, we must consider the properties of the various algorithms. Many criteria have been suggested for comparing CPU-scheduling algo- rithms. Which characteristics are used for comparison can make a substantial difference in which algorithm is judged to be best. The criteria include the following: • CPU utilization. We want to keep the CPU as busy as possible. Concep- tually, CPU utilization can range from 0 to 100 percent. In a real system, it should range from 40 percent (for a lightly loaded system) to 90 percent (for a heavily used system). • Throughput.IftheCPU is busy executing processes, then work is being done. One measure of work is the number of processes that are completed per time unit, called throughput. For long processes, this rate may be one process per hour; for short transactions, it may be ten processes per second. • Turnaround time. From the point of view of a particular process, the important criterion is how long it takes to execute that process. The interval fromthetimeofsubmissionofaprocesstothetimeofcompletionisthe turnaround time. Turnaround time is the sum of the periods spent waiting to get into memory, waiting in the ready queue, executing on the CPU,and doing I/O. • Waiting time.TheCPU-scheduling algorithm does not affect the amount of time during which a process executes or does I/O; it affects only the amount of time that a process spends waiting in the ready queue. Waiting time is the sum of the periods spent waiting in the ready queue. • Response time. In an interactive system, turnaround time may not be the best criterion. Often, a process can produce some output fairly early and can continue computing new results while previous results are being 198 Chapter 5 CPU Scheduling output to the user. Thus, another measure is the time from the submission of a request until the first response is produced. This measure, called response time, is the time it takes to start responding, not the time it takes to output the response. The turnaround time is generally limited by the speed of the output device. It is desirable to maximize CPU utilization and throughput and to minimize turnaround time, waiting time, and response time. In most cases, we optimize the average measure. However, under some circumstances, it is desirable to optimize the minimum or maximum values rather than the average. For example, to guarantee that all users get good service, we may want to minimize the maximum response time. Investigators have suggested that, for interactive systems (such as time- sharing systems), it is more important to minimize the variance in the response time than to minimize the average response time. A system with reasonable and predictable response time may be considered more desirable than a system that is faster on the average but is highly variable. However, little work has been done on CPU-scheduling algorithms that minimize variance. As we discuss various CPU-scheduling algorithms in the following section, we illustrate their operation. An accurate illustration should involve many processes, each a sequence of several hundred CPU bursts and I/O bursts. For simplicity, though, we consider only one CPU burst (in milliseconds) per process in our examples. Our measure of comparison is the average waiting time. More elaborate evaluation mechanisms are discussed in Section 5.8. 5.3 Scheduling Algorithms CPU scheduling deals with the problem of deciding which of the processes in the ready queue is to be allocated the CPU. There are many different CPU-scheduling algorithms. In this section, we describe several of them. 5.3.1 First-Come, First-Served Scheduling By far the simplest CPU-scheduling algorithm is the first-come, first-served (FCFS) scheduling algorithm. With this scheme, the process that requests the CPU first is allocated the CPU first. The implementation of the FCFS policy is easily managed with a FIFO queue. When a process enters the ready queue, its process control block is linked onto the tail of the queue. When the CPU is free, it is allocated to the process at the head of the queue. The running process is then removed from the queue. The code for FCFS scheduling is simple to write and understand. On the negative side, the average waiting time under the FCFS policy is often quite long. Consider the following set of processes that arrive at time 0, with the length of the CPU burst given in milliseconds: Process Burst Time P1 24 P2 3 P3 3 5.3 Scheduling Algorithms 199 If the processes arrive in the order P1, P2, P3, and are served in FCFS order, we get the result shown in the following Gantt chart, which is a bar chart that illustrates a particular schedule, including the start and finish times of each of the participating processes: P1 P2 P3 3027240 The waiting time is 0 milliseconds for process P1, 24 milliseconds for process P2, and 27 milliseconds for process P3. Thus, the average waiting time is (0 + 24 + 27)/3 = 17 milliseconds. If the processes arrive in the order P2, P3, P1, however, the results will be as shown in the following Gantt chart: P1P2 P3 30036 The average waiting time is now (6 + 0 + 3)/3 = 3 milliseconds. This reduction is substantial. Thus, the average waiting time under an FCFS policy is generally not minimal and may vary substantially if the processes’ CPU burst times vary greatly. In addition, consider the performance of FCFS scheduling in a dynamic situation. Assume we have one CPU-bound process and many I/O-bound processes. As the processes flow around the system, the following scenario may result. The CPU-bound process will get and hold the CPU.Duringthis time, all the other processes will finish their I/O and will move into the ready queue, waiting for the CPU. While the processes wait in the ready queue, the I/O devices are idle. Eventually, the CPU-bound process finishes its CPU burst and moves to an I/O device. All the I/O-bound processes, which have short CPU bursts, execute quickly and move back to the I/O queues. At this point, the CPU sits idle. The CPU-bound process will then move back to the ready queue and be allocated the CPU. Again, all the I/O processes end up waiting in the ready queue until the CPU-bound process is done. There is a convoy effect as all the other processes wait for the one big process to get off the CPU.This effect results in lower CPU and device utilization than might be possible if the shorter processes were allowed to go first. Note also that the FCFS scheduling algorithm is nonpreemptive. Once the CPU has been allocated to a process, that process keeps the CPU until it releases the CPU, either by terminating or by requesting I/O.TheFCFS algorithm is thus particularly troublesome for time-sharing systems, where it is important that each user get a share of the CPU at regular intervals. It would be disastrous to allow one process to keep the CPU for an extended period. 5.3.2 Shortest-Job-First Scheduling A different approach to CPU scheduling is the shortest-job-first (SJF) schedul- ing algorithm. This algorithm associates with each process the length of the process’s next CPU burst. When the CPU is available, it is assigned to the process 200 Chapter 5 CPU Scheduling that has the smallest next CPU burst. If the next CPU bursts of two processes are the same, FCFS scheduling is used to break the tie. Note that a more appropriate term for this scheduling method would be the shortest-next-CPU-burst algorithm, because scheduling depends on the length of the next CPU burst of a process, rather than its total length. We use the term SJF because it is the term most commonly used to refer to this type of scheduling. As an example of SJF scheduling, consider the following set of processes, with the length of the CPU burst given in milliseconds: Process Burst Time P1 6 P2 8 P3 7 P4 3 Using SJF scheduling, we would schedule these processes according to the following Gantt chart: P3 P2P4 P1 2416903 The waiting time is 3 milliseconds for process P1, 16 milliseconds for process P2, 9 milliseconds for process P3, and 0 milliseconds for process P4.Thus,the average waiting time is (3 + 16 + 9 + 0)/4 = 7 milliseconds. By comparison, if we were using the FCFS scheduling scheme, the average waiting time would be 10.25 milliseconds. The SJF scheduling algorithm is provably optimal, in that it gives the minimum average waiting time for a given set of processes. Moving a short process before a long one decreases the waiting time of the short process more than it increases the waiting time of the long process. Consequently, the average waiting time decreases. The real difficulty with the SJF algorithm is knowing the length of the next CPU request. For long-term (job) scheduling in a batch system, we can use as the length the process time limit that a user specifies when he submits the job. Thus, users are motivated to estimate the process time limit accurately, since a lower value may mean faster response. (Too low a value will cause a time-limit-exceeded error and require resubmission.) SJF scheduling is used frequently in long-term scheduling. Unfortunately, the SJF algorithm cannot be implemented at the level of short-term CPU scheduling. With short-term scheduling, there is no way to know the length of the next CPU burst. One approach is to try to approximate SJF scheduling. We may not know the length of the next CPU burst, but we may be able to predict its value. We expect that the next CPU burst will be similar in length to the previous ones. By computing an approximation of the length of the next CPU burst, we can pick the process with the shortest predicted CPU burst. 5.3 Scheduling Algorithms 201 The next CPU burst is generally predicted as an exponential average of the measured lengths of previous CPU bursts. We can define the exponential average with the following formula. Let tn be the length of the nth CPU burst, and let ␶n+1 be our predicted value for the next CPU burst. Then, for ␣,0≤ ␣ ≤ 1, define ␶n+1 = ␣ tn +(1− ␣)␶n. The value of tn contains our most recent information; ␶n stores the past history. The parameter ␣ controls the relative weight of recent and past history in our prediction. If ␣ =0,then␶n+1 = ␶n, and recent history has no effect (current conditions are assumed to be transient). If ␣ =1,then␶n+1 = tn, and only the most recent CPU burst matters (history is assumed to be old and irrelevant). More commonly, ␣ = 1/2, so recent history and past history are equally weighted. The initial ␶0 can be defined as a constant or as an overall system average. Figure 5.3 shows an exponential average with ␣ =1/2and␶0 = 10. To understand the behavior of the exponential average, we can expand the formula for ␶n+1 by substituting for ␶n, to find ␶n+1 = ␣tn + (1 − ␣)␣tn−1 +···+(1 − ␣) j ␣tn− j +···+(1 − ␣)n+1␶0. Since both ␣ and (1 − ␣) are less than or equal to 1, each successive term has less weight than its predecessor. The SJF algorithm can be either preemptive or nonpreemptive. The choice arises when a new process arrives at the ready queue while a previous process is still executing. The next CPU burst of the newly arrived process may be shorter 6464131313… 810 66591112… CPU burst (ti) "guess" (τi) ti τi 2 time 4 6 8 10 12 Figure 5.3 Predicting the length of the next CPU burst. 202 Chapter 5 CPU Scheduling than what is left of the currently executing process. A preemptive SJF algorithm will preempt the currently executing process, whereas a nonpreemptive SJF algorithm will allow the currently running process to finish its CPU burst. Preemptive SJF scheduling is sometimes called shortest-remaining-time-first scheduling. As an example, consider the following four processes, with the length of the CPU burst given in milliseconds: Process Arrival Time Burst Time P1 08 P2 14 P3 29 P4 35 If the processes arrive at the ready queue at the times shown and need the indicated burst times, then the resulting preemptive SJF schedule is as depicted in the following Gantt chart: P1 P3P1 P2 P4 26171001 5 Process P1 is started at time 0, since it is the only process in the queue. Process P2 arrives at time 1. The remaining time for process P1 (7 milliseconds) is larger than the time required by process P2 (4 milliseconds), so process P1 is preempted, and process P2 is scheduled. The average waiting time for this example is [(10 − 1) + (1 − 1) + (17 − 2) + (5 − 3)]/4 = 26/4 = 6.5 milliseconds. Nonpreemptive SJF scheduling would result in an average waiting time of 7.75 milliseconds. 5.3.3 Priority Scheduling The SJF algorithm is a special case of the general priority scheduling algorithm. A priority is associated with each process, and the CPUis allocated to the process with the highest priority. Equal-priority processes are scheduled in FCFS order. An SJF algorithm is simply a priority algorithm where the priority (p)isthe inverse of the next CPU burst. The larger the CPU burst, the lower the priority, and vice versa. Note that we discuss scheduling in terms of high priority and low priority. Priorities are generally indicated by some fixed range of numbers, such as 0 to 7 or 0 to 4,095. However, there is no general agreement on whether 0 is the highest or lowest priority. Some systems use low numbers to represent low priority; others use low numbers for high priority. This difference can lead to confusion. In this text, we assume that low numbers represent high priority. As an example, consider the following set of processes, assumed to have arrived at time 0 in the order P1, P2, ···, P5, with the length of the CPU burst given in milliseconds: 5.3 Scheduling Algorithms 203 Process Burst Time Priority P1 10 3 P2 11 P3 24 P4 15 P5 52 Using priority scheduling, we would schedule these processes according to the following Gantt chart: P1 P4P3P2 P5 191816601 The average waiting time is 8.2 milliseconds. Priorities can be defined either internally or externally. Internally defined priorities use some measurable quantity or quantities to compute the priority of a process. For example, time limits, memory requirements, the number of open files, and the ratio of average I/O burst to average CPU burst have been used in computing priorities. External priorities are set by criteria outside the operating system, such as the importance of the process, the type and amount of funds being paid for computer use, the department sponsoring the work, and other, often political, factors. Priority scheduling can be either preemptive or nonpreemptive. When a process arrives at the ready queue, its priority is compared with the priority of the currently running process. A preemptive priority scheduling algorithm will preempt the CPU if the priority of the newly arrived process is higher than the priority of the currently running process. A nonpreemptive priority scheduling algorithm will simply put the new process at the head of the ready queue. A major problem with priority scheduling algorithms is indefinite block- ing,orstarvation. A process that is ready to run but waiting for the CPU can be considered blocked. A priority scheduling algorithm can leave some low- priority processes waiting indefinitely. In a heavily loaded computer system, a steady stream of higher-priority processes can prevent a low-priority process from ever getting the CPU. Generally, one of two things will happen. Either the process will eventually be run (at 2 A.M. Sunday, when the system is finally lightly loaded), or the computer system will eventually crash and lose all unfinished low-priority processes. (Rumor has it that, when they shut down the IBM 7094 at MIT in 1973, they found a low-priority process that had been submitted in 1967 and had not yet been run.) A solution to the problem of indefinite blockage of low-priority processes is aging. Aging is a technique of gradually increasing the priority of processes that wait in the system for a long time. For example, if priorities range from 127 (low) to 0 (high), we could increase the priority of a waiting process by 1 every 15 minutes. Eventually, even a process with an initial priority of 127 would have the highest priority in the system and would be executed. In fact, it would take no more than 32 hours for a priority-127 process to age to a priority-0 process. 204 Chapter 5 CPU Scheduling 5.3.4 Round-Robin Scheduling The round-robin (RR) scheduling algorithm is designed especially for time- sharing systems. It is similar to FCFS scheduling, but preemption is added to enable the system to switch between processes. A small unit of time, called a time quantum or time slice, is defined. A time quantum is generally from 10 to 100 milliseconds in length. The ready queue is treated as a circular queue. The CPU scheduler goes around the ready queue, allocating the CPU to each process for a time interval of up to 1 time quantum. To implement RR scheduling, we keep the ready queue as a FIFO queue of processes. New processes are added to the tail of the ready queue. The CPU scheduler picks the first process from the ready queue, sets a timer to interrupt after 1 time quantum, and dispatches the process. Oneoftwothingswillthenhappen.TheprocessmayhaveaCPU burst of less than 1 time quantum. In this case, the process itself will release the CPU voluntarily. The scheduler will then proceed to the next process in the ready queue. Otherwise, if the CPU burst of the currently running process is longer than 1 time quantum, the timer will go off and will cause an interrupt to the operating system. A context switch will be executed, and the process will be put at the tail of the ready queue. The CPU scheduler will then select the next process in the ready queue. TheaveragewaitingtimeundertheRR policy is often long. Consider the following set of processes that arrive at time 0, with the length of the CPU burst given in milliseconds: Process Burst Time P1 24 P2 3 P3 3 If we use a time quantum of 4 milliseconds, then process P1 gets the first 4 milliseconds. Since it requires another 20 milliseconds, it is preempted after thefirsttimequantum,andtheCPU is given to the next process in the queue, process P2.ProcessP2 does not need 4 milliseconds, so it quits before its time quantum expires. The CPU is then given to the next process, process P3.Once each process has received 1 time quantum, the CPU is returned to process P1 for an additional time quantum. The resulting RR schedule is as follows: P1P1 P1P1P1P1 P2 301814 262210704 P3 Let’s calculate the average waiting time for the above schedule. P1 waits for 6 millisconds (10–4), P2 waits for 4 millisconds, and P3 waits for 7 millisconds. Thus, the average waiting time is 17/3 = 5.66 milliseconds. In the RR scheduling algorithm, no process is allocated the CPU for more than 1 time quantum in a row (unless it is the only runnable process). If a 5.3 Scheduling Algorithms 205 process’s CPU burst exceeds 1 time quantum, that process is preempted and is put back in the ready queue. The RR scheduling algorithm is thus preemptive. If there are n processes in the ready queue and the time quantum is q, then each process gets 1/n of the CPU time in chunks of at most q time units. Each process must wait no longer than (n − 1) × q time units until its next time quantum. For example, with five processes and a time quantum of 20 milliseconds, each process will get up to 20 milliseconds every 100 milliseconds. The performance of the RR algorithm depends heavily on the size of the time quantum. At one extreme, if the time quantum is extremely large, the RR policy is the same as the FCFS policy. In contrast, if the time quantum is extremely small (say, 1 millisecond), the RR approach is called processor sharing and (in theory) creates the appearance that each of n processes has its own processor running at 1/n the speed of the real processor. This approach was used in Control Data Corporation (CDC) hardware to implement ten peripheral processors with only one set of hardware and ten sets of registers. The hardware executes one instruction for one set of registers, then goes on to the next. This cycle continues, resulting in ten slow processors rather than one fast one. (Actually, since the processor was much faster than memory and each instruction referenced memory, the processors were not much slower than ten real processors would have been.) In software, we need also to consider the effect of context switching on the performance of RR scheduling. Assume, for example, that we have only one process of 10 time units. If the quantum is 12 time units, the process finishes in less than 1 time quantum, with no overhead. If the quantum is 6 time units, however, the process requires 2 quanta, resulting in a context switch. If the time quantum is 1 time unit, then nine context switches will occur, slowing the execution of the process accordingly (Figure 5.4). Thus, we want the time quantum to be large with respect to the context- switch time. If the context-switch time is approximately 10 percent of the time quantum, then about 10 percent of the CPU time will be spent in context switching. In practice, most modern systems have time quanta ranging from 10 to 100 milliseconds. The time required for a context switch is typically less than 10 microseconds; thus, the context-switch time is a small fraction of the time quantum. process time ϭ 10 quantum context switches 12 0 61 19 010 010 012345678910 6 Figure 5.4 How a smaller time quantum increases context switches. 206 Chapter 5 CPU Scheduling average turnaround time 1 12.5 12.0 11.5 11.0 10.5 10.0 9.5 9.0 234 time quantum 567 P1 P2 P3 P4 6 3 1 7 process time Figure 5.5 How turnaround time varies with the time quantum. Turnaround time also depends on the size of the time quantum. As we can see from Figure 5.5, the average turnaround time of a set of processes does not necessarily improve as the time-quantum size increases. In general, the average turnaround time can be improved if most processes finish their next CPU burst in a single time quantum. For example, given three processes of 10 time units each and a quantum of 1 time unit, the average turnaround time is 29. If the time quantum is 10, however, the average turnaround time drops to 20. If context-switch time is added in, the average turnaround time increases even more for a smaller time quantum, since more context switches are required. Although the time quantum should be large compared with the context- switch time, it should not be too large. If the time quantum is too large, as mentioned earlier, RR scheduling degenerates to an FCFS policy. A rule of thumb is that 80 percent of the CPU bursts should be shorter than the time quantum. 5.3.5 Multilevel Queue Scheduling Another class of scheduling algorithms has been created for situations in which processes are easily classified into different groups. For example, a common division is made between foreground (interactive) processes and background (batch) processes. These two types of processes have different response-time requirements and so may have different scheduling needs. In addition, foreground processes may have priority (externally defined) over background processes. A multilevel queue scheduling algorithm partitions the ready queue into several separate queues (Figure 5.6). The processes are permanently assigned to 5.3 Scheduling Algorithms 207 system processes highest priority lowest priority interactive processes interactive editing processes batch processes student processes Figure 5.6 Multilevel queue scheduling. one queue, generally based on some property of the process, such as memory size, process priority, or process type. Each queue has its own scheduling algorithm. For example, separate queues might be used for foreground and background processes. The foreground queue might be scheduled by an RR algorithm, while the background queue is scheduled by an FCFS algorithm. In addition, there must be scheduling among the queues, which is com- monly implemented as fixed-priority preemptive scheduling. For example, the foreground queue may have absolute priority over the background queue. Let’s look at an example of a multilevel queue scheduling algorithm with five queues, listed below in order of priority: 1. System processes 2. Interactive processes 3. Interactive editing processes 4. Batch processes 5. Student processes Each queue has absolute priority over lower-priority queues. No process in the batch queue, for example, could run unless the queues for system processes, interactive processes, and interactive editing processes were all empty. If an interactive editing process entered the ready queue while a batch process was running, the batch process would be preempted. Another possibility is to time-slice among the queues. Here, each queue gets a certain portion of the CPU time, which it can then schedule among its various processes. For instance, in the foreground–background queue example, the foreground queue can be given 80 percent of the CPU time for RR scheduling 208 Chapter 5 CPU Scheduling among its processes, whereas the background queue receives 20 percent of the CPU to give to its processes on an FCFS basis. 5.3.6 Multilevel Feedback Queue Scheduling Normally, when the multilevel queue scheduling algorithm is used, processes are permanently assigned to a queue when they enter the system. If there are separate queues for foreground and background processes, for example, processes do not move from one queue to the other, since processes do not change their foreground or background nature. This setup has the advantage of low scheduling overhead, but it is inflexible. The multilevel feedback queue scheduling algorithm, in contrast, allows a process to move between queues. The idea is to separate processes according to the characteristics of their CPU bursts. If a process uses too much CPU time, it will be moved to a lower-priority queue. This scheme leaves I/O-bound and interactive processes in the higher-priority queues. In addition, a process that waits too long in a lower-priority queue may be moved to a higher-priority queue. This form of aging prevents starvation. For example, consider a multilevel feedback queue scheduler with three queues, numbered from 0 to 2 (Figure 5.7). The scheduler first executes all processes in queue 0. Only when queue 0 is empty will it execute processes in queue 1. Similarly, processes in queue 2 will be executed only if queues 0 and 1 are empty. A process that arrives for queue 1 will preempt a process in queue 2. A process in queue 1 will in turn be preempted by a process arriving for queue 0. A process entering the ready queue is put in queue 0. A process in queue 0 is given a time quantum of 8 milliseconds. If it does not finish within this time, it is moved to the tail of queue 1. If queue 0 is empty, the process at the head of queue 1 is given a quantum of 16 milliseconds. If it does not complete, it is preempted and is put into queue 2. Processes in queue 2 are run on an FCFS basis but are run only when queues 0 and 1 are empty. This scheduling algorithm gives highest priority to any process with a CPU burst of 8 milliseconds or less. Such a process will quickly get the CPU,finish its CPU burst, and go off to its next I/O burst. Processes that need more than quantum ϭ 8 quantum ϭ 16 FCFS Figure 5.7 Multilevel feedback queues. 5.4 Thread Scheduling 209 8 but less than 24 milliseconds are also served quickly, although with lower priority than shorter processes. Long processes automatically sink to queue 2 and are served in FCFS order with any CPU cycles left over from queues 0 and 1. In general, a multilevel feedback queue scheduler is defined by the following parameters: • The number of queues • The scheduling algorithm for each queue • The method used to determine when to upgrade a process to a higher- priority queue • The method used to determine when to demote a process to a lower- priority queue • The method used to determine which queue a process will enter when that process needs service The definition of a multilevel feedback queue scheduler makes it the most general CPU-scheduling algorithm. It can be configured to match a specific system under design. Unfortunately, it is also the most complex algorithm, since defining the best scheduler requires some means by which to select values for all the parameters. 5.4 Thread Scheduling In Chapter 4, we introduced threads to the process model, distinguishing between user-level and kernel-level threads. On operating systems that support them, it is kernel-level threads—not processes—that are being scheduled by the operating system. User-level threads are managed by a thread library, and the kernel is unaware of them. To run on a CPU, user-level threads must ultimately be mapped to an associated kernel-level thread, although this mapping may be indirect and may use a lightweight process (LWP). In this section, we explore scheduling issues involving user-level and kernel-level threads and offer specific examples of scheduling for Pthreads. 5.4.1 Contention Scope One distinction between user-level and kernel-level threads lies in how they are scheduled. On systems implementing the many-to-one (Section 4.2.1) and many-to-many (Section 4.2.3) models, the thread library schedules user-level threads to run on an available LWP, a scheme known as process-contention scope (PCS), since competition for the CPU takes place among threads belonging to the same process. When we say the thread library schedules user threads onto available LWPs, we do not mean that the thread is actually running on a CPU; this would require the operating system to schedule the kernel thread onto a physical CPU. To decide which kernel thread to schedule onto a CPU,the kernel uses system-contention scope (SCS). Competition for the CPU with SCS scheduling takes place among all threads in the system. Systems using the 210 Chapter 5 CPU Scheduling one-to-one model (Section 4.2.2), such as Windows XP,Solaris,andLinux, schedule threads using only SCS. Typically, PCS is done according to priority—the scheduler selects the runnable thread with the highest priority to run. User-level thread priorities are set by the programmer and are not adjusted by the thread library, although some thread libraries may allow the programmer to change the priority of a thread. It is important to note that PCS will typically preempt the thread currently running in favor of a higher-priority thread; however, there is no guarantee of time slicing (Section 5.3.4) among threads of equal priority. 5.4.2 Pthread Scheduling We provided a sample POSIX Pthread program in Section 4.3.1, along with an introduction to thread creation with Pthreads. Now, we highlight the POSIX Pthread API that allows specifying either PCS or SCS during thread creation. Pthreads identifies the following contention scope values: • PTHREAD SCOPE PROCESS schedules threads using PCS scheduling. • PTHREAD SCOPE SYSTEM schedules threads using SCS scheduling. On systems implementing the many-to-many model, the PTHREAD SCOPE PROCESS policy schedules user-level threads onto available LWPs. The number of LWPs is maintained by the thread library, perhaps using scheduler activations (Section 4.5.6). The PTHREAD SCOPE SYSTEM scheduling policy will create and bind an LWP for each user-level thread on many-to-many systems, effectively mapping threads using the one-to-one policy. The Pthread IPC provides two functions for getting—and setting—the contention scope policy: • pthread attr setscope(pthread attr t *attr, int scope) • pthread attr getscope(pthread attr t *attr, int *scope) The first parameter for both functions contains a pointer to the attribute set for the thread. The second parameter for the pthread attr setscope() function is passed either the PTHREAD SCOPE SYSTEM or the PTHREAD SCOPE PROCESS value, indicating how the contention scope is to be set. In the case of pthread attr getscope(), this second parameter contains a pointer to an int value that is set to the current value of the contention scope. If an error occurs, each of these functions returns a non-zero value. In Figure 5.8, we illustrate a Pthread scheduling API.Thepro- gram first determines the existing contention scope and sets it to PTHREAD SCOPE PROCESS. It then creates five separate threads that will run using the SCS scheduling policy. Note that on some systems, only certain contention scope values are allowed. For example, Linux and Mac OS X systems allow only PTHREAD SCOPE SYSTEM. 5.5 Multiple-Processor Scheduling 211 #include #include #define NUM THREADS 5 int main(int argc, char *argv[]) { int i, scope; pthread t tid[NUM THREADS]; pthread attr t attr; /* get the default attributes */ pthread attr init(&attr); /* first inquire on the current scope */ if (pthread attr getscope(&attr, &scope) != 0) fprintf(stderr, "Unable to get scheduling scope\n"); else { if (scope == PTHREAD SCOPE PROCESS) printf("PTHREAD SCOPE PROCESS"); else if (scope == PTHREAD SCOPE SYSTEM) printf("PTHREAD SCOPE SYSTEM"); else fprintf(stderr, "Illegal scope value.\n"); } /* set the scheduling algorithm to PCS or SCS */ pthread attr setscope(&attr, PTHREAD SCOPE SYSTEM); /* create the threads */ for (i = 0; i < NUM THREADS; i++) pthread create(&tid[i],&attr,runner,NULL); /* now join on each thread */ for (i = 0; i < NUM THREADS; i++) pthread join(tid[i], NULL); } /* Each thread will begin control in this function */ void *runner(void *param) { /* do some work ... */ pthread exit(0); } Figure 5.8 Pthread scheduling API. 5.5 Multiple-Processor Scheduling Our discussion thus far has focused on the problems of scheduling the CPU in a system with a single processor. If multiple CPUs are available, load sharing 212 Chapter 5 CPU Scheduling becomes possible; however, the scheduling problem becomes correspondingly more complex. Many possibilities have been tried; and as we saw with single- processor CPU scheduling, there is no one best solution. Here, we discuss several concerns in multiprocessor scheduling. We concentrate on systems in which the processors are identical—homogeneous—in terms of their functionality; we can then use any available processor to run any process in the queue. (Note, however, that even with homogeneous multiprocessors, there are sometimes limitations on scheduling. Consider a system with an I/O device attached to a private bus of one processor. Processes that wish to use that device must be scheduled to run on that processor.) 5.5.1 Approaches to Multiple-Processor Scheduling One approach to CPU scheduling in a multiprocessor system has all scheduling decisions, I/O processing, and other system activities handled by a single processor—the master server. The other processors execute only user code. This asymmetric multiprocessing is simple because only one processor accesses the system data structures, reducing the need for data sharing. A second approach uses symmetric multiprocessing (SMP),whereeach processor is self-scheduling. All processes may be in a common ready queue, or each processor may have its own private queue of ready processes. Regardless, scheduling proceeds by having the scheduler for each processor examine the ready queue and select a process to execute. As we shall see in Chapter 6, if we have multiple processors trying to access and update a common data structure, the scheduler must be programmed carefully. We must ensure that two processors do not choose the same process and that processes are not lost from the queue. Virtually all modern operating systems support SMP, including Windows XP, Windows 2000, Solaris, Linux, and Mac OS X.Intheremainderof this section, we discuss issues concerning SMP systems. 5.5.2 Processor Affinity Consider what happens to cache memory when a process has been running on a specific processor. The data most recently accessed by the process populate the cache for the processor; as a result, successive memory accesses by the process are often satisfied in cache memory. Now consider what happens if the process migrates to another processor. The contents of cache memory must be invalidated for the first processor, and the cache for the second processor must be repopulated. Because of the high cost of invalidating and repopulating caches, most SMP systems try to avoid migration of processes from one processor to another and instead attempt to keep a process running on the same processor. This is known as processor affinity—that is, a process has an affinity for the processor on which it is currently running. Processor affinity takes several forms. When an operating system has a policy of attempting to keep a process running on the same processor—but not guaranteeing that it will do so—we have a situation known as soft affinity. Here, it is possible for a process to migrate between processors. Some systems —such as Linux—also provide system calls that support hard affinity,thereby allowing a process to specify that it is not to migrate to other processors. Solaris allows processes to be assigned to processor sets, limiting which processes can run on which CPUs. It also implements soft affinity. 5.5 Multiple-Processor Scheduling 213 CPU fast access memory CPU fast access slow access memory computer Figure 5.9 NUMA and CPU scheduling. The main-memory architecture of a system can affect processor affinity issues. Figure 5.9 illustrates an architecture featuring non-uniform memory access (NUMA), in which a CPU has faster access to some parts of main memory than to other parts. Typically, this occurs in systems containing combined CPU and memory boards. The CPUs on a board can access the memory on that board with less delay than they can access memory on other boards in the system. If the operating system’s CPU scheduler and memory-placement algorithms work together, then a process that is assigned affinity to a particular CPU can be allocated memory on the board where that CPU resides. This example also shows that operating systems are frequently not as cleanly defined and implemented as described in operating-system textbooks. Rather, the “solid lines” between sections of an operating system are frequently only “dotted lines,” with algorithms creating connections in ways aimed at optimizing performance and reliability. 5.5.3 Load Balancing On SMP systems, it is important to keep the workload balanced among all processors to fully utilize the benefits of having more than one processor. Otherwise, one or more processors may sit idle while other processors have high workloads, along with lists of processes awaiting the CPU. Load balancing attempts to keep the workload evenly distributed across all processors in an SMP system. It is important to note that load balancing is typically only necessary on systems where each processor has its own private queue of eligible processes to execute. On systems with a common run queue, load balancing is often unnecessary, because once a processor becomes idle, it immediately extracts a runnable process from the common run queue. It is also important to note, however, that in most contemporary operating systems supporting SMP, each processor does have a private queue of eligible processes. There are two general approaches to load balancing: push migration and pull migration. With push migration, a specific task periodically checks the load on each processor and—if it finds an imbalance—evenly distributes the load by moving (or pushing) processes from overloaded to idle or less- 214 Chapter 5 CPU Scheduling busy processors. Pull migration occurs when an idle processor pulls a waiting task from a busy processor. Push migration and pull migration need not be mutually exclusive and are in fact often implemented in parallel on load- balancing systems. For example, the Linux scheduler (described in Section 5.6.3) and the ULE scheduler available for FreeBSD systems implement both techniques. Linux runs its load-balancing algorithm every 200 milliseconds (push migration) or whenever the run queue for a processor is empty (pull migration). Interestingly, load balancing often counteracts the benefits of processor affinity, discussed in Section 5.5.2. That is, the benefit of keeping a process running on the same processor is that the process can take advantage of its data being in that processor’s cache memory. Either pulling or pushing a process from one processor to another invalidates this benefit. As is often the case in systems engineering, there is no absolute rule concerning what policy is best. Thus, in some systems, an idle processor always pulls a process from a non-idle processor; and in other systems, processes are moved only if the imbalance exceeds a certain threshold. 5.5.4 Multicore Processors Traditionally, SMP systems have allowed several threads to run concurrently by providing multiple physical processors. However, a recent trend in computer hardware has been to place multiple processor cores on the same physical chip, resulting in a multicore processor. Each core has a register set to maintain its architectural state and thus appears to the operating system to be a separate physical processor. SMP systems that use multicore processors are faster and consume less power than systems in which each processor has its own physical chip. Multicore processors may complicate scheduling issues, however. Let’s consider how this can happen. Researchers have discovered that when a processor accesses memory, it spends a significant amount of time waiting for the data to become available. This situation, known as a memory stall, may occur for various reasons, such as a cache miss (accessing data that are not in cache memory). Figure 5.10 illustrates a memory stall. In this scenario, the processor can spend up to 50 percent of its time waiting for data to become available from memory. To remedy this situation, many recent hardware designs have implemented multithreaded processor cores in which two (or more) hardware threads are assigned to each core. That way, if one thread stalls while waiting for memory, the core can switch to another thread. time compute cycle memory stall cycle thread C C M C M C M M C M Figure 5.10 Memory stall. 5.5 Multiple-Processor Scheduling 215 time thread0 thread1 C M C M C M C C M C M C M C Figure 5.11 Multithreaded multicore system. Figure 5.11 illustrates a dual-threaded processor core on which the execution of thread0 and the execution of thread1 are interleaved. From an operating- system perspective, each hardware thread appears as a logical processor that is available to run a software thread. Thus, on a dual-threaded, dual-core system, four logical processors are presented to the operating system. The UltraSPARC T1 CPU has eight cores per chip and four hardware threads per core; from the perspective of the operating system, there appear to be 32 logical processors. In general, there are two ways to multithread a processor: coarse-grained and fine-grained multithreading. With coarse-grained multithreading, a thread executes on a processor until a long-latency event such as a memory stall occurs. Because of the delay caused by this event, the processor must switch to another thread to begin execution. However, the cost of switching between threads is high, as the instruction pipeline must be flushed before the other thread can begin execution on the processor core. Once this new thread begins execution, it begins filling the pipeline with its instructions. Fine-grained (or interleaved) multithreading switches between threads at a much finer level of granularity —typically at the boundary of an instruction cycle. However, the architectural design of fine-grained systems includes logic for thread switching. As a result, the cost of switching between threads is small. Notice that a multithreaded multicore processor actually requires two different levels of scheduling. On one level are the scheduling decisions that must be made by the operating system as it chooses which software thread to run on each hardware thread (logical processor). For this level of scheduling, the operating system may choose any scheduling algorithm, such as one of those described in Section 5.3. A second level of scheduling specifies how each core decides which hardware thread to run. There are several strategies to adopt in this situation. The UltraSPARC T1, mentioned earlier, uses a simple round-robin algorithm to schedule the four hardware threads to each core. Another example, the Intel Itanium, is a dual-core processor with two hardware-managed threads per core. Assigned to each hardware thread is adynamicurgency value ranging from 0 to 7, with 0 representing the lowest urgency, and 7 the highest. The Itanium identifies five different events that may trigger a thread switch. When one of these events occurs, the thread-switching logic compares the urgency of the two threads and selects the thread with the higher urgency value to execute on the processor core. 5.5.5 Virtualization and Scheduling A system with virtualization, even a single-CPU system, frequently acts like a multiprocessor system. The virtualization software presents one or more 216 Chapter 5 CPU Scheduling virtual CPUs to each of the virtual machines running on the system and then schedules the use of the physical CPUs among the virtual machines. The significant variations between virtualization technologies make it difficult to summarize the effect of virtualization on scheduling (see Section 2.8). In general, though, most virtualized environments have one host operating system and many guest operating systems. The host operating system creates and manages the virtual machines, and each virtual machine has a guest operating system installed and applications running within that guest. Each guest operating system may be fine-tuned for specific use cases, applications, and users, including time sharing or even real-time operation. Any guest operating-system scheduling algorithm that assumes a certain amount of progress in a given amount of time will be negatively affected by virtualization. Consider a time-sharing operating system that tries to allot 100 milliseconds to each time slice to give users a reasonable response time. In a virtual machine, this operating system is at the mercy of the virtualization system as to what CPU resources it actually receives. A given 100-millisecond time slice may take much more than 100 milliseconds of virtual CPU time. Depending on how busy the system is, the time slice may take a second or more, resulting in very poor response times for users logged into that virtual machine. The effect on a real-time operating system would be even more catastrophic. The net effect of such scheduling layering is that individual virtualized operating systems receive only a portion of the available CPU cycles, even though they believe they are receiving all of the cycles and indeed that they are scheduling all of those cycles. Commonly, the time-of-day clocks in virtual machines are incorrect because timers take longer to trigger than they would on dedicated CPUs. Virtualization can thus undo the good scheduling-algorithm efforts of the operating systems in virtual machines. 5.6 Operating System Examples We turn next to a description of the scheduling policies of the Solaris, Windows XP, and Linux operating systems. It is important to remember that we are describing the scheduling of kernel threads with Solaris and Windows XP. Recall that Linux does not distinguish between processes and threads; thus, we use the term task when discussing the Linux scheduler. 5.6.1 Example: Solaris Scheduling Solaris uses priority-based thread scheduling where each thread belongs to one of six classes: 1. Time sharing (TS) 2. Interactive (IA) 3. Real time (RT) 4. System (SYS) 5. Fair share (FSS) 6. Fixed priority (FP) 5.6 Operating System Examples 217 time quantumpriority return from sleep time quantum expired 0 5 10 15 20 25 30 35 40 45 50 55 59 200 200 160 160 120 120 80 80 40 40 40 40 20 0 0 0 5 10 15 20 25 30 35 40 45 49 50 50 51 51 52 52 53 54 55 56 58 58 59 Figure 5.12 Solaris dispatch table for time-sharing and interactive threads. Within each class, there are different priorities and different scheduling algorithms. The default scheduling class for a process is time sharing. The scheduling policy for the time-sharing class dynamically alters priorities and assigns time slices of different lengths using a multilevel feedback queue. By default, there is an inverse relationship between priorities and time slices. The higher the priority, the smaller the time slice; the lower the priority, the larger the time slice. Interactive processes typically have a higher priority; CPU-bound processes, a lower priority. This scheduling policy gives good response time for interactive processes and good throughput for CPU-bound processes. The interactive class uses the same scheduling policy as the time-sharing class, but it gives windowing applications—such as those created by the KDE or GNOME window managers—a higher priority for better performance. Figure 5.12 shows the dispatch table for scheduling time-sharing and interactive threads. These two scheduling classes include 60 priority levels, but for brevity, we display only a handful. The dispatch table shown in Figure 5.12 contains the following fields: • Priority. The class-dependent priority for the time-sharing and interactive classes. A higher number indicates a higher priority. • Time quantum. The time quantum for the associated priority. This illus- trates the inverse relationship between priorities and time quanta: the lowest priority (priority 0) has the highest time quantum (200 millisec- 218 Chapter 5 CPU Scheduling onds), and the highest priority (priority 59) has the lowest time quantum (20 milliseconds). • Time quantum expired. The new priority of a thread that has used its entire time quantum without blocking. Such threads are considered CPU-intensive. As shown in the table, these threads have their priorities lowered. • Return from sleep. The priority of a thread that is returning from sleeping (such as waiting for I/O). As the table illustrates, when I/O is available for a waiting thread, its priority is boosted to between 50 and 59. This practice supports the scheduling policy of providing good response time for interactive processes. Threads in the real-time class are given the highest priority.This assignment allows a real-time process to have a guaranteed response from the system within a bounded period of time. A real-time process will run before a process in any other class. In general, however, few processes belong to the real-time class. Solaris uses the system class to run kernel threads, such as the scheduler and paging daemon. Once established, the priority of a system thread does not change. The system class is reserved for kernel use (user processes running in kernel mode are not in the system class). The fixed-priority and fair-share classes were introduced with Solaris 9. Threads in the fixed-priority class have the same priority range as those in the time-sharing class; however, their priorities are not dynamically adjusted. The fair-share scheduling class uses CPU shares instead of priorities to make scheduling decisions. CPU shares indicate entitlement to available CPU resources and are allocated to a set of processes (known as a project). Each scheduling class includes a set of priorities. However, the scheduler converts the class-specific priorities into global priorities and selects the thread with the highest global priority to run. The selected thread runs on the CPU until it (1) blocks, (2) uses its time slice, or (3) is preempted by a higher-priority thread. If there are multiple threads with the same priority, the scheduler uses a round-robin queue. Figure 5.13 illustrates how the six scheduling classes relate to one another and how they map to global priorities. Notice that the kernel maintains 10 threads for servicing interrupts. These threads do not belong to any scheduling class and execute at the highest priority (160–169). As mentioned, Solaris has traditionally used the many-to-many model (Section 4.2.3) but switched to the one-to-one model (Section 4.2.2) beginning with Solaris 9. 5.6.2 Example: Windows XP Scheduling Windows XP schedules threads using a priority-based, preemptive scheduling algorithm. The Windows XP scheduler ensures that the highest-priority thread will always run. The portion of the Windows XP kernel that handles scheduling is called the dispatcher. A thread selected to run by the dispatcher will run until it is preempted by a higher-priority thread, until it terminates, until its time quantum ends, or until it calls a blocking system call, such as for I/O.Ifa higher-priority real-time thread becomes ready while a lower-priority thread 5.6 Operating System Examples 219 interrupt threads 169highest lowest first scheduling order global priority last 160 159 100 60 59 0 99 realtime (RT) threads system (SYS) threads fair share (FSS) threads fixed priority (FX) threads timeshare (TS) threads interactive (IA) threads Figure 5.13 Solaris scheduling. is running, the lower-priority thread will be preempted. This preemption gives a real-time thread preferential access to the CPU when the thread needs such access. The dispatcher uses a 32-level priority scheme to determine the order of thread execution. Priorities are divided into two classes. The variable class contains threads having priorities from 1 to 15, and the real-time class contains threads with priorities ranging from 16 to 31. (There is also a thread running at priority 0 that is used for memory management.) The dispatcher uses a queue for each scheduling priority and traverses the set of queues from highest to lowest until it finds a thread that is ready to run. If no ready thread is found, the dispatcher will execute a special thread called the idle thread. There is a relationship between the numeric priorities of the Windows XP kernel and the Win32 API. The Win32 API identifies several priority classes to which a process can belong. These include: 220 Chapter 5 CPU Scheduling high above normal normal below normal idle priority time-critical real- time 31 26 25 24 23 22 16 15 15 14 13 12 11 1 15 12 11 10 9 8 1 15 10 9 8 7 6 1 15 8 7 6 5 4 1 15 6 5 4 3 2 1 highest above normal normal lowest idle below normal Figure 5.14 Windows XP priorities. • REALTIME PRIORITY CLASS • HIGH PRIORITY CLASS • ABOVE NORMAL PRIORITY CLASS • NORMAL PRIORITY CLASS • BELOW NORMAL PRIORITY CLASS • IDLE PRIORITY CLASS Priorities in all classes except the REALTIME PRIORITY CLASS are variable, meaning that the priority of a thread belonging to one of these classes can change. A thread within a given priority classes also has a relative priority. The values for relative priorities include: • TIME CRITICAL • HIGHEST • ABOVE NORMAL • NORMAL • BELOW NORMAL • LOWEST • IDLE The priority of each thread is based on both the priority class it belongs to and its relative priority within that class. This relationship is shown in Figure 5.14. The values of the priority classes appear in the top row. The left column contains the values for the relative priorities. For example, if the relative priority of a thread in the ABOVE NORMAL PRIORITY CLASS is NORMAL, the numeric priority of that thread is 10. Furthermore, each thread has a base priority representing a value in the priority range for the class the thread belongs to. By default, the base priority 5.6 Operating System Examples 221 is the value of the NORMAL relative priority for that class. The base priorities for each priority class are: • REALTIME PRIORITY CLASS—24 • HIGH PRIORITY CLASS—13 • ABOVE NORMAL PRIORITY CLASS—10 • NORMAL PRIORITY CLASS—8 • BELOW NORMAL PRIORITY CLASS—6 • IDLE PRIORITY CLASS—4 Processes are typically members of the NORMAL PRIORITY CLASS.Apro- cess belongs to this class unless the parent of the process was of the IDLE PRIORITY CLASS or unless another class was specified when the process was created. The initial priority of a thread is typically the base priority of the process the thread belongs to. When a thread’s time quantum runs out, that thread is interrupted; if the thread is in the variable-priority class, its priority is lowered. The priority is never lowered below the base priority, however. Lowering the priority tends to limit the CPU consumption of compute-bound threads. When a variable- priority thread is released from a wait operation, the dispatcher boosts the priority. The amount of the boost depends on what the thread was waiting for; for example, a thread that was waiting for keyboard I/O would get a large increase, whereas a thread waiting for a disk operation would get a moderate one. This strategy tends to give good response times to interactive threads that are using the mouse and windows. It also enables I/O-bound threads to keep the I/O devices busy while permitting compute-bound threads to use spare CPU cycles in the background. This strategy is used by several time-sharing operating systems, including UNIX. In addition, the window with which the user is currently interacting receives a priority boost to enhance its response time. When a user is running an interactive program, the system needs to provide especially good performance. For this reason, Windows XP has a special scheduling rule for processes in the NORMAL PRIORITY CLASS.WindowsXP distinguishes between the foreground process that is currently selected on the screen and the background processes that are not currently selected. When a process moves into the foreground, Windows XP increases the scheduling quantum by some factor—typically by 3. This increase gives the foreground process three times longer to run before a time-sharing preemption occurs. 5.6.3 Example: Linux Scheduling Prior to Version 2.5, the Linux kernel ran a variation of the traditional UNIX scheduling algorithm. Two problems with the traditional UNIX scheduler are that it does not provide adequate support for SMP systems and that it does not scale well as the number of tasks on the system grows. With Version 2.5, the scheduler was overhauled, and the kernel now provides a scheduling algorithm that runs in constant time—known as O(1)—regardless of the number of tasks on the system. The newer scheduler also provides increased 222 Chapter 5 CPU Scheduling numeric priority relative priority time quantum 0 • • • 99 100 • • • 140 highest lowest 200 ms 10 ms real-time tasks other tasks Figure 5.15 The relationship between priorities and time-slice length. support for SMP, including processor affinity and load balancing, as well as providing fairness and support for interactive tasks. The Linux scheduler is a preemptive, priority-based algorithm with two separate priority ranges: a real-time range from 0 to 99 and a nice value ranging from 100 to 140. These two ranges map into a global priority scheme wherein numerically lower values indicate higher priorities. Unlike schedulers for many other systems, including Solaris (Section 5.6.1) and Windows XP (Section 5.6.2), Linux assigns higher-priority tasks longer time quanta and lower-priority tasks shorter time quanta. The relationship between priorities and time-slice length is shown in Figure 5.15. A runnable task is considered eligible for execution on the CPU as long as it has time remaining in its time slice. When a task has exhausted its time slice, it is considered expired and is not eligible for execution again until all other tasks have also exhausted their time quanta. The kernel maintains a list of all runnable tasks in a runqueue data structure. Because of its support for SMP, each processor maintains its own runqueue and schedules itself independently. Each runqueue contains two priority arrays: active and expired.Theactive array contains all tasks with time remaining in their time slices, and the expired array contains all expired tasks. Each of these priority arrays contains a list of tasks indexed according to priority (Figure 5.16). The scheduler chooses the task with the highest priority from the active array for execution on the CPU. On multiprocessor machines, this means that each processor is scheduling the active array priority [0] [1] • • • [140] task lists • • • expired array priority [0] [1] • • • [140] task lists • • • Figure 5.16 List of tasks indexed according to priority. 5.7 Java Scheduling 223 highest-priority task from its own runqueue structure. When all tasks have exhausted their time slices (that is, the active array is empty), the two priority arrays are exchanged; the expired array becomes the active array, and vice versa. Linux implements real-time scheduling as defined by POSIX.1b, which is described in Section 5.4.2. Real-time tasks are assigned static priorities. All other tasks have dynamic priorities that are based on their nice values plus or minus the value 5. The interactivity of a task determines whether the value 5 will be added to or subtracted from the nice value. A task’s interactivity is determined by how long it has been sleeping while waiting for I/O. Tasks that are more interactive typically have longer sleep times and therefore are more likely to have adjustments closer to −5, as the scheduler favors interactive tasks. The result of such adjustments will be higher priorities for these tasks. Conversely, tasks with shorter sleep times are often more CPU-bound and thus will have their priorities lowered. A task’s dynamic priority is recalculated when the task has exhausted its time quantum and is to be moved to the expired array. Thus, when the two arrays are exchanged, all tasks in the new active array have been assigned new priorities and corresponding time slices. 5.7 Java Scheduling We now consider scheduling in Java. The specification for the JVM has a loosely defined scheduling policy simply stating that each thread has a priority and that higher-priority threads will run in preference to threads with lower priorities. However, unlike the case with strict priority-based scheduling, it is possible that a lower-priority thread may get an opportunity to run at the expense of a higher-priority thread. Since the specification does not say that a scheduling policy must be preemptive, it is possible that a thread with a lower priority may continue to run even as a higher-priority thread becomes runnable. Furthermore, the specification for the JVM does not indicate whether or not threads are time-sliced using a round-robin scheduler (Section 5.3.4)— that decision is up to the particular implementation of the JVM.Ifthreadsare time-sliced, then a runnable thread executes until one of the following events occurs: 1. Its time quantum expires. 2. It blocks for I/O. 3. It exits its run() method. On systems that support preemption, a thread running on a CPU may also be preempted by a higher-priority thread. So that all threads have an equal amount of CPU time on a system that does not perform time slicing, a thread may yield control of the CPU with the yield() method. By invoking the yield() method, a thread suggests that it is willing to relinquish control of the CPU, allowing another thread an opportunity 224 Chapter 5 CPU Scheduling to run. This yielding of control is called cooperative multitasking.Theuseof the yield() method appears as public void run() { while (true) { // perform a CPU-intensive task ... // now yield control of the CPU Thread.yield(); } } 5.7.1 Thread Priorities Each Java thread is assigned a priority that is a positive integer within a given range. A thread is given a default priority when it is created. Unless it is changed explicitly by the program, it maintains the same priority throughout its lifetime; the JVM does not dynamically alter priorities. The Java Thread class identifies the following thread priorities: Priority Comment Thread.MIN PRIORITY The minimum thread priority Thread.MAX PRIORITY The maximum thread priority Thread.NORM PRIORITY The default thread priority MIN PRIORITY has a value of 1; MAX PRIORITY, a value of 10; and NORM PRIORITY, a value of 5. Every Java thread has a priority that falls some- where within this range. (There is in fact a priority 0, but it is reserved for threads created by the JVM; developers cannot assign a thread with priority 0.) The default priority is NORM PRIORITY. When a thread is created, it is given the same priority as the thread that created it. The priority of a thread can also be set explicitly with the setPriority() method. The priority can be set either before the thread is started or while the thread is active. The class HighThread (Figure 5.17) illustrates using the setPriority() method to increase the priority of the thread by 3. Because the JVM is typically implemented on top of a host operating system, the priority of a Java thread is related to the priority of the kernel thread public class HighThread implements Runnable { public void run() { Thread.currentThread().setPriority(Thread.NORM PRIORITY + 3); // remainder of run() method ... } } Figure 5.17 Setting a priority using setPriority(). 5.7 Java Scheduling 225 Java priority Win32 priority 1 (MIN_PRIORITY) LOWEST 2 LOWEST 3 BELOW_NORMAL BELOW_NORMAL4 5 (NORM_PRIORITY) NORMAL 6 ABOVE_NORMAL 7 ABOVE_NORMAL 8 HIGHEST 9 HIGHEST 10 (MAX_PRIORITY) TIME_CRITICAL Figure 5.18 The relationship between the priorities of Java and Win32 threads. to which it is mapped. As might be expected, this relationship varies from system to system. Thus, changing the priority of a Java thread through the setPriority() system call can have different effects depending on the host operating system. For example, consider Solaris systems. As described in Section 5.6.1, Solaris threads are assigned a priority between 0 and 59. On these systems, the ten priorities assigned to Java threads must somehow be associated with the sixty possible priorities of the kernel threads. The priority of the kernel thread to which a Java thread is mapped is based on a combination of the Java-level priority and the dispatch table shown in Figure 5.13. Win32 systems identify seven priority levels. As a result, different Java thread priorities may map to the same priority of a kernel thread. For example, Thread.NORM PRIORITY +1andThread.NORM PRIORITY +2 map to the same kernel priority; altering the priority of Java threads on Win32 systems may have no effect on how such threads are scheduled. The relationship between the priorities of Java and Win32 threads is shown in Figure 5.18. It is important to note that the specification for the JVM does not identify how a JVM must implement thread priorities. JVM designers can implement prioritization however they choose. In fact, a JVM implementation may choose to ignore calls to setPriority() entirely. 5.7.2 Java Thread Scheduling on Solaris Let’s look more closely at Java thread scheduling on Solaris systems. The relationship between the priorities of a Java thread and its associated kernel thread has an interesting history. On Solaris, each Java thread is assigned a unique user thread. Furthermore, user threads are mapped to kernel threads through a lightweight process, or LWP, as illustrated in Figure 5.19. Solaris uses the priocntl() and thr setprio() system calls to change the priorities of LWPs and user threads, respectively. The kernel is aware of priority changes 226 Chapter 5 CPU Scheduling Java thread priority = a Solaris native thread priority = b LWP priority = c Solaris kernel thread priority = c setPriority(a) thr_setprio(b) priocnti(c) Figure 5.19 User threads are mapped to kernel threads through a lightweight process. made to the LWP with priocntl(), but changes made to the priority of a user thread with thr setprio() are not reflected in the kernel. Prior to Solaris 9, Solaris used the many-to-many model of mapping user and kernel threads (Section 4.2.3). The difficulty with this model was that, although it was possible to change the priority of the LWP to which a Java thread was mapped, this mapping was dynamic, and the Java thread could switch to another LWP without the JVM’s awareness. This issue was addressed when Solaris 9 adopted the one-to-one model (Section 4.2.2), thereby ensuring that a Java thread is assigned to the same LWP for its lifetime. When Java’s setPriority() method was called prior to Version 1.4.2 of the JVM on Solaris, the JVM would change only the priority of the user thread, using the thr setprio() system call; the priority of the underlying LWP was not changed, as the JVM did not invoke priocntl(). (This was because Solaris used the many-to-many threading model at that time. Changing the priority of the LWP made little sense, as a Java thread could migrate between different LWPs without the kernel’s knowledge.) However, by the time Version 1.4.2 of the JVM appeared, Solaris had adopted the one-to-one model. As a result, when a Java thread invoked setPriority(),theJVM would call both thr setprio() and priocntl() to alter the priority of the user thread and the LWP, respectively. Version 1.5 of the JVM on Solaris addressed issues concerning the relative priorities of Java threads and threads running in native C and C++ programs. 5.8 Algorithm Evaluation 227 On Solaris, by default, a C or C++ program initially runs at the highest priority in its scheduling class. However, the default priority given to a Java thread at Thread.NORM PRIORITY is in the middle of the priority range for its scheduling class. As a result, when Solaris concurrently ran both a C and a Java program with default scheduling priorities, the operating system typically favored the C program. Version 1.5 of the JVM on Solaris assigns Java priorities from Thread.NORM PRIORITY to Thread.MAX PRIORITY,the highest priority in the scheduling class. Java priorities Thread.MIN PRIORITY to Thread.NORM PRIORITY −1 are assigned correspondingly lower priorities. The advantage of this scheme is that Java programs now run with priorities equal to those of C and C++ programs. The disadvantage is that changing the priority of a Java thread from setPriority(Thread.NORM PRIORITY) to setPriority(Thread.MAX PRIORITY) has no effect. 5.7.3 Scheduling Features in Java 1.5 With such a loosely defined scheduling policy, Java developers have typically been discouraged from using API features related to scheduling, since the behavior of the method calls can vary so much from one operating system to another. Beginning in Java 1.5, however, features have been added to the Java API to support a more deterministic scheduling policy. The API additions are centered around the use of a thread pool. Thread pools using Java were covered in Section 4.5.4. In that section, we discussed three different thread-pool structures: (1) a single-threaded pool, (2) a thread pool with a fixed number of threads, and (3) a cached thread pool. A fourth structure, which executes threads after a certain delay or periodically, is also available. This structure is similar to that shown in Figure 4.17. It differs, however, in that we use the factory method newScheduledThreadPool() in the Executors class, which returns a ScheduledExecutorService object. Using this object, we invoke one of four possible methods to schedule a Runnable task either after a fixed delay,at a fixed (periodic) rate, or periodically with an initial delay. The program shown in Figure 5.20 creates a Runnable task that checks once per second to see if there are entries written to a log. If so, the entries are written to a database. The program uses a scheduled thread pool of size 1 (we only require one thread) and then schedules the task using the scheduleAtFixedRate() method so that the task begins running immediately (delay of 0) and then runs once per second. 5.8 Algorithm Evaluation How do we select a CPU-scheduling algorithm for a particular system? As we saw in Section 5.3, there are many scheduling algorithms, each with its own parameters. As a result, selecting an algorithm can be difficult. The first problem is defining the criteria to be used in selecting an algorithm. As we saw in Section 5.2, criteria are often defined in terms of CPU utilization, response time, or throughput. To select an algorithm, we must first define the relative importance of these elements. Our criteria may include several measures, such as: 228 Chapter 5 CPU Scheduling import java.util.concurrent.*; class Task implements Runnable { public void run() { /** * check if there are any entries that * need to be written from a log to * a database. */ System.out.println("Checking for log entries ..."); } } public class SPExample { public static void main(String[] args) { // create the scheduled thread pool ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1); // create the task Runnable task = new Task(); // schedule the task so that it runs once every second scheduler.scheduleAtFixedRate(task,0,1,TimeUnit.SECONDS); } } Figure 5.20 Creating a scheduled thread pool in Java. • Maximizing CPU utilization under the constraint that the maximum response time is 1 second • Maximizing throughput so that turnaround time is (on average) linearly proportional to total execution time Once the selection criteria have been defined, we want to evaluate the algorithms under consideration. We next describe the various evaluation methods we can use. 5.8.1 Deterministic Modeling One major class of evaluation methods is analytic evaluation.Analytic evaluation uses the given algorithm and the system workload to produce a formula or number that evaluates the performance of the algorithm for that workload. Deterministic modeling is one type of analytic evaluation. This method takes a particular predetermined workload and defines the performance of each algorithm for that workload. For example, assume that we have the workload 5.8 Algorithm Evaluation 229 shown below. All five processes arrive at time 0, in the order given, with the length of the CPU burst given in milliseconds: Process Burst Time P1 10 P2 29 P3 3 P4 7 P5 12 Consider the FCFS, SJF,andRR (quantum = 10 milliseconds) scheduling algorithms for this set of processes. Which algorithm would give the minimum average waiting time? For the FCFS algorithm, we would execute the processes as P2 P5P3 P4P1 6139 4942010 The waiting time is 0 milliseconds for process P1, 10 milliseconds for process P2, 39 milliseconds for process P3, 42 milliseconds for process P4,and49 milliseconds for process P5. Thus, the average waiting time is (0 + 10 + 39 + 42 + 49)/5 = 28 milliseconds. With nonpreemptive SJF scheduling, we execute the processes as P5 P2P3 P4 6132201003 P1 The waiting time is 10 milliseconds for process P1, 32 milliseconds for process P2, 0 milliseconds for process P3, 3 milliseconds for process P4,and20 milliseconds for process P5. Thus, the average waiting time is (10 + 32 + 0 + 3 + 20)/5 = 13 milliseconds. With the RR algorithm, we execute the processes as P5 P5 P2P2 P2P3 P4 6130 40 50 5220 23100 P1 The waiting time is 0 milliseconds for process P1, 32 milliseconds for process P2, 20 milliseconds for process P3, 23 milliseconds for process P4,and40 milliseconds for process P5. Thus, the average waiting time is (0 + 32 + 20 + 23 + 40)/5 = 23 milliseconds. We see that, in this case, the average waiting time obtained with the SJF policy is less than half that obtained with FCFS scheduling; the RR algorithm gives us an intermediate value. 230 Chapter 5 CPU Scheduling Deterministic modeling is simple and fast. It gives us exact numbers, allowing us to compare the algorithms. However, it requires exact numbers for input, and its answers apply only to those cases. The main uses of deterministic modeling are in describing scheduling algorithms and providing examples. In cases where we are running the same program over and over again and can measure the program’s processing requirements exactly, we may be able to use deterministic modeling to select a scheduling algorithm. Furthermore, over a set of examples, deterministic modeling may indicate trends that can then be analyzed and proved separately. For example, it can be shown that, for the environment described (all processes and their times available at time 0), the SJF policy will always result in the minimum waiting time. 5.8.2 Queueing Models On many systems, the processes that are run vary from day to day, so there is no static set of processes (or times) to use for deterministic modeling. What can be determined, however, is the distribution of CPU and I/O bursts. These distributions can be measured and then approximated or simply estimated. The result is a mathematical formula describing the probability of a particular CPU burst. Commonly, this distribution is exponential and is described by its mean. Similarly, we can describe the distribution of times when processes arrive in the system (the arrival-time distribution). From these two distributions, it is possible to compute the average throughput, utilization, waiting time, and so on for most algorithms. The computer system is described as a network of servers. Each server has a queue of waiting processes. The CPU is a server with its ready queue, as is the I/O system with its device queues. Knowing arrival rates and service rates, we can compute utilization, average queue length, average wait time, and so on. This area of study is called queueing-network analysis. As an example, let n be the average queue length (excluding the process being serviced), let W be the average waiting time in the queue, and let ␭ be the average arrival rate for new processes in the queue (such as three processes per second). We expect that during the time W that a process waits, ␭ × W new processes will arrive in the queue. If the system is in a steady state, then the number of processes leaving the queue must be equal to the number of processes that arrive. Thus, n = ␭ × W. This equation, known as Little’s formula, is particularly useful because it is valid for any scheduling algorithm and arrival distribution. We can use Little’s formula to compute one of the three variables if we know the other two. For example, if we know that 7 processes arrive every second (on average), and that there are normally 14 processes in the queue, then we can compute the average waiting time per process as 2 seconds. Queueing analysis can be useful in comparing scheduling algorithms, but it also has limitations. At the moment, the classes of algorithms and distributions that can be handled are fairly limited. The mathematics of complicated algorithms and distributions can be difficult to work with. Thus, arrival and service distributions are often defined in mathematically tractable 5.8 Algorithm Evaluation 231 actual process execution performance statistics for FCFS simulation FCFS performance statistics for SJF performance statistics for RR (q ϭ 14) trace tape simulation SJF simulation RR (q ϭ 14) • • • CPU 10 I/O 213 CPU 12 I/O 112 CPU 2 I/O 147 CPU 173 • • • Figure 5.21 Evaluation of CPU schedulers by simulation. —but unrealistic—ways. It is also generally necessary to make a number of independent assumptions, which may not be accurate. As a result of these difficulties, queueing models are often only approximations of real systems, and the accuracy of the computed results may be questionable. 5.8.3 Simulations To get a more accurate evaluation of scheduling algorithms, we can use simulations. Running simulations involves programming a model of the computer system. Software data structures represent the major components of the system. The simulator has a variable representing a clock; as this variable’s value is increased, the simulator modifies the system state to reflect the activities of the devices, the processes, and the scheduler. As the simulation executes, statistics that indicate algorithm performance are gathered and printed. The data to drive the simulation can be generated in several ways. The most common method uses a random-number generator that is programmed to generate processes, CPU burst times, arrivals, departures, and so on, according to probability distributions. The distributions can be defined mathematically (uniform, exponential, Poisson) or empirically. If a distribution is to be defined empirically, measurements of the actual system under study are taken. The results define the distribution of events in the real system; this distribution can then be used to drive the simulation. A distribution-driven simulation may be inaccurate, however, because of relationships between successive events in the real system. The frequency distribution indicates only how many instances of each event occur; it does not indicate anything about the order of their occurrence. To correct this problem, we can use trace tapes. We create a trace tape by monitoring the real system and recording the sequence of actual events (Figure 5.21). We then use this sequence to drive the simulation. Trace tapes provide an excellent way to compare two 232 Chapter 5 CPU Scheduling algorithms on exactly the same set of real inputs. This method can produce accurate results for its inputs. Simulations can be expensive, often requiring hours of computer time. A more detailed simulation provides more accurate results, but it also takes more computer time. In addition, trace tapes can require large amounts of storage space. Finally, the design, coding, and debugging of the simulator can be a major task. 5.8.4 Implementation Even a simulation is of limited accuracy. The only completely accurate way to evaluate a scheduling algorithm is to code it up, put it in the operating system, and see how it works. This approach puts the actual algorithm in the real system for evaluation under real operating conditions. The major difficulty with this approach is the high cost. The expense is incurred not only in coding the algorithm and modifying the operating system to support it (along with its required data structures) but also in the reaction of the users to a constantly changing operating system. Most users are not interested in building a better operating system; they merely want to get their processes executed and use their results. A constantly changing operating system does not help the users to get their work done. Another difficulty is that the environment in which the algorithm is used will change. The environment will change not only in the usual way, as new programs are written and the types of problems change, but also as a result of the performance of the scheduler. If short processes are given priority, then users may break larger processes into sets of smaller processes. If interactive processes are given priority over noninteractive processes, then users may switch to interactive use. For example, researchers designed one system that classified interactive and noninteractive processes automatically by looking at the amount of terminal I/O. If a process did not input or output to the terminal in a 1-second interval, the process was classified as noninteractive and was moved to a lower-priority queue. In response to this policy, one programmer modified his programs to write an arbitrary character to the terminal at regular intervals of less than 1 second. The system gave his programs a high priority, even though the terminal output was completely meaningless. The most flexible scheduling algorithms are those that can be altered by the system managers or by the users so that they can be tuned for a specific application or set of applications. A workstation that performs high-end graphical applications, for instance, may have scheduling needs different from those of a Web server or file server. Some operating systems— particularly several versions of UNIX—allow the system manager to fine-tune the scheduling parameters for a particular system configuration. For example, Solaris provides the dispadmin command to allow the system administrator to modify the parameters of the scheduling classes described in Section 5.6.1. Another approach is to use APIs that modify the priority of a process or thread. The Java, POSIX,andWin32 API provide such functions. The downfall of this approach is that performance-tuning a system or application most often does not result in improved performance in more general situations. 5.9 Summary 233 5.9 Summary CPU scheduling is the task of selecting a waiting process from the ready queue and allocating the CPU to it. The CPU is allocated to the selected process by the dispatcher. First-come, first-served (FCFS) scheduling is the simplest scheduling algo- rithm, but it can cause short processes to wait for very long processes. Shortest- job-first (SJF) scheduling is provably optimal, providing the shortest average waiting time. Implementing SJF scheduling is difficult, however, because pre- dicting the length of the next CPU burst is difficult. The SJF algorithm is a special case of the general priority scheduling algorithm, which simply allocates the CPU to the highest-priority process. Both priority and SJF scheduling may suffer from starvation. Aging is a technique to prevent starvation. Round-robin (RR) scheduling is more appropriate for a time-shared (inter- active) system. RR scheduling allocates the CPU to the first process in the ready queue for q time units, where q is the time quantum. After q time units, if the process has not relinquished the CPU, it is preempted, and the process is put at the tail of the ready queue. The major problem is the selection of the time quantum. If the quantum is too large, RR scheduling degenerates to FCFS scheduling; if the quantum is too small, scheduling overhead in the form of context-switch time becomes excessive. The FCFS algorithm is nonpreemptive; the RR algorithm is preemptive. The SJF and priority algorithms may be either preemptive or nonpreemptive. Multilevel queue algorithms allow different algorithms to be used for different classes of processes. The most common model includes a foreground interactive queue that uses RR scheduling and a background batch queue that uses FCFS scheduling. Multilevel feedback queues allow processes to move from one queue to another. Many contemporary computer systems support multiple processors and allow each processor to schedule itself independently.Typically, each processor maintains its own private queue of processes (or threads), all of which are available to run. Additional issues related to multiprocessor scheduling include processor affinity, load balancing, and multicore processing as well as scheduling on virtualization systems. Operating systems supporting threads at the kernel level must schedule threads—not processes—for execution. This is the case with Solaris and Windows XP. Both of these systems schedule threads using preemptive, priority-based scheduling algorithms, including support for real-time threads. The Linux process scheduler uses a priority-based algorithm with real- time support as well. The scheduling algorithms for these three operating systems typically favor interactive over batch and CPU-bound processes. The specification for scheduling in Java is loosely defined: higher-priority threads will be given preference over threads with lower priority. The wide variety of scheduling algorithms demands that we have methods to select among algorithms. Analytic methods use mathematical analysis to determine the performance of an algorithm. Simulation methods determine performance by imitating the scheduling algorithm on a “representative” sample of processes and computing the resulting performance. However, sim- ulation can at best provide an approximation of actual system performance; the only reliable technique for evaluating a scheduling algorithm is to imple- 234 Chapter 5 CPU Scheduling ment the algorithm on an actual system and monitor its performance in a “real-world” environment. Practice Exercises 5.1 A CPU-scheduling algorithm determines an order for the execution of its scheduled processes. Given n processes to be scheduled on one processor, how many different schedules are possible? Give a formula in terms of n. 5.2 Explain the difference between preemptive and nonpreemptive schedul- ing. 5.3 Suppose that the following processes arrive for execution at the times indicated. Each process will run for the amount of time listed. In answering the questions, use nonpreemptive scheduling, and base all decisions on the information you have at the time the decision must be made. Process Arrival Time Burst Time P1 0.0 8 P2 0.4 4 P3 1.0 1 a. What is the average turnaround time for these processes with the FCFS scheduling algorithm? b. What is the average turnaround time for these processes with the SJF scheduling algorithm? c. The SJF algorithm is supposed to improve performance, but notice that we chose to run process P1 at time 0 because we did not know that two shorter processes would arrive soon. Compute what the average turnaround time will be if the CPU is left idle for the first 1 unit and then SJF scheduling is used. Remember that processes P1 and P2 are waiting during this idle time, so their waiting time may increase. This algorithm could be known as future-knowledge scheduling. 5.4 What advantage is there in having different time-quantum sizes at different levels of a multilevel queueing system? 5.5 Many CPU-scheduling algorithms are parameterized. For example, the RR algorithm requires a parameter to indicate the time slice. Multilevel feedback queues require parameters to define the number of queues, the scheduling algorithm for each queue, the criteria used to move processes between queues, and so on. These algorithms are thus really sets of algorithms (for example, the set of RR algorithms for all time slices, and so on). One set of algorithms may include another (for example, the FCFS algorithm is the RR algorithm Exercises 235 with an infinite time quantum). What (if any) relation holds between the following pairs of algorithm sets? a. Priority and SJF b. Multilevel feedback queues and FCFS c. Priority and FCFS d. RR and SJF 5.6 Suppose that a scheduling algorithm (at the level of short-term CPU scheduling) favors those processes that have used the least processor time in the recent past. Why will this algorithm favor I/O-bound programs and yet not permanently starve CPU-bound programs? 5.7 Distinguish between PCS and SCS scheduling. 5.8 Assume that an operating system maps user-level threads to the kernel using the many-to-many model and that the mapping is done through the use of LWPs. Furthermore, the system allows program developers to create real-time threads. Is it necessary to bind a real-time thread to an LWP? Exercises 5.9 Why is it important for the scheduler to distinguish I/O-bound programs from CPU-bound programs? 5.10 Discuss how the following pairs of scheduling criteria conflict in certain settings. a. CPU utilization and response time b. Average turnaround time and maximum waiting time c. I/O device utilization and CPU utilization 5.11 Consider the exponential average formula used to predict the length of the next CPU burst (Section 5.3.2). What are the implications of assigning the following values to the parameters used by the algorithm? a. ␣ = 0and␶0 = 100 milliseconds b. ␣ = 0.99 and ␶0 = 10 milliseconds 5.12 Consider the following set of processes, with the length of the CPU burst given in milliseconds: Process Burst Time Priority P1 10 3 P2 11 P3 23 P4 14 P5 52 236 Chapter 5 CPU Scheduling The processes are assumed to have arrived in the order P1, P2, P3, P4, P5, all at time 0. a. Draw four Gantt charts that illustrate the execution of these processes using the following scheduling algorithms: FCFS, SJF, nonpreemptive priority (a smaller priority number implies a higher priority), and RR (quantum = 1). b. What is the turnaround time of each process for each of the scheduling algorithms in part a? c. What is the waiting time of each process for each of these schedul- ing algorithms? d. Which of the algorithms results in the minimum average waiting time (over all processes)? 5.13 Which of the following scheduling algorithms could result in starvation? a. First-come, first-served b. Shortest job first c. Round robin d. Priority 5.14 Consider a variant of the RR scheduling algorithm in which the entries in the ready queue are pointers to the PCBs. a. What would be the effect of putting two pointers to the same process in the ready queue? b. What would be two major advantages and two disadvantages of this scheme? c. How would you modify the basic RR algorithm to achieve the same effect without the duplicate pointers? 5.15 Consider a system running ten I/O-bound tasks and one CPU-bound task. Assume that the I/O-bound tasks issue an I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is 0.1 millisecond and that all processes are long-running tasks. Describe the CPU utilization for a round-robin scheduler when: a. The time quantum is 1 millisecond b. The time quantum is 10 milliseconds 5.16 Consider a system implementing multilevel queue scheduling. What strategy can a computer user employ to maximize the amount of CPU time allocated to the user’s process? 5.17 Consider a preemptive priority scheduling algorithm based on dynami- cally changing priorities. Larger priority numbers imply higher priority. When a process is waiting for the CPU (in the ready queue, but not running), its priority changes at a rate ␣; when it is running, its priority changes at a rate ␤. All processes are given a priority of 0 when they Exercises 237 enter the ready queue. The parameters ␣ and ␤ can be set to give many different scheduling algorithms. a. What is the algorithm that results from ␤ > ␣ > 0? b. What is the algorithm that results from ␣ < ␤ < 0? 5.18 Explain the differences in how much the following scheduling algo- rithms discriminate in favor of short processes: a. FCFS b. RR c. Multilevel feedback queues 5.19 Using the Windows XP scheduling algorithm, determine the numeric priority of each of the following threads. a. A thread in the REALTIME PRIORITY CLASS with a relative priority of HIGHEST b. A thread in the NORMAL PRIORITY CLASS with a relative priority of NORMAL c. A thread in the HIGH PRIORITY CLASS with a relative priority of ABOVE NORMAL 5.20 Consider the scheduling algorithm in the Solaris operating system for time-sharing threads. a. What is the time quantum (in milliseconds) for a thread with priority 10? With priority 55? b. Assume that a thread with priority 35 has used its entire time quantum without blocking. What new priority will the scheduler assign this thread? c. Assume that a thread with priority 35 blocks for I/O before its time quantum has expired. What new priority will the scheduler assign this thread? 5.21 The traditional UNIX scheduler enforces an inverse relationship between priority numbers and priorities: the higher the number, the lower the priority. The scheduler recalculates process priorities once per second using the following function: Priority = (recent CPU usage / 2) + base where base = 60 and recent CPU usage refers to a value indicating how often a process has used the CPU since priorities were last recalculated. Assume that recent CPU usage for process P1 is 40, for process P2 is 18, and for process P3 is 10. What will be the new priorities for these three processes when priorities are recalculated? Based on this information, does the traditional UNIX scheduler raise or lower the relative priority of a CPU-bound process? 238 Chapter 5 CPU Scheduling 5.22 As discussed in Section 5.7, the specification for the JVM may allow implementations to ignore calls to setPriority().Anargumentin favor of ignoring setPriority() is that modifying the priority of a Java thread has little effect once the thread begins running on a native operating-system thread, since the operating-system scheduler modifies the priority of the kernel thread to which the Java thread is mapped based on how CPU-orI/O-intensive the thread is. Discuss the pros and cons of this argument. Programming Problems 5.23 In programming problem 4.21, you wrote a program that listed each thread in the JVM. Modify this program so that you also list the priority of each thread. Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Feedback queues were originally implemented on the CTSS system described in Corbato et al. [1962]. This feedback queue scheduling system was analyzed by Schrage [1967]. The preemptive priority scheduling algorithm of Exercise 5.17 was suggested by Kleinrock [1975]. Anderson et al. [1989], Lewis and Berg [1998], and Philbin et al. [1996] discuss thread scheduling. Multicore scheduling is examined in McNairy and Bhatia [2005] and Kongetira et al. [2005]. Scheduling techniques that take into account information regarding pro- cess execution times from previous runs are described in Fisher [1981], Hall et al. [1996], and Lowney et al. [1993]. Fair-share schedulers are covered by Henry [1984], Woodside [1986], and Kay and Lauder [1988]. Scheduling policies used in the UNIX V operating system are described by Bach [1987]; those for UNIX FreeBSD 5.2 are presented by McKusick and Neville-Neil [2005]; and those for the Mach operating system are discussed by Black [1990]. Love [2005] covers scheduling in Linux. Details of the ULE scheduler can be found in Roberson [2003]. Solaris scheduling is described by Mauro and McDougall [2007]. Solomon [1998], Solomon and Russinovich [2000], and Russinovich and Solomon [2005] discuss scheduling in Windows Bibliographical Notes 239 internals. Butenhof [1997] and Lewis and Berg [1998] describe scheduling in Pthreads systems. Siddha et al. [2007] discuss scheduling challenges on multicore systems. Java thread scheduling is covered in Oaks and Wong [2004] and Goetz et al. [2006]. This page intentionally left blank 6CHAPTER Process Synchronization A cooperating process is one that can affect or be affected by other processes executing in the system. Cooperating processes can either directly share a logical address space (that is, both code and data) or be allowed to share data only through files or messages. The former case is achieved through the use of threads, discussed in Chapter 4. Concurrent access to shared data may result in data inconsistency, however. In this chapter, we discuss various mechanisms to ensure the orderly execution of cooperating processes that share a logical address space, so that data consistency is maintained. CHAPTER OBJECTIVES • To introduce the critical-section problem, whose solutions can be used to ensure the consistency of shared data. • To present both software and hardware solutions to the critical-section problem. • To introduce the concept of an atomic transaction and describe mecha- nisms to ensure atomicity. 6.1 Background In Chapter 3, we developed a model of a system consisting of cooperating sequential processes or threads, all running asynchronously and possibly sharing data. We illustrated this model with the producer–consumer problem, which is representative of operating systems. Specifically, in Section 3.4.1, we described how a bounded buffer could be used to enable processes to share memory. Let’s return to our consideration of the bounded buffer. Here, we assume that the code for the producer is as follows: 241 242 Chapter 6 Process Synchronization while (count == BUFFER SIZE) ; // do nothing // add an item to the buffer buffer[in] = item; in = (in + 1) % BUFFER SIZE; ++count; The code for the consumer is while (count == 0) ; // do nothing // remove an item from the buffer item = buffer[out]; out = (out + 1) % BUFFER SIZE; --count; Although both the producer and consumer routines are correct separately, they may not function correctly when executed concurrently. As an illustration, suppose that the value of the variable count is currently 5 and that the producer and consumer processes execute the statements “++count” and “--count” concurrently. Following the execution of these two statements, the value of the variable count maybe4,5,or6!Theonlycorrectresult,though,iscount == 5, which is generated correctly if the producer and consumer execute separately. We can show that the value of count may be incorrect as follows. Note that the statement “++count” may be implemented in machine language (on a typical machine) as register1 = count register1 = register1 +1 count = register1 where register1 is a local CPU register. Similarly, the statement “--count” is implemented as follows: register2 = count register2 = register2 − 1 count = register2 where again register2 is a local CPU register. Even though register1 and register2 may be the same physical register (an accumulator, say), remember that the contents of this register will be saved and restored by the interrupt handler (Section 1.2.3). The concurrent execution of “++count” and “--count” is equivalent to a sequential execution in which the lower-level statements presented previously are interleaved in some arbitrary order (but the order within each high-level statement is preserved). One such interleaving is 6.2 The Critical-Section Problem 243 T0: producer execute register1 = count {register1 = 5} T1: producer execute register1 = register1 + 1 {register1 = 6} T2: consumer execute register2 = count {register2 = 5} T3: consumer execute register2 = register2 − 1 {register2 = 4} T4: producer execute count = register1 {count = 6} T5: consumer execute count = register2 {count = 4} Notice that we have arrived at the incorrect state “count == 4”, indicating that four buffers are full, when, in fact, five buffers are full. If we reversed the order of the statements at T4 and T5, we would arrive at the incorrect state “count == 6”. We would arrive at this incorrect state because we allowed both processes to manipulate the variable count concurrently. A situation like this, where several processes access and manipulate the same data concurrently and the outcome of the execution depends on the particular order in which the access takes place, is called a race condition. To guard against the race condition above, we need to ensure that only one process at a time can be manipulating the variable count. To make such a guarantee, we require that the processes be synchronized in some way. Situations such as the one just described occur frequently in operating systems as different parts of the system manipulate resources. Furthermore, with the growth of multicore systems, there is an increased emphasis on developing multithreaded applications wherein several threads—which are quite possibly sharing data—are running in parallel on different processing cores. Clearly, we want any changes that result from such activities not to interfere with one another. Because of the importance of this issue, a major portion of this chapter is concerned with process synchronization and coordination among cooperating processes. 6.2 The Critical-Section Problem Consider a system consisting of n processes {P0, P1, ..., Pn−1}.Eachprocess has a segment of code, called a critical section, in which the process may be changing common variables, updating a table, writing a file, and so on. The important feature of the system is that, when one process is executing in its critical section, no other process is to be allowed to execute in its critical section. That is, no two processes are executing in their critical sections at the same time. The critical-section problem is to design a protocol that the processes can use to cooperate. Each process must request permission to enter its critical section. The section of code implementing this request is the entry section.The critical section may be followed by an exit section. The remaining code is the remainder section. The general structure of a typical process Pi is shown in Figure 6.1. The entry section and exit section are enclosed in boxes to highlight these important segments of code. A solution to the critical-section problem must satisfy the following three requirements: 1. Mutual exclusion.IfprocessPi is executing in its critical section, then no other processes can be executing in their critical sections. 244 Chapter 6 Process Synchronization while (true) { entry section critical section exit section remainder section } Figure 6.1 General structure of a typical process Pi . 2. Progress. If no process is executing in its critical section and some processes wish to enter their critical sections, then only those processes that are not executing in their remainder sections can participate in deciding which will enter its critical section next, and this selection cannot be postponed indefinitely. 3. Bounded waiting. There exists a bound, or limit, on the number of times that other processes are allowed to enter their critical sections after a process has made a request to enter its critical section and before that request is granted. We assume that each process is executing at a nonzero speed. However, we can make no assumption concerning the relative speed of the n processes. At a given point in time, many kernel-mode processes may be active in the operating system. As a result, the code implementing an operating system (kernel code) is subject to several possible race conditions. Consider as an example a kernel data structure that maintains a list of all open files in the system. This list must be modified when a new file is opened or closed (adding the file to the list or removing it from the list). If two processes were to open files simultaneously, the separate updates to this list could result in a race condition. Other kernel data structures that are prone to possible race conditions include structures for maintaining memory allocation, for maintaining process lists, and for interrupt handling. It is up to kernel developers to ensure that the operating system is free from such race conditions. Two general approaches are used to handle critical sections in operating systems: (1) preemptive kernels and (2) nonpreemptive kernels. A preemptive kernel allows a process to be preempted while it is running in kernel mode. A nonpreemptive kernel does not allow a process running in kernel mode to be preempted; a kernel-mode process will run until it exits kernel mode, blocks, or voluntarily yields control of the CPU. Obviously, a nonpreemptive kernel is essentially free from race conditions on kernel data structures, as only oneprocessisactiveinthekernelatatime.Wecannotsaythesameabout preemptive kernels, so they must be carefully designed to ensure that shared kernel data are free from race conditions. Preemptive kernels are especially difficult to design for SMP architectures, since in these environments it is 6.3 Peterson’s Solution 245 possible for two kernel-mode processes to run simultaneously on different processors. Why, then, would anyone favor a preemptive kernel over a nonpreemptive one? A preemptive kernel is more suitable for real-time programming, as it will allow a real-time process to preempt a process currently running in the kernel. Furthermore, a preemptive kernel may be more responsive, since there is less risk that a kernel-mode process will run for an arbitrarily long period before relinquishing the processor to waiting processes. Of course, this effect can be minimized by designing kernel code that does not behave in this way. Later in this chapter, we explore how various operating systems manage preemption within the kernel. 6.3 Peterson’s Solution Next, we illustrate a classic software-based solution to the critical-section problem known as Peterson’s solution. Because of the way modern computer architectures perform basic machine-language instructions, such as load and store, there are no guarantees that Peterson’s solution will work correctly on such architectures. However, we present the solution because it provides a good algorithmic description of solving the critical-section problem and illustrates some of the complexities involved in designing software that addresses the requirements of mutual exclusion, progress, and bounded waiting. Peterson’s solution is restricted to two processes that alternate execution between their critical sections and remainder sections. The processes are numbered P0 and P1. For convenience, when presenting Pi ,weusePj to denote the other process; that is, j equals 1 − i. Peterson’s solution requires the two processes to share two data items: int turn; boolean flag[2]; The variable turn indicates whose turn it is to enter its critical section. That is, if turn == i,thenprocessPi is allowed to execute in its critical section. The flag array is used to indicate if a process is ready to enter its critical section. For example, if flag[i] is true, this value indicates that Pi is ready to enter its critical section. With an explanation of these data structures complete, we are now ready to describe the algorithm shown in Figure 6.2. To enter the critical section, process Pi first sets flag[i] to be true and then sets turn to the value j, thereby asserting that if the other process wishes to enter the critical section, it can do so. If both processes try to enter at the same time, turn will be set to both i and j at roughly the same time. Only one of these assignments will last; the other will occur but will be overwritten immediately. The eventual value of turn determines which of the two processes is allowed to enter its critical section first. We now prove that this solution is correct. We need to show that: 1. Mutual exclusion is preserved. 2. The progress requirement is satisfied. 3. The bounded-waiting requirement is met. 246 Chapter 6 Process Synchronization while (true) { flag[i] = TRUE; turn = j; while (flag[j] && turn == j); critical section flag[i] = FALSE; remainder section } Figure 6.2 The structure of process Pi in Peterson’s solution. To prove property 1, we note that each Pi enters its critical section only if either flag[j] == false or turn == i. Also note that, if both processes can be executing in their critical sections at the same time, then flag[0] == flag[1] == true. These two observations imply that P0 and P1 could not have successfully executed their while statements at about the same time, since the value of turn can be either 0 or 1 but cannot be both. Hence, one of the processes —say, Pj —must have successfully executed the while statement, whereas Pi had to execute at least one additional statement (“turn == j”). However, at that time, flag[j] == true and turn == j, and this condition persists as long as Pj is in its critical section; as a result, mutual exclusion is preserved. To prove properties 2 and 3, we note that a process Pi can be prevented from entering the critical section only if it is stuck in the while loop with the condition flag[j] == true and turn == j; this loop is the only one possible. If Pj is not ready to enter the critical section, then flag[j] == false,andPi can enter its critical section. If Pj has set flag[j] to true and is also executing in its while statement, then either turn == i or turn == j.Ifturn == i,thenPi will enter the critical section. If turn == j,thenPj will enter the critical section. However, once Pj exits its critical section, it will reset flag[j] to false, allowing Pi to enter its critical section. If Pj resets flag[j] to true,itmustalsosetturn to i. Thus, since Pi does not change the value of the variable turn while executing the while statement, Pi will enter the critical section (progress) after at most one entry by Pj (bounded waiting). 6.4 Synchronization Hardware We have just described one software-based solution to the critical-section problem. However, as mentioned, software-based solutions such as Peterson’s are not guaranteed to work on modern computer architectures. Instead, we can generally state that any solution to the critical-section problem requires a simple tool—a lock. Race conditions are prevented by requiring that critical regions be protected by locks. A process must acquire a lock before entering a critical section; it releases the lock when it exits the critical section. This is illustrated in Figure 6.3. 6.4 Synchronization Hardware 247 while (true) { acquire lock critical section release lock remainder section } Figure 6.3 Solution to the critical-section problem using locks. In the following discussions, we explore several more solutions to the critical-section problem using techniques ranging from hardware to software- based APIs available to application programmers. All these solutions are based on the premise of locking; however, as we shall see, the designs of such locks can be quite sophisticated. Westart by presenting some simple hardware instructions that are available on many systems and showing how they can be used effectively in solving the critical-section problem. Hardware features can make any programming task easier and improve system efficiency. The critical-section problem could be solved simply in a single-processor environment if we could prevent interrupts from occurring while a shared variable was being modified. In this manner, we could be sure that the current sequence of instructions would be allowed to execute in order without pre- emption. No other instructions would be run, so no unexpected modifications could be made to the shared variable. This is often the approach taken by nonpreemptive kernels. Unfortunately, this solution is not as feasible in a multiprocessor environ- ment. Disabling interrupts on a multiprocessor can be time consuming, as the message is passed to all the processors. This message passing delays entry into each critical section, and system efficiency decreases. Also consider the effect on a system’s clock if the clock is kept updated by interrupts. Many modern computer systems therefore provide special hardware instructions that allow us either to test and modify the content of a word or to swap the contents of two words atomically—that is, as one uninterruptible unit. We can use these special instructions to solve the critical-section problem in a relatively simple manner. Rather than discussing one specific instruction for one specific machine, we abstract the main concepts behind these types of instructions. The HardwareData class shown in Figure 6.4 illustrates the instructions. The getAndSet() method implementing the get-and-set instruction is shown in Figure 6.4. The important characteristic of this instruction is that it is executed atomically. Thus, if two get-and-set instructions are executed simultaneously (each on a different CPU), they will be executed sequentially in some arbitrary order. 248 Chapter 6 Process Synchronization public class HardwareData { private boolean value = false; public HardwareData(boolean value) { this.value = value; } public boolean get() { return value; } public void set(boolean newValue) { value = newValue; } public boolean getAndSet(boolean newValue) { boolean oldValue = this.get(); this.set(newValue); return oldValue; } public void swap(HardwareData other) { boolean temp = this.get(); this.set(other.get()); other.set(temp); } } Figure 6.4 Data structure for hardware solutions. If the machine supports the get-and-set instruction, then we can implement mutual exclusion by declaring lock to be an object of class HardwareData and initializing it to false. All threads will share access to the lock object. Figure 6.5 illustrates the structure of an arbitrary thread. Notice that this thread uses the yield() method introduced in Section 5.7. Invoking yield() keeps the thread in the runnable state but also allows the JVM to select another runnable thread to run. The swap instruction, defined in the swap() method in Figure 6.4, operates on the contents of two words; like the get-and-set instruction, it is executed atomically. If the machine supports the swap instruction, then mutual exclu- sion can be provided as follows. All threads share an object lock of class HardwareData that is initialized to false. In addition, each thread has a local HardwareData object key. The structure of an arbitrary thread is shown in Figure 6.6. Unfortunately for hardware designers, implementing atomic testAnd- Set() instructions on multiprocessors is not a trivial task. Such implementa- tions are discussed in books on computer architecture. 6.5 Semaphores 249 // lock is shared by all threads HardwareData lock = new HardwareData(false); while (true) { while (lock.getAndSet(true)) Thread.yield(); // critical section lock.set(false); // remainder section } Figure 6.5 Thread using get-and-set lock. 6.5 Semaphores The hardware-based solutions to the critical-section problem presented in Section 6.4 are complicated for application programmers to use. To overcome this difficulty, we can use a synchronization tool called a semaphore. AsemaphoreS contains an integer variable that, apart from initialization, is accessed only through two standard operations: acquire() and release(). These operations were originally termed P (from the Dutch proberen, meaning “to test”)andV (from verhogen, meaning “to increment”). Assuming value represents the integer value of the semaphore, the defini- tions of acquire() and release() are shown in Figure 6.7. Modifications to the integer value of the semaphore in the acquire() and release()operations must be executed indivisibly. That is, when one thread modifies the semaphore value, no other thread can simultaneously modify that same semaphore value. // lock is shared by all threads HardwareData lock = new HardwareData(false); // each thread has a local copy of key HardwareData key = new HardwareData(true); while (true) { key.set(true); do { lock.swap(key); } while (key.get() == true); // critical section lock.set(false); // remainder section } Figure 6.6 Thread using swap instruction. 250 Chapter 6 Process Synchronization acquire() { while (value <= 0) ; // no-op value--; } release() { value++; } Figure 6.7 The definitions of acquire() and release(). In addition, in the case of acquire(), the testing of the integer value of the semaphore (value <= 0) and of its possible modification (value--)mustbe executed without interruption. Wewill see in Section 6.5.2 how these operations can be implemented; first, let’s see how semaphores can be used. 6.5.1 Usage Operating systems often distinguish between counting and binary semaphores. The value of a counting semaphore can range over an unrestricted domain. The value of a binary semaphore can range only between 0 and 1. On some systems, binary semaphores are known as mutex locks,astheyarelocksthat provide mutual exclusion. We can use binary semaphores to control access to the critical section for a process or thread. The general strategy is as follows (assuming that the semaphore is initialized to 1): Semaphore sem = new Semaphore(1); sem.acquire(); // critical section sem.release(); // remainder section A generalized solution for multiple threads is shown in the Java program in Figure 6.8. Five separate threads are created, but only one can be in its critical section at a given time. The semaphore sem, which is shared by all the threads, controls access to the critical section. Counting semaphores can be used for a variety of purposes, such as controlling access to a given resource consisting of a finite number of instances. The semaphore is initialized to the number of resources available. Each thread that wishes to use a resource performs an acquire() operation on the semaphore (thereby decrementing the count). When a thread releases a resource, it performs a release() operation (incrementing the count). When the count for the semaphore goes to 0, all resources are being used. After that, threads that wish to use a resource will block until the count becomes greater than 0. 6.5 Semaphores 251 public class Worker implements Runnable { private Semaphore sem; public Worker(Semaphore sem) { this.sem = sem; } public void run() { while (true) { sem.acquire(); criticalSection(); sem.release(); remainderSection(); } } } public class SemaphoreFactory { public static void main(String args[]) { Semaphore sem = new Semaphore(1); Thread[] bees = new Thread[5]; for (int i = 0; i < 5; i++) bees[i] = new Thread(new Worker(sem)); for (int i = 0; i < 5; i++) bees[i].start(); } } Figure 6.8 Synchronization using semaphores. We can also use semaphores to solve various synchronization problems. For example, consider two concurrently running processes: P1 with a statement S1 and P2 with a statement S2. Suppose we require that S2 be executed only after S1 has completed. We can implement this scheme readily by letting P1 and P2 share a common semaphore synch, initialized to 0, and by inserting the statements S1; synch.release(); in process P1 and the statements synch.acquire(); S2; in process P2.Becausesynch is initialized to 0, P2 will execute S2 only after P1 has invoked synch.release(), which is after statement S1 has been executed. 252 Chapter 6 Process Synchronization 6.5.2 Implementation The main disadvantage of the semaphore definition just described is that it requires busy waiting. While a process is in its critical section, any other process that tries to enter its critical section must loop continuously in the entry code. This continual looping is clearly a problem in a multiprogramming system, where a single CPU is shared among many processes. Busy waiting wastes CPU cycles that some other process might be able to use productively. A semaphore that produces this result is also called a spinlock,becausethe process “spins” while waiting for the lock. (Spinlocks have the advantage that no context switch is required when a process must wait on a lock, and a context switchmaytakeconsiderabletime.Thus,whenlocksareexpectedtobeheldfor short times, spinlocks are useful. They are often employed on multiprocessor systems where one thread can “spin” on one processor while another thread performs its critical section on another processor.) To overcome the need for busy waiting, we can modify the definitions of the acquire() and release() semaphore operations. When a process executes the acquire() operation and finds that the semaphore value is not positive, it must wait. However, rather than using busy waiting, the process can block itself. The block operation places a process into a waiting queue associated with the semaphore, and the state of the process is switched to the waiting state. Then, control is transferred to the CPU scheduler, which selects another process to execute. A process that is blocked, waiting on a semaphore S, should be restarted when some other process executes a release() operation. The process is restarted by a wakeup() operation, which changes the process from the waiting state to the ready state. The process is then placed in the ready queue. (The CPU may or may not be switched from the running process to the newly ready process, depending on the CPU-scheduling algorithm.) To implement semaphores under this definition, we define a semaphore as (1) an integer value and (2) a list of processes. When a process must wait on a semaphore, it is added to the list of processes for that semaphore. The release() operation removes one process from the list of waiting processes and awakens that process. The semaphore operations can now be defined as acquire(){ value--; if (value < 0) { add this process to list block(); } } release(){ value++; if (value <= 0) { remove a process P from list wakeup(P); } } 6.5 Semaphores 253 The block() operation suspends the process that invokes it. The wakeup(P) operation resumes the execution of a blocked process P.Thesetwo operations are provided by the operating system as basic system calls. Note that in this implementation, semaphore values may be negative, although the semaphore value is never negative under the classical definition of semaphores with busy waiting. If the semaphore value is negative, its magnitude is the number of processes waiting on that semaphore. This fact is a result of switching the order of the decrement and the test in the implementation of the acquire() operation. The list of waiting processes can be easily implemented by a link field in each process control block (PCB). Each semaphore contains an integer value and a pointer to a list of PCBs. One way to add and remove processes from the list, which ensures bounded waiting, is to use a FIFO queue, where the semaphore contains both head and tail pointers to the queue. In general, however, the list may use any queueing strategy. Correct use of semaphores does not depend on a particular queueing strategy for the semaphore lists. The critical aspect of semaphores is that they be executed atomically. We must guarantee that no two processes can execute acquire() and release() operations on the same semaphore at the same time. This situation creates a critical-section problem, which can be solved in one of two ways. In a single-processor environment, we can simply inhibit interrupts during the time the acquire() and release() operations are executing. Once inter- rupts are inhibited, instructions from different processes cannot be interleaved. Only the currently running process executes until interrupts are reenabled and the scheduler can regain control. In a multiprocessor environment, however, inhibiting interrupts does not work. Instructions from different processes (running on different processors) may be interleaved in some arbitrary way. If the hardware does not provide any special instructions, we can employ any of the correct software solutions for the critical-section problem (Section 6.2), where the critical sections consist of the acquire() and release() operations. It is important to admit we have not completely eliminated busy waiting with this definition of the acquire() and release() operations. Rather, we have moved busy waiting to the critical sections of application programs. Furthermore, we have limited busy waiting to the critical sections of the acquire() and release() operations. These sections are short (if properly coded, they should be no more than about ten instructions). Thus, the critical section is almost never occupied; busy waiting occurs rarely, and then for only a short time. An entirely different situation exists with application programs, whose critical sections may be long (minutes or even hours) or may almost always be occupied. In this case, busy waiting is extremely inefficient. Throughout this chapter, we address issues of performance and show techniques to avoid busy waiting. In Section 6.8.7.2, we show how semaphores are provided in the Java API. 6.5.3 Deadlocks and Starvation The implementation of a semaphore with a waiting queue may result in a situation where two or more processes are waiting indefinitely for an event that can be caused only by one of the waiting processes. The event in question 254 Chapter 6 Process Synchronization is the execution of a release() operation. When such a state is reached, these processes are said to be deadlocked. As an illustration, consider a system consisting of two processes, P0 and P1, each accessing two semaphores, S and Q, set to the value 1: P0 P1 S.acquire(); Q.acquire(); Q.acquire(); S.acquire(); .. .. .. S.release(); Q.release(); Q.release(); S.release(); Suppose that P0 executes S.acquire(),andthenP1 executes Q.acquire().WhenP0 executes Q.acquire(),itmustwaituntilP1 executes Q.release(). Similarly, when P1 executes S.acquire(),itmust wait until P0 executes S.release(). Since these signal operations cannot be executed, P0 and P1 are deadlocked. We say that a set of processes is in a deadlock state when every process in the set is waiting for an event that can be caused only by another process in the set. The events with which we are mainly concerned here are resource acquisition and release; however, other types of events may result in deadlocks, as we show in Chapter 7. In that chapter, we describe various mechanisms for dealing with the deadlock problem. Another problem related to deadlocks is indefinite blocking,orstarvation —a situation in which processes wait indefinitely within the semaphore. Indefinite blocking may occur if we add and remove processes from the list associated with a semaphore in last-in, first-out (LIFO)order. 6.5.4 Priority Inversion A scheduling challenge arises when a higher-priority process needs to read or modify kernel data that are currently being accessed by a lower-priority process—or a chain of lower-priority processes. Since kernel data are typically protected with a lock, the higher-priority process will have to wait for a lower-priority one to finish with the resource. The situation becomes more complicated if the lower-priority process is preempted in favor of another process with a higher priority. As an example, assume we have three processes, L, M,andH, whose priorities follow the order L < M < H. Assume that process H requires resource R, which is currently being accessed by process L. Ordinarily, process H would wait for L to finish using resource R. However, now suppose that process M becomes runnable, thereby preempting process L. Indirectly, a process with a lower priority—process M—has affected how long process H must wait for L to relinquish resource R. This problem is known as priority inversion. It occurs only in systems with more than two priorities, so one solution is to have only two priorities. That is insufficient for most general-purpose operating systems, however. Typically 6.6 Classic Problems of Synchronization 255 PRIORITY INVERSION AND THE MARS PATHFINDER Priority inversion can be more than a scheduling inconvenience. On systems with tight time constraints (such as real-time systems—see Chapter 19), priority inversion can cause a process to take longer than it should to accomplish a task. When that happens, other failures can cascade, resulting in system failure. Consider the Mars Pathfinder, a NASA space probe that landed a robot, the Sojourner rover, on Mars in 1997 to conduct experiments. Shortly after the Sojourner began operating, it started to experience frequent computer resets. Each reset reinitialized all hardware and software, including communica- tions. If the problem had not been solved, the Sojourner would have failed in its mission. The problem was caused by the fact that one high-priority task, “bc dist,” was taking longer than expected to complete its work. This task was being forced to wait for a shared resource that was held by the lower-priority “ASI/MET” task, which in turn was preempted by multiple medium-priority tasks. The “bc dist” task would stall waiting for the shared resource, and ultimately the “bc sched” task would discover the problem and perform the reset. The Sojourner was suffering from a typical case of priority inversion. The operating system on the Sojourner was VxWorks (see Section 19.6), which had a global variable to enable priority inheritance on all semaphores. After testing, the variable was set on the Sojourner (on Mars!), and the problem was solved. A full description of the problem, its detection, and its solu- tion was written by the software team lead and is available at research.microsoft.com/mbj/Mars Pathfinder/Authoritative Account.html. these systems solve the problem by implementing a priority-inheritance protocol. According to this protocol, all processes that are accessing resources needed by a higher-priority process inherit the higher priority until they are finished with the resources in question. When they are finished, their priorities revert to their original values. In the example above, a priority-inheritance protocol would allow process L to temporarily inherit the priority of process H, thereby preventing process M from preempting its execution. When process L had finished using resource R, it would relinquish its inherited priority from H and assume its original priority. Because resource R would now be available, process H —not M—would run next. 6.6 Classic Problems of Synchronization In this section, we present a number of synchronization problems as examples of a large class of concurrency-control problems. These problems are used for testing nearly every newly proposed synchronization scheme. In our solutions to the problems, we use semaphores for synchronization. 256 Chapter 6 Process Synchronization 6.6.1 The Bounded-Buffer Problem The bounded-buffer problem was introduced in Section 6.1; it is commonly used to illustrate the power of synchronization primitives. A solution is shown in Figure 6.9. A producer places an item in the buffer by calling the insert() method (Figure 6.10), and consumers remove items by invoking remove() (Figure 6.11). The mutex semaphore provides mutual exclusion for accesses to the buffer pool and is initialized to 1. The empty and full semaphores count the number of empty and full buffers. The semaphore empty is initialized to the capacity of the buffer—BUFFER SIZE.Thesemaphorefull is initialized to 0. The producer thread is shown in Figure 6.12. The producer alternates between sleeping for a while (the SleepUtilities classisavailablein WileyPLUS), producing a message, and attempting to place that message into the buffer via the insert() method. The consumer thread is shown in Figure 6.13. The consumer alternates between sleeping and consuming an item using the remove() method. public class BoundedBuffer implements Buffer { private static final int BUFFER SIZE = 5; private E[] buffer; private int in, out; private Semaphore mutex; private Semaphore empty; private Semaphore full; public BoundedBuffer() { // buffer is initially empty in = 0; out = 0; mutex = new Semaphore(1); empty = new Semaphore(BUFFER SIZE); full = new Semaphore(0); buffer = (E[]) new Object[BUFFER SIZE]; } public void insert(E item) { // Figure 6.10 } public E remove() { // Figure 6.11 } } Figure 6.9 Solution to the bounded-buffer problem using semaphores. 6.6 Classic Problems of Synchronization 257 // Producers call this method public void insert(E item) { empty.acquire(); mutex.acquire(); // add an item to the buffer buffer[in] = item; in = (in + 1) % BUFFER SIZE; mutex.release(); full.release(); } Figure 6.10 The insert() method. The Factory class (Figure 6.14) creates the producer and consumer threads, passing each a reference to the BoundedBuffer object. 6.6.2 The Readers–Writers Problem Suppose that a database is to be shared among several concurrent processes. Some of these processes may want only to read the database, whereas others may want to update (that is, to read and write) the database. We distinguish between these two types of processes by referring to the former as readers and to the latter as writers. Obviously, if two readers access the shared data simultaneously, no adverse effects will result. However, if a writer and some other process (either a reader or a writer) access the database simultaneously, chaos may ensue. To ensure that these difficulties do not arise, we require that the writers have exclusive access to the shared database. This requirement leads to the readers–writers problem. Since it was originally stated, this problem has been // Consumers call this method public E remove() { E item; full.acquire(); mutex.acquire(); // remove an item from the buffer item = buffer[out]; out = (out + 1) % BUFFER SIZE; mutex.release(); empty.release(); return item; } Figure 6.11 The remove() method. 258 Chapter 6 Process Synchronization import java.util.Date; public class Producer implements Runnable { private Buffer buffer; public Producer(Buffer buffer) { this.buffer = buffer; } public void run() { Date message; while (true) { // nap for awhile SleepUtilities.nap(); // produce an item & enter it into the buffer message = new Date(); buffer.insert(message); } } } Figure 6.12 The producer. import java.util.Date; public class Consumer implements Runnable { private Buffer buffer; public Consumer(Buffer buffer) { this.buffer = buffer; } public void run() { Date message; while (true) { // nap for awhile SleepUtilities.nap(); // consume an item from the buffer message = (Date)buffer.remove(); } } } Figure 6.13 The consumer. 6.6 Classic Problems of Synchronization 259 import java.util.Date; public class Factory { public static void main(String args[]) { Buffer buffer = new BoundedBuffer(); // Create the producer and consumer threads Thread producer = new Thread(new Producer(buffer)); Thread consumer = new Thread(new Consumer(buffer)); producer.start(); consumer.start(); } } Figure 6.14 The Factory class. used to test nearly every new synchronization primitive. The problem has several variations, all involving priorities. The simplest one, referred to as the first readers–writers problem, requires that no reader be kept waiting unless a writer has already obtained permission to use the shared database. In other words, no reader should wait for other readers to finish simply because a writer is waiting. The second readers–writers problem requires that, once a writer is public class Reader implements Runnable { private ReadWriteLock db; public Reader(ReadWriteLock db) { this.db = db; } public void run() { while (true) { // nap for awhile SleepUtilities.nap(); db.acquireReadLock(); // now read from the database SleepUtilities.nap(); db.releaseReadLock(); } } } Figure 6.15 Areader. 260 Chapter 6 Process Synchronization public class Writer implements Runnable { private ReadWriteLock db; public Writer(ReadWriteLock db) { this.db = db; } public void run() { while (true) { // nap for awhile SleepUtilities.nap(); db.acquireWriteLock(); // now write to write to the database SleepUtilities.nap(); db.releaseWriteLock(); } } } Figure 6.16 Awriter. ready, that writer perform its write as soon as possible. In other words, if a writer is waiting to access the object, no new readers can start reading. A solution to either problem may result in starvation. In the first case, writers may starve; in the second case, readers may starve. For this reason, other variants of the problem have been proposed. Here, we present the Java class files for a solution to the first readers–writers problem. It does not address starvation. (In the exercises at the end of the chapter, you are asked to modify the solution to make it starvation-free.) Each reader thread alternates between sleeping and reading, as shown in Figure 6.15. When a reader wishes to read the database, it invokes the acquireReadLock() method; when it has finished reading, it calls releaseReadLock(). Each writer thread (Figure 6.16) performs similarly. The methods called by each reader and writer thread are defined in the ReadWriteLock interface in Figure 6.17. The Database class in Figure 6.18 public interface ReadWriteLock { public void acquireReadLock(); public void acquireWriteLock(); public void releaseReadLock(); public void releaseWriteLock(); } Figure 6.17 The interface for the readers–writers problem. 6.6 Classic Problems of Synchronization 261 public class Database implements ReadWriteLock { private int readerCount; private Semaphore mutex; private Semaphore db; public Database() { readerCount = 0; mutex = new Semaphore(1); db = new Semaphore(1); } public void acquireReadLock() { // Figure 6.19 } public void releaseReadLock() { // Figure 6.19 } public void acquireWriteLock() { // Figure 6.20 } public void releaseWriteLock() { // Figure 6.20 } } Figure 6.18 The database for the readers–writers problem. implements this interface. The readerCount keeps track of the number of readers. The semaphore mutex is used to ensure mutual exclusion when readerCount is updated. The semaphore db functions as a mutual exclusion semaphore for the writers. It also is used by the readers to prevent writers from entering the database while the database is being read. The first reader performs an acquire() operation on db, thereby preventing any writers from entering the database. The final reader performs a release() operation on db. Note that, if a writer is active in the database and n readers are waiting, then one reader is queued on db and n − 1readersarequeuedonmutex. Also observe that, when a writer executes db.release(),wemayresumetheexecutionof either the waiting readers or a single waiting writer. The selection is made by the scheduler. Read–write locks are most useful in the following situations: • In applications where it is easy to identify which threads only read shared data and which threads only write shared data. • In applications that have more readers than writers. This is because read– write locks generally require more overhead to establish than semaphores 262 Chapter 6 Process Synchronization public void acquireReadLock() { mutex.acquire(); /** * The first reader indicates that * the database is being read. */ ++readerCount; if (readerCount == 1) db.acquire(); mutex.release(); } public void releaseReadLock() { mutex.acquire(); /** * The last reader indicates that * the database is no longer being read. */ --readerCount; if (readerCount == 0) db.release(); mutex.release(); } Figure 6.19 Methods called by readers. or mutual exclusion locks, and the overhead for setting up a read–write lock is balanced by the increased concurrency of allowing multiple readers. 6.6.3 The Dining-Philosophers Problem Consider five philosophers who spend their lives thinking and eating. The philosophers share a circular table surrounded by five chairs, each belonging to one philosopher. In the center of the table is a bowl of rice, and the table is laid with five single chopsticks (Figure 6.21). When a philosopher thinks, she does public void acquireWriteLock() { db.acquire(); } public void releaseWriteLock() { db.release(); } Figure 6.20 Methods called by writers. 6.6 Classic Problems of Synchronization 263 RICE Figure 6.21 The situation of the dining philosophers. not interact with her colleagues. From time to time, a philosopher gets hungry and tries to pick up the two chopsticks that are closest to her (the chopsticks that are between her and her left and right neighbors). A philosopher may pick up only one chopstick at a time. Obviously, she cannot pick up a chopstick that is already in the hand of a neighbor. When a hungry philosopher has both her chopsticks at the same time, she eats without releasing her chopsticks. When she is finished eating, she puts down both of her chopsticks and starts thinking again. The dining-philosophers problem is considered a classic synchronization problem neither because of its practical importance nor because computer scientists dislike philosophers but because it is an example of a large class of concurrency-control problems. It is a simple representation of the need to allocate several resources among several processes in a deadlock-free and starvation-free manner. One simple solution is to represent each chopstick with a semaphore. A philosopher tries to grab the chopstick by executing an acquire() operation on that semaphore; she releases a chopstick by executing the release() operation on the appropriate semaphores. Thus, the shared data are Semaphore chopStick[] = new Semaphore[5]; for(int i = 0; i < 5; i++) chopStick[i] = new Semaphore(1); where all the elements of chopstick are initialized to 1. The structure of philosopher i is shown in Figure 6.22. Although this solution guarantees that no two neighboring philosophers are eating simultaneously, it nevertheless must be rejected because it has the possibility of creating a deadlock. Suppose that all five philosophers become hungry simultaneously and each grabs her left chopstick. All the elements of chopstick will now be equal to 0. When each philosopher tries to grab her right chopstick, she will be delayed forever. Several possible remedies to the deadlock problem are listed next. These remedies prevent deadlock by placing restrictions on the philosophers: 264 Chapter 6 Process Synchronization while (true) { // get left chopstick chopStick[i].acquire(); // get right chopstick chopStick[(i + 1) % 5].acquire(); eating(); // return left chopstick chopStick[i].release(); // return right chopstick chopStick[(i + 1) % 5].release(); thinking(); } Figure 6.22 The structure of philosopher i. • Allow at most four philosophers to be sitting simultaneously at the table. • Allow a philosopher to pick up her chopsticks only if both chopsticks are available (note that she must pick them up in a critical section). • Use an asymmetric solution; for example, an odd philosopher picks up first her left chopstick and then her right chopstick, whereas an even philosopher picks up her right chopstick and then her left chopstick. In Section 6.7, we present a solution to the dining-philosophers problem that ensures freedom from deadlocks. Note, however, that any satisfactory solution to the dining-philosophers problem must guard against the possibility that one of the philosophers will starve to death. A deadlock-free solution does not necessarily eliminate the possibility of starvation. 6.7 Monitors Although semaphores provide a convenient and effective mechanism for process synchronization, using them incorrectly can result in timing errors that are difficult to detect, since these errors happen only if some particular execution sequences take place and these sequences do not always occur. We have seen an example of such errors in the use of counters in our solution to the producer–consumer problem (Section 6.1). In that example, the timing problem happened only rarely, and even then the counter value appeared to be reasonable—off by only 1. Nevertheless, the solution is obviously not an acceptable one. It is for this reason that semaphores were introduced in the first place. Unfortunately, such timing errors can still occur when semaphores are used. To illustrate how, we review the semaphore solution to the critical-section problem. All processes share a semaphore variable mutex, which is initialized to 1. Each process must execute mutex.acquire() before entering the critical section and mutex.release() afterward. If this sequence is not observed, two 6.7 Monitors 265 processes may be in their critical sections simultaneously. Let us examine the various difficulties that may result. Note that these difficulties will arise even if a single process is not well behaved. This situation may be caused by an honest programming error or an uncooperative programmer. • Suppose that a process interchanges the order in which the acquire() and release() operations on the semaphore mutex are executed, resulting in the following execution: mutex.release(); ... critical section ... mutex.acquire(); In this situation, several processes may be executing in their critical sections simultaneously, violating the mutual-exclusion requirement. This error may be discovered only if several processes are simultaneously active in their critical sections. Note that this situation may not always be reproducible. • Suppose that a process replaces the mutex.release() operation with the mutex.acquire() operation. That is, it executes mutex.acquire(); ... critical section ... mutex.acquire(); In this case, a deadlock will occur. • Suppose that a process omits the mutex.acquire(),orthe mutex.release(), or both. In this case, either mutual exclusion is violated or a deadlock will occur. These examples illustrate that various types of errors can be generated easily when programmers use semaphores incorrectly to solve the critical-section problem. Similar problems may arise in the other synchronization models that we discussed in Section 6.6. To deal with such errors, researchers have developed high-level language constructs. In this section, we describe one fundamental high-level synchro- nization construct—the monitor type. 6.7.1 Usage A type, or abstract data type, encapsulates private data with public methods to operate on that data. A monitor type presents a set of programmer-defined operations that are provided mutual exclusion within the monitor. The monitor type also contains the declaration of variables whose values define the state of an instance of that type, along with the bodies of methods or functions 266 Chapter 6 Process Synchronization monitor monitor name { // shared variable declarations procedure P1 (...){ ... } procedure P2 (...){ ... } . . . procedure Pn (...){ ... } initialization code ( ...){ ... } } Figure 6.23 Syntax of a monitor. that operate on those variables. The syntax of a monitor is shown in Figure 6.23. The representation of a monitor type cannot be used directly by the various processes. Thus, a procedure defined within a monitor can access only those variables declared locally within the monitor and its formal parameters. Similarly, the local variables of a monitor can be accessed only by the local procedures. The monitor construct ensures that only one process at a time can be active within the monitor. Consequently, the programmer does not need to code this synchronization constraint explicitly (Figure 6.24). However, the monitor construct, as defined so far, is not sufficiently powerful for modeling some synchronization schemes. For this purpose, we need to define additional synchronization mechanisms. These mechanisms are provided by the condition variable construct. A programmer who needs to write a tailor- made synchronization scheme can define one or more variables of type Condition: Condition x, y; The only operations that can be invoked on a condition variable are wait() and signal().Theoperation x.wait(); 6.7 Monitors 267 entry queue shared data operations initialization code . . . Figure 6.24 Schematic view of a monitor. means that the process invoking this operation is suspended until another process invokes x.signal(); The x.signal() operation resumes exactly one suspended process. If no process is suspended, then the signal() operation has no effect; that is, the state of x is the same as if the operation had never been executed (Figure 6.25). Contrast this operation with the signal() operation associated with semaphores, which always affects the state of the semaphore. Now suppose that, when the x.signal() operation is invoked by a process P, there is a suspended process Q associated with condition x. Clearly, if the suspended process Q is allowed to resume its execution, the signaling process P must wait. Otherwise, both P and Q would be active simultaneously within the monitor. Note, however, that both processes can conceptually continue with their execution. Two possibilities exist: 1. Signal and wait. P either waits until Q leaves the monitor or waits for another condition. 2. Signal and continue. Q either waits until P leaves the monitor or waits for another condition. There are reasonable arguments in favor of adopting either option. On the one hand, since P was already executing in the monitor, the signal-and-continue method seems more reasonable. On the other hand, if we allow thread P to continue, then by the time Q is resumed, the logical condition for which Q was waiting may no longer hold. A compromise between these two choices 268 Chapter 6 Process Synchronization operations queues associated with x, y conditions entry queue shared data x y initialization code • • • Figure 6.25 Monitor with condition variables. was adopted in the language Concurrent Pascal. When thread P executes the signal() operation, it immediately leaves the monitor. Hence, Q is immediately resumed. Many programming languages have incorporated the idea of the monitor as described in this section, including Concurrent Pascal, Mesa, C# (pro- nounced C-sharp), and Java. Other languages—such as Erlang—provide some type of concurrency support using a similar mechanism. We discuss Java’s synchronization mechanism fully in Section 6.8. 6.7.2 Dining-Philosophers Solution Using Monitors Next, we illustrate monitor concepts by presenting a deadlock-free solution to the dining-philosophers problem. This solution imposes the restriction that a philosopher may pick up her chopsticks only if both of them are available. To code this solution, we need to distinguish among three states in which we may find a philosopher. For this purpose, we introduce the following data structures: enum State {THINKING, HUNGRY, EATING}; State states = new State[5]; Philosopher i can set the variable state[i] = State.EATING only if her two neighbors are not eating. That is, the conditions (state[(i + 4) % 5] != State.EATING)and(state[(i + 1) % 5] != State.EATING)hold. We also need to declare Condition[] self = new Condition[5]; 6.7 Monitors 269 monitor DiningPhilosophers { enum State {THINKING, HUNGRY, EATING}; State[] states = new State[5]; Condition[] self = new Condition[5]; public DiningPhilosophers { for (int i = 0; i < 5; i++) state[i] = State.THINKING; } public void takeForks(int i) { state[i] = State.HUNGRY; test(i); if (state[i] != State.EATING) self[i].wait; } public void returnForks(int i) { state[i] = State.THINKING; // test left and right neighbors test((i + 4) % 5); test((i + 1) % 5); } private void test(int i) { if ( (state[(i + 4) % 5] != State.EATING) && (state[i] == State.HUNGRY) && (state[(i + 1) % 5] != State.EATING) ) { state[i] = State.EATING; self[i].signal; } } } Figure 6.26 A monitor solution to the dining-philosophers problem. where philosopher i can delay herself when she is hungry but is unable to obtain the chopsticks she needs. We can now describe our solution to the dining-philosophers problem. The distribution of the chopsticks is controlled by the monitor dp,whichis an instance of the monitor type DiningPhilosophers. In Figure 6.26, we show the definition of this monitor type using a Java-like pseudocode. Each philosopher, before starting to eat, must invoke the operation takeForks(). This act may result in the suspension of the philosopher thread. After the successful completion of the operation, the philosopher may eat. After she eats, the philosopher invokes the returnForks() operation and starts to think. Thus, philosopher i must invoke the operations takeForks() and returnForks() in the following sequence: 270 Chapter 6 Process Synchronization dp.takeForks(i); eat(); dp.returnForks(i); It is easy to show that this solution ensures that no two neighboring philosophers are eating simultaneously and that no deadlocks will occur. However, it is possible for a philosopher to starve to death. We do not present a solution to that problem but rather ask you in the chapter-ending exercises to develop one. 6.8 Java Synchronization Now that we have provided a grounding in synchronization theory, we can describe how Java synchronizes the activity of threads, allowing the programmer to develop generalized solutions to enforce mutual exclusion between threads. When an application ensures that data remain consistent even when accessed concurrently by multiple threads, the application is said to be thread-safe. 6.8.1 Bounded Buffer In Chapter 3, we described a shared-memory solution to the bounded-buffer problem. This solution suffers from two disadvantages. First, both the producer and the consumer use busy-waiting loops if the buffer is either full or empty. Second, the variable count, which is shared by the producer and the consumer, may develop a race condition, as described in Section 6.1. This section addresses these and other problems while developing a solution using Java synchronization mechanisms. 6.8.1.1 Busy Waiting and Livelock Busy waiting was introduced in Section 6.5.2, where we examined an imple- mentation of the acquire() and release() semaphore operations. In that section, we described how a process could block itself as an alternative to busy waiting. One way to accomplish such blocking in Java is to have a thread call the Thread.yield() method. Recall from Section 5.7 that, when a thread invokes the yield() method, the thread stays in the runnable state but allows the JVM to select another runnable thread to run. The yield() method makes more effective use of the CPU than busy waiting does. In this instance, however, using either busy waiting or yielding may lead to another problem, known as livelock. Livelock is similar to deadlock; both prevent two or more threads from proceeding, but the threads are unable to proceed for different reasons. Deadlock occurs when every thread in a set is blocked waiting for an event that can be caused only by another blocked thread in the set. Livelock occurs when a thread continuously attempts an action that fails. Here is one scenario that could cause livelock. Recall that the JVM schedules threads using a priority-based algorithm, favoring high-priority threads over threads with lower priority. If the producer has a priority higher than that of the consumer and the buffer is full, the producer will enter the while loop 6.8 Java Synchronization 271 // Producers call this method public synchronized void insert(E item) { while (count == BUFFER SIZE) Thread.yield(); buffer[in] = item; in = (in + 1) % BUFFER SIZE; ++count; } // Consumers call this method public synchronized E remove() { E item; while (count == 0) Thread.yield(); item = buffer[out]; out = (out + 1) % BUFFER SIZE; --count; return item; } Figure 6.27 Synchronized insert() and remove() methods. and either busy-wait or yield() to another runnable thread while waiting for count to be decremented to less than BUFFER SIZE.Aslongastheconsumer has a priority lower than that of the producer, it may never be scheduled by the JVM to run and therefore may never be able to consume an item and free up buffer space for the producer. In this situation, the producer is livelocked waiting for the consumer to free buffer space. We will see shortly that there is a better alternative than busy waiting or yielding while waiting for a desired event to occur. 6.8.1.2 Race Condition In Section 6.1, we saw an example of the consequences of a race condition onthesharedvariablecount. Figure 6.27 illustrates how Java’s handling of concurrent access to shared data prevents race conditions. In describing this situation, we introduce a new keyword: synchronized. Every object in Java has associated with it a single lock. An object’s lock may be owned by a single thread. Ordinarily, when an object is being referenced (that is, when its methods are being invoked), the lock is ignored. When a method is declared to be synchronized, however, calling the method requires owning the lock for the object. If the lock is already owned by another thread, the thread calling the synchronized method blocks and is placed in the entry set for the object’s lock. The entry set represents the set of threads waiting for the lock to become available. If the lock is available when a synchronized 272 Chapter 6 Process Synchronization entry set acquire lock object lock owner Figure 6.28 Entry set. method is called, the calling thread becomes the owner of the object’s lock and can enter the method. The lock is released when the thread exits the method. If the entry set for the lock is not empty when the lock is released, the JVM arbitrarily selects a thread from this set to be the owner of the lock. (When we say “arbitrarily,” we mean that the specification does not require that threads in this set be organized in any particular order. However, in practice, most virtual machines order threads in the wait set according to a FIFO policy.) Figure 6.28 illustrates how the entry set operates. If the producer calls the insert() method, as shown in Figure 6.27, and the lock for the object is available, the producer becomes the owner of the lock; it can then enter the method, where it can alter the value of count and other shared data. If the consumer attempts to call the synchronized remove() method while the producer owns the lock, the consumer will block because the lock is unavailable. When the producer exits the insert() method, it releases the lock. The consumer can now acquire the lock and enter the remove() method. 6.8.1.3 Deadlock At first glance, this approach appears at least to solve the problem of having a race condition on the variable count. Because both the insert() method and the remove() method are declared synchronized, we have ensured that only one thread can be active in either of these methods at a time. However, lock ownership has led to another problem. Assume that the buffer is full and the consumer is sleeping. If the producer calls the insert() method, it will be allowed to continue, because the lock is available. When the producer invokes the insert() method, it sees that the buffer is full and performs the yield() method. All the while, the producer still owns the lock for the object. When the consumer awakens and tries to call the remove() method (which would ultimately free up buffer space for the producer), it will block because it does not own the lock for the object. Thus, both the producer and the consumer are unable to proceed because (1) the producer is blocked waiting for the consumer to free space in the buffer and (2) the consumer is blocked waiting for the producer to release the lock. By declaring each method as synchronized, we have prevented the race condition on the shared variables. However, the presence of the yield() loop has led to a possible deadlock. 6.8 Java Synchronization 273 // Producers call this method public synchronized void insert(E item) { while (count == BUFFER SIZE) { try { wait(); } catch (InterruptedException e) {} } buffer[in] = item; in = (in + 1) % BUFFER SIZE; ++count; notify(); } // Consumers call this method public synchronized E remove() { E item; while (count == 0) { try { wait(); } catch (InterruptedException e) {} } item = buffer[out]; out = (out + 1) % BUFFER SIZE; --count; notify(); return item; } Figure 6.29 insert() and remove() methods using wait() and notify(). 6.8.1.4 Wait and Notify Figure 6.29 addresses the yield() loop by introducing two new Java methods: wait() and notify(). In addition to having a lock, every object also has associated with it a wait set consisting of a set of threads. This wait set is initially empty. When a thread enters a synchronized method, it owns the lock for the object. However, this thread may determine that it is unable to continue because a certain condition has not been met. That will happen, for example, if the producer calls the insert() method and the buffer is full. The thread then will release the lock and wait until the condition that will allow it to continue is met, thus avoiding the previous deadlock situation. 274 Chapter 6 Process Synchronization entry set wait set acquire lock wait object lock owner Figure 6.30 Entry and wait sets. When a thread calls the wait() method, the following happens: 1. The thread releases the lock for the object. 2. The state of the thread is set to blocked. 3. Thethreadisplacedinthewaitsetfortheobject. Consider the example in Figure 6.29. If the producer calls the insert() method and sees that the buffer is full, it calls the wait() method. This call releases the lock, blocks the producer, and puts the producer in the wait set for the object. Because the producer has released the lock, the consumer ultimately enters the remove() method, where it frees space in the buffer for the producer. Figure 6.30 illustrates the entry and wait sets for a lock. (Note that wait() can result in an InterruptedException being thrown. We will cover this in Section 6.8.6.) How does the consumer thread signal that the producer may now proceed? Ordinarily, when a thread exits a synchronized method, the departing thread releases only the lock associated with the object, possibly removing a thread from the entry set and giving it ownership of the lock. However, at the end of the synchronized insert() and remove() methods, we have a call to the method notify(). The call to notify(): 1. Picks an arbitrary thread T from the list of threads in the wait set 2. Moves T from the wait set to the entry set 3. Sets the state of T from blocked to runnable T is now eligible to compete for the lock with the other threads. Once T has regained control of the lock, it returns from calling wait(), where it may check the value of count again. Next, we describe the wait() and notify() methods in terms of the program shown in Figure 6.29. We assume that the buffer is full and the lock for the object is available. • The producer calls the insert() method, sees that the lock is available, and enters the method. Once in the method, the producer determines that the buffer is full and calls wait(). The call to wait() releases the lock for the object, sets the state of the producer to blocked, and puts the producer in the wait set for the object. 6.8 Java Synchronization 275 • The consumer ultimately calls and enters the remove() method, as the lock for the object is now available. The consumer removes an item from the buffer and calls notify(). Note that the consumer still owns the lock for the object. • The call to notify() removes the producer from the wait set for the object, moves the producer to the entry set, and sets the producer’s state to runnable. • The consumer exits the remove() method. Exiting this method releases the lock for the object. • The producer tries to reacquire the lock and is successful. It resumes execution from the call to wait(). The producer tests the while loop, determines that room is available in the buffer, and proceeds with the remainder of the insert() method. If no thread is in the wait set for the object, the call to notify() is ignored. When the producer exits the method, it releases the lock for the object. The BoundedBuffer class shown in Figure 6.31 represents the complete solution to the bounded-buffer problem using Java synchronization. This class may be substituted for the BoundedBuffer class used in the semaphore-based solution to this problem in Section 6.6.1. public class BoundedBuffer implements Buffer { private static final int BUFFER SIZE = 5; private int count, in, out; private E[] buffer; public BoundedBuffer() { // buffer is initially empty count = 0; in=0; out=0; buffer = (E[]) new Object[BUFFER SIZE]; } public synchronized void insert(E item) { // Figure 6.29 } public synchronized E remove() { // Figure 6.29 } } Figure 6.31 Bounded buffer. 276 Chapter 6 Process Synchronization /** * myNumber is the number of the thread * that wishes to do some work */ public synchronized void doWork(int myNumber) { while (turn != myNumber) { try { wait(); } catch (InterruptedException e) {} } // Do some work for awhile . . . /** * Finished working. Now indicate to the * next waiting thread that it is their * turn to do some work. */ turn = (turn + 1) % 5; notify(); } Figure 6.32 doWork() method. 6.8.2 Multiple Notifications As described in Section 6.8.1.4, the call to notify() arbitrarily selects a thread from the list of threads in the wait set for an object. This approach works fine when only one thread is in the wait set, but consider what can happen when there are multiple threads in the wait set and more than one condition for which to wait. It is possible that a thread whose condition has not yet been met will be the thread that receives the notification. Suppose, for example, that there are five threads {T0, T1, T2, T3, T4} and a shared variable turn indicating which thread’s turn it is. When a thread wishes to do work, it calls the doWork() method in Figure 6.32. Only the thread whose number matches the value of turn can proceed; all other threads must wait their turn. Assume the following: • turn = 3. • T1, T2, and T4areinthewaitsetfortheobject. • T3 is currently in the doWork() method. When thread T3isdone,itsetsturn to 4 (indicating that it is T4’s turn) and calls notify().Thecalltonotify() arbitrarily picks a thread from the wait set. If T2 receives the notification, it resumes execution from the call to wait() and tests the condition in the while loop. T2 sees that this is not its 6.8 Java Synchronization 277 turn, so it calls wait() again. Ultimately, T3andT0 will call doWork() and will also invoke the wait() method, since it is the turn for neither T3norT0. Now, all five threads are blocked in the wait set for the object. Thus, we have another deadlock to handle. Because the call to notify() arbitrarily picks a single thread from the wait set, the developer has no control over which thread is chosen. Fortunately, Java provides a mechanism that allows all threads in the wait set to be notified. The notifyAll() method is similar to notify(),exceptthatevery waiting thread is removed from the wait set and placed in the entry set. If the call to notify() in doWork() is replaced with a call to notifyAll(),whenT3 finishes and sets turn to 4, it calls notifyAll(). This call has the effect of removing T1, T2, and T4 from the wait set. The three threads then compete for the object’s lock once again. Ultimately, T1andT2callwait(),andonlyT4 proceeds with the doWork() method. In sum, the notifyAll() method is a mechanism that wakes up all waiting threads and lets the threads decide among themselves which of them should run next. In general, notifyAll() is a more expensive operation than notify() because it wakes up all threads, but it is regarded as a more conservative strategy appropriate for situations in which multiple threads may be in the wait set for an object. public class Database implements ReadWriteLock { private int readerCount; private boolean dbWriting; public Database() { readerCount = 0; dbWriting = false; } public synchronized void acquireReadLock() { // Figure 6.34 } public synchronized void releaseReadLock() { // Figure 6.34 } public synchronized void acquireWriteLock() { // Figure 6.35 } public synchronized void releaseWriteLock() { // Figure 6.35 } } Figure 6.33 Solution to the readers–writers problem using Java synchronization. 278 Chapter 6 Process Synchronization public synchronized void acquireReadLock() { while (dbWriting == true) { try { wait(); } catch(InterruptedException e) {} } ++readerCount; } public synchronized void releaseReadLock() { --readerCount; /** * The last reader indicates that * the database is no longer being read. */ if (readerCount == 0) notify(); } Figure 6.34 Methods called by readers. In the following section, we look at a Java-based solution to the readers– writers problem that requires the use of both notify() and notifyAll(). 6.8.3 A Java-Based Solution to the Readers–Writers Problem We can now provide a solution to the first readers–writers problem by using Java synchronization. The methods called by each reader and writer thread are defined in the Database class in Figure 6.33, which implements the ReadWriteLock interface shown in Figure 6.17. The readerCount keeps track of the number of readers; a value > 0 indicates that the database is currently being read. dbWriting is a boolean variable indicating whether the database is currently being accessed by a writer. acquireReadLock(), releaseReadLock(), acquireWriteLock(),andreleaseWriteLock() are all declared as synchronized to ensure mutual exclusion to the shared variables. When a writer wishes to begin writing, it first checks whether the database is currently being either read or written. If the database is being read or written, the writer enters the wait set for the object. Otherwise, it sets dbWriting to true. When a writer is finished, it sets dbWriting to false.Whena reader invokes acquireReadLock(), it first checks whether the database is currently being written. If the database is unavailable, the reader enters the wait set for the object; otherwise, it increments readerCount. The final reader calling releaseReadLock() invokes notify(), thereby notifying a waiting writer. When a writer invokes releaseWriteLock(), however, it calls the notifyAll() method rather than notify(). Consider the effect on readers. If several readers wish to read the database while it is being written, and the 6.8 Java Synchronization 279 public synchronized void acquireWriteLock() { while (readerCount > 0 || dbWriting == true) { try { wait(); } catch(InterruptedException e) {} } /** * Once there are no readers or a writer, * indicate that the database is being written. */ dbWriting = true; } public synchronized void releaseWriteLock() { dbWriting = false; notifyAll(); } Figure 6.35 Methods called by writers. writer invokes notify() once it has finished writing, only one reader will receive the notification. Other readers will remain in the wait set even though the database is available for reading. By invoking notifyAll(),adeparting writer is ensured of notifying all waiting readers. 6.8.4 Block Synchronization The amount of time between when a lock is acquired and when it is released is defined as the scope of the lock. Java also allows blocks of code to be declared as synchronized,becauseasynchronized method that has only a small percentage of its code manipulating shared data may yield a scope that is too large. In such an instance, it may be better to synchronize only the block of code that manipulates shared data than to synchronize the entire method. Such a design results in a smaller lock scope. Thus, in addition to declaring synchronized methods, Java also allows block synchronization, as illustrated in Figure 6.36. Access to the criticalSection() method in Figure 6.36 requires ownership of the lock for the mutexLock object. We can also use the wait() and notify() methods in a synchronized block. The only difference is that they must be invoked with the same object that is being used for synchronization. This approach is shown in Figure 6.37. 6.8.5 Synchronization Rules The synchronized keyword is a straightforward construct, but it is important to know a few rules about its behavior. 280 Chapter 6 Process Synchronization Object mutexLock = new Object(); ... public void someMethod() { nonCriticalSection(); synchronized(mutexLock) { criticalSection(); } remainderSection(); } Figure 6.36 Block synchronization. 1. A thread that owns the lock for an object can enter another synchronized method (or block) for the same object. This is known as a recursive or reentrant lock. 2. A thread can nest synchronized method invocations for different objects. Thus, a thread can simultaneously own the lock for several different objects. 3. If a method is not declared synchronized,thenitcanbeinvoked regardless of lock ownership, even while another synchronized method for the same object is executing. 4. If the wait set for an object is empty, then a call to notify() or notifyAll() has no effect. 5. wait(), notify(),andnotifyAll() mayonlybeinvokedfromsyn- chronized methods or blocks; otherwise, an IllegalMonitorStateEx- ception is thrown. It is also possible to declare static methods as synchronized.Thisis because, along with the locks that are associated with object instances, there is a single class lock associated with each class. Thus, for a given class, there Object mutexLock = new Object(); ... synchronized(mutexLock) { try { mutexLock.wait(); } catch (InterruptedException ie) {} } synchronized(mutexLock) { mutexLock.notify(); } Figure 6.37 Block synchronization using wait() and notify(). 6.8 Java Synchronization 281 can be several object locks, one per object instance. However, there is only one class lock. In addition to using the class lock to declare static methods as synchro- nized,wecanuseitinasynchronized blockbyplacing"class name.class" within the synchronized statement. For example, if we wished to use a syn- chronized blockwiththeclasslockfortheSomeObject class, we would use the following: synchronized(SomeObject.class) { /** * synchronized block of code */ } 6.8.6 Handling InterruptedException Note that invoking the wait() method requires placing it in a try-catch block, as wait() may throw an InterruptedException. Recall from Chapter 4 that the interrupt() method is the preferred technique for interrupting a thread in Java. When interrupt() is invoked on a thread, the interruption status of that thread is set. A thread can check its interruption status using the isInterrupted() method, which returns true if its interruption status is set. The wait() method also checks the interruption status of a thread. If it is set, wait() will throw an InterruptedException. This allows interruption of a thread that is blocked in the wait set. (It should also be noted that once an InterruptedException is thrown, the interrupted status of the thread is cleared.) For code clarity and simplicity, we choose to ignore this exception in our code examples. That is, all calls to wait() appear as: try { wait(); } catch (InterruptedException ie) { /* ignore */ } However, if we choose to handle InterruptedException,wepermitthe interruption of a thread blocked in a wait set. Doing so allows more robust multithreaded applications, as it provides a mechanism for interrupting a thread that is blocked trying to acquire a mutual exclusion lock. One strategy is to allow the InterruptedException to propagate. That is, in methods where wait() is invoked, we first remove the try-catch blocks when calling wait() and declare such methods as throwing InterruptedException. By doing this, we are allowing the InterruptedException to propagate from the method where wait() is being invoked. However, allowing this exception to propagate requires placing calls to such methods within try- catch (InterruptedException) blocks. 6.8.7 Concurrency Features in Java Prior to Java 1.5, the only concurrency features available in Java were the synchronized, wait(),andnotify() commands, which are based on single locks for each object. Java 1.5 introduced a rich API consisting of several concurrency features, including various mechanisms for synchronizing con- 282 Chapter 6 Process Synchronization current threads. In this section, we cover (1) reentrant locks, (2) semaphores, and (3) condition variables available in the java.util.concurrent and java.util.concurrent.locks packages. Readers interested in the additional features of these packages are encouraged to consult the Java API. 6.8.7.1 Reentrant Locks Perhaps the simplest locking mechanism available in the API is the Reentrant- Lock. In many ways, a ReentrantLock acts like the synchronized statement described in Section 6.8.1.2: a ReentrantLock is owned by a single thread and is used to provide mutually exclusive access to a shared resource. However, the ReentrantLock provides several additional features, such as setting a fairness parameter, which favors granting the lock to the longest-waiting thread. (Recall from Section 6.8.1.2 that the specification for the JVM does not indicate that threads in the wait set for an object lock are to be ordered in any specific fashion.) AthreadacquiresaReentrantLock lock by invoking its lock() method. If the lock is available—or if the thread invoking lock() already owns it, which is why it is termed reentrant—lock() assigns the invoking thread lock ownership and returns control. If the lock is unavailable, the invoking thread blocks until it is ultimately assigned the lock when its owner invokes unlock(). ReentrantLock implements the Lock interface; its usage is as follows: Lock key = new ReentrantLock(); key.lock(); try { // critical section } finally { key.unlock(); } The programming idiom of using try and finally requires a bit of explanation. If the lock is acquired via the lock() method, it is important that the lock be similarly released. By enclosing unlock() in a finally clause, we ensure that the lock is released once the critical section completes or if an exception occurs within the try block. Notice that we do not place the call to lock() within the try clause, as lock() does not throw any checked exceptions. Consider what happens if we place lock() within the try clause and an unchecked exception occurs when lock() is invoked (such as OutofMemoryError): The finally clause triggers the call to unlock(), which then throws the unchecked IllegalMonitorStateException,asthe lock was never acquired. This IllegalMonitorStateException replaces the unchecked exception that occurred when lock() was invoked, thereby obscuring the reason why the program initially failed. 6.8.7.2 Semaphores The Java 5 API also provides a counting semaphore, as described in Section 6.5. The constructor for the semaphore appears as 6.8 Java Synchronization 283 Semaphore(int value); where value specifies the initial value of the semaphore (a negative value is allowed). The acquire() method throws an InterruptedException if the acquiring thread is interrupted (Section 6.8.6). The following example illustrates using a semaphore for mutual exclusion: Semaphore sem = new Semaphore(1); try { sem.acquire(); // critical section } catch (InterruptedException ie) {} finally { sem.release(); } Notice that we place the call to release() in the finally clause to ensure that the semaphore is released. 6.8.7.3 Condition Variables The last utility we cover in the Java API is the condition variable. Just as the ReentrantLock (Section 6.8.7.1) is similar to Java’s synchronized statement, condition variables provide functionality similar to the wait(), notify(), and notifyAll() methods. Therefore, to provide mutual exclusion to both, a condition variable must be associated with a reentrant lock. We create a condition variable by first creating a ReentrantLock and invoking its newCondition() method, which returns a Condition object representing the condition variable for the associated ReentrantLock.This is illustrated in the following statements: Lock key = new ReentrantLock(); Condition condVar = key.newCondition(); Once the condition variable has been obtained, we can invoke its await() and signal() methods, which function in the same way as the wait() and signal() commands described in Section 6.7. As mentioned, reentrant locks and condition variables in the Java API func- tion similarly to the synchronized, wait(),andnotify() statements. How- ever, one advantage to using the features available in the API is they often pro- vide more flexibility and control than their synchronized/wait()/notify() counterparts. Another distinction concerns Java’s locking mechanism, in which each object has its own lock. In many ways, this lock acts as a monitor. Every Java object thus has an associated monitor, and a thread can acquire an object’s monitor by entering a synchronized method or block. Let’s look more closely at this distinction. Recall that, with monitors as described in Section 6.7, the wait() and signal() operations can be applied tonamedcondition variables, allowing a thread to wait for a specific condition or to be notified when a specific conditionhas been met. At the language level, Java does not provide support for named condition variables. Each Java monitor 284 Chapter 6 Process Synchronization is associated with just one unnamed condition variable, and the wait(), notify(),andnotifyAll() operations apply only to this single condition variable. When a Java thread is awakened via notify() or notifyAll(),it receives no information as to why it was awakened. It is up to the reactivated thread to check for itself whether the condition for which it was waiting has been met. The doWork() method shown in Figure 6.32 highlights this issue; notifyAll() must be invoked to awaken all waiting threads, and—once awake—each thread must check for itself if the condition it has been waiting for has been met (that is, if it is that thread’s turn). We further illustrate this distinction by rewriting the doWork() method in Figure 6.32 using condition variables. We first create a ReentrantLock and five condition variables (representing the conditions the threads are waiting for) to signal the thread whose turn is next. This is shown below: Lock lock = new ReentrantLock(); Condition[] condVars = new Condition[5]; for (int i = 0;i<5;i++) condVars[i] = lock.newCondition(); The modified doWork() method is shown in Figure 6.38. Notice that doWork() is no longer declared as synchronized,sincetheReentrantLock provides mutual exclusion. When a thread invokes await() on the condition variable, it releases the associated ReentrantLock, allowing another thread to acquire the mutual exclusion lock. Similarly, when signal() is invoked, only the condition variable is signaled; the lock is released by invoking unlock(). 6.9 Synchronization Examples We next describe the synchronization mechanisms provided by the Solaris, Windows XP, and Linux operating systems, as well as the Pthreads API.Wehave chosen these three operating systems because they provide good examples of different approaches for synchronizing the kernel, and we have included the Pthreads API because it is widely used for thread creation and synchronization by developers on UNIX and Linux systems. As you will see in this section, the synchronization methods available in these differing systems vary in subtle and significant ways. 6.9.1 Synchronization in Solaris To control access to critical sections, Solaris provides adaptive mutexes, condi- tion variables, semaphores, reader–writer locks, and turnstiles. Solaris imple- ments semaphores and condition variables essentially as they are presented in Sections 6.5 and 6.7. In this section, we describe adaptive mutexes, reader– writer locks, and turnstiles. An adaptive mutex protects access to every critical data item. On a multiprocessor system, an adaptive mutex starts as a standard semaphore implemented as a spinlock. If the data are locked and therefore already in use, the adaptive mutex does one of two things. If the lock is held by a thread that is currently running on another CPU, the thread spins while waiting for the 6.9 Synchronization Examples 285 /** * myNumber is the number of the thread * that wishes to do some work */ public void doWork(int myNumber) { lock.lock(); try { /** * If it’s not my turn, then wait * until I’m signaled */ if (myNumber != turn) condVars[myNumber].await(); // Do some work for awhile . . . /** * Finished working. Now indicate to the * next waiting thread that it is their * turn to do some work. */ turn = (turn + 1) % 5; condVars[turn].signal(); } catch (InterruptedException ie) {} finally { lock.unlock(); } } Figure 6.38 doWork() method with condition variables. lock to become available, because the thread holding the lock is likely to finish soon. If the thread holding the lock is not currently in run state, the thread blocks, going to sleep until it is awakened by the release of the lock. It is put to sleep so that it will not spin while waiting, since the lock will not be freed very soon. A lock held by a sleeping thread is likely to be in this category. On a single-processor system, the thread holding the lock is never running if the lock is being tested by another thread, because only one thread can run at a time. Therefore, on this type of system, threads always sleep rather than spin if they encounter a lock. Solaris uses the adaptive-mutex method to protect only data that are accessed by short code segments. That is, a mutex is used if a lock will be held for less than a few hundred instructions. If the code segment is longer than that, the spin-waiting method is exceedingly inefficient. For these longer code segments, condition variables and semaphores are used. If the desired lock is already held, the thread issues a wait and sleeps. When a thread frees the lock, it issues a signal to the next sleeping thread in the queue. The extra 286 Chapter 6 Process Synchronization cost of putting a thread to sleep and waking it, and of the associated context switches, is less than the cost of wasting several hundred instructions waiting in a spinlock. Reader–writer locks are used to protect data that are accessed frequently but are usually accessed in a read-only manner. In these circumstances, reader–writer locks are more efficient than semaphores, because multiple threads can read data concurrently,whereas semaphores always serialize access to the data. Reader–writer locks are relatively expensive to implement, so again they are used only on long sections of code. Solaris uses turnstiles to order the list of threads waiting to acquire either an adaptive mutex or a reader–writer lock. A turnstile is a queue structure containing threads blocked on a lock. For example, if one thread currently owns the lock for a synchronized object, all other threads trying to acquire the lock will block and enter the turnstile for that lock. When the lock is released, the kernel selects a thread from the turnstile as the next owner of the lock. Each synchronized object with at least one thread blocked on the object’s lock requires a separate turnstile. However, rather than associating a turnstile with each synchronized object, Solaris gives each kernel thread its own turnstile. Because a thread can be blocked only on one object at a time, this is more efficient than having a turnstile for each object. The turnstile for the first thread to block on a synchronized object becomes the turnstile for the object itself. Threads subsequently blocking on the lock will be added to this turnstile. When the initial thread ultimately releases the lock, it gains a new turnstile from a list of free turnstiles maintained by the kernel. To prevent a priority inversion, turnstiles are organized according to a priority-inheritance protocol. If a lower-priority thread currently holds a lock on which a higher-priority thread is blocked, the thread with the lower priority will temporarily inherit the priority of the higher-priority thread. Upon releasing the lock, the thread will revert to its original priority. Note that the locking mechanisms used by the kernel are implemented for user-level threads as well, so the same types of locks are available inside and outside the kernel. A crucial implementation difference is the priority- inheritance protocol. Kernel-locking routines adhere to the kernel priority- inheritance methods used by the scheduler, as described in Section 19.4; user-level thread-locking mechanisms do not provide this functionality. To optimize Solaris performance, developers have refined and fine-tuned the locking methods. Because locks are used frequently and typically are used for crucial kernel functions, tuning their implementation and use can produce great performance gains. 6.9.2 Synchronization in Windows XP The Windows XP operating system is a multithreaded kernel that provides support for real-time applications and multiple processors. When the Windows XP kernel accesses a global resource on a single-processor system, it temporarily masks interrupts for all interrupt handlers that may also access the global resource. On a multiprocessor system, Windows XP protects access to global resources by using spinlocks. Just as in Solaris, the kernel uses spinlocks only to protect short code segments. Furthermore, for reasons of efficiency, the kernel ensures that a thread will never be preempted while holding a spinlock. 6.9 Synchronization Examples 287 nonsignaled signaled owner thread releases mutex lock thread acquires mutex lock Figure 6.39 Mutex dispatcher object. For thread synchronization outside the kernel, Windows XP provides dispatcher objects. Using a dispatcher object, threads synchronize according to several different mechanisms, including mutexes, semaphores, events, and timers. The system protects shared data by requiring a thread to gain ownership of a mutex to access the data and to release ownership when it is finished. Semaphores behave as described in Section 6.5. Events are similar to condition variables; that is, they may notify a waiting thread when a desired condition occurs. Finally, timers are used to notify one (or more than one) thread that a specified amount of time has expired. Dispatcher objects may be in either a signaled state or a nonsignaled state. A signaled state indicates that an object is available and a thread will not block when acquiring the object. A nonsignaled state indicatesthatanobjectisnot available and a thread will block when attempting to acquire the object. We illustrate the state transitions of a mutex lock dispatcher object in Figure 6.39. A relationship exists between the state of a dispatcher object and the state of a thread. When a thread blocks on a nonsignaled dispatcher object, its state changes from ready to waiting, and the thread is placed in a waiting queue for that object. When the state for the dispatcher object moves to signaled, the kernel checks whether any threads are waiting on the object. If so, the kernel moves one thread—or possibly more threads—from the waiting state to the ready state, where it can resume executing. The number of threads the kernel selects from the waiting queue depends on the type of dispatcher object for which it is waiting. The kernel will select only one thread from the waiting queue for a mutex, since a mutex object may be “owned” by only a single thread. For an event object, the kernel will select all threads that are waiting for the event. We can use a mutex lock as an illustration of dispatcher objects and thread states. If a thread tries to acquire a mutex dispatcher object that is in a nonsignaled state, that thread will be suspended and placed in a waiting queue for the mutex object. When the mutex moves to the signaled state (because another thread has released the lock on the mutex), the thread waiting at the front of the queue will be moved from the waiting state to the ready state and will acquire the mutex lock. We provide a programming project at the end of this chapter that uses mutex locks and semaphores in the Win32 API. 6.9.3 Synchronization in Linux Prior to Version 2.6, Linux was a nonpreemptive kernel, meaning that a process running in kernel mode could not be preempted—even if a higher-priority 288 Chapter 6 Process Synchronization process became available to run. Now, however, the Linux kernel is fully preemptive, so a task can be preempted when it is running in the kernel. The Linux kernel provides spinlocks and semaphores (as well as reader– writer versions of these two locks) for locking in the kernel. On SMP machines, the fundamental locking mechanism is a spinlock, and the kernel is designed so that the spinlock is held only for short durations. On single-processor machines, spinlocks are inappropriate for use and are replaced by enabling and disabling kernel preemption. That is, on single-processor machines, rather than holding a spinlock, the kernel disables kernel preemption; and rather than releasing the spinlock, it enables kernel preemption. This is summarized below: single processor multiple processors Acquire spin lock. Release spin lock. Disable kernel preemption. Enable kernel preemption. Linux uses an interesting approach to disable and enable kernel pre- emption. It provides two simple system calls—preempt disable() and preempt enable()—to perform these tasks. In addition, the kernel is not preemptible if a kernel-mode task is holding a lock. To enforce this rule, each task in the system has a thread-info structure containing a counter, preempt count, to indicate the number of locks being held by the task. When alockisacquired,preempt count is incremented. It is decremented when a lock is released. If the value of preempt count for the task currently running is greater than zero, it is not safe to preempt the kernel, as this task currently holds a lock. If the count is zero, the kernel can safely be interrupted (assuming there are no outstanding calls to preempt disable()). Spinlocks—along with enabling and disabling kernel preemption—are used in the kernel only when a lock (or disabling kernel preemption) is held for a short duration. When a lock must be held for a longer period, semaphores are appropriate for use. 6.9.4 Synchronization in Pthreads The Pthreads API provides mutex locks, condition variables, and read–write locks for thread synchronization. This API is available for programmers and is not part of any particular kernel. Mutex locks represent the fundamental synchronization technique used with Pthreads. A mutex lock is used to protect critical sections of code—that is, a thread acquires the lock before entering a critical section and releases it upon exiting the critical section. Condition variables in Pthreads behave much as described in Section 6.7. Read–write locks behave similarly to the locking mechanism described in Section 6.6.2. Many systems that implement Pthreads also provide semaphores, although they are not part of the Pthreads standard and instead belong to the POSIX SEM extension. Other extensions to the Pthreads API include spinlocks, but not all extensions are considered portable from one implementation to another. We provide a programming project at the end of this chapter that uses Pthreads mutex locks and semaphores. 6.10 Atomic Transactions 289 6.10 Atomic Transactions The mutual exclusion of critical sections ensures that the critical sections are executed atomically—that is, as one uninterruptible unit. If two critical sections are instead executed concurrently, the result is equivalent to their sequential execution in some unknown order. Although this property is useful in many application domains, in many cases we would like to make sure that a critical section forms a single logical unit of work that either is performed in its entirety or is not performed at all. An example is a funds transfer, in which one account is debited and another is credited. Clearly, it is essential for data consistency either that both the credit and debit occur or that neither occurs. Consistency of data, along with storage and retrieval of data, is a concern often associated with database systems. Recently, there has been an upsurge of interest in using database-system techniques in operating systems. Operating systems can be viewed as manipulators of data; as such, they can benefit from the advanced techniques and models available from database research. For instance, many of the ad hoc techniques used in operating systems to manage files could be more flexible and powerful if more formal database methods were used in their place. In Sections 6.10.2 to 6.10.4, we describe some of these database techniques and explain how they can be used by operating systems. First, however, we deal with the general issue of transaction atomicity. It is this property that the database techniques are meant to address. 6.10.1 System Model A collection of instructions (or operations) that performs a single logical function is called a transaction. A major issue in processing transactions is preserving atomicity despite the possibility of failures within the computer system. We can think of a transaction as a program unit that accesses and perhaps updates various data items that reside on a disk within some files. From our point of view, such a transaction is simply a sequence of read and write operations terminated by either a commit operation or an abort operation. A commit operation signifies that the transaction has terminated its execution successfully, whereas an abort operation signifies that the transaction has ended its normal execution due to a logical error or system failure. If a terminated transaction has completed its execution successfully, it is committed;otherwise,itisaborted. Since an aborted transaction may already have modified the data that it has accessed, the state of these data may not be the same as it would have been if the transaction had executed atomically. But if atomicity is to be ensured, an aborted transaction must have no effect on the state of the data. Thus, the state of the data accessed by an aborted transaction must be restored to what it was just before the transaction started executing. We say that such a transaction has been rolled back. It is part of the responsibility of the system to ensure this property. To determine how the system should ensure atomicity, we need first to identify the properties of devices used for storing the various data accessed by the transactions. Various types of storage media are distinguished by their relative speed, capacity, and resilience to failure. 290 Chapter 6 Process Synchronization • Volatile storage. Information residing in volatile storage does not usually survive system crashes. Examples of such storage are main and cache memory. Access to volatile storage is extremely fast, both because of the speed of the memory access itself and because it is possible to access directly any data item in volatile storage. • Nonvolatile storage. Information residing in nonvolatile storage usually survives system crashes. Examples of media for such storage are disks and magnetic tapes. Disks are more reliable than main memory but less reliable TRANSACTIONAL MEMORY With the emergence of multicore systems has come increased pressure to develop multithreaded applications that take advantage of multiple process- ing cores. However, multithreaded applications present an increased risk of race conditions and deadlocks. Traditionally, techniques such as locks, semaphores, and monitors have been used to address these issues. How- ever, transactional memory provides an alternative strategy for developing thread-safe concurrent applications. A memory transaction is a sequence of memory read–write operations that are atomic. If all operations in a transaction are completed, the memory transaction is committed; otherwise, the operations must be aborted and rolled back. The benefits of transactional memory can be obtained through features added to a programming language. Consider an example. Suppose we have a function update() that modifies shared data. Traditionally, this function would be written using locks such as the following: update () { acquire(); /* modify shared data */ release(); } However, using synchronization mechanisms such as locks and semaphores involves many potential problems, including deadlocks. Additionally, as the number of threads increases, traditional locking does not scale well. As an alternative to traditional methods, new features that take advantage of transactional memory can be added to a programming language. In our example, suppose we add the construct atomic{S}, which ensures that the operations in S execute as a transaction. This allows us to rewrite the update() method as follows: update () { atomic { /* modify shared data */ } } Continued on following page. 6.10 Atomic Transactions 291 TRANSACTIONAL MEMORY (Continued) The advantage of using such a mechanism rather than locks is that the transactional memory system—not the developer—is responsible for guar- anteeing atomicity. Additionally, the system can identify which statements in atomic blocks can be executed concurrently, such as concurrent read access to a shared variable. It is, of course, possible for a programmer to identify these situations and use reader–writer locks, but the task becomes increasingly difficult as the number of threads within an application grows. Transactional memory can be implemented in either software or hard- ware. Software transactional memory (STM), as the name suggests, imple- ments transactional memory exclusively in software—no special hardware is needed. STM works by inserting instrumentation code inside transaction blocks. The code is inserted by a compiler and manages each transaction by examining where statements may run concurrently and where specific low-level locking is required. Hardware transactional memory (small HTM) uses hardware cache hierarchies and cache coherency protocols to manage and resolve conflicts involving shared data residing in separate processors’ caches. HTM requires no special code instrumentation and thus has less over- head than STM. However, HTM does require that existing cache hierarchies and cache coherency protocols be modified to support transactional memory. Transactional memory has existed for several years without widespread implementation. However, the growth of multicore systems and the asso- ciated emphasis on concurrent programming have prompted a significant amount of research in this area on the part of both academics and hardware vendors, including Intel and Sun Microsystems. than magnetic tapes. Both disks and tapes, however, are subject to failure, which may result in loss of information. Currently, nonvolatile storage is slower than volatile storage by several orders of magnitude, because disk and tape devices are electromechanical and require physical motion to access data. • Stable storage. Information residing in stable storage is never lost (never should be taken with a grain of salt, since theoretically such absolutes cannot be guaranteed). Toimplement an approximation of such storage, we need to replicate information in several nonvolatile storage caches (usually disks) with independent failure modes and to update the information in a controlled manner (Section 12.8). Here, we are concerned only with ensuring transaction atomicity in an environment where failures result in the loss of information on volatile storage. 6.10.2 Log-Based Recovery One way to ensure atomicity is to record, on stable storage, information describing all the modifications made by the transaction to the various data it accesses. The most widely used method for achieving this form of recording is write-ahead logging. Here, the system maintains, on stable storage, a data 292 Chapter 6 Process Synchronization structure called the log. Each log record describes a single operation of a transaction write and has the following fields: • Transaction name. The unique name of the transaction that performed the write operation • Data item name. The unique name of the data item written • Old value. The value of the data item prior to the write operation • New value. The value that the data item will have after the write Other special log records exist to record significant events during transaction processing, such as the start of a transaction and the commit or abort of a transaction. Before a transaction Ti starts its execution, the record < Ti starts> is written to the log. During its execution, any write operation by Ti is preceded by the writing of the appropriate new record to the log. When Ti commits, the record < Ti commits> is written to the log. Because the information in the log is used in reconstructing the state of the data items accessed by the various transactions, we cannot allow the actual update to a data item to take place before the corresponding log record is written out to stable storage. We therefore require that, prior to execution of a write(X) operation, the log records corresponding to X be written onto stable storage. Note the performance penalty inherent in this system. Two physical writes are required for every logical write requested. Also, more storage is needed, both for the data themselves and for the log recording the changes. In cases where the data are extremely important and fast failure recovery is necessary, however, the functionality is worth the price. Using the log, the system can handle any failure that does not result in the loss of information on nonvolatile storage. The recovery algorithm uses two procedures: • undo(Ti ), which restores the value of all data updated by transaction Ti to the old values • redo(Ti ), which sets the value of all data updated by transaction Ti to the new values The set of data updated by Ti and the appropriate old and new values can be found in the log. Note that the undo and redo operations must be idempotent (that is, multiple executions must have the same result as does one execution) to guarantee correct behavior even if a failure occurs during the recovery process. If a transaction Ti aborts, then we can restore the state of the data that it has updated by simply executing undo(Ti ). If a system failure occurs, we restore the state of all updated data by consulting the log to determine which transactions need to be redone and which need to be undone. This classification of transactions is accomplished as follows: • Transaction Ti needs to be undone if the log contains the < Ti starts> record but does not contain the < Ti commits> record. 6.10 Atomic Transactions 293 • Transaction Ti needs to be redone if the log contains both the < Ti starts> and the < Ti commits> records. 6.10.3 Checkpoints When a system failure occurs, we must consult the log to determine which transactions need to be redone and which need to be undone. In principle, we need to search the entire log to make these determinations. There are two major drawbacks to this approach: 1. The searching process is time consuming. 2. Most of the transactions that, according to our algorithm, need to be redone have already actually updated the data that the log says they need to modify. Although redoing the data modifications will cause no harm (due to idempotency), it will nevertheless cause recovery to take longer. To reduce these types of overhead, we introduce the concept of check- points. During execution, the system maintains the write-ahead log. In addi- tion, the system periodically performs checkpoints that require the following sequence of actions to take place: 1. Output all log records currently residing in volatile storage (usually main memory) onto stable storage. 2. Output all modified data residing in volatile storage to the stable storage. 3. Output a log record onto stable storage. The presence of a record in the log allows the system to streamline its recovery procedure. Consider a transaction Ti that committed prior to the checkpoint. The < Ti commits> record appears in the log before the record. Any modifications made by Ti must have been written to stable storage either prior to the checkpoint or as part of the checkpoint itself. Thus, at recovery time, there is no need to perform a redo operation on Ti . This observation allows us to refine our previous recovery algorithm. After a failure has occurred, the recovery routine examines the log to deter- mine the most recent transaction Ti that started executing before the most recent checkpoint. It finds such a transaction by searching the log backward to find the first record and then finding the subsequent < Ti start> record. Once transaction Ti has been identified, the redo and undo operations need be applied only to transaction Ti and all transactions Tj that started executing after transaction Ti . We’ll call these transactions set T. The remainder of the log can be ignored. The recovery operations that are required are as follows: • For all transactions Tk in T for which the record < Tk commits> appears in the log, execute redo(Tk). 294 Chapter 6 Process Synchronization • For all transactions Tk in T that have no < Tk commits> record in the log, execute undo(Tk). 6.10.4 Concurrent Atomic Transactions We have been considering an environment in which only one transaction can be executing at a time. We now turn to the case where multiple transactions are active simultaneously. Because each transaction is atomic, the concurrent execution of transactions must be the same as if the transactions were executed serially in some arbitrary order. This property, called serializability,canbe maintained by simply executing each transaction within a critical section. That is, all transactions share a common semaphore mutex, which is initialized to 1. When a transaction starts executing, its first action is to execute wait(mutex). After the transaction either commits or aborts, it executes signal(mutex). Although this scheme ensures the atomicity of all concurrently executing transactions, it is too restrictive. As we shall see, in many cases we can allow transactions to overlap their execution while maintaining serializability. A number of different concurrency-control algorithms ensure serializability, and we describe these algorithms next. 6.10.4.1 Serializability Consider a system with two data items, A and B, that are both read and written by two transactions, T0 and T1. Suppose that these transactions are executed atomically in the order T0 followed by T1. This execution sequence, which is called a schedule, is represented in Figure 6.40. In this schedule, labeled schedule 1, the sequence of instruction steps is in chronological order from top to bottom, with instructions of T0 appearing in the left column and instructions of T1 appearing in the right column. A schedule in which each transaction is executed atomically is called a serial schedule. A serial schedule consists of a sequence of instructions from various transactions wherein the instructions belonging to a particular transaction appear together. Thus, for a set of n transactions, there exist n! different valid serial schedules. Each serial schedule is correct, because it is equivalent to the atomic execution of the participating transactions in some arbitrary order. T0 T1 read(A) write(A) read(B) write(B) read(A) write(A) read(B) write(B) Figure 6.40 Schedule 1: A serial schedule in which T0 is followed by T1. 6.10 Atomic Transactions 295 If we allow the two transactions to overlap their execution, then the result- ing schedule is no longer serial. A nonserial schedule does not necessarily imply an incorrect execution (that is, an execution that is not equivalent to one represented by a serial schedule). To see that this is the case, we need to define the notion of conflicting operations. Consider a schedule S in which there are two consecutive operations Oi and Oj of transactions Ti and Tj , respectively. We say that Oi and Oj conflict if they access the same data item and at least one of them is a write operation. To illustrate the concept of conflicting operations, we consider the nonserial schedule 2, shown in Figure 6.41. The write(A) operation of T0 conflicts with the read(A) operation of T1. However, the write(A) operation of T1 does not conflict with the read(B) operation of T0, because the two operations access different data items. We can take advantage of this situation through swapping. Let Oi and Oj be consecutive operations of a schedule S.IfOi and Oj are operations of different transactions and Oi and Oj do not conflict, then we can swap the order of Oi and Oj to produce a new schedule S’. We expect S to be equivalent to S’, as all operations appear in the same order in both schedules, except for Oi and Oj , whose order does not matter. To illustrate the swapping idea, we consider again schedule 2 (Figure 6.41). Because the write(A) operation of T1 does not conflict with the read(B) operation of T0, we can swap these operations to generate an equivalent schedule, as follows: • Swap the read(B) operation of T0 with the read(A) operation of T1. • Swap the write(B) operation of T0 with the write(A) operation of T1. • Swap the write(B) operation of T0 with the read(A) operation of T1. The final result of these swaps is schedule 1 in Figure 6.40, which is a serial schedule. Thus, we have shown that schedule 2 is equivalent to a serial schedule. If a schedule S can be transformed into a serial schedule S’ by a series of swaps of nonconflicting operations, we say that schedule S is conflict serializable. Thus, schedule 2 is conflict serializable, because it can be transformed into the serial schedule 1. T0 T1 read(A) write(A) read(A) write(A) read(B) write(B) read(B) write(B) Figure 6.41 Schedule 2: A concurrent serializable schedule. 296 Chapter 6 Process Synchronization 6.10.4.2 Locking Protocol One way to ensure serializability is to associate a lock with each data item and to require that each transaction follow a locking protocol that governs how locks are acquired and released. There are various modes in which a data item can be locked. In this section, we restrict our attention to two modes: • Shared.IfatransactionTi has obtained a shared-mode lock (denoted by S) on data item Q, then Ti can read Q but cannot write it. • Exclusive.IfatransactionTi has obtained an exclusive-mode lock (denoted by X) on data item Q, then Ti can both read and write Q. We require that every transaction request a lock in an appropriate mode on data item Q, depending on the type of operations it will perform on Q. To access data item Q, transaction Ti must first lock Q in the appropriate mode. If Q is not currently locked, then the lock is granted, and Ti can access it. However, if Q is currently locked by some other transaction, then Ti may have to wait. More specifically, suppose that Ti requests an exclusive lock on Q. In this case, Ti must wait until the lock on Q is released. If Ti requests a shared lock on Q, then Ti must wait if Q is locked in exclusive mode. Otherwise, it can obtain the lock and access Q. Notice that this scheme is quite similar to the readers–writers algorithm discussed in Section 6.6.2. A transaction may unlock a data item that it locked at an earlier point. It must, however, hold a lock on a data item as long as it accesses that item. Moreover, it is not always desirable for a transaction to unlock a data item immediately after its last access of that data item, because serializability may not be ensured. One protocol that ensures serializability is the two-phase locking protocol. This protocol requires that each transaction issue lock and unlock requests in two phases: • Growing phase. A transaction may obtain locks but may not release any locks. • Shrinking phase. A transaction may release locks but may not obtain any new locks. Initially, a transaction is in the growing phase. The transaction acquires locks as needed. Once the transaction releases a lock, it enters the shrinking phase, and no more lock requests can be issued. The two-phase locking protocol ensures conflict serializability (Exercise 6.34). It does not, however, ensure freedom from deadlock. In addition, it is possible that, for a given set of transactions, there are conflict-serializable schedules that cannot be obtained by use of the two-phase locking protocol. To improve performance over two-phase locking, we need either to have additional information about the transactions or to impose some structure or ordering on the set of data. 6.10 Atomic Transactions 297 6.10.4.3 Timestamp-Based Protocols In the locking protocols described above, the order followed by pairs of conflicting transactions is determined at execution time. Another method for determining the order is to select an order in advance. The most common method for doing so is to use a timestamp ordering scheme. With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by TS(Ti ). This timestamp is assigned by the system before the transaction Ti starts execution. If a transaction Ti has been assigned timestamp TS(Ti ), and later a new transaction Tj enters the system, then TS(Ti ) < TS(Tj ). There are two simple methods for implementing this scheme: • Use the value of the system clock as the timestamp; that is, a transaction’s timestamp is equal to the value of the clock when the transaction enters the system. This method will not work for transactions that occur on separate systems or for processors that do not share a clock. • Use a logical counter as the timestamp; that is, a transaction’s timestamp is equal to the value of the counter when the transaction enters the system. The counter is incremented after a new timestamp is assigned. The timestamps of the transactions determine the serializability order. Thus, if TS(Ti ) < TS(Tj ), then the system must ensure that the schedule produced is equivalent to a serial schedule in which transaction Ti appears before transaction Tj . To implement this scheme, we associate with each data item Q two timestamp values: • W-timestamp(Q) denotes the largest timestamp of any transaction that successfully executed write(Q). • R-timestamp(Q) denotes the largest timestamp of any transaction that successfully executed read(Q). These timestamps are updated whenever a new read(Q)orwrite(Q)instruc- tion is executed. The timestamp ordering protocol ensures that any conflicting read and write operations are executed in timestamp order. This protocol operates as follows: • Suppose that transaction Ti issues read(Q): ◦ If TS(Ti ) < W-timestamp(), then Ti needs to read a value of Q that was already overwritten. Hence, the read operation is rejected, and Ti is rolled back. ◦ If TS(Ti ) ≥ W-timestamp(Q), then the read operation is executed, and R-timestamp(Q) is set to the maximum of R-timestamp(Q)andTS(Ti ). • Suppose that transaction Ti issues write(Q): 298 Chapter 6 Process Synchronization T2 T3 read(B) read(B) write(B) read(A) read(A) write(A) Figure 6.42 Schedule 3: A schedule possible under the timestamp protocol. ◦ If TS(Ti ) < R-timestamp(Q), then the value of Q that Ti is producing was needed previously and Ti assumed that this value would never be produced. Hence, the write operation is rejected, and Ti is rolled back. ◦ If TS(Ti ) < W-timestamp(Q), then Ti is attempting to write an obsolete value of Q.Hence,thiswrite operation is rejected, and Ti is rolled back. ◦ Otherwise, the write operation is executed. AtransactionTi that is rolled back as a result of either a read or write operation is assigned a new timestamp and is restarted. Toillustrate this protocol, consider schedule 3 in Figure 6.42, which includes transactions T2 and T3. We assume that a transaction is assigned a timestamp immediately before its first instruction. Thus, in schedule 3, TS(T2) < TS(T3), and the schedule is possible under the timestamp protocol. This execution can also be produced by the two-phase locking protocol. However, some schedules are possible under the two-phase locking protocol but not under the timestamp protocol, and vice versa (Exercise 6.7). The timestamp protocol ensures conflict serializability. This capability follows from the fact that conflicting operations are processed in timestamp order. The protocol also ensures freedom from deadlock, because no transaction ever waits. 6.11 Summary Given a collection of cooperating sequential processes that share data, mutual exclusion must be provided to ensure that a critical section of code is used by only one process or thread at a time. Typically, computer hardware provides several operations that ensure mutual exclusion. However, such hardware-based solutions are too complicated for most developers to use. Semaphores overcome this obstacle. Semaphores can be used to solve various synchronization problems and can be implemented efficiently, especially if hardware support for atomic operations is available. Various synchronization problems (such as the bounded-buffer problem, the readers–writers problem, and the dining-philosophers problem) are impor- tant mainly because they are examples of a large class of concurrency-control problems. These problems are used to test nearly every newly proposed synchronization scheme. Practice Exercises 299 The operating system must provide the means to guard against timing errors. Several language constructs have been proposed to deal with these prob- lems. Monitors provide the synchronization mechanism for sharing abstract data types. A condition variable provides a method by which a monitor procedure can block its execution until it is signaled to continue. Java provides various tools to coordinate the activities of multiple threads accessing shared data through the synchronized, wait(), notify(),and notifyAll() mechanisms. In addition, the Java API provides support for mutual exclusion locks, semaphores, and condition variables. Operating systems also provide support for synchronization. For example, Solaris, Windows XP, and Linux provide mechanisms such as semaphores, mutexes, spinlocks, and condition variables to control access to shared data. The Pthreads API provides support for mutexes and condition variables. A transaction is a program unit that must be executed atomically; that is, either all the operations associated with it are executed to completion, or none are performed. To ensure atomicity despite system failure, we can use a write-ahead log. All updates are recorded on the log, which is kept in stable storage. If a system crash occurs, the information in the log is used in restoring the state of the updated data items, which is accomplished by use of the undo and redo operations. To reduce the overhead in searching the log after a system failure has occurred, we can use a checkpoint scheme. To ensure serializability when the execution of several transactions over- laps, we must use a concurrency-control scheme. Various concurrency-control schemes ensure serializability by delaying an operation or aborting the trans- action that issued the operation. The most common ones are locking protocols and timestamp ordering schemes. Practice Exercises 6.1 In Section 6.4, we mentioned that disabling interrupts frequently can affect the system’s clock. Explain why this can occur and how such effects can be minimized. 6.2 The Cigarette-Smokers Problem. Consider a system with three smoker processes and one agent process. Each smoker continuously rolls a cigarette and then smokes it. But to roll and smoke a cigarette, the smoker needs three ingredients: tobacco, paper, and matches. One of the smoker processes has paper, another has tobacco, and the third has matches. The agent has an infinite supply of all three materials. The agent places two of the ingredients on the table. The smoker who has the remaining ingredient then makes and smokes a cigarette, signaling the agent on completion. The agent then puts out another two of the three ingredients, and the cycle repeats. Write a program to synchronize the agent and the smokers using Java synchronization. 6.3 Explain why Solaris, Windows XP, and Linux implement multiple locking mechanisms. Describe the circumstances under which they use spinlocks, mutexes, semaphores, adaptive mutexes, and condition variables. In each case, explain why the mechanism is needed. 300 Chapter 6 Process Synchronization 6.4 Describe how volatile, nonvolatile, and stable storage differ in cost. 6.5 Explain the purpose of the checkpoint mechanism. How often should checkpoints be performed? Describe how the frequency of checkpoints affects: • System performance when no failure occurs • The time it takes to recover from a system crash • The time it takes to recover from a disk crash 6.6 Explain the concept of transaction atomicity. 6.7 Show that some schedules are possible under the two-phase locking protocol but not possible under the timestamp protocol, and vice versa. Exercises 6.8 Race conditions are possible in many computer systems. Consider a banking system with two methods: deposit(amount) and with- draw(amount). These two methods are passed the amount that is to be deposited or withdrawn from a bank account. Assume that a husband and wife share a bank account and that concurrently the husband calls the withdraw() method and the wife calls deposit().Describehowa race condition is possible and what might be done to prevent the race condition from occurring. 6.9 The first known correct software solution to the critical-section problem for two processes was developed by Dekker. The two processes, P0 and P1, share the following variables: boolean flag[2]; /* initially false */ int turn; The structure of process Pi (i == 0 or 1) is shown in Figure 6.43; the other process is Pj (j == 1 or 0). Prove that the algorithm satisfies all three requirements for the critical-section problem. 6.10 The first known correct software solution to the critical-section problem for n processes with a lower bound on waiting of n − 1turnswas presented by Eisenberg and McGuire. The processes share the following variables: enum pstate {idle, want in, in cs}; pstate flag[n]; int turn; All the elements of flag are initially idle; the initial value of turn is immaterial (between 0 and n-1). The structure of process Pi is shown in Figure 6.44. Prove that the algorithm satisfies all three requirements for the critical-section problem. Exercises 301 do { flag[i] = true; while (flag[j]) { if (turn == j) { flag[i] = false; while (turn == j) ; // do nothing flag[i] = true; } } // critical section turn = j; flag[i] = false; // remainder section } while (true); Figure 6.43 The structure of process Pi in Dekker’s algorithm. 6.11 What is the meaning of the term busy waiting? What other kinds of waiting are there in an operating system? Can busy waiting be avoided altogether? Explain your answer. 6.12 Explain why spinlocks are not appropriate for single-processor systems yet are often used in multiprocessor systems. 6.13 Explain why implementing synchronization primitives by disabling interrupts is not appropriate in a single-processor system if the syn- chronization primitives are to be used in user-level programs. 6.14 Explain why interrupts are not appropriate for implementing synchro- nization primitives in multiprocessor systems. 6.15 Describe two kernel data structures in which race conditions are possible. Be sure to include a description of how a race condition can occur. 6.16 Describe how the swap() instruction can be used to provide mutual exclusion that satisfies the bounded-waiting requirement. 6.17 Show that, if the acquire() and release() semaphore operations are not executed atomically, then mutual exclusion may be violated. 6.18 Windows Vista provides a new lightweight synchronization tool called a slim reader–writer lock. Whereas most implementations of reader– writer locks favor either readers or writers, or perhaps order waiting threads using a FIFO policy, slim reader–writer locks favor neither readers nor writers and do not order waiting threads in a FIFO queue. Explain the benefits of providing such a synchronization tool. 6.19 Show how to implement the acquire() and release() semaphore operations in multiprocessor environments using the getAndSet() instruction. The solution should exhibit minimal busy waiting. 302 Chapter 6 Process Synchronization do { while (true) { flag[i] = want in; j = turn; while (j != i) { if (flag[j] != idle) { j = turn; else j=(j+1)%n; } flag[i] = in cs; j=0; while ( (j < n) && (j == i || flag[j] != in cs) ) j++; if ( (j >= n) && (turn == i || flag[turn] == idle) ) break; } // critical section j = (turn + 1) % n; while (flag[j] == idle) j=(j+1)%n; turn = j; flag[i] = idle; // remainder section } while (true); Figure 6.44 The structure of process Pi in Eisenberg and McGuire’s algorithm. 6.20 Demonstrate that monitors and semaphores are equivalent insofar as they can be used to implement the same types of synchronization problems. 6.21 Write a bounded-buffer monitor in which the buffers (portions) are embedded within the monitor itself. 6.22 The strict mutual exclusion within a monitor makes the bounded-buffer monitor of Exercise 6.21 mainly suitable for small portions. a. Explain why this is true. b. Design a new scheme that is suitable for larger portions. Exercises 303 6.23 Discuss the tradeoff between fairness and throughput of operations in the readers–writers problem. Propose a method for solving the readers–writers problem without causing starvation. 6.24 How does the signal() operation associated with monitors differ from the corresponding release() operation defined for semaphores? 6.25 Suppose the signal() statement can appear only as the last statement in a monitor procedure. Suggest how the implementation described in Section 6.7 can be simplified in this situation. 6.26 Consider a system consisting of processes P1, P2, ..., Pn, each of which has a unique priority number. Write a monitor that allocates three identical printers to these processes, using the priority numbers for deciding the order of allocation. 6.27 A file is to be shared among different processes, each of which has a unique number. The file can be accessed simultaneously by several processes, subject to the following constraint: the sum of all unique numbers associated with all the processes currently accessing the file must be less than n. Write a monitor to coordinate access to the file. 6.28 When a signal is performed on a condition inside a monitor, the signaling process can either continue its execution or transfer control to the process that is signaled. How would the solution to the preceding exercise differ with these two different ways of performing signaling? 6.29 Suppose we replace the wait() and signal() operations of moni- tors with a single construct await(B),whereB is a general boolean expression that causes the process executing it to wait until B becomes true. a. Write a monitor using this scheme to implement the readers– writers problem. b. Explain why, in general, this construct cannot be implemented efficiently. c. What restrictions need to be put on the await() statement so that it can be implemented efficiently? (Hint: Restrict the generality of B; see Kessels [1977].) 6.30 Write a monitor that implements an alarm clock that enables a calling program to delay itself for a specified number of time units (ticks). You may assume the existence of a real hardware clock that invokes a procedure tick in your monitor at regular intervals. 6.31 The Singleton design pattern ensures that only one instance of an object is created. For example, assume we have a class called Singleton and we wish to allow only one instance of it. Rather than creating a Singleton object using its constructor, we instead declare the constructor as private and provide a public static method—such as getInstance() —for object creation: Singleton sole = Singleton.getInstance(); 304 Chapter 6 Process Synchronization public class Singleton { private static Singleton instance = null; private Singleton {} public static Singleton getInstance() { if (instance == null) instance = new Singleton(); return instance; } } Figure 6.45 First attempt at Singleton design pattern. Figure 6.45 provides one strategy for implementing the Singleton pattern. The idea behind this approach is to use lazy initialization, whereby we create an instance of the object only when it is needed— that is, when getInstance() is first called. However, Figure 6.45 suffers from a race condition. Identify the race condition. Figure 6.46 shows an alternative strategy that addresses the race condition by using the double-checked locking idiom.Usingthis strategy, we first check whether instance is null.Ifitis,wenext obtain the lock for the Singleton class and then double-check whether instance is still null before creating the object. Does this strategy result in any race conditions? If so, identify and fix them. Otherwise, illustrate why this code example is thread-safe. 6.32 Why do Solaris, Linux, and Windows XP use spinlocks as a syn- chronization mechanism only on multiprocessor systems and not on single-processor systems? public class Singleton { private static Singleton instance = null; private Singleton {} public static Singleton getInstance() { if (instance == null) { synchronized(Singleton.class) { if (instance == null) instance = new Singleton(); } } return instance; } } Figure 6.46 Singleton design pattern using double-checked locking. Programming Problems 305 6.33 In log-based systems that provide support for transactions, updates to data items cannot be performed before the corresponding entries are logged. Why is this restriction necessary? 6.34 Show that the two-phase locking protocol ensures conflict serializability. 6.35 What are the implications of assigning a new timestamp to a transaction that is rolled back? How does the system process transactions that were issued after the rolled-back transaction but that have timestamps smaller than the new timestamp of the rolled-back transaction? Programming Problems 6.36 Exercise 4.20 requires the main thread to wait for the sorting and merge threads by using the join() method. Modify your solution to this exercise so that it uses semaphores rather than the join() method. (Hint: We recommend carefully reading through the Java API on the constructor for Semaphore objects.) 6.37 The HardWareData class in Figure 6.4 abstracts the idea of the get-and-set and swap instructions. However, this class is not considered thread- safe, because multiple threads may concurrently access its methods and thread safety requires that each method be performed atomically. Rewrite the HardwareData class using Java synchronization so that it is thread-safe. 6.38 Servers can be designed to limit the number of open connections. For example, a server may wish to have only N socket connections open at any point in time. After N connections have been made, the server will not accept another incoming connection until an existing connection is released. In the source code available on WileyPLUS, there is a program named TimedServer.java that listens to port 2500. When a connection is made (via telnet or the supplied client program TimedClient.java), the server creates a new thread that maintains the connection for 10 seconds (writing the number of seconds remaining while the connection remains open). At the end of 10 seconds, the thread closes the connection. Currently, TimedServer.java will accept an unlimited number of connections. Using semaphores, modify this program so that it limits the number of concurrent connections. 6.39 Assume that a finite number of resources of a single resource type must be managed. Processes may ask for a number of these resources and —once finished—will return them. As an example, many commercial software packages provide a given number of licenses, indicating the number of applications that may run concurrently. When the application is started, the license count is decremented. When the application is terminated, the license count is incremented. If all licenses are in use, requests to start the application are denied. Such requests will only be granted when an existing license holder terminates the application and a license is returned. 306 Chapter 6 Process Synchronization The following Java class is used to manage a finite number of instances of an available resource. Note that when a process wishes to obtain a number of resources, it invokes the decreaseCount() method. Similarly, when a process wants to return a number of resources, it calls increaseCount(). public class Manager { public static final int MAX RESOURCES = 5; private int availableResources = MAX RESOURCES; /** * Decrease availableResources by count resources. * return 0 if sufficient resources available, * otherwise return -1 */ public int decreaseCount(int count) { if (availableResources < count) return -1; else { availableResources -= count; return 0; } /* Increase availableResources by count resources. */ public void increaseCount(int count) { availableResources += count; } } However, the preceding program segment produces a race condition. Do the following: a. Identify the data involved in the race condition. b. Identify the location (or locations) in the code where the race condition occurs. c. Using Java synchronization, fix the race condition. Also modify decreaseCount() so that a thread blocks if there aren’t sufficient resources available. 6.40 Implement the Channel interface (Figure 3.20) so that the send() and receive() methods are blocking. That is, a thread invoking send() will block if the channel is full. If the channel is empty, a thread invoking receive() will block. Doing this will require storing the messages in a fixed-length array. Ensure that your implementation is thread-safe (using Java synchronization) and that the messages are stored in FIFO order. Programming Problems 307 6.41 A barrier is a thread-synchronization mechanism that allows several threads to run for a period and then forces all threads to wait until all have reached a certain point. Once all threads have reached this point (the barrier), they may all continue. An interface for a barrier appears as follows: public interface Barrier { /** * Each thread calls this method when it reaches * the barrier. All threads are released to continue * processing when thelast thread calls this method. */ public void waitForOthers(); /** * Release all threads from waiting for the barrier. * Any future calls to waitForOthers() will not wait * until the Barrier is set again with a call * to the constructor. */ public void freeAll(); } The following code segment establishes a barrier and creates 10 Worker threads that will synchronize according to the barrier: public static final int THREAD COUNT = 10; Barrier jersey = new BarrierImpl(THREAD COUNT); for (int i = 0; i < THREAD COUNT; i++) (new Worker(jersey)).start(); Note that the barrier must be initialized to the number of threads that are being synchronized and that each thread has a reference to the same barrier object—jersey.EachWorker will run as follows: // All threads have access to this barrier Barrier jersey; // do some work for a while . . . // now wait for the others jersey.waitForOthers(); // now do more work . . . When a thread invokes the method waitForOthers(), it will block until all threads have reached this method (the barrier). Once all threads have 308 Chapter 6 Process Synchronization reached the method, they may all proceed with the remainder of their code. The freeAll() method bypasses the need to wait for threads to reach the barrier; as soon as freeAll() is invoked, all threads waiting for the barrier are released. Implement the Barrier interface using Java synchronization. 6.42 Implement the Buffer interface (Figure 3.15) as a bounded buffer using Java’s condition variables. Test your solution using Figure 6.14. 6.43 Implement the ReadWriteLock interface (Figure 6.17) using Java’s con- dition variables. You may find it necessary to examine the signalAll() method in the Condition API. 6.44 The Sleeping-Barber Problem. A barbershop consists of a waiting room with n chairs and a barber room with one barber chair. If there are no customers to be served, the barber goes to sleep. If a customer enters the barbershop and all chairs are occupied, then the customer leaves the shop. If the barber is busy but chairs are available, then the customer sits in one of the free chairs. If the barber is asleep, the customer wakes up the barber. Write a program to coordinate the barber and the customers using Java synchronization. Programming Projects The projects below deal with three distinct topics—designing a pid man- ager, designing a thread pool, and implementing a solution to the dining- philosophers problem using Java’s condition variables. Project 1: Designing a pid Manager A pid manager is responsible for managing process identifiers (pids). When a process is first created, it is assigned a unique pid by the pid manager. The pid is returned to the pid manager when the process completes execution. The pid manager may later reassign this pid. Process identifiers are discussed more fully in Section 3.3.1. What is most important here is to recognize that process identifiers must be unique; no two active processes can have the same pid. TheJavainterfaceshowninFigure6.47identifiesthebasicmethodsfor obtaining and releasing a pid. Process identifiers are assigned within the range MIN PID to MAX PID (inclusive). The fundamental difference between getPID() and getPIDWait() is that if no pids are available, getPID() returns -1,whereasgetPIDWait() blocks the calling process until a pid becomes available. As with most kernel data, the data structure for maintaining a set of pids must be free from race conditions and deadlock. One possible result from a race condition is that the same pid will be concurrently assigned to more than one process. (However, a pid can be reused once it has been returned via the call to releasePID().) To achieve blocking behavior in getPIDWait(), you may use any of the Java-based synchronization mechanisms discussed in Section 6.8. Programming Projects 309 /** * An interface for a PID manager. * * The range of allowable PID’s is * MIN PID .. MAX PID (inclusive) * * An implementation of this interface * must ensure thread safety. */ public interface PIDManager { /** The range of allowable PIDs (inclusive) */ public static final int MIN PID=4; public static final int MAX PID = 127; /** * Return a valid PID or -1 if * none are available */ public int getPID(); /** * Return a valid PID, possibly blocking the * calling process until one is available. */ public int getPIDWait(); /** * Release the pid * Throw an IllegalArgumentException if the pid * is outside of the range of PID values. */ public void releasePID(int pid); } Figure 6.47 Java interface for obtaining and releasing a pid. Project 2: Designing a Thread Pool Create a thread pool (see Chapter 4) using Java synchronization. Your thread pool will implement the following API: ThreadPool() Create a default-sized thread pool ThreadPool(int size) Create a thread pool of size size void add(Runnable task) Add a task to be performed by a thread in the pool void stopPool() Stop all threads in the pool 310 Chapter 6 Process Synchronization Your pool will first create a number of idle threads that await work. Work will be submitted to the pool via the add() method, which adds a task implementing the Runnable interface. The add() method will place the Runnable task into a queue. Once a thread in the pool becomes available for work, it will check the queue for any Runnable tasks. If there are such tasks, the idle thread will remove the task from the queue and invoke its run() method. If the queue is empty, the idle thread will wait to be notified when work becomes available. (The add() method will perform a notify() when it places a Runnable task into the queue to possibly awaken an idle thread awaiting work.) The stopPool() method will stop all threads in the pool by invoking their interrupt() method (Section 4.5.2). This, of course, requires that Runnable tasks being executed by the thread pool check their interruption status. There are many different ways to test your solution to this problem. One suggestion is to modify your answer to Exercise 3.17 so that the server can respond to each client request by using a thread pool. Project 3: Dining Philosophers In Section 6.7.2, we provide an outline of a solution to the dining-philosophers problem using monitors. This exercise will require implementing this solution using Java’s condition variables. Begin by creating five philosophers, each identified by a number 0...4. Each philosopher runs as a separate thread. Philosophers alternate between thinking and eating. When a philosopher wishes to eat, it invokes the method takeForks(philNumber),wherephilNumber identifies the number of the philosopher wishing to eat. When a philosopher finishes eating, it invokes returnForks(philNumber). Your solution will implement the following interface: public interface DiningServer { /* Called by a philosopher when it wishes to eat */ public void takeForks(int philNumber); /* Called by a philosopher when it is finished eating */ public void returnForks(int philNumber); } The implementation of the interface follows the outline of the solution provided in Figure 6.26. Use Java’s condition variables to synchronize the activity of the philosophers and prevent deadlock. Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises Bibliographical Notes 311 • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes The mutual-exclusion problem was first discussed in a classic paper by Dijkstra [1965a]. Dekker’s algorithm (Exercise 6.9)—the first correct software solution to the two-process mutual-exclusion problem—was developed by the Dutch mathematician T. Dekker. This algorithm also was discussed by Dijkstra [1965a]. A simpler solution to the two-process mutual-exclusion problem has since been presented by Peterson [1981] (Figure 6.2). Dijkstra [1965b] presented the first solution to the mutual-exclusion prob- lem for n processes. This solution, however, does not place an upper bound on theamountoftimeaprocessmustwaitbeforeitisallowedtoenterthecritical section. Knuth [1966] presented the first algorithm with a bound; his bound was 2n turns. A refinement of Knuth’s algorithm by deBruijn [1967] reduced the waiting time to n2 turns, after which Eisenberg and McGuire [1972] succeeded in reducing the time to the lower bound of n−1 turns. Another algorithm that also requires n−1 turns but is easier to program and to understand is the bakery algorithm, which was developed by Lamport [1974]. Burns [1978] developed the hardware-solution algorithm that satisfies the bounded-waiting requirement. General discussions concerning the mutual-exclusion problem were offered by Lamport [1986] and Lamport [1991]. Raynal [1986] offered a collection of algorithms for mutual exclusion. The semaphore concept was suggested by Dijkstra [1965a]. Patil [1971] examined the question of whether semaphores can solve all possible syn- chronization problems. Parnas [1975] discussed some of the flaws in Patil’s arguments. Kosaraju [1973] followed up on Patil’s work to produce a problem that cannot be solved by wait() and signal() operations. Lipton [1974] discussed the limitations of various synchronization primitives. The classic process-coordination problems that we have described are paradigms for a large class of concurrency-control problems. The bounded- buffer problem, the dining-philosophers problem, and the sleeping-barber problem (Exercise 6.44) were suggested by Dijkstra [1965a] and Dijkstra [1971]. The cigarette-smokers problem (Exercise 6.2) was developed by Patil [1971]. The readers–writers problem was suggested by Courtois et al. [1971]. The issue of concurrent reading and writing was discussed by Lamport [1977]. The problem of synchronization of independent processes was discussed by Lamport [1976]. The critical-region concept was suggested by Hoare [1972] and by Brinch- Hansen [1972]. Brinch-Hansen [1973] developed the monitor concept. A complete description of the monitor was given by Hoare [1974]. Kessels [1977] proposed an extension to the monitor to allow automatic signaling. Experience obtained from the use of monitors in concurrent programs was discussed by Lampson and Redell [1979]. They also examined the priority inversion problem. General discussions concerning concurrent programming were offered by Ben-Ari [1990] and Birrell [1989]. 312 Chapter 6 Process Synchronization Optimizing the performance of locking primitives has been examined in many works, such as Lamport [1987], Mellor-Crummey and Scott [1991], and Anderson [1990]. The use of shared objects that do not require the use of critical sections was discussed in Herlihy [1993], Bershad [1993], and Kopetz and Reisinger [1993]. Novel hardware instructions and their utility in implementing synchronization primitives have been described in works such as Culler et al. [1998], Goodman et al. [1989], Barnes [1993], and Herlihy and Moss [1993]. Some details of the locking mechanisms used in Solaris were presented in Mauro and McDougall [2007]. Note that the locking mechanisms used by the kernel are implemented for user-level threads as well, so the same types of locks are available inside and outside the kernel. Details of Windows 2000 synchronization can be found in Solomon and Russinovich [2000]. Java thread synchronization is covered in Oaks and Wong [2004]; Goetz et al. [2006] present a detailed discussion of concurrent programming in Java as well as the java.util.concurrent package. The write-ahead log scheme was first introduced in System R by Gray et al. [1981]. The concept of serializability was formulated by Eswaran et al. [1976] in connection with their work on concurrency control for System R. The two-phase locking protocol was introduced by Eswaran et al. [1976]. The timestamp- based concurrency-control scheme was provided by Reed [1983]. Bernstein and Goodman [1980] explain various timestamp-based concurrency-control algorithms. Adl-Tabatabai et al. [2007] discuss transactional memory. 7CHAPTER Deadlocks In a multiprogramming environment, several processes may compete for a finite number of resources. A process requests resources; if the resources are not available at that time, the process enters a waiting state. Sometimes, a waiting process is never again able to change state, because the resources it has requested are held by other waiting processes. This situation is called a deadlock. We discussed this issue briefly in Chapter 6 in connection with semaphores. Perhaps the best illustration of a deadlock can be drawn from a law passed by the Kansas legislature early in the 20th century. It said, in part: “When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.” In this chapter, we describe methods that an operating system can use to prevent or deal with deadlocks. Although some applications can identify programs that may deadlock, operating systems typically do not provide deadlock-prevention facilities, and it remains the responsibility of program- mers to ensure that they design deadlock-free programs. Deadlock problems can only become more common, given current trends, including larger num- bers of processes, multithreaded programs, many more resources within a system, and an emphasis on long-lived file and database servers rather than batch systems. CHAPTER OBJECTIVES • To develop a description of deadlocks, which prevent sets of concurrent processes from completing their tasks. • To present a number of different methods for preventing or avoiding deadlocks in a computer system. 7.1 System Model A system consists of a finite number of resources to be distributed among a number of competing processes. The resources are partitioned into several 313 314 Chapter 7 Deadlocks types, each consisting of some number of identical instances. Memory space, CPU cycles, files, and I/O devices (such as printers and DVD drives) are examples of resource types. If a system has two CPUs, then the resource type CPU has two instances. Similarly, the resource type printer may have five instances. If a process requests an instance of a resource type, the allocation of any instance of the type will satisfy the request. If it will not, then the instances are not identical, and the resource type classes have not been defined properly. For example, a system may have two printers. These two printers may be defined to be in the same resource class if no one cares which printer prints which output. However, if one printer is on the ninth floor and the other is in the basement, then people on the ninth floor may not see both printers as equivalent, and a separate resource class may need to be defined for each printer. A process must request a resource before using it and must release the resource after using it. A process may request as many resources as it requires to carry out its designated task. Obviously, the number of resources requested may not exceed the total number of resources available in the system. In other words, a process cannot request three printers if the system has only two. Under the normal mode of operation, a process may utilize a resource in only the following sequence: 1. Request. The process requests the resource. If the request cannot be granted immediately (for example, if the resource is being used by another process), then the requesting process must wait until it can acquire the resource. 2. Use. The process can operate on the resource (for example, if the resource is a printer, the process can print on the printer). 3. Release. The process releases the resource. The request and release of resources are system calls, as explained in Chapter 2. Examples are the request() and release() device, open() and close() file, and allocate() and free() memory system calls. Request and release of resources that are not managed by the operating system can be accomplished through the acquire() and release() operations on semaphores or through acquisition and release of an object’s lock via Java’s synchronized keyword. For each use of a kernel-managed resource by a process or thread, the operating system checks to make sure that the process has requested and has been allocated the resource. A system table records whether each resource is free or allocated; for each resource that is allocated, the table also records the process to which it is allocated. If a process requests a resource that is currently allocated to another process, it can be added to a queue of processes waiting for this resource. A set of processes is in a deadlocked state when every process in the set is waiting for an event that can be caused only by another process in the set. The events with which we are mainly concerned here are resource acquisition and release. The resources may be either physical resources (for example, printers, tape drives, memory space, and CPU cycles) or logical resources (for example, files, semaphores, and monitors). However, other types of events may result in deadlocks (for example, the IPC facilities discussed in Chapter 3). 7.2 Deadlock Characterization 315 To illustrate a deadlocked state, consider a system with three CD read-write (RW) drives. Suppose each of three processes holds one of these drives. If each process now requests another drive, the three processes will be in a deadlocked state. Each is waiting for the event “CD RW is released,” which can be caused only by one of the other waiting processes. This example illustrates a deadlock involving the same resource type. Deadlocks may also involve different resource types. For example, consider a system with one printer and one DVD drive. Suppose that process Pi is holding the DVD and process Pj is holding the printer. If Pi requests the printer and Pj requests the DVD drive, a deadlock occurs. A programmer who is developing multithreaded applications must pay particular attention to this problem. Multithreaded programs are good candi- dates for deadlock because multiple threads can compete for shared resources. 7.2 Deadlock Characterization In a deadlock, processes never finish executing, and system resources are tied up, preventing other jobs from starting. Before we discuss the various methods for dealing with the deadlock problem, we look more closely at features that characterize deadlocks. 7.2.1 Necessary Conditions A deadlock situation can arise if the following four conditions hold simultane- ously in a system: 1. Mutual exclusion. At least one resource must be held in a nonsharable mode; that is, only one process at a time can use the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released. 2. Hold and wait. A process must be holding at least one resource and waiting to acquire additional resources that are currently being held by other processes. 3. No preemption. Resources cannot be preempted; that is, a resource can be released only voluntarily by the process holding it, after that process has completed its task. 4. Circular wait.Aset{P0, P1, ..., Pn} of waiting processes must exist such that P0 is waiting for a resource held by P1, P1 is waiting for a resource held by P2, ..., Pn−1 is waiting for a resource held by Pn,andPn is waiting for a resource held by P0. We emphasize that all four conditions must hold for a deadlock to occur. The circular-wait condition implies the hold-and-wait condition, so the four conditions are not completely independent. We shall see in Section 7.4, however, that it is useful to consider each condition separately. 316 Chapter 7 Deadlocks 7.2.2 Resource-Allocation Graph Deadlocks can be described more precisely in terms of a directed graph called a system resource-allocation graph. This graph consists of a set of vertices V and a set of edges E. The set of vertices V is partitioned into two different types of nodes: P = {P1, P2, ..., Pn}, the set consisting of all the active processes in the system, and R = {R1, R2, ..., Rm}, the set consisting of all resource types in the system. A directed edge from process Pi to resource type Rj is denoted by Pi → Rj ; it signifies that process Pi has requested an instance of resource type Rj and is currently waiting for that resource. A directed edge from resource type Rj to process Pi is denoted by Rj → Pi ; it signifies that an instance of resource type Rj has been allocated to process Pi . A directed edge Pi → Rj is called a request edge; a directed edge Rj → Pi is called an assignment edge. Pictorially, we represent each process Pi as a circle and each resource type Rj as a rectangle. Since resource type Rj mayhavemorethanoneinstance,we represent each such instance as a dot within the rectangle. Note that a request edge points to only the rectangle Rj , whereas an assignment edge must also designate one of the dots in the rectangle. When process Pi requests an instance of resource type Rj , a request edge is inserted in the resource-allocation graph. When this request can be fulfilled, the request edge is instantaneously transformed to an assignment edge. When the process no longer needs access to the resource, it releases the resource; as a result, the assignment edge is deleted. The resource-allocation graph shown in Figure 7.1 depicts the following situation. • The sets P, R, and E: ◦ P = {P1, P2, P3} ◦ R = {R1, R2, R3, R4} ◦ E = {P1 → R1, P2 → R3, R1 → P2, R2 → P2, R2 → P1, R3 → P3} • Resource instances: ◦ One instance of resource type R1 ◦ Two instances of resource type R2 ◦ One instance of resource type R3 ◦ Three instances of resource type R4 • Process states: ◦ Process P1 is holding an instance of resource type R2 and is waiting for an instance of resource type R1. ◦ Process P2 is holding an instance of R1 and an instance of R2 and is waiting for an instance of R3. ◦ Process P3 is holding an instance of R3. 7.2 Deadlock Characterization 317 R1 R3 R2 R4 P3P2P1 Figure 7.1 Resource-allocation graph. Given the definition of a resource-allocation graph, it can be shown that, if the graph contains no cycles, then no process in the system is deadlocked. If the graph does contain a cycle, then a deadlock may exist. If each resource type has exactly one instance, then a cycle implies that a deadlock has occurred. If the cycle involves only a set of resource types, each of which has only a single instance, then a deadlock has occurred. Each process involved in the cycle is deadlocked. In this case, a cycle in the graph is both a necessary and a sufficient condition for the existence of deadlock. If each resource type has several instances, then a cycle does not necessarily imply that a deadlock has occurred. In this case, a cycle in the graph is a necessary but not a sufficient condition for the existence of deadlock. To illustrate this concept, we return to the resource-allocation graph depicted in Figure 7.1. Suppose that process P3 requests an instance of resource type R2. Since no resource instance is currently available, a request edge P3 → R2 is added to the graph (Figure 7.2). At this point, two minimal cycles exist in the system: P1 → R1 → P2 → R3 → P3 → R2 → P1 P2 → R3 → P3 → R2 → P2 Processes P1, P2,andP3 are deadlocked. Process P2 is waiting for the resource R3, which is held by process P3.ProcessP3 is waiting for either process P1 or process P2 to release resource R2.Inaddition,processP1 is waiting for process P2 to release resource R1. Now consider the resource-allocation graph in Figure 7.3. In this example, we also have a cycle: P1 → R1 → P3 → R2 → P1 However, there is no deadlock. Observe that process P4 may release its instance of resource type R2. That resource can then be allocated to P3, breaking the cycle. 318 Chapter 7 Deadlocks R1 R3 R2 R4 P3P2P1 Figure 7.2 Resource-allocation graph with a deadlock. In summary, if a resource-allocation graph does not have a cycle, then the system is not in a deadlocked state. If there is a cycle, then the system may or may not be in a deadlocked state. This observation is important when we deal with the deadlock problem. 7.2.2.1 Deadlock in a Multithreaded Java Program Before we proceed to a discussion of handling deadlocks, let’s see how deadlock can occur in a multithreaded Java program, as shown in Figure 7.4. In this example, we have two threads—threadA and threadB—aswellastwo reentrant locks—first and second. (Recall from Chapter 6 that a reentrant lock acts as a simple mutual exclusion lock.) In this example, threadA attempts to acquire the locks in the order (1) first,(2)second; threadB attempts using the order (1) second,(2)first. Deadlock is possible in the following scenario: threadA → second → threadB → first → threadA R2 R1 P3 P4 P2 P1 Figure 7.3 Resource-allocation graph with a cycle but no deadlock. 7.2 Deadlock Characterization 319 class A implements Runnable { private Lock first, second; public A(Lock first, Lock second) { this.first = first; this.second = second; } public void run() { try { first.lock(); // do something second.lock(); // do something else } finally { first.unlock(); second.unlock(); } } } class B implements Runnable { private Lock first, second; public A(Lock first, Lock second) { this.first = first; this.second = second; } public void run() { try { second.lock(); // do something first.lock(); // do something else } finally { second.unlock(); first.unlock(); } } } public class DeadlockExample { // Figure 7.5 } Figure 7.4 Deadlock example. 320 Chapter 7 Deadlocks public static void main(String arg[]) { Lock lockX = new ReentrantLock(); Lock lockY = new ReentrantLock(); Thread threadA = new Thread(new A(lockX,lockY)); Thread threadB = new Thread(new B(lockX,lockY)); threadA.start(); threadB.start(); } Figure 7.5 Creating the threads (continuation of Figure 7.4). Note that, even though deadlock is possible, it will not occur if threadA is able to acquire and release the locks for first and second before threadB attempts to acquire the locks. This example illustrates a problem with handling deadlocks: it is difficult to identify and test for deadlocks that may occur only under certain circumstances. 7.3 Methods for Handling Deadlocks Generally speaking, we can deal with the deadlock problem in one of three ways: • We can use a protocol to prevent or avoid deadlocks, ensuring that the system will never enter a deadlocked state. • We can allow the system to enter a deadlocked state, detect it, and recover. • We can ignore the problem altogether and pretend that deadlocks never occur in the system. The third solution is the one used by most operating systems, including UNIX and Windows. The JVM also does nothing to manage deadlocks. It is up to the application developer to write programs that handle deadlocks on these systems. Next, we elaborate briefly on each of the three methods for handling deadlocks. Then, in Sections 7.4 through 7.7, we present detailed algorithms. Before proceeding, we should mention that some researchers have argued that none of the basic approaches alone is appropriate for the entire spectrum of resource-allocation problems in operating systems. The basic approaches can be combined, however, allowing us to select an optimal approach for each class of resources in a system. 7.3.1 Three Major Methods To ensure that deadlocks never occur, the system can use either a deadlock- prevention or a deadlock-avoidance scheme. Deadlock prevention provides a set of methods for ensuring that at least one of the necessary conditions 7.3 Methods for Handling Deadlocks 321 (Section 7.2.1) cannot hold. These methods prevent deadlocks by constraining how requests for resources can be made. We discuss these methods in Section 7.4. Deadlock avoidance requires that the operating system be given in advance additional information concerning which resources a process will request and use during its lifetime. With this additional knowledge, it can decide for each request whether or not the process should wait. To decide whether the current request can be satisfied or must be delayed, the system must consider the resources currently available, the resources currently allo- cated to each process, and the future requests and releases of each process. We discuss these schemes in Section 7.5. If a system does not employ either a deadlock-prevention or a deadlock- avoidance algorithm, then a deadlock situation may arise. In this environment, the system can provide an algorithm that examines the state of the system to determine whether a deadlock has occurred and an algorithm to recover from the deadlock (if a deadlock has indeed occurred). We discuss these issues in Section 7.6 and Section 7.7. In the absence of algorithms to detect and recover from deadlocks, we may arrive at a situation in which the system is in a deadlocked state yet has no way of recognizing what has happened. In this case, the undetected deadlock will result in deterioration of the system’s performance, because resources are being held by processes that cannot run and because more and more processes, as they make requests for resources, will enter a deadlocked state. Eventually, the system will stop functioning and will need to be restarted manually. Although this method may not seem to be a viable approach to the deadlock problem, it is nevertheless used in most operating systems, as mentioned earlier. In many systems, deadlocks occur infrequently (say, once per year); thus, this method is cheaper than the prevention, avoidance, or detection and recovery methods, which must be used constantly.Also, in some circumstances, a system is in a frozen state but not in a deadlocked state. We see this situation, for example, with a real-time process running at the highest priority (or any process running on a nonpreemptive scheduler) and never returning control to the operating system. The system must have manual recovery methods for such conditions and may simply use those techniques for deadlock recovery. 7.3.2 Handling Deadlocks in Java As noted earlier, the JVM does nothing to manage deadlocks; it is up to the application developer to write programs that are deadlock-free. In the remainder of this section, we illustrate how deadlock is possible when selected methods of the core Java API are used and how the programmer can develop programs that appropriately handle deadlock. In Chapter 4, we introduced Java threads and some of the API that allows users to create and manipulate threads. Two additional methods of the Thread class are the suspend() and resume() methods, which were deprecated in later versions of the Java API because they could lead to deadlock. (A deprecated method indicates that the method is still part of the Java API, but its use is discouraged.) The suspend() method suspends execution of the thread currently run- ning. The resume() method resumes execution of a suspended thread. Once 322 Chapter 7 Deadlocks a thread has been suspended, it can continue only if another thread resumes it. Furthermore, a suspended thread continues to hold all locks while it is blocked. Deadlock is possible if a suspended thread holds a lock on an object and the thread that can resume it must own this lock before it can resume the suspended thread. import java.applet.*; import java.awt.*; public class ClockApplet extends Applet implements Runnable { private Thread clockThread; private boolean ok = false; private Object mutex = new Object(); public void run() { while (true) { try { // sleep for 1 second Thread.sleep(1000); // repaint the date and time repaint(); // see if we need to suspend ourself synchronized (mutex) { while (ok == false) mutex.wait(); } } catch (InterruptedException e) {} } } public void start() { // Figure 7.7 } public void stop() { // Figure 7.7 } public void paint(Graphics g) { g.drawString(new java.util.Date().toString(),10,30); } } Figure 7.6 Applet that displays the date and time of day. 7.3 Methods for Handling Deadlocks 323 Another method, stop(), has been deprecated as well, but not because it can lead to deadlock. Unlike the situation in which a thread has been suspended, when a thread has been stopped, it releases all the locks that it owns. However, locks are generally used in the following progression: (1) acquire the lock, (2) access a shared data structure, and (3) release the lock. If a thread is in the middle of step 2 when it is stopped, it will release the lock; but it may leave the shared data structure in an inconsistent state. In Section 4.5.2, we discussed how to terminate a thread using deferred cancellation rather than asynchronously canceling a thread using the stop() method. Here, we present a strategy for suspending and resuming a thread without using the deprecated suspend() and resume() methods. The program shown in Figure 7.6 is a multithreaded applet that displays the time of day. When this applet starts, it creates a second thread (which we will call the clock thread) that outputs the time of day. The run() method of the clock thread alternates between sleeping for one second and then calling the repaint() method. The repaint() method ultimately calls the paint() method, which draws the current date and time in the browser’s window. This applet is designed so that the clock thread is running while the applet is visible; if the applet is not being displayed (as when the browser window /** * This method is called when the applet is * started or we return to the applet. */ public void start() { ok = true; if (clockThread == null) { clockThread = new Thread(this); clockThread.start(); } else { synchronized(mutex) { mutex.notify(); } } } /** * This method is called when we * leave the page the applet is on. */ public void stop() { synchronized(mutex) { ok = false; } } Figure 7.7 start() and stop() methods for the applet (continuation of Figure 7.6). 324 Chapter 7 Deadlocks has been minimized), the clock thread is suspended from execution. This is accomplished by overriding the start() and stop() methods of the Applet class. (Be careful not to confuse these with the start() and stop() methods of the Thread class.) The start() method of an applet is called when an applet is first created. If the user leaves the web page, if the applet scrolls off the screen, or if the browser window is minimized, the applet’s stop()method is called. If the user returns to the applet’s web page, the applet’s start() method is called again. The applet uses the Boolean variable ok to indicate whether the clock thread can run. This variable will be set to true in the start() method of the applet, indicating that the clock thread can run. The stop() method of the applet will set it to false. The clock thread will check the value of this Boolean variable in its run() method and will only proceed if it is true. Because the thread for the applet and the clock thread will be sharing this variable, access to it will be controlled through a synchronized block. This program is shown in Figure 7.7. If the clock thread sees that the Boolean value is false, it suspends itself by calling the wait() method for the object mutex. When the applet wishes to resume the clock thread, it sets the Boolean variable to true and calls notify() for the mutex object. This call to notify() awakens the clock thread. It checks the value of the Boolean variable and, seeing that it is now true, proceeds in its run() method, displaying the date and time. 7.4 Deadlock Prevention As we noted in Section 7.2.1, for a deadlock to occur, each of the four necessary conditions must hold. By ensuring that at least one of these conditions cannot hold, we can prevent the occurrence of a deadlock. We elaborate on this approach by examining each of the four necessary conditions separately. 7.4.1 Mutual Exclusion The mutual-exclusion condition must hold for nonsharable resources. For example, a printer cannot be simultaneously shared by several processes. Sharable resources, in contrast, do not require mutually exclusive access and thus cannot be involved in a deadlock. Read-only files are a good example of a sharable resource. If several processes attempt to open a read-only file at the same time, they can be granted simultaneous access to the file. A process never needs to wait for a sharable resource. In general, however, we cannot prevent deadlocks by denying the mutual-exclusion condition, because some resources are intrinsically nonsharable. 7.4.2 Hold and Wait To ensure that the hold-and-wait condition never occurs in the system, we must guarantee that, whenever a process requests a resource, it does not hold any other resources. One protocol that can be used requires each process to request and be allocated all its resources before it begins execution. We can implement this provision by requiring that system calls requesting resources for a process precede all other system calls. 7.4 Deadlock Prevention 325 An alternative protocol allows a process to request resources only when it has none. A process may request some resources and use them. Before it can request any additional resources, however, it must release all the resources that it is currently allocated. To illustrate the difference between these two protocols, we consider a process that copies data from a DVD drive to a file on disk, sorts the file, and then prints the results to a printer. If all resources must be requested at the beginning of the process, then the process must initially request the DVD drive, disk file, and printer. It will hold the printer for its entire execution, even though it needs the printer only at the end. The second method allows the process to request initially only the DVD drive and disk file. It copies from the DVD drive to the disk and then releases both the DVD drive and the disk file. The process must then again request the disk file and the printer. After copying the disk file to the printer, it releases these two resources and terminates. Both these protocols have two main disadvantages. First, resource utiliza- tion may be low, since resources may be allocated but unused for a long period. In our example, for instance, we can release the DVD drive and disk file, and then again request the disk file and printer, only if we can be sure that our data will remain on the disk file. Otherwise, we must request all resources at the beginning for both protocols. Second, starvation is possible. A process that needs several popular resources may have to wait indefinitely, because at least one of the resources that it needs is always allocated to some other process. 7.4.3 No Preemption The third necessary condition for deadlocks is that there be no preemption of resources that have already been allocated. To ensure that this condition does not hold, we can use the following protocol. If a process is holding some resources and requests another resource that cannot be immediately allocated to it (that is, the process must wait), then all resources the process is currently holding are preempted. In other words, these resources are implicitly released. The preempted resources are added to the list of resources for which the process is waiting. The process will be restarted only when it can regain its old resources, as well as receiving the new ones that it is requesting. Alternatively, if a process requests some resources, we first check whether they are available. If they are, we allocate them. If they are not, we check whether they are allocated to some other process that is waiting for additional resources. If so, we preempt the desired resources from the waiting process and allocate them to the requesting process. If the resources are neither available nor held by a waiting process, the requesting process must wait. While it is waiting, some of its resources may be preempted, but only if another process requests them. A process can be restarted only when it is allocated the new resources it is requesting and recovers any resources that were preempted while it was waiting. This protocol is often applied to resources whose state can be easily saved and restored later, such as CPU registers and memory space. It cannot generally be applied to such resources as printers and tape drives. 326 Chapter 7 Deadlocks 7.4.4 Circular Wait The fourth and final condition for deadlocks is the circular-wait condition. One way to ensure that this condition never holds is to impose a total ordering of all resource types and to require that each process requests resources in an increasing order of enumeration. To illustrate, we let R = {R1, R2, ..., Rm} be the set of resource types. We assign to each resource type a unique integer number, which allows us to compare two resources and to determine whether one precedes another in our ordering. Formally, we define a one-to-one function F: R → N, where N is the set of natural numbers. For example, if the set of resource types R includes tape drives, disk drives, and printers, then the function F might be defined as follows: F(tape drive) = 1 F(disk drive) = 5 F(printer) = 12 We can now consider the following protocol to prevent deadlocks: Each process can request resources only in an increasing order of enumeration. That is, a process can initially request any number of instances of a resource type —say, Ri . After that, the process can request instances of resource type Rj if and only if F(Rj ) > F(Ri ). For example, using the function defined previously, a process that wants to use the tape drive and printer at the same time must first request the tape drive and then request the printer. Alternatively, we can require that a process requesting an instance of resource type Rj must have released any resources Ri such that F(Ri ) ≥ F(Rj ). Note also that if several instances of the same resource type are needed, a single request for all of them must be issued. If these two protocols are used, then the circular-wait condition cannot hold. We can demonstrate this fact by assuming that a circular wait exists (proof by contradiction). Let the set of processes involved in the circular wait be {P0, P1, ..., Pn},wherePi is waiting for a resource Ri , which is held by process Pi+1. (Modulo arithmetic is used on the indexes, so that Pn is waiting for aresourceRn held by P0.) Then, since process Pi+1 is holding resource Ri while requesting resource Ri+1,wemusthaveF(Ri ) < F(Ri+1)foralli. But this condition means that F(R0) < F(R1) < ... < F(Rn) < F(R0). By transitivity, F(R0) < F(R0), which is impossible. Therefore, there can be no circular wait. We can accomplish this scheme in an application program by developing an ordering among all synchronization objects in the system. All requests for synchronization objects must be made in increasing order. For example, if the lock ordering in the Java program shown in Figure 7.4 was F(first)=1 F(second)=5 then threadB could not request the locks out of order. Keep in mind that developing an ordering, or hierarchy, does not in itself prevent deadlock. It is up to application developers to write programs that follow the ordering. Also note that the function F should be defined according 7.4 Deadlock Prevention 327 to the normal order of usage of the resources in a system. For example, because the tape drive is usually needed before the printer, it would be reasonable to specify that F(tape drive) < F(printer). To develop a lock ordering, Java programmers are encouraged to use the method System.identityHashCode(), which returns the value that would be returned by the default value of the object’s hashCode() method. For example, to obtain the value of identityHashCode() for the first and second locks in the Java program shown in Figure 7.4, you would use the following statements: int firstOrderingValue = System.identityHashCode(first); int secondOrderingValue = System.identityHashCode(second); Although ensuring that resources are acquired in the proper order is the responsibility of application developers, certain software can be used to verify that locks are acquired in the proper order and to give appropriate warnings when locks are acquired out of order and deadlock is possible. One lock-order verifier, which works on BSD versions of UNIX such as FreeBSD, is known as witness. Witness uses mutual-exclusion locks to protect critical sections, as described in Chapter 6; it works by dynamically maintaining the relationship of lock orders in a system. Let’s use the program shown in Figure 7.4 as an example. Assume that threadA is the first to acquire the locks and does so in the order (1) first,(2)second. Witness records the relationship that first must be acquired before second.IfthreadB later attempts to acquire the locks out of order, witness generates a warning message on the system console. Finally, it is important to note that imposing a lock ordering does not guarantee deadlock prevention if locks can be acquired dynamically. For example, assume we have a method that transfers funds between two accounts. Toprevent a race condition, we use the object lock associated with eachAccount object in a synchronized block. The code appears as follows: void transaction(Account from, Account to, double amount) { synchronized(from) { synchronized(to) { from.withdraw(amount); to.deposit(amount); } } } Deadlock is possible if two threads simultaneously invoke the transaction() method, transposing different accounts. That is, one thread might invoke transaction(checkingAccount, savingsAccount, 25); and another might invoke transaction(savingsAccount, checkingAccount, 50); We leave it as an exercise for the reader to figure out a solution to fix this situation. 328 Chapter 7 Deadlocks 7.5 Deadlock Avoidance Deadlock-prevention algorithms, as discussed in Section 7.4, prevent deadlocks by limiting how requests can be made. The limits ensure that at least one of the necessary conditions for deadlock cannot occur and, hence, that deadlocks cannot hold. Possible side effects of preventing deadlocks by this method, however, are low device utilization and reduced system throughput. An alternative method for avoiding deadlocks is to require additional information about how resources are to be requested. For example, in a system with one tape drive and one printer, the system might need to know that process P will request first the tape drive and then the printer before releasing both resources, whereas process Q will request first the printer and then the tape drive. With this knowledge of the complete sequence of requests and releases for each process, the system can decide for each request whether or not the process should wait in order to avoid a possible future deadlock. Each request requires that in making this decision the system consider the resources currently available, the resources currently allocated to each process, and the future requests and releases of each process. The various algorithms that use this approach differ in the amount and type of information required. The simplest and most useful model requires that each process declare the maximum number of resources of each type that it may need. Given this a priori information, it is possible to construct an algorithm that ensures that the system will never enter a deadlocked state. Such an algorithm defines the deadlock-avoidance approach. A deadlock-avoidance algorithm dynamically examines the resource-allocation state to ensure that a circular- wait condition can never exist. The resource-allocation state is defined by the number of available and allocated resources and the maximum demands of the processes. In the following sections, we explore two deadlock-avoidance algorithms. 7.5.1 Safe State A state is safe if the system can allocate resources to each process (up to its maximum) in some order and still avoid a deadlock. More formally, a system is in a safe state only if there exists a safe sequence. A sequence of processes is a safe sequence for the current allocation state if, for each Pi , the resource requests that Pi can make can be satisfied by the currently available resources plus the resources held by all Pj ,withj < i. In this situation, if the resources that Pi needs are not immediately available, then Pi can wait until all Pj have finished. When they have finished, Pi canobtainallofits needed resources, complete its designated task, return its allocated resources, and terminate. When Pi terminates, Pi+1 can obtain its needed resources, and so on. If no such sequence exists, then the system state is said to be unsafe. A safe state is not a deadlocked state. Conversely, a deadlocked state is an unsafe state. Not all unsafe states are deadlocks, however (Figure 7.8). An unsafe state may lead to a deadlock. As long as the state is safe, the operating system can avoid unsafe (and deadlocked) states. In an unsafe state, the operating system cannot prevent processes from requesting resources in such a way that a deadlock occurs. The behavior of the processes controls unsafe states. 7.5 Deadlock Avoidance 329 deadlock unsafe safe Figure 7.8 Safe, unsafe, and deadlocked state spaces. To illustrate, we consider a system with twelve magnetic tape drives and three processes: P0, P1,andP2.ProcessP0 requires ten tape drives, process P1 may need as many as four tape drives, and process P2 may need up to nine tape drives. Suppose that, at time t0,processP0 is holding five tape drives, process P1 is holding two tape drives, and process P2 is holding two tape drives. (Thus, there are three free tape drives.) Maximum Needs Current Needs P0 10 5 P1 42 P2 92 At time t0, the system is in a safe state. The sequence satisfies the safety condition. Process P1 can immediately be allocated all its tape drives and then return them (the system will then have five available tape drives); then process P0 can get all its tape drives and return them (the system will then have ten available tape drives); and finally process P2 can get all its tape drives and return them (the system will then have all twelve tape drives available). A system can go from a safe state to an unsafe state. Suppose that, at time t1,processP2 requests and is allocated one more tape drive. The system is no longer in a safe state. At this point, only process P1 can be allocated all its tape drives. When it returns them, the system will have only four available tape drives. Since process P0 is allocated five tape drives but has a maximum of ten, it may request five more tape drives. If it does so, it will have to wait, because they are unavailable. Similarly, process P2 may request six additional tape drives and, if it does, will have to wait, resulting in a deadlock. Our mistake was in granting the request from process P2 for one more tape drive. If we had made P2 wait until either of the other processes had finished and released its resources, then we could have avoided the deadlock. Given the concept of a safe state, we can define avoidance algorithms that ensure that the system will never deadlock. The idea is simply to ensure that the system will always remain in a safe state. Initially, the system is in a safe state. Whenever a process requests a resource that is currently available, the system must decide whether the resource can be allocated immediately or whether 330 Chapter 7 Deadlocks the process must wait. The request is granted only if the allocation leaves the system in a safe state. In this scheme, if a process requests a resource that is currently available, it may still have to wait. Thus, resource utilization may be lower than it would otherwise be. 7.5.2 Resource-Allocation-Graph Algorithm If we have a resource-allocation system with only one instance of each resource type, we can use a variant of the resource-allocation graph defined in Section 7.2.2 for deadlock avoidance. In addition to the request and assignment edges already described, we introduce a new type of edge, called a claim edge. A claim edge Pi → Rj indicates that process Pi may request resource Rj at some time in the future. This edge resembles a request edge in direction but is represented in the graph by a dashed line. When process Pi requests resource Rj ,theclaimedgePi → Rj is converted to a request edge. Similarly, when a resource Rj is released by Pi , the assignment edge Rj → Pi is reconverted to a claim edge Pi → Rj . We note that the resources must be claimed a priori in the system. That is, before process Pi starts executing, all its claim edges must already appear in the resource-allocation graph. We can relax this condition by allowing a claim edge Pi → Rj to be added to the graph only if all the edges associated with process Pi are claim edges. Now suppose that process Pi requests resource Rj .Therequestcanbe granted only if converting the request edge Pi → Rj to an assignment edge Rj → Pi does not result in the formation of a cycle in the resource-allocation graph. We check for safety by using a cycle-detection algorithm. An algorithm for detecting a cycle in this graph requires an order of n2 operations, where n is the number of processes in the system. If no cycle exists, then the allocation of the resource will leave the system in a safe state. If a cycle is found, then the allocation will put the system in an unsafe state. In that case, process Pi will have to wait for its requests to be satisfied. To illustrate this algorithm, we consider the resource-allocation graph of Figure 7.9. Suppose that P2 requests R2.AlthoughR2 is currently free, we cannot allocate it to P2, since this action will create a cycle in the graph (Figure R1 R2 P2P1 Figure 7.9 Resource-allocation graph for deadlock avoidance. 7.5 Deadlock Avoidance 331 R1 R2 P2P1 Figure 7.10 An unsafe state in a resource-allocation graph. 7.10). A cycle, as mentioned, indicates that the system is in an unsafe state. If P1 requests R2,andP2 requests R1, then a deadlock will occur. 7.5.3 Banker’s Algorithm The resource-allocation-graph algorithm is not applicable to a resource- allocation system with multiple instances of each resource type. The deadlock- avoidance algorithm that we describe next is applicable to such a system but is less efficient than the resource-allocation graph scheme. This algorithm is commonly known as the banker’s algorithm. The name was chosen because the algorithm could be used in a banking system to ensure that the bank never allocated its available cash in such a way that it could no longer satisfy the needs of all its customers. When a new process enters the system, it must declare the maximum number of instances of each resource type that it may need. This number may not exceed the total number of resources in the system. When a user requests a set of resources, the system must determine whether the allocation of these resources will leave the system in a safe state. If it will, the resources are allocated; otherwise, the process must wait until some other process releases enough resources. Several data structures must be maintained to implement the banker’s algorithm. These data structures encode the state of the resource-allocation system. We need the following data structures, where n is the number of processes in the system and m isthenumberofresourcetypes: • Available. A vector of length m indicates the number of available resources of each type. If Available[j]equalsk, then k instances of resource type Rj are available. • Max.Ann × m matrix defines the maximum demand of each process. If Max[i][j]equalsk, then process Pi may request at most k instances of resource type Rj . • Allocation.Ann × m matrix defines the number of resources of each type currently allocated to each process. If Allocation[i][j]equalsk, then process Pi is currently allocated k instances of resource type Rj . 332 Chapter 7 Deadlocks • Need.Ann × m matrix indicates the remaining resource need of each process. If Need[i][j]equalsk, then process Pi may need k more instances of resource type Rj to complete its task. Note that Need[i][j]equalsMax[i][j] − Allocation[i][j]. These data structures vary over time in both size and value. To simplify the presentation of the banker’s algorithm, we next establish some notation. Let X and Y be vectors of length n. We say that X ≤ Y if and only if X[i] ≤ Y[i]foralli = 1, 2, ..., n. For example, if X = (1,7,3,2) and Y = (0,3,2,1), then Y ≤ X. In addition, Y < X if Y ≤ X and Y = X. We can treat each row in the matrices Allocation and Need as vectors and refer to them as Allocationi and Needi . The vector Allocationi specifies the resources currently allocated to process Pi ; the vector Needi specifies the additional resources that process Pi may still request to complete its task. 7.5.3.1 Safety Algorithm We can now present the algorithm for finding out whether or not a system is in a safe state. This algorithm can be described as follows: 1. Let Work and Finish be vectors of length m and n, respectively. Initialize Work = Available and Finish[i]=false for i = 0, 1, ..., n − 1. 2. Find an index i such that both a. Finish[i]==false b. Needi ≤ Work If no such i exists,gotostep4. 3. Work = Work + Allocationi Finish[i]=true Go to step 2. 4. If Finish[i]==true for all i, then the system is in a safe state. This algorithm may require an order of m × n2 operations to determine whether a state is safe. 7.5.3.2 Resource-Request Algorithm Next, we describe the algorithm for determining whether requests can be safely granted. Let Requesti be the request vector for process Pi .IfRequesti [ j] == k,then process Pi wants k instances of resource type Rj . When a request for resources is made by process Pi , the following actions are taken: 1. If Requesti ≤ Needi , go to step 2. Otherwise, raise an error condition, since the process has exceeded its maximum claim. 2. If Requesti ≤ Available, go to step 3. Otherwise, Pi must wait, since the resources are not available. 7.5 Deadlock Avoidance 333 3. Have the system pretend to have allocated the requested resources to process Pi by modifying the state as follows: Available = Available - Requesti ; Allocationi = Allocationi + Requesti ; Needi = Needi - Requesti ; If the resulting resource-allocation state is safe, the transaction is com- pleted, and process Pi is allocated its resources. However, if the new state is unsafe, then Pi must wait for Requesti , and the old resource-allocation state is restored. 7.5.3.3 An Illustrative Example To illustrate the use of the banker’s algorithm, consider a system with five processes P0 through P4 and three resource types A, B, and C. Resource type A has ten instances, resource type B has five instances, and resource type C has seven instances. Suppose that, at time T0, the following snapshot of the system has been taken: Allocation Max Available ABC ABC ABC P0 010 753 332 P1 200 322 P2 302 902 P3 211 222 P4 002 433 The content of the matrix Need is defined to be Max − Allocation and is as follows: Need ABC P0 743 P1 122 P2 600 P3 011 P4 431 We claim that the system is currently in a safe state. Indeed, the sequence satisfies the safety criteria. Suppose now that process P1 requests one additional instance of resource type A and two instances of resource type C, so Request1 = (1,0,2). To decide whether this request can be immediately granted, we first check that Request1 ≤ Available—that is, that (1,0,2) ≤ (3,3,2), which is true. We then pretend that this request has been fulfilled, and we arrive at the following new state: 334 Chapter 7 Deadlocks Allocation Need Available ABC ABC ABC P0 010 743 230 P1 302 020 P2 302 600 P3 211 011 P4 002 431 We must determine whether this new system state is safe. To do so, we execute our safety algorithm and find that the sequence satisfies the safety requirement. Hence, we can immediately grant the request of process P1. You should be able to see, however, that when the system is in this state, a request for (3,3,0) by P4 cannot be granted, since the resources are not available. Furthermore, a request for (0,2,0) by P0 cannot be granted, even though the resources are available, since the resulting state is unsafe. We leave it as a programming exercise for students to implement the banker’s algorithm. 7.6 Deadlock Detection As we have noted, if a system does not employ either a deadlock-prevention or a deadlock-avoidance algorithm, then a deadlock situation may occur. In this environment, the system may provide: • An algorithm that examines the state of the system to determine whether a deadlock has occurred • An algorithm to recover from the deadlock In the following discussion, we elaborate on these two requirements as they pertain to systems with only a single instance of each resource type, as well as to systems with several instances of each resource type. At this point, however, we note that a detection-and-recovery scheme requires overhead that includes not only the run-time costs of maintaining the necessary information and executing the detection algorithm but also the potential losses inherent in recovering from adeadlock. 7.6.1 Single Instance of Each Resource Type If all resources have only a single instance, then we can define a deadlock- detection algorithm that uses a variant of the resource-allocation graph, called a wait-for graph. We obtain this graph from the resource-allocation graph by removing the resource nodes and collapsing the appropriate edges. More precisely, an edge from Pi to Pj in a wait-for graph implies that process Pi is waiting for process Pj to release a resource that Pi needs. An edge Pi → Pj exists in a wait-for graph if and only if the corresponding resource- allocation graph contains two edges Pi → Rq and Rq → Pj for some resource 7.6 Deadlock Detection 335 P3 P5 P4 P2P1 R2 R1 R3 R4 R5 P3 P5 P4 P2P1 (b)(a) Figure 7.11 (a) Resource-allocation graph. (b) Corresponding wait-for graph. Rq . For example, in Figure 7.11, we present a resource-allocation graph and the corresponding wait-for graph. As before, a deadlock exists in the system if and only if the wait-for graph contains a cycle. To detect deadlocks, the system needs to maintain the wait-for graph and periodically invoke an algorithm that searches for a cycle in the graph. An algorithm to detect a cycle in a graph requires an order of n2 operations, where n is the number of vertices in the graph. 7.6.2 Several Instances of a Resource Type The wait-for graph scheme is not applicable to a resource-allocation system with multiple instances of each resource type. We turn now to a deadlock- detection algorithm that is applicable to such a system. The algorithm employs several time-varying data structures that are similar to those used in the banker’s algorithm (Section 7.5.3): • Available. A vector of length m indicates the number of available resources of each type. • Allocation.Ann × m matrix defines the number of resources of each type currently allocated to each process. • Request.Ann × m matrix indicates the current request of each process. If Request[i][j]equalsk, then process Pi is requesting k more instances of resource type Rj . The ≤ relation between two vectors is defined as in Section 7.5.3. To simplify notation, we again treat the rows in the matrices Allocation and Request as vectors; we refer to them as Allocationi and Requesti . The detection algorithm described here simply investigates every possible allocation sequence for the 336 Chapter 7 Deadlocks processes that remain to be completed. Compare this algorithm with the banker’s algorithm of Section 7.5.3. 1. Let Work and Finish be vectors of length m and n, respectively. Initialize Work = Available. For i = 0, 1, ..., n-1, if Allocationi = 0, then Finish[i]=false; otherwise, Finish[i]=true. 2. Find an index i such that both a. Finish[i]==false b. Requesti ≤ Work If no such i exists,gotostep4. 3. Work = Work + Allocationi Finish[i]=true Go to step 2. 4. If Finish[i]==false for some i, 0 ≤ i < n, then the system is in a deadlocked state. Moreover, if Finish[i]==false, then process Pi is deadlocked. This algorithm requires an order of m × n2 operations to detect whether the system is in a deadlocked state. You may wonder why we reclaim the resources of process Pi (in step 3) as soon as we determine that Requesti ≤ Work (in step 2b). We know that Pi is currently not involved in a deadlock (since Requesti ≤ Work). Thus, we take an optimistic attitude and assume that Pi will require no more resources to complete its task; it will thus soon return all currently allocated resources to the system. If our assumption is incorrect, a deadlock may occur later. That deadlock will be detected the next time the deadlock-detection algorithm is invoked. To illustrate this algorithm, we consider a system with five processes P0 through P4 and three resource types A, B, and C. Resource type A has seven instances, resource type B has two instances, and resource type C has six instances. Suppose that, at time T0, we have the following resource-allocation state: Allocation Request Available ABC ABC ABC P0 010 000 000 P1 200 202 P2 303 000 P3 211 100 P4 002 002 We claim that the system is not in a deadlocked state. Indeed, if we execute our algorithm, we will find that the sequence results in Finish[i]==true for all i. 7.6 Deadlock Detection 337 Suppose now that process P2 makes one additional request for an instance of type C. The Request matrix is modified as follows: Request ABC P0 000 P1 202 P2 001 P3 100 P4 002 We claim that the system is now deadlocked. Although we can reclaim the resources held by process P0, the number of available resources is not sufficient to fulfill the requests of the other processes. Thus, a deadlock exists, consisting of processes P1, P2, P3,andP4. 7.6.3 Detection-Algorithm Usage When should we invoke the detection algorithm? The answer depends on two factors: 1. How often is a deadlock likely to occur? 2. How many processes will be affected by deadlock when it happens? If deadlocks occur frequently, then the detection algorithm should be invoked frequently. Resources allocated to deadlocked processes will be idle until the deadlock can be broken. In addition, the number of processes involved in the deadlock cycle may grow. Deadlocks occur only when some process makes a request that cannot be granted immediately. This request may be the final request that completes a chain of waiting processes. In the extreme, then, we can invoke the deadlock- detection algorithm every time a request for allocation cannot be granted immediately. In this case, we can identify not only the deadlocked set of processes but also the specific process that “caused” the deadlock. (In reality, each of the deadlocked processes is a link in the cycle in the resource graph, so all of them, jointly, caused the deadlock.) If there are many different resource types, one request may create many cycles in the resource graph, each cycle completed by the most recent request and “caused” by the one identifiable process. Of course, invoking the deadlock-detection algorithm for every resource request will incur considerable overhead in computation time. A less expensive alternative is simply to invoke the algorithm at defined intervals—for example, once per hour or whenever CPU utilization drops below 40 percent. (A deadlock eventually cripples system throughput and causes CPU utilization to drop.) If the detection algorithm is invoked at arbitrary points in time, the resource graph may contain many cycles. In this case, we generally cannot tell which of the many deadlocked processes “caused” the deadlock. 338 Chapter 7 Deadlocks 7.7 Recovery from Deadlock When a detection algorithm determines that a deadlock exists, several alter- natives are available. One possibility is to inform the operator that a deadlock has occurred and to let the operator deal with the deadlock manually. Another possibility is to let the system recover from the deadlock automatically. There are two options for breaking a deadlock. One is simply to abort one or more processes to break the circular wait. The other is to preempt some resources from one or more of the deadlocked processes. 7.7.1 Process Termination To eliminate deadlocks by aborting a process, we use one of two methods. In both methods, the system reclaims all resources allocated to the terminated processes. • Abort all deadlocked processes. This method clearly will break the deadlock cycle, but at great expense; the deadlocked processes may have computed for a long time, and the results of these partial computations must be discarded and probably will have to be recomputed later. • Abort one process at a time until the deadlock cycle is eliminated.This method incurs considerable overhead, since after each process is aborted, a deadlock-detection algorithm must be invoked to determine whether any processes are still deadlocked. Aborting a process may not be easy. If the process was in the midst of updating a file, terminating it will leave that file in an incorrect state. Similarly, if the process was in the midst of printing data on a printer, the system must reset the printer to a correct state before printing the next job. If the partial termination method is used, then we must determine which deadlocked process (or processes) should be terminated. This determination is a policy decision, similar to CPU-scheduling decisions. The question is basically an economic one; we should abort those processes whose termination will incur the minimum cost. Unfortunately, the term minimum cost is not a precise one. Many factors may affect which process is chosen, including: 1. What the priority of the process is 2. How long the process has computed and how much longer the process will compute before completing its designated task 3. Howmanyandwhattypesofresourcestheprocesshasused(forexample, whether the resources are simple to preempt) 4. How many more resources the process needs in order to complete 5. How many processes will need to be terminated 6. Whether the process is interactive or batch 7.8 Summary 339 7.7.2 Resource Preemption To eliminate deadlocks using resource preemption, we successively preempt some resources from processes and give these resources to other processes until the deadlock cycle is broken. If preemption is required to deal with deadlocks, then three issues need to be addressed: 1. Selecting a victim. Which resources and which processes are to be preempted? As in process termination, we must determine the order of preemption to minimize cost. Cost factors may include such parameters as the number of resources a deadlocked process is holding and the amount of time the process has thus far consumed during its execution. 2. Rollback. If we preempt a resource from a process, what should be done with that process? Clearly, it cannot continue with its normal execution; it is missing some needed resource. We must roll back the process to some safe state and restart it from that state. Since, in general, it is difficult to determine what a safe state is, the simplest solution is a total rollback: abort the process and then restart it. Although it is more effective to roll back the process only as far as necessary to break the deadlock, this method requires the system to keep more information about the state of all running processes. 3. Starvation. How do we ensure that starvation will not occur? That is, how can we guarantee that resources will not always be preempted from thesameprocess? In a system where victim selection is based primarily on cost factors, itmayhappenthatthesameprocessisalwayspickedasavictim.As a result, this process never completes its designated task—a starvation situation that must be dealt with in any practical system. Clearly, we must ensure that a process can be picked as a victim only a (small) finite number of times. The most common solution is to include the number of rollbacks in the cost factor. 7.8 Summary A deadlocked state occurs when two or more processes are waiting indefinitely for an event that can be caused only by one of the waiting processes. There are three principal methods for dealing with deadlocks: • Use some protocol to prevent or avoid deadlocks, ensuring that the system will never enter a deadlocked state. • Allow the system to enter a deadlocked state, detect it, and then recover. • Ignore the problem altogether and pretend that deadlocks never occur in the system. The third solution is the one used by most operating systems, including UNIX and Windows. 340 Chapter 7 Deadlocks A deadlock can occur only if four necessary conditions hold simultaneously in the system: mutual exclusion, hold and wait, no preemption, and circular wait. To prevent deadlocks, we can ensure that at least one of the necessary conditions never holds. A method for avoiding deadlocks, rather than preventing them, requires that the operating system have a priori information about how each process will utilize system resources. The banker’s algorithm, for example, requires a priori information about the maximum number of each resource class that each process may request. Using this information, we can define a deadlock- avoidance algorithm. If a system does not employ a protocol to ensure that deadlocks will never occur, then a detection-and-recovery scheme may be employed. A deadlock- detection algorithm must be invoked to determine whether a deadlock has occurred. If a deadlock is detected, the system must recover either by terminating some of the deadlocked processes or by preempting resources from some of the deadlocked processes. Where preemption is used to deal with deadlocks, three issues must be addressed: selecting a victim, rollback, and starvation. In a system that selects victims for rollback primarily on the basis of cost factors, starvation may occur, and some selected processes may never complete their designated tasks. Researchers have argued that none of the basic approaches alone is appro- priate for the entire spectrum of resource-allocation problems in operating systems. The basic approaches can be combined, however, allowing us to select an optimal approach for each class of resources in a system. Practice Exercises 7.1 List three examples of deadlocks that are not related to a computer- system environment. 7.2 Suppose that a system is in an unsafe state. Show that it is possible for the processes to complete their execution without entering a deadlocked state. 7.3 A possible method for preventing deadlocks is to have a single, higher- order resource that must be requested before any other resource. For example, if multiple threads attempt to access the synchronization objects A···E, deadlock is possible. (Such synchronization objects may include mutexes, semaphores, condition variables, and the like.) We can prevent the deadlock by adding a sixth object F. Whenever a thread wants to acquire the synchronization lock for any object A···E,itmust first acquire the lock for object F.Thissolutionisknownascontainment: the locks for objects A ··· E are contained within the lock for object F. Compare this scheme with the circular-wait scheme of Section 7.4.4. 7.4 Prove that the safety algorithm presented in Section 7.5.3 requires an order of m × n2 operations. 7.5 Consider a computer system that runs 5,000 jobs per month and has no deadlock-prevention or deadlock-avoidance scheme. Deadlocks occur about twice per month, and the operator must terminate and rerun about Practice Exercises 341 10 jobs per deadlock. Each job is worth about $2 (in CPU time), and the jobs terminated tend to be about half-done when they are aborted. A systems programmer has estimated that a deadlock-avoidance algorithm (like the banker’s algorithm) could be installed in the system with an increase in the average execution time per job of about 10 percent. Since the machine currently has 30 percent idle time, all 5,000 jobs per month could still be run, although turnaround time would increase by about 20 percent on average. a. What are the arguments for installing the deadlock-avoidance algorithm? b. What are the arguments against installing the deadlock-avoidance algorithm? 7.6 Can a system detect that some of its processes are starving? If you answer “yes,” explain how it can. If you answer “no,” explain how the system can deal with the starvation problem. 7.7 Consider the following resource-allocation policy. Requests for and releases of resources are allowed at any time. If a request for resources cannot be satisfied because the resources are not available, then we check any processes that are blocked waiting for resources. If a blocked process has the desired resources, then these resources are taken away from it and are given to the requesting process. The vector of resources for which the blocked process is waiting is increased to include the resources that were taken away. For example, consider a system with three resource types and the vector Available initialized to (4,2,2). If process P0 asks for (2,2,1), it gets them. If P1 asks for (1,0,1), it gets them. Then, if P0 asks for (0,0,1), it is blocked (resource not available). If P2 now asks for (2,0,0), it gets the available one (1,0,0) and one that was allocated to P0 (since P0 is blocked). P0’s Allocation vector goes down to (1,2,1), and its Need vector goes up to (1,0,1). a. Can deadlock occur? If you answer “yes,” give an example. If you answer “no,” specify which necessary condition cannot occur. b. Can indefinite blocking occur? Explain your answer. 7.8 Suppose that you have coded the deadlock-avoidance safety algorithm and now have been asked to implement the deadlock-detection algo- rithm. Can you do so by simply using the safety algorithm code and redefining Max[i]=Waiting[i]+Allocation[i], where Waiting[i] is a vector specifying the resources for which process i is waiting and Allocation[i] is as defined in Section 7.5? Explain your answer. 7.9 Is it possible to have a deadlock involving only a single process? Explain your answer. 342 Chapter 7 Deadlocks Exercises 7.10 Consider the traffic deadlock depicted in Figure 7.12. a. Show that the four necessary conditions for deadlock hold in this example. b. State a simple rule for avoiding deadlocks in this system. 7.11 Consider the deadlock situation that can occur in the dining- philosophers problem when the philosophers obtain the chopsticks one at a time. Discuss how the four necessary conditions for deadlock hold in this setting. Discuss how deadlocks could be avoided by eliminating any one of the four necessary conditions. 7.12 Compare the circular-wait scheme with the various deadlock-avoidance schemes (like the banker’s algorithm) with respect to the following issues: a. Runtime overheads b. System throughput 7.13 In a real computer system, neither the resources available nor the demands of processes for resources are consistent over long periods (months). Resources break or are replaced, new processes come and go, and new resources are bought and added to the system. If deadlock is controlled by the banker’s algorithm, which of the following changes can be made safely (without introducing the possibility of deadlock), and under what circumstances? • • • • • • • • • • • • Figure 7.12 Traffic deadlock for Exercise 7.10. Exercises 343 a. Increase Available (new resources added). b. Decrease Available (resource permanently removed from system). c. Increase Max for one process (the process needs or wants more resources than allowed). d. Decrease Max for one process (the process decides it does not need that many resources). e. Increase the number of processes. f. Decrease the number of processes. 7.14 Consider a system consisting of four resources of the same type that are shared by three processes, each of which needs at most two resources. Show that the system is deadlock free. 7.15 Consider a system consisting of m resources of the same type being shared by n processes. A process can request or release only one resource at a time. Show that the system is deadlock free if the following two conditions hold: a. The maximum need of each process is between one resource and m resources. b. Thesumofallmaximumneedsislessthanm + n. 7.16 The Java API for the Thread class contains a method destroy() that has been deprecated. Consulting the API, explain why destroy() was deprecated. 7.17 Consider the following snapshot of a system: Allocation Max Available ABCD ABCD ABCD P0 0012 0012 1520 P1 1000 1750 P2 1354 2356 P3 0632 0652 P4 0014 0656 Answer the following questions using the banker’s algorithm: a. What is the content of the matrix Need? b. Is the system in a safe state? c. If a request from process P1 arrives for (0,4,2,0), can the request be granted immediately? 7.18 We can obtain the banker’s algorithm for a single resource type from the general banker’s algorithm simply by reducing the dimensionality of the various arrays by 1. Show through an example that we cannot 344 Chapter 7 Deadlocks implement the multiple-resource-type banker’s scheme by applying the single-resource-type scheme to each resource type individually. 7.19 Consider the version of the dining-philosophers problem in which the chopsticks are placed at the center of the table and any two of them can be used by a philosopher. Assume that requests for chopsticks are made one at a time. Describe a simple rule for determining whether a particular request can be satisfied without causing deadlock given the current allocation of chopsticks to philosophers. 7.20 Consider again the setting in the preceding question. Assume now that each philosopher requires three chopsticks to eat. Resource requests are still issued one at a time. Describe some simple rules for determining whether a particular request can be satisfied without causing deadlock given the current allocation of chopsticks to philosophers. 7.21 What is the optimistic assumption made in the deadlock-detection algorithm? How can this assumption be violated? Programming Problems 7.22 In Section 7.4.4, we describe a situation in which we prevent deadlock by ensuring that all locks are acquired in a certain order. However, we also point out that deadlock is possible in this situation if locks are acquired dynamically and two threads simultaneously invoke the transac- tion() method. Fix the transaction() method to prevent deadlock from occurring. (Hint: consider using System.identityHashCode().) 7.23 A single-lane bridge connects the two Vermont villages of North Tunbridge and South Tunbridge. Farmers in the two villages use this bridge to deliver their produce to the neighboring town. The bridge can become deadlocked if both a northbound and a southbound farmer get on the bridge at the same time (Vermont farmers are stubborn and are unable to back up.) Using either Java semaphores or Java synchronization, design an algorithm that prevents deadlock. To test your implementation, design two threads, one representing a northbound farmer and the other representing a farmer traveling southbound. Once both are on the bridge, each will sleep for a random period of time to simulate traveling across the bridge. Initially, do not be concerned about starvation (the situation in which northbound farmers prevent southbound farmers from using the bridge, or vice versa). 7.24 Modify your solution to Exercise 7.23 so that it is starvation-free. Programming Projects In this project you will write a Java program that implements the banker’s algorithm discussed in Section 7.5.3. Several customers request and release resources from the bank. The banker will grant a request only if it leaves the Programming Projects 345 system in a safe state. A request is denied if it leaves the system in an unsafe state. The Bank The bank will employ the strategy outlined in Section 7.5.3, whereby it will consider requests from n customers for m resources. The bank will keep track of the resources using the following data structures: int numberOfCustomers; // the number of customers int numberOfResources; // the number of resources int[] available; // the available amount of each resource int[][] maximum; // the maximum demand of each customer int[][] allocation; // the amount currently allocated // to each customer int[][] need; // the remaining needs of each customer The functionality of the bank appears in the interface shown in Figure 7.13. The implementation of this interface will require adding a constructor that is passed the number of resources initially available. For example, suppose we have three resource types with 10, 5, and 7 resources initially available. In this case, we can create an implementation of the interface using the following technique: Bank theBank = new BankImpl(10,5,7); The bank will grant a request if the request satisfies the safety algorithm outlined in Section 7.5.3.1. If granting the request does not leave the system in a safe state, the request is denied. Testing Your Implementation You can use the file TestHarness.java, which is available on Wiley Plus, to test your implementation of the Bank interface. This program expects the implementation of the Bank interfacetobenamedBankImpl and requires an input file containing the maximum demand of each resource type for each customer. For example, if there are five customers and three resource types, the input file might appear as follows: 7,5,3 3,2,2 9,0,2 2,2,2 4,3,3 This indicates that the maximum demand for customer0 is 7, 5, 3; for customer1, 3, 2, 2; and so forth. Since each line of the input file represents a separate 346 Chapter 7 Deadlocks public interface Bank { /** * Add a customer * customerNumber - the number of the customer * maximumDemand - the maximum demand for this customer */ public void addCustomer(int customerNum, int[] maximumDemand); /** * Output the value of available, maximum, * allocation, and need */ public void getState(); /** * Request resources * customerNumber - the customer requesting resources * request - the resources being requested */ public boolean requestResources(int customerNumber, int[] request); /** * Release resources * customerNumber - the customer releasing resources * release - the resources being released */ public void releaseResources(int customerNumber, int[] release); } Figure 7.13 Interface showing the functionality of the bank. customer, the addCustomer() method is to be invoked as each line is read in, initializing the value of maximum for each customer. (In the above example, the value of maximum[0][] is initialized to 7, 5, 3 for customer 0; maximum[1][] is initialized to 3, 2, 2; and so forth.) Furthermore, TestHarness.java also requires the initial number of resources available in the bank. For example, if there are initially 10, 5, and 7 resources available, we invoke TestHarness.java as follows: java TestHarness infile.txt 10 5 7 where infile.txt refers to a file containing the maximum demand for each customer followed by the number of resources initially available. The available array will be initialized to the values passed on the command line. Bibliographical Notes 347 Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Dijkstra [1965a] was one of the first and most influential contributors in the deadlock area. Holt [1972] was the first person to formalize the notion of deadlocks in terms of an allocation-graph model similar to the one presented in this chapter. Starvation was also covered by Holt [1972]. Hyman [1985] provided the deadlock example from the Kansas legislature. A recent study of deadlock handling is provided in Levine [2003]. The various prevention algorithms were suggested by Havender [1968], who devised the resource-ordering scheme for the IBM OS/360 system. The banker’s algorithm for avoiding deadlocks was developed for a single resource type by Dijkstra [1965a] and was extended to multiple resource types by Habermann [1969]. Exercises 7.14 and 7.15 are from Holt [1971]. The deadlock-detection algorithm for multiple instances of a resource type, which is described in Section 7.6.2, was presented by Coffman et al. [1971]. Bach [1987] describes how many of the algorithms in the traditional UNIX kernel handle deadlock. Solutions to deadlock problems in networks are discussed in works such as Culler et al. [1998] and Rodeheffer and Schroeder [1991]. The witness lock-order verifier is presented in Baldwin [2002]. This page intentionally left blank Part Three Memory Management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be at least partially in main memory during execution. To improve both the utilization of the CPU and the speed of its response to users, a general-purpose computer must keep several pro- cesses in memory. Many memory-management schemes exist, reflect- ing various approaches, and the effectiveness of each algorithm depends on the situation. Selection of a memory-management scheme for a sys- tem depends on many factors, especially on the hardware design of the system. Most algorithms require hardware support. This page intentionally left blank 8CHAPTER Main Memory In Chapter 5, we showed how the CPU can be shared by a set of processes. As aresultofCPU scheduling, we can improve both the utilization of the CPU and the speed of the computer’s response to its users. To realize this increase in performance, however, we must keep several processes in memory; that is, we must share memory. In this chapter, we discuss various ways to manage memory. The memory- management algorithms vary from a primitive bare-machine approach to paging and segmentation strategies. Each approach has its own advantages and disadvantages. Selection of a memory-management method for a specific system depends on many factors, especially on the hardware design of the system. As we shall see, many algorithms require hardware support, although recent designs have closely integrated the hardware and operating system. CHAPTER OBJECTIVES • To provide a detailed description of various ways of organizing memory hardware. • To discuss various memory-management techniques, including paging and segmentation. • To provide a detailed description of the Intel Pentium, which supports both pure segmentation and segmentation with paging. 8.1 Background As we saw in Chapter 1, memory is central to the operation of a modern computer system. Memory consists of a large array of words or bytes, each with its own address. The CPU fetches instructions from memory according to the value of the program counter. These instructions may cause additional loading from and storing to specific memory addresses. A typical instruction-execution cycle, for example, first fetches an instruc- tion from memory. The instruction is then decoded and may cause operands to be fetched from memory. After the instruction has been executed on the 351 352 Chapter 8 Main Memory operands, results may be stored back in memory. The memory unit sees only a stream of memory addresses; it does not know how they are generated (by the instruction counter, indexing, indirection, or literal addresses, for example) or what they are for (instructions or data). Accordingly, we can ignore how a program generates a memory address. We are interested only in the sequence of memory addresses generated by the running program. We begin our discussion by covering several issues that are pertinent to the various techniques for managing memory. This coverage includes an overview of basic hardware issues, the binding of symbolic memory addresses to actual physical addresses, and the distinction between logical and physical addresses. We conclude the section with a discussion of dynamically loading and linking code and shared libraries. 8.1.1 Basic Hardware Main memory and the registers built into the processor itself are the only storage that the CPU can access directly.There are machine instructions that take memory addresses as arguments, but none that take disk addresses. Therefore, any instructions in execution, and any data being used by the instructions, must be in one of these direct-access storage devices. If the data are not in memory, they must be moved there before the CPU can operate on them. Registers that are built into the CPU are generally accessible within one cycle of the CPU clock. Most CPUs can decode instructions and perform simple operations on register contents at the rate of one or more operations per clock tick. The same cannot be said of main memory, which is accessed via a transaction on the memory bus. Completing a memory access may take many cycles of the CPU clock. In such cases, the processor normally needs to stall, since it does not have the data required to complete the instruction that it is executing. This situation is intolerable because memory accesses are so frequent. The remedy is to add fast memory between the CPU and main memory. A memory buffer used to accommodate a speed differential, called a cache, is described in Section 1.8.3. Not only are we concerned with the relative speed of accessing physical memory, but we also must ensure correct operation to protect the operating system from access by user processes and, in addition, to protect user processes from one another. This protection must be provided by the hardware. It can be implemented in several ways, as we shall see throughout the chapter. In this section, we outline one possible implementation. We first need to make sure that each process has a separate memory space. To do this, we need the ability to determine the range of legal addresses that the process may access and to ensure that the process can access only these legal addresses. We can provide this protection by using two registers, usually a base and a limit, as illustrated in Figure 8.1. The base register holds the smallest legal physical memory address; the limit register specifies the size of the range. For example, if the base register holds 300040 and the limit register is 120900, then the program can legally access all addresses from 300040 through 420939 (inclusive). To protect memory space, the CPU hardware compares every address generated in user mode with the registers. Any attempt by a program executing in user mode to access operating-system memory or other users’ memory 8.1 Background 353 operating system 0 256000 300040 300040 base 120900 limit 420940 880000 1024000 process process process Figure 8.1 A base and a limit register define a logical address space. results in a trap to the operating system, which treats the attempt as a fatal error (Figure 8.2). This scheme prevents a user program from (accidentally or deliberately) modifying the code or data structures of either the operating system or other users. The base and limit registers can be loaded only by the operating system, which uses a special privileged instruction. Since privileged instructions can be executed only in kernel mode, and since only the operating system executes in kernel mode, only the operating system can load the base and limit registers. This scheme allows the operating system to change the value of the registers but prevents user programs from changing the registers’ contents. The operating system, executing in kernel mode, is given unrestricted access to both operating-system memory and users’ memory. This provision allows the operating system to load users’ programs into users’ memory, to base memorytrap to operating system monitor—addressing error address yesyes nono CPU base ϩ limit ≥ < Figure 8.2 Hardware address protection with base and limit registers. 354 Chapter 8 Main Memory dump out those programs in case of errors, to access and modify parameters of system calls, and so on. 8.1.2 Address Binding Usually, a program resides on a disk as a binary executable file. To be executed, the program must be brought into memory and placed within a process. Depending on the memory management in use, the process may be moved between disk and memory during its execution. The processes on the disk that are waiting to be brought into memory for execution form the input queue. The normal procedure is to select one of the processes in the input queue and to load that process into memory. As the process is executed, it accesses instructions and data from memory. Eventually, the process terminates, and its memory space is declared available. Most systems allow a user process to reside in any part of the physical memory. Thus, although the address space of the computer starts at 00000, the first address of the user process need not be 00000. This approach affects the addresses that the user program can use. In most cases, a user program will go through several steps—some of which may be optional—before being executed (Figure 8.3). Addresses may be represented in different ways during these steps. Addresses in the source program are generally symbolic (such as count). A compiler will typically bind these symbolic addresses to relocatable addresses (such as “14 bytes from the beginning of this module”). The linkage editor or loader will in turn bind the relocatable addresses to absolute addresses (such as 74014). Each binding is a mapping from one address space to another. Classically, the binding of instructions and data to memory addresses can be done at any step along the way: • Compile time. If you know at compile time where the process will reside in memory,then absolute code can be generated. For example, if you know that a user process will reside starting at location R, then the generated compiler code will start at that location and extend up from there. If, at some later time, the starting location changes, then it will be necessary to recompile this code. The MS-DOS .COM-format programs are bound at compile time. • Load time. If it is not known at compile time where the process will reside in memory, then the compiler must generate relocatable code.Inthiscase, final binding is delayed until load time. If the starting address changes, we need only reload the user code to incorporate this changed value. • Execution time. If the process can be moved during its execution from one memory segment to another, then binding must be delayed until run time. Special hardware must be available for this scheme to work, as will be discussed in Section 8.1.3. Most general-purpose operating systems use this method. A major portion of this chapter is devoted to showing how these vari- ous bindings can be implemented effectively in a computer system and to discussing appropriate hardware support. 8.1 Background 355 dynamic linking source program object module linkage editor load module loader in-memory binary memory image other object modules compile time load time execution time (run time) compiler or assembler system library dynamically loaded system library Figure 8.3 Multistep processing of a user program. 8.1.3 Logical versus Physical Address Space An address generated by the CPU is commonly referred to as a logical address, whereas an address seen by the memory unit—that is, the one loaded into the memory-address register of the memory—is commonly referred to as a physical address. The compile-time and load-time address-binding methods generate iden- tical logical and physical addresses. However, the execution-time address- binding scheme results in differing logical and physical addresses. In this case, we usually refer to the logical address as a virtual address.Weuselogical address and virtual address interchangeably in this text. The set of all logical addresses generated by a program is a logical address space; the set of all physical addresses corresponding to these logical addresses is a physical address space. 356 Chapter 8 Main Memory ϩ MMU CPU memory 14346 14000 relocation register 346 logical address physical address Figure 8.4 Dynamic relocation using a relocation register. Thus, in the execution-time address-binding scheme, the logical and physical address spaces differ. The run-time mapping from virtual to physical addresses is done by a hardware device called the memory-management unit (MMU). We can choose from many different methods to accomplish such mapping, as we discuss in Sections 8.3 through 8.7. For the time being, we illustrate this mapping with asimpleMMU scheme that is a generalization of the base-register scheme described in Section 8.1.1. The base register is now called a relocation register. The value in the relocation register is added to every address generated by a user process at the time the address is sent to memory (see Figure 8.4). For example, if the base is at 14000, then an attempt by the user to address location 0 is dynamically relocated to location 14000; an access to location 346 is mapped to location 14346. The MS-DOS operating system running on the Intel 80x86 family of processors used four relocation registers when loading and running processes. The user program never sees the real physical addresses. The program can create a pointer to location 346, store it in memory,manipulate it, and compare it with other addresses—all as the number 346. Only when it is used as a memory address (in an indirect load or store, perhaps) is it relocated relative to the base register. The user program deals with logical addresses. The memory-mapping hardware converts logical addresses into physical addresses. This form of execution-time binding was discussed in Section 8.1.2. The final location of a referenced memory address is not determined until the reference is made. We now have two different types of addresses: logical addresses (in the range 0 to max) and physical addresses (in the range R +0toR + max for a base value R). The user program generates only logical addresses and thinks that the process runs in locations 0 to max. However, these logical addresses must be mapped to physical addresses before they are used. The concept of a logical address space that is bound to a separate physical address space is central to proper memory management. 8.1 Background 357 8.1.4 Dynamic Loading In our discussion so far, it has been necessary for the entire program and all data of a process to be in physical memory for the process to execute. The size of a process has thus been limited to the size of physical memory. To obtain better memory-space utilization, we can use dynamic loading. With dynamic loading, a routine is not loaded until it is called. All routines are kept on disk in a relocatable load format. The main program is loaded into memory and is executed. When a routine needs to call another routine, the calling routine first checks to see whether the other routine has been loaded. If it has not, the relocatable linking loader is called to load the desired routine into memory and to update the program’s address tables to reflect this change. Then control is passed to the newly loaded routine. The advantage of dynamic loading is that an unused routine is never loaded. This method is particularly useful when large amounts of code are needed to handle infrequently occurring cases, such as error routines. In this case, although the total program size may be large, the portion that is used (and hence loaded) may be much smaller. Dynamic loading does not require special support from the operating system. It is the responsibility of the users to design their programs to take advantage of such a method. Operating systems may help the programmer, however, by providing library routines to implement dynamic loading. 8.1.5 Dynamic Linking and Shared Libraries Figure 8.3 also shows dynamically linked libraries. Some operating systems support only static linking, in which system language libraries are treated like any other object module and are combined by the loader into the binary program image. Dynamic linking, in contrast, is similar to dynamic loading. Here, though, linking, rather than loading, is postponed until execution time. This feature is usually used with system libraries, such as language subroutine libraries. Without this facility, each program on a system must include a copy of its language library (or at least the routines referenced by the program) in the executable image. This requirement wastes both disk space and main memory. With dynamic linking, a stub is included in the image for each library- routine reference. The stub is a small piece of code that indicates how to locate the appropriate memory-resident library routine or how to load the library if the routine is not already present. When the stub is executed, it checks to see whether the needed routine is already in memory. If it is not, the program loads the routine into memory. Either way, the stub replaces itself with the address of the routine and executes the routine. Thus, the next time that particular code segment is reached, the library routine is executed directly, incurring no cost for dynamic linking. Under this scheme, all processes that use a language library execute only one copy of the library code. This feature can be extended to library updates (such as bug fixes). A library may be replaced by a new version, and all programs that reference the library will automatically use the new version. Without dynamic linking, all such programs would need to be relinked to gain access to the new library. So that programs will not accidentally execute new, incompatible versions of libraries, version information is included in both the program and the library. More than one version of a library may be loaded into memory, and each program uses its 358 Chapter 8 Main Memory version information to decide which copy of the library to use. Versions with minor changes retain the same version number, whereas versions with major changes increment the number. Thus, only programs that are compiled with the new library version are affected by any incompatible changes incorporated in it. Other programs linked before the new library was installed will continue using the older library. This system is also known as shared libraries. Unlike dynamic loading, dynamic linking generally requires help from the operating system. If the processes in memory are protected from one another, then the operating system is the only entity that can check to see whether the needed routine is in another process’s memory space or that can allow multiple processes to access the same memory addresses. We elaborate on this concept when we discuss paging in Section 8.4.4. 8.2 Swapping A process must be in memory to be executed. A process, however, can be swapped temporarily out of memory to a backing store and then brought back into memory for continued execution. For example, assume a multipro- gramming environment with a round-robin CPU-scheduling algorithm. When a quantum expires, the memory manager will start to swap out the process that just finished and to swap another process into the memory space that has been freed (Figure 8.5). In the meantime, the CPU scheduler will allocate a time slice to some other process in memory. When each process finishes its quantum, it will be swapped with another process. Ideally, the memory manager can swap processes fast enough that some processes will be in memory, ready to execute, when the CPU scheduler wants to reschedule the CPU. In addition, the quantum must be large enough to allow reasonable amounts of computing to be done between swaps. operating system swap out swap in user space main memory backing store process P2 process P11 2 Figure 8.5 Swapping of two processes using a disk as a backing store. 8.2 Swapping 359 A variant of this swapping policy is used for priority-based scheduling algorithms. If a higher-priority process arrives and wants service, the memory manager can swap out the lower-priority process and then load and execute the higher-priority process. When the higher-priority process finishes, the lower-priority process can be swapped back in and continued. This variant of swapping is sometimes called roll out, roll in. Normally, a process that is swapped out will be swapped back into the same memory space it occupied previously. This restriction is dictated by the method of address binding. If binding is done at assembly or load time, then the process cannot be easily moved to a different location. If execution-time binding is being used, however, then a process can be swapped into a different memory space, because the physical addresses are computed during execution time. Swapping requires a backing store. The backing store is commonly a fast disk. It must be large enough to accommodate copies of all memory images for all users, and it must provide direct access to these memory images. The system maintains a ready queue consisting of all processes whose memory images are on the backing store or in memory and are ready to run. Whenever the CPU scheduler decides to execute a process, it calls the dispatcher. The dispatcher checks to see whether the next process in the queue is in memory. If it is not, and if there is no free memory region, the dispatcher swaps out a process currently in memory and swaps in the desired process. It then reloads registers and transfers control to the selected process. The context-switch time in such a swapping system is fairly high. To get an idea of the context-switch time, let’s assume that the user process is 100 MB in size and the backing store is a standard hard disk with a transfer rate of 50 MB per second. The actual transfer of the 100-MB process to or from main memory takes 100 MB/50 MB per second = 2 seconds. Assuming an average latency of 8 milliseconds, the swap time is 2,008 milliseconds. Since we must both swap out and swap in, the total swap time is about 4,016 milliseconds. Notice that the major part of the swap time is transfer time. The total transfer time is directly proportional to the amount of memory swapped. If we have a computer system with 4 GB of main memory and a resident operating system taking 1 GB, the maximum size of the user process is 3 GB. However, many user processes may be much smaller than this—say, 100 MB. A 100-MB process could be swapped out in 2 seconds, compared with the 60 seconds required for swapping 3 GB. Clearly, it would be useful to know exactly how much memory a user process is using, not simply how much it might be using. Then we would need to swap only what is actually used, reducing swap time. For this method to be effective, the user must keep the system informed of any changes in memory requirements. Thus, a process with dynamic memory requirements will need to issue system calls (request memory and release memory) to inform the operating system of its changing memory needs. Swapping is constrained by other factors as well. If we want to swap a process, we must be sure that it is completely idle. Of particular concern is any pending I/O. A process may be waiting for an I/O operation when 360 Chapter 8 Main Memory we want to swap that process to free up memory. However, if the I/O is asynchronously accessing the user memory for I/O buffers, then the process cannot be swapped. Assume that the I/O operation is queued because the device is busy. If we were to swap out process P1 and swap in process P2,the I/O operation might then attempt to use memory that now belongs to process P2. There are two main solutions to this problem: never swap a process with pending I/O,orexecuteI/O operations only into operating-system buffers. Transfers between operating-system buffers and process memory then occur only when the process is swapped in. Currently, standard swapping is used in few systems. It requires too much swapping time and provides too little execution time to be a reasonable memory-management solution. Modified versions of swapping, however, are found on many systems. A modification of swapping is used in many versions of UNIX.Swappingis normally disabled but will start if many processes are running and are using a threshold amount of memory. Swapping is again halted when the load on the system is reduced. Memory management in UNIX is described fully in Sections 21.7 and A.6. Early PCs—which lacked the sophistication to implement more advanced memory-management methods—ran multiple large processes by using a modified version of swapping. A prime example is the Microsoft Windows 3.1 operating system, which supports concurrent execution of processes in memory. If a new process is loaded and there is insufficient main memory, an old process is swapped to disk. This operating system does not provide full swapping, however, because the user, rather than the scheduler, decides when it is time to preempt one process for another. Any swapped-out process remains swapped out (and not executing) until the user selects that process to run. Subsequent versions of Microsoft operating systems take advantage of the advanced MMU features now found in PCs. We explore such features in Section 8.4 and in Chapter 9, where we cover virtual memory. 8.3 Contiguous Memory Allocation The main memory must accommodate both the operating system and the various user processes. We therefore need to allocate main memory in the most efficient way possible. This section explains one common method, contiguous memory allocation. The memory is usually divided into two partitions: one for the resident operating system and one for the user processes. We can place the operating system in either low memory or high memory. The major factor affecting this decision is the location of the interrupt vector. Since the interrupt vector is often in low memory, programmers usually place the operating system in low memory as well. Thus, in this text, we discuss only the situation in which the operating system resides in low memory. The development of the other situation is similar. We usually want several user processes to reside in memory at the same time. We therefore need to consider how to allocate available memory to the processes that are in the input queue waiting to be brought into memory. 8.3 Contiguous Memory Allocation 361 In contiguous memory allocation, each process is contained in a single contiguous section of memory. 8.3.1 Memory Mapping and Protection Before discussing memory allocation further, we must discuss the issue of memory mapping and protection. We can provide these features by using a relocation register, as discussed in Section 8.1.3, together with a limit register, as discussed in Section 8.1.1. The relocation register contains the value of the smallest physical address; the limit register contains the range of logical addresses (for example, relocation = 100040 and limit = 74600). With relocation and limit registers, each logical address must be less than the limit register; the MMU maps the logical address dynamically by adding the value in the relocation register. This mapped address is sent to memory (Figure 8.6). When the CPU scheduler selects a process for execution, the dispatcher loads the relocation and limit registers with the correct values as part of the context switch. Because every address generated by a CPU is checked against these registers, we can protect both the operating system and the other users’ programs and data from being modified by this running process. The relocation-register scheme provides an effective way to allow the operating system’s size to change dynamically. This flexibility is desirable in many situations. For example, the operating system contains code and buffer space for device drivers. If a device driver (or other operating-system service) is not commonly used, we do not want to keep the code and data in memory, as we might be able to use that space for other purposes. Such code is sometimes called transient operating-system code; it comes and goes as needed. Thus, using this code changes the size of the operating system during program execution. 8.3.2 Memory Allocation Now we are ready to turn to memory allocation. One of the simplest methods for allocating memory is to divide memory into several fixed-sized partitions. Each partition may contain exactly one process. Thus, the degree CPU memory logical address trap: addressing error no yes physical address relocation register ϩϽ limit register Figure 8.6 Hardware support for relocation and limit registers. 362 Chapter 8 Main Memory of multiprogramming is bound by the number of partitions. In this multiple- partition method, when a partition is free, a process is selected from the input queue and is loaded into the free partition. When the process terminates, the partition becomes available for another process. This method was originally used by the IBM OS/360 operating system (called MFT); it is no longer in use. The method described next is a generalization of the fixed-partition scheme (called MVT); it is used primarily in batch environments. Many of the ideas presented here are also applicable to a time-sharing environment in which pure segmentation is used for memory management (Section 8.6). In the variable-partition scheme, the operating system keeps a table indicating which parts of memory are available and which are occupied. Initially, all memory is available for user processes and is considered one large block of available memory, a hole. Eventually, as you will see, memory contains a set of holes of various sizes. As processes enter the system, they are put into an input queue. The operating system takes into account the memory requirements of each process and the amount of available memory space in determining which processes are allocated memory. When a process is allocated space, it is loaded into memory, and it can then compete for CPU time. When a process terminates, it releases its memory, which the operating system may then fill with another process from the input queue. At any given time, then, we have a list of available block sizes and an input queue. The operating system can order the input queue according to a scheduling algorithm. Memory is allocated to processes until, finally, the memory requirements of the next process cannot be satisfied—that is, no available block of memory (or hole) is large enough to hold that process. The operating system can then wait until a large enough block is available, or it can skip down the input queue to see whether the smaller memory requirements of some other process can be met. In general, as mentioned, the memory blocks available comprise a set of holes of various sizes scattered throughout memory. When a process arrives and needs memory, the system searches the set for a hole that is large enough for this process. If the hole is too large, it is split into two parts. One part is allocated to the arriving process; the other is returned to the set of holes. When a process terminates, it releases its block of memory, which is then placed back in the set of holes. If the new hole is adjacent to other holes, these adjacent holes are merged to form one larger hole. At this point, the system may need to check whether there are processes waiting for memory and whether this newly freed and recombined memory could satisfy the demands of any of these waiting processes. This procedure is a particular instance of the general dynamic storage- allocation problem, which concerns how to satisfy a request of size n from a list of free holes. There are many solutions to this problem. The first-fit, best-fit, and worst-fit strategies are the ones most commonly used to select a free hole from the set of available holes. • First fit. Allocate the first hole that is big enough. Searching can start either at the beginning of the set of holes or at the location where the previous first-fit search ended. We can stop searching as soon as we find a free hole that is large enough. 8.3 Contiguous Memory Allocation 363 • Best fit. Allocate the smallest hole that is big enough. We must search the entire list, unless the list is ordered by size. This strategy produces the smallest leftover hole. • Worst fit. Allocate the largest hole. Again, we must search the entire list, unless it is sorted by size. This strategy produces the largest leftover hole, which may be more useful than the smaller leftover hole from a best-fit approach. Simulations have shown that both first fit and best fit are better than worst fit in terms of speed and storage utilization. Neither first fit nor best fit is clearly better than the other in terms of storage utilization, but first fit is generally faster. 8.3.3 Fragmentation Both the first-fit and best-fit strategies for memory allocation suffer from external fragmentation. As processes are loaded and removed from memory, the free memory space is broken into little pieces. External fragmentation exists when there is enough total memory space to satisfy a request but the available spaces are not contiguous; storage is fragmented into a large number of small holes. This fragmentation problem can be severe. In the worst case, we could have a block of free (or wasted) memory between every two processes. If all these small pieces of memory were in one big free block instead, we might be able to run several more processes. Whether we are using the first-fit or best-fit strategy can affect the amount of fragmentation. (First fit is better for some systems, whereas best fit is better for others.) Another factor is which end of a free block is allocated. (Which is the leftover piece—the one on the top or the one on the bottom?) No matter which algorithm is used, however, external fragmentation will be a problem. Depending on the total amount of memory storage and the average process size, external fragmentation may be a minor or a major problem. Statistical analysis of first fit, for instance, reveals that, even with some optimization, given N allocated blocks, another 0.5 N blocks will be lost to fragmentation. That is, one-third of memory may be unusable! This property is known as the 50-percent rule. Memory fragmentation can be internal as well as external. Consider a multiple-partition allocation scheme with a hole of 18,464 bytes. Suppose that the next process requests 18,462 bytes. If we allocate exactly the requested block, we are left with a hole of 2 bytes. The overhead to keep track of this hole will be substantially larger than the hole itself. The general approach to avoiding this problem is to break the physical memory into fixed-sized blocks and allocate memory in units based on block size. With this approach, the memory allocated to a process may be slightly larger than the requested memory. The difference between these two numbers is internal fragmentation—unused memory that is internal to a partition. One solution to the problem of external fragmentation is compaction.The goal is to shuffle the memory contents so as to place all free memory together in one large block. Compaction is not always possible, however. If relocation is static and is done at assembly or load time, compaction cannot be done; compaction is possible only if relocation is dynamic and is done at execution 364 Chapter 8 Main Memory time. If addresses are relocated dynamically, relocation requires only moving the program and data and then changing the base register to reflect the new base address. When compaction is possible, we must determine its cost. The simplest compaction algorithm is to move all processes toward one end of memory; all holes move in the other direction, producing one large hole of available memory. This scheme can be expensive. Another possible solution to the external-fragmentation problem is to permit the logical address space of the processes to be noncontiguous, thus allowing a process to be allocated physical memory wherever such memory is available. Two complementary techniques achieve this solution: paging (Section 8.4) and segmentation (Section 8.6). These techniques can also be combined (Section 8.7). 8.4 Paging Paging is a memory-management scheme that permits the physical address space of a process to be noncontiguous. Paging avoids external fragmentation and the need for compaction. It also solves the considerable problem of fitting memory chunks of varying sizes onto the backing store; most memory- management schemes used before the introduction of paging suffered from this problem. The problem arises because, when some code fragments or data residing in main memory need to be swapped out, space must be found on the backing store. The backing store has the same fragmentation problems discussed in connection with main memory, but access is much slower, so compaction is impossible. Because of its advantages over earlier methods, paging in its various forms is used in most operating systems. Traditionally, support for paging has been handled by hardware. However, recent designs have implemented paging by closely integrating the hardware and operating system, especially on 64-bit microprocessors. 8.4.1 Basic Method The basic method for implementing paging involves breaking physical mem- ory into fixed-sized blocks called frames and breaking logical memory into blocks of the same size called pages.Whenaprocessistobeexecuted,its pages are loaded into any available memory frames from their source (a file system or the backing store). The backing store is divided into fixed-sized blocks that are of the same size as the memory frames. The hardware support for paging is illustrated in Figure 8.7. Every address generated by the CPU is divided into two parts: a page number (p) and a page offset (d). The page number is used as an index into a page table.The page table contains the base address of each page in physical memory. This base address is combined with the page offset to define the physical memory address that is sent to the memory unit. The paging model of memory is shown in Figure 8.8. The page size (like the frame size) is defined by the hardware. The size of a page is typically a power of 2, varying between 512 bytes and 16 MB per page, depending on the computer architecture. The selection of a power of 2 as a page size makes the translation of a logical address into a page number and 8.4 Paging 365 physical memory f logical address page table physical address CPU p p f d df f0000 … 0000 f1111 … 1111 Figure 8.7 Paging hardware. page offset particularly easy. If the size of the logical address space is 2m,and apagesizeis2n addressing units (bytes or words), then the high-order m − n bits of a logical address designate the page number, and the n low-order bits designate the page offset. Thus, the logical address is as follows: pd page number page offset m – n n where p is an index into the page table and d is the displacement within the page. As a concrete (although minuscule) example, consider the memory in Figure 8.9. Here, in the logical address, n=2andm =4.Usingapagesize of 4 bytes and a physical memory of 32 bytes (8 pages), we show how the user’s view of memory can be mapped into physical memory. Logical address 0 is page 0, offset 0. Indexing into the page table, we find that page 0 is in frame 5. Thus, logical address 0 maps to physical address 20 [= (5 × 4) + 0]. Logical address 3 (page 0, offset 3) maps to physical address 23 [= (5 × 4) + 3]. Logical address 4 is page 1, offset 0; according to the page table, page 1 is mapped to frame 6. Thus, logical address 4 maps to physical address 24 [= (6 × 4) + 0]. Logical address 13 maps to physical address 9. You may have noticed that paging itself is a form of dynamic relocation. Every logical address is bound by the paging hardware to some physical address. Using paging is similar to using a table of base (or relocation) registers, one for each frame of memory. When we use a paging scheme, we have no external fragmentation: any free frame can be allocated to a process that needs it. However, we may have some internal fragmentation. Notice that frames are allocated as units. If the memory 366 Chapter 8 Main Memory page 0 page 1 page 2 page 3 logical memory page 1 page 3 page 0 page 2 physical memory page table frame number 1 4 3 7 0 1 2 3 0 1 2 3 4 5 6 7 Figure 8.8 Paging model of logical and physical memory. requirements of a process do not happen to coincide with page boundaries, the last frame allocated may not be completely full. For example, if page size is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It will be allocated 36 frames, resulting in internal fragmentation of 2,048 − 1,086 = 962 bytes. In the worst case, a process would need n pagesplus1byte.It would be allocated n + 1 frames, resulting in internal fragmentation of almost an entire frame. If process size is independent of page size, we expect internal fragmentation to average one-half page per process. This consideration suggests that small page sizes are desirable. However, overhead is involved in each page-table entry, and this overhead is reduced as the size of the pages increases. Also, disk I/O is more efficient when the amount data being transferred is larger (Chapter 12). Generally, page sizes have grown over time as processes, data sets, and main memory have become larger. Today, pages typically are between 4 KB and 8 KB in size, and some systems support even larger page sizes. Some CPUs and kernels even support multiple page sizes. For instance, Solaris uses page sizes of 8 KB and 4 MB, depending on the data stored by the pages. Researchers are now developing support for variable on-the-fly page size. Usually, each page-table entry is 4 bytes long, but that size can vary as well. A 32-bit entry can point to one of 232 physicalpageframes.Ifframesizeis4KB, then a system with 4-byte entries can address 244 bytes (or 16 TB) of physical memory. When a process arrives in the system to be executed, its size, expressed in pages, is examined. Each page of the process needs one frame. Thus, if the process requires n pages, at least n frames must be available in memory. If n frames are available, they are allocated to this arriving process. The first page of the process is loaded into one of the allocated frames, and the frame number 8.4 Paging 367 logical memory physical memory page table i j k l m n o p a b c d e f g h a b c d e f g h i j k l m n o p 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 4 8 12 16 20 24 28 1 2 3 5 6 1 2 Figure 8.9 Paging example for a 32-byte memory with 4-byte pages. is put in the page table for this process. The next page is loaded into another frame, its frame number is put into the page table, and so on (Figure 8.10). An important aspect of paging is the clear separation between the user’s view of memory and the actual physical memory. The user program views memory as one single space, containing only this one program. In fact, the user program is scattered throughout physical memory, which also holds other programs. The difference between the user’s view of memory and the actual physical memory is reconciled by the address-translation hardware. The logical addresses are translated into physical addresses. This mapping is hidden from the user and is controlled by the operating system. Notice that the user process by definition is unable to access memory it does not own. It has no way of addressing memory outside of its page table, and the table includes only those pages that the process owns. Since the operating system is managing physical memory, it must be aware of the allocation details of physical memory—which frames are allocated, which frames are available, how many total frames there are, and so on. This information is generally kept in a data structure called a frame table.Theframe table has one entry for each physical page frame, indicating whether the latter 368 Chapter 8 Main Memory (a) free-frame list 14 13 18 20 15 13 14 15 16 17 18 19 20 21 page 0 page 1 page 2 page 3 new process (b) free-frame list 15 13 page 1 page 0 page 2 page 3 14 15 16 17 18 19 20 21 page 0 page 1 page 2 page 3 new process new-process page table 140 1 2 3 13 18 20 Figure 8.10 Free frames (a) before allocation and (b) after allocation. is free or allocated and, if it is allocated, to which page of which process or processes. In addition, the operating system must be aware that user processes operate in user space, and all logical addresses must be mapped to produce physical addresses. If a user makes a system call (to do I/O, for example) and provides an address as a parameter (a buffer, for instance), that address must be mapped to produce the correct physical address. The operating system maintains a copy of the page table for each process, just as it maintains a copy of the instruction counter and register contents. This copy is used to translate logical addresses to physical addresses whenever the operating system must map a logical address to a physical address manually. It is also used by the CPU dispatcher to define the hardware page table when a process is to be allocated the CPU. Paging therefore increases the context-switch time. 8.4.2 Hardware Support Each operating system has its own methods for storing page tables. Most allocate a page table for each process. A pointer to the page table is stored with the other register values (like the instruction counter) in the process control block. When the dispatcher is told to start a process, it must reload the user registers and define the correct hardware page-table values from the stored user page table. The hardware implementation of the page table can be done in several ways. In the simplest case, the page table is implemented as a set of dedicated registers. These registers should be built with very high-speed logic to make the paging-address translation efficient. Every access to memory must go through the paging map, so efficiency is a major consideration. The CPU dispatcher 8.4 Paging 369 reloads these registers, just as it reloads the other registers. Instructions to load or modify the page-table registers are, of course, privileged, so that only the operating system can change the memory map. The DEC PDP-11 is an example of such an architecture. The address consists of 16 bits, and the page size is 8 KB. The page table thus consists of eight entries that are kept in fast registers. The use of registers for the page table is satisfactory if the page table is reasonably small (for example, 256 entries). Most contemporary computers, however, allow the page table to be very large (for example, 1 million entries). For these machines, the use of fast registers to implement the page table is not feasible. Rather, the page table is kept in main memory, and a page-table base register (PTBR) points to the page table. Changing page tables requires changing only this one register, substantially reducing context-switch time. The problem with this approach is the time required to access a user memory location. If we want to access location i, we must first index into thepagetable,usingthevalueinthePTBR offsetbythepagenumberfori.This task requires a memory access. It provides us with the frame number, which is combined with the page offset to produce the actual address. We can then access the desired place in memory. With this scheme, two memory accesses are neededtoaccessabyte(oneforthepage-tableentry,oneforthebyte).Thus, memory access is slowed by a factor of 2. This delay would be intolerable under most circumstances. We might as well resort to swapping! The standard solution to this problem is to use a special, small, fast- lookup hardware cache, called a translation look-aside buffer (TLB).TheTLB is associative, high-speed memory. Each entry in the TLB consists of two parts: a key (or tag) and a value. When the associative memory is presented with an item, the item is compared with all keys simultaneously. If the item is found, the corresponding value field is returned. The search is fast; the hardware, however, is expensive. Typically, the number of entries in a TLB is small, often between 64 and 1,024. The TLB is used with page tables in the following way. The TLB contains only a few of the page-table entries. When a logical address is generated by the CPU, its page number is presented to the TLB. If the page number is found, its frame number is immediately available and is used to access memory. The whole task may take less than 10 percent longer than it would if an unmapped memory reference were used. If the page number is not in the TLB (known as a TLB miss), a memory reference to the page table must be made. When the frame number is obtained, we can use it to access memory (Figure 8.11). In addition, we add the page number and frame number to the TLB, so that they will be found quickly on the next reference. If the TLB is already full of entries, the operating system must select one for replacement. Replacement policies range from least recently used (LRU) to random. Furthermore, some TLBs allow certain entries to be wired down, meaning that they cannot be removed from the TLB. Typically, TLB entries for kernel code are wired down. Some TLBsstoreaddress-space identifiers (ASIDs) in each TLB entry. An ASID uniquely identifies each process and is used to provide address-space protection for that process. When the TLB attempts to resolve virtual page numbers, it ensures that the ASID for the currently running process matches the ASID associated with the virtual page. If the ASIDs do not match, the attempt is treated as a TLB miss. In addition to providing address-space protection, an ASID 370 Chapter 8 Main Memory page table f CPU logical address p d fd physical address physical memory p TLB miss page number frame number TLB hit TLB Figure 8.11 Paging hardware with TLB. allows the TLB to contain entries for several different processes simultaneously. If the TLB does not support separate ASIDs, then every time a new page table is selected (for instance, with each context switch), the TLB must be flushed (or erased) to ensure that the next executing process does not use the wrong translation information. Otherwise, the TLB could include old entries that contain valid virtual addresses but have incorrect or invalid physical addresses left over from the previous process. The percentage of times that a particular page number is found in the TLB is called the hit ratio. An 80-percent hit ratio, for example, means that we find the desired page number in the TLB 80 percent of the time. If it takes 20 nanoseconds to search the TLB and 100 nanoseconds to access memory, then a mapped-memory access takes 120 nanoseconds when the page number is in the TLB. If we fail to find the page number in the TLB (20 nanoseconds), then we must first access memory for the page table and frame number (100 nanoseconds) and then access the desired byte in memory (100 nanoseconds), for a total of 220 nanoseconds. To find the effective memory-access time,we weight the case by its probability: effective access time = 0.80 × 120 + 0.20 × 220 = 140 nanoseconds. In this example, we suffer a 40-percent slowdown in memory-access time (from 100 to 140 nanoseconds). For a 98-percent hit ratio, we have effective access time = 0.98 × 120 + 0.02 × 220 = 122 nanoseconds. 8.4 Paging 371 This increased hit rate produces only a 22 percent slowdown in access time. We further explore the impact of the hit ratio on the TLB in Chapter 9. 8.4.3 Protection Memory protection in a paged environment is accomplished by protection bits associated with each frame. Normally, these bits are kept in the page table. Every reference to memory goes through the page table to find the correct framenumber.Atthesametimethatthephysicaladdressisbeingcomputed, the protection bits can be checked. One protection bit can define a page to be read–write or read-only. Any attempt to write to a read-only page will cause a hardware trap to the operating system (or memory-protection violation). We can easily expand this approach to provide a finer level of protection. We can create hardware to provide read-only, read–write, or execute-only protection; or, by providing separate protection bits for each kind of access, we can allow any combination of these accesses. Illegal attempts will in each case be trapped to the operating system. One additional bit is generally attached to each entry in the page table: a valid–invalid bit. When this bit is set to “valid,” the associated page is in the process’s logical address space and is thus a legal (or valid) page. When the bit is set to“invalid,” the page is not in the process’s logical address space. Illegal addresses are trapped by use of the valid–invalid bit. The operating system sets this bit for each page to allow or disallow access to the page. Suppose, for example, that in a system with a 14-bit address space (0 to 16383), we have a program that should use only addresses 0 to 10468. Given apagesizeof2KB, we have the situation shown in Figure 8.12. Addresses in pages 0, 1, 2, 3, 4, and 5 are mapped normally through the page table. Any attempt to generate an address in pages 6 or 7, however, will find that the valid–invalid bit is set to invalid, and the computer will trap to the operating system (invalid page reference). Notice that this scheme has created a problem. Because the program extends only to address 10468, any reference beyond that address is illegal. However, references to page 5 are classified as valid, so accesses to addresses up to 12287 are valid. Only the addresses from 12288 to 16383 are invalid. This problem is a result of the 2-KB page size and reflects the internal fragmentation of paging. Rarely does a process use all its address range. In fact, many processes use only a small fraction of the address space available to them. It would be wasteful in these cases to create a page table with entries for every page in the address range. Most of this table would be unused but would take up valuable memory space. Some systems provide hardware, in the form of a page-table length register (PTLR), to indicate the size of the page table. This value is checked against every logical address to verify that the address is in the valid range for the process. Failure of this test causes an error trap to the operating system. 8.4.4 Shared Pages An advantage of paging is the possibility of sharing common code. This con- sideration is particularly important in a time-sharing environment. Consider a system that supports 40 users, each of whom executes a text editor. If the text 372 Chapter 8 Main Memory page 0 page 0 page 1 page 2 page 3 page 4 page 5 page n ••• 00000 0 1 2 3 4 5 6 7 8 9 frame number 0 1 2 3 4 5 6 7 2 3 4 7 8 9 0 0 v v v v v v i i page table valid–invalid bit 10,468 12,287 page 1 page 2 page 3 page 4 page 5 Figure 8.12 Valid (v) or invalid (i) bit in a page table. editor consists of 150 KB of code and 50 KB of data space, we need 8,000 KB to support the 40 users. If the code is reentrant code (or pure code), however, it can be shared, as shown in Figure 8.13. Here we see a three-page editor—each page 50 KB in size—being shared among three processes. (The large page size is used to simplify the figure.) Each process has its own data page. Reentrant code is non-self-modifying code: it never changes during execu- tion. Thus, two or more processes can execute the same code at the same time. Each process has its own copy of registers and data storage to hold the data for the process’s execution. The data for two different processes will, of course, be different. Only one copy of the editor need be kept in physical memory. Each user’s page table maps onto the same physical copy of the editor, but data pages are mapped onto different frames. Thus, to support 40 users, we need only one copy of the editor (150 KB), plus 40 copies of the 50 KB of data space per user. The total space required is now 2,150 KB instead of 8,000 KB—a significant savings. Other heavily used programs can also be shared—compilers, window systems, run-time libraries, database systems, and so on. To be sharable, the code must be reentrant. The read-only nature of shared code should not be left to the correctness of the code; the operating system should enforce this property. The sharing of memory among processes on a system is similar to the sharing of the address space of a task by threads, described in Chapter 4. 8.5 Structure of the Page Table 373 7 6 5 ed 24 ed 13 2 data 11 0 3 4 6 1 page table for P1 process P1 data 1 ed 2 ed 3 ed 1 3 4 6 2 page table for P3 process P3 data 3 ed 2 ed 3 ed 1 3 4 6 7 page table for P2 process P2 data 2 ed 2 ed 3 ed 1 8 9 10 11 data 3 2data ed 3 Figure 8.13 Sharing of code in a paging environment. Furthermore, recall that in Chapter 3 we described shared memory as a method of interprocess communication. Some operating systems implement shared memory using shared pages. Organizing memory according to pages provides numerous benefits in addition to allowing several processes to share the same physical pages. We cover several other benefits in Chapter 9. 8.5 Structure of the Page Table In this section, we explore some of the most common techniques for structuring the page table: hierarchical paging, hashed page tables, and inverted page tables. 8.5.1 Hierarchical Paging Most modern computer systems support a large logical address space (232 to 264). In such an environment, the page table itself becomes excessively large. For example, consider a system with a 32-bit logical address space. If the page size in such a system is 4 KB (212), then a page table may consist of up to 1 million entries (232/212). Assuming that each entry consists of 4 bytes, each process may need up to 4 MB of physical address space for the page table alone. Clearly, we would not want to allocate the page table contiguously in 374 Chapter 8 Main Memory ••• ••• outer page table page of page table page table memory 929 900 929 900 708 500 100 1 0 ••• 100 708 ••• ••• ••• ••• ••• ••• ••• ••• ••• 1 500 Figure 8.14 A two-level page-table scheme. main memory. One simple solution to this problem is to divide the page table into smaller pieces. We can accomplish this division in several ways. One way is to use a two-level paging algorithm, in which the page table itself is also paged (Figure 8.14). For example, consider again the system with a 32-bit logical address space and a page size of 4 KB. A logical address is divided into a page number consisting of 20 bits and a page offset consisting of 12 bits. Because we page the page table, the page number is further divided into a 10-bit page number and a 10-bit page offset. Thus, a logical address is as follows: p1 p2 d page number page offset 10 10 12 where p1 is an index into the outer page table and p2 is the displacement within the page of the outer page table. The address-translation method for this architecture is shown in Figure 8.15. Because address translation works from the outer page table inward, this scheme is also known as a forward-mapped page table. The VAX architecture supports a variation of two-level paging. The VAX is a 32-bit machine with a page size of 512 bytes. The logical address space of a 8.5 Structure of the Page Table 375 logical address outer page table p1 p2 p1 page of page table p2 d d Figure 8.15 Address translation for a two-level 32-bit paging architecture. process is divided into four equal sections, each of which consists of 230 bytes. Each section represents a different part of the logical address space of a process. The first 2 high-order bits of the logical address designate the appropriate section. The next 21 bits represent the logical page number of that section, and the final 9 bits represent an offset in the desired page. By partitioning the page table in this manner, the operating system can leave partitions unused until a process needs them. An address on the VAX architecture is as follows: spd section page offset 2219 where s designates the section number, p is an index into the page table, and d is the displacement within the page. Even when this scheme is used, the size of a one-level page table for a VAX process using one section is 221 bits ∗ 4 bytes per entry = 8 MB. To further reduce main-memory use, the VAX pages the user-process page tables. For a system with a 64-bit logical address space, a two-level paging scheme is no longer appropriate. To illustrate this point, let’s suppose that the page size in such a system is 4 KB (212). In this case, the page table consists of up to 252 entries. If we use a two-level paging scheme, then the inner page tables can conveniently be one page long, or contain 210 4-byte entries. The addresses look like this: p1 p2 d outer page inner page offset 42 10 12 The outer page table consists of 242 entries, or 244 bytes. The obvious way to avoid such a large table is to divide the outer page table into smaller pieces. (This approach is also used on some 32-bit processors for added flexibility and efficiency.) We can divide the outer page table in various ways. We can page the outer page table, giving us a three-level paging scheme. Suppose that the outer page 376 Chapter 8 Main Memory table is made up of standard-size pages (210 entries, or 212 bytes). In this case, a 64-bit address space is still daunting: p1 p2 p3 2nd outer page outer page inner page 32 10 10 d offset 12 The outer page table is still 234 bytes in size. The next step would be a four-level paging scheme, where the second-level outer page table itself is also paged, and so forth. The 64-bit UltraSPARC would require seven levels of paging—a prohibitive number of memory accesses— to translate each logical address. You can see from this example why, for 64-bit architectures, hierarchical page tables are generally considered inappropriate. 8.5.2 Hashed Page Tables A common approach for handling address spaces larger than 32 bits is to use a hashed page table, with the hash value being the virtual page number. Each entry in the hash table contains a linked list of elements that hash to the same location (to handle collisions). Each element consists of three fields: (1) the virtual page number, (2) the value of the mapped page frame, and (3) a pointer to the next element in the linked list. The algorithm works as follows: The virtual page number in the virtual address is hashed into the hash table. The virtual page number is compared with field 1 in the first element in the linked list. If there is a match, the corresponding page frame (field 2) is used to form the desired physical address. If there is no match, subsequent entries in the linked list are searched for a matching virtual page number. This scheme is shown in Figure 8.16. hash table qs logical address physical address physical memory pd rd prhash function • • • Figure 8.16 Hashed page table. 8.5 Structure of the Page Table 377 A variation of this scheme that is favorable for 64-bit address spaces has been proposed. This variation uses clustered page tables, which are similar to hashed page tables except that each entry in the hash table refers to several pages (such as 16) rather than a single page. Therefore, a single page-table entry can store the mappings for multiple physical page frames. Clustered page tables are particularly useful for sparse address spaces, where memory references are noncontiguous and scattered throughout the address space. 8.5.3 Inverted Page Tables Usually, each process has an associated page table. The page table has one entry for each page that the process is using (or one slot for each virtual address, regardless of the latter’s validity). This table representation is a natural one, since processes reference pages through the pages’ virtual addresses. The operating system must then translate this reference into a physical memory address. Since the table is sorted by virtual address, the operating system is able to calculate where in the table the associated physical address entry is located and to use that value directly. One of the drawbacks of this method is that each page table may consist of millions of entries. These tables may consume large amounts of physical memory just to keep track of how other physical memory is being used. Tosolvethisproblem,wecanuseaninverted page table. An inverted page table has one entry for each real page (or frame) of memory. Each entry consists of the virtual address of the page stored in that real memory location, with information about the process that owns the page. Thus, only one page table is in the system, and it has only one entry for each page of physical memory. Figure 8.17 shows the operation of an inverted page table. Compare it with Figure 8.7, which depicts a standard page table in operation. Inverted page table CPU logical address physical address physical memory i pid p pid search p d id Figure 8.17 Inverted page table. 378 Chapter 8 Main Memory page tables often require that an address-space identifier (Section 8.4.2) be stored in each entry of the page table, since the table usually contains several different address spaces mapping physical memory. Storing the address-space identifier ensures that a logical page for a particular process is mapped to the corresponding physical page frame. Examples of systems using inverted page tables include the 64-bit UltraSPARC and PowerPC. To illustrate this method, we describe a simplified version of the inverted page table used in the IBM RT. Each virtual address in the system consists of a triple: . Each inverted page-table entry is a pair, ,where process-id assumes the role of the address-space identifier. When a memory reference occurs, part of the virtual address, consisting of , is presented to the memory subsystem. The inverted page table is then searched for a match. If a match is found—say, at entry i—then the physical address is generated. If no match is found, then an illegal address access has been attempted. Although this scheme decreases the amount of memory needed to store each page table, it increases the amount of time needed to search the table when a page reference occurs. Because the inverted page table is sorted by physical address, but lookups occur on virtual addresses, the whole table might need to be searched for a match. This search would take far too long. To alleviate this problem, we use a hash table, as described in Section 8.5.2, to limit the search to one—or at most a few—page-table entries. Of course, each access to the hash table adds a memory reference to the procedure, so one virtual memory reference requires at least two real memory reads—one for the hash-table entry and one for the page table. (Recall that the TLB is searched first, before the hash table is consulted, offering some performance improvement.) Systems that use inverted page tables have difficulty implementing shared memory. Shared memory is usually implemented as multiple virtual addresses (one for each process sharing the memory) that are mapped to one physical address. This standard method cannot be used with inverted page tables; because there is only one virtual page entry for every physical page, one physical page cannot have two (or more) shared virtual addresses. A simple technique for addressing this issue is to allow the page table to contain only one mapping of a virtual address to the shared physical address. This means that references to virtual addresses that are not mapped result in page faults. 8.6 Segmentation An important aspect of memory management that became unavoidable with paging is the separation of the user’s view of memory from the actual physical memory. As we have already seen, the user’s view of memory is not the same as the actual physical memory. The user’s view is mapped onto physical memory. This mapping allows differentiation between logical memory and physical memory. 8.6 Segmentation 379 8.6.1 Basic Method Do users think of memory as a linear array of bytes, some containing instructions and others containing data? Most people would say no. Rather, users prefer to view memory as a collection of variable-sized segments, with no necessary ordering among segments (Figure 8.18). Consider how you think of a program when you are writing it. You think of it as a main program with a set of methods, procedures, or functions. It may also include various data structures: objects, arrays, stacks, variables, and so on. Each of these modules or data elements is referred to by name. You talk about “the stack,”“the math library,”“the main program,” without caring what addresses in memory these elements occupy. You are not concerned with whether the stack is stored before or after the Sqrt() function. Each of these segments is of variable length; the length is intrinsically defined by the purpose of the segment in the program. Elements within a segment are identified by their offset from the beginning of the segment: the first statement of the program, the seventh stack frame entry in the stack, the fifth instruction of the Sqrt(),andsoon. Segmentation is a memory-management scheme that supports this user view of memory. A logical address space is a collection of segments. Each segment has a name and a length. The addresses specify both the segment name and the offset within the segment. The user therefore specifies each address by two quantities: a segment name and an offset. (Contrast this scheme with the paging scheme, in which the user specifies only a single address, which is partitioned by the hardware into a page number and an offset, all invisible to the programmer.) logical address subroutine stack symbol table main program Sqrt Figure 8.18 User’s view of a program. 380 Chapter 8 Main Memory For simplicity of implementation, segments are numbered and are referred to by a segment number, rather than by a segment name. Thus, a logical address consists of a two-tuple: . Normally, the user program is compiled, and the compiler automatically constructs segments reflecting the input program. A C compiler might create separate segments for the following: 1. The code 2. Global variables 3. The heap, from which memory is allocated 4. Thestacksusedbyeachthread 5. ThestandardClibrary Libraries that are linked in during compile time might be assigned separate segments. The loader would take all these segments and assign them segment numbers. 8.6.2 Hardware Although the user can now refer to objects in the program by a two-dimensional address, the actual physical memory is still, of course, a one-dimensional sequence of bytes. Thus, we must define an implementation to map two- dimensional user-defined addresses into one-dimensional physical addresses. CPU physical memory sd <+ trap: addressing error no yes segment table limit base s Figure 8.19 Segmentation hardware. 8.7 Example: The Intel Pentium 381 CPU logical address segmentation unit linear address paging unit physical address physical memory Figure 8.20 Logical to physical address translation in the Pentium. This mapping is effected by a segment table. Each entry in the segment table has a segment base and a segment limit. The segment base contains the starting physical address where the segment resides in memory, and the segment limit specifies the length of the segment. The use of a segment table is illustrated in Figure 8.19. A logical address consists of two parts: a segment number, s, and an offset into that segment, d. The segment number is used as an index to the segment table. The offset d of the logical address must be between 0 and the segment limit. If it is not, we trap to the operating system (logical addressing attempt beyond end of segment). When an offset is legal, it is added to the segment base to produce the address in physical memory of the desired byte. The segment table is thus essentially an array of base–limit register pairs. As an example, consider the situation shown in Figure 8.21. We have five segments numbered from 0 through 4. The segments are stored in physical memory as shown. The segment table has a separate entry for each segment, giving the beginning address of the segment in physical memory (or base) and the length of that segment (or limit). For example, segment 2 is 400 bytes long and begins at location 4300. Thus, a reference to byte 53 of segment 2 is mapped onto location 4300 + 53 = 4353. A reference to segment 3, byte 852, is mapped to 3200 (the base of segment 3) + 852 = 4052. A reference to byte 1222 of segment 0 would result in a trap to the operating system, as this segment is only 1,000 bytes long. 8.7 Example: The Intel Pentium Both paging and segmentation have advantages and disadvantages. In fact, some architectures provide both. In this section, we discuss the Intel Pentium architecture, which supports both pure segmentation and segmentation with paging. We do not give a complete description of the memory-management structure of the Pentium in this text. Rather, we present the major ideas on which it is based. We conclude our discussion with an overview of Linux address translation on Pentium systems. In Pentium systems, the CPU generates logical addresses, which are given to the segmentation unit. The segmentation unit produces a linear address for each logical address. The linear address is then given to the paging unit, which in turn generates the physical address in main memory.Thus, the segmentation and paging units form the equivalent of the memory-management unit (MMU). This scheme is shown in Figure 8.20. 8.7.1 Pentium Segmentation The Pentium architecture allows a segment to be as large as 4 GB,andthe maximum number of segments per process is 16 K. The logical-address space 382 Chapter 8 Main Memory logical address space subroutine stack symbol table main program Sqrt 1400 physical memory 2400 3200 segment 2 4300 4700 5700 6300 6700 segment table limit 0 1 2 3 4 1000 400 400 1100 1000 base 1400 6300 4300 3200 4700 segment 0 segment 3 segment 4 segment 2segment 1 segment 0 segment 3 segment 4 segment 1 Figure 8.21 Example of segmentation. of a process is divided into two partitions. The first partition consists of up to 8 K segments that are private to that process. The second partition consists of up to 8 K segments that are shared among all the processes. Information about the first partition is kept in the local descriptor table (LDT); information about the second partition is kept in the global descriptor table (GDT).Eachentry in the LDT and GDT consists of an 8-byte segment descriptor with detailed information about a particular segment, including the base location and limit of that segment. The logical address is a pair (selector, offset). In this pair, the selector is a 16-bit number: p 2 g 1 s 13 in which s designates the segment number, g indicates whether the segment is in the GDT or LDT,andp deals with protection. The offset is a 32-bit number specifying the location of the byte (or word) within the segment in question. The machine has six segment registers, allowing six segments to be addressed at any one time by a process. It also has six 8-byte microprogram registers to hold the corresponding descriptors from either the LDT or the GDT. 8.7 Example: The Intel Pentium 383 logical address selector descriptor table segment descriptor + 32-bit linear address offset Figure 8.22 Intel Pentium segmentation. This cache lets the Pentium avoid having to read the descriptor from memory for every memory reference. The linear address on the Pentium is 32 bits long and is formed as follows. The segment register points to the appropriate entry in the LDT or GDT.The base and limit information about the segment in question is used to generate a linear address. First, the limit is used to check for address validity. If the address is not valid, a memory fault is generated, resulting in a trap to the operating system. If it is valid, then the value of the offset is added to the value of the base, resulting in a 32-bit linear address. This is shown in Figure 8.22. In the following section, we discuss how the paging unit turns this linear address into a physical address. 8.7.2 Pentium Paging The Pentium architecture allows a page size of either 4 KB or 4 MB. For 4-KB pages, the Pentium uses a two-level paging scheme in which the division of the 32-bit linear address is as follows: p1 p2 d page number page offset 10 10 12 The address-translation scheme for this architecture is similar to the scheme shown in Figure 8.15. The Intel Pentium address translation is shown in more detail in Figure 8.23. The 10 high-order bits reference an entry in the outermost page table, which the Pentium terms the page directory.(TheCR3 register points to the page directory for the current process.) The page directory entry points to an inner page table that is indexed by the contents of the innermost 10 bits in the linear address. Finally, the low-order bits 0–11 refer to the offset in the 4-KB page pointed to in the page table. One entry in the page directory is the Page Size flag, which—if set— indicates that the size of the page frame is 4 MB and not the standard 4 KB. If this flag is set, the page directory points directly to the 4-MB page frame, 384 Chapter 8 Main Memory page directory page directory CR3 register page directory page table 4-KB page 4-MB page page table offset offset (linear address) 31 22 21 12 11 0 2131 22 0 Figure 8.23 Paging in the Pentium architecture. bypassing the inner page table; and the 22 low-order bits in the linear address refer to the offset in the 4-MB page frame. To improve the efficiency of physical memory use, Intel Pentium page tables can be swapped to disk. In this case, an invalid bit is used in the page directory entry to indicate whether the table to which the entry is pointing is in memory or on disk. If the table is on disk, the operating system can use the other 31 bits to specify the disk location of the table; the table then can be brought into memory on demand. 8.7.3 Linux on Pentium Systems As an illustration, consider the Linux operating system running on the Intel Pentium architecture. Because Linux is designed to run on a variety of proces- sors—many of which may provide only limited support for segmentation— Linux does not rely on segmentation and uses it minimally. On the Pentium, Linux uses only six segments: 1. A segment for kernel code 2. A segment for kernel data 3. Asegmentforusercode 4. Asegmentforuserdata 8.7 Example: The Intel Pentium 385 5. A task-state segment (TSS) 6. A default LDT segment The segments for user code and user data are shared by all processes running in user mode. This is possible because all processes use the same logical address space and all segment descriptors are stored in the global descriptor table (GDT). Furthermore, each process has its own task-state segment (TSS), and the descriptor for this segment is stored in the GDT.TheTSS is used to store the hardware context of each process during context switches. The default LDT segment is normally shared by all processes and is usually not used. However, if a process requires its own LDT, it can create one and use that instead of the default LDT. As noted, each segment selector includes a 2-bit field for protection. Thus, the Pentium allows four levels of protection. Of these four levels, Linux only recognizes two: user mode and kernel mode. Although the Pentium uses a two-level paging model, Linux is designed to run on a variety of hardware platforms, many of which are 64-bit platforms where two-level paging is not plausible. Therefore, Linux has adopted a three- level paging strategy that works well for both 32-bit and 64-bit architectures. The linear address in Linux is broken into the following four parts: global directory middle directory page table offset Figure 8.24 highlights the three-level paging model in Linux. The number of bits in each part of the linear address varies according to architecture. However, as described earlier in this section, the Pentium architecture only uses a two-level paging model. How, then, does Linux apply global directory global directory CR3 register middle directory page table page frameglobal directory entry middle directory entry page table entry middle directory page table (linear address) offset Figure 8.24 Three-level paging in Linux. 386 Chapter 8 Main Memory its three-level model on the Pentium? In this situation, the size of the middle directory is zero bits, effectively bypassing the middle directory. Each task in Linux has its own set of page tables and—just as in Figure 8.23 —theCR3 register points to the global directory for the task currently executing. During a context switch, the value of the CR3 register is saved and restored in the TSS segments of the tasks involved in the context switch. 8.8 Summary Memory-management algorithms for multiprogrammed operating systems range from the simple single-user system approach to paged segmentation. The most important determinant of the method used in a particular system is the hardware provided. Every memory address generated by the CPU must be checked for legality and possibly mapped to a physical address. The checking cannot be implemented (efficiently) in software. Hence, we are constrained by the hardware available. The various memory-management algorithms (contiguous allocation, pag- ing, segmentation, and combinations of paging and segmentation) differ in many aspects. In comparing different memory-management strategies, we use the following considerations: • Hardware support. A simple base register or a base–limit register pair is sufficient for the single- and multiple-partition schemes, whereas paging and segmentation need mapping tables to define the address map. • Performance. As the memory-management algorithm becomes more complex, the time required to map a logical address to a physical address increases. For the simple systems, we need only compare or add to the logical address—operations that are fast. Paging and segmentation can be as fast if the mapping table is implemented in fast registers. If the table is in memory, however, user memory accesses can be degraded substantially. A TLB can reduce the performance degradation to an acceptable level. • Fragmentation. A multiprogrammed system will generally perform more efficiently if it has a higher level of multiprogramming. For a given set of processes, we can increase the multiprogramming level only by packing more processes into memory. To accomplish this task, we must reduce memory waste, or fragmentation. Systems with fixed-sized allo- cation units, such as the single-partition scheme and paging, suffer from internal fragmentation. Systems with variable-sized allocation units, such as the multiple-partition scheme and segmentation, suffer from external fragmentation. • Relocation. One solution to the external-fragmentation problem is com- paction. Compaction involves shifting a program in memory in such a way that the program does not notice the change. This consideration requires that logical addresses be relocated dynamically, at execution time. If addresses are relocated only at load time, we cannot compact storage. • Swapping. Swapping can be added to any algorithm. At intervals deter- mined by the operating system, usually dictated by CPU-scheduling poli- Practice Exercises 387 cies, processes are copied from main memory to a backing store and later are copied back to main memory. This scheme allows more processes to be runthancanbefitintomemoryatonetime. • Sharing. Another means of increasing the multiprogramming level is to share code and data among different users. Sharing generally requires that either paging or segmentation be used to provide small packets of information (pages or segments) that can be shared. Sharing is a means of running many processes with a limited amount of memory, but shared programs and data must be designed carefully. • Protection. If paging or segmentation is provided, different sections of a user program can be declared execute-only, read-only, or read–write. This restriction is necessary with shared code or data and is generally useful in any case to provide simple run-time checks for common programming errors. Practice Exercises 8.1 Name two differences between logical and physical addresses. 8.2 Consider a system in which a program can be separated into two parts: code and data. The CPU knows whether it wants an instruction (instruction fetch) or data (data fetch or store). Therefore, two base– limit register pairs are provided: one for instructions and one for data. The instruction base–limit register pair is automatically read-only, so programs can be shared among different users. Discuss the advantages and disadvantages of this scheme. 8.3 Why are page sizes always powers of 2? 8.4 Consider a logical address space of 64 pages of 1,024 words each, mapped onto a physical memory of 32 frames. a. How many bits are there in the logical address? b. How many bits are there in the physical address? 8.5 What is the effect of allowing two entries in a page table to point to the same page frame in memory? Explain how this effect could be used to decrease the amount of time needed to copy a large amount of memory from one place to another. What effect would updating some byte on the one page have on the other page? 8.6 Describe a mechanism by which one segment could belong to the address space of two different processes. 8.7 Sharing segments among processes without requiring that they have the same segment number is possible in a dynamically linked segmentation system. a. Define a system that allows static linking and sharing of segments without requiring that the segment numbers be the same. 388 Chapter 8 Main Memory b. Describe a paging scheme that allows pages to be shared without requiring that the page numbers be the same. 8.8 In the IBM/370, memory protection is provided through the use of keys. A key is a 4-bit quantity. Each 2-K block of memory has a key (the storage key) associated with it. The CPU also has a key (the protection key) associated with it. A store operation is allowed only if both keys are equal or if either is zero. Which of the following memory-management schemes could be used successfully with this hardware? a. Bare machine b. Single-user system c. Multiprogramming with a fixed number of processes d. Multiprogramming with a variable number of processes e. Paging f. Segmentation Exercises 8.9 Explain the difference between internal and external fragmentation. 8.10 Consider the following process for generating binaries. A compiler is used to generate the object code for individual modules, and a linkage editor is used to combine multiple object modules into a single program binary. How does the linkage editor change the binding of instructions and data to memory addresses? What information needs to be passed from the compiler to the linkage editor to facilitate the memory-binding tasks of the linkage editor? 8.11 Given five memory partitions of 100 KB, 500 KB, 200 KB, 300 KB,and 600 KB (in that order), how would the first-fit, best-fit, and worst-fit algorithms place processes of 212 KB, 417 KB, 112 KB, and 426 KB (in that order)? Which algorithm makes the most efficient use of memory? 8.12 Most systems allow a program to allocate more memory to its address space during execution. Allocation of data in the heap segments of programs is an example of such allocated memory. What is required to support dynamic memory allocation in the following schemes? a. Contiguous memory allocation b. Pure segmentation c. Pure paging 8.13 Compare the memory organization schemes of contiguous memory allocation, pure segmentation, and pure paging with respect to the following issues: a. External fragmentation Exercises 389 b. Internal fragmentation c. Ability to share code across processes 8.14 On a system with paging, a process cannot access memory that it does not own. Why? How could the operating system allow access to other memory? Why should it or should it not? 8.15 Compare paging with segmentation with respect to the amount of memory required by the address-translation structures in order to convert virtual addresses to physical addresses. 8.16 Program binaries in many systems are typically structured as follows. Code is stored starting with a small, fixed virtual address, such as 0. The code segment is followed by the data segment that is used for storing the program variables. When the program starts executing, the stack is allocated at the other end of the virtual address space and is allowed to grow toward lower virtual addresses. What is the significance of this structure for the following schemes? a. Contiguous memory allocation b. Pure segmentation c. Pure paging 8.17 Assuming a 1-KB page size, what are the page numbers and offsets for the following address references (provided as decimal numbers)? a. 2375 b. 19366 c. 30000 d. 256 e. 16385 8.18 Consider a logical address space of 32 pages with 1,024 words per page, mapped onto a physical memory of 16 frames. a. How many bits are required in the logical address? b. How many bits are required in the physical address? 8.19 Consider a computer system with a 32-bit logical address and 4-KB page size. The system supports up to 512 MB of physical memory. How many entries are there in each of the following? a. A conventional single-level page table b. An inverted page table 8.20 Consider a paging system with the page table stored in memory. a. If a memory reference takes 200 nanoseconds, how long does a paged memory reference take? b. If we add TLBs, and 75 percent of all page-table references are found in the TLBs, what is the effective memory reference time? (Assume 390 Chapter 8 Main Memory that finding a page-table entry in the TLBs takes zero time, if the entry is there.) 8.21 Why are segmentation and paging sometimes combined into one scheme? 8.22 Explain why sharing a reentrant module is easier when segmentation is used than when pure paging is used. 8.23 Consider the following segment table: Segment Base Length 0 219 600 1 2300 14 2 90 100 3 1327 580 4 1952 96 What are the physical addresses for the following logical addresses? a. 0,430 b. 1,10 c. 2,500 d. 3,400 e. 4,112 8.24 What is the purpose of paging the page tables? 8.25 Consider the hierarchical paging scheme used by the VAX architecture. How many memory operations are performed when a user program executes a memory-load operation? 8.26 Compare the segmented paging scheme with the hashed page table scheme for handling large address spaces. Under what circumstances is one scheme preferable to the other? 8.27 Consider the Intel address-translation scheme shown in Figure 8.22. a. Describe all the steps taken by the Intel Pentium in translating a logical address into a physical address. b. What are the advantages to the operating system of hardware that provides such complicated memory translation? c. Are there any disadvantages to this address-translation system? If so, what are they? If not, why is this scheme not used by every manufacturer? Programming Problems 8.28 Assuming that a system has a 32-bit virtual address, write a Java program that is passed (1) the size of a page and (2) the virtual address. Your Bibliographical Notes 391 program will report the page number and offset of the given virtual address with the specified page size. Page sizes must be specified as a power of 2 and within the range 1024 —16384 (inclusive). Assuming such a program is named Address, it would run as follows: java Address 4096 19986 and the correct output would appear as: The address 19986 contains: page number = 4 offset = 3602 Wiley Plus Visit Wiley Plus for • Source code • Solutions to practice exercises • Additional programming problems and exercises • Labs using an operating-system simulator Bibliographical Notes Dynamic storage allocation was discussed by Knuth [1973] (Section 2.5), who found through simulation results that first fit is generally superior to best fit. Knuth [1973] also discussed the 50-percent rule. The concept of paging can be credited to the designers of the Atlas system, which has been described by Kilburn et al. [1961] and by Howarth et al. [1961]. The concept of segmentation was first discussed by Dennis [1965]. Paged segmentation was first supported in the GE 645,onwhichMULTICS was originally implemented (Organick [1972] and Daley and Dennis [1967]). Inverted page tables are discussed in an article about the IBM RT storage manager by Chang and Mergen [1988]. Address translation in software is covered in Jacob and Mudge [1997]. Hennessy and Patterson [2002] explain the hardware aspects of TLBs, caches, and MMUs. Talluri et al. [1995] discuss page tables for 64-bit address spaces. Alternative approaches to enforcing memory protection are proposed and studied in Wahbe et al. [1993a], Chase et al. [1994], Bershad et al. [1995], and Thorn [1997]. Dougan et al. [1999] and Jacob and Mudge [2001] discuss techniques for managing the TLB. Fang et al. [2001] evaluate support for large pages. Tanenbaum [2001] discusses Intel 80386 paging. Memory management for several architectures—such as the Pentium II, PowerPC,andUltraSPARC— is described by Jacob and Mudge [1998a]. Segmentation on Linux systems is presented in Bovet and Cesati [2002]. This page intentionally left blank 9CHAPTER Virtual Memory In Chapter 8, we discussed various memory-management strategies used in computer systems. All these strategies have the same goal: to keep many processes in memory simultaneously to allow multiprogramming. However, they tend to require that an entire process be in memory before it can execute. Virtual memory is a technique that allows the execution of processes that are not completely in memory. One major advantage of this scheme is that programs can be larger than physical memory. Further, virtual memory abstracts main memory into an extremely large, uniform array of storage, separating logical memory as viewed by the user from physical memory. This technique frees programmers from the concerns of memory-storage limitations. Virtual memory also allows processes to share files easily and to implement shared memory. In addition, it provides an efficient mechanism for process creation. Virtual memory is not easy to implement, however, and may substantially decrease performance if it is used carelessly. In this chapter, we discuss virtual memory in the form of demand paging and examine its complexity and cost. CHAPTER OBJECTIVES • To describe the benefits of a virtual memory system. • To explain the concepts of demand paging, page-replacement algorithms, and allocation of page frames. • To discuss the principles of the working-set model. 9.1 Background The memory-management algorithms outlined in Chapter 8 are necessary because of one basic requirement: the instructions being executed must be in physical memory. The first approach to meeting this requirement is to place the entire logical address space in physical memory. Dynamic loading can help to ease this restriction, but it generally requires special precautions and extra work by the programmer. 393 394 Chapter 9 Virtual Memory The requirement that instructions must be in physical memory to be executed seems both necessary and reasonable; but it is also unfortunate, since it limits the size of a program to the size of physical memory. In fact, an examination of real programs shows us that, in many cases, the entire program is not needed. For instance, consider the following: • Programs often have code to handle unusual error conditions. Since these errors seldom, if ever, occur in practice, this code is almost never executed. • Arrays, lists, and tables are often allocated more memory than they actually need. An array may be declared 100 by 100 elements, even though it is seldom larger than 10 by 10 elements. An assembler symbol table may have room for 3,000 symbols, although the average program has less than 200 symbols. • Certain options and features of a program may be used rarely.For instance, the routines on U.S. government computers that balance the budget have not been used in many years. Even in those cases where the entire program is needed, it may not all be needed at the same time. The ability to execute a program that is only partially in memory would confer many benefits: virtual memory memory map physical memory • • • page 0 page 1 page 2 page v Figure 9.1 Diagram showing virtual memory that is larger than physical memory. 9.1 Background 395 • A program would no longer be constrained by the amount of physical memory that is available. Users would be able to write programs for an extremely large virtual address space, simplifying the programming task. • Because each user program could take less physical memory, more programs could be run at the same time, with a corresponding increase in CPU utilization and throughput but with no increase in response time or turnaround time. • Less I/O would be needed to load or swap user programs into memory, so each user program would run faster. Thus, running a program that is not entirely in memory would benefit both the system and the user. Virtual memory involves the separation of logical memory as perceived by users from physical memory. This separation allows an extremely large virtual memory to be provided for programmers when only a smaller physical memory is available (Figure 9.1). Virtual memory makes the task of program- ming much easier, because the programmer no longer needs to worry about the amount of physical memory available; she can concentrate instead on the problem to be programmed. The virtual address space of a process refers to the logical (or virtual) view of how a process is stored in memory. Typically, this view is that a process begins at a certain logical address—say, address 0—and exists in contiguous memory, as shown in Figure 9.2. Recall from Chapter 8, though, that in fact physical memory may be organized in page frames and that the physical page code 0 Max data heap stack Figure 9.2 Virtual address space. 396 Chapter 9 Virtual Memory shared library stack shared pages code data heap code data heap shared library stack Figure 9.3 Shared library using virtual memory. frames assigned to a process may not be contiguous. It is up to the memory- management unit (MMU) to map logical pages to physical page frames in memory. Note in Figure 9.2 that we allow for the heap to grow upward in memory as it is used for dynamic memory allocation. Similarly, we allow for the stack to grow downward in memory through successive function calls. The large blank space (or hole) between the heap and the stack is part of the virtual address space but will require actual physical pages only if the heap or stack grows. Virtual address spaces that include holes are known as sparse address spaces. Using a sparse address space is beneficial because the holes can be filled as the stack or heap segments grow or if we wish to dynamically link libraries (or possibly other shared objects) during program execution. In addition to separating logical memory from physical memory, virtual memory allows files and memory to be shared by two or more processes through page sharing (Section 8.4.4). This leads to the following benefits: • System libraries can be shared by several processes through mapping of the shared object into a virtual address space. Although each process considers the shared libraries to be part of its virtual address space, the actual pages where the libraries reside in physical memory are shared by all the processes (Figure 9.3). Typically, a library is mapped read-only into the space of each process that is linked with it. • Similarly, virtual memory enables processes to share memory. Recall from Chapter 3 that two or more processes can communicate through the use of shared memory. Virtual memory allows one process to create a region of memory that it can share with another process. Processes sharing this region consider it part of their virtual address space, yet the actual physical pages of memory are shared, much as is illustrated in Figure 9.3. 9.2 Demand Paging 397 • Virtual memory can allow pages to be shared during process creation with the fork() system call, thus speeding up process creation. We further explore these—and other—benefits of virtual memory later in this chapter. First, though, we discuss implementing virtual memory through demand paging. 9.2 Demand Paging Consider how an executable program might be loaded from disk into memory. One option is to load the entire program in physical memory at program execution time. However, a problem with this approach is that we may not initially need the entire program in memory. Suppose a program starts with a list of available options from which the user is to select. Loading the entire program into memory results in loading the executable code for all options, regardless of whether an option is ultimately selected by the user or not. An alternative strategy is to load pages only as they are needed. This technique is known as demand paging and is commonly used in virtual memory systems. With demand-paged virtual memory, pages are only loaded when they are demanded during program execution; pages that are never accessed are thus never loaded into physical memory. A demand-paging system is similar to a paging system with swapping (Figure 9.4) where processes reside in secondary memory (usually a disk). When we want to execute a process, we swap it into memory. Rather than program A swap out 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 swap in program B main memory Figure 9.4 Transfer of a paged memory to contiguous disk space. 398 Chapter 9 Virtual Memory swapping the entire process into memory, however, we use a lazy swapper.A lazy swapper never swaps a page into memory unless that page will be needed. Since we are now viewing a process as a sequence of pages, rather than as one large contiguous address space, use of the term swapper is technically incorrect. A swapper manipulates entire processes, whereas a pager is concerned with the individual pages of a process. We thus use pager, rather than swapper, in connection with demand paging. 9.2.1 Basic Concepts When a process is to be swapped in, the pager guesses which pages will be used before the process is swapped out again. Instead of swapping in a whole process, the pager brings only those pages into memory.Thus, it avoids reading into memory pages that will not be used anyway, decreasing the swap time and the amount of physical memory needed. With this scheme, we need some form of hardware support to distinguish between the pages that are in memory and the pages that are on the disk. The valid–invalid bit scheme described in Section 8.4.3 can be used for this purpose. This time, however, when this bit is set to “valid,” the associated page is both legal and in memory. If the bit is set to “invalid,” thepageeitherisnot B D DE F H logical memory valid–invalid bitframe page table 1 04 62 3 4 59 6 7 1 0 2 3 4 5 6 7 i v v i i v i i physical memory A ABC C FGHF 1 0 2 3 4 5 6 7 9 8 10 11 12 13 14 15 A C E G Figure 9.5 Page table when some pages are not in main memory. 9.2 Demand Paging 399 valid (that is, not in the logical address space of the process) or is valid but is currently on the disk. The page-table entry for a page that is brought into memory is set as usual, but the page-table entry for a page that is not currently in memory is either simply marked invalid or contains the address of the page on disk. This situation is depicted in Figure 9.5. Notice that marking a page invalid will have no effect if the process never attempts to access that page. Hence, if we guess right and page in all and only those pages that are actually needed, the process will run exactly as though we had brought in all pages. While the process executes and accesses pages that are memory resident, execution proceeds normally. But what happens if the process tries to access a page that was not brought into memory? Access to a page marked invalid causes a page fault. The paging hardware, in translating the address through the page table, will notice that the invalid bit is set, causing a trap to the operating system. This trap is the result of the operating system’s failure to bring the desired page into memory. The procedure for handling this page fault is straightforward (Figure 9.6): 1. We check an internal table (usually kept with the process control block) for this process to determine whether the reference was a valid or an invalid memory access. 2. If the reference was invalid, we terminate the process. If it was valid, but we have not yet brought in that page, we now page it in. load M reference trap i page is on backing store operating system restart instruction reset page table page table physical memory bring in missing page free frame 1 2 3 6 5 4 Figure 9.6 Steps in handling a page fault. 400 Chapter 9 Virtual Memory 3. We find a free frame (by taking one from the free-frame list, for example). 4. We schedule a disk operation to read the desired page into the newly allocated frame. 5. When the disk read is complete, we modify the internal table kept with the process and the page table to indicate that the page is now in memory. 6. We restart the instruction that was interrupted by the trap. The process can now access the page as though it had always been in memory. In the extreme case, we can start executing a process with no pages in memory. When the operating system sets the instruction pointer to the first instruction of the process, which is on a non-memory-resident page, the process immediately faults for the page. After this page is brought into memory, the process continues to execute, faulting as necessary until every page that it needs is in memory. At that point, it can execute with no more faults. This scheme is pure demand paging: never bring a page into memory until it is required. Theoretically, some programs could access several new pages of memory with each instruction execution (one page for the instruction and many for data), possibly causing multiple page faults per instruction. This situation would result in unacceptable system performance. Fortunately, analysis of running processes shows that this behavior is exceedingly unlikely. Programs tend to have locality of reference, described in Section 9.6.1, which results in reasonable performance from demand paging. The hardware to support demand paging is the same as the hardware for paging and swapping: • Page table. This table has the ability to mark an entry invalid through a valid–invalid bit or a special value of protection bits. • Secondary memory. This memory holds those pages that are not present in main memory. The secondary memory is usually a high-speed disk. It is known as the swap device, and the section of disk used for this purpose is known as swap space. Swap-space allocation is discussed in Chapter 12. A crucial requirement for demand paging is the ability to restart any instruction after a page fault. Because we save the state (registers, condition code, instruction counter) of the interrupted process when the page fault occurs,wemustbeabletorestarttheprocessinexactly the same place and state, except that the desired page is now in memory and is accessible. In most cases, this requirement is easy to meet. A page fault may occur at any memory reference. If the page fault occurs on the instruction fetch, we can restart by fetching the instruction again. If a page fault occurs while we are fetching an operand, we must fetch and decode the instruction again and then fetch the operand. As a worst-case example, consider a three-address instruction such as ADD the content of A to B, placing the result in C. These are the steps to execute this instruction: 1. Fetch and decode the instruction (ADD). 9.2 Demand Paging 401 2. Fetch A. 3. Fetch B. 4. Add A and B. 5. Store the sum in C. If we fault when we try to store in C (because C is in a page not currently in memory), we will have to get the desired page, bring it in, correct the page table, and restart the instruction. The restart will require fetching the instruction again, decoding it again, fetching the two operands again, and then adding again. However, there is not much repeated work (less than one complete instruction), and the repetition is necessary only when a page fault occurs. The major difficulty arises when one instruction may modify several different locations. For example, consider the IBM System 360/370 MVC (move character) instruction, which can move up to 256 bytes from one location to another (possibly overlapping) location. If either block (source or destination) straddles a page boundary, a page fault might occur after the move is partially done. In addition, if the source and destination blocks overlap, the source block may have been modified, in which case we cannot simply restart the instruction. This problem can be solved in two different ways. In one solution, the microcode computes and attempts to access both ends of both blocks. If a page fault is going to occur, it will happen at this step, before anything is modified. The move can then take place; we know that no page fault can occur, since all the relevant pages are in memory. The other solution uses temporary registers to hold the values of overwritten locations. If there is a page fault, all the old values are written back into memory before the trap occurs. This action restores memory to its state before the instruction was started, so that the instruction can be repeated. This is by no means the only architectural problem resulting from adding paging to an existing architecture to allow demand paging, but it illustrates some of the difficulties involved. Paging is added between the CPU and the memory in a computer system. It should be entirely transparent to the user process. Thus, people often assume that paging can be added to any system. Although this assumption is true for a non-demand-paging environment, where a page fault represents a fatal error, it is not true where a page fault means only that an additional page must be brought into memory and the process restarted. 9.2.2 Performance of Demand Paging Demand paging can significantly affect the performance of a computer system. To see why, let’s compute the effective access time for a demand-paged memory. For most computer systems, the memory-access time, denoted ma, ranges from 10 to 200 nanoseconds. As long as we have no page faults, the effective access time is equal to the memory access time. If, however, a page faultoccurs,wemustfirstreadtherelevantpagefromdiskandthenaccessthe desired word. 402 Chapter 9 Virtual Memory Let p be the probability of a page fault (0 ≤ p ≤ 1). We would expect p to be close to zero—that is, we would expect to have only a few page faults. The effective access time is then effective access time = (1 − p) × ma + p × page fault time. To compute the effective access time, we must know how much time is needed to service a page fault. A page fault causes the following sequence to occur: 1. Trap to the operating system. 2. Save the user registers and process state. 3. Determine that the interrupt was a page fault. 4. Check that the page reference was legal and determine the location of the page on the disk. 5. Issue a read from the disk to a free frame: a. Wait in a queue for this device until the read request is serviced. b. Wait for the device seek and/or latency time. c. Begin the transfer of the page to a free frame. 6. While waiting, allocate the CPU to some other user (CPU scheduling, optional). 7. Receive an interrupt from the disk I/O subsystem (I/O completed). 8. Save the registers and process state for the other user (if step 6 is executed). 9. Determine that the interrupt was from the disk. 10. Correct the page table and other tables to show that the desired page is now in memory. 11. Wait for the CPU to be allocated to this process again. 12. Restore the user registers, process state, and new page table, and then resume the interrupted instruction. Not all of these steps are necessary in every case. For example, we are assuming that, in step 6, the CPU is allocated to another process while the I/O occurs. This arrangement allows multiprogramming to maintain CPU utilization but requires additional time to resume the page-fault service routine when the I/O transfer is complete. In any case, we are faced with three major components of the page-fault service time: 1. Service the page-fault interrupt. 2. Read in the page. 3. Restart the process. 9.2 Demand Paging 403 The first and third tasks can be reduced, with careful coding, to several hundred instructions. These tasks may take from 1 to 100 microseconds each. The page-switch time, however, will probably be close to 8 milliseconds. (A typical hard disk has an average latency of 3 milliseconds, a seek of 5 milliseconds, and a transfer time of 0.05 milliseconds. Thus, the total paging time is about 8 milliseconds, including hardware and software time.) Remember also that we are looking at only the device-service time. If a queue of processes is waiting for the device, we have to add device-queueing time as we wait for the paging device to be free to service our request, increasing even more the time to swap. With an average page-fault service time of 8 milliseconds and a memory- access time of 200 nanoseconds, the effective access time in nanoseconds is effective access time = (1 − p) × (200) + p (8 milliseconds) =(1− p) × 200 + p × 8,000,000 = 200 + 7,999,800 × p. We see, then, that the effective access time is directly proportional to the page-fault rate. If one access out of 1,000 causes a page fault, the effective access time is 8.2 microseconds. The computer will be slowed down by a factor of 40 because of demand paging! If we want performance degradation to be less than 10 percent, we need 220 > 200 + 7,999,800 × p, 20 > 7,999,800 × p, p < 0.0000025. That is, to keep the slowdown due to paging at a reasonable level, we can allow fewer than one memory access out of 399,990 to page-fault. In sum, it is important to keep the page-fault rate low in a demand-paging system. Otherwise, the effective access time increases, slowing process execution dramatically. An additional aspect of demand paging is the handling and overall use of swap space. Disk I/O to swap space is generally faster than that to the file system. It is faster because swap space is allocated in much larger blocks, and file lookups and indirect allocation methods are not used (Chapter 12). The system can therefore gain better paging throughput by copying an entire file image into the swap space at process startup and then performing demand paging from the swap space. Another option is to demand pages from the file system initially but to write the pages to swap space as they are replaced. This approach will ensure that only needed pages are read from the file system but that all subsequent paging is done from swap space. Some systems attempt to limit the amount of swap space used through demand paging of binary files. Demand pages for such files are brought directly from the file system. However, when page replacement is called for, these frames can simply be overwritten (because they are never modified), and the pages can be read in from the file system again if needed. Using this approach, the file system itself serves as the backing store. However, swap space must 404 Chapter 9 Virtual Memory still be used for pages not associated with a file; these pages include the stack and heap for a process. This method appears to be a good compromise and is used in several systems, including Solaris and BSD UNIX. 9.3 Copy-on-Write In Section 9.2, we illustrated how a process can start quickly by merely demand- paging in the page containing the first instruction. However, process creation using the fork() system call may initially bypass the need for demand paging by using a technique similar to page sharing (covered in Section 8.4.4). This technique provides for rapid process creation and minimizes the number of new pages that must be allocated to the newly created process. Recall that the fork() system call creates a child process that is a duplicate of its parent. Traditionally, fork() worked by creating a copy of the parent’s address space for the child, duplicating the pages belonging to the parent. However, considering that many child processes invoke the exec() system call immediately after creation, the copying of the parent’s address space may be unnecessary. Instead, we can use a technique known as copy-on-write, which works by allowing the parent and child processes initially to share the same pages. These shared pages are marked as copy-on-write pages, meaning that if either process writes to a shared page, a copy of the shared page is created. Copy-on-write is illustrated in Figures 9.7 and Figure 9.8, which show the contents of the physical memory before and after process 1 modifies page C. For example, assume that the child process attempts to modify a page containing portions of the stack, with the pages set to be copy-on-write. The operating system will create a copy of this page, mapping it to the address space of the child process. The child process will then modify its copied page and not the page belonging to the parent process. Obviously, when the copy-on-write technique is used, only the pages that are modified by either process are copied; all unmodified pages can be shared by the parent and child processes. Note, too, that only pages that can be modified need be marked as copy-on-write. Pages that cannot be modified (pages containing executable code) can be shared by process1 physical memory page A page B page C process2 Figure 9.7 Before process 1 modifies page C. 9.4 Page Replacement 405 process1 physical memory page A page B page C Copy of page C process2 Figure 9.8 After process 1 modifies page C. the parent and child. Copy-on-write is a common technique used by several operating systems, including Windows XP, Linux, and Solaris. When it is determined that a page is going to be duplicated using copy- on-write, it is important to note the location from which the free page will be allocated. Many operating systems provide a pool of free pages for such requests. These free pages are typically allocated when the stack or heap for a process must expand or when there are copy-on-write pages to be managed. Operating systems typically allocate these pages using a technique known as zero-fill-on-demand. Zero-fill-on-demand pages have been zeroed-out before being allocated, thus erasing the previous contents. Several versions of UNIX (including Solaris and Linux) provide a variation of the fork() system call—vfork() (for virtual memory fork)— that operates differently from fork() with copy-on-write. With vfork(), the parent process is suspended, and the child process uses the address space of the parent. Because vfork() does not use copy-on-write, if the child process changed any pages of the parent’s address space, the altered pages would be visible to the parent once it resumed. Therefore, vfork() must be used with caution to ensure that the child process does not modify the address space of the parent. vfork() is intended to be used when the child process calls exec() immediately after creation. Because no copying of pages takes place, vfork() is an extremely efficient method of process creation and is sometimes used to implement UNIX command-line shell interfaces. 9.4 Page Replacement In our earlier discussion of the page-fault rate, we assumed that each page faults at most once, when it is first referenced. This representation is not strictly accurate, however. If a process of ten pages actually uses only half of them, then demand paging saves the I/O necessary to load the five pages that are never used. We could also increase our degree of multiprogramming by running twice as many processes. Thus, if we had forty frames, we could run eight processes, rather than the four that could run if each required ten frames (five of which were never used). 406 Chapter 9 Virtual Memory monitor load M physical memory 1 0 2 3 4 5 6 7 H load M J M logical memory for user 1 0 PC 1 2 3 B M valid–invalid bitframe page table for user 1 i A B D E logical memory for user 2 0 1 2 3 valid–invalid bitframe page table for user 2 i 4 3 5 v v v 7 2 v v 6 v D H J A E Figure 9.9 Need for page replacement. If we increase our degree of multiprogramming, we are over-allocating memory. If we run six processes, each of which is ten pages in size but actually uses only five pages, we have higher CPU utilization and throughput, with ten frames to spare. It is possible, however, that each of these processes, for a particular data set, may suddenly try to use all ten of its pages, resulting in a need for sixty frames when only forty are available. Further, consider that system memory is not used only for holding program pages. Buffers for I/O also consume a considerable amount of memory. This use can increase the strain on memory-placement algorithms. Deciding how much memory to allocate to I/O and how much to program pages is a significant challenge. Some systems allocate a fixed percentage of memory for I/O buffers, whereas others allow both user processes and the I/O subsystem to compete for all system memory. Over-allocation of memory manifests itself as follows. While a user process is executing, a page fault occurs. The operating system determines where the desired page is residing on the disk but then finds that there are no free frames on the free-frame list; all memory is in use (Figure 9.9). The operating system has several options at this point. It could terminate the user process. However, demand paging is the operating system’s attempt to improve the computer system’s utilization and throughput. Users should not be aware that their processes are running on a paged system—paging should be logically transparent to the user. So this option is not the best choice. The operating system could instead swap out a process, freeing all its frames and reducing the level of multiprogramming. This option is a good one in certain circumstances, and we consider it further in Section 9.6. Here, we discuss the most common solution: page replacement. 9.4 Page Replacement 407 9.4.1 Basic Page Replacement Page replacement takes the following approach. If no frame is free, we find one that is not currently being used and free it. We can free a frame by writing its contents to swap space and changing the page table (and all other tables) to indicate that the page is no longer in memory (Figure 9.10). We can now use the freed frame to hold the page for which the process faulted. We modify the page-fault service routine to include page replacement: 1. Find the location of the desired page on the disk. 2. Find a free frame: a. If there is a free frame, use it. b. If there is no free frame, use a page-replacement algorithm to select a victim frame. c. Write the victim frame to the disk; change the page and frame tables accordingly. 3. Read the desired page into the newly freed frame; change the page and frame tables. 4. Restart the user process. Notice that, if no frames are free, two page transfers (one out and one in) are required. This situation effectively doubles the page-fault service time and increases the effective access time accordingly. valid–invalid bitframe f page table victim change to invalid swap out victim page swap desired page in reset page table for new page physical memory 2 4 1 3 f 0i v Figure 9.10 Page replacement. 408 Chapter 9 Virtual Memory We can reduce this overhead by using a modify bit (or dirty bit). When this scheme is used, each page or frame has a modify bit associated with it in the hardware. The modify bit for a page is set by the hardware whenever any word or byte in the page is written into, indicating that the page has been modified. When we select a page for replacement, we examine its modify bit. If the bit is set, we know that the page has been modified since it was read in from the disk. In this case, we must write the page to the disk. If the modify bit is not set, however, the page has not been modified since it was read into memory. In this case, we need not write the memory page to the disk: it is already there. This technique also applies to read-only pages (for example, pages of binary code). Such pages cannot be modified; thus, they may be discarded when desired. This scheme can significantly reduce the time required to service a page fault, since it reduces I/O time by one-half if thepagehasnotbeenmodified. Page replacement is basic to demand paging. It completes the separation between logical memory and physical memory. With this mechanism, an enormous virtual memory can be provided for programmers on a smaller physical memory. With no demand paging, user addresses are mapped into physical addresses, so the two sets of addresses can be different. All the pages of a process still must be in physical memory, however. With demand paging, the size of the logical address space is no longer constrained by physical memory. If we have a user process of twenty pages, we can execute it in ten frames simply by using demand paging and using a replacement algorithm to find a free frame whenever necessary. If a page that has been modified is to be replaced, its contents are copied to the disk. A later reference to that page will cause a page fault. At that time, the page will be brought back into memory, perhaps replacing some other page in the process. We must solve two major problems to implement demand paging: we must develop a frame-allocation algorithm and a page-replacement algorithm. That is, if we have multiple processes in memory, we must decide how many frames to allocate to each process; and when page replacement is required, we must select the frames that are to be replaced. Designing appropriate algorithms to solve these problems is an important task, because disk I/O is so expensive. Even slight improvements in demand-paging methods yield large gains in system performance. There are many different page-replacement algorithms. Every operating system probably has its own replacement scheme. How do we select a particular replacement algorithm? In general, we want the one with the lowest page-fault rate. We evaluate an algorithm by running it on a particular string of memory referencesandcomputingthenumberofpagefaults.Thestringofmemory references is called a reference string. We can generate reference strings artificially (by using a random-number generator, for example), or we can trace a given system and record the address of each memory reference. The latter choice produces a large number of data (on the order of 1 million addresses per second). To reduce the number of data, we use two facts. First, for a given page size (and the page size is generally fixed by the hardware or system), we need to consider only the page number, rather than theentireaddress.Second,ifwehaveareferencetoapagep, then any references to page p that immediately follow will never cause a page fault. Page p will be in 9.4 Page Replacement 409 number of page faults 16 14 12 10 8 6 4 2 123 number of frames 456 Figure 9.11 Graph of page faults versus number of frames. memory after the first reference, so the immediately following references will not fault. For example, if we trace a particular process, we might record the following address sequence: 0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103, 0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105 At 100 bytes per page, this sequence is reduced to the following reference string: 1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1 To determine the number of page faults for a particular reference string and page-replacement algorithm, we also need to know the number of page frames available. Obviously, as the number of frames available increases, the number of page faults should decrease. For the reference string considered previously,forexample,ifwehadthreeormoreframes,wewouldhaveonly three faults—one fault for the first reference to each page. In contrast, with only one frame available, we would have a replacement with every reference, resulting in eleven faults. In general, we expect a curve such as that in Figure 9.11. As the number of frames increases, the number of page faults drops to some minimal level. Of course, adding physical memory increases the number of frames. We next illustrate several page-replacement algorithms. In doing so, we use the reference string 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1 for a memory with three frames. 410 Chapter 9 Virtual Memory 7 7 0 7 0 1 page frames reference string 2 0 1 2 3 1 2 3 0 4 3 0 4 2 0 4 2 3 0 2 3 7 1 2 7 0 2 7 0 1 0 1 3 0 7012030423 0 7 11021203 1 2 Figure 9.12 FIFO page-replacement algorithm. 9.4.2 FIFO Page Replacement The simplest page-replacement algorithm is a first-in, first-out (FIFO) algorithm. A FIFO replacement algorithm associates with each page the time when that page was brought into memory. When a page must be replaced, the oldest page is chosen. Notice that it is not strictly necessary to record the time when apageisbroughtin.WecancreateaFIFO queue to hold all pages in memory. We replace the page at the head of the queue. When a page is brought into memory, we insert it at the tail of the queue. For our example reference string, our three frames are initially empty. The first three references (7, 0, 1) cause page faults and are brought into these empty frames. The next reference (2) replaces page 7, because page 7 was brought in first. Since 0 is the next reference and 0 is already in memory, we have no fault for this reference. The first reference to 3 results in replacement of page 0, since it is now first in line. Because of this replacement, the next reference, to 0, will fault. Page 1 is then replaced by page 0. This process continues as shown in Figure 9.12. Every time a fault occurs, we show which pages are in our three frames. There are fifteen faults altogether. The FIFO page-replacement algorithm is easy to understand and program. However, its performance is not always good. On the one hand, the page replaced may be an initialization module that was used a long time ago and is no longer needed. On the other hand, it could contain a heavily used variable that was initialized early and is in constant use. Notice that, even if we select for replacement a page that is in active use, everything still works correctly. After we replace an active page with a new one, a fault occurs almost immediately to retrieve the active page. Some other page must be replaced to bring the active page back into memory. Thus, a bad replacement choice increases the page-fault rate and slows process execution. It does not, however, cause incorrect execution. To illustrate the problems that are possible with a FIFO page-replacement algorithm, we consider the following reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 Figure 9.13 shows the curve of page faults for this reference string versus the number of available frames. Notice that the number of faults for four frames (ten) is greater than the number of faults for three frames (nine)! This most unexpected result is known as Belady’s anomaly: for some page-replacement algorithms, the page-fault rate may increase as the number of allocated frames 9.4 Page Replacement 411 number of page faults 16 14 12 10 8 6 4 2 123 number of frames 4567 Figure 9.13 Page-fault curve for FIFO replacement on a reference string. increases. We would expect that giving more memory to a process would improve its performance. In some early research, investigators noticed that this assumption was not always true. Belady’s anomaly was discovered as a result. 9.4.3 Optimal Page Replacement One result of the discovery of Belady’s anomaly was the search for an optimal page-replacement algorithm, which has the lowest page-fault rate of all algorithms and will never suffer from Belady’s anomaly. Such an algorithm does exist and has been called OPT or MIN. It is simply this: Replace the page that will not be used for the longest period of time. Use of this page-replacement algorithm guarantees the lowest possible page- faultrateforafixednumberofframes. For example, on our sample reference string, the optimal page-replacement algorithm would yield nine page faults, as shown in Figure 9.14. The first three page frames reference string 7 7 0 7 0 1 2 0 1 2 0 3 2 4 3 2 0 3 7 0 1 2 0 1 7012030423 0 7 11021203 Figure 9.14 Optimal page-replacement algorithm. 412 Chapter 9 Virtual Memory references cause faults that fill the three empty frames. The reference to page 2 replaces page 7, because page 7 will not be used until reference 18, whereas page 0 will be used at 5, and page 1 at 14. The reference to page 3 replaces page 1, as page 1 will be the last of the three pages in memory to be referenced again. With only nine page faults, optimal replacement is much better than a FIFO algorithm, which results in fifteen faults. (If we ignore the first three, which all algorithms must suffer, then optimal replacement is twice as good as FIFO replacement.) In fact, no replacement algorithm can process this reference string in three frames with fewer than nine faults. Unfortunately, the optimal page-replacement algorithm is difficult to implement, because it requires future knowledge of the reference string. (We encountered a similar situation with the SJF CPU-scheduling algorithm in Section 5.3.2.) As a result, the optimal algorithm is used mainly for comparison studies. For instance, it may be useful to know that, although a new algorithm is not optimal, it is within 12.3 percent of optimal at worst and within 4.7 percent on average. 9.4.4 LRU Page Replacement If the optimal algorithm is not feasible, perhaps an approximation of the optimal algorithm is possible. The key distinction between the FIFO and OPT algorithms (other than looking backward versus forward in time) is that the FIFO algorithm uses the time when a page was brought into memory, whereas the OPT algorithm uses the time when a page is to be used. If we use the recent past as an approximation of the near future, then we can replace the page that has not been used for the longest period of time. This approach is the least-recently-used (LRU) algorithm. LRU replacement associates with each page the time of that page’s last use. Whenapagemustbereplaced,LRU chooses the page that has not been used for the longest period of time. We can think of this strategy as the optimal page-replacement algorithm looking backward in time, rather than forward. (Strangely, if we let SR be the reverse of a reference string S, then the page-fault rate for the OPT algorithm on S is the same as the page-fault rate for the OPT algorithm on SR. Similarly, the page-fault rate for the LRU algorithm on S is the same as the page-fault rate for the LRU algorithm on SR.) The result of applying LRU replacement to our example reference string is shown in Figure 9.15. The LRU algorithm produces twelve faults. Notice that the first five faults are the same as those for optimal replacement. When the page frames reference string 7 7 0 7 0 1 2 0 1 2 0 3 4 0 3 4 0 2 4 3 2 0 3 2 1 3 2 1 0 2 1 0 7 7012030423 0 7 11021203 Figure 9.15 LRU page-replacement algorithm. 9.4 Page Replacement 413 reference to page 4 occurs, however, LRU replacement sees that, of the three frames in memory, page 2 was used least recently. Thus, the LRU algorithm replaces page 2, not knowing that page 2 is about to be used. When it then faults for page 2, the LRU algorithm replaces page 3, since it is now the least recently used of the three pages in memory. Despite these problems, LRU replacement with twelve faults is much better than FIFO replacement with fifteen. The LRU policy is often used as a page-replacement algorithm and is considered to be good. The major problem is how to implement LRU replacement. An LRU page-replacement algorithm may require substantial hardware assistance. The problem is to determine an order for the frames defined by the time of last use. Two implementations are feasible: • Counters. In the simplest case, we associate with each page-table entry a time-of-use field and add to the CPU a logical clock or counter. The clock is incremented for every memory reference. Whenever a reference to a page is made, the contents of the clock register are copied to the time-of-use field in the page-table entry for that page. In this way, we always have the “time” of the last reference to each page. We replace the page with the smallest time value. This scheme requires a search of the page table to find the LRU page and a write to memory (to the time-of-use field in the page table) for each memory access. The times must also be maintained when page tables are changed (due to CPU scheduling). Overflow of the clock must be considered. • Stack. Another approach to implementing LRU replacement is to keep a stack of page numbers. Whenever a page is referenced, it is removed from the stack and put on the top. In this way, the most recently used page is always at the top of the stack and the least recently used page is always at the bottom (Figure 9.16). Because entries must be removed from the middle of the stack, it is best to implement this approach by using a doubly linked list with a head pointer and a tail pointer. Removing a page and putting it on the top of the stack then requires changing six pointers at most. Each update is a little more expensive, but there is no search for a replacement; the tail pointer points to the bottom of the stack, which is the LRU page. This approach is particularly appropriate for software or microcode implementations of LRU replacement. Like optimal replacement, LRU replacement does not suffer from Belady’s anomaly. Both belong to a class of page-replacement algorithms, called stack algorithms, that can never exhibit Belady’s anomaly. A stack algorithm is an algorithm for which it can be shown that the set of pages in memory for n frames is always a subset of the set of pages that would be in memory with n +1frames.ForLRU replacement, the set of pages in memory would be the n most recently referenced pages. If the number of frames is increased, these n pages will still be the most recently referenced and so will still be in memory. Note that neither implementation of LRU would be conceivable without hardware assistance beyond the standard TLB registers. The updating of the clock fields or stack must be done for every memory reference. If we were to use an interrupt for every reference to allow software to update such data structures, it would slow every memory reference by a factor of at least ten, 414 Chapter 9 Virtual Memory 2 1 0 4 7 stack before a 7 2 1 4 0 stack after b reference string 4707101212 27 ab 1 Figure 9.16 Use of a stack to record the most recent page references. hence slowing every user process by a factor of ten. Few systems could tolerate that level of overhead for memory management. 9.4.5 LRU-Approximation Page Replacement Few computer systems provide sufficient hardware support for true LRU page replacement. Some systems provide no hardware support, and other page- replacement algorithms (such as a FIFO algorithm) must be used. Many systems provide some help, however, in the form of a reference bit. The reference bit for a page is set by the hardware whenever that page is referenced (either a read or a write to any byte in the page). Reference bits are associated with each entry in the page table. Initially, all bits are cleared (to 0) by the operating system. As a user process executes, the bit associated with each page referenced is set (to 1) by the hardware. After some time, we can determine which pages have been used and which have not been used by examining the reference bits, although we do not know the order of use. This information is the basis for many page-replacement algorithms that approximate LRU replacement. 9.4.5.1 Additional-Reference-Bits Algorithm We can gain additional ordering information by recording the reference bits at regular intervals. We can keep an 8-bit byte for each page in a table in memory. At regular intervals (say, every 100 milliseconds), a timer interrupt transfers control to the operating system. The operating system shifts the reference bit for each page into the high-order bit of its 8-bit byte, shifting the other bits right by 1 bit and discarding the low-order bit. These 8-bit shift registers contain the history of page use for the last eight time periods. If the shift register contains 00000000, for example, then the page has not been used for eight time periods; a page that is used at least once in each period has a shift register value of 11111111. A page with a history register value of 11000100 has been used more recently than one with a value of 01110111. If we interpret these 8-bit bytes as unsigned integers, the page with the lowest number is the LRU page, and it can be replaced. Notice that the numbers are not guaranteed to be unique, 9.4 Page Replacement 415 circular queue of pages (a) next victim 0 reference bits pages 0 1 1 0 1 1 …… circular queue of pages (b) 0 reference bits pages 0 0 0 0 1 1 …… Figure 9.17 Second-chance (clock) page-replacement algorithm. however. We can either replace (swap out) all pages with the smallest value or use the FIFO method to choose among them. The number of bits of history included in the shift register can be varied, of course, and is selected (depending on the hardware available) to make the updating as fast as possible. In the extreme case, the number can be reduced to zero, leaving only the reference bit itself. This algorithm is called the second-chance page-replacement algorithm. 9.4.5.2 Second-Chance Algorithm The basic algorithm of second-chance replacement is a FIFO replacement algorithm. When a page has been selected, however, we inspect its reference bit. If the value is 0, we proceed to replace this page; but if the reference bit is set to 1, we give the page a second chance and move on to select the next FIFO page. When a page gets a second chance, its reference bit is cleared, and its arrival time is reset to the current time. Thus, a page that is given a second chance will not be replaced until all other pages have been replaced (or given second chances). In addition, if a page is used often enough to keep its reference bit set, it will never be replaced. One way to implement the second-chance algorithm (sometimes referred to as the clock algorithm) is as a circular queue. A pointer (that is, a hand on the clock) indicates which page is to be replaced next. When a frame is needed, the pointer advances until it finds a page with a 0 reference bit. As it advances, it clears the reference bits (Figure 9.17). Once a victim page is found, the page 416 Chapter 9 Virtual Memory is replaced, and the new page is inserted in the circular queue in that position. Notice that, in the worst case, when all bits are set, the pointer cycles through the whole queue, giving each page a second chance. It clears all the reference bits before selecting the next page for replacement. Second-chance replacement degenerates to FIFO replacement if all bits are set. 9.4.5.3 Enhanced Second-Chance Algorithm We can enhance the second-chance algorithm by considering the reference bit and the modify bit (described in Section 9.4.1) as an ordered pair. With these two bits, we have the following four possible classes: 1. (0, 0) neither recently used nor modified—best page to replace 2. (0, 1) not recently used but modified—not quite as good, because the page will need to be written out before replacement 3. (1, 0) recently used but clean—probably will be used again soon 4. (1, 1) recently used and modified—probably will be used again soon, and the page will be need to be written out to disk before it can be replaced Each page is in one of these four classes. When page replacement is called for, we use the same scheme as in the clock algorithm; but instead of examining whether the page to which we are pointing has the reference bit set to 1, we examine the class to which that page belongs. We replace the first page encountered in the lowest nonempty class. Notice that we may have to scan the circular queue several times before we find a page to be replaced. The major difference between this algorithm and the simpler clock algo- rithm is that here we give preference to those pages that have been modified to reduce the number of I/Os required. 9.4.6 Counting-Based Page Replacement There are many other algorithms that can be used for page replacement. For example, we can keep a counter of the number of references that have been made to each page and develop the following two schemes. • The least-frequently-used (LFU) page-replacement algorithm requires that the page with the smallest count be replaced. The reason for this selection is that an actively used page should have a large reference count. A problem arises, however, when a page is used heavily during the initial phase of a process but then is never used again. Since it was used heavily, it has a large count and remains in memory even though it is no longer needed. One solution is to shift the counts right by 1 bit at regular intervals, forming an exponentially decaying average usage count. • The most-frequently-used (MFU) page-replacement algorithm is based on the argument that the page with the smallest count was probably just brought in and has yet to be used. 9.4 Page Replacement 417 As you might expect, neither MFU nor LFU replacement is common. The implementation of these algorithms is expensive, and they do not approximate OPT replacement well. 9.4.7 Page-Buffering Algorithms Other procedures are often used along with a specific page-replacement algorithm. For example, systems commonly keep a pool of free frames. When a page fault occurs, a victim frame is chosen as before. However, the desired page is read into a free frame from the pool before the victim is written out. This procedure allows the process to restart as soon as possible, without waiting for the victim page to be written out. When the victim is later written out, its frame is added to the free-frame pool. An expansion of this idea is to maintain a list of modified pages. Whenever the paging device is idle, a modified page is selected and is written to the disk. Its modify bit is then reset. This scheme increases the probability that a page will be clean when it is selected for replacement and will not need to be written out. Another modification is to keep a pool of free frames but to remember which page was in each frame. Since the frame contents are not modified when a frame is written to the disk, the old page can be reused directly from the free-frame pool if it is needed before that frame is reused. No I/O is needed in this case. When a page fault occurs, we first check whether the desired page is in the free-frame pool. If it is not, we must select a free frame and read into it. This technique is used in the VAX/VMS system along with a FIFO replace- ment algorithm. When the FIFO replacement algorithm mistakenly replaces a page that is still in active use, that page is quickly retrieved from the free-frame pool, and no I/O is necessary. The free-frame buffer provides protection against the relatively poor, but simple, FIFO replacement algorithm. This method is necessary because the early versions of VAX did not implement the reference bit correctly. Some versions of the UNIX system use this method in conjunction with the second-chance algorithm. It can be a useful augmentation to any page- replacement algorithm, to reduce the penalty incurred if the wrong victim page is selected. 9.4.8 Applications and Page Replacement In certain cases, applications accessing data through the operating system’s virtual memory perform worse than if the operating system provided no buffering at all. A typical example is a database, which provides its own memory management and I/O buffering. Applications like this understand their memory use and disk use better than does an operating system that is implementing algorithms for general-purpose use. If the operating system is buffering I/O, and the application is doing so as well, then twice the memory is being used for a set of I/O operations. In another example, data warehouses frequently perform massive sequen- tial disk reads, followed by computations and writes. The LRU algorithm would be removing old pages and preserving new ones, while the application would more likely be reading older pages than newer ones (as it starts its sequential reads again). Here, MFU would actually be more efficient than LRU. 418 Chapter 9 Virtual Memory Because of such problems, some operating systems give special programs the ability to use a disk partition as a large sequential array of logical blocks, without any file-system data structures. This array is sometimes called the raw disk,andI/O to this array is termed raw I/O.RawI/O bypasses all the file- system services, such as file I/O demand paging, file locking, prefetching, space allocation, file names, and directories. Note that although certain applications are more efficient when implementing their own special-purpose storage services on a raw partition, most applications perform better when they use the regular file-system services. 9.5 Allocation of Frames We turn next to the issue of allocation. How do we allocate the fixed amount of free memory among the various processes? If we have 93 free frames and two processes, how many frames does each process get? The simplest case is the single-user system. Consider a single-user system with 128 KB of memory composed of pages 1 KB in size. This system has 128 frames. The operating system may take 35 KB, leaving 93 frames for the user process. Under pure demand paging, all 93 frames would initially be put on the free-frame list. When a user process started execution, it would generate a sequence of page faults. The first 93 page faults would all get free frames from the free-frame list. When the free-frame list was exhausted, a page-replacement algorithm would be used to select one of the 93 in-memory pages to be replaced with the 94th, and so on. When the process terminated, the 93 frames would once again be placed on the free-frame list. There are many variations on this simple strategy. We can require that the operating system allocate all its buffer and table space from the free-frame list. When this space is not in use by the operating system, it can be used to support user paging. We can try to keep three free frames reserved on the free-frame list at all times. Thus, when a page fault occurs, there is a free frame available to page into. While the page swap is taking place, a replacement can be selected, which is then written to the disk as the user process continues to execute. Other variants are also possible, but the basic strategy is clear: the user process is allocated any free frame. 9.5.1 Minimum Number of Frames Our strategies for the allocation of frames are constrained in various ways. We cannot, for example, allocate more than the total number of available frames (unless there is page sharing). We must also allocate at least a minimum number of frames. Here, we look more closely at the latter requirement. One reason for allocating at least a minimum number of frames involves performance. Obviously, as the number of frames allocated to each process decreases, the page-fault rate increases, slowing process execution. In addition, remember that, when a page fault occurs before an executing instruction is complete, the instruction must be restarted. Consequently, we must have enough frames to hold all the different pages that any single instruction can reference. 9.5 Allocation of Frames 419 For example, consider a machine in which all memory-reference instruc- tions may reference only one memory address. In this case, we need at least one frame for the instruction and one frame for the memory reference. In addition, if one-level indirect addressing is allowed (for example, a load instruction on page 16 can refer to an address on page 0, which is an indirect reference to page 23), then paging requires at least three frames per process. Think about what might happen if a process had only two frames. The minimum number of frames is defined by the computer architecture. For example, the move instruction for the PDP-11 includes more than one word for some addressing modes, and thus the instruction itself may straddle two pages. In addition, each of its two operands may be indirect references, for a total of six frames. Another example is the IBM 370 MVC instruction. Since the instruction is from storage location to storage location, it takes 6 bytes and can straddle two pages. The block of characters to move and the area to which it is to be moved can each also straddle two pages. This situation would require six frames. The worst case occurs when the MVC instruction is the operand of an EXECUTE instruction that straddles a page boundary; in this case, we need eight frames. The worst-case scenario occurs in computer architectures that allow multiple levels of indirection (for example, each 16-bit word could contain a 15-bit address plus a 1-bit indirect indicator). Theoretically, a simple load instruction could reference an indirect address that could reference an indirect address (on another page) that could also reference an indirect address (on yet another page), and so on, until every page in virtual memory had been touched. Thus, in the worst case, the entire virtual memory must be in physical memory. To overcome this difficulty,we must place a limit on the levels of indirection (for example, limit an instruction to at most 16 levels of indirection). When the first indirection occurs, a counter is set to 16; the counter is then decremented for each successive indirection for this instruction. If the counter is decremented to 0, a trap occurs (excessive indirection). This limitation reduces the maximum number of memory references per instruction to 17, requiring the same number of frames. Whereas the minimum number of frames per process is defined by the architecture, the maximum number is defined by the amount of available physical memory. In between, we are still left with significant choice in frame allocation. 9.5.2 Allocation Algorithms The easiest way to split m frames among n processes is to give everyone an equal share, m/n frames. For instance, if there are 93 frames and five processes, each process will get 18 frames. The three leftover frames can be used as a free-frame buffer pool. This scheme is called equal allocation. An alternative is to recognize that different processes will need different amounts of memory. Consider a system with a 1-KB frame size. If a small student process of 10 KB and an interactive database of 127 KB are the only two processes running in a system with 62 free frames, it does not make much sense to give each process 31 frames. The student process does not need more than 10 frames, so the other 21 are, strictly speaking, wasted. 420 Chapter 9 Virtual Memory To solve this problem, we can use proportional allocation, in which we allocate available memory to each process according to its size. Let the size of the virtual memory for process pi be si , and define S =  si . Then, if the total number of available frames is m, we allocate ai frames to process pi ,whereai is approximately ai = si /S × m. Of course, we must adjust each ai to be an integer that is greater than the minimum number of frames required by the instruction set, with a sum not exceeding m. With proportional allocation, we would split 62 frames between two processes, one of 10 pages and one of 127 pages, by allocating 4 frames and 57 frames, respectively, since 10/137 × 62 ≈ 4and 127/137 × 62 ≈ 57. In this way, both processes share the available frames according to their “needs,” rather than equally. In both equal and proportional allocation, of course, the allocation may vary according to the multiprogramming level. If the multiprogramming level is increased, each process will lose some frames to provide the memory needed for the new process. Conversely, if the multiprogramming level decreases, the frames that were allocated to the departed process can be spread over the remaining processes. Notice that, with either equal or proportional allocation, a high-priority process is treated the same as a low-priority process. By its definition, however, we may want to give the high-priority process more memory to speed its execution, to the detriment of low-priority processes. One solution is to use a proportional allocation scheme wherein the ratio of frames depends not on the relative sizes of processes but rather on the priorities of processes or on a combination of size and priority. 9.5.3 Global versus Local Allocation Another important factor in the way frames are allocated to the various processes is page replacement. With multiple processes competing for frames, we can classify page-replacement algorithms into two broad categories: global replacement and local replacement. Global replacement allows a process to select a replacement frame from the set of all frames, even if that frame is currently allocated to some other process; that is, one process can take a frame from another. Local replacement requires that each process select from only its own set of allocated frames. For example, consider an allocation scheme wherein we allow high-priority processes to select frames from low-priority processes for replacement. A process can select a replacement from among its own frames or the frames 9.5 Allocation of Frames 421 of any lower-priority process. This approach allows a high-priority process to increase its frame allocation at the expense of a low-priority process. With a local replacement strategy, the number of frames allocated to a process does not change. With global replacement, a process may happen to select only frames allocated to other processes, thus increasing the number of frames allocated to it (assuming that other processes do not choose its frames for replacement). One problem with a global replacement algorithm is that a process cannot control its own page-fault rate. The set of pages in memory for a process depends not only on the paging behavior of that process but also on the paging behavior of other processes. Therefore, the same process may perform quite differently (for example, taking 0.5 seconds for one execution and 10.3 seconds for the next execution) because of totally external circumstances. Such is not the case with a local replacement algorithm. Under local replacement, the set of pages in memory for a process is affected by the paging behavior of only that process. Local replacement might hinder a process, however, by not making available to it other, less used pages of memory. Thus, global replacement generally results in greater system throughput and is therefore the more common method. 9.5.4 Non-Uniform Memory Access Thus far in our coverage of virtual memory, we have assumed that all main memory is created equal—or at least that it is accessed equally. On many computer systems, that is not the case. Often, in systems with multiple CPUs (Section 1.3.2), a given CPU can access some sections of main memory faster than it can access others. These performance differences are caused by how CPUs and memory are interconnected in the system. Frequently, such a system is made up of several system boards, each containing multiple CPUsandsome memory. The system boards are interconnected in various ways, ranging from system buses to high-speed network connections like InfiniBand. As you might expect, the CPUs on a particular board can access the memory on that board with less delay than they can access memory on other boards in the system. Systems in which memory access times vary significantly are known collectively as non-uniform memory access (NUMA) systems, and without exception, they are slower than systems in which memory and CPUsarelocatedonthesame motherboard. Managing which page frames are stored at which locations can significantly affect performance in NUMA systems. If we treat memory as uniform in such a system, CPUs may wait significantly longer for memory access than if we modify memory-allocation algorithms to take NUMA into account. Similar changes must be made to the scheduling system. The goal of these changes is to have memory frames allocated “as close as possible” to the CPU on which the process is running. The definition of “close” is “with minimum latency,” which typically means on the same system board as the CPU. The algorithmic changes consist of having the scheduler track the last CPU on which each process ran. If the scheduler tries to schedule each process onto its previous CPU, and the memory-management system tries to allocate frames for the process close to the CPU on which it is being scheduled, then improved cache hits and decreased memory access times will result. 422 Chapter 9 Virtual Memory The picture is more complicated once threads are added. For example, a process with many running threads may end up with those threads scheduled on many different system boards. How is the memory to be allocated in this case? Solaris solves the problem by creating an lgroup entity in the kernel. Each lgroup gathers together close CPUs and memory. In fact, there is a hierarchy of lgroups based on the amount of latency between the groups. Solaris tries to schedule all threads of a process and allocate all memory of a process within an lgroup. If that is not possible, it picks nearby lgroups for the rest of the resources needed. In this manner, overall memory latency is minimized, and CPU cache hit rates are maximized. 9.6 Thrashing If the number of frames allocated to a low-priority process falls below the minimum number required by the computer architecture, we must suspend that process’s execution. We should then page out its remaining pages, freeing all its allocated frames. This provision introduces a swap-in, swap-out level of intermediate CPU scheduling. In fact, look at any process that does not have “enough” frames. If the processdoesnothavethenumberofframesitneedstosupportpagesin active use, it will quickly page-fault. At this point, it must replace some page. However, since all its pages are in active use, it must replace a page that will be needed again right away. Consequently, it quickly faults again, and again, and again, replacing pages that it must bring back in immediately. This high paging activity is called thrashing. A process is thrashing if it is spending more time paging than executing. 9.6.1 Cause of Thrashing Thrashing results in severe performance problems. Consider the following scenario, which is based on the actual behavior of early paging systems. The operating system monitors CPU utilization. If CPU utilization is too low, we increase the degree of multiprogramming by introducing a new process to the system. A global page-replacement algorithm is used; it replaces pages without regard to the process to which they belong. Now suppose that a process enters a new phase in its execution and needs more frames. It starts faulting and taking frames away from other processes. These processes need those pages, however, and so they also fault, taking frames from other processes. These faulting processes must use the paging device to swap pages in and out. As they queue up for the paging device, the ready queue empties. As processes wait for the paging device, CPU utilization decreases. The CPU scheduler sees the decreasing CPU utilization and increases the degree of multiprogramming as a result. The new process tries to get started by taking frames from running processes, causing more page faults and a longer queue for the paging device. As a result, CPU utilization drops even further, and the CPU scheduler tries to increase the degree of multiprogramming even more. Thrashing has occurred, and system throughput plunges. The page- fault rate increases tremendously. As a result, the effective memory-access 9.6 Thrashing 423 thrashing degree of multiprogramming CPU utilization Figure 9.18 Thrashing. time increases. No work is getting done, because the processes are spending all their time paging. This phenomenon is illustrated in Figure 9.18, in which CPU utilization is plotted against the degree of multiprogramming. As the degree of multi- programming increases, CPU utilization also increases, although more slowly, until a maximum is reached. If the degree of multiprogramming is increased even further, thrashing sets in, and CPU utilization drops sharply. At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of multiprogramming. We can limit the effects of thrashing by using a local replacement algorithm (or priority replacement algorithm). With local replacement, if one process starts thrashing, it cannot steal frames from another process and cause the latter to thrash as well. However, the problem is not entirely solved. If processes are thrashing, they will be in the queue for the paging device most of the time. The average service time for a page fault will increase because of the longer average queue for the paging device. Thus, the effective access time will increase even for a process that is not thrashing. To prevent thrashing, we must provide a process with as many frames as it needs. But how do we know how many frames it “needs”? There are several techniques. The working-set strategy (Section 9.6.2) starts by looking at how many frames a process is actually using. This approach defines the locality model of process execution. The locality model states that, as a process executes, it moves from locality to locality. A locality is a set of pages that are actively used together (Figure 9.19). A program is generally composed of several different localities, which may overlap. For example, when a function is called, it defines a new locality. In this locality, memory references are made to the instructions of the function call, its local variables, and a subset of the global variables. When we exit the function, the process leaves this locality, since the local variables and instructions of the function are no longer in active use. We may return to this locality later. 424 Chapter 9 Virtual Memory 18 20 22 24 26 28 30 32 34 page numbers memory address execution time Figure 9.19 Locality in a memory-reference pattern. Thus, we see that localities are defined by the program structure and its data structures. The locality model states that all programs will exhibit this basic memory reference structure. Note that the locality model is the unstated principle behind the caching discussions so far in this book. If accesses to any types of data were random rather than patterned, caching would be useless. Suppose we allocate enough frames to a process to accommodate its current locality. It will fault for the pages in its locality until all these pages are in memory; then, it will not fault again until it changes localities. If we do not allocate enough frames to accommodate the size of the current locality, the process will thrash, since it cannot keep in memory all the pages that it is actively using. 9.6 Thrashing 425 9.6.2 Working-Set Model As mentioned, the working-set model is based on the assumption of locality. This model uses a parameter, , to define the working-set window.Theidea is to examine the most recent  page references. The set of pages in the most recent  page references is the working set (Figure 9.20). If a page is in active use, it will be in the working set. If it is no longer being used, it will drop from the working set  time units after its last reference. Thus, the working set is an approximation of the program’s locality. For example, given the sequence of memory references shown in Figure 9.20, if  = 10 memory references, then the working set at time t1 is {1, 2, 5, 6, 7}.Bytimet2, the working set has changed to {3, 4}. The accuracy of the working set depends on the selection of .If is too small, it will not encompass the entire locality; if  is too large, it may overlap several localities. In the extreme, if  is infinite, the working set is the set of pages touched during the process execution. The most important property of the working set, then, is its size. If we compute the working-set size, WSSi , for each process in the system, we can then consider that D =  WSSi , where D is the total demand for frames. Each process is actively using the pages in its working set. Thus, process i needs WSSi frames. If the total demand is greater than the total number of available frames (D > m), thrashing will occur, because some processes will not have enough frames. Once  has been selected, use of the working-set model is simple. The operating system monitors the working set of each process and allocates to that working set enough frames to provide it with its working-set size. If there are enough extra frames, another process can be initiated. If the sum of the working-set sizes increases, exceeding the total number of available frames, the operating system selects a process to suspend. The process’s pages are written out (swapped), and its frames are reallocated to other processes. The suspended process can be restarted later. This working-set strategy prevents thrashing while keeping the degree of multiprogramming as high as possible. Thus, it optimizes CPU utilization. The difficulty with the working-set model is keeping track of the working set. The working-set window is a moving window. At each memory reference, a new reference appears at one end and the oldest reference drops off the other page reference table . . . 2 6 1 5 7 7 7 7 5 1 6 2 3 4 1 2 3 4 4 4 3 4 3 4 4 4 1 3 2 3 4 4 4 3 4 4 4 . . . ∆ t1 WS(t1) = {1,2,5,6,7} ∆ t2 WS(t2) = {3,4} Figure 9.20 Working-set model. 426 Chapter 9 Virtual Memory end. A page is in the working set if it is referenced anywhere in the working-set window. We can approximate the working-set model with a fixed-interval timer interrupt and a reference bit. For example, assume that  equals 10,000 references and that we can cause a timer interrupt every 5,000 references. When we get a timer interrupt, we copy and clear the reference-bit values for each page. Thus, if a page fault occurs, we can examine the current reference bit and two in-memory bits to determine whether a page was used within the last 10,000 to 15,000 references. If it was used, at least one of these bits will be on. If it has not been used, these bits will be off. Those pages with at least one bit on will be considered to be in the working set. Note that this arrangement is not entirely accurate, because we cannot tell where, within an interval of 5,000, a reference occurred. We can reduce the uncertainty by increasing the number of history bits and the frequency of interrupts (for example, 10 bits and interrupts every 1,000 references). However, the cost to service these more frequent interrupts will be correspondingly higher. 9.6.3 Page-Fault Frequency The working-set model is successful, and knowledge of the working set can be useful for prepaging (Section 9.9.1), but it seems a clumsy way to control thrashing. A strategy that uses the page-fault frequency (PFF) takes a more direct approach. The specific problem is how to prevent thrashing. Thrashing has a high page-fault rate. Thus, we want to control the page-fault rate. When it is too high, we know that the process needs more frames. Conversely,if the page-fault rate is too low, then the process may have too many frames. We can establish upper and lower bounds on the desired page-fault rate (Figure 9.21). If the actual page-fault rate exceeds the upper limit, we allocate the process another frame; if the page-fault rate falls below the lower limit, we remove a frame from the process. Thus, we can directly measure and control the page-fault rate to prevent thrashing. number of frames increase number of frames upper bound lower bound decrease number of frames page-fault rate Figure 9.21 Page-fault frequency. 9.7 Memory-Mapped Files 427 WORKING SETS AND PAGE FAULT RATES There is a direct relationship between the working set of a process and its page-fault rate. Typically, as shown in Figure 9.20, the working set of a process changes over time as references to data and code sections move from one locality to another. Assuming there is sufficient memory to store the working set of a process (that is, the process is not thrashing), the page-fault rate of the process will transition between peaks and valleys over time. This general behavior is shown in Figure 9.22. 1 0 time working set page fault rate Figure 9.22 Page-fault rate over time. A peak in the page-fault rate occurs when we begin demand-paging a new locality. However, once the working set of this new locality is in memory, the page-fault rate falls. When the process moves to a new working set, the page-fault rate rises toward a peak once again, returning to a lower rate once the new working set is loaded into memory. The span of time between the start of one peak and the start of the next peak represents the transition from one working set to another. As with the working-set strategy, we may have to suspend a process. If the page-fault rate increases and no free frames are available, we must select some process and suspend it. The freed frames are then distributed to processes with high page-fault rates. 9.7 Memory-Mapped Files Consider a sequential read of a file on disk using the standard system calls open(), read(),andwrite(). Each file access requires a system call and disk access. Alternatively, we can use the virtual memory techniques discussed so far to treat file I/O as routine memory accesses. This approach, known 428 Chapter 9 Virtual Memory as memory-mapping a file, allows a part of the virtual address space to be logically associated with the file. 9.7.1 Basic Mechanism Memory-mapping a file is accomplished by mapping a disk block to a page (or pages) in memory. Initial access to the file proceeds through ordinary demand paging, resulting in a page fault. However, a page-sized portion of the file is read from the file system into a physical page (some systems may opt to read in more than a page-sized chunk of memory at a time). Subsequent reads and writes to the file are handled as routine memory accesses. This practice simplifies file access and usage by allowing the system to manipulate files through memory rather than incurring the overhead of using the read() and write() system calls. Note that writes to the file mapped in memory are not necessarily immediate (synchronous) writes to the file on disk. Some systems may choose to update the physical file when the operating system periodically checks whether the page in memory has been modified. When the file is closed, all the memory-mapped data are written back to disk and removed from the virtual memory of the process. Some operating systems provide memory mapping only through a specific system call and use the standard system calls to perform all other file I/O. However, some systems choose to memory-map a file regardless of whether the file was specified as memory-mapped. Let’s take Solaris as an example. If a file is specified as memory-mapped (using the mmap() system call), Solaris maps the file into the address space of the process. If a file is opened and accessed using ordinary system calls, such as open(), read(),andwrite(), Solaris still memory-maps the file; however, the file is mapped to the kernel address space. Regardless of how the file is opened, then, Solaris treats all file I/O as memory-mapped, allowing file access to take place via the efficient memory subsystem. Multiple processes may be allowed to map the same file concurrently, to permit sharing of data. Writes by any of the processes modify the data in virtual memory and can be seen by all others that map the same section of the file. Given our earlier discussions of virtual memory, it should be clear how the sharing of memory-mapped sections of memory is implemented: the virtual memory map of each sharing process points to the same page of physical memory—the page that holds a copy of the disk block. This memory sharing is illustrated in Figure 9.23. The memory-mapping system calls can also support copy-on-write functionality, allowing processes to share a file in read-only mode but to have their own copies of any data they modify. So that access to the shared data is coordinated, the processes involved might use one of the mechanisms for achieving mutual exclusion described in Chapter 6. In many ways, the sharing of memory-mapped files is similar to shared memory as described in Section 3.4.1. Not all systems use the same mechanism for both; UNIX and Linux systems, for example, use the mmap() system call for memory mapping and the POSIX-compliant shmget() and shmat() systems calls for shared memory. On Windows NT, 2000, and XP systems, however, shared memory is accomplished by memory-mapping files. On these 9.7 Memory-Mapped Files 429 process A virtual memory 1 1 123456 2 3 3 4 5 5 4 2 6 6 1 2 3 4 5 6 process B virtual memory physical memory disk file Figure 9.23 Memory-mapped files. systems, processes can communicate using shared memory by having the communicating processes memory-map the same file into their virtual address spaces. The memory-mapped file serves as the region of shared memory between the communicating processes (Figure 9.24). 9.7.2 Memory-Mapped Files in Java Next, we present the facilities in the Java API for memory-mapping files. To memory-map a file requires first opening the file and then obtaining the FileChannel for the opened file. Once the FileChannel is obtained, we invoke the map() method of this channel, which maps the file into memory. The map() process1 memory-mapped file shared memory shared memory shared memory process2 Figure 9.24 Shared memory in Windows using memory-mapped I/O. 430 Chapter 9 Virtual Memory import java.io.*; import java.nio.*; import java.nio.channels.*; public class MemoryMapReadOnly { // Assume the page size is 4 KB public static final int PAGE SIZE = 4096; public static void main(String args[]) throws IOException { RandomAccessFile inFile = new RandomAccessFile(args[0],"r"); FileChannel in = inFile.getChannel(); MappedByteBuffer mappedBuffer = in.map(FileChannel.MapMode.READ ONLY, 0, in.size()); long numPages = in.size() / (long)PAGE SIZE; if (in.size() % PAGE SIZE > 0) ++numPages; // we will "touch" the first byte of every page int position = 0; for (long i = 0; i < numPages; i++) { byte item = mappedBuffer.get(position); position += PAGE SIZE; } in.close(); inFile.close(); } } Figure 9.25 Memory-mapping a file in Java. method returns a MappedByteBuffer, which is a buffer of bytes that is mapped in memory. This is shown in Figure 9.25. The API for the map() method is as follows: map(mode, position, size) The mode refers to how the mapping occurs. Figure 9.25 maps the file as READ ONLY. Files can also be mapped as READ WRITE and PRIVATE. If the file is mapped as PRIVATE, memory mapping takes place via the copy-on-write technique described in Section 9.3. Then any changes in the mapped file result only in changes to the object instance of the MappedByteBuffer performing the mapping. The position refers to the byte position where the mapping is to begin, and the size indicates how many bytes are to be mapped from the starting position. Figure 9.25 maps the entire file—that is, from position 0 to the size of the FileChannel, in.size(). It is also possible to map only a portion of a file or to obtain several different mappings of the same file. 9.8 Allocating Kernel Memory 431 In Figure 9.25, we assume that the page size mapping the file is 4,096 bytes. (Many operating systems provide a system call to determine the page size; however, this is not a feature of the Java API.) We then determine the number of pages necessary to map the file in memory and access the first byte of every page using the get() method of the MappedByteBuffer class. This has the effect of demand-paging every page of the file into memory (on systems supporting that memory model). The API also provides the load() method of the MappedByteBuffer class, which loads the entire file into memory using demand paging. Many operating systems provide a system call that releases (or unmaps) the mapping of a file. Such an action releases the physical pages that mapped the file in memory. The Java API provides no such features. A mapping exists until the MappedByteBuffer object is garbage-collected. 9.7.3 Memory-Mapped I/O InthecaseofI/O, as mentioned in Section 1.2.1, each I/O controller includes registers to hold commands and the data being transferred. Usually, special I/O instructions allow data transfers between these registers and system memory. To allow more convenient access to I/O devices, many computer architectures provide memory-mapped I/O. In this case, ranges of memory addresses are set aside and are mapped to the device registers. Reads and writes to these memory addresses cause the data to be transferred to and from the device registers. This method is appropriate for devices that have fast response times, such as video controllers. In the IBM PC,eachlocationonthescreenismapped to a memory location. Displaying text on the screen is almost as easy as writing the text into the appropriate memory-mapped locations. Memory-mapped I/O is also convenient for other devices, such as the serial and parallel ports used to connect modems and printers to a computer. The CPU transfers data through these kinds of devices by reading and writing a few device registers, called an I/O port.Tosendoutalongstringofbytesthrougha memory-mapped serial port, the CPU writes one data byte to the data register and sets a bit in the control register to signal that the byte is available. The device takes the data byte and then clears the bit in the control register to signal that it is ready for the next byte. Then the CPU can transfer the next byte. If the CPU uses polling to watch the control bit, constantly looping to see whether the device is ready, this method of operation is called programmed I/O (PIO). If the CPU does not poll the control bit, but instead receives an interrupt when the device is ready for the next byte, the data transfer is said to be interrupt driven. 9.8 Allocating Kernel Memory When a process running in user mode requests additional memory, pages are allocated from the list of free page frames maintained by the kernel. This list is typically populated using a page-replacement algorithm such as those discussed in Section 9.4 and most likely contains free pages scattered throughout physical memory, as explained earlier. Remem