MySQL High Availability (高可用MySQL)


Charles Bell, Mats Kindahl, and Lars Thalmann SECOND EDITION MySQL High Availability MySQL High Availability, Second Edition by Charles Bell, Mats Kindahl, and Lars Thalmann Copyright © 2014 Charles Bell, Mats Kindahl, Lars Thalmann. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Andy Oram Production Editor: Nicole Shelby Copyeditor: Jasmine Kwityn Proofreader: Linley Dolby Indexer: Lucie Haskins Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest June 2010: First Edition April 2014: Second Edition Revision History for the Second Edition: 2014-04-09: First release See http://oreilly.com/catalog/errata.csp?isbn=9781449339586 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. MySQL High Availability, the image of an American robin, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-33958-6 [LSI] Table of Contents Foreword for the Second Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Foreword for the First Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Part I. High Availability and Scalability 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What’s This Replication Stuff, Anyway? 5 So, Backups Are Not Needed Then? 7 What’s With All the Monitoring? 7 Is There Anything Else I Can Read? 8 Conclusion 9 2. MySQL Replicant Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Basic Classes and Functions 15 Supporting Different Operating Systems 16 Servers 17 Server Roles 19 Conclusion 21 3. MySQL Replication Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Basic Steps in Replication 24 Configuring the Master 25 Configuring the Slave 27 Connecting the Master and Slave 28 A Brief Introduction to the Binary Log 29 What’s Recorded in the Binary Log 30 Watching Replication in Action 30 The Binary Log’s Structure and Content 33 iii Adding Slaves 35 Cloning the Master 37 Cloning a Slave 39 Scripting the Clone Operation 41 Performing Common Tasks with Replication 42 Reporting 43 Conclusion 49 4. The Binary Log. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Structure of the Binary Log 52 Binlog Event Structure 54 Event Checksums 56 Logging Statements 58 Logging Data Manipulation Language Statements 58 Logging Data Definition Language Statements 59 Logging Queries 59 LOAD DATA INFILE Statements 65 Binary Log Filters 67 Triggers, Events, and Stored Routines 70 Stored Procedures 75 Stored Functions 78 Events 81 Special Constructions 82 Nontransactional Changes and Error Handling 83 Logging Transactions 86 Transaction Cache 87 Distributed Transaction Processing Using XA 91 Binary Log Group Commit 94 Row-Based Replication 97 Enabling Row-based Replication 98 Using Mixed Mode 99 Binary Log Management 100 The Binary Log and Crash Safety 100 Binlog File Rotation 101 Incidents 103 Purging the Binlog File 104 The mysqlbinlog Utility 105 Basic Usage 106 Interpreting Events 113 Binary Log Options and Variables 118 Options for Row-Based Replication 120 iv | Table of Contents Conclusion 121 5. Replication for High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Redundancy 124 Planning 126 Slave Failures 127 Master Failures 127 Relay Failures 127 Disaster Recovery 127 Procedures 128 Hot Standby 130 Dual Masters 135 Slave Promotion 144 Circular Replication 149 Conclusion 151 6. MySQL Replication for Scale-Out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Scaling Out Reads, Not Writes 155 The Value of Asynchronous Replication 156 Managing the Replication Topology 158 Application-Level Load Balancing 162 Hierarchical Replication 170 Setting Up a Relay Server 171 Adding a Relay in Python 172 Specialized Slaves 173 Filtering Replication Events 174 Using Filtering to Partition Events to Slaves 176 Managing Consistency of Data 177 Consistency in a Nonhierarchical Deployment 178 Consistency in a Hierarchical Deployment 180 Conclusion 187 7. Data Sharding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 What Is Sharding? 190 Why Should You Shard? 191 Limitations of Sharding 192 Elements of a Sharding Solution 194 High-Level Sharding Architecture 196 Partitioning the Data 197 Shard Allocation 202 Mapping the Sharding Key 206 Sharding Scheme 206 Table of Contents | v Shard Mapping Functions 210 Processing Queries and Dispatching Transactions 215 Handling Transactions 216 Dispatching Queries 218 Shard Management 220 Moving a Shard to a Different Node 220 Splitting Shards 225 Conclusion 225 8. Replication Deep Dive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Replication Architecture Basics 228 The Structure of the Relay Log 229 The Replication Threads 233 Starting and Stopping the Slave Threads 234 Running Replication over the Internet 235 Setting Up Secure Replication Using Built-in Support 237 Setting Up Secure Replication Using Stunnel 238 Finer-Grained Control Over Replication 239 Information About Replication Status 239 Options for Handling Broken Connections 248 How the Slave Processes Events 249 Housekeeping in the I/O Thread 249 SQL Thread Processing 250 Semisynchronous Replication 257 Configuring Semisynchronous Replication 258 Monitoring Semisynchronous Replication 259 Global Transaction Identifiers 260 Setting Up Replication Using GTIDs 261 Failover Using GTIDs 263 Slave Promotion Using GTIDs 264 Replication of GTIDs 266 Slave Safety and Recovery 268 Syncing, Transactions, and Problems with Database Crashes 268 Transactional Replication 270 Rules for Protecting Nontransactional Statements 274 Multisource Replication 275 Details of Row-Based Replication 278 Table_map Events 280 The Structure of Row Events 282 Execution of Row Event 283 Events and Triggers 284 Filtering in Row-Based Replication 286 vi | Table of Contents Partial Row Replication 288 Conclusion 289 9. MySQL Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 What Is MySQL Cluster? 292 Terminology and Components 292 How Does MySQL Cluster Differ from MySQL? 293 Typical Configuration 293 Features of MySQL Cluster 294 Local and Global Redundancy 296 Log Handling 297 Redundancy and Distributed Data 297 Architecture of MySQL Cluster 298 How Data Is Stored 300 Partitioning 303 Transaction Management 304 Online Operations 304 Example Configuration 306 Getting Started 306 Starting a MySQL Cluster 308 Testing the Cluster 313 Shutting Down the Cluster 314 Achieving High Availability 314 System Recovery 317 Node Recovery 318 Replication 319 Achieving High Performance 324 Considerations for High Performance 325 High Performance Best Practices 326 Conclusion 328 Part II. Monitoring and Managing 10. Getting Started with Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Ways of Monitoring 334 Benefits of Monitoring 335 System Components to Monitor 335 Processor 336 Memory 337 Disk 338 Network Subsystem 339 Table of Contents | vii Monitoring Solutions 340 Linux and Unix Monitoring 341 Process Activity 342 Memory Usage 347 Disk Usage 350 Network Activity 353 General System Statistics 355 Automated Monitoring with cron 356 Mac OS X Monitoring 356 System Profiler 357 Console 359 Activity Monitor 361 Microsoft Windows Monitoring 365 The Windows Experience 366 The System Health Report 367 The Event Viewer 369 The Reliability Monitor 372 The Task Manager 374 The Performance Monitor 375 Monitoring as Preventive Maintenance 377 Conclusion 377 11. Monitoring MySQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 What Is Performance? 380 MySQL Server Monitoring 381 How MySQL Communicates Performance 381 Performance Monitoring 382 SQL Commands 383 The mysqladmin Utility 389 MySQL Workbench 391 Third-Party Tools 402 The MySQL Benchmark Suite 405 Server Logs 407 Performance Schema 409 Concepts 410 Getting Started 412 Using Performance Schema to Diagnose Performance Problems 420 MySQL Monitoring Taxonomy 421 Database Performance 423 Measuring Database Performance 423 Best Practices for Database Optimization 435 Best Practices for Improving Performance 444 viii | Table of Contents Everything Is Slow 444 Slow Queries 444 Slow Applications 445 Slow Replication 445 Conclusion 446 12. Storage Engine Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 InnoDB 448 Using the SHOW ENGINE Command 450 Using InnoDB Monitors 453 Monitoring Logfiles 457 Monitoring the Buffer Pool 458 Monitoring Tablespaces 460 Using INFORMATION_SCHEMA Tables 461 Using PERFORMANCE_SCHEMA Tables 462 Other Parameters to Consider 463 Troubleshooting Tips for InnoDB 464 MyISAM 467 Optimizing Disk Storage 467 Repairing Your Tables 468 Using the MyISAM Utilities 468 Storing a Table in Index Order 470 Compressing Tables 471 Defragmenting Tables 471 Monitoring the Key Cache 471 Preloading Key Caches 472 Using Multiple Key Caches 473 Other Parameters to Consider 474 Conclusion 475 13. Replication Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Getting Started 477 Server Setup 478 Inclusive and Exclusive Replication 478 Replication Threads 481 Monitoring the Master 483 Monitoring Commands for the Master 483 Master Status Variables 487 Monitoring Slaves 487 Monitoring Commands for the Slave 487 Slave Status Variables 492 Replication Monitoring with MySQL Workbench 493 Table of Contents | ix Other Items to Consider 495 Networking 495 Monitor and Manage Slave Lag 496 Causes and Cures for Slave Lag 497 Working with GTIDs 498 Conclusion 499 14. Replication Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 What Can Go Wrong 502 Problems on the Master 503 Master Crashed and Memory Tables Are in Use 503 Master Crashed and Binary Log Events Are Missing 503 Query Runs Fine on the Master but Not on the Slave 505 Table Corruption After a Crash 505 Binary Log Is Corrupt on the Master 506 Killing Long-Running Queries for Nontransactional Tables 507 Unsafe Statements 507 Problems on the Slave 509 Slave Server Crashed and Replication Won’t Start 510 Slave Connection Times Out and Reconnects Frequently 510 Query Results Are Different on the Slave than on the Master 511 Slave Issues Errors when Attempting to Restart with SSL 512 Memory Table Data Goes Missing 513 Temporary Tables Are Missing After a Slave Crash 513 Slave Is Slow and Is Not Synced with the Master 513 Data Loss After a Slave Crash 514 Table Corruption After a Crash 514 Relay Log Is Corrupt on the Slave 515 Multiple Errors During Slave Restart 515 Consequences of a Failed Transaction on the Slave 515 I/O Thread Problems 515 SQL Thread Problems: Inconsistencies 516 Different Errors on the Slave 517 Advanced Replication Problems 517 A Change Is Not Replicated Among the Topology 517 Circular Replication Issues 518 Multimaster Issues 518 The HA_ERR_KEY_NOT_FOUND Error 519 GTID Problems 519 Tools for Troubleshooting Replication 520 Best Practices 521 Know Your Topology 521 x | Table of Contents Check the Status of All of Your Servers 523 Check Your Logs 523 Check Your Configuration 524 Conduct Orderly Shutdowns 525 Conduct Orderly Restarts After a Failure 525 Manually Execute Failed Queries 526 Don’t Mix Transactional and Nontransactional Tables 526 Common Procedures 526 Reporting Replication Bugs 528 Conclusion 529 15. Protecting Your Investment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 What Is Information Assurance? 532 The Three Practices of Information Assurance 532 Why Is Information Assurance Important? 533 Information Integrity, Disaster Recovery, and the Role of Backups 533 High Availability Versus Disaster Recovery 534 Disaster Recovery 535 The Importance of Data Recovery 541 Backup and Restore 542 Backup Tools and OS-Level Solutions 547 MySQL Enterprise Backup 548 MySQL Utilities Database Export and Import 559 The mysqldump Utility 560 Physical File Copy 562 Logical Volume Manager Snapshots 564 XtraBackup 569 Comparison of Backup Methods 569 Backup and MySQL Replication 570 Backup and Recovery with Replication 571 PITR 571 Automating Backups 579 Conclusion 581 16. MySQL Enterprise Monitor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Getting Started with MySQL Enterprise Monitor 584 Commercial Offerings 585 Anatomy of MySQL Enterprise Monitor 585 Installation Overview 586 MySQL Enterprise Monitor Components 590 Dashboard 591 Monitoring Agent 594 Table of Contents | xi Advisors 594 Query Analyzer 595 MySQL Production Support 597 Using MySQL Enterprise Monitor 597 Monitoring 599 Query Analyzer 605 Further Information 608 Conclusion 609 17. Managing MySQL Replication with MySQL Utilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 Common MySQL Replication Tasks 612 Checking Status 612 Stopping Replication 615 Adding Slaves 617 MySQL Utilities 618 Getting Started 618 Using the Utilities Without Workbench 619 Using the Utilities via Workbench 619 General Utilities 621 Comparing Databases for Consistency: mysqldbcompare 621 Copying Databases: mysqldbcopy 624 Exporting Databases: mysqldbexport 625 Importing Databases: mysqldbimport 628 Discovering Differences: mysqldiff 629 Showing Disk Usage: mysqldiskusage 632 Checking Tables Indexes: mysqlindexcheck 635 Searching Metadata: mysqlmetagrep 636 Searching for Processes: mysqlprocgrep 637 Cloning Servers: mysqlserverclone 639 Showing Server Information: mysqlserverinfo 641 Cloning Users: mysqluserclone 642 Utilities Client: mysqluc 643 Replication Utilities 644 Setting Up Replication: mysqlreplicate 644 Checking Replication Setup: mysqlrplcheck 646 Showing Topologies: mysqlrplshow 648 High Availability Utilities 650 Concepts 650 mysqlrpladmin 651 mysqlfailover 655 Creating Your Own Utilities 663 Architecture of MySQL Utilities 663 xii | Table of Contents Custom Utility Example 664 Conclusion 673 A. Replication Tips and Tricks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 B. A GTID Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705 Table of Contents | xiii Foreword for the Second Edition In 2011, Pinterest started growing. Some say we grew faster than any other startup to date. In the earliest days, we were up against a new scalability bottleneck every day that could slow down the site or bring it down altogether. We remember having our laptops with us everywhere. We slept with them, we ate with them, we went on vacation with them. We even named them. We have the sound of the SMS outage alerts imprinted in our brains. When the infrastructure is constantly being pushed to its limits, you can’t help but wish for an easy way out. During our growth, we tried no less than five well-known database technologies that claimed to solve all our problems, but each failed catastrophically. Except MySQL. The time came around September 2011 to throw all the cards in the air and let them resettle. We re-architected everything around MySQL, Memcache, and Redis with just three engineers. MySQL? Why MySQL? We laid out our biggest concerns with any technology and started asking the same questions for each. Here’s how MySQL shaped up: • Does it address our storage needs? Yes, we needed mappings, indexes, sorting, and blob storage, all available in MySQL. • Is it commonly used? Can you hire somebody for it? MySQL is one of the most common database choices in production today. It’s so easy to hire people who have used MySQL that we could walk outside in Palo Alto and yell out for a MySQL engineer and a few would come up. Not kidding. • Is the community active? Very active. There are great books available and a strong online community. • How robust is it to failure? Very robust! We’ve never lost any data even in the most dire of situations. • How well does it scale? By itself, it does not scale beyond a single box. We’d need a sharding solution layered on top. (That’s a whole other discussion!) xv • Will you be the biggest user? Nope, not by far. Bigger users included Facebook, Twitter, and Google. You don’t want to be the biggest user of a technology if you can help it. If you are, you’ll trip over new scalability problems that nobody has had a chance to debug yet. • How mature is it? Maturity became the real differentiator. Maturity to us is a meas‐ ure of the blood, sweat, and tears that have gone into a program divided by its complexity. MySQL is reasonably complex, but not nearly so compared to some of the magic autoclustering NoSQL solutions available. Additionally, MySQL has had 28 years of the best and the brightest contributing back to it from such companies as Facebook and Google, who use it at massive scale. Of all the technologies we looked at, by our definition of maturity, MySQL was a clear choice. • Does it have good debugging tools? As a product matures, you naturally get great debugging and profiling tools since people are more likely to have been in a similar sticky situation. You’ll find yourself in trouble at 3 A.M. (multiple times). Being able to root cause an issue and get back to bed is better than rewriting for another technology by 6 A.M. Based on our survey of 10 or so database technologies, MySQL was the clear choice. MySQL is great, but it kinda drops you off at your destination with no baggage and you have to fend for yourself. It works very well and you can connect to it, but as soon as you start using it and scaling, the questions starting flying: • My query is running slow, now what? • Should I enable compression? How do I do it? • What are ways of scaling beyond one box? • How do I get replication working? How about master-master replication? • REPLICATION STOPPED! NOW WHAT?! • What are options for durability (fsync speeds)? • How big should my buffers be? • There are a billion fields in mysql.ini. What are they? What should they be set to? • I just accidentally wrote to my slave! How do I prevent that from happening again? • How do I prevent running an UPDATE with no where clause? • What debugging and profiling tools should I be using? • Should I use InnoDB, MyISAM, or one of several other flavors of storage engine? The online community is helpful for answering specific questions, finding examples, bug fixes, and workarounds, but often lacks a strong cohesive story, and deeper dis‐ cussions about architecture are few and far between. We knew how to use MySQL at xvi | Foreword for the Second Edition small scale, but this scale and pace were insane. High Availability MySQL provided insights that allowed us to squeeze more out of MySQL. One new feature in MySQL 5.6, Global Transaction Handlers, adds a unique identifier to every transaction in a replication tree. This new feature makes failover and slave promotion far easier. We’ve been waiting for this for a long time and it’s well covered in this new edition. During our grand re-architecture to a sharded solution, we referred to this book for architectural decisions, such as replication techniques and topologies, data sharding alternatives, monitoring options, tuning, and concerns in the cloud. It gave us a deeper understanding of how MySQL works underneath the hood, which allowed us to make better informed choices around the high level queries, access patterns, and structures we’d be using, as well as iterate on our design afterward. The resulting MySQL archi‐ tecture still serves Pinterest’s core data needs today. —Yashwanth Nelapati and Marty Weiner Pinterest February 2014 Foreword for the Second Edition | xvii Foreword for the First Edition A lot of research has been done on replication, but most of the resulting concepts are never put into production. In contrast, MySQL replication is widely deployed but has never been adequately explained. This book changes that. Things are explained here that were previously limited to people willing to read a lot of source code and spend a lot of time—including a few late-night sessions—debugging it in production. Replication enables you to provide highly available data services while enduring the inevitable failures. There are an amazing number of ways for things to fail, including the loss of a disk, server, or data center. Even when hardware is perfect or fully redundant, people are not. Database tables will be dropped by mistake. Applications will write incorrect data. Occasional failure is assured. But with reasonable preparation, recovery from failure can also be assured. The keys to survival are redundancy and backups. Replication in MySQL supports both. But MySQL replication is not limited to supporting failure recovery. It is frequently used to support read scale-out. MySQL can efficiently replicate to a large number of servers. For applications that are read-mostly, this is a cost-effective strategy for supporting a large number of queries on commodity hardware. And there are other interesting uses for MySQL replication. Online data definition lan‐ guage (DDL) is a very complex feature to implement in a relational database manage‐ ment system. MySQL does not support online DDL, but through the use of replication, you can implement something that is frequently good enough. You can get a lot done with replication if you are willing to be creative. Replication is one of the features that made MySQL wildly popular. It is also the feature that allows you to convert a popular MySQL prototype into a successful business-critical deployment. Like most of MySQL, replication favors simplicity and ease of use. As a consequence, it is occasionally less than perfect when running in production. This book explains what you need to know to successfully use MySQL replication. It will help you to understand how replication has been implemented, what can go wrong, how to pre‐ xix vent problems, and how to fix them when—despite your best attempts at prevention— they crop up. MySQL replication is also a work in progress. Change, like failure, is also assured. MySQL is responding to that change, and replication continues to get more efficient, more robust, and more interesting. For instance, row-based replication is new in MySQL 5.1. While MySQL deployments come in all shapes and sizes, I care most about data services for Internet applications and am excited about the potential to replicate from MySQL to distributed storage systems like HBase and Hadoop. This will make MySQL better at sharing the data center. I have been on teams that support important MySQL deployments at Facebook and Google. I’ve encountered many of the problems covered in this book and have had the opportunity and time to learn solutions. The authors of this book are also experts on MySQL replication, and by reading this book you can share their expertise. —Mark Callaghan xx | Foreword for the First Edition Preface The authors of this book have been creating parts of MySQL and working with it for many years. Dr. Charles Bell is a senior developer leading the MySQL Utilities team. He has also worked on replication and backup. His interests include all things MySQL, database theory, software engineering, microcontrollers, and three-dimensional print‐ ing. Dr. Mats Kindahl is a principal senior software developer currently leading the MySQL High Availability and Scalability team. He is architect and implementor of sev‐ eral MySQL features. Dr. Lars Thalmann is the development director and technical lead of the MySQL Replication, Backup, Connectors, and Utilities teams, and has designed many of the replication and backup features. He has worked on the development of MySQL clustering, replication, and backup technologies. We wrote this book to fill a gap we noticed among the many books on MySQL. There are many excellent books on MySQL, but few that concentrate on its advanced features and applications, such as high availability, reliability, and maintainability. In this book, you will find all of these topics and more. We also wanted to make the reading a bit more interesting by including a running narrative about a MySQL professional who encounters common requests made by his boss. In the narrative, you will meet Joel Thomas, who recently decided to take a job working for a company that has just started using MySQL. You will observe Joel as he learns his way around MySQL and tackles some of the toughest problems facing MySQL professionals. We hope you find this aspect of the book entertaining. Who This Book Is For This book is for MySQL professionals. We expect you to have basic knowledge of SQL, MySQL administration, and the operating system you are running. We provide intro‐ ductory information about replication, disaster recovery, system monitoring, and other key topics of high availability. See Chapter 1 for other books that offer useful background information. xxi How This Book Is Organized This book is divided into two parts. Part I encompasses MySQL high availability and scale-out. Because these depend a great deal on replication, a lot of this part focuses on that topic. Part II examines monitoring and performance concerns for building robust data centers. Part I, High Availability and Scalability Chapter 1, Introduction, explains how this book can help you and gives you a context for reading it. Chapter 2, MySQL Replicant Library, introduces a Python library for working with sets of servers that is used throughout the book. Chapter 3, MySQL Replication Fundamentals, discusses both manual and automated procedures for setting up basic replication. Chapter 4, The Binary Log, explains the critical file that ties together replication and helps in disaster recovery, troubleshooting, and other administrative tasks. Chapter 5, Replication for High Availability, shows a number of ways to recover from server failure, including the use of automated scripts. Chapter 6, MySQL Replication for Scale-Out, shows a number of techniques and top‐ ologies for improving the read scalabilility of large data sets. Chapter 7, Data Sharding, shows techniques for handling very large databases and/or improving the write scalability of a database through sharding. Chapter 8, Replication Deep Dive, addresses a number of topics, such as secure data transfer and row-based replication. Chapter 9, MySQL Cluster, shows how to use this tool to achieve high availability. Part II, Monitoring and Managing Chapter 10, Getting Started with Monitoring, presents the main operating system pa‐ rameters you have to be aware of, and tools for monitoring them. Chapter 11, Monitoring MySQL, presents several tools for monitoring database activity and performance. Chapter 12, Storage Engine Monitoring, explains some of the parameters you need to monitor on a more detailed level, focusing on issues specific to MyISAM or InnoDB. Chapter 13, Replication Monitoring, offers details about how to keep track of what mas‐ ters and slaves are doing. xxii | Preface Chapter 14, Replication Troubleshooting, shows how to deal with failures and restarts, corruption, and other incidents. Chapter 15, Protecting Your Investment, explains the use of backups and disaster re‐ covery techniques. Chapter 16, MySQL Enterprise Monitor, introduces a suite of tools that simplifies many of the tasks presented in earlier chapters. Chapter 17, Managing MySQL Replication with MySQL Utilities, introduces the MySQL Utilities, which are a new set of tools for managing MySQL Servers. Appendixes Appendix A, Replication Tips and Tricks, offers a grab bag of procedures that are useful in certain situations. Appendix B, A GTID Implementation, shows an implementation for handling failovers with transactions if you are using servers that don’t support GTIDs. Conventions Used in This Book The following typographical conventions are used in this book: Plain text Indicates menu titles, table names, options, and buttons. Italic Indicates new terms, database names, URLs, email addresses, filenames, and Unix utilities. Constant width Indicates command-line options, variables and other code elements, the contents of files, and the output from commands. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values. This element signifies a tip or suggestion. Preface | xxiii This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at at http://bit.ly/mysqllaunch. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of ex‐ ample code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “MySQL High Availability, by Charles Bell, Mats Kindahl, and Lars Thalmann. Copyright 2014 Charles Bell, Mats Kindahl, and Lars Thalmann, 978-1-44933-958-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Safari® Books Online Safari Books Online (www.safaribooksonline.com) is an on- demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business. Technology professionals, software developers, web designers, and business and crea‐ tive professionals use Safari Books Online as their primary resource for research, prob‐ lem solving, learning, and certification training. xxiv | Preface Safari Books Online offers a range of product mixes and pricing programs for organi‐ zations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Pro‐ fessional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technol‐ ogy, and dozens more. For more information about Safari Books Online, please visit us online. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/mysql_high_availability. To comment or ask technical questions about this book, send email to: bookques tions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at: http://www.oreilly.com. Acknowledgments The authors would like to thank our technical reviewers of this and the previous edition: Mark Callahan, Morgan Tocker, Sveta Smirnova, Luis Soares, Sheeri Kritzer Cabral, Alfie John, and Colin Charles. Your attention to detail and insightful suggestions were invaluable. We could not have delivered a quality book without your help. We also want to thank our extremely talented colleagues on the MySQL team and in the MySQL community who have provided comments, including Alfranio Correia, Andrei Elkin, Zhen-Xing He, Serge Kozlov, Sven Sandberg, Luis Soares, Rafal Somla, Li-Bing Song, Ingo Strüwing, Dao-Gang Qu, Giuseppe Maxia, and Narayanan Venkateswaran for their tireless dedication to making MySQL the robust and powerful tool it is today. We especially would like to thank our MySQL customer support professionals, who help us bridge the gap between our customers’ needs and our own desires to improve the Preface | xxv product. We would also like to thank the many community members who so selflessly devote time and effort to improve MySQL for everyone. Finally, and most important, we would like to thank our editor, Andy Oram, who helped us shape this work, for putting up with our sometimes cerebral and sometimes over- the-top enthusiasm for all things MySQL. A most sincere thanks goes out to the entire O’Reilly team and especially our editor for their patience as we struggled to fit so many new topics into what was already a very large book. Charles would like to thank his loving wife, Annette, for her patience and understanding when he was spending time away from family priorities to work on this book. Charles would also like to thank his many colleagues on the MySQL team at Oracle who con‐ tribute their wisdom freely to everyone on a daily basis. Finally, Charles would like to thank all of his brothers and sisters in Christ who both challenge and support him daily. Mats would like to thank his wife, Lill, and two sons, Jon and Hannes, for their uncon‐ ditional love and understanding in difficult times. You are the loves of his life and he cannot imagine a life without you. Mats would also like to thank his MySQL colleagues inside and outside Oracle for all the interesting, amusing, and inspiring times together —you are truly some of the sharpest minds in the trade. Lars would like to thank his amazing girlfriend Claudia; he loves her beyond words. He would also like to thank all of his colleagues, current and past, who have made MySQL such an interesting place to work. In fact, it is not even a place. The distributed nature of the MySQL development team and the open-mindedness of its many dedicated de‐ velopers are truly extraordinary. The MySQL community has a special spirit that makes working with MySQL an honorable task. What we have created together is remarkable. It is amazing that it started with such a small group of people and managed to build a product that services so many of the Fortune 500 companies today. xxvi | Preface PART I High Availability and Scalability One of the key database features that supports both high availability and scalability in an application is replication. Replication is used to create redundancy in the database layer as well as to make copies of the database available for scaling the reads. Part I covers how you can use replication to ensure high availability and how you can scale your system. CHAPTER 1 Introduction Joel looked through the classified ads for a new job. His current job was a good one, and the company had been very accommodating to him while he attended college. But it had been several years since he graduated, and he wanted to do more with his career. “This looks promising,” he said, circling an advertisement for a computer science spe‐ cialist working with MySQL. He had experience with MySQL and certainly met the academic requirements for the job. After reading through several other ads, he decided to call about the MySQL job. After a brief set of cursory questions, the human resources manager granted him an interview in two days’ time. Two days and three interviews later, he was introduced to the company’s president and chief executive officer, Robert Summerson, for his final technical interview. He waited while Mr. Summerson paused during the questions and referred to his notes. So far, they were mostly mundane questions about information technology, but Joel knew the hard questions about MySQL were coming next. Finally, the interviewer said, “I am impressed with your answers, Mr. Thomas. May I call you Joel?” “Yes, sir,” Joel said as he endured another uncomfortable period while the interviewer read over his notes for the third time. “Tell me what you know about MySQL,” Mr. Summerson said before placing his hands on his desk and giving Joel a very penetrating stare. Joel began explaining what he knew about MySQL, tossing in a generous amount of the material he had read the night before. After about 10 minutes, he ran out of things to talk about. Mr. Summerson waited a couple of minutes, then stood and offered Joel his hand. As Joel rose and shook Mr. Summerson’s hand, Summerson said, “That’s all I need to hear, Joel. The job is yours.” 3 “Thank you, sir.” Mr. Summerson motioned for Joel to follow him out of his office. “I’ll take you back to the HR people so we can get you on the payroll. Can you start two weeks from Monday?” Joel was elated and couldn’t help but smile. “Yes, sir.” “Excellent.” Mr. Summerson shook Joel’s hand again and said, “I want you to come prepared to evaluate the configuration of our MySQL servers. I want a complete report on their configuration and health.” Joel’s elation waned as he drove out of the parking lot. He didn’t go home right away. Instead, he drove to the nearest bookstore. “I’m going to need a good book on MySQL,” he thought. So, you have decided to take on a large installation and take care of its operation. Well, you are up for some very interesting—as well as rewarding—times. Compared to running a small site, supporting a large venture requires planning, fore‐ sight, experience, and even more planning. As a database administrator for a large ven‐ ture, you are required to—or will be required to—do things like the following: • Provide plans for recovery of business-essential data in the event of a disaster. It is also likely that you will have to execute the procedure at least once. • Provide plans for handling a large customer/user base and monitoring the load of each node in the site in order to optimize it. • Plan for rapid scale-out in the event the user base grows rapidly. For all these cases, it is critical to plan for the events in advance and be prepared to act quickly when necessary. Because not all applications using big sets of servers are websites, we prefer to use the term deployment—rather than the term site or website—to refer to the server that you are using to support some kind of application. This could be a website, but could just as well be a customer relationship management (CRM) system or an online game. The book focuses on the database layer of such a system, but there are some examples that demonstrate how the application layer and the database layer integrate. You need three things to keep a site responsive and available: backups of data, redun‐ dancy in the system, and responsiveness. The backups can restore a node to the state it was in before a crash, redundancy allows the site to continue to operate even if one or more of the nodes stops functioning, and the responsiveness makes the system usable in practice. 4 | Chapter 1: Introduction 1. You are not restricted to using a single backup method; you can just as well use a mix of different methods depending on your needs. For each case, however, you have to make a choice of the most appropriate method to do the backup. There are many ways to perform backups, and the method you choose will depend on your needs.1 Do you need to recover to an exact point in time? In that case, you have to ensure that you have all that is necessary for performing a point-in-time recovery (PITR). Do you want to keep the servers up while making a backup? If so, you need to ensure that you are using some form of backup that does not disturb the running server, such as an online backup. Redundancy is handled by duplicating hardware, keeping several instances running in parallel, and using replication to keep multiple copies of the same data available on several machines. If one of the machines fails, it is possible to switch over to another machine that has a copy of the same data. Together with replication, backup also plays an important role in scaling your system and adding new nodes when needed. If done right, it is even possible to automatically add new slaves at the press of a button, at least figuratively. What’s This Replication Stuff, Anyway? If you’re reading this book, you probably have a pretty good idea of what replication is about. It is nevertheless a good idea to review the concepts and ideas. Replication is used to clone all changes made on a server—called the master server or just master—to another server, which is called the slave server or just slave. This is normally used to create a faithful copy of the master server, but replication can be used for other purposes as well. The two most common uses of replication are to create a backup of the main server to avoid losing any data if the master crashes and to have a copy of the main server to perform reporting and analysis work without disturbing the rest of the business. For a small business, this makes a lot of things simpler, but it is possible to do a lot more with replication, including the following: Support several offices It is possible to maintain servers at each location and replicate changes to the other offices so that the information is available everywhere. This may be necessary to protect data and also to satisfy legal requirements to keep information about the business available for auditing purposes. Ensure the business stays operational even if one of the servers goes down An extra server can be used to handle all the traffic if the original server goes down. What’s This Replication Stuff, Anyway? | 5 2. There is an extension called semisynchronous replication as well (see “Semisynchronous Replication” on page 257), but that is a relatively new addition. Until MySQL 5.7.2 DMR, it externalized the transaction before it was replicated, allowing it to be read before it had been replicated and acknowledged requiring some care when being used for high availability. Ensure the business can operate even in the presence of a disaster Replication can be used to send changes to an alternative data center at a different geographic location. Protect against mistakes (“oopses”) It is possible to create a delayed slave by connecting a slave to a master such that the slave is always a fixed period—for example, an hour—behind the master. If a mistake is made on the master, it is possible to find the offending statement and remove it before it is executed by the slave. One of the two most important uses of replication in many modern applications is that of scaling out. Modern applications are typically very read-intensive; they have a high proportion of reads compared to writes. To reduce the load on the master, you can set up a slave with the sole purpose of answering read queries. By connecting a load balancer, it is possible to direct read queries to a suitable slave, while write queries go to the master. When using replication in a scale-out scenario, it is important to understand that MySQL replication traditionally has been asynchronous2 in the sense that transactions are committed at the master server first, then replicated to the slave and applied there. This means that the master and slave may not be consistent, and if replication is running continuously, the slave will lag behind the master. The advantage of using asynchronous replication is that it is faster and scales better than synchronous replication, but in cases where it is important to have current data, the asynchrony must be handled to ensure the information is actually up-to-date. Scaling out reads is, however, not sufficient to scale all applications. With growing de‐ mands on larger databases and higher write load, it is necessary to scale more than just reads. Managing larger databases and improving performance of large database systems can be accomplished using techniques such as sharding. With sharding, the database is split into manageable chunks, allowing you to increase the size of the database by dis‐ tributing it over as many servers as you need as well as scaling writes efficiently. Another important application of replication is ensuring high availability by adding redundancy. The most common technique is to use a dual-master setup (i.e., using replication to keep a pair of masters available all the time, where each master mirrors the other). If one of the masters goes down, the other one is ready to take over imme‐ diately. In addition to the dual-master setup, there are other techniques for achieving high availability that do not involve replication, such as using shared or replicated disks. 6 | Chapter 1: Introduction Although they are not specifically tied to MySQL, these techniques are important tools for ensuring high availability. So, Backups Are Not Needed Then? A backup strategy is a critical component of keeping a system available. Regular backups of the servers provide safety against crashes and disasters, which, to some extent, can be handled by replication. Even when replication is used correctly and efficiently, how‐ ever, there are some things that it cannot handle. You’ll need to have a working backup strategy for the following cases: Protection against mistakes If a mistake is discovered, potentially a long time after it actually occurred, repli‐ cation will not help. In this case, it is necessary to roll back the system to a time before the mistake was introduced and fix the problem. This requires a working backup schedule. Replication provides some protection against mistakes if you are using a time- delayed slave, but if the mistake is discovered after the delay period, the change will have already taken effect on the slave as well. So, in general, it is not possible to protect against mistakes using replication only—backups are required as well. Creating new servers When creating new servers—either slaves for scale-out purposes or new masters to act as standbys—it is necessary to make a backup of an existing server and restore that backup image on the new server. This requires a quick and efficient backup method to minimize the downtime and keep the load on the system at an acceptable level. Legal reasons In addition to pure business reasons for data preservation, you may have legal requirements to keep data safe, even in the event of a disaster. Not complying with these requirements can pose significant problems to operating the business. In short, a backup strategy is necessary for operating the business, regardless of any other precautions you have in place to ensure that the data is safe. What’s With All the Monitoring? Even if you have replication set up correctly, it is necessary to understand the load on your system and to keep a keen eye on any problems that surface. As business require‐ ments shift due to changing customer usage patterns, it is necessary to balance the system to use resources as efficiently as possible and to reduce the risk of losing avail‐ ability due to sudden changes in resource utilization. So, Backups Are Not Needed Then? | 7 There are a number of different things that you can monitor, measure, and plan for to handle these types of changes. Here are some examples: • You can add indexes to tables that are frequently read. • You can rewrite queries or change the structure of databases to speed up execution time. • If locks are held for a long time, it is an indication that several connections are using the same table. It might pay off to switch storage engines. • If some of your scale-out slaves are hot-processing a disproportionate number of queries, the system might require some rebalancing to ensure that all the scale-out slaves are hit evenly. • To handle sudden changes in resource usage, it is necessary to determine the normal load of each server and understand when the system will start to respond slowly because of a sudden increase in load. Without monitoring, you have no way of spotting problematic queries, hot slaves, or improperly utilized tables. Is There Anything Else I Can Read? There is plenty of literature on using MySQL for various jobs, and also a lot of literature about high-availability systems. Here is a list of books that we strongly recommend if you are going to work with MySQL: MySQL by Paul DuBois (Addison-Wesley) This is the reference to MySQL and consists of 1,200 pages (really!) packed with everything you want to know about MySQL (and probably a lot that you don’t want to know). High Performance MySQL, Third Edition by Baron Schwartz, Peter Zaitsev, and Va‐ dim Tkachenko (O’Reilly) This is one of the best books on using MySQL in an enterprise setting. It covers optimizing queries and ensuring your system is responsive and available. Scalable Internet Architectures by Theo Schlossnagle (Sams Publishing) Written by one of the most prominent thinkers in the industry, this is a must for anybody working with systems of scale. The book uses a Python library developed by the authors (called the MySQL Python Replicant) for many of the administrative tasks. MySQL Python Replicant is available on Launchpad. 8 | Chapter 1: Introduction Conclusion In the next chapter, we will start with the basics of setting up replication, so get a com‐ fortable chair, open your computer, and we’ll get started. Joel was adjusting his chair when a knock sounded from his door. “Settling in, Joel?” Mr. Summerson asked. Joel didn’t know what to say. He had been tasked to set up a replication slave on his first day on the job and while it took him longer than he had expected, he had yet to hear his boss’s feedback about the job. Joel spoke the first thing on his mind: “Yes, sir, I’m still trying to figure out this chair.” “Nice job with the documentation, Joel. I’d like you to write a report explaining what you think we should do to improve our management of the database server.” Joel nodded. “I can do that.” “Good. I’ll give you another day to get your office in order. I expect the report by Wed‐ nesday, close of business.” Before Joel could reply, Mr. Summerson walked away. Joel sat down and flipped another lever on his chair. He heard a distinct click as the back gave way, forcing him to fling his arms wide. “Whoa!” He looked toward his door as he clumsily picked up his chair, thankful no one saw his impromptu gymnastics. “OK, that lever is now off limits,” he said. Conclusion | 9 CHAPTER 2 MySQL Replicant Library Joel opened his handy text file full of common commands and tasks and copied them into another editor, changing the values for his current need. It was a series of commands involving a number of tools and utilities. “Ah, this is for the birds!” he thought. “There has got to be a better way.” Frustrated, he flipped open his handy MySQL High Availability tome and examined the table of contents. “Aha! A chapter on a library of replication procedures. Now, this is what I need!” Automating administrative procedures is critical to handling large deployments, so you might be asking, “Wouldn’t it be neat if we could automate the procedures in this book?” In many cases, you’ll be happy to hear that you can. This chapter introduces the MySQL Replicant library, a simple library written by the authors for managing replication. We describe the basic principles and classes, and will extend the library with new func‐ tionality in the coming chapters. The code is available at Launchpad, where you can find more information and download the source code and documentation. The Replicant library is based around the idea of creating a model of the connections between servers on a computer (any computer, such as your laptop), like the model in Figure 2-1. The library is designed so you can manage the connections by changing the model. For example, to reconnect a slave to another master, just reconnect the slave in the model, and the library will send the appropriate commands for doing the job. 11 Figure 2-1. A replication topology reflected in a model Besides the simple replication topology shown in Figure 2-1, two other basic topologies include tree topologies and dual masters (used for providing high availability). Topol‐ ogies will be covered in more depth in Chapter 6. To make the library useful on a wide variety of platforms and for a wide variety of deployments, it has been constructed with the following in mind: • The servers are likely to run on a variety of operating systems, such as Windows, Linux, and flavors of Unix such as Solaris or Mac OS X. Procedures for starting and stopping servers, as well as the names of configuration files, differ depending on the operating system. The library should therefore support different operating sys‐ tems and it should be possible to extend it with new operating systems that are not in the library. • The deployment is likely to consist of servers running different versions of MySQL. For example, while you are upgrading a deployment to use new versions of the server, it will consist of a mixture of old and new versions. The library should be able to handle such a deployment. • A deployment consists of servers with many different roles, so it should be possible to specify different roles for the servers. In addition, it should be possible to create new roles that weren’t anticipated at the beginning. Also, servers should be able to change roles. • It is necessary to be able to execute SQL queries on each server. This functionality is needed for configuration as well as for extracting information necessary to man‐ age the deployment. This support is also used by other parts of the system to im‐ plement their jobs—for example, to implement a slave promotion. 12 | Chapter 2: MySQL Replicant Library • It is necessary to be able to execute shell commands on each machine. This is needed to perform some administrative tasks that cannot be done using the SQL interface. This support is also used by, for example, the operating system part of the library to manage the servers. • It should be possible to add and remove options from the server’s configuration file. • The library should support a deployment with multiple servers on a machine. This requires the ability to recognize different configuration files and database files used by different MySQL servers on a single machine. • There should be a set of utilities for performing common tasks such as setting up replication, but it should also be possible to extend the library with new utility functions that were not anticipated at the beginning. The interface hides these complexities as much as possible and presents a simple inter‐ face in Python. Python was chosen by the authors because it is concise, easy to read, available on all operating systems that run MySQL, and increasingly popular for general- purpose scripting. You can see an example of how to define a topology in Example 2-1. Example 2-1. Using the library to construct a topology from mysql.replicant.server import Server, User from mysql.replicant.machine import Linux from mysql.replicant.roles import Master, Final # The master that we use MASTER = Server('master', server_id=1, sql_user=User("mysql_replicant"), ssh_user=User("mats"), machine=Linux(), host="master.example.com", port=3307, socket='/var/run/mysqld/mysqld.sock') # Slaves that we keep available SLAVES = [ Server('slave1', server_id=2, sql_user=User("mysql_replicant"), ssh_user=User("mats"), machine=Linux(), host="slave1.example.com", port=3308), Server('slave2', server_id=3, sql_user=User("mysql_replicant"), ssh_user=User("mats"), machine=Linux(), host="slave2.example.com", port=3309), Server('slave3', server_id=4, sql_user=User("mysql_replicant"), ssh_user=User("mats"), machine=Linux(), MySQL Replicant Library | 13 host="slave3.example.com", port=3310), ] # Create the roles for these servers master_role = Master(User("repl_user", "xyzzy")) slave_role = Final(MASTER) # Imbue the servers with the roles master_role.imbue(MASTER) for slave in SLAVES: slave_role.imbue(slave) # Convenience variable of all servers SERVERS = [MASTER] + SLAVES The first step is to create a server object containing all the information about how to access the server. This server will be used as master, but this statement does nothing specific to configure it as a master. That is done later when imbuing the server with a role. When configuring the server, you need to include information on how to connect to the server to send SQL commands to it. For this example, we have a dedicated replicant user that is used to access the server. In this case, there is no password, but one could be set up when constructing the User instance. You will also need access to the machine where the server is running, to do such things as shut it down or access the configuration file. This line grants access to the user who will connect to the machine. Because servers are started and stopped in different ways on different kinds of operating systems, you must indicate the operating system the server is running on. In this case, Linux is used for all servers. This is where information about the host the server is running on goes. host and port are used when connecting to a remote machine, and socket is used when connecting on the same machine. If you will connect only remotely, you can omit socket. This constructs a list of servers that will be slaves. As for the master, it gives basic connection information but has nothing specific about slaves. To configure servers, the servers are imbued with roles. This statement constructs a Master role containing the replication user that will be used by slaves to connect to the server. This specifies the Final slave role, which does not run a binary log on the server, so cannot be promoted to a master later. 14 | Chapter 2: MySQL Replicant Library These statements imbue all the servers with their roles. The effect is to update the configuration of each server so that it can be used in that role. If necessary (e.g., if the configuration file has to be changed), the statements will also restart the servers. The previous example imbued all the server with their roles, and after you have started all the servers, Example 2-2 shows how you can use the library to redirect all slaves to use a new master. Example 2-2. Using the library to redirect slaves import my_deployment from mysql.replicant.commands import change_master for slave in my_deployment.slaves: slave.stop() change_master(slave, my_deployment.master) slave.start() We have deliberately kept this example simple, and therefore have omitted some im‐ portant steps. As the code stands, it stops replication in its tracks and is likely to lose transactions if executed on an active server. You will see how to change masters properly in Chapter 5. The following sections show the code that makes such applications possible. To avoid cluttering the code more than necessary, we have removed some error checking and other defensive measures needed to have a stable and safe library. You will find the complete code for the library at Launchpad. Basic Classes and Functions The first things you need in order to use the library are some basic definitions for frequently used concepts. We need exceptions to be able to report errors, and we need some simple objects for representing positions and user information. The complete list of exceptions can be found in the library. All the exceptions inherit from a common base class Error defined in the library, as is customary. The exceptions that you will see in later chapters include the following: EmptyRowError This exception is thrown when an attempt is made to select a field from a query that did not return any rows. NoOptionError This exception is raised when ConfigManager does not find the option. Basic Classes and Functions | 15 SlaveNotRunningError This exception is raised when the slave is not running but was expected to run. NotMasterError This exception is raised when the server is not a master and the operation is there‐ fore illegal. NotSlaveError This exception is raised when the server is not a slave and the operation is therefore illegal. There is also a set of classes for representing some common concepts that will be used later in the book: Position and GTID These classes represent a binlog position consisting of a filename and a byte offset within the file, or a global transaction identifier (introduced in MySQL 5.6). A representation method prints out a parsable representation of the binlog positions so that they can be put in secondary storage or if you just want to look at them. To compare and order the positions, the class defines a comparison operator that allows the positions to be ordered. Note that when global transaction identifiers are not used, positions can be different on different servers, so it is not useful to compare positions from different servers. For that reason, an exception will be thrown if an attempt is made to compare different kinds of positions. User This class represents a user with a name and a password. It is used for many types of accounts: a MySQL user account, a shell user account, and the replication user (which we will introduce later). Supporting Different Operating Systems To work with different operating systems, you can use a set of classes that abstract away the differences. The idea is to give each class methods for each of the required tasks that are implemented differently by different operating systems. At this time, all we need are methods to stop and start the server: Machine This class is the base class for a machine and holds all the information that is com‐ mon to this kind of machine. It is expected that a machine instance has at least the following members: Machine.defaults_file The default location of the my.cnf file on this machine 16 | Chapter 2: MySQL Replicant Library Machine.start_server( server ) Method to start the server Machine.stop_server( server ) Method to stop the server Linux This class handles a server running on a Linux machine. It uses the init(8) scripts stored under /etc/init.d to start and stop the server. Solaris This class handles servers running on a Solaris machine and uses the svadm(1M) command to start and stop the server. Servers The Server class defines all the primitive functions that implement the higher-level functions we want to expose in the interface: Server(name, ...) The Server class represents a server in the system; there is one object for each running server in the entire system. Here are the most important parameters (for a full list, consult the project page on Launchpad): name This is the name of the server, and is used to create values for the pid-file, log-bin, and log-bin-index options. If no name parameter is provided, it will be deduced from the pid-file option, the log-bin option, the log-bin- index option, or as a last resort, using the default. host, port, and socket The host where the server resides, the port for connecting to the server as a MySQL client, and the socket through which to connect if on the same host. ssh_user and sql_user A combination of user and password that can be used for connecting to the machine or the server. These users are used to execute administrative com‐ mands, such as starting and stopping the server and reading and writing the configuration file, or for executing SQL commands on the server. machine An object that holds operating system–specific primitives. We chose the name “machine” instead of “os” to avoid a name conflict with the Python standard library os module. This parameter lets you use different techniques for starting and stopping the server as well as other tasks and operating system–specific parameters. The parameters will be covered later. Servers | 17 server_id An optional parameter to hold the server’s identifier, as defined in each server’s configuration file. If this option is omitted, the server identifier will be read from the configuration file of the server. If there is no server identifier in the configuration file either, the server is a vagabond and does not participate in replication as master or slave. config_manager An optional parameter to hold a reference to a configuration manager that can be queried for information about the configuration for the server. Server.connect() and Server.disconnect() Use the connect and disconnect methods to establish a connection to the server before executing commands in a session and disconnect from the server after fin‐ ishing the session, respectively. These methods are useful because in some situations it is critical to keep the con‐ nection to the server open even after an SQL command has been executed. Other‐ wise, for example, when doing a FLUSH TABLES WITH READ LOCK, the lock will automatically be released when the connection is dropped. Server.ssh(command, args...) and Server.sql(command, args...) Use these to execute a shell command or an SQL command on the server. The ssh and sql methods both return an iterable. ssh returns a list of the lines of output from the executed command, whereas sql returns a list of objects of an internal class named Row. The Row class defines the __iter__ and next methods so that you iterate over the returned lines or rows, for example: for row in server.sql("SHOW DATABASES"): print row["Database"] To handle statements that return a single row, the class also defines a __getitem__ method, which will fetch a field from the single row or raise an exception if there is no row. This means that when you know your return value has only one row (which is guaranteed for many SQL statements), you can avoid the loop shown in the previous example and write something like: print server.sql("SHOW MASTER STATUS")["Position"] Server.fetch_config() and Server.replace_config() The methods fetch_config and replace_config fetch the configuration file into memory from the remote server to allow the user to add or remove options as well as change the values of some options. For example, to add a value to the log-bin and log-bin-index options, you can use the module as follows: from my_deployment import master config = master.fetch_config() 18 | Chapter 2: MySQL Replicant Library config.set('log-bin', 'capulet-bin') config.set('log-bin-index', 'capulet-bin.index') master.replace_config(config) Server.start() and Server.stop() The methods start and stop forward information to the machine object to do their jobs, which depend on the operating system the server is using. The methods will either start the server or shut down the server, respectively. Server Roles Servers work slightly differently depending on their roles. For example, masters require a replication user for slaves to use when connecting, but slaves don’t require that user account unless they act as a master and have other slaves connecting. To capture the configuration of the servers in a flexible manner, classes are introduced for representing different roles. When you use the imbue method on a server, the appropriate commands are sent to the server to configure it correctly for that role. Note that a server might change roles in the lifetime of a deployment, so the roles given here just serve to configure the initial de‐ ployment. However, a server always has a designated role in the deployment and there‐ fore also has an associated role. When a server changes roles, it might be necessary to remove some of the configuration information from the server, so therefore an unimbue method is also defined for a role and used when switching roles for a server. In this example, only three roles are defined. Later in the book, you will see more roles defined. For example, you will later see how to create nonfinal slaves that can be used as secondaries or relay servers. The following three roles can be found in the MySQL Replicant library: Role This is the base class of all the roles. Each derived class needs to define the methods imbue and (optionally) unimbue to accept a single server to imbue with the role. To aid derived classes with some common tasks, the Role class defines a number of helper functions, including the following: Role.imbue(server) This method imbues the server with the new role by executing the appropriate code. Role.unimbue(server) This method allows a role to perform cleanup actions before another role is imbued. Server Roles | 19 Role._set_server_id(server, config) If there is no server identifier in the configuration, this method sets it to server.server_id. If the configuration has a server identifier, it will be used to set the value of server.server_id. Role._create_repl_user(server, user) This method creates a replication user on the server and grants it the necessary rights to act as a replication slave. Role._enable_binlog(server, config) This method enables the binary log on the server by setting the log-bin and log-bin-index options to appropriate values. If the server already has a value for log-bin, this method does nothing. Role._disable_binlog(server, config) This method disables the binary log by clearing the log-bin and log-bin- index options in the configuration file. Vagabond This is the default role assigned to any server that does not participate in the rep‐ lication deployment. As such, the server is a “vagabond” and does not have any responsibilities whatsoever. Master This role is for a server that acts as a master. The role will set the server identifier, enable the binary log, and create a replication user for the slaves. The name and password of the replication user will be stored in the server so that when slaves are connected, the class can look up the replication username. Final This is the role for a (final) slave (i.e., a slave that does not have a binary log of its own). When a server is imbued with this role, it will be given a server identifier, the binary log will be disabled, and a CHANGE MASTER command will be issued to connect the slave to a master. Note that we stop the server before we write the configuration file back to it, and restart the server after we have written the configuration file. The configuration file is read only when starting the server and closed after the reading is done, but we play it safe and stop the server before modifying the file. One of the critical design decisions here is not to store any state information about the servers that roles apply to. It might be tempting to keep a list of all the masters by adding them to the role object, but because roles of the servers change over the lifetime of the deployment, the roles are used only to set up the system. Because we allow a role to contain parameters, you can use them to configure several servers with the same information. 20 | Chapter 2: MySQL Replicant Library import my_deployment from mysql.replicant.roles import Final slave_role = Final(master=my_deployment.master) for slave in my_deployment.slaves: slave_role.imbue(slave) Conclusion In this chapter you have seen how to build a library for making administration of your servers easy. You have also seen the beginning of the MySQL Replicant library that we will be developing throughout this book. Joel finished testing his script. He was pretty confident he had all of the parts in place and that the resulting command would save him a lot of time in the future. He clicked Enter. A few moments later, his script returned the data he expected. He checked his servers thinking this was too easy, but he found everything he wanted to do had been done. “Cool, that was easy!” he said, and locked his screen before heading to lunch. Conclusion | 21 CHAPTER 3 MySQL Replication Fundamentals Joel jumped as a sharp rap on his door announced his boss’s unapologetic interruption. Before Joel could say “come in,” the boss stepped into his doorway and said, “Joel, we’re getting complaints that our response time is getting slow. See what you can do to speed things up. The administrator told me there are too many read operations from the applications. See what you can do to offload some of that.” Before Joel could respond, Mr. Summerson was out the door and on his way elsewhere. “I suppose he means we need a bigger server,” Joel thought. As if he had read Joel’s mind, Mr. Summerson stuck his head back in the doorway and said, “Oh, and by the way, the startup we bought all the equipment from had a bunch of servers we haven’t found any use for yet. Can you take a look at those and see what you can do with them? OK, Joel?” Then he was gone again. “I wonder if I’ll ever get used to this,” Joel thought as he pulled his favorite MySQL book off the shelf and glanced at the table of contents. He found the chapter on replication and decided that might fit the bill. MySQL replication is a very useful tool when used correctly, but it can also be a source of considerable headaches when it experiences a failure or when it is configured or used incorrectly. This chapter will cover the fundamentals of using MySQL replication by beginning with a simple setup to get you started and then introducing some basic tech‐ niques to store in your “replication toolkit.” This chapter covers the following replication use cases: High availability through hot standby If a server goes down, everything will stop; it will not be possible to execute (perhaps critical) transactions, get information about customers, or retrieve other important data. This is something that you want to avoid at (almost) any cost, because it can 23 severely disrupt your business. The easiest solution is to configure an extra server with the sole purpose of acting as a hot standby, ready to take over the job of the main server if it fails. Report generation Creating reports from data on a server will degrade the server’s performance, in some cases significantly. If you’re running lots of background jobs to generate re‐ ports, it’s worth creating an extra server just for this purpose. You can get a snapshot of the database at a certain time by stopping replication on the report server and then running large queries on it without disturbing the main business server. For example, if you stop replication after the last transaction of the day, you can extract your daily reports while the rest of the business is humming along at its normal pace. Debugging and auditing You can also investigate queries that have been executed on the server—for example, to see whether particular queries were executed on servers with performance prob‐ lems, or whether a server has gone out of sync because of a bad query. Basic Steps in Replication This chapter will introduce several sophisticated techniques for maximizing the effi‐ ciency and value of replication, but as a first step, we will set up the simple replication shown in Figure 3-1—a single instance of replication from a master to a slave. This does not require any knowledge of the internal architecture or execution details of the rep‐ lication process (we’ll explore these before we take on more complicated scenarios). Figure 3-1. Simple replication Setting up basic replication can be summarized in three easy steps: 24 | Chapter 3: MySQL Replication Fundamentals 1. On Windows, the command-line prompt (CMD) or PowerShell can be used in place of the Unix “shell.” 1. Configure one server to be a master. 2. Configure one server to be a slave. 3. Connect the slave to the master. Unless you plan replication from the start and include the right configuration options in the my.cnf files, you will have to restart each server to carry out steps 1 and 2. To follow the procedures in this section, it is easiest if you have a shell account on the machine with privileges to change the my.cnf file as well as an account on the server with ALL privileges granted.1 You should be very restrictive in granting privileges in a production environment. For precise guidelines, consult “Privileges for the User Configuring Replication” on page 27. Configuring the Master To configure a server so that it can act as master, ensure the server has an active binary log and a unique server ID. We will examine the binary log in greater detail later, but for now it is sufficient to say that it keeps a record of all the changes the master has made so that they can be repeated on the slave. The server ID is used to distinguish two servers from each other. To set up the binary log and server ID, you have to take the server down and add the log-bin, log-bin-index, and server-id options to the my.cnf configuration file as shown in Example 3-1. The added options are in boldface. Example 3-1. Options added to my.cnf to configure a master [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp log-bin = master-bin log-bin-index = master-bin.index server-id = 1 The log-bin option gives the base name for all the files created by the binary log (as you will see later, the binary log consists of several files). If you create a filename with an extension to log-bin, the extension will be ignored and only the file’s base name will be used (i.e., the name without the extension). Basic Steps in Replication | 25 The log-bin-index option gives the name of the binary log index file, which keeps a list of all binlog files. Strictly speaking, it is not necessary to give a name in the log-bin option. The default value is hostname-bin. The value for hostname is taken from the option for pid-file, which by default is the name of the host (as given by the gethostname(2) system call). If an administrator later changes the machine’s hostname, the binlog files will change names as well, but they will be tracked correctly in the index file. However, it is a good idea to create a name that is unique for the MySQL server and not tied to the machine the server is running on because it can be confusing to work with a series of binlog files that suddenly change name midstream. If no value is provided for log-bin-index, the default value will be the same base name as for the binlog files (hostname-bin if you don’t give a default for log-bin). This means that if you do not provide a value for log-bin-index, the index file will change its name when you change the name of the host. So if you change the name of the host and start the server, it will not find the index file and therefore assume that it does not exist, and this will give you an empty binary log. Each server is identified by a unique server ID, so if a slave connects to the master and has the same server-id as the master, an error will be generated indicating that the master and the slave have the same server ID. Once you have added the options to the configuration file, start the server again and finish its configuration by adding a replication user. After you make the change to the master’s configuration file, restart the master for the changes to take effect. The slave initiates a normal client connection to the master and requests the master to send all changes to it. For the slave to connect, a user with special replication privileges is required on the master. Example 3-2 shows a standard mysql client session on the master server, with commands that add a new user account and give it the proper privilege. Example 3-2. Creating a replication user on the master master> CREATE USER repl_user; Query OK, 0 rows affected (0.00 sec) master> GRANT REPLICATION SLAVE ON *.* -> TO repl_user IDENTIFIED BY 'xyzzy'; Query OK, 0 rows affected (0.00 sec) 26 | Chapter 3: MySQL Replication Fundamentals There is nothing special about the REPLICATION SLAVE privilege ex‐ cept that the user can retrieve the binary log from the master. It is perfectly viable to have a normal user account and grant that user the REPLICATION SLAVE privilege. It is, however, a good idea to keep the replication slave user separate from the other users. If you do that, you can remove the user if you need to disallow certain slaves from connecting later. Configuring the Slave After configuring the master, you must configure the slave. As with the master server, you need to assign each slave a unique server ID. You may also want to consider adding the names of the relay log and the relay log index files to the my.cnf file (we will discuss the relay log in more detail in “Replication Architecture Basics” on page 228) using the options relay-log and relay-log-index. The recommended configuration options are given in Example 3-3, with the added options highlighted. Example 3-3. Options added to my.cnf to configure a slave [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp server-id = 2 relay-log-index = slave-relay-bin.index relay-log = slave-relay-bin Like the log-bin and log-bin-index options, the defaults for the relay-log and relay-log-index options depend on the hostname. The default for relay-log is host name-relay-bin and the default for relay-log-index is hostname-relay-bin.index. Using the default introduces a problem in that if the hostname of the server changes, it will not find the relay log index file and will assume there is nothing in the relay logfiles. After editing the my.cnf file, restart the slave server for the changes to take effect. Privileges for the User Configuring Replication To configure the connection of the slave to the master for replication, it is necessary to have an account with certain privileges, in addition to a shell account with access to critical files. For security reasons, it is usually a good idea to restrict the account used for configuring the master and slave to just the necessary privileges. To create and drop users, the account needs to have the CREATE USER privilege. To grant the REPLICATION Basic Steps in Replication | 27 SLAVE to the replication account, it is necessary to have the REPLICATION SLAVE privilege with the GRANT OPTION. To perform further replication-related procedures (shown later in this chapter), you need a few more options: • To execute the FLUSH LOGS command (or any FLUSH command), you need the RELOAD privilege. • To execute SHOW MASTER STATUS and SHOW SLAVE STATUS, you need either the SUPER or REPLICATION CLIENT privilege. • To execute CHANGE MASTER TO, you need the SUPER privilege. For example, to give mats sufficient privileges for all the procedures in this chapter, issue the following: server> GRANT REPLICATION SLAVE, RELOAD, CREATE USER, SUPER -> ON *.* -> TO mats@'192.168.2.%' -> WITH GRANT OPTION; Connecting the Master and Slave Now you can perform the final step in setting up basic replication: directing the slave to the master so that it knows where to replicate from. To do this, you need four pieces of information about the master: • A hostname • A port number • A user account on the master with replication slave privileges • A password for the user account You already created a user account with the right privileges and a password when con‐ figuring the master. The hostname is given by the operating system and can’t be con‐ figured in the my.cnf file, but the port number can be assigned in my.cnf (if you do not supply a port number, the default value of 3306 will be used). The final two steps nec‐ essary to get replication up and running are to direct the slave to the master using the CHANGE MASTER TO command and then start replication using START SLAVE: slave> CHANGE MASTER TO -> MASTER_HOST = 'master-1', -> MASTER_PORT = 3306, -> MASTER_USER = 'repl_user', -> MASTER_PASSWORD = 'xyzzy'; Query OK, 0 rows affected (0.00 sec) 28 | Chapter 3: MySQL Replication Fundamentals slave> START SLAVE; Query OK, 0 rows affected (0.15 sec) Congratulations! You have now set up your first replication between a master and a slave! If you make some changes to the database on the master, such as adding new tables and filling them in, you will find that they are replicated to the slave. Try it out! Create a test database (if you do not already have one), create some tables, and add some data to the tables to see that the changes replicate over to the slave. Observe that either a hostname or an IP address can be given to the MASTER_HOST parameter. If a hostname is given, the IP address for the hostname is retrieved by calling gethostname(3), which, depending on your configuration, could mean resolving the hostname using a DNS lookup. The steps for configuring such lookups are beyond the scope of this book. A Brief Introduction to the Binary Log What makes replication work is the binary log (or just binlog), which is a record of all changes made to the database on a server. You need to understand how the binary log works in order to have control over replication or to fix any problems that arise, so we’ll give you a bit of background in this section. Figure 3-2 shows a schematic view of the replication architecture, containing a master with a binary log and a slave that receives changes from the master via the binary log. We will cover the replication architecture in detail in Chapter 8. When a statement is about to finish executing, it writes an entry to the end of the binary log and sends the statement parser a notification that it has completed the statement. Figure 3-2. Role of the binary log in replication Usually only the statement that is about to finish executing is written to the binary log, but there are some special cases where other information is written—either in addition A Brief Introduction to the Binary Log | 29 to the statement or instead of the statement. It will soon be clear why this is so, but for the time being, you can pretend that only the statements that are being executed are being written to the binary log. What’s Recorded in the Binary Log The purpose of the binary log is to record changes made to the tables in the database. The binary log can then be used for replication, as well as for point-in-time-recovery (PITR) , discussed in Chapter 15) and in some limited cases for auditing. Note that the binary log contains only changes made to the database, so for statements that do not change any data in the database, no entry is written to the binary log. Traditionally, MySQL replication records changes by preserving the SQL statement that made the change. This is called statement-based replication. Because statement-based replication re-executes the statements on the slave, the result on the slave can differ from the master if the context of the master and slave are not exactly the same. This is the reason why as of version 5.1, MySQL also offers row-based replication. In contrast to statement-based replication, row-based replication individually records each change to a row in the binary log. In addition to being more convenient, row-based replication can offer some speed advantages in certain situations. To imagine the difference, consider a complex update that uses a lot of joins or WHERE clauses. Instead of re-executing all the logic on the slave in statement-based replication, all you really need to know is the state of the row after the change. On the other hand, if a single update changes 10,000 rows, you’d rather record just the statement instead of 10,000 separate changes as row-based replication does. We will cover row-based replication in Chapter 8, explaining its implementation and its use. In the examples that follow, we’ll focus on statement-based replication because it’s easier to understand with respect to database management activities. Watching Replication in Action Using the replication example from the previous section, let’s take a look at the binlog events for some simple statements. Let’s start by connecting a command-line client to the master and executing a few commands to get a binary log: master> CREATE TABLE tbl (text TEXT); Query OK, 0 rows affected (0.04 sec) master> INSERT INTO tbl VALUES ("Yeah! Replication!"); Query OK, 1 row affected (0.00 sec) master> SELECT * FROM tbl; +--------------------+ | text | 30 | Chapter 3: MySQL Replication Fundamentals +--------------------+ | Yeah! Replication! | +--------------------+ 1 row in set (0.00 sec) master> FLUSH LOGS; Query OK, 0 rows affected (0.28 sec) The FLUSH LOGS command forces the binary log to rotate, which will allow us to see a “complete” binlog file in all its glory. To take a closer look at this file, use the SHOW BINLOG EVENTS command, as shown in Example 3-4. Example 3-4. Checking what events are in the binary log master> SHOW BINLOG EVENTS\G *************************** 1. row *************************** Log_name: mysql-bin.000001 Pos: 4 Event_type: Format_desc Server_id: 1 End_log_pos: 107 Info: Server ver: 5.5.34-0ubuntu0.12.04.1-log, Binlog ver: 4 *************************** 2. row *************************** Log_name: mysql-bin.000001 Pos: 107 Event_type: Query Server_id: 1 End_log_pos: 198 Info: use `test`; CREATE TABLE tbl (text TEXT) *************************** 3. row *************************** Log_name: mysql-bin.000001 Pos: 198 Event_type: Query Server_id: 1 End_log_pos: 266 Info: BEGIN *************************** 4. row *************************** Log_name: mysql-bin.000001 Pos: 266 Event_type: Query Server_id: 1 End_log_pos: 374 Info: use `test`; INSERT INTO tbl VALUES ("Yeah! Replication!") *************************** 5. row *************************** Log_name: mysql-bin.000001 Pos: 374 Event_type: Xid Server_id: 1 End_log_pos: 401 Info: COMMIT /* xid=188 */ *************************** 6. row *************************** Log_name: mysql-bin.000001 Pos: 401 A Brief Introduction to the Binary Log | 31 Event_type: Rotate Server_id: 1 End_log_pos: 444 Info: mysql-bin.000002;pos=4 6 rows in set (0.00 sec) In this binary log, we can now see six events: a format description event, three query events, one XID event, and a rotate event. The query event is how statements executed against the database are normally written to the binary log, the XID event is used for transaction management, whereas the format description and rotate events are used by the server internally to manage the binary log. We will discuss these events in more detail in Chapter 8, but for now, let’s take a closer look at the columns given for each event: Event_type This is the type of the event. We have seen three different types here, but there are many more. The type of the event denotes what information is transported to the slave. Currently—in MySQL 5.1.18 to 5.5.33—there are 27 events (several of them are not used, but they are retained for backward compatibility), and in 5.6.12 there are 35, but this is an extensible range and new events are added as required. Server_id This is the server ID of the server that created the event. Log_name This is the name of the file that stores the event. An event is always contained in a single file and will never span two files. Pos This is the position of the file where the event starts (i.e., the first byte of the event). End_log_pos This gives the position in the file where the event ends and the next event starts. This is one higher than the last byte of the event, so the bytes in the range Pos to End_log_pos − 1 are the bytes containing the event and the length of the event can be computed as End_log_pos − Pos. Info This is human-readable text with information about the event. Different informa‐ tion is printed for different events, but you can at least count on the query event to print the statement that it contains. The first two columns, Log_name and Pos, make up the binlog position of the event and will be used to indicate the location or position of an event. In addition to what is shown here, each event contains a lot of other information—for example, a timestamp, which is the number of seconds since the Epoch (a classic Unix moment in time, 1970-01-01 00:00:00 UTC). 32 | Chapter 3: MySQL Replication Fundamentals The Binary Log’s Structure and Content As we explained, the binary log is not actually a single file, but a set of files that allow for easier management (such as removing old logs without disturbing recent ones). The binary log consists of a set of binlog files with the real contents as well as a binlog index file, which keeps track of which binlog files exist. Figure 3-3 shows how a binary log is organized. Figure 3-3. Structure of the binary log One binlog file is the active binlog file. This is the file that is currently being written to (and usually read from as well). Each binlog file starts with a format description event and ends with a rotate event. The format description log event contains, among other things, the version of the server that produced the file and general information about the server and binary log. The rotate event tells where the binary log continues by giving the filename of the next file in the sequence. Each file is organized into binary log events, where each event makes a standalone, atomic piece of the binary log. The format description log event contains a flag that marks the file as properly closed. While a binlog file is being written, the flag is set, and when the file is closed, the flag is cleared. This way, it is possible to detect corrupt binlog files in the event of a crash and allow replication to recover. If you try to execute additional statements at the master, you will observe something strange—no changes are seen in the binary log: master> INSERT INTO tbl VALUES ("What's up?"); Query OK, 1 row affected (0.00 sec) master> SELECT * FROM tbl; +--------------------+ | text | A Brief Introduction to the Binary Log | 33 +--------------------+ | Yeah! Replication! | | What's up? | +--------------------+ 1 row in set (0.00 sec) master> SHOW BINLOG EVENTS\G same as before What happened to the new event? Well, as you already know, the binary log consists of several files, and the SHOW BINLOG EVENTS statement shows only the contents of the first binlog file. This is contrary to what most users expect, which is to see the contents of the active binlog file. If the name of the first binlog file is master-bin.000001 (containing the events shown previously), you can take a look at the events in the next binlog file, in this case named master-bin.000002, using the following: master> SHOW BINLOG EVENTS IN 'master-bin.000002'\G *************************** 1. row *************************** Log_name: mysql-bin.000002 Pos: 4 Event_type: Format_desc Server_id: 1 End_log_pos: 107 Info: Server ver: 5.5.34-0ubuntu0.12.04.1-log, Binlog ver: 4 *************************** 2. row *************************** Log_name: mysql-bin.000002 Pos: 107 Event_type: Query Server_id: 1 End_log_pos: 175 Info: BEGIN *************************** 3. row *************************** Log_name: mysql-bin.000002 Pos: 175 Event_type: Query Server_id: 1 End_log_pos: 275 Info: use `test`; INSERT INTO tbl VALUES ("What's up?") *************************** 4. row *************************** Log_name: mysql-bin.000002 Pos: 275 Event_type: Xid Server_id: 1 End_log_pos: 302 Info: COMMIT /* xid=196 */ 4 rows in set (0.00 sec) You might have noticed in Example 3-4 that the binary log ends with a rotate event and that the Info field contains the name of the next binlog file and position where the events start. To see which binlog file is currently being written, you can use the SHOW MASTER STATUS command: 34 | Chapter 3: MySQL Replication Fundamentals master> SHOW MASTER STATUS\G *************************** 1. row *************************** File: master-bin.000002 Position: 205 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec) Now that you’ve finished taking a look at the binary log, stop and reset the slave and drop the table: master> DROP TABLE tbl; Query OK, 0 rows affected (0.00 sec) slave> STOP SLAVE; Query OK, 0 rows affected (0.08 sec) slave> RESET SLAVE; Query OK, 0 rows affected (0.00 sec) After that, you can drop the table and reset the master to start fresh: master> DROP TABLE tbl; Query OK, 0 rows affected (0.00 sec) master> RESET MASTER; Query OK, 0 rows affected (0.04 sec) The RESET MASTER command removes all the binlog files and clears the binlog index file. The RESET SLAVE statement removes all files used by replication on the slave to get a clean start. Neither the RESET MASTER nor the RESET SLAVE command is de‐ signed to work when replication is active, so: • When executing the RESET MASTER command (on the master), make sure that no slaves are attached. • When executing the RESET SLAVE command (on the slave), make sure that the slave does not have replication active by issuing a STOP SLAVE command. We will cover the most basic events in this chapter, but for the complete list with all its gory details, refer to the MySQL Internals Manual. Adding Slaves Now that you know a little about the binary log, we are ready to tackle one of the basic problems with the way we created a slave earlier. When we configured the slave, we Adding Slaves | 35 provided no information about where to start replication, so the slave will start reading the binary logs on the master from the beginning. That’s clearly not a very good idea if the master has been running for some time: in addition to making the slave replay quite a lot of events just to ramp up, you might not be able to obtain the necessary logs, because they might have been stored somewhere else for safekeeping and removed from the master (we’ll discuss that more in Chapter 15 when we talk about backups and PITR). We need another way to create new slaves—called bootstrapping a slave—without start‐ ing replication from the beginning. The CHANGE MASTER TO command has two parameters that will help us here: MASTER_LOG_FILE and MASTER_LOG_POS. (Starting with MySQL 5.6, there is another, even easier way to specify positions: Global Transaction Identifiers, or GTIDs. Read more about them in Chapter 8.) You can use these to specify the binlog position at which the master should start sending events instead of starting from the beginning. Using these parameters to CHANGE MASTER TO, we can bootstrap a slave using the fol‐ lowing steps: 1. Configure the new slave. 2. Make a backup of the master (or of a slave that has been replicating the master). See Chapter 15 for common backup techniques. 3. Write down the binlog position that corresponds to this backup (in other words, the position following the last event leading up to the master’s current state). 4. Restore the backup on the new slave. See Chapter 15 for common restore techniques. 5. Configure the slave to start replication from this position. Depending on whether you use the master or a slave as a baseline in step 2, the procedure differs slightly, so we will start by describing how to bootstrap a new slave when you only have a single server running that you want to use as master—this is called cloning the master. Cloning the master means taking a snapshot of the server, which is usually accomplished by creating a backup. There are various techniques for backing up the server, but in this chapter, we have decided to use one of the simpler techniques: running mysqldump to create a logical backup. Other options are to create a physical backup by copying the database files, online backup techniques such as MySQL Enterprise Backup, or even volume snapshots using Linux LVM (Logical Volume Manager). The various techniques will be described fully in Chapter 15, along with a discussion of their relative merits. 36 | Chapter 3: MySQL Replication Fundamentals Cloning the Master The mysqldump utility has options that allow you to perform all the steps in this section in a single step, but to explain the necessary operations, we will perform all the steps here individually. You will see a more compact version later in this section. To clone the master, as shown in Figure 3-4, start by creating a backup of the master. Because the master is probably running and has a lot of tables in the cache, it is necessary to flush all tables and lock the database to prevent changes before checking the binlog position. You can do this using the FLUSH TABLES WITH READ LOCK command: master> FLUSH TABLES WITH READ LOCK; Query OK, 0 rows affected (0.02 sec) Figure 3-4. Cloning a master to create a new slave Once the database is locked, you are ready to create a backup and note the binlog po‐ sition. Note that at this point you should not disconnect mysql from the server as that will release the lock that you just took. Because no changes are occurring on the master, the SHOW MASTER STATUS command will correctly reveal the current file and position in the binary log. We will go through the details of the SHOW MASTER STATUS and the SHOW MASTER LOGS commands in Chapter 8. master> SHOW MASTER STATUS\G *************************** 1. row *************************** File: master-bin.000042 Position: 456552 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec) The position of the next event to write is master-bin.000042, 456552, which is where replication should start, given that everything before this point will be in the backup. Once you have jotted down the binlog position, you can create your backup. The easiest way to create a backup of the database is to use mysqldump: $ mysqldump --all-databases --host=master-1 >backup.sql Adding Slaves | 37 Because you now have a faithful copy of the master, you can unlock the tables of the database on the master and allow it to continue processing queries: master> UNLOCK TABLES; Query OK, 0 rows affected (0.23 sec) Next, restore the backup on the slave using the mysql utility: $ mysql --host=slave-1 CHANGE MASTER TO -> MASTER_HOST = 'master-1', -> MASTER_PORT = 3306, -> MASTER_USER = 'slave-1', -> MASTER_PASSWORD = 'xyzzy', -> MASTER_LOG_FILE = 'master-bin.000042', -> MASTER_LOG_POS = 456552; Query OK, 0 rows affected (0.00 sec) slave> START SLAVE; Query OK, 0 rows affected (0.25 sec) It is possible to have mysqldump perform many of the previous steps automatically. To make a logical backup of all databases on a server called master, enter: $ mysqldump --host=master -all-databases \ > --master-data=1 >backup-source.sql The --master-data=1 option makes mysqldump write a CHANGEMASTER TO statement with the file and position in the binary log, as given by SHOW MASTER STATUS. You can then restore the backup on a slave using: $ mysql --host=slave-1 STOP SLAVE; Query OK, 0 rows affected (0.20 sec) After the slave is stopped, you can flush the tables as before and create the backup. Because you created a backup of the slave (not the master), use the SHOW SLAVE STA TUS command instead of SHOW MASTER STATUS to determine where to start replication. The output from this command is considerable, and it will be covered in detail in Chapter 8, but to get the position of the next event in the binary log of the master that the slave will execute, note the value of the fields Relay_Master_Log_File and Ex ec_Master_Log_Pos: Adding Slaves | 39 original-slave> SHOW SLAVE STATUS\G ... Relay_Master_Log_File: master-bin.000042 ... Exec_Master_Log_Pos: 546632 After creating the backup and restoring it on the new slave, configure replication to start from this position and start the new slave: new-slave> CHANGE MASTER TO -> MASTER_HOST = 'master-1', -> MASTER_PORT = 3306, -> MASTER_USER = 'slave-1', -> MASTER_PASSWORD = 'xyzzy', -> MASTER_LOG_FILE = 'master-bin.000042', -> MASTER_LOG_POS = 546632; Query OK, 0 rows affected (0.19 sec) new-slave> START SLAVE; Query OK, 0 rows affected (0.24 sec) Cloning the master and cloning the slave differ only on some minor points, which means that our Python library will be able to combine the two into a single procedure for creating new slaves by creating the backup at a source server and connecting the new slave to a master. A common technique for making backups is to call FLUSH TABLES WITH READ LOCK and then to create a copy of the database files while the MySQL server is locked with the read lock. This is usually much faster than using mysqldump, but FLUSH TABLES WITH READ LOCK is not safe for use with InnoDB! FLUSH TABLES WITH READ LOCK does lock the tables, preventing any new transactions from starting, but there are several activities going on in the background that FLUSH TABLES WITH READ LOCK does not prevent. Use one of the following techniques to create a backup of InnoDB tables safely: • Shut down the server and copy the files. This can be an advantage if the database is big, as restoring data with mysqldump can be slow. • Use mysqldump after performing FLUSH TABLES WITH READ LOCK (as we did ear‐ lier). The read lock is preventing changes while the data is read. The database may be locked for a long time if there is a lot of data to be read. Note, however, that it is possible to take a consistent snapshot using the --single-transaction option, but this is only possible when using InnoDB tables. For more information, see “The mysqldump Utility” on page 560. • Use a snapshot solution such as LVM (on Linux) or ZFS (on Solaris) while locking the database with FLUSH TABLES WITH READ LOCK. 40 | Chapter 3: MySQL Replication Fundamentals • Use MySQL Enterprise Backup (or XtraBackup) to do an online backup of MySQL. Scripting the Clone Operation The Python library clones a master simply by copying the database from the master using the Server object that represents the master. To do this, it uses a clone function, which you will see in Example 3-6. Cloning a slave is similar, but the backup is taken from one server, while the new slave connects to another server to perform replication. It is easy to support cloning both a master and a slave by using two different parameters: a source parameter that specifies where the backup should be created and a use_master parameter that indicates where the slave should connect after the backup is restored. A call to the clone method looks like the following: clone(slave = slave[1], source = slave[0], use_master = master) The next step is to write some utility functions to implement the cloning function, which will also come in handy for other activities. Example 3-5 shows the following functions: fetch_master_pos Fetches the binlog position from a master (i.e., the position of the next event the master will write to the binary log). fetch_slave_pos Fetches the binlog position from a slave (i.e., the position of the next event to read from the master). replicate_from Accepts as arguments a slave, a master, and a binlog position, and directs the slave to replicate from the master starting with the given position. The replicate_from function reads the field repl_user from the master to get the name and password of the replication user. If you look at the definition of the Server class, you’ll find that there is no such field. It is added by the Master role when the server is imbued. Example 3-5. Utility functions to fetch the master and slave positions of a server _CHANGE_MASTER_TO = """CHANGE MASTER TO MASTER_HOST=%s, MASTER_PORT=%s, MASTER_USER=%s, MASTER_PASSWORD=%s, MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s""" def replicate_from(slave, master, position): slave.sql(_CHANGE_MASTER_TO, (master.host, master.port, master.repl_user.name, master.repl_user.passwd, position.file, position.pos)) Adding Slaves | 41 def fetch_master_pos(server): result = server.sql("SHOW MASTER STATUS") return Position(server.server_id, result["File"], result["Position"]) def fetch_slave_pos(server): result = server.sql("SHOW SLAVE STATUS") return Position(server.server_id, result["Relay_Master_Log_File"], result["Exec_Master_Log_Pos"]) These are all the functions needed to create the clone function. To clone a slave, the calling application passes a separate use_master argument, causing clone to direct the new slave to that master for replication. To clone a master, the calling application omits the separate use_master argument, causing the function to use the “source” server as a master. Because there are many ways to create a backup of a server, Example 3-6 restricts the method to one choice, using mysqldump to create a logical backup of the server. Later, we will demonstrate how to generalize the backup procedure so that you can use the same basic code to bootstrap new slaves using arbitrary backup methods. Example 3-6. Function to clone either the master or the slave def clone(slave, source, use_master = None): from subprocess import call backup_file = open(server.host + "-backup.sql", "w+") if master is not None: source.sql("STOP SLAVE") lock_database(source) if master is None: position = fetch_master_position(source) else: position = fetch_slave_position(source) call(["mysqldump", "--all-databases", "--host='%s'" % source.host], stdout=backup_file) if master is not None: start_slave(source) backup_file.seek() # Rewind to beginning call(["mysql", "--host='%s'" % slave.host], stdin=backup_file) if master is None: replicate_from(slave, source, position) else: replicate_from(slave, master, position) start_slave(slave) Performing Common Tasks with Replication Each of the common use cases for replication—scale-out, hot standbys, and so forth— involve their own implementation details and possible pitfalls. We’ll show you how to perform some of these tasks and how to enhance the Python library to support them. 42 | Chapter 3: MySQL Replication Fundamentals Passwords are omitted from the examples in this section. When con‐ figuring the accounts to control the servers, you can either allow access only from certain hosts that control the deployment (by cre‐ ating accounts such as mats@'192.168.2.136'), or you can supply passwords to the commands. Reporting Most businesses need a lot of routine reports: weekly reports on the items sold, monthly reports on expenses and revenues, and various kinds of heavy data mining to spot trends or identify focus groups for the marketing department. Running these queries on the master can prove to be troublesome. Data-mining queries can require a lot of computing resources and can slow down normal operations only to find out that, say, a focus group for left handed scissors might not be worthwhile to conduct. In addition, these reports are typically not very urgent (compared to processing normal transactions), so there is no need to create them as quickly as possible. In other words, because these reports are not time-critical, it does not matter much if they take two hours to complete instead of one. A better idea is to dust off a spare server (or two, if you have enough reporting require‐ ments) and set it up to replicate from the master. When you need to do the reporting, you can stop replication, run your reporting applications, then start replication again, all without disturbing the master. Reporting often needs to cover a precise interval, such as a summary of all sales for the day, so it is necessary to stop replication at the right moment so you don’t get any sales for the following day in the report. Because there is no way to stop the slave when it sees an event with a certain date or time, it has to be done some other way. Let’s pretend that reports are needed once each day, and that all transactions from mid‐ night to midnight shall be included. It is necessary to stop the reporting slave at midnight so that no events from after midnight are executed on the slave and all events from before midnight are executed on the slave. The intention is not to do this manually, so let’s consider how we can automate the procedure. The following steps will accomplish what we want: 1. Just before midnight, perhaps five minutes before midnight, stop the reporting slave so that no events come from the master. 2. After midnight, check the binary log on the master and find the last event that was recorded before midnight. Obviously, if you do this before midnight, you might not have seen all events for the day yet. 3. Record the binlog position of this event and start the slave to run until this position. Performing Common Tasks with Replication | 43 4. Wait until the slave has reached this position and stopped. The first issue is how to schedule the jobs correctly. There are different ways to do this, depending on the operating system. Although we won’t go into all the details here, you can see how to schedule tasks for Unix-like operating systems, such as Linux, in “Sched‐ uling tasks on Unix” on page 48. Stopping the slave is as simple as executing STOP SLAVE and noting the binlog position after the slave is stopped: slave> STOP SLAVE; Query OK, 0 rows affected (0.25 sec) slave> SHOW SLAVE STATUS\G ... Relay_Master_Log_File: capulet-bin.000004 ... Exec_Master_Log_Pos: 2456 1 row in set (0.00 sec) The remaining three steps are executed before the actual reporting starts and usually as part of the script that does the actual reporting. Before outlining the script, let’s consider how to perform each step. To read the contents of the binary log, invoke a utility called mysqlbinlog. This will be introduced in detail later, but this utility is used in the second step. The mysqlbinlog utility has the two handy options, --start-datetime and --stop-datetime, which you can use to read only a portion of the binary log. So to get all events from the time that you stopped the slave to just before midnight, use the following command: $ mysqlbinlog --force --read-from-remote-server --host=reporting.bigcorp.com \ > --start-datetime='2009-09-25 23:55:00' > --stop-datetime='2009-09-25 23:59:59' \ > binlog files The timestamp stored in each event is the timestamp when the statement started exe‐ cuting, not the timestamp when it was written to the binary log. The --stop- datetime option will stop emitting events on the first timestamp after the date/time supplied, so it is possible that there is an event that started executing before the date/ time but was written to the binary log after the date/time. Such an event is not included in the range given. Because the master is writing to the binary logs at this time, it is necessary to supply the --force option. Otherwise, mysqlbinlog will refuse to read the open binary log. To execute this command, it is necessary to supply a set of binlog files to read. Since the names of these files are dependent on configuration options, the names of these files have to be fetched from the server. After that, it is necessary to figure out the range of binlog files that needs to be supplied to the mysqlbinlog command. Getting the list of binlog filenames is easy to do with the SHOW BINARY LOGS command: 44 | Chapter 3: MySQL Replication Fundamentals master> SHOW BINARY LOGS; +--------------------+-----------+ | Log_name | File_size | +--------------------+-----------+ | capulet-bin.000001 | 24316 | | capulet-bin.000002 | 1565 | | capulet-bin.000003 | 125 | | capulet-bin.000004 | 2749 | +--------------------+-----------+ 4 rows in set (0.00 sec) In this case, there are only four files, but there could potentially be quite a lot more. Scanning a large list of files that were written before the slave was stopped is just a waste of time, so it is a good idea to try to reduce the number of files to read in order to find the correct position to stop at. Because you recorded the binlog position in the first step, when the slave was stopped, it is an easy matter to find the name of the file where the slave stopped and then take that name and all the following names as input to the mysqlbinlog utility. Typically, this will only be one file (or two in the event that the binary log was rotated between stopping the slave and starting the reporting). When executing the mysqlbinlog command with just a few binlog files, you will get a textual output for each with some information about the event: $ mysqlbinlog --force --read-from-remote-server --host=reporting.bigcorp.com \ > --start-datetime='2009-09-25 23:55:00' > --stop-datetime='2009-09-25 23:59:59' \ > capulet-bin.000004 /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #090909 22:16:25 server id 1 end_log_pos 106 Start: binlog v 4, server v... ROLLBACK/*!*/; . . . # at 2495 #090929 23:58:36 server id 1 end_log_pos 2650 Query thread_id=27 exe... SET TIMESTAMP=1254213690/*!*/; SET /*!*/; INSERT INTO message_board(user, message) VALUES ('mats@sun.com', 'Midnight, and I'm bored') /*!*/; The interesting part here is the end_log_pos of the last event in the sequence (in this case, 2650), because this is where the next event after midnight will be written. If you were paying attention to the output from the previous command, you saw that there is no information about which binlog file this byte position is referring to, and it is necessary to have a file to find the event. If a single file is supplied to the mysqlbin Performing Common Tasks with Replication | 45 log command, the filename is obvious, but if two files are supplied, it is necessary to figure out if the last event for the day is in the first or the second file. If you look at the line containing the end_log_pos, you will also see that the event type is there. Because every binlog file starts with a format description event—a line for such an event appears in the previous output—you can check these events to determine the location of the event you want. If there are two format description events in the output, the event is in the second file, and if there is just one, it is in the first file. The final step before starting the reporting work is to start replication and stop it at exactly the position where the event after midnight will be written (or has already been written, should that be the case). To do this, you can use the lesser-known syntax START SLAVE UNTIL. This command accepts a master logfile and a master log position where the slave should stop, and then starts the slave. When the slave reaches the given position, it will automatically stop: report> START SLAVE UNTIL -> MASTER_LOG_POS='capulet-bin.000004', -> MASTER_LOG_POS=2650; Query OK, 0 rows affected (0.18 sec) Like the STOP SLAVE command (without the UNTIL), the START SLAVE UNTIL command will return immediately—not, as could be expected, when the slave has reached the position where it should stop. So commands issued after START SLAVE UNTIL continue to be executed as long as the slave is running. To wait for the slave to reach the position you want it to stop at, use the MASTER_POS_WAIT function, which will block while waiting for the slave to reach the given position: report> SELECT MASTER_POS_WAIT('capulet-bin.000004', 2650); Query OK, 0 rows affected (231.15 sec) At this point, the slave has stopped at the last event for the day, and the reporting process can start analyzing the data and generating reports. Handling reporting in Python Automating this in Python is quite straightforward; Example 3-7 shows the code for stopping reporting at the right time. The fetch_remote_binlog function reads a binary log from a remote server using the mysqlbinlog command. The contents of the file(s) will be returned as an iterator over the lines of the file. To optimize the fetches, you can optionally provide a list of files to scan. You can also pass a start date/time and a stop date/time to limit the date/time range of the result. These will be passed to the mysqlbinlog program. The find_datetime_position function does the work of scanning the binlog lines to find the last end_log_pos as well as keeping track of how many start events have been observed. It also contacts the reporting server to find out where it stopped reading the 46 | Chapter 3: MySQL Replication Fundamentals binlog file and then contacts the master to get the binlog files and find the right one to start the scan from. Example 3-7. Python code for running replication to a datetime def fetch_remote_binlog(server, binlog_files=None, start_datetime=None, stop_datetime=None): from subprocess import Popen, PIPE if not binlog_files: binlog_files = [ row["Log_name"] for row in server.sql("SHOW BINARY LOGS")] command = ["mysqlbinlog", "--read-from-remote-server", "--force", "--host=%s" % (server.host), "--user=%s" % (server.sql_user.name)] if server.sql_user.passwd: command.append("--password=%s" % (server.sql_user.passwd)) if start_datetime: command.append("--start-datetime=%s" % (start_datetime)) if stop_datetime: command.append("--stop-datetime=%s" % (stop_datetime)) return iter(Popen(command + binlog_files, stdout=PIPE).stdout) def find_datetime_position(master, report, start_datetime, stop_datetime): from itertools import dropwhile from mysql.replicant import Position import re all_files = [row["Log_name"] for row in master.sql("SHOW BINARY LOGS")] stop_file = report.sql("SHOW SLAVE STATUS")["Relay_Master_Log_File"] files = list(dropwhile(lambda file: file != stop_file, all_files)) lines = fetch_remote_binlog(server, binlog_files=files, start_datetime=start_datetime, stop_datetime=stop_datetime) binlog_files = 0 last_epos = None for line in lines: m = re.match(r"#\d{6}\s+\d?\d:\d\d:\d\d\s+" r"server id\s+(?P\d+)\s+" r"end_log_pos\s+(?P\d+)\s+" r"(?P\w+)", line) if m: if m.group("type") == "Start": binlog_files += 1 if m.group("type") == "Query": last_epos = m.group("epos") return Position(files[binlog_files-1], last_epos) You can now use these functions to synchronize the reporting server before the actual reporting job: Performing Common Tasks with Replication | 47 master.connect() report.connect() pos = find_datetime_position(master, report, start_datetime="2009-09-14 23:55:00", stop_datetime="2009-09-14 23:59:59") report.sql("START SLAVE UNTIL MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s", (pos.file, pos.pos)) report.sql("DO MASTER_POS_WAIT(%s,%s)", (pos.file, pos.pos)) . . code for reporting . . As you can see, working with replication is pretty straightforward. This particular ex‐ ample introduces several of the critical concepts that we will be using later when talking about scale-out: how to start and stop the slave at the right time, how to get information about binlog positions or figure it out using the standard tools, and how to integrate it all into an automated solution for your particular needs. Scheduling tasks on Unix To easiest way ensure the slave is stopped just before midnight and the reporting is started after midnight is to set up a job for cron(8) that sends a stop slave command to the slave and starts the reporting script. For example, the following crontab(5) entries would ensure that the slave is stopped before midnight, and that the reporting script to roll the slave forward is executed, say, five minutes after midnight. Here we assume that the stop_slave script will stop the slave, and the daily_report will run the daily report (starting with the synchronization described earlier): # stop reporting slave five minutes before midnight, every day 55 23 * * * $HOME/mysql_control/stop_slave # Run reporting script five minutes after midnight, every day 5 0 * * * $HOME/mysql_control/daily_report Assuming that you put this in a crontab file, reporttab, you can install the crontab file using the crontab reporttab command. Scheduling tasks on Windows To start the Task Scheduler in Windows, open the search feature (Windows key+R) and enter taskschd.msc. Depending on your security settings and version of Windows, you may need to respond to the User Account Control (UAC) dialog box to continue. To create a new task trigger by time, choose Create Basic Task from the Action pane. This opens the Create Basic Task Wizard, which will guide you through the steps to create a 48 | Chapter 3: MySQL Replication Fundamentals simple task. On the first pane of the wizard, name the task and provide an optional description, then click Next. The second pane allows you to specify the frequency of the firing of the task. There are many options here for controlling when the task runs: a single run, daily, weekly, and even when you log on or when a specific event occurs. Click Next once you’ve made your choice. Depending on the frequency you chose, the third pane will allow you to specify the details (e.g., date and time) of when the task fires. Click Next once you have configured the trigger timing options. The fourth pane is where you specify the task or action to occur when the task event occurs (when the task fires). You can choose to start a program, send an email message, or display a message to the user. Make your selection and click Next to move to the next pane. Depending on the action you chose on the previous pane, here you can specify what happens when the task fires. For example, if you chose to run an application, you enter the name of the application or script, any arguments, and which folder the task starts in. Once you have entered all of this information, click Next to review the task on the final pane. If you’re satisfied all is set correctly, click Finish to schedule the task. You can click Back to return to any of the previous screens and make changes. Finally, you have the option to open the Properties page after you click Finish if you want to make additional changes to the task. Conclusion In this chapter, we have presented an introduction to MySQL replication, including a look at why replication is used and how to set it up. We also took a quick look into the binary log. In the next chapter, we examine the binary log in greater detail. Joel finished giving Mr. Summerson his report on how he was going to balance the load across four new slaves, along with plans for how the topology could be expanded to handle future needs. “That’s fine work, Joel. Now explain to me again what this slave thing is.” Joel suppressed a sigh and said, “A slave is a copy of the data on the database server that gets its changes from the original database server called the master…” Conclusion | 49 CHAPTER 4 The Binary Log “Joel?” Joel jumped, nearly banging his head as he crawled out from under his desk. “I was just rerouting a few cables,” he said by way of an explanation. Mr. Summerson merely nodded and said in a very authoritative manner, “I need you to look into a problem the marketing people are having with the new server. They need to roll back the data to a certain point.” “Well, that depends…” Joel started, worried about whether he had snapshots of old states of the system. “I told them you’d be right down.” With that, Mr. Summerson turned and walked away. A moment later a woman stopped in front of his door and said, “He’s always like that. Don’t take it personally. Most of us call it a drive-by tasking.” She laughed and introduced herself. “My name’s Amy. I’m one of the developers here.” Joel walked around his desk and met her at the door. “I’m Joel.” After a moment of awkward silence Joel said, “I, er, better get on that thing.” Amy smiled and said, “See you around.” “Just focus on what you have to do to succeed,” Joel thought as he returned to his desk to search for that MySQL book he bought last week. The previous chapter included a very brief introduction to the binary log. In this chapter, we will fill in more details and give a more thorough description of the binary log structure, the replication event format, and how to use the mysqlbinlog tool to investigate and work with the contents of binary logs. 51 The binary log records changes made to the database. It is usually used for replication, and the binary log then allows the same changes to be made on the slaves as well. Because the binary log normally keeps a record of all changes, you can also use it for auditing purposes to see what happened in the database, and for point-in-time recovery (PITR) by playing back the binary log to a server, repeating changes that were recorded in the binary log. (This is what we did in “Reporting” on page 43, where we played back all changes done between 23:55:00 and 23:59:59.) The binary log contains information that could change the database. Note that state‐ ments that could potentially change the database are also logged, even if they don’t actually do change the database. The most notable cases are those statements that op‐ tionally make a change, such as DROP TABLE IF EXISTS or CREATE TABLE IF NOT EXISTS, along with statements such as DELETE and UPDATE that have WHERE conditions that don’t happen to match any rows on the master. SELECT statements are not normally logged because they do not make any changes to any database. There are, however, exceptions. The binary log records each transaction in the order that the commit took place on the master. Although transactions may be interleaved on the master, each appears as an uninterrupted sequence in the binary log, the order determined by the time of the commit. Structure of the Binary Log Conceptually, the binary log is a sequence of binary log events (also called binlog events or even just events when there is no risk of confusion). As you saw in Chap‐ ter 3, the binary log actually consists of several files, as shown in Figure 4-1, that together form the binary log. Figure 4-1. The structure of the binary log 52 | Chapter 4: The Binary Log 1. In some special cases, covered in “How nontransactional statements are logged” on page 88, nontransactional statements can be part of a group. The actual events are stored in a series of files called binlog files with names in the form host-bin.000001, accompanied by a binlog index file that is usually named host- bin.index and keeps track of the existing binlog files. The binlog file that is currently being written to by the server is called the active binlog file. If no slaves are lagging, this is also the file that is being read by the slaves. The names of the binlog files and the binlog index file can be controlled using the log-bin and log-bin-index options, which you are familiar with from “Configuring the Master” on page 25. The options are cov‐ ered in more detail later in this chapter. The index file keeps track of all the binlog files used by the server so that the server can correctly create new binlog files when necessary, even after server restarts. Each line in the index file contains the name of a binlog file that is part of the binary log. Depending on the MySQL version, it can either be the full name or a name relative to the data directory. Commands that affect the binlog files, such as PURGE BINARY LOGS, RESET MASTER, and FLUSH LOGS, also affect the index file by adding or removing lines to match the files that were added or removed by the command. As shown in Figure 4-1, each binlog file is made up of binlog events, with the Format_description event serving as the file’s header and the Rotate event as its footer. Note that a binlog file might not end with a rotate event if the server was interrupted or crashed. The Format_description event contains information about the server that wrote the binlog file as well as some critical information about the file’s status. If the server is stopped and restarted, it creates a new binlog file and writes a new Format_descrip tion event to it. This is necessary because changes can potentially occur between bring‐ ing a server down and bringing it up again. For example, the server could be upgraded, in which case a new Format_description event would have to be written. When the server has finished writing a binlog file, a Rotate event is added to end the file. The event points to the next binlog file in sequence by giving the name of the file as well as the position to start reading from. The Format_description event and the Rotate event will be described in detail in the next section. With the exception of control events (e.g., Format_description, Rotate, and Inci dent), events of a binlog file are grouped into units called groups, as seen in Figure 4-2. In transactional storage engines, each group is roughly equivalent to a transaction, but for nontransactional storage engines or statements that cannot be part of a transaction, such as CREATE or ALTER, each statement is a group by itself.1 Structure of the Binary Log | 53 In short, each group of events in the binlog file contains either a single statement not in a transaction or a transaction consisting of several statements. Figure 4-2. A single binlog file with groups of events Each group is executed entirely or not at all (with the exception of a few well-defined cases). If, for some reason, the slave stops in the middle of a group, replication will start from the beginning of the group and not from the last statement executed. Chapter 8 describes in detail how the slave executes events. Binlog Event Structure MySQL 5.0 introduced a new binlog format: binlog format 4. The preceding formats were not easy to extend with additional fields if the need were to arise, so binlog format 4 was designed specifically to be extensible. This is still the event format used in every server version since 5.0, even though each version of the server has extended the binlog format with new events and some events with new fields. Binlog format 4 is the event format described in this chapter. Each binlog event consists of four parts: Common header The common header is—as the name suggests—common to all events in the binlog file. The common header contains basic information about the event, the most impor‐ tant fields being the event type and the size of the event. 54 | Chapter 4: The Binary Log Post header The post header is specific to each event type; in other words, each event type stores different information in this field. But the size of this header, just as with the com‐ mon header, is the same throughout a given binlog file. The size of each event type is given by the Format_description event. Event body After the headers comes the event body, which is the variable-sized part of the event. The size and the end position is listed in the common header for the event. The event body stores the main data of the event, which is different for different event types. For the Query event, for instance, the body stores the query, and for the User_var event, the body stores the name and value of a user variable that was just set by a statement. Checksum Starting with MySQL 5.6, there is a checksum at the end of the event, if the server is configured to generate one. The checksum is a 32-bit integer that is used to check that the event has not been corrupted since it was written. A complete listing of the formats of all events is beyond the scope of this book, but because the Format_description and Rotate events are critical to how the other events are interpreted, we will briefly cover them here. If you are interested in the details of the events, you can find them in the MySQL Internals Manual. As already noted, the Format_description event starts every binlog file and contains common information about the events in the file. The result is that the Format_descrip tion event can be different between different files; this typically occurs when a server is upgraded and restarted. The Format_description_log_event contains the following fields: Binlog file format version This is the version of the binlog file, which should not be confused with the version of the server. MySQL versions 3.23, 4.0, and 4.1 use version 3 of the binary log, while MySQL versions 5.0 and later use version 4 of the binary log. The binlog file format version changes when developers make significant changes in the overall structure of the file or the events. In MySQL version 5.0, the start event for a binlog file was changed to use a different format and the common headers for all events were also changed, which prompted the change in the binlog file format version. Server version This is a version string denoting the server that created the file. This includes the version of the server as well as additional information if special builds are made. The format is normally the three-position version number, followed by a hyphen Structure of the Binary Log | 55 and any additional build options. For example, “5.5.10-debug-log” means debug build version 5.5.10 of the server. Common header length This field stores the length of the common header. Because it’s here in the For mat_description, this length can be different for different binlog files. This holds for all events except the Format_description and Rotate events, which cannot vary. Post-header lengths The post-header length for each event is fixed within a binlog file, and this field stores an array of the post-header length for each event that can occur in the binlog file. Because the number of event types can vary between servers, the number of different event types that the server can produce is stored before this field. The Rotate and Format_description log events have a fixed length because the server needs them before it knows the size of the com‐ mon header length. When connecting to the server, it first sends a Format_description event. Because the length of the common head‐ er is stored in the Format_description event, there is no way for the server to know what the size of the common header is for the Ro tate event unless it has a fixed size. So for these two events, the size of the common header is fixed and will never change between serv‐ er versions, even if the size of the common header changes for oth‐ er events. Because both the size of the common header and the size of the post header for each event type are given in the Format_description event, extending the format with new events or even increasing the size of the post headers by adding new fields is supported by this format and will therefore not require a change in the binlog file format. With each extension, particular care is taken to ensure that the extension does not affect interpretation of events that were already in earlier versions. For example, the common header can be extended with an additional field to indicate that the event is compressed and the type of compression used, but if this field is missing—which would be the case if a slave is reading events from an old master—the server should still be able to fall back on its default behavior. Event Checksums Because hardware can fail and software can contain bugs, it is necessary to have some way to ensure that data corrupted by such events is not applied on the slave. Random failures can occur anywhere, and if they occur inside a statement, they often lead to a syntax error causing the slave to stop. However, relying on this to prevent corrupt events 56 | Chapter 4: The Binary Log from being replicated is a poor way to ensure integrity of events in the binary log. This policy would not catch many types of corruptions, such as in timestamps, nor would it work for row-based events where the data is encoded in binary form and random cor‐ ruptions are more likely to lead to incorrect data. To ensure the integrity of each event, MySQL 5.6 introduced replication event check‐ sums. When events are written, a checksum is added, and when the events are read, the checksum is computed for the event and compared against the checksum written with the event. If the checksums do not match, execution can be aborted before any attempt is made to apply the event on the slave. The computation of checksums can potentially impact performance, but benchmarking has demonstrated no noticeable performance degradation from the addition and checking of checksums, so they are enabled by de‐ fault in MySQL 5.6. They can, however, be turned off if necessary. In MySQL 5.6, checksums can be generated when changes are written either to the binary log or to the relay log, and verified when reading events back from one of these logs. Replication event checksums are controlled using three options: binlog-checksum=type This option enables checksums and tells the server what checksum computation to use. Currently there are only two choices: CRC32 uses ISO-3309 CRC-32 checksums, whereas NONE turns off checksumming. The default is CRC32, meaning that check‐ sums are generated. master-verify-checksum=boolean This option controls whether the master verifies the checksum when reading it from the binary log. This means that the event checksum is verified when it is read from the binary log by the dump thread (see “Replication Architecture Basics” on page 228), but before it is sent out, and also when using SHOW BINLOG EVENTS. If any of the events shown is corrupt, the command will throw an error. This option is off by default. slave-sql-verify-checksum=boolean This option controls whether the slave verifies the event checksum after reading it from the relay log and before applying it to the slave database. This option is off by default. If you get a corrupt binary log or relay log, mysqlbinlog can be used to find the bad checksum using the --verify-binlog-checksum option. This option causes mysqlbin‐ log to verify the checksum of each event read and stop when a corrupt event is found, which the following example demonstrates: $ client/mysqlbinlog --verify-binlog-checksum master-bin.000001 . . Structure of the Binary Log | 57 . # at 261 #110406 8:35:28 server id 1 end_log_pos 333 CRC32 0xed927ef2... SET TIMESTAMP=1302071728/*!*/; BEGIN /*!*/; # at 333 #110406 8:35:28 server id 1 end_log_pos 365 CRC32 0x01ed254d Intvar SET INSERT_ID=1/*!*/; ERROR: Error in Log_event::read_log_event(): 'Event crc check failed!... DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; Logging Statements Starting with MySQL 5.1, row-based replication is also available, which will be covered in “Row-Based Replication” on page 97. In statement-based replication, the actual executed statement is written to the binary log together with some execution information, and the statement is re-executed on the slave. Because not all events can be logged as statements, there are some exceptions that you should be aware of. This section will describe the process of logging individual statements as well as the important caveats. Because the binary log is a common resource—all threads write to it—it is critical to prevent two threads from updating the binary log at the same time. To handle this, a lock for the binary log—called the LOCK_log mutex—is acquired just before each group is written to the binary log and released just after the group has been written. Because all session threads for the server can potentially log transactions to the binary log, it is quite common for several session threads to block on this lock. Logging Data Manipulation Language Statements Data manipulation are usuallylanguage (DML) statements are usually DELETE, INSERT, and UPDATE statements. To support logging changes in a consistent manner, MySQL writes the binary log while transaction-level locks are held, and releases them after the binary log has been written. To ensure that the binary log is updated consistently with the tables that the statement modifies, each statement is logged to the binary log during statement commit, just before the table locks are released. If the logging were not made as part of the statement, another statement could be “injected” between the changes that the statement introduces to the database and the logging of the statement to the binary log. This would mean that the statements would be logged in a different order than the order in which they took effect in the database, which could lead to inconsistencies between master and slave. For in‐ 58 | Chapter 4: The Binary Log stance, an UPDATE statement with a WHERE clause could update different rows on the slave because the values in those rows might be different if the statement order changed. Logging Data Definition Language Statements Data definition language (DDL) statements affect a schema, such as CREATE TABLE and ALTER TABLE statements. These create or change objects in the filesystem—for example, table definitions are stored in .frm files and databases are represented as filesystem directories—so the server keeps information about these available in internal data structures. To protect the update of the internal data structure, it is necessary to acquire an internal lock (called LOCK_open) before altering the table definition. Because a single lock is used to protect these data structures, the creation, alteration, and destruction of database objects can be a considerable source of performance prob‐ lems. This includes the creation and destruction of temporary tables, which is quite common as a technique to create an intermediate result set to perform computations on. If you are creating and destroying a lot of temporary tables, it is often possible to boost performance by reducing the creation (and subsequent destruction) of temporary tables. Logging Queries For statement-based replication, the most common binlog event is the Query event, which is used to write a statement executed on the master to the binary log. In addition to the actual statement executed, the event contains some additional information nec‐ essary to execute the statement. Recall that the binary log can be used for many purposes and contains statements in a potentially different order from that in which they were executed on the master. In some cases, part of the binary log may be played back to a server to perform PITR, and in some cases, replication may start in the middle of a sequence of events because a backup has been restored on a slave before starting replication. In all these cases, the events are executing in different contexts (i.e., there is information that is implicit when the server executes the statement but that has to be known to execute the statement correctly). Examples include: Current database If the statement refers to a table, function, or procedure without qualifying it with the database, the current database is implicit for the statement. Value of user-defined variable If a statement refers to a user-defined variable, the value of the variable is implicit for the statement. Logging Statements | 59 Seed for the RAND function The RAND function is based on a pseudorandom number function, meaning that it can generate a sequence of numbers that are reproducible but appear random in the sense that they are evenly distributed. The function is not really random, but starts from a seed number and applies a pseudorandom function to generate a deterministic sequence of numbers. This means that given the same seed, the RAND function will always return the same number. However, this makes the seed implicit for the statement. The current time Obviously, the time the statement started executing is implicit. Having a correct time is important when calling functions that are dependent on the current time— such as NOW and UNIX_TIMESTAMP—because otherwise they will return different re‐ sults if there is a delay between the statement execution on the master and on the slave. Value used when inserting into an AUTO_INCREMENT column If a statement inserts a row into a table with a column defined with the AU TO_INCREMENT attribute, the value used for that row is implicit for the statement because it depends on the rows inserted before it. Value returned by a call to LAST_INSERT_ID If the LAST_INSERT_ID function is used in a statement, it depends on the value inserted by a previous statement, which makes this value implicit for the statement. Thread ID For some statements, the thread ID is implicit. For example, if the statement refers to a temporary table or uses the CURRENT_ID function, the thread ID is implicit for the statement. Because the context for executing the statements cannot be known when they’re re‐ played—either on a slave or on the master after a crash and restart—it is necessary to make the implicit information explicit by adding it to the binary log. This is done in slightly different ways for different kinds of information. In addition to the previous list, some information is implicit to the execution of triggers and stored routines, but we will cover that separately in “Triggers, Events, and Stored Routines” on page 70. Let’s consider each of the cases of implicit information individually, demonstrate the problem with each one, and examine how the server handles it. Current database The log records the current database by adding it to a special field of the Query event. This field also exists for the events used to handle the LOAD DATA INFILE statement, 60 | Chapter 4: The Binary Log discussed in “LOAD DATA INFILE Statements” on page 65, so the description here applies to that statement as well. The current database also plays an important role in filtering on the database and is described later in this chapter. Current time Five functions use the current time to compute their values: NOW, CURDATE, CURTIME, UNIX_TIMESTAMP, and SYSDATE. The first four functions return a value based on the time when the statement started to execute. In contrast, SYSDATE returns the value when the function is executed. The difference can best be demonstrated by comparing the exe‐ cution of NOW and SYSDATE with an intermediate sleep: mysql> SELECT SYSDATE(), NOW(), SLEEP(2), SYSDATE(), NOW()\G *************************** 1. row *************************** SYSDATE(): 2013-06-08 23:24:08 NOW(): 2013-06-08 23:24:08 SLEEP(2): 0 SYSDATE(): 2013-06-08 23:24:10 NOW(): 2013-06-08 23:24:08 1 row in set (2.00 sec) Both functions are evaluated when they are encountered, but NOW returns the time that the statement started executing, whereas SYSDATE returns the time when the function was executed. To handle these time functions correctly, the timestamp indicating when the event started executing is stored in the event. This value is then copied from the event to the slave execution thread and used as if it were the time the event started executing when computing the value of the time functions. Because SYSDATE gets the time from the operating system directly, it is not safe for statement-based replication and will return different values on the master and slave when executed. So unless you really want to have the actual time inserted into your tables, it is prudent to stay away from this function. Context events Some implicit information is associated with statements that meet certain conditions: • If the statement contains a reference to a user-defined variable (as in Example 4-1), it is necessary to add the value of the user-defined variable to the binary log. • If the statement contains a call to the RAND function, it is necessary to add the pseudorandom seed to the binary log. • If the statement contains a call to the LAST_INSERT_ID function, it is necessary to add the last inserted ID to the binary log. Logging Statements | 61 • If the statement performs an insert into a table with an AUTO_INCREMENT column, it is necessary to add the value that was used for the column (or columns) to the binary log. Example 4-1. Statements with user-defined variables SET @value = 45; INSERT INTO t1 VALUES (@value); In each of these cases, one or more context events are added to the binary log before the event containing the query is written. Because there can be several context events pre‐ ceding a Query event, the binary log can handle multiple user-defined variables together with the RAND function, or (almost) any combination of the previously listed conditions. The binary log stores the necessary context information through the following events: User_var Each such event records the name and value of a single user-defined variable. Rand Records the random number seed used by the RAND function. The seed is fetched internally from the session’s state. Intvar If the statement is inserting into an autoincrement column, this event records the value of the internal autoincrement counter for the table before the statement starts. If the statement contains a call to LAST_INSERT_ID, this event records the value that this function returned in the statement. Example 4-2 shows some statements that generate all of the context events and how the events appear when displayed using SHOW BINLOG EVENTS. Note that there can be several context events before each statement. Example 4-2. Query events with context events master> SET @foo = 12; Query OK, 0 rows affected (0.00 sec) master> SET @bar = 'Smoothnoodlemaps'; Query OK, 0 rows affected (0.00 sec) master> INSERT INTO t1(b,c) VALUES -> (@foo,@bar), (RAND(), 'random'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 master> INSERT INTO t1(b) VALUES (LAST_INSERT_ID()); Query OK, 1 row affected (0.00 sec) master> SHOW BINLOG EVENTS FROM 238\G 62 | Chapter 4: The Binary Log *************************** 1. row *************************** Log_name: mysqld1-bin.000001 Pos: 238 Event_type: Query Server_id: 1 End_log_pos: 306 Info: BEGIN *************************** 2. row *************************** Log_name: mysqld1-bin.000001 Pos: 306 Event_type: Intvar Server_id: 1 End_log_pos: 334 Info: INSERT_ID=1 *************************** 3. row *************************** Log_name: mysqld1-bin.000001 Pos: 334 Event_type: RAND Server_id: 1 End_log_pos: 369 Info: rand_seed1=952494611,rand_seed2=949641547 *************************** 4. row *************************** Log_name: mysqld1-bin.000001 Pos: 369 Event_type: User var Server_id: 1 End_log_pos: 413 Info: @`foo`=12 *************************** 5. row *************************** Log_name: mysqld1-bin.000001 Pos: 413 Event_type: User var Server_id: 1 End_log_pos: 465 Info: @`bar`=_utf8 0x536D6F6F74686E6F6F6... *************************** 6. row *************************** Log_name: mysqld1-bin.000001 Pos: 465 Event_type: Query Server_id: 1 End_log_pos: 586 Info: use `test`; INSERT INTO t1(b,c) VALUES (@foo,@bar)... *************************** 7. row *************************** Log_name: mysqld1-bin.000001 Pos: 586 Event_type: Xid Server_id: 1 End_log_pos: 613 Info: COMMIT /* xid=44 */ *************************** 8. row *************************** Log_name: mysqld1-bin.000001 Pos: 613 Logging Statements | 63 Event_type: Query Server_id: 1 End_log_pos: 681 Info: BEGIN *************************** 9. row *************************** Log_name: mysqld1-bin.000001 Pos: 681 Event_type: Intvar Server_id: 1 End_log_pos: 709 Info: LAST_INSERT_ID=1 *************************** 10. row *************************** Log_name: mysqld1-bin.000001 Pos: 709 Event_type: Intvar Server_id: 1 End_log_pos: 737 Info: INSERT_ID=3 *************************** 11. row *************************** Log_name: mysqld1-bin.000001 Pos: 737 Event_type: Query Server_id: 1 End_log_pos: 843 Info: use `test`; INSERT INTO t1(b) VALUES (LAST_INSERT_ID()) *************************** 12. row *************************** Log_name: mysqld1-bin.000001 Pos: 843 Event_type: Xid Server_id: 1 End_log_pos: 870 Info: COMMIT /* xid=45 */ 12 rows in set (0.00 sec) Thread ID The last implicit piece of information that the binary log sometimes needs is the thread ID of the MySQL session handling the statement. The thread ID is necessary when a function is dependent on the thread ID—such as when it refers to CONNECTION_ID—but most importantly for handling temporary tables. Temporary tables are specific to each thread, meaning that two temporary tables with the same name are allowed to coexist, provided they are defined in different sessions. Temporary tables can provide an effective means to improve the performance of certain operations, but they require special handling to work with the binary log. Internally in the server, temporary tables are handled by creating obscure names for storing the table definitions. The names are based on the process ID of the server, the thread ID that creates the table, and a thread-specific counter to distinguish between different instances of the table from the same thread. This naming scheme allows tables 64 | Chapter 4: The Binary Log from different threads to be distinguished from each other, but each statement can access its proper table only if the thread ID is stored in the binary log. Similar to how the current database is handled in the binary log, the thread ID is stored as a separate field in every Query event and can therefore be used to compute thread- specific data and handle temporary tables correctly. When writing the Query event, the thread ID to store in the event is read from the pseudo_thread_id server variable. This means that it can be set before executing a statement, but only if you have SUPER privileges. This server variable is intended to be used by mysqlbinlog to emit statements correctly and should not normally be used. For a statement that contains a call to the CONNECTION_ID function or that uses or creates a temporary table, the Query event is marked as thread-specific in the binary log. Because the thread ID is always present in the Query event, this flag is not necessary but is mainly used to allow mysqlbinlog to avoid printing unnecessary assignments to the pseu do_thread_id variable. LOAD DATA INFILE Statements The LOAD DATA INFILE statement makes it easy to fill tables quickly from a file. Un‐ fortunately, it is dependent on a certain kind of context that cannot be covered by the context events we have discussed: files that need to be read from the filesystem. To handle LOAD DATA INFILE, the MySQL server uses a special set of events to handle the transfer of the file using the binary log. In addition to solving the problem for LOAD DATA INFILE, this makes the statement a very convenient tool for transferring large amounts of data from the master to the slave, as you will see soon. To correctly transfer and execute a LOAD DATA INFILE statement, several new events are introduced into the binary log: Begin_load_query This event signals the start of data transfer in the file. Append_block A sequence of one or more of these events follows the Begin_load_query event to contain the rest of the file’s data, if the file was larger than the maximum allowed packet size on the connection. Execute_load_query This event is a specialized variant of the Query event that contains the LOAD DATA INFILE statement executed on the master. Even though the statement contained in this event contains the name of the file that was used on the master, this file will not be sought by the slave. Instead, the contents provided by the preceding Begin_load_query and Append_block events will be used. Logging Statements | 65 For each LOAD DATA INFILE statement executed on the master, the file to read is mapped to an internal file-backed buffer, which is used in the following processing. In addition, a unique file ID is assigned to the execution of the statement and is used to refer to the file read by the statement. While the statement is executing, the file contents are written to the binary log as a sequence of events starting with a Begin_load_query event—which indicates the be‐ ginning of a new file—followed by zero or more Append_block events. Each event writ‐ ten to the binary log is no larger than the maximum allowed packet size, as specified by the max-allowed-packet option. After the entire file is read and applied to the table, the execution of the statement terminates by writing the Execute_load_query event to the binary log. This event con‐ tains the statement executed together with the file ID assigned to the execution of the statement. Note that the statement is not the original statement as the user wrote it, but rather a recreated version of the statement. If you are reading an old binary log, you might instead find Load_log_event, Execute_log_event, and Create_file_log_event. These were the events used to replicate LOAD DATA INFILE prior to MySQL version 5.0.3 and were replaced by the implementation just described. Example 4-3 shows the events written to the binary log by a successful execution of a LOAD DATA INFILE statement. In the Info field, you can see the assigned file ID—1, in this case—and see that it is used for all the events that are part of the execution of the statement. You can also see that the file foo.dat used by the statement contains more than the maximum allowed packet size of 16384, so it is split into three events. Example 4-3. Successful execution of LOAD DATA INFILE master> SHOW BINLOG EVENTS IN 'master-bin.000042' FROM 269\G *************************** 1. row *************************** Log_name: master-bin.000042 Pos: 269 Event_type: Begin_load_query Server_id: 1 End_log_pos: 16676 Info: ;file_id=1;block_len=16384 *************************** 2. row *************************** Log_name: master-bin.000042 Pos: 16676 Event_type: Append_block Server_id: 1 End_log_pos: 33083 Info: ;file_id=1;block_len=16384 *************************** 3. row *************************** 66 | Chapter 4: The Binary Log Log_name: master-bin.000042 Pos: 33083 Event_type: Append_block Server_id: 1 End_log_pos: 33633 Info: ;file_id=1;block_len=527 *************************** 4. row *************************** Log_name: master-bin.000042 Pos: 33633 Event_type: Execute_load_query Server_id: 1 End_log_pos: 33756 Info: use `test`; LOAD DATA INFILE 'foo.dat' INTO...;file_id=1 4 rows in set (0.00 sec) Binary Log Filters It is possible to filter out statements from the binary log using two options: binlog-do- db and binlog-ignore-db (which we will call binlog-*-db, collectively). The binlog- do-db option is used when you want to filter only statements belonging to a certain database, and binlog-ignore-db is used when you want to ignore a certain database but replicate all other databases. These options can be given multiple times, so to filter out both the database one_db and the database two_db, you must give both options in the my.cnf file. For example: [mysqld] binlog-ignore-db=one_db binlog-ignore-db=two_db The way MySQL filters events can be quite a surprise to unfamiliar users, so we’ll explain how filtering works and make some recommendations on how to avoid some of the major headaches. Figure 4-3 shows how MySQL determines whether the statement is filtered. The filtering is done on a statement level—either the entire statement is filtered out or the entire statement is written to the binary log—and the binlog-*-db options use the current database to decide whether the statement should be filtered, not the database of the tables affected by the statement. To help you understand the behavior, consider the statements in Example 4-4. Each line uses bad as the current database and changes tables in different databases. Logging Statements | 67 Figure 4-3. Logic for binlog-*-db filters Example 4-4. Statements using different databases USE bad; INSERT INTO t1 VALUES (1),(2); USE bad; INSERT INTO good.t2 VALUES (1),(2); USE bad; UPDATE good.t1, ugly.t2 SET a = b; This line changes a table in the database named bad since it does not qualify the table name with a database name. This line changes a table in a different database than the current database. This line changes two tables in two different databases, neither of which is the current database. Now, given these statements, consider what happens if the bad database is filtered using binlog-ignore-db=bad. None of the three statements in Example 4-4 will be written to the binary log, even though the second and third statements change tables on the good and ugly database and make no reference to the bad database. This might seem strange at first—why not filter the statement based on the database of the table changed? But consider what would happen with the third statement if the ugly database was filtered instead of the bad database. Now one database in the UPDATE is filtered out and the other isn’t. This puts the server in a catch-22 situation, so the problem is solved by just filtering on the current database, and this rule is used for all statements (with a few exceptions). 68 | Chapter 4: The Binary Log To avoid mistakes when executing statements that can potentially be filtered out, make it a habit not to write statements so they qualify table, function, or procedure names with the database name. In‐ stead, whenever you want to access a table in a different database, issue a USE statement to make that database the current database. In other words, instead of writing: INSERT INTO other.book VALUES ('MySQL', 'Paul DuBois'); write: USE other; INSERT INTO book VALUES ('MySQL', 'Paul DuBois'); Using this practice, it is easy to see by inspection that the statement does not update multiple databases simply because no tables should be qualified with a database name. This behavior does not apply when row-based replication is used. The filtering em‐ ployed in row-based replication will be discussed in “Filtering in Row-Based Replica‐ tion” on page 286, but since row-based replication can work with each individual row change, it is able to filter on the actual table that the row is targeted for and does not use the current database. So, what happens when both binlog-do-db and binlog-ignore-db are used at the same time? For example, consider a configuration file containing the following two rules: [mysqld] binlog-do-db=good binlog-ignore-db=bad In this case, will the following statement be filtered out or not? USE ugly; INSERT INTO t1 VALUES (1); Following the diagram in Figure 4-3, you can see that if there is at least a binlog-do- db rule, all binlog-ignore-db rules are ignored completely, and since only the good database is included, the previous statement will be filtered out. Because of the way that the binlog-*-db rules are evaluated, it is pointless to have both binlog-do-db and binlog-ignore-db rules at the same time. Since the binary log can be used for recovery as well as replication, the recommendation is not to use the binlog-*- db options and instead filter out the event on the slave by using replicate-* options (these are described in “Filtering and skip‐ ping events” on page 255). Using the binlog-*-db option would fil‐ ter out statements from the binary log and you will not be able to restore the database from the binary log in the event of a crash. Logging Statements | 69 Triggers, Events, and Stored Routines A few other constructions that are treated specially when logged are stored programs— that is, triggers, events, and stored routines (the last is a collective name for stored procedures and stored functions). Their treatment with respect to the binary log con‐ tains some elements in common, so they will be covered together in this section. The explanation distinguishes statements of two types: statements that define, destroy, or alter stored programs and statements that invoke them. Statements that define or destroy stored programs The following discussion shows triggers in the examples, but the same principles apply to definition of events and stored routines. To understand why the server needs to handle these features specially when writing them to the binary log, consider the code in Example 4-5. In the example, a table named employee keeps information about all employees of an imagined system and a table named log keeps a log of interesting information. Note that the log table has a timestamp column that notes the time of a change and that the name column in the employee table is the primary key for the table. There is also a status column to tell whether the addition succeeded or failed. To track information about employee information changes—for example, for auditing purposes—three triggers are created so that whenever an employee is added, removed, or changed, a log entry of the change is added to a log table. Notice that the triggers are after triggers, which means that entries are added only if the executed statement is successful. Failed statements will not be logged. We will later extend the example so that unsuccessful attempts are also logged. Example 4-5. Definitions of tables and triggers for employee administration CREATE TABLE employee ( name CHAR(64) NOT NULL, email CHAR(64), password CHAR(64), PRIMARY KEY (name) ); CREATE TABLE log ( id INT AUTO_INCREMENT, email CHAR(64), status CHAR(10), message TEXT, ts TIMESTAMP, PRIMARY KEY (id) ); CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee 70 | Chapter 4: The Binary Log FOR EACH ROW INSERT INTO log(email, status, message) VALUES (NEW.email, 'OK', CONCAT('Adding employee ', NEW.name)); CREATE TRIGGER tr_employee_delete_after AFTER DELETE ON employee FOR EACH ROW INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', 'Removing employee'); delimiter $$ CREATE TRIGGER tr_employee_update_after AFTER UPDATE ON employee FOR EACH ROW BEGIN IF OLD.name != NEW.name THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', CONCAT('Name change from ', OLD.name, ' to ', NEW.name)); END IF; IF OLD.password != NEW.password THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', 'Password change'); END IF; IF OLD.email != NEW.email THEN INSERT INTO log(email, status, message) VALUES (OLD.email, 'OK', CONCAT('E-mail change to ', NEW.email)); END IF; END $$ delimiter ; With these trigger definitions, it is now possible to add and remove employees as shown in Example 4-6. Here an employee is added, modified, and removed, and as you can see, each of the operations is logged to the log table. The operations of adding, removing, and modifying employees may be done by a user who has access to the employee table, but what about access to the log table? In this case, a user who can manipulate the employee table should not be able to make changes to the log table. There are many reasons for this, but they all boil down to trusting the contents of the log table for purposes of maintenance, auditing, disclosure to legal authorities, and so on. So the DBA may choose to make access to the employee table available to many users while keeping access to the log table very restricted. Example 4-6. Adding, removing, and modifying users master> SET @pass = PASSWORD('xyzzy'); Query OK, 0 rows affected (0.00 sec) master> INSERT INTO employee VALUES ('mats', 'mats@example.com', @pass); Query OK, 1 row affected (0.00 sec) master> UPDATE employee SET name = 'matz' Logging Statements | 71 -> WHERE email = 'mats@example.com'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 master> SET @pass = PASSWORD('foobar'); Query OK, 0 rows affected (0.00 sec) master> UPDATE employee SET password = @pass -> WHERE email = 'mats@example.com'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 master> DELETE FROM employee WHERE email = 'mats@example.com'; Query OK, 1 row affected (0.00 sec) master> SELECT * FROM log; +----+------------------+-------------------------------+---------------------+ | id | email | message | ts | +----+------------------+-------------------------------+---------------------+ | 1 | mats@example.com | Adding employee mats | 2012-11-14 18:56:08 | | 2 | mats@example.com | Name change from mats to matz | 2012-11-14 18:56:11 | | 3 | mats@example.com | Password change | 2012-11-14 18:56:41 | | 4 | mats@example.com | Removing employee | 2012-11-14 18:57:11 | +----+------------------+-------------------------------+---------------------+ 4 rows in set (0.00 sec) The INSERT, UPDATE, and DELETE in the example can generate a warning that it is unsafe to log when statement mode is used. This is because it invokes a trigger that inserts into an autoincrement column. In general, such warnings should be investigated. To make sure the triggers can execute successfully against a highly protected table, they are executed as the user who defined the trigger, not as the user who changed the con‐ tents of the employee table. So the CREATE TRIGGER statements in Example 4-5 are executed by the DBA, who has privileges to make additions to the log table, whereas the statements altering employee information in Example 4-6 are executed through a user management account that only has privileges to change the employee table. When the statements in Example 4-6 are executed, the employee management account is used for updating entries in the employee table, but the DBA privileges are used to make additions to the log table. The employee management account cannot be used to add or remove entries from the log table. As an aside, Example 4-6 assigns passwords to a user variable before using them in the statement. This is done to avoid sending sensitive data in plain text to another server. 72 | Chapter 4: The Binary Log Security and the Binary Log In general, a user with REPLICATION SLAVE privileges has privileges to read everything that occurs on the master and should therefore be secured so that the account cannot be compromised. Details are beyond the scope of this book, but here are some examples of precautions you can take: • Make it impossible to log in to the account from outside the firewall. • Track all login attempts for accounts with REPLICATION SLAVE privileges, and place the log on a separate secure server. • Encrypt the connection between the master and the slave using, for example, MySQL’s built-in Secure Sockets Layer (SSL) support. Even if the account has been secured, there is information that does not have to be in the binary log, so it makes sense not to store it there in the first place. One of the more common types of sensitive information is passwords. Events containing passwords can be written to the binary log when executing statements that change tables on the server and that include the password required for access to the tables. A typical example is: UPDATE employee SET pass = PASSWORD('foobar') WHERE email = 'chuck@example.com'; If replication is in place, it is better to rewrite this statement without the password. This is done by computing and storing the hashed password into a user-defined variable and then using that in the expression: SET @password = PASSWORD('foobar'); UPDATE employee SET pass = @password WHERE email = 'chuck@example.com'; Since the SET statement is not replicated, the original password will not be stored in the binary log, only in the memory of the server while executing the statement. As long as the password hash, rather than the plain-text password, is stored in the table, this technique works. If the raw password is stored directly in the table, there is no way to prevent the password from ending up in the binary log. But storing hashes for pass‐ words is a standard good practice in any case, to prevent someone who gets his hands on the raw data from learning the passwords. Encrypting the connection between the master and the slave offers some protection, but if the binary log itself is compromised, encrypting the connection doesn’t help. If you recall the earlier discussion about implicit information, you may already have noticed that both the user executing a line of code and the user who defines a trigger Logging Statements | 73 are implicit. As you will see in Chapter 8, neither the definer nor the invoker of the trigger is critical to executing the trigger on the slave, and the user information is ef‐ fectively ignored when the slave executes the statement. However, the information is important when the binary log is played back to a server—for instance, when doing a PITR. To play back a binary log to the server without problems in handling privileges on all the various tables, it is necessary to execute all the statements as a user with SUPER privileges. But the triggers may not have been defined using SUPER privileges, so it is important to recreate the triggers with the correct user as the trigger’s definer. If a trigger is defined with SUPER privileges instead of by the user who defined the trigger originally, it might cause a privilege escalation. To permit a DBA to specify the user under which to execute a trigger, the CREATE TRIGGER syntax includes an optional DEFINER clause. If a DEFINER is not given to the statements—as is the case in Example 4-7—the statement will be rewritten for the binary log to add a DEFINER clause and use the current user as the definer. This means that the definition of the insert trigger appears in the binary log, as shown in Example 4-7. It lists the account that created the trigger (root@localhost) as the definer, which is what we want in this case. Example 4-7. A CREATE TRIGGER statement in the binary log master> SHOW BINLOG EVENTS FROM 92236 LIMIT 1\G *************************** 1. row *************************** Log_name: master-bin.000038 Pos: 92236 Event_type: Query Server_id: 1 End_log_pos: 92491 Info: use `test`; CREATE DEFINER=`root`@`localhost` TRIGGER ... 1 row in set (0.00 sec) Statements that invoke triggers and stored routines Moving over from definitions to invocations, we can ask how the master’s triggers are handled during replication. Well, actually they’re not handled at all. The statement that invokes the trigger is logged to the binary log, but it is not linked to the particular trigger. Instead, when the slave executes the statement, it automatically executes any triggers associated with the tables affected by the statement. This means that there can be different triggers on the master and the slave, and the triggers on the master will be invoked on the master while the triggers on the slave will be invoked on the slave. For example, if the trigger to add entries to the log table is not necessary on the slave, performance can be improved by eliminating the trigger from the slave. Still, any context events necessary for replicating correctly will be written to the binary log before the statement that invokes the trigger, even if it is just the statements in the 74 | Chapter 4: The Binary Log trigger that require the context events. Thus, Example 4-8 shows the binary log after executing the INSERT statement in Example 4-5. Note that the first event writes the INSERT ID for the log table’s primary key. This reflects the use of the log table in the trigger, but it might appear to be redundant because the slave will not use the trigger. You should, however, note that using different triggers on the master and slave—or no trigger at all on either the master or slave—is the exception and that the INSERT ID is necessary for replicating the INSERT statement correctly when the trigger is both on the master and slave. Example 4-8. Contents of the binary log after executing INSERT master> SHOW BINLOG EVENTS FROM 93340\G *************************** 1. row *************************** Log_name: master-bin.000038 Pos: 93340 Event_type: Intvar Server_id: 1 End_log_pos: 93368 Info: INSERT_ID=1 *************************** 2. row *************************** Log_name: master-bin.000038 Pos: 93368 Event_type: User var Server_id: 1 End_log_pos: 93396h Info: @`pass`=_utf8 0x2A3942353030333433424335324... utf8_general_ci *************************** 3. row *************************** Log_name: master-bin.000038 Pos: 93396 Event_type: Query Server_id: 1 End_log_pos: 93537 Info: use `test`; INSERT INTO employee VALUES ... 3 rows in set (0.00 sec) Stored Procedures Stored functions and stored procedures are known by the common name stored rou‐ tines. Since the server treats stored procedures and stored functions very differently, stored procedures will be covered in this section and stored functions in the next section. The situation for stored routines is similar to triggers in some aspects, but very different in others. Like triggers, stored routines offer a DEFINER clause, and it must be explicitly added to the binary log whether or not the statement includes it. But the invocation of stored routines is handled differently from triggers. To begin, let’s extend Example 4-6, which defines tables for employees and logs, with some utility routines to work with the employees. Even though this can be handled with Logging Statements | 75 standard INSERT, DELETE, and UPDATE statements, we’ll use stored procedures to demon‐ strate some issues involved in writing them to the binary log. For these purposes, let’s extend the example with the functions in Example 4-9 for adding and removing em‐ ployees. Example 4-9. Stored procedure definitions for managing employees delimiter $$ CREATE PROCEDURE employee_add(p_name CHAR(64), p_email CHAR(64), p_password CHAR(64)) MODIFIES SQL DATA BEGIN DECLARE l_pass CHAR(64); SET l_pass = PASSWORD(p_password); INSERT INTO employee(name, email, password) VALUES (p_name, p_email, l_pass); END $$ CREATE PROCEDURE employee_passwd(p_email CHAR(64), p_password CHAR(64)) MODIFIES SQL DATA BEGIN DECLARE l_pass CHAR(64); SET l_pass = PASSWORD(p_password); UPDATE employee SET password = l_pass WHERE email = p_email; END $$ CREATE PROCEDURE employee_del(p_name CHAR(64)) MODIFIES SQL DATA BEGIN DELETE FROM employee WHERE name = p_name; END $$ delimiter ; For the employee_add and employee_passwd procedures, we have extracted the en‐ crypted password into a separate variable for the reasons already explained, but the employee_del procedure just contains a DELETE statement, since nothing else is needed. A binlog entry corresponding to one function is: master> SHOW BINLOG EVENTS FROM 97911 LIMIT 1\G *************************** 1. row *************************** Log_name: master-bin.000038 Pos: 97911 Event_type: Query Server_id: 1 End_log_pos: 98275 Info: use `test`; CREATE DEFINER=`root`@`localhost`PROCEDURE ... 1 row in set (0.00 sec) 76 | Chapter 4: The Binary Log As expected, the definition of this procedure is extended with the DEFINER clause before writing the definition to the binary log, but apart from that, the body of the procedure is left intact. Notice that the CREATE PROCEDURE statement is replicated as a Query event, as are all DDL statements. In this regard, stored routines are similar to triggers in the way they are treated by the binary log. But invocation differs significantly from triggers. Example 4-10 calls the procedure that adds an employee and shows the resulting contents of the binary log. Example 4-10. Calling a stored procedure master> CALL employee_add('chuck', 'chuck@example.com', 'abrakadabra'); Query OK, 1 row affected (0.00 sec) master> SHOW BINLOG EVENTS FROM 104033\G *************************** 1. row *************************** Log_name: master-bin.000038 Pos: 104033 Event_type: Intvar Server_id: 1 End_log_pos: 104061 Info: INSERT_ID=1 *************************** 2. row *************************** Log_name: master-bin.000038 Pos: 104061 Event_type: Query Server_id: 1 End_log_pos: 104416 Info: use `test`; INSERT INTO employee(name, email, password) VALUES ( NAME_CONST('p_name',_utf8'chuck' COLLATE ...), NAME_CONST('p_email',_utf8'chuck@example.com' COLLATE ...), NAME_CONST('pass',_utf8'*FEB349C4FDAA307A...' COLLATE ...)) 2 rows in set (0.00 sec) In Example 4-10, there are four things that you should note: • The CALL statement is not written to the binary log. Instead, the statements executed as a result of the call are written to the binary log. In other words, the body of the stored procedure is unrolled into the binary log. • The statement is rewritten to not contain any references to the parameters of the stored procedure—that is, p_name, p_email, and p_password. Instead, the NAME_CONST function is used for each parameter to create a result set with a single value. • The locally declared variable pass is also replaced with a NAME_CONST expression, where the second parameter contains the encrypted password. Logging Statements | 77 • Just as when a statement that invokes a trigger is written to the binary log, the statement that calls the stored procedure is preceded by an Intvar event holding the insert ID used when adding the employee to the log table. Since neither the parameter names nor the locally declared names are available outside the stored routine, NAME_CONST is used to associate the name of the parameter or local variable with the constant value used when executing the function. This guarantees that the value can be used in the same way as the parameter or local variable. However, this change is not significant; currently it offers no advantages over using the parameters directly. Stored Functions Stored functions share many similarities with stored procedures and some similarities with triggers. Similar to both stored procedures and triggers, stored functions have a DEFINER clause that is normally (but not always) used when the CREATE FUNCTION statement is written to the binary log. In contrast to stored procedures, stored functions can return scalar values and you can therefore embed them in various places in SQL statements. For example, consider the definition of a stored routine in Example 4-11, which extracts the email address of an employee given the employee’s name. The function is a little contrived—it is significantly more efficient to just execute statements directly—but it suits our purposes well. Example 4-11. A stored function to fetch the name of an employee delimiter $$ CREATE FUNCTION employee_email(p_name CHAR(64)) RETURNS CHAR(64) DETERMINISTIC BEGIN DECLARE l_email CHAR(64); SELECT email INTO l_email FROM employee WHERE name = p_name; RETURN l_email; END $$ delimiter ; This stored function can be used conveniently in other statements, as shown in Example 4-12. In contrast to stored procedures, stored functions have to specify a char‐ acteristic—such as DETERMINISTIC, NO SQL, or READS SQL DATA—if they are to be writ‐ ten to the binary log. Example 4-12. Examples of using the stored function master> CREATE TABLE collected ( -> name CHAR(32), -> email CHAR(64) -> ); 78 | Chapter 4: The Binary Log Query OK, 0 rows affected (0.09 sec) master> INSERT INTO collected(name, email) -> VALUES ('chuck', employee_email('chuck')); Query OK, 1 row affected (0.01 sec) master> SELECT employee_email('chuck'); +-------------------------+ | employee_email('chuck') | +-------------------------+ | chuck@example.com | +-------------------------+ 1 row in set (0.00 sec) When it comes to calls, stored functions are replicated in the same manner as triggers: as part of the statement that executes the function. For instance, the binary log doesn’t need any events preceding the INSERT statement in Example 4-12, but it will contain the context events necessary to replicate the stored function inside the INSERT. What about SELECT? Normally, SELECT statements are not written to the binary log since they don’t change any data, but a SELECT containing a stored function, as in Example 4-13, is an exception. Example 4-13. Example of stored function that updates a table CREATE TABLE log(log_id INT AUTO_INCREMENT PRIMARY KEY, msg TEXT); delimiter $$ CREATE FUNCTION log_message(msg TEXT) RETURNS INT DETERMINISTIC BEGIN INSERT INTO log(msg) VALUES(msg); RETURN LAST_INSERT_ID(); END $$ delimiter ; SELECT log_message('Just a test'); When executing the stored function, the server notices that it adds a row to the log table and marks the statement as an “updating” statement, which means that it will be written to the binary log. So, for the slightly artificial example in Example 4-13, the binary log will contain the event: *************************** 7. row *************************** Log_name: mysql-bin.000001 Pos: 845 Event_type: Query Server_id: 1 End_log_pos: 913 Info: BEGIN Logging Statements | 79 *************************** 8. row *************************** Log_name: mysql-bin.000001 Pos: 913 Event_type: Intvar Server_id: 1 End_log_pos: 941 Info: LAST_INSERT_ID=1 *************************** 9. row *************************** Log_name: mysql-bin.000001 Pos: 941 Event_type: Intvar Server_id: 1 End_log_pos: 969 Info: INSERT_ID=1 *************************** 10. row *************************** Log_name: mysql-bin.000001 Pos: 969 Event_type: Query Server_id: 1 End_log_pos: 1109 Info: use `test`; SELECT `test`.`log_message`(_utf8'Just a test' COLLATE... *************************** 11. row *************************** Log_name: mysql-bin.000001 Pos: 1105 Event_type: Xid Server_id: 1 End_log_pos: 1132 Info: COMMIT /* xid=237 */ Stored Functions and Privileges The CREATE ROUTINE privilege is required to define a stored procedure or stored func‐ tion. Strictly speaking, no other privileges are needed to create a stored routine, but since it normally executes under the privileges of the definer, defining a stored routine would not make much sense if the definer of the procedure didn’t have the necessary privileges to read to or write from tables referenced by the stored procedure. But replication threads on the slave execute without privilege checks. This leaves a se‐ rious security hole allowing any user with the CREATE ROUTINE privilege to elevate her privileges and execute any statement on the slave. In MySQL versions earlier than 5.0, this does not cause problems, because all paths of a statement are explored when the statement is executed on the master. A privilege violation on the master will prevent a statement from being written to the binary log, so users cannot access objects on the slave that were out of bounds on the master. How‐ ever, with the introduction of stored routines, it is possible to create conditional exe‐ cution paths, and the server does not explore all paths when executing a stored routine. Since stored procedures are unrolled, the exact statements executed on the master are also executed on the slave, and since the statement is logged only if it was successfully 80 | Chapter 4: The Binary Log executed on the master, it is not possible to get access to other objects. Not so with stored functions. If a stored function is defined with SQL SECURITY INVOKER, a malicious user can craft a function that will execute differently on the master and the slave. The security breach can then be buried in the branch executed on the slave. This is demonstrated in the following example: CREATE FUNCTION magic() RETURNS CHAR(64) SQL SECURITY INVOKER BEGIN DECLARE result CHAR(64); IF @@server_id <> 1 THEN SELECT what INTO result FROM secret.agents LIMIT 1; RETURN result; ELSE RETURN 'I am magic!'; END IF; END $$ One piece of code executes on the master (the ELSE branch), whereas a separate piece of code (the IF branch) executes on the slave where the privilege checks are disabled. The effect is to elevate the user’s privileges from CREATE ROUTINE to the equivalent of SUPER. Notice that this problem doesn’t occur if the function is defined with SQL SECURITY DEFINER, because the function executes with the user’s privileges and will be blocked on the slave. To prevent privilege escalation on a slave, MySQL requires SUPER privileges by default to define stored functions. But because stored functions are very useful, and some da‐ tabase administrators trust their users with creating proper functions, this check can be disabled with the log-bin-trust-function-creators option. Events The events feature is a MySQL extension, not part of standard SQL. Events, which should not be confused with binlog events, are handled by a stored program that is executed regularly by a special event scheduler. Similar to all other stored programs, definitions of events are also logged with a DEFIN ER clause. Since events are invoked by the event scheduler, they are always executed as the definer and do not pose a security risk in the way that stored functions do. When events are executed, the statements are written to the binary log directly. Since the events will be executed on the master, they are automatically disabled on the slave and will therefore not be executed there. If the events were not disabled, they would Logging Statements | 81 be executed twice on the slave: once by the master executing the event and replicating the changes to the slave, and once by the slave executing the event directly. Because the events are disabled on the slave, it is necessary to en‐ able these events if the slave, for some reason, should lose the master. So, for example, when promoting a slave as described in Chapter 5, don’t forget to enable the events that were replicated from the mas‐ ter. This is easiest to do using the following statement: UPDATE mysql.events SET Status = ENABLED WHERE Status = SLAVESIDE_DISABLED; The purpose of the check is to enable only the events that were disabled when being replicated from the master. There might be events that are disabled for other reasons. Special Constructions Even though statement-based replication is normally straightforward, some special constructions have to be handled with care. Recall that for the statement to be executed correctly on the slave, the context has to be correct for the statement. Even though the context events discussed earlier handle part of the context, some constructions have additional context that is not transferred as part of the replication process. The LOAD_FILE function The LOAD_FILE function allows you to fetch a file and use it as part of an expression. Although quite convenient at times, the file has to exist on the slave server to replicate correctly since the file is not transferred during replication, as the file to LOAD DATA INFILE is. With some ingenuity, you can rewrite a statement involving the LOAD_FILE function either to use the LOAD DATA INFILE statement or to define a user-defined variable to hold the contents of the file. For example, take the following statement that inserts a document into a table: master> INSERT INTO document(author, body) -> VALUES ('Mats Kindahl', LOAD_FILE('go_intro.xml')); You can rewrite this statement to use LOAD DATA INFILE instead. In this case, you have to take care to specify character strings that cannot exist in the document as field and line delimiters, since you are going to read the entire file contents as a single column. master> LOAD DATA INFILE 'go_intro.xml' INTO TABLE document -> FIELDS TERMINATED BY '@*@' LINES TERMINATED BY '&%&' -> (author, body) SET author = 'Mats Kindahl'; An alternative is to store the file contents in a user-defined variable and then use it in the statement. 82 | Chapter 4: The Binary Log master> SET @document = LOAD_FILE('go_intro.xml'); master> INSERT INTO document(author, body) VALUES -> ('Mats Kindahl, @document); Nontransactional Changes and Error Handling So far we have considered only transactional changes and have not looked at error handling at all. For transactional changes, error handling is pretty uncomplicated: a statement that tries to change transactional tables and fails will not have any effect at all on the table. That’s the entire point of having a transactional system—so the changes that the statement attempts to introduce can be safely ignored. The same applies to transactions that are rolled back: they have no effect on the tables and can therefore simply be discarded without risking inconsistencies between the master and the slave. A specialty of MySQL is the provisioning of nontransactional storage engines. This can offer some speed advantages, because the storage engine does not have to administer the transactional log that the transactional engines use, and it allows some optimizations on disk access. From a replication perspective, however, nontransactional engines re‐ quire special considerations. The most important aspect to note is that replication cannot handle arbitrary nontran‐ sactional engines, but has to make some assumptions about how they behave. Some of those limitations are lifted with the introduction of row-based replication in version 5.1 —a subject that will be covered in “Row-Based Replication” on page 97—but even in that case, it cannot handle arbitrary storage engines. One of the features that complicates the issue further, from a replication perspective, is that it is possible to mix transactional and nontransactional engines in the same trans‐ action, and even in the same statement. To continue with the example used earlier, consider Example 4-14, where the log table from Example 4-5 is given a nontransactional storage engine while the employee table is given a transactional one. We use the nontransactional MyISAM storage engine for the log table to improve its speed, while keeping the transactional behavior for the employee table. We can further extend the example to track unsuccessful attempts to add employees by creating a pair of insert triggers: a before trigger and an after trigger. If an administrator sees an entry in the log with a status field of FAIL, it means the before trigger ran, but the after trigger did not, and therefore an attempt to add an employee failed. Example 4-14. Definition of log and employee tables with storage engines CREATE TABLE employee ( name CHAR(64) NOT NULL, email CHAR(64), password CHAR(64), PRIMARY KEY (email) Logging Statements | 83 ) ENGINE = InnoDB; CREATE TABLE log ( id INT AUTO_INCREMENT, email CHAR(64), message TEXT, status ENUM('FAIL', 'OK') DEFAULT 'FAIL', ts TIMESTAMP, PRIMARY KEY (id) ) ENGINE = MyISAM; delimiter $$ CREATE TRIGGER tr_employee_insert_before BEFORE INSERT ON employee FOR EACH ROW BEGIN INSERT INTO log(email, message) VALUES (NEW.email, CONCAT('Adding employee ', NEW.name)); SET @LAST_INSERT_ID = LAST_INSERT_ID(); END $$ delimiter ; CREATE TRIGGER tr_employee_insert_after AFTER INSERT ON employee FOR EACH ROW UPDATE log SET status = 'OK' WHERE id = @LAST_INSERT_ID; What are the effects of this change on the binary log? To begin, let’s consider the INSERT statement from Example 4-6. Assuming the statement is not inside a transaction and AUTOCOMMIT is 1, the statement will be a transaction by itself. If the statement executes without errors, everything will proceed as planned and the statement will be written to the binary log as a Query event. Now, consider what happens if the INSERT is repeated with the same employee. Since the email column is the primary key, this will generate a duplicate key error when the insertion is attempted, but what will happen with the statement? Is it written to the binary log or not? Let’s have a look… master> SET @pass = PASSWORD('xyzzy'); Query OK, 0 rows affected (0.00 sec) master> INSERT INTO employee(name,email,password) -> VALUES ('chuck','chuck@example.com',@pass); ERROR 1062 (23000): Duplicate entry 'chuck@example.com' for key 'PRIMARY' master> SELECT * FROM employee; +------+--------------------+-------------------------------------------+ | name | email | password | +------+--------------------+-------------------------------------------+ | chuck | chuck@example.com | *151AF6B8C3A6AA09CFCCBD34601F2D309ED54888 | +------+--------------------+-------------------------------------------+ 1 row in set (0.00 sec) 84 | Chapter 4: The Binary Log master> SHOW BINLOG EVENTS FROM 38493\G *************************** 1. row *************************** Log_name: master-bin.000038 Pos: 38493 Event_type: User var Server_id: 1 End_log_pos: 38571 Info: @`pass`=_utf8 0x2A31353141463642384333413641413... *************************** 2. row *************************** Log_name: master-bin.000038 Pos: 38571 Event_type: Query Server_id: 1 End_log_pos: 38689 Info: use `test`; INSERT INTO employee(name,email,password)... 2 rows in set (0.00 sec) As you can see, the statement is written to the binary log even though the employee table is transactional and the statement failed. Looking at the contents of the table using the SELECT reveals that there is still a single employee, proving the statement was rolled back—so why is the statement written to the binary log? Looking into the log table will reveal the reason. master> SELECT * FROM log; +----+------------------+------------------------+--------+---------------------+ | id | email | message | status | ts | +----+------------------+------------------------+--------+---------------------+ | 1 | mats@example.com | Adding employee mats | OK | 2010-01-13 15:50:45 | | 2 | mats@example.com | Name change from ... | OK | 2010-01-13 15:50:48 | | 3 | mats@example.com | Password change | OK | 2010-01-13 15:50:50 | | 4 | mats@example.com | Removing employee | OK | 2010-01-13 15:50:52 | | 5 | mats@example.com | Adding employee mats | OK | 2010-01-13 16:11:45 | | 6 | mats@example.com | Adding employee mats | FAIL | 2010-01-13 16:12:00 | +----+--------------+----------------------------+--------+---------------------+ 6 rows in set (0.00 sec) Look at the last line, where the status is FAIL. This line was added to the table by the before trigger tr_employee_insert_before. For the binary log to faithfully represent the changes made to the database on the master, it is necessary to write the statement to the binary log if there are any nontransactional changes present in the statement or in triggers that are executed as a result of executing the statement. Since the statement failed, the after trigger tr_employee_insert_after was not executed, and therefore the status is still FAIL from the execution of the before trigger. Since the statement failed on the master, information about the failure needs to be written to the binary log as well. The MySQL server handles this by using an error code field in the Query event to register the exact error code that caused the statement to fail. This field is then written to the binary log together with the event. Logging Statements | 85 The error code is not visible when using the SHOW BINLOG EVENTS command, but you can view it using the mysqlbinlog tool, which we will cover later in the chapter. Logging Transactions You have now seen how individual statements are written to the binary log, along with context information, but we did not cover how transactions are logged. In this section, we will briefly cover how transactions are logged. A transaction can start under a few different circumstances: • When the user issues START TRANSACTION (or BEGIN). • When AUTOCOMMIT=1 and a statement accessing a transactional table starts to exe‐ cute. Note that a statement that writes only to nontransactional tables—for example, only to MyISAM tables—does not start a transaction. • When AUTOCOMMIT=0 and the previous transaction was committed or aborted either implicitly (by executing a statement that does an implicit commit) or explicitly by using COMMIT or ROLLBACK. Not every statement that is executed after the transaction has started is part of that transaction. The exceptions require special care from the binary log. Nontransactional statements are by their very definition not part of the transaction. When they are executed, they take effect immediately and do not wait for the transaction to commit. This also means that it is not possible to roll them back. They don’t affect an open transaction: any transactional statement executed after the nontransactional statement is still added to the currently open transaction. In addition, several statements do an implicit commit. These can be separated into three groups based on the reason they do an implicit commit. Statements that write files Most DDL statements (CREATE, ALTER, etc.), with some exceptions, do an implicit commit of any outstanding transaction before starting to execute and an implicit commit after they have finished. These statements modify files in the filesystem and are for that reason not transactional. Statements that modify tables in the mysql database All statements that create, drop, or modify user accounts or privileges for users do an implicit commit and cannot be part of a transaction. Internally, these statements modify tables in the mysql database, which are all nontransactional. In MySQL versions earlier than 5.1.3, these statements did not cause an implicit commit, but because they were writing to nontransactional tables, they were treated as nontransactional statements. As you will soon see, this caused some inconsis‐ 86 | Chapter 4: The Binary Log tencies, so implicit commits were added for these statements over the course of several versions. Statements that require implicit commits for pragmatic reasons Statements that lock tables, statements that are used for administrative purposes, and LOAD DATA INFILE cause implicit commits in various situations because the implementation requires this to make them work correctly. Statements that cause an implicit commit are clearly not part of any transaction, because any open transaction is committed before execution starts. You can find a complete list of statements that cause an implicit commit in the online MySQL Reference Manual. Transaction Cache The binary log can have statements in a different order from their actual execution, because it combines all the statements in each transaction to keep them together. Mul‐ tiple sessions can execute simultaneous transactions on a server, and the transactional storage engines maintain their own transactional logs to make sure each transaction executes correctly. These logs are not visible to the user. In contrast, the binary log shows all transactions from all sessions in the order in which they were committed as if each executed sequentially. To ensure each transaction is written as a unit to the binary log, the server has to separate statements that are executing in different threads. When committing a transaction, the server writes all the statements that are part of the transaction to the binary log as a single unit. For this purpose, the server keeps a transaction cache for each thread, as illustrated in Figure 4-4. Each statement executed for a transaction is placed in the transaction cache, and the contents of the transaction cache are then copied to the binary log and emptied when the transaction commits. Figure 4-4. Threads with transaction caches and a binary log Logging Transactions | 87 Statements that contain nontransactional changes require special attention. Recall from our previous discussion that nontransactional statements do not cause the current transaction to terminate, so the changes introduced by the execution of a nontransac‐ tional statement have to be recorded somewhere without closing the currently open transaction. The situation is further complicated by statements that simultaneously af‐ fect transactional and nontransactional tables. These statements are considered trans‐ actional but include changes that are not part of the transaction. Statement-based replication cannot handle this correctly in all situations, so a best-effort approach has been taken. We’ll describe the measures taken by the server, followed by the issues you have to be aware of in order to avoid the replication problems that are left over. How nontransactional statements are logged When no transaction is open, nontransactional statements are written to the binary log at the end of execution of the statement and do not “transit” in the transaction cache before ending up in the binary log. If, however, a transaction is open, the rules for how to handle the statement are as follows: 1. Rule 1: If the statement is marked as transactional, it is written to the transaction cache. 2. Rule 2: If the statement is not marked as transactional and there are no statements in the transaction cache, the statement is written directly to the binary log. 3. Rule 3: If the statement is not marked as transactional, but there are statements in the transaction cache, the statement is written to the transaction cache. The third rule might seem strange, but you can understand the reasoning if you look at Example 4-15. Returning to our employee and log tables, consider the statements in Example 4-15, where a modification of a transactional table comes before modification of a nontransactional table in the transaction. Example 4-15. Transaction with nontransactional statement 1 START TRANSACTION; 2 SET @pass = PASSWORD('xyzzy'); 3 INSERT INTO employee(name,email,password) VALUES ('mats','mats@example.com', @pass); 4 INSERT INTO log(email, message) VALUES ('root@example.com', 'This employee was bad'); 5 COMMIT; Following rule 3, the statement on line 4 is written to the transaction cache even though the table is nontransactional. If the statement were written directly to the binary log, it would end up before the statement in line 3 because the statement in line 3 would not end up in the binary log until a successful commit in line 5. In short, the slave’s log would 88 | Chapter 4: The Binary Log end up containing the comment added by the DBA in line 4 before the actual change to the employee in line 3, which is clearly inconsistent with the master. Rule 3 avoids such situations. The left side of Figure 4-5 shows the undesired effects if rule 3 did not apply, whereas the right side shows what actually happens thanks to rule 3. Figure 4-5. Alternative binary logs depending on rule 3 Rule 3 involves a trade-off. Because the nontransactional statement is cached while the transaction executes, there is a risk that two transactions will update a nontransactional table on the master in a different order from that in which they are written to the binary log. This situation can arise when there is a dependency between the first transactional and the second nontransactional statement of the transaction, but this cannot generally be handled by the server because it would require parsing each statement completely, in‐ cluding code in all triggers invoked, and performing a dependency analysis. Although technically possible, this would add extra processing to all statements during an open transaction and would therefore affect performance, perhaps significantly. Because the problem can almost always be avoided by designing transactions properly and ensuring that there are no dependencies of this kind in the transaction, the overhead was not added to MySQL. How to avoid replication problems with nontransactional statements The best strategy for avoiding problems is not to use nontransactional tables. However, if they are required by the application, a strategy for avoiding the dependencies discussed in the previous section is to ensure that statements affecting nontransactional tables are written first in the transaction. In this case, the statements will be written directly to the binary log, because the transaction cache is empty (refer to rule 2 in the preceding section). The statements are known to have no dependencies. If you need any values from these statements later in the transaction, you can assign them to temporary tables or variables. After that, the real contents of the transaction can be executed, referencing the temporary tables or variables. Logging Transactions | 89 Writing Non-Transactional Statements Directly Starting with MySQL 5.6, it is possible to force nontransactional statements to be written directly to the binary log by using the option binlog_direct_non_transactional_up dates. When this option is enabled, all nontransactional statements will be written to the binary log before the transaction where they appear instead of inside the transaction, even when the nontransactional statement appears after a transactional statement. This option changes the behavior only in statement-based replication. In row-based repli‐ cation, the rows for the nontransactional statement are always written before the trans‐ action. This will work because row-based replication does not execute the statements, but just changes the data in the table. Hence, the nontransactional “statement” will not have any dependencies at all and can be safely written before the transaction. As an example, if the transaction in Example 4-15 is executed with binlog_di rect_non_transactional_updates disabled (which is the default), statements are writ‐ ten to the binary log in the order shown by Figure 4-6. Figure 4-6. Order of statements from Example 4-15 in natural order If, instead, binlog_direct_non_transactional_updates is enabled, the sequence of events will be in the order shown in Figure 4-7. Here, a separate transaction created for the write to the log table, which is then written before the transaction containing trans‐ actional statements. Figure 4-7. Order of statements from Example 4-15 in safe order 90 | Chapter 4: The Binary Log Because the nontransactional statement is written before the transaction, it is critical that there are no dependencies on the statements executed before the nontransactional statement in the transaction. If this is something that you need, you should consider using row-based replication instead. Distributed Transaction Processing Using XA MySQL version 5.0 lets you coordinate transactions involving different resources by using the X/Open Distributed Transaction Processing model, XA. Although currently not very widely used, XA offers attractive opportunities for coordinating all kinds of resources with transactions. In version 5.0, the server uses XA internally to coordinate the binary log and the storage engines. A set of commands allows the client to take advantage of XA synchronization as well. XA allows different statements entered by different users to be treated as a single trans‐ action. On the other hand, it imposes some overhead, so some administrators turn it off globally. Instructions for working with the XA protocol are beyond the scope of this book, but we will give a brief introduction to XA here before describing how it affects the binary log. XA includes a transaction manager that coordinates a set of resource managers so that they commit a global transaction as an atomic unit. Each transaction is assigned a unique XID, which is used by the transaction manager and the resource managers. When used internally in the MySQL server, the transaction manager is usually the binary log and the resource managers are the storage engines. The process of committing an XA trans‐ action is shown in Figure 4-8 and consists of two phases. In phase 1, each storage engine is asked to prepare for a commit. When preparing, the storage engine writes any information it needs to commit correctly to safe storage and then returns an OK message. If any storage engine replies negatively—meaning that it cannot commit the transaction—the commit is aborted and all engines are instructed to roll back the transaction. After all storage engines have reported that they have prepared without error, and before phase 2 begins, the transaction cache is written to the binary log. In contrast to normal transactions, which are terminated with a normal Query event with a COMMIT, an XA transaction is terminated with an Xid event containing the XID. Logging Transactions | 91 Figure 4-8. Distributed transaction commit using XA In phase 2, all the storage engines that were prepared in phase 1 are asked to commit the transaction. When committing, each storage engine will report that it has committed the transaction in stable storage. It is important to understand that the commit cannot fail: once phase 1 has passed, the storage engine has guaranteed that the transaction can be committed and therefore is not allowed to report failure in phase 2. A hardware failure can, of course, cause a crash, but since the storage engines have stored the in‐ 92 | Chapter 4: The Binary Log formation in durable storage, they will be able to recover properly when the server restarts. The restart procedure is discussed in the section “The Binary Log and Crash Safety” on page 100. After phase 2, the transaction manager is given a chance to discard any shared resources, should it choose to. The binary log does not need to do any such cleanup actions, so it does not do anything special with regard to XA at this step. In the event that a crash occurs while committing an XA transaction, the recovery procedure in Figure 4-9 will take place when the server is restarted. At startup, the server will open the last binary log and check the Format_description event. If the binlog- in-use flag is set (described in “Binlog File Rotation” on page 101), it indicates that the server crashed and XA recovery has to be executed. Figure 4-9. Procedure for XA recovery The server starts by walking through the binary log that was just opened and finding the XIDs of all transactions in the binary log by reading the Xid events. Each storage engine loaded into the server will then be asked to commit the transactions in this list. Logging Transactions | 93 For each XID in the list, the storage engine will determine whether a transaction with that XID is prepared but not committed, and commit it if that is the case. If the storage engine has prepared a transaction with an XID that is not in this list, the XID obviously did not make it to the binary log before the server crashed, so the transaction should be rolled back. Binary Log Group Commit Writing data to disk takes time; quite a lot actually. The performance of disk is orders of magnitude slower than main memory, where disk access times is counted in milli‐ seconds, main memory access is counted in nanoseconds. For that reason, operating systems have elaborate memory management systems for keeping part of the files in memory and not performing more writes to disk than necessary. Since database systems have to be safe from crashes, it is necessary to force the data to be written to disk when a transaction is committed. To avoid the performance impact resulting if each transaction has to be written to disk, multiple independent transactions can be grouped together and written to disk as a single unit, which is called group commit. Because the disk write time is mainly a con‐ sequence of the time it takes to move the disk head to the correct position on the disk, and not the amount of data written, this improves performance significantly. It is, however, not sufficient to commit the transaction data in the storage engine effi‐ ciently, the binary log also has to be written efficiently. For that reason, binary log group commit was added in MySQL 5.6. With this, multiple independent transactions can be committed as a group to the binary log, improving performance significantly. In order for online backup tools like MySQL Enterprise Backup to work correctly, it is important that transactions are written to the binary log in the same order as they are prepared in the storage engine. The implementation is built around a set of stages that each transaction has to pass through before being fully committed, you can see an illustration in Figure 4-10. Each stage is protected by a mutex so there can never be more than one thread in each stage. Each stage handles one part of the commit procedure. The first stage flushes the trans‐ action caches of the threads to the file pages, the second stage executes a sync to get the file pages to disk, and the last stage commits all transactions. 94 | Chapter 4: The Binary Log Figure 4-10. Binary log group commit architecture To move sessions between stages in an ordered manner, each stage has an associated queue where sessions can queue up for processing. Each stage queue is protected by a mutex that is held briefly while manipulating the queue. You can see the names of the mutexes for each stage in Table 4-1, in case you want to use the performance schema to monitor performance of the commit phase. Table 4-1. Binlog group commit stages and their mutexes Stage Stage mutex Stage queue mutex Flush LOCK_log LOCK_flush_queue Sync LOCK_sync LOCK_sync_queue Commit LOCK_commit LOCK_commit_queue Sessions normally run in separate threads, so any session thread that wants to commit a transaction enqueues the session in the flush stage queue. A session thread then pro‐ ceeds through each stage in the following way: 1. If the session thread is enqueued to a non-empty stage queue, it is a follower and will go and wait for its transaction to be committed by some other session thread. 2. If the session thread enqueued to an empty stage queue, it is a (stage) leader and will bring all the sessions registered for the stage through the stage. 3. The leader empties all the sessions of the queue in a single step. The order of the sessions will be maintained, but after this step, new sessions can enqueue to the stage queue. 4. The stage processing is done as follows: • For the flush queue, the transaction of each session will be flushed to the binary log in the same order as they enqueued to the flush queue. • For the sync stage, an fsync call is executed. Logging Transactions | 95 • For the commit stage, the transactions are committed in the storage engines in the same order as they registered. 5. The sessions are enqueued to the queue for the next stage in the same order as they were registered for this stage. Note that the queue might not be empty, which means that a session thread has “caught up” with a preceding leader thread. In this case, the session thread that was the leader and did the processing will become a follower for the next stage and let the leader already queued up handle the processing for the combined sets of ses‐ sions. Note that a leader can become a follower, but a follower can never become a leader. As explained, new leader threads merge their session queues with older threads that are in the middle of commit procedures. This lets the system adapt dynamically to changing situations. Normally, the sync stage is the most expensive stage, so many batches of sessions will pass through the flush stage and “pile up” in the sync stage queue, after which a single leader thread will bring all the sessions through that stage. However, if battery-backed caches are used, fsync calls are very cheap, so in that case, sessions might not pile up at this stage. Transactions have to be committed in order so that online backup methods (such as XtraBackup or MySQL Enterprise Backup) work correctly. For this reason, the commit stage commits the transactions in the same order as they are written to the binary log. However, it is also possible to commit transactions in parallel in the commit stage. This means that the commits are done in an arbitrary order. This does not affect correctness in normal operations, but it means that you should not take an online backup while doing parallel commits. Benchmarks did not show any measurable improvement in performance when com‐ mitting transactions in parallel, so the default is to always commit in order. This is also the recommended setting, unless you have special needs (or just want to test it). You can control whether transactions should be ordered using binlog_order_com mits. If this is set to OFF, transactions are committed in parallel and the last stage is skipped (the threads commit themselves in an arbitrary order instead of waiting for the leader). The flush stage includes an optimization when reading sessions from the queue. Instead of entering the flush stage and taking the entire queue for processing, the leader “skims” one session at a time from the queue, flushes it, and then repeats the procedure. The idea behind this optimization is that as long as there are sessions enqueueing, there is no point in advancing to the queue stage, and since flushing the transaction to the binary log takes time, performing it on each session provides an opportunity for more sessions that need committing to enqueue themselves. This optimization showed significant 96 | Chapter 4: The Binary Log performance improvements in throughput. But because it can affect latency, you can control how long the leader thread will keep “skimming” sessions from the flush queue using the binlog_max_flush_queue_time, which takes the number of microseconds that the leader should skim the queue. Once the timer expires, the entire queue will be grabbed and the leader reverts to executing the procedure described in the preceding list for the stage. This means that the time given is not the maximum latency of a trans‐ action: the leader will have to flush all the transactions that are registered for the queue before moving to the sync stage. Row-Based Replication The primary goal of replication is to keep the master and the slave synchronized so they have the same data. As you saw earlier, replication offers a number of special features to ensure the results are as close as possible to being identical on master and slave: context events, session-specific IDs, etc. Despite this, there are still some situations that statement-based replication can’t cur‐ rently handle correctly: • If an UPDATE, DELETE, or INSERT statement contains a LIMIT clause, it may cause problems if a database crashes during execution. • If there is an error during execution of a nontransactional statement, there is no guarantee that the effects are the same on the master and the slave. • If a statement contains a call to a UDF, there is no way to ensure the same value is used on the slave. • If the statement contains any nondeterministic function—such as USER, CUR RENT_USER, or CONNECTION_ID—results may differ between master and slave. • If a statement updates two tables with autoincrement columns, it will not work correctly, because only a single last insert ID can be replicated, which will then be used for both tables on the slave, while on the master, the insert ID for each table will be used individually. In these cases, it is better to replicate the actual data being inserted into the tables, which is what row-based replication does. Instead of replicating the statement that performs the changes, row-based replication replicates each row being inserted, deleted, or updated separately, with the values that were used for the operation. Since the row that is sent to the slave is the same row that is sent to the storage engine, it contains the actual data being inserted into the table. Hence there are no UDFs to consider, no autoincrement counters to keep track of, and no partial execution of statements to take into consideration—just data, plain and sim‐ ple. Row-Based Replication | 97 2. MySQL 5.6 supports logging the statement that generated the rows to the binary log together with the rows. See the note in “Options for Row-Based Replication” on page 120. Row-based replication opens up an entirely new set of scenarios that you just cannot accomplish with statement-based replication. However, you must also be aware of some differences in behavior. When choosing between statement-based and row-based replication, consider the following: • Do you have statements that update a lot of rows, or do the statements usually only change or insert a few rows? If the statement changes a lot of rows, statement-based replication will have more compact statements and may execute faster. But since the statement is executed on the slave as well, this is not always true. If the statement has a complex optimization and execution plan, it might be faster to use row-based replication, because the logic for finding rows is much faster. If the statement changes or inserts only a few rows, row-based replication is po‐ tentially faster because there is no parsing involved and all processing goes directly to the storage engine. • Do you need to see which statements are executed? The events for handling row- based replication are hard to decode, to say the least. In statement-based replication, the statements are written into the binary log and hence can be read directly.2 • Statement-based replication has a simple replication model: just execute the same statement on the slave. This has existed for a long time and is familiar to many DBAs. Row-based replication, on the other hand, is less familiar to many DBAs and can therefore potentially be harder to fix when replication fails. • If data is different on master and slave, executing statements can yield different results on master and slave. Sometimes this is intentional—in this case, statement- based replication can and should be used—but sometimes this not intentional and can be prevented through row-based replication. Enabling Row-based Replication You can control the format to use when writing to the binary log using the option binlog-format. This option can take the values STATEMENT, MIXED, or ROW (see “Options for Row-Based Replication” on page 120 for more information), which comes both as a global variable and session variable. This makes it possible for a session to temporarily switch replication format (but you need SUPER privileges to change this variable, even the session version). However, to ensure that row-based replication is used when starting 98 | Chapter 4: The Binary Log the server, you need to stop the master and update the configuration file. Example 4-16 shows the addition needed to enable row-based replication. Example 4-16. Options to configure row-based replication [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp log-bin = master-bin log-bin-index = master-bin.index server-id = 1 binlog-format = ROW Using Mixed Mode Mixed-mode replication is recommended for MySQL version 5.1 and later, but the default value for the binlog-format option is STATEMENT. This might seem odd, but that decision was made to avoid problems for users who upgrade from versions 5.0 or earlier. Because those versions had no row-based replication and users had to use statement- based replication, the MySQL developers did not want servers to make a sudden switch. If the servers suddenly started sending out row-based replication events when they were upgraded, the deployment would likely be a mess. To reduce the number of factors that an upgrading DBA has to consider, the default for this option remains STATEMENT. However, if you use one of the template files distributed with MySQL version 5.1, you will notice the binlog-format option has the value MIXED, per the recommendation. The principles behind mixed-mode replication are simple: use statement-based repli‐ cation normally and switch to row-based replication for unsafe statements. We have already examined the kinds of statements that can lead to problems and why. To sum‐ marize, mixed-mode currently switches to row-based replication if: • The statement calls any of the following: — The UUID function — A user-defined function — The CURRENT_USER or USER function — The LOAD_FILE function • Two or more tables with an AUTO_INCREMENT column are updated in the same statement. • A server variable is used in the statement. Row-Based Replication | 99 • The storage engine does not allow statement-based replication (e.g., the MySQL Cluster engine). This list is, by necessity, incomplete: it is being extended as new constructions are dis‐ covered unsafe. For a complete and accurate list, refer to the online MySQL Reference Manual. Binary Log Management The events mentioned thus far represent some real change of data that occurred on the master. However, things can happen that do not represent any change of data on the master but can affect replication. For example, if the server is stopped, it can potentially affect replication. No events can, of course, be written to the binary log when the server is not running, but this means that if anything is changed in the datafile, it will not be represented in the binary log. A typical example of this is restoring a backup, or other‐ wise manipulating the datafiles. Such changes are not replicated simply because the server is not running. However, the fact that the server stopped is sometimes represented in the binary log using a binary log event precisely to be able to recognize that there can be a “gap” in the binary log where things could have happened. Events are needed for other purposes as well. Because the binary logs consist of multiple files, it is necessary to split the groups at convenient places to form the sequence of binlog files. To handle this safely, special events (rotate events) are added to the binary log. The Binary Log and Crash Safety As you have seen, changes to the binary log do not correspond to changes to the master databases on a one-to-one basis. It is important to keep the databases and the binary log mutually consistent in case of a crash. In other words, there should be no changes committed to the storage engine that are not written to the binary log, and vice versa. Nontransactional engines introduce problems right away. For example, it is not possible to guarantee consistency between the binary log and a MyISAM table because MyISAM is nontransactional and the storage engine will carry through any requested change long before any attempt to log the statement. But for transactional storage engines, MySQL includes measures to make sure that a crash does not cause the binary log to lose too much information. As we described in “Logging Statements” on page 58, events are written to the binary log before releasing the locks on the table, but after all the changes have been given to the storage engine. So if there is a crash before the storage engine releases the locks, the server has to ensure that any changes recorded to the binary log are actually in the table 100 | Chapter 4: The Binary Log on the disk before allowing the statement (or transaction) to commit. This requires coordination with standard filesystem synchronization. Because disk accesses are very expensive compared to memory accesses, operating sys‐ tems are designed to cache parts of the file in a dedicated part of the main memory— usually called the page cache—and wait to write file data to disk until necessary. Writing to disk becomes necessary when another page must be loaded from disk and the page cache is full, but it can also be requested by an application by doing an explicit call to write the pages of a file to disk. Recall from the earlier description of XA that when the first phase is complete, all data has to be written to durable storage—that is, to disk—for the protocol to handle crashes correctly. This means that every time a transaction is committed, the page cache has to be written to disk. This can be very expensive and, depending on the application, not always necessary. To control how often the data is written to disk, you can set the sync- binlog option. This option takes an integer specifying how often to write the binary log to disk. If the option is set to 5, for instance, the binary log will be written to disk on every fifth commit group of statements or transactions. The default value is 0, which means that the binary log is not explicitly written to disk by the server, but happens at the discretion of the operating system. Note that since the introduction of binary log group commit in MySQL version 5.6, there is a sync with each commit group and not with each transaction or statement. This means that with sync- binlog=1 several transactions will be written to disk in a batch. You can read more about binary log group commit in “Binary Log Group Commit” on page 94. For storage engines that support XA, such as InnoDB, setting the sync-binlog option to 1 means that you will not lose any transactions under normal crashes. For engines that do not support XA, you might lose at most one transaction. If, however, every group is written to disk, it means that the performance suffers, usually a lot. Disk accesses are notoriously slow and caches are used for precisely the purpose of improving the performance because they remove the need to always write data to disk. If you are prepared to risk losing a few transactions or statements—either because you can handle the work it takes to recover this manually or because it is not important for the application—you can set sync-binlog to a higher value or leave it at the default. Binlog File Rotation MySQL starts a new file to hold binary log events at regular intervals. For practical and administrative reasons, it wouldn’t work to keep writing to a single file—operating sys‐ Binary Log Management | 101 tems have limits on file sizes. As mentioned earlier, the file to which the server is cur‐ rently writing is called the active binlog file. Switching to a new file is called binary log rotation or binlog file rotation, depending on the context. Four main activities cause a rotation: The server stops Each time the server starts, it begins a new binary log. We’ll discuss why shortly. The binlog file reaches a maximum size If the binlog file grows too large, it will be automatically rotated. You can control the size of the binlog files using the binlog-cache-size server variable. The binary log is explicitly flushed The FLUSH LOGS command writes all logs to disk and creates a new file to continue writing the binary log. This can be useful when administering recovery images for PITR. Reading from an open binlog file can have unexpected results, so it is advis‐ able to force an explicit flush before trying to use binlog files for recovery. An incident occurred on the server In addition to stopping altogether, the server can encounter incidents that cause the binary log to be rotated. These incidents sometimes require special manual intervention from the administrator, because they can leave a “gap” in the replication stream. It is easier for the DBA to handle the incident if the server starts on a fresh binlog file after an incident. The first event of every binlog file is the Format_description event, which describes the server that wrote the file along with information about the contents and status of the file. Three items are of particular interest here: The binlog-in-use flag Because a crash can occur while the server is writing to a binlog file, it is critical to indicate when a file was closed properly. Otherwise, a DBA could replay a corrupted file on the master or slave and cause more problems. To provide assurance about the file’s integrity, the binlog-in-use flag is set when the file is created and cleared after the final event (Rotate) has been written to the file. Thus, any program can see whether the binlog file was properly closed. Binlog file format version Over the course of MySQL development, the format for the binary log has changed several times, and it will certainly change again. Developers increment the version number for the format when significant changes—notably changes to the common headers—render new files unreadable to previous versions of the server. (The cur‐ rent format, starting with MySQL version 5.0, is version 4.) The binlog file format 102 | Chapter 4: The Binary Log version field lists its version number; if a different server cannot handle a file with that version, it simply refuses to read the file. Server version This is a string denoting the version of the server that wrote the file. The server version used to run the examples in this chapter was “5.5.31-0ubuntu0.12.04.1,” for instance. As you can see, the string is guaranteed to include the MySQL server version, but it also contains additional information related to the specific build. In some situations, this information can help you or the developers figure out and resolve subtle bugs that can occur when replicating between different versions of the server. To rotate the binary log safely even in the presence of crashes, the server uses a write- ahead strategy and records its intention in a temporary file called the purge index file (this name was chosen because the file is used while purging binlog files as well, as you will see). Its name is based on that of the index file, so for instance if the name of the index file is master-bin.index, the name of the purge index file is master-bin.~rec~. After creating the new binlog file and updating the index file to point to it, the server removes the purge index file. In versions of MySQL earlier than 5.1.43, rotation or binlog file purg‐ ing could leave orphaned files; that is, the files might exist in the filesystem without being mentioned in the index file. Because of this, old files might not be purged correctly, leaving them around and requiring manual cleaning of the files from the directory. The orphaned files do not cause a problem for replication, but can be considered an annoyance. The procedure shown in this section en‐ sures that no files are orphaned in the event of a crash. In the event of a crash, if a purge index file is present on the server, the server can compare the purge index file and the index file when it restarts and see what was actually accomplished compared to what was intended. Before MySQL version 5.6 it was possible that the binary log could be left partially written. This could occur if only a part of the cache was written to the binary log before the server crashed. With MySQL version 5.6, the binary log will be trimmed and the partially written transaction removed from the binary log. This is safe because the transaction is not committed in the storage engine until it was written completely to the binary log. Incidents The term “incidents” refers to events that don’t change data on a server but must be written to the binary log because they have the potential to affect replication. Most Binary Log Management | 103 incidents don’t require special intervention from the DBA—for instance, servers can stop and restart without changes to database files—but there will inevitably be some incidents that call for special action. Currently, there are two incident events that you might discover in a binary log: Stop Indicates that the server was stopped through normal means. If the server crashed, no stop event will be written, even when the server is brought up again. This event is written in the old binlog file (restarting the server rotates to a new file) and contains only a common header; no other information is provided in the event. When the binary log is replayed on the slave, it ignores any Stop events. Normally, the fact that the server stopped does not require special attention and replication can proceed as usual. If the server was switched to a new version while it was stop‐ ped, this will be indicated in the next binlog file, and the server reading the binlog file will then stop if it cannot handle the new version of the binlog format. In this sense, the Stop event does not represent a “gap” in the replication stream. However, the event is worth recording because someone might manually restore a backup or make other changes to files before restarting replication, and the DBA replaying the file could find this event in order to start or stop the replay at the right time. Incident An event type introduced in version 5.1 as a generic incident event. In contrast with the Stop event, this event contains an identifier to specify what kind of incident occurred. It is used to indicate that the server was forced to perform actions that almost guarantee that changes are missing from the binary log. For example, incident events in version 5.1 are written if the database was reloaded or if a nontransactional event was too big to fit in the binlog file. MySQL Cluster generates this event when one of the nodes had to reload the database and could therefore be out of sync. When the binary log is replayed on the slave, it stops with an error if it encounters an Incident event. In the case of the MySQL Cluster reload event, it indicates a need to resynchronize the cluster and probably to search for events that are missing from the binary log. Purging the Binlog File Over time, the server will accumulate binlog files unless old ones are purged from the filesystem. The server can automatically purge old binary logs from the filesystem, or you can explicitly tell the server to purge the files. To make the server automatically purge old binlog files, set the expire-logs-days option—which is available as a server variable as well—to the number of days that you 104 | Chapter 4: The Binary Log want to keep binlog files. Remember that as with all server variables, this setting is not preserved between restarts of the server. So if you want the automatic purging to keep going across restarts, you have to add the setting to the my.cnf file for the server. To purge the binlog files manually, use the PURGE BINARY LOGS command, which comes in two forms: PURGE BINARY LOGS BEFORE datetime This form of the command will purge all files that are before the given date. If datetime is in the middle of a logfile (which it usually is), all files before the one holding datetime will be purged. PURGE BINARY LOGS TO 'filename' This form of the command will purge all files that precede the given file. In other words, all files before filename in the output from SHOW MASTER LOGS will be re‐ moved, leaving filename as the first binlog file. Binlog files are purged when the server starts or when a binary log rotation is done. If the server discovers files that require purging, either because a file is older than expire- logs-days or because a PURGE BINARY LOGS command was executed, it will start by writing the files that the server has decided are ripe for purging to the purge index file (for example, master-bin.~rec~). After that, the files are removed from the filesystem, and finally the purge index file is removed. In the event of a crash, the server can continue removing files by comparing the contents of the purge index file and the index file and removing all files that were not removed because of a crash. As you saw earlier, the purge index file is used when rotating as well, so if a crash occurs before the index file can be properly updated, the new binlog file will be removed and then re-created when the rotate is repeated. The mysqlbinlog Utility One of the more useful tools available to an administrator is the client program mysql‐ binlog. This is a small program that can investigate the contents of binlog files as well as relay logfiles (we will cover the relay logs in Chapter 8). In addition to reading binlog files locally, mysqlbinlog can also fetch binlog files remotely from other servers. The mysqlbinlog is a very useful tool when investigating problems with replication, but you can also use it to implement PITR, as demonstrated in Chapter 3. The mysqlbinlog tool normally outputs the contents of the binary log in a form that can be executed by sending them to a running server. When statement-based replication is employed, the statements executed are emitted as SQL statements. For row-based rep‐ lication, which will be introduced in Chapter 8, mysqlbinlog generates some additional data necessary to handle row-based replication. This chapter focuses entirely on The mysqlbinlog Utility | 105 statement-based replication, so we will use the command with options to suppress out‐ put needed to handle row-based replication. Some options to mysqlbinlog will be explained in this section, but for a complete list, consult the online MySQL Reference Manual. Basic Usage Let’s start with a simple example where we create a binlog file and then look at it using mysqlbinlog. We will start up a client connected to the master and execute the following commands to see how they end up in the binary log: mysqld1> RESET MASTER; Query OK, 0 rows affected (0.01 sec) mysqld1> CREATE TABLE employee ( -> id INT AUTO_INCREMENT, -> name CHAR(64) NOT NULL, -> email CHAR(64), -> password CHAR(64), -> PRIMARY KEY (id) -> ); Query OK, 0 rows affected (0.00 sec) mysqld1> SET @password = PASSWORD('xyzzy'); Query OK, 0 rows affected (0.00 sec) mysqld1> INSERT INTO employee(name,email,password) -> VALUES ('mats','mats@example.com',@password); Query OK, 1 row affected (0.01 sec) mysqld1> SHOW BINARY LOGS; +--------------------+-----------+ | Log_name | File_size | +--------------------+-----------+ | mysqld1-bin.000038 | 670 | +--------------------+-----------+ 1 row in set (0.00 sec) Let’s now use mysqlbinlog to dump the contents of the binlog file master-bin.000038, which is where all the commands ended up. The output shown in Example 4-17 has been edited slightly to fit the page. Example 4-17. Output from execution of mysqlbinlog $ sudo mysqlbinlog \ > --short-form \ > --force-if-open \ > --base64-output=never \ > /var/lib/mysql1/mysqld1-bin.000038 1 /*!40019 SET @@session.max_insert_delayed_threads=0*/; 106 | Chapter 4: The Binary Log 2 /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; 3 DELIMITER /*!*/; 4 ROLLBACK/*!*/; 5 use test/*!*/; 6 SET TIMESTAMP=1264227693/*!*/; 7 SET @@session.pseudo_thread_id=999999999/*!*/; 8 SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1, @@session.autocommit=1/*!*/; 9 SET @@session.sql_mode=0/*!*/; 10 SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/; 11 /*!\C utf8 *//*!*/; 12 SET @@session.character_set_client=8,@@session.collation_connection=8, @@session.collation_server=8/*!*/; 13 SET @@session.lc_time_names=0/*!*/; 14 SET @@session.collation_database=DEFAULT/*!*/; 15 CREATE TABLE employee ( 16 id INT AUTO_INCREMENT, 17 name CHAR(64) NOT NULL, 18 email CHAR(64), 19 password CHAR(64), 20 PRIMARY KEY (id) 21 ) ENGINE=InnoDB 22 /*!*/; 23 SET TIMESTAMP=1264227693/*!*/; 24 BEGIN 25 /*!*/; 26 SET INSERT_ID=1/*!*/; 27 SET @`password`:=_utf8 0x2A31353141463... COLLATE `utf8_general_ci`/*!*/; 28 SET TIMESTAMP=1264227693/*!*/; 29 INSERT INTO employee(name,email,password) 30 VALUES ('mats','mats@example.com',@password) 31 /*!*/; 32 COMMIT/*!*/; 33 DELIMITER ; 34 # End of log file 35 ROLLBACK /* added by mysqlbinlog */; 36 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; To get this output, we use three options: short-form With this option, mysqlbinlog prints only information about the SQL statements issued, and leaves out comments with information about the events in the binary log. This option is useful when mysqlbinlog is used only to play back the events to a server. If you want to investigate the binary log for problems, you will need these comments and should not use this option. force-if-open If the binlog file is not closed properly, either because the binlog file is still being written to or because the server crashed, mysqlbinlog will print a warning that this The mysqlbinlog Utility | 107 binlog file was not closed properly. This option prevents the printing of that warn‐ ing. base64-output=never This prevents mysqlbinlog from printing base64-encoded events. If mysqlbinlog has to print base64-encoded events, it will also print the Format_description event of the binary log to show the encoding used. For statement-based replication, this is not necessary, so this option is used to suppress that event. In Example 4-17, lines 1–4 contain the preamble printed in every output. Line 3 sets a delimiter that is unlikely to occur elsewhere in the file. The delimiter is also designed to appear as a comment in processing languages that do not recognize the setting of the delimiter. The rollback on line 4 is issued to ensure the output is not accidentally put inside a transaction because a transaction was started on the client before the output was fed into the client. We can skip momentarily to the end of the output—lines 33–35—to see the counterpart to lines 1–4. They restore the values set in the preamble and roll back any open trans‐ action. This is necessary in case the binlog file was truncated in the middle of a trans‐ action, to prevent any SQL code following this output from being included in a trans‐ action. The USE statement on line 5 is printed whenever the database is changed. Even though the binary log specifies the current database before each SQL statement, mysqlbinlog shows only the changes to the current database. When a USE statement appears, it is the first line of a new event. The first line that is guaranteed to be in the output for each event is SET TIMESTAMP, as shown on lines 6 and 23. This statement gives the timestamp when the event started executing in seconds since the epoch. Lines 7–14 contain general settings, but like USE on line 5, they are printed only for the first event and whenever their values change. Because the INSERT statement on lines 29–30 is inserting into a table with an autoin‐ crement column using a user-defined variable, the INSERT_ID session variable on line 26 and the user-defined variable on line 27 are set before the statement. This is the result of the Intvar and User_var events in the binary log. If you omit the short-form option, each event in the output will be preceded by some comments about the event that generated the lines. You can see these comments, which start with hash marks (#) in Example 4-18. 108 | Chapter 4: The Binary Log Example 4-18. Interpreting the comments in mysqlbinlog output $ sudo mysqlbinlog \ > --force-if-open \ > --base64-output=never \ > /var/lib/mysql1/mysqld1-bin.000038 . . . 1 # at 386 2 #100123 7:21:33 server id 1 end_log_pos 414 Intvar 3 SET INSERT_ID=1/*!*/; 4 # at 414 5 #100123 7:21:33 server id 1 end_log_pos 496 User_var 6 SET @`password`:=_utf8 0x2A313531...838 COLLATE `utf8_general_ci`/*!*/; 7 # at 496 8 #100123 7:21:33 server id 1 end_log_pos 643 Query thread_id=6 exec_time=0 error_code=0 9 SET TIMESTAMP=1264227693/*!*/; 10 INSERT INTO employee(name,email,password) 11 VALUES ('mats','mats@example.com',@password) 12 /*!*/; 13 # at 643 14 #100123 7:21:33 server id 1 end_log_pos 670 Xid = 218 15 COMMIT/*!*/; 16 DELIMITER ; 17 # End of log file 18 ROLLBACK /* added by mysqlbinlog */; 19 /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; The first line of the comment gives the byte position of the event, and the second line contains other information about the event. Consider, for example, the INSERT state‐ ment line: # at 496 #100123 7:21:33 server id 1 end_log_pos 643 Query thread_id=6 exec_time=0 error_code=0 The various parts of the comments have the following meanings: at 496 The byte position where the event starts (i.e., the first byte of the event). 100123 7:21:33 The timestamp of the event as a datetime (date plus time). This is the time when the query started executing or when the events were written to the binary log. server_id 1 The server ID of the server that generated the event. This server ID is used to set the pseudo_thread_id session variable, and a line setting this variable is printed if the event is thread-specific and the server ID is different from the previously printed ID. The mysqlbinlog Utility | 109 end_log_pos 643 The byte position of the event that follows this event. By taking the difference be‐ tween this value and the position where the event starts, you can get the length of the event. Query The type of event. In Example 4-18, you can see several different types of events, such as User_var, Intvar, and Xid. The fields after these are event-specific, and hence different for each event. For the Query event, we can see two additional fields: thread_id=6 The ID of the thread that executed the event. This is used to handle thread-specific queries, such as queries that access temporary tables. exec_time=0 The execution time of the query in seconds. Example 4-17 and Example 4-18 dump the output of a single file, but mysqlbinlog accepts multiple files as well. If several binlog files are given, they are processed in order. The files are printed in the order you request them, and there is no checking that the Rotate event ending each file refers to the next file in sequence. The responsibility for ensuring that these binlog files make up part of a real binary log lies on the user. Thanks to the way the binlog files binlog files are given, they will be processed in are named, submitting multiple files to mysqlbinlog—such as by using * as a file-globbing wildcard—is usually not a problem. However, let’s look at what happens when the binlog file counter, which is used as an extension to the filename, goes from 999999 to 1000000: $ ls mysqld1-bin.[0-9]* mysqld1-bin.000007 mysqld1-bin.000011 mysqld1-bin.000039 mysqld1-bin.000008 mysqld1-bin.000035 mysqld1-bin.1000000 mysqld1-bin.000009 mysqld1-bin.000037 mysqld1-bin.999998 mysqld1-bin.000010 mysqld1-bin.000038 mysqld1-bin.999999 As you can see, the last binlog file to be created is listed before the two binlog files that are earlier in binary log order. So it is worth checking the names of the files before you use wildcards. Since your binlog files are usually pretty large, you won’t want to print the entire contents of the binlog files and browse them. Instead, there are a few options you can use to limit the output so that only a range of the events is printed. start-position=bytepos The byte position of the first event to dump. Note that if several binlog files are supplied to mysqlbinlog, this position will be interpreted as the position in the first file in the sequence. 110 | Chapter 4: The Binary Log If an event does not start at the position given, mysqlbinlog will still try to interpret the bytes starting at that position as an event, which usually leads to garbage output. stop-position=bytepos The byte position of the last event to print. If no event ends at that position, the last event printed will be the event with a position that precedes bytepos. If multiple binlog files are given, the position will be the position of the last file in the sequence. start-datetime=datetime Prints only events that have a timestamp at or after datetime. This will work cor‐ rectly when multiple files are given—if all events of a file are before the datetime, all events will be skipped—but there is no checking that the events are printed in order according to their timestamps. stop-datetime=datetime Prints only events that have a timestamp before datetime. This is an exclusive range, meaning that if an event is marked 2010-01-24 07:58:32 and that exact datetime is given, the event will not be printed. Note that since the timestamp of the event uses the start time of the statement but events are ordered in the binary log based on the commit time, it is possible to have events with a timestamp that comes before the timestamp of the preceding event. Since mysqlbinlog stops at the first event with a timestamp outside the range, there might be events that aren’t displayed because they have timestamps before date time. Reading remote files In addition to reading files on a local filesystem, the mysqlbinlog utility can read binlog files from a remote server. It does this by using the same mechanism that the slaves use to connect to a master and ask for events. This can be practical in some cases, since it does not require a shell account on the machine to read the binlog files, just a user on the server with REPLICATION SLAVE privileges. To handle remote reading of binlog files, include the read-from-remote-server option along with a host and user for connecting to the server, and optionally a port (if different from the default) and a password. When reading from a remote server, give just the name of the binlog file, not the full path. So to read the Query event from Example 4-18 remotely, the command would look something like the following (the server prompts for a password, but it is not output when you enter it): $ sudo mysqlbinlog > --read-from-remote-server > --host=master.example.com The mysqlbinlog Utility | 111 > --base64-output=never > --user=repl_user --password > --start-position=294 > mysqld1-bin.000038 Enter password: /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 294 #130608 22:09:19 server id 1 end_log_pos 0 Start: binlog v 4, server v 5.5.31 -0ubuntu0.12.04.1-log created 130608 22:09:19 # at 294 #130608 22:13:08 server id 1 end_log_pos 362 Query thread_id=53 exec_time=0 error_code=0 SET TIMESTAMP=1370722388/*!*/; SET @@session.pseudo_thread_id=53/*!*/; SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0... SET @@session.sql_mode=0/*!*/; SET @@session.auto_increment_increment=1, @@session.auto_increment_offset... /*!\C utf8 *//*!*/; SET @@session.character_set_client=33,@@session.collation_connection=33... SET @@session.lc_time_names=0/*!*/; SET @@session.collation_database=DEFAULT/*!*/; BEGIN /*!*/; # at 362 #130608 22:13:08 server id 1 end_log_pos 390 Intvar SET INSERT_ID=1/*!*/; # at 390 #130608 22:13:08 server id 1 end_log_pos 472 User_var SET @`password`:=_utf8 0x2A31353141463642384333413641413039434643434244333... # at 472 #130608 22:13:08 server id 1 end_log_pos 627 Query thread_id=53 exec_time=0 use `test`/*!*/; SET TIMESTAMP=1370722388/*!*/; INSERT INTO employee(name, email, password) VALUES ('mats', 'mats@example.com', @password) /*!*/; # at 627 #130608 22:13:08 server id 1 end_log_pos 654 Xid = 175 COMMIT/*!*/; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/; Reading raw binary logfiles The mysqlbinlog utility is very useful for investigating binary logs, but it can also be used for taking backups of binlog files. This is very useful when you do not have access to 112 | Chapter 4: The Binary Log the machine using normal shell but have REPLICATION SLAVE privileges. In this case, you can read the binary logfiles from the server, but they should not be parsed and rather just saved to files. The normal way to use mysqlbinlog for taking a backup of binlog files remotely is: mysqlbinlog --raw --read-from-remote-server \ --host=master.example.com --user=repl_user \ master-bin.000012 master-bin.000013 ... In this example, the binlog files master-bin.000012 and master-bin.000013 will be read and saved in the current directory. Note that you have to use --read-from-remote- server together with --raw. Using --raw without --read-from-remote-server is pointless since that would be the same as using a plain file copy. The most interesting options to control the behavior of mysqlbinlog are described below. You can find a de‐ tailed description of how to use mysqlbinlog for backups at the MySQL Reference Man‐ ual. --result-file=prefix This option gives the prefix to use when constructing the files to write to. The prefix can be a directory name (with trailing slash), or any other prefix. It defaults to the empty string, so if this option is not used, files will be written using the same name as they have on the master. --to-last-log Normally, only the files given on the command line are transferred, but if this option is provided, only the starting binary log file has to be given. After that, mysqlbin‐ log will transfer all files after the first one. --stop-never Do not stop reading after reaching end of file of the last log: keep waiting for more input. This option is useful when taking a backup that is to be used for point-in- time recovery. See “Backup and MySQL Replication” on page 570 for a detailed treat‐ ment of backing up for point-in-time recovery. Interpreting Events Sometimes, the standard information printed by mysqlbinlog is not sufficient for spot‐ ting a problem, so it is necessary to go into the details of the event and investigate its content. To handle such situations, you can pass the hexdump option to tell mysqlbin‐ log to write the actual bytes of the events. Before going into the details of the events, here are some general rules about the format of the data in the binary log: The mysqlbinlog Utility | 113 Integer data Integer fields in the binary log are printed in little-endian order, so you have to read integer fields backward. This means that, for example, the 32-bit block 03 01 00 00 represents the hexadecimal number 103. String data String data is usually stored both with length data and null-terminated. Sometimes, the length data appears just before the string and sometimes it is stored in the post header. This section will cover the most common events, but an exhaustive reference concerning the format of all the events is beyond the scope of this book. Check the Binary Log section in the MySQL Internals Manual for an exhaustive list of all the events available and their fields. The most common of all the events is the Query event, so let’s concentrate on it first. Example 4-19 shows the output for such an event. Example 4-19. Output when using option - -hexdump $ sudo mysqlbinlog \ > --force-if-open \ > --hexdump \ > --base64-output=never \ > /var/lib/mysql1/mysqld1-bin.000038 . . . 1 # at 496 2 #100123 7:21:33 server id 1 end_log_pos 643 3 # Position Timestamp Type Master ID Size Master Pos Flags 4 # 1f0 6d 95 5a 4b 02 01 00 00 00 93 00 00 00 83 02 00 00 10 00 5 # 203 06 00 00 00 00 00 00 00 04 00 00 1a 00 00 00 40 |................| 6 # 213 00 00 01 00 00 00 00 00 00 00 00 06 03 73 74 64 |.............std| 7 # 223 04 08 00 08 00 08 00 74 65 73 74 00 49 4e 53 45 |.......test.INSE| 8 # 233 52 54 20 49 4e 54 4f 20 75 73 65 72 28 6e 61 6d |RT.INTO.employee| 9 # 243 65 2c 65 6d 61 69 6c 2c 70 61 73 73 77 6f 72 64 |.name.email.pass| 10 # 253 29 0a 20 20 56 41 4c 55 45 53 20 28 27 6d 61 74 |word....VALUES..| 11 # 263 73 27 2c 27 6d 61 74 73 40 65 78 61 6d 70 6c 65 |.mats...mats.exa| 12 # 273 2e 63 6f 6d 27 2c 40 70 61 73 73 77 6f 72 64 29 |mple.com...passw| 13 # 283 6f 72 64 29 |ord.| 14 # Query thread_id=6 exec_time=0 error_code=0 SET TIMESTAMP=1264227693/*!*/; INSERT INTO employee(name,email,password) VALUES ('mats','mats@example.com',@password) The first two lines and line 14 are comments listing basic information that we discussed earlier. Notice that when you use the hexdump option, the general information and the event-specific information are split into two lines, whereas they are merged in the nor‐ mal output. Lines 3 and 4 list the common header: 114 | Chapter 4: The Binary Log Timestamp The timestamp of the event as an integer, stored in little-endian format. Type A single byte representing the type of the event. Some event types are given in the MySQL Internals Manual but to get the values for your specific server you need to look in the source code (in the file sql/log_event.h). Master ID The server ID of the server that wrote the event, written as an integer. For the event shown in Example 4-19, the server ID is 1. Size The size of the event in bytes, written as an integer. Master Pos The same as end_log_pos (i.e., the start of the event following this event). Flags This field has 16 bits reserved for general flags concerning the event. The field is mostly unused, but it stores the binlog-in-use flag. As you can see in Example 4-19, the binlog-in-use flag is set, meaning that the binary log is not closed properly (in this case, because we didn’t flush the logs before calling mysqlbinlog). After the common header come the post header and body for the event. As already mentioned, an exhaustive coverage of all the events is beyond the scope of this book, but we will cover the most important and commonly used events: the Query and For mat_description log events. Query event post header and body The Query event is by far the most used and also the most complicated event issued by the server. Part of the reason is that it has to carry a lot of information about the context of the statement when it was executed. As already demonstrated, integer variables, user variables, and random seeds are covered using specific events, but it is also necessary to provide other information, which is part of this event. The post header for the Query event consists of five fields. Recall that these fields are of fixed size and that the length of the post header is given in the Format_description event for the binlog file, meaning that later MySQL versions may add additional fields if the need should arise. Thread ID A four-byte unsigned integer representing the thread ID that executed the state‐ ment. Even though the thread ID is not always necessary to execute the statement correctly, it is always written into the event. The mysqlbinlog Utility | 115 Execution time The number of seconds from the start of execution of the query to when it was written to the binary log, expressed as a four-byte unsigned integer. Database name length The length of the database name, stored as an unsigned one-byte integer. The database name is stored in the event body, but the length is given here. Error code The error code resulting from execution of the statement, stored as a two-byte unsigned integer. This field is included because, in some cases, statements have to be logged to the binary log even when they fail. Status variables length The length of the block in the event body storing the status variables, stored as a two-byte unsigned integer. This status block is sometimes used with a Query event to store various status variables, such as SQL_MODE. The event body consists of the following fields, which are all of variable length. Status variables A sequence of status variables. Each status variable is represented by a single integer followed by the value of the status variable. The interpretation and length of each status variable value depends on which status variable it concerns. Status variables are not always present; they are added only when necessary. Some examples of status variables follow: Q_SQL_MODE_CODE The value of SQL_MODE used when executing the statement. Q_AUTO_INCREMENT This status variable contains the values of auto_increment_increment and auto_increment_offset used for the statement, assuming that they are not the default of 1. Q_CHARSET This status variable contains the character set code and collation used by the connection and the server when the statement was executed. Current database The name of the current database, stored as a null-terminated string. Notice that the length of the database name is given in the post header. Statement text The statement that was executed. The length of the statement can be computed from the information in the common header and the post header. This statement is normally identical to the original statement written, but in some cases, the state‐ ment is rewritten before it is stored in the binary log. For instance, as you saw earlier 116 | Chapter 4: The Binary Log in this chapter, triggers and stored procedures are stored with DEFINER clauses specified. Format description event post header and body The Format_description event records important information about the binlog file format, the event format, and the server. Because it has to remain robust between versions—it should still be possible to interpret it even if the binlog format changes— there are some restrictions on which changes are allowed. One of the more important restrictions is that the common header of both the For mat_description event and the Rotate event is fixed at 19 bytes. This means that it is not possible to extend the event with new fields in the common header. The post header and event body for the Format_description event contain the follow‐ ing fields: Binlog file version The version of the binlog file format used by this file. For MySQL versions 5.0 and later, this is 4. Server version string A 50-byte string storing server version information. This is usually the three-part version number followed by information about the options used for the build, “5.5.31-0ubuntu0.12.04.1,” for instance. Creation time A four-byte integer holding the creation time—the number of seconds since the epoch—of the first binlog file written by the server since startup. For later binlog files written by the server, this field will be zero. This scheme allows a slave to determine that the server was restarted and that the slave should reset state and temporary data—for example, close any open transac‐ tions and drop any temporary tables it has created. Common header length The length of the common header for all events in the binlog file except the For mat_description and Rotate events. As described earlier, the length of the com‐ mon header for the Format_description and Rotate events is fixed at 19 bytes. Post-header lengths This is the only variable-length field of the Format_description log event. It holds an array containing the size of the post header for each event in the binlog file as a one-byte integer. The value 255 is reserved as the length for the field, so the maxi‐ mum length of a post header is 254 bytes. The mysqlbinlog Utility | 117 Binary Log Options and Variables A set of options and variables allow you to configure a vast number of aspects of binary logging. Several options control such properties as the name of the binlog files and the index file. Most of these options can be manipulated as server variables as well. Some have already been mentioned earlier in the chapter, but here you will find more details on each: expire-log-days=days The number of days that binlog files should be kept. Files that are older than the specified number will be purged from the filesystem when the binary log is rotated or the server restarts. By default this option is 0, meaning that binlog files are never removed. log-bin[=basename ] The binary log is turned on by adding the log-bin option in the my.cnf file, as explained in Chapter 3. In addition to turning on the binary log, this option gives a base name for the binlog files; that is, the portion of the filename before the dot. If an extension is provided, it is removed when forming the base name of the binlog files. If the option is specified without a basename, the base name defaults to host-bin where host is the base name—that is, the filename without directory or extension— of the file given by the pid-file option, which is usually the hostname as given by gethostname(2). For example, if pid-file is /usr/run/mysql/master.pid, the default name of the binlog files will be master-bin.000001, master-bin.000002, etc. Since the default value for the pid-file option includes the hostname, it is strongly recommended that you give a value to the log-bin option. Otherwise the binlog files will change names when the hostname changes (unless pid-file is given an explicit value). log-bin-index[=filename] Gives a name to the index file. This can be useful if you want to place the index file in a different place from the default. The default is the same as the base name used for log-bin. For example, if the base name used to create binlog files is master-bin, the index file will be named master- bin.index. Similar to the situation for the log-bin option, the hostname will be used for con‐ structing the index filename, meaning that if the hostname changes, replication will break. For this reason, it is strongly recommended that you provide a value for this option. 118 | Chapter 4: The Binary Log log-bin-trust-function-creators When creating stored functions, it is possible to create specially crafted functions that allow arbitrary data to be read and manipulated on the slave. For this reason, creating stored functions requires the SUPER privilege. However, since stored func‐ tions are very useful in many circumstances, it might be that the DBA trusts anyone with CREATE ROUTINE privileges not to write malicious stored functions. For this reason, it is possible to disable the SUPER privilege requirement for creating stored functions (but CREATE ROUTINE is still required). binlog-cache-size=bytes The size of the in-memory part of the transaction cache in bytes. The transaction cache is backed by disk, so whenever the size of the transaction cache exceeds this value, the remaining data will go to disk. This can potentially create a performance problem, so increasing the value of this option can improve performance if you use many large transactions. Note that just allocating a very large buffer might not be a good idea, because that means that other parts of the server get less memory, which might cause perfor‐ mance degradation. max-binlog-cache-size=bytes Use this option to restrict the size of each transaction in the binary log. Since large transactions can potentially block the binary log for a long time, they will cause other threads to convoy on the binary log and can therefore create a significant performance problem. If the size of a transaction exceeds bytes, the statement will be aborted with an error. max-binlog-size=bytes Specifies the size of each binlog file. When writing a statement or transaction would exceed this value, the binlog file is rotated and writing proceeds in a new, empty binlog file. Notice that if the transaction or statement exceeds max-binlog-size, the binary log will be rotated, but the transaction will be written to the new file in its entirety, exceeding the specified maximum. This is because transactions are never split be‐ tween binlog files. sync-binlog=period Specifies how often to write the binary log to disk using fdatasync(2). The value given is the number of transaction commits for each real call to fdatasync(2). For instance, if a value of 1 is given, fdatasync(2) will be called for each transaction commit, and if a value of 10 is given, fdatasync(2) will be called after each 10 trans‐ action commits. Binary Log Options and Variables | 119 A value of zero means that there will be no calls to fdatasync(2) at all and that the server trusts the operating system to write the binary log to disk as part of the normal file handling. read-only Prevents any client threads—except the slave thread and users with SUPER privileges —from updating any data on the server (this does not include temporary tables, which can still be updated). This is useful on slave servers to allow replication to proceed without data being corrupted by clients that connect to the slave. Options for Row-Based Replication Use the following options to configure row-based replication: binlog-format The binlog-format option can be set to use one of the following modes: STATEMENT This will use the traditional statement-based replication for all statements. ROW This will use the shiny new row-based replication for all statements that insert or change data (data manipulation language, or DML, statements). However, statement-based replication must still be used for statements that create tables or otherwise alter the schema (data definition language, or DDL, statements). MIXED This is intended to be a safe version of statement-based replication and is the recommended mode to use with MySQL version 5.1 and later. In mixed-mode replication, the server writes the statements to the binary log as statements, but switches to row-based replication if a statement is considered unsafe through one of the criteria we have discussed in this chapter. The variable also exists as a global server variable and as a session variable. When starting a new session, the global value is copied to the session variable and then the session variable is used to decide how to write statements to the binary log. binlog-max-row-event-size Use this option to specify when to start a new event for holding the rows. Because the events are read fully into memory when being processed, this option is a rough way of controlling the size of row-holding events so that not too much memory is used when processing the rows. binlog-rows-query-log-events (new in MySQL 5.6) This option causes the server to add an informational event to the binary log before the the row events. The informational event contains the original query that gen‐ erated the rows. 120 | Chapter 4: The Binary Log Note that, because unrecognized events cause the slave to stop, any slave server before MySQL 5.6.2 will not recognize the event and hence stop replicating from the master. This means that if you intend to use an informational event, you need to upgrade all slaves to MySQL 5.6.2 (or later) before enabling binlog-rows-query-log-events. Starting with MySQL 5.6.2, informational events can be added for other purposes as well, but they will not cause the slave to stop. They are intended to allow information to be added to the binary log for those readers (including slaves) that can have use for it, but they should not in any way change the semantics of execution and can therefore be safely ignored by slaves that do not recognize them. Conclusion Clearly, there is much to the binary log—including its use, composition, and techniques. We presented these concepts and more in this chapter, including how to control the binary log behavior. The material in this chapter builds a foundation for a greater un‐ derstanding of the mechanics of the binary log and its importance in logging changes to data. Joel opened an email message from his boss that didn’t have a subject. “I hate it when people do that,” he thought. Mr. Summerson’s email messages were like his taskings— straight and to the point. The message read, “Thanks for recovering that data for the marketing people. I’ll expect a report by tomorrow morning. You can send it via email.” Joel shrugged and opened a new email message, careful to include a meaningful subject. He wondered what level of detail to include and whether he should explain what he learned about the binary log and the mysqlbinlog utility. After a moment of contem‐ plation, he included as many details as he could. “He’ll probably tell me to cut it back to a bulleted list,” thought Joel. That seemed like a good idea, so he wrote a two-sentence summary and a few bullet points and moved them to the top of the message. When he was finished, he sent it on its way to his boss. “Maybe I should start saving these some‐ where in case I have to recount something,” he mused. Conclusion | 121 CHAPTER 5 Replication for High Availability Joel was listening to his iPod when he noticed his boss standing directly in front of his desk. He took off his headphones and said, “Sorry, sir.” Mr. Summerson smiled and said, “No problem, Joel. I need you to figure out some way to ensure we can keep our replicated servers monitored so that we don’t lose data and can minimize downtime. We’re starting to get some complaints from the developers that the system is too inflexible. I can deal with the developers, but the support people tell me that when we have a failure it takes too long to recover. I’d like you to make that your top priority.” Joel nodded. “Sure, I’ll look at load balancing and improving our recovery efforts in replication.” “Excellent. Give me a report on what you think we need to do to solve this problem.” Joel watched his boss leave his office. “OK, let’s find out what this high availability chapter has to say,” he thought, as he opened his favorite MySQL book. Buying expensive machines known for their reliability and ensuring that you have a really good UPS in case of power failures should give you a highly available system. Right? Well, high availability is actually not that easy to achieve. To have a system that is truly available all the time, you have to plan carefully for any contingency and ensure that you have redundancy to handle failing components. True high availability—a system that does not go down even in the most unexpected circumstances—is hard to achieve and very costly. 123 The basic principles for achieving high availability are simple enough; implementing the measures is the tricky part. You need to have three things in place for ensuring high availability: Redundancy If a component fails, you have to have a replacement for it. The replacement can be either idly standing by or part of the existing deployment. Contingency plans If a component fails, you have to know what to do. This depends on which com‐ ponent failed and how it failed. Procedure If a component fails, you have to be able to detect it and then execute your plans swiftly and efficiently. If the system has a single component whose failure will cause the entire system to fail, the system has a single point of failure. If a system has a single point of failure, it puts a severe limit on your ability to achieve high availability, which means that one of your first goals is to locate these single points of failure and ensure you have redundancy for them. Redundancy To understand where redundancy might be needed, you have to identify every potential point of failure in the deployment. Even though it sounds easy—not to mention a tad tedious and boring—it requires some imagination to ensure that you really have found them all. Switches, routers, network cards, and even network cables are single points of failure. Outside of your architecture, but no less important, are power sources and physical facilities. But what about services needed to keep the deployment up? Suppose all network management is consolidated in a web-based interface? Or what if you have only one staff person who knows how to handle some types of failure? Identifying the points of failure does not necessarily mean that you have to eliminate them all. Sometimes it is just not possible for economical, technical, or geographic reasons, but being aware of them helps you with planning. Some things that you should consider, or at least make a conscious decision about whether to consider, are cost of duplicating components, the probability of failure for different components, the time to replace a component, and risk exposure while re‐ pairing a component. If repairing a component takes a week and you are running with the spare as the single point of failure during this time, you are taking a certain risk that the spare could be lost as well, which may or may not be acceptable. Once you have identified where you need redundancy, you have to choose between two fundamental alternatives: you can either keep duplicates around for each 124 | Chapter 5: Replication for High Availability component—ready to take over immediately if the original component should fail—or you can ensure you have extra capacity in the system so that if a component fails, you can still handle the load. This choice does not have to be made in an all-or-nothing fashion: you can combine the two techniques so that you duplicate some components and use extra capacity for some other parts of the system. On the surface, the easiest approach is to duplicate components, but duplication is expensive. You have to leave a standby around and keep it up-to-date with the main component all the time. The advantages of duplicating components are that you do not lose performance when switching and that switching to the standby is usually faster than restructuring the system, which you would have to do if you approached the prob‐ lem by creating spare capacity. Creating spare capacity lets you use all the components for running the business, pos‐ sibly allowing you to handle higher peaks in your load. When a component breaks, you restructure the system so that all remaining components are in use. It is, however, im‐ portant to have more capacity than you normally need. To understand why, consider a simple case where you have a master that handles the writes—actually, you should have two, because you need to have redundancy—with a set of slaves connected to the master whose only purpose is to serve read requests. Should one of the slaves fail, the system will still be responding, but the capacity of the system will be reduced. If you have 10 slaves, each running at 50% capacity, the failure of one slave will increase the load on each slave to 55%, which is easy to handle. However, if the slaves are running at 95% capacity and one of the slaves fails, each server would have to handle 105% of the original load to handle the same load, which is clearly not possible. In this case, the read capacity of the system will be reduced and the response time will be longer. And planning for the loss of one server is not sufficient: you have to consider the prob‐ ability of losing more than one server and prepare for that situation as well. Continuing with our previous example, even if each server is running at 80% capacity, the system will be able to handle the loss of one server. However, the loss of two servers means that the load on each remaining server will increase to 100%, leaving you with no room for unexpected bursts in traffic. If this occurs once a year, it might be manageable, but you have to know how often it is likely to happen. Table 5-1 gives example probabilities for losing 1, 2, or 3 servers in a setup of 100 servers, given different probabilities of losing a single server. As you can see, with a 1% proba‐ bility of losing a server, you have a 16% risk of losing three or more servers. If you are not prepared to handle that, you’re in for some problems if it actually happens. Redundancy | 125 For a stochastic variable X representing the number of servers lost, the probabilities are calculated using the binomial tail distribution: P(X ≥ k) = ( n k )p k Table 5-1. Probabilities of losing servers Probability of losing a single server 1 2 3 1.00% 100.00% 49.50% 16.17% 0.50% 50.00% 12.38% 2.02% 0.10% 10.00% 0.50% 0.02% To avoid such a situation, you have to monitor the deployment closely to know what the load is, figure out the capacity of your system through measurements, and do your math to see where the response times will start to suffer. Planning Having redundancy is not sufficient; you also need to have plans for what to do when the components fail. In the previous example, it is easy to handle a failing slave, because new connections will be redirected to the working slaves, but consider the following: • What happens with the existing connections? Just aborting and returning an error message to the user is probably not a good idea. Typically, there is an application layer between the user and the database, so in this case the application layer has to retry the query with another server. • What happens if the master fails? In the previous example, only the slaves failed, but the master can also fail. Assuming you have added redundancy by keeping an extra master around (we will cover how to do that later in the chapter), you must also have plans for moving all the slaves over to the new master. This chapter will cover some of the techniques and topologies that you can use to handle various situations for failing MySQL servers. There are basically three server roles to consider: master failures, slave failures, and relay failures. Slave failures are just failures of slaves that are used for read scale-out. The slaves that also act as masters are relay slaves and need special care. Master failures are the most important ones to handle quickly, because the deployment will be unavailable until the master is restored. 126 | Chapter 5: Replication for High Availability Slave Failures By far, the easiest failures to handle are slave failures. Because the slaves are used only for read queries, it is sufficient to inform the load balancer that the slave is missing, and the load balancer will direct new queries to the functioning slaves. There have to be enough slaves to handle the reduced capacity of the system, but apart from that, a failing slave does not normally affect the replication topology and there are no specific topol‐ ogies that you need to consider to make slave failure easier to manage. When a slave has failed, there are inevitably some queries that have been sent to the slave that are waiting for a reply. Once these connections report an error resulting from a lost server, the queries have to be repeated with a functioning slave. Master Failures If the master fails, it has to be replaced to keep the deployment up, and it has to be replaced quickly. The moment the master fails, all write queries will be aborted, so the first thing to do is to get a new master available and direct all clients to it. Because the main master failed, all the slaves are now without a master as well, meaning that all the slaves have stale data, but they are still up and can reply to read queries. However, some queries may block if they are waiting for changes to arrive at the slave. Some queries may make it into the relay log of the slave and therefore will eventually be executed by the slave. No special consideration has to be taken on the behalf of these queries. For queries that are waiting for events that did not leave the master before it crashed, the situation is bleaker. In this case, it is necessary to ensure they are handled. This usually means they are reported as failures, so the user will have to reissue the query. Relay Failures For servers acting as relay servers, the situation has to be handled specially. If they fail, the remaining slaves have to be redirected to use some other relay or the master itself. Because the relay has been added to relieve the master of some load, it is likely that the master will not be able to handle the load of a batch of slaves connected to one of its relays. Disaster Recovery In the world of high availability, “disaster” does not have to mean earthquakes or floods; it just means that something went very bad for the computer and it is not local to the machine that failed. Planning | 127 Typical examples are lost power in the data center—not necessarily because the power was lost in the city; just losing power in the building is sufficient. The nature of a disaster is that many things fail at once, making it impossible to handle redundancy by duplicating servers at a single data center. Instead, it is necessary to ensure data is kept safe at another geographic location, and it is quite common for companies to ensure high availability by having different components at different offi‐ ces, even when the company is relatively small. Procedures After you have eliminated all single points of failure, ensured you have sufficient re‐ dundancy for the system, and made plans for every contingency, you should be ready for the last step. All your resources and careful planning are of no use unless you can wield them properly. You can usually manage a small site with a few servers manually with very little planning, but as the number of servers increases, automation becomes a necessity—and if you run a successful business, the number of servers might have to increase quickly. You’re likely better off if you plan from day one to have automation—if you have to grow, you will be busy handling other matters and will probably not have time to create the necessary automation support. Some of the basic procedures have already been discussed, but you need to consider having ready-made procedures for at least the following tasks: Adding new slaves Creating new slaves when you need to scale is the basis for running a big site. There are several options for creating new slaves. They all circle around methods for taking a snapshot of an existing server, usually a slave, restoring the snapshot on a new server, and then starting replication from the correct position. The time for taking a snapshot will, of course, affect how quickly you can bring the new slave up; if the backup time is too long, the master may have issued a lot of changes, which means that the new slave will take longer to catch up. For this reason, the snapshot time is important. Figure 5-1 shows the snapshot time when the slave has caught up. You can see that when the slave is stopped to take a snapshot, the changes will start to accumulate, which will cause the outstanding changes to in‐ crease. Once the slave is restarted, it will start to apply the outstanding changes and the number of outstanding changes will decrease. 128 | Chapter 5: Replication for High Availability Figure 5-1. Outstanding changes when taking a snapshot Some different methods of taking a snapshot include the following: Using mysqldump Using mysqldump is safe but slow. If you use InnoDB tables, mysqldump has options to allow you to take a consistent snapshot, meaning you do not have to bring the server offline. There are also options that allow you to get the master and slave positions for the snapshot so that replication will start at the right position. Copying the database files This is relatively fast, but requires you to bring the server offline before copying the files. It also require you to manage the positions for starting replication at the right place, something that mysqldump does for you. Using an online backup method There are different methods available, such as the MySQL Enterprise Backup and XtraBackup. Using LVM to get a snapshot On Linux, it is possible to take a snapshot of a volume using Logical Volume Manager (LVM). It does require that you prepare beforehand, because a special LVM volume has to be created. Just like copying the database files, this method requires you to manage the replication positions yourself. Using filesystem snapshot methods The Solaris ZFS, for example, has built-in support for taking snapshots. This is a very fast technique for creating backups, but it is similar to the other tech‐ niques (except for mysqldump). It also requires you to manage the replication positions yourself. Procedures | 129 If it should be necessary to use a different engine when restoring, you have to use mysqldump: all the other methods have to restore to the same engine that was used for taking the backup. Techniques for creating new slaves are covered in Chapter 3, and the different backup methods are covered in Chapter 15. Removing slaves from the topology Removing slaves from the setup only requires notifying the load balancer that the slave is absent. An example load balancer—with methods for adding and removing servers—can be found in Chapter 6. Switching the master For routine maintenance, it is common to have to switch all the slaves of a master over to a secondary master as well as notify load balancers of the master’s absence. This procedure can and should be handled with no downtime at all, so it should not affect normal operations. Using slave promotion (described later in this chapter) is one way to handle this, but it might be easier to use a hot standby instead (also covered later in this chapter). Handling slave failures Your slaves will fail—it is just a matter of how often. Handling slave failures must be a routine event in any deployment. It is only necessary to detect that the slave is absent and remove it from the load balancer’s pool, as described in Chapter 6. Handling master failures When the master goes down suddenly, you have to detect the failure and move all the slaves over to a standby, or promote one of the slaves to be the new master. Techniques for this are described later in this chapter. Upgrading slaves Upgrading slaves to new versions of the server should usually not be a problem. However, bringing the slave out of the system for the upgrade requires removing it from the load balancer and maybe notifying other systems of the slave’s absence. Upgrading masters To upgrade the master, you first need to upgrade all the slaves to ensure that they can read all the replication events of the master. To upgrade the master, it is usually necessary to either use a standby as a master while you are performing the upgrade or promote one of the slaves to be the master for the duration of the upgrade. Hot Standby The easiest topology for duplicating servers is hot standby, shown in Figure 5-2. The hot standby is a dedicated server that just duplicates the main master. The hot standby server is connected to the master as a slave, so that it reads and applies all changes. This 130 | Chapter 5: Replication for High Availability setup is often called primary-backup configuration, where the primary is the master and the “backup” is the secondary. There can be multiple hot standbys. Figure 5-2. Master with a hot standby Failure is inevitable, at least when you run a large deployment. It is not a question of if servers fail, but when and how often they fail. The idea in this topology is that when the main master fails, the hot standby provides a faithful replica of the master, and all the clients and slaves can therefore be switched over to the hot standby and continue op‐ erating. Operations can proceed with hardly a lapse, and the hot standby gives you a chance to fix or replace the main master. After you have repaired the master, you have to bring it back on track and either set it to be the hot standby, or redirect the slaves to the original master again. As with many ideas, the reality is not always that rosy. All these are relevant issues, but for starters, let’s just consider the first case: that of switching over to a hot standby when the primary is still running, as illustrated in Figure 5-3. MySQL 5.6 introduced the concept of global transaction identifiers, which significantly simplifies this problem of handling failover. However, because MySQL 5.6 is relatively new, this section demonstrates how to perform failover before MySQL 5.6. For a de‐ scription of how to handle failover using global transaction identifiers, have a look in “Global Transaction Identifiers” on page 260, where you will also see how to set the server up for using global transaction identifiers. If you are using a pre-MySQL 5.6 server and want to use global transaction identifiers, you have to roll them yourself. An example of how to do this can be found in Appendix B. Procedures | 131 Figure 5-3. Switching over from a running master to a standby Handling a switchover The main challenge with switching over to a standby before MySQL 5.6 is to perform the switchover in such a way that the slave starts replicating from the standby at precisely the position where it stopped replicating from the original master. If the positions were easy to translate—for example, if the positions were the same on both the master and the standby—we would not have a problem. Unfortunately, the positions may be dif‐ ferent on the master and the standby for a number of reasons. The most common cause is when a standby was not attached to the master when the master was started. But even if they were attached from the start, events cannot be guaranteed to be written the same way to the binary log on the standby as they were written to the binary log on the master. The basic idea for performing the switchover is to stop the slave and the standby at exactly the same position and then just redirect the slave to the standby. Because the standby hasn’t made any changes after the position where you stopped it, you can just check the binlog position on the standby and direct the slave to start at that position. This task has to be performed manually, because just stopping the slave and the standby will not guarantee that they are synchronized. To do this, stop both the slave and the standby and compare the binlog positions. Because both positions refer to positions on the same master—the slave and standby are both connected to the same master—you can check the positions just by comparing the filename and the byte position lexicographically (in that order): standby> SHOW SLAVE STATUS\G ... Relay_Master_Log_File: master-bin.000096 ... Exec_Master_Log_Pos: 756648 1 row in set (0.00 sec) 132 | Chapter 5: Replication for High Availability slave> SHOW SLAVE STATUS\G ... Relay_Master_Log_File: master-bin.000096 ... Exec_Master_Log_Pos: 743456 1 row in set (0.00 sec) In this case, the standby is ahead of the slave: they are in the same file, but the standby is at position 756648 whereas the slave is at 743456. So just write down the slave position of the standby (756648) and stop the slave from running until it has caught up with the standby. To have the slave catch up with the standby and stop at the right position, use the START SLAVE UNTIL command, as we did when stopping the reporting slave earlier in this chapter: slave> START SLAVE UNTIL -> MASTER_LOG_FILE = 'master-bin.000096', -> MASTER_LOG_POS = 756648; Query OK, 0 rows affected (0.18 sec) slave> SELECT MASTER_POS_WAIT('master-bin.000096', 756648); Query OK, 0 rows affected (1.12 sec) The slave and standby have now stopped at exactly the same position, and everything is ready to do the switchover to the standby using CHANGE MASTER to direct the slave to the standby and start it. But what position should you specify? Because the file and position that the master recorded for its stopping point are different from the file and position recorded by the standby for the same point, it is necessary to fetch the position that the standby recorded while recording the changes as a master. To do this, execute SHOW MASTER STATUS on the standby: standby> SHOW MASTER STATUS\G *************************** 1. row *************************** File: standby-bin.000019 Position: 56447 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec) Now you can redirect the slave to the standby using the correct position: slave> CHANGE MASTER TO -> MASTER_HOST = 'standby.example.com', -> MASTER_PORT = 3306, -> MASTER_USER = 'repl_user', -> MASTER_PASSWORD = 'xyzzy', -> MASTER_LOG_FILE = ' standby-bin.000019', -> MASTER_LOG_POS = 56447; Query OK, 0 rows affected (0.18 sec) Procedures | 133 slave> START SLAVE; Query OK, 0 rows affected (0.25 sec) If the opposite is true—if the slave is ahead of the standby—you can just switch the roles of the standby and the slave in the previous steps. This is possible because the master is running and can provide either the slave or the standby with the missing changes. In the next section, we will consider how to handle the situation in which the master has stopped unexpectedly and hence cannot provide either the slave or the standby with the missing changes. Handling a switchover in Python Example 5-1 shows the Python code for switching a slave over to another master. The replicate_to_position function instructs a server to read from the master only to the given position. When the procedure returns, the slave will have stopped at exactly this position. The switch_to_master directs a slave to a new master. The procedure assumes that both the server on which it executes and the new master are connected to the same original master. If they are not, the positions are not comparable and the procedure will raise an exception. The procedure allows the position on the master to be given explicitly instead of computed, which we will use later in the chapter when implementing fail- over. Example 5-1. Procedure for switching to a new master from mysql.replicant.commands import ( fetch_slave_position, fetch_master_position, change_master, ) def replicate_to_position(server, pos): server.sql("START SLAVE UNTIL MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s", (pos.file, pos.pos)) server.sql("SELECT MASTER_POS_WAIT(%s,%s)", (pos.file, pos.pos)) def switch_to_master(server, standby, master_pos=None): server.sql("STOP SLAVE") server.sql("STOP SLAVE") if master_pos is None: server_pos = fetch_slave_position(server) standby_pos = fetch_slave_position(standby) if server_pos < standby_pos: replicate_to_position(server, standby_pos) elif server_pos > standby_pos: replicate_to_position(standby, server_pos) master_pos = fetch_master_position(standby) change_master(server, standby, master_pos) standby.sql("START SLAVE") server.sql("START SLAVE") 134 | Chapter 5: Replication for High Availability Dual Masters One frequently mentioned setup for high availability is the dual masters topology. In this setup, two masters replicate each other to keep both current. This setup is very simple to use because it is symmetric. Failing over to the standby master does not require any reconfiguration of the main master, and failing back to the main master again when the standby master fails in turn is very easy. Servers can be either active or passive. If a server is active it means that the server accepts writes, which are likely to be propagated elsewhere using replication. If a server is pas‐ sive, it does not accept writes and is just following the active master, usually to be ready to take over when it fails. When using dual masters, there are two different setups, each serving a different pur‐ pose: Active-active In an active-active setup, writes go to both servers, which then transfer changes to the other master. Active-passive In this setup, one of the masters, called the active master, handles writes while the other server, called the passive master, just keeps current with the active master. This is almost identical to the hot standby setup, but because it is symmetric, it is easy to switch back and forth between the masters, each taking turns being the active master. Note that this setup does not necessarily let the passive master answer queries. For some of the solutions that you’ll see in this section, the passive master is a cold standby. These setups do not necessarily mean that replication is used to keep the servers synchronized—there are other techniques that can serve that purpose. Some techniques can support active-active masters, while other techniques can only support active- passive masters. The most common use of an active-active dual masters setup is to have the servers geographically close to different sets of users—for example, in branch offices at different places in the world. The users can then work with the local server, and the changes will be replicated over to the other master so that both masters are kept in sync. Because the transactions are committed locally, the system will be perceived as more responsive. It is important to understand that the transactions are committed locally, meaning that the two masters are not consistent (i.e., they might not have the same information). The changes committed to one master will be propagated to the other master eventually, but until that has been done, the masters have inconsistent data. Procedures | 135 This has two main consequences that you need to be aware of: • If the same information is updated on the two masters—for example, a user is accidentally added to both masters—there will be a conflict between the two updates and it is likely that replication will stop. • If a crash occurs while the two masters are inconsistent, some transactions will be lost. To some extent, you can avoid the problem with conflicting changes by allowing writes to only one of the servers, thereby making the other master a passive master. This is called an active-passive setup. The active server is called the primary and the passive server is called the secondary. Losing transactions when the server crashes is an inevitable result of using asynchro‐ nous replication, but depending on the application, it does not necessarily have to be a serious problem. You can limit the number of transactions that are lost when the server crashes by using a new feature in MySQL 5.5 called semisynchronous replication. The idea behind semisynchronous replication is that the thread committing a transaction blocks until at least one slave acknowledges that it has received the transaction. Because the events for the transaction are sent to the slave after the transaction has been com‐ mitted to the storage engine, the number of lost transactions can be kept down to at most one per thread. Similar to the active-active approach, the active-passive setup is symmetrical and there‐ fore allows you to switch easily from the main master to the standby and back. De‐ pending on the way you handle the mirroring, it may also be possible to use the passive master for administrative tasks such as upgrading the server and use the upgrade server as the active master once the upgrade is finished without any downtime at all. One fundamental problem that has to be resolved when using an active-passive setup is the risk of both servers deciding that they are the primary master. This is called the split-brain syndrome. This can occur if network connectivity is lost for a brief period, long enough to have the secondary promote itself to primary, but then the primary is brought online again. If changes have been made to both servers while they are both in the role of primary, there may be a conflict. In the case of using a shared disk, simulta‐ neous writes to the disks by two servers are likely to cause “interesting” problems with the database (i.e., problems that are probably disastrous and difficult to pinpoint). In other words, two running MySQL servers are not allowed to shard the same data di‐ rectory, so it is necessary to ensure that at most one MySQL server using the data di‐ rectory is active at any time (you can find a more elaborate discussion on this in “Shared disks” on page 137). The easiest and most common way to prevent such a situation is to ensure that the server that was deemed “dead” is really not active. This is done using a technique called, some‐ 136 | Chapter 5: Replication for High Availability what morbidly, STONITH (Shoot The Other Node In The Head). This can be accom‐ plished in several different ways, such as connecting to the server and executing a kill -9 (if the server can be reached), turning of the network card to isolate the server, or turning the power switch on the machine. If the server is truly unreachable (e.g., it ended up on a different partition), you have to use a “poison pill” so that when the server is accessible again, it will “commit suicide.” Shared disks A straightforward dual masters approach is shown in Figure 5-4, where a pair of masters is connected using a shared disk architecture such as a storage area network (SAN). In this approach, both servers are connected to the same SAN and are configured to use the same files. Because one of the masters is passive, it will not write anything to the files while the active master is running as usual. If the main server fails, the standby will be ready to take over. The advantage of this approach is that, because the binlog files are on a shared disk, there is no need for translating binlog positions. The two servers are truly mirror images of each other, but they are running on two different machines. This means that switching over from the main master to the standby is very fast. There is no need for the slaves to translate positions to the new master; all that is necessary is to note the position where the slave stopped, issue a CHANGE MASTER command, and start replication again. When you failover using this technique, you have to perform recovery on the tables, because it is very likely that updates were stopped midstream. Each storage engine be‐ haves differently in this situation. For example, InnoDB has to perform a normal re‐ covery from the transaction log, as it would in the event of a crash, whereas if you use MyISAM you probably have to repair the tables before being able to continue operation. Of these two choices, InnoDB is preferred because recovery is significantly faster than repairing a MyISAM table. You should also consider the time it takes to warm up the caches, which can be lengthy. Notice that the position uses the server ID of the main server, but it represents the same position on the standby because it uses the same files and is a mirror image of the main server. Because the position contains the server ID as well, this will also catch any mis‐ takes made by the user, such as passing a master that is not a mirror image of the main master. Setting up dual masters using shared disks is dependent on the shared storage solution used, a discussion that is beyond the scope of this book. Procedures | 137 Figure 5-4. Dual masters using a shared disk The problem with using shared storage is that the two masters are using the same files for storing data, so you have to be very careful when doing any administrative tasks on the passive master. Overwriting the configuration files, even by mistake, can be fatal. It is not sufficient to enforce one server to be read-only because there are files written (e.g., by InnoDB) even when the server is in read-only mode. The handling of split-brain syndrome depends on which shared disk solution is used and is beyond the scope of this book. One example, however, occurs when using SCSI, which has support for reserving disks by servers. This allows a server to detect that it is really not the primary anymore by noticing that the disks are reserved by another server, and bringing itself offline. Replicated disks using DRBD The Linux High Availability project contains a lot of useful tools for maintaining high availability systems. Most of these tools are beyond the scope of this book, but there is one tool that is interesting for our purposes: Distributed Replicated Block Device (DRBD), which is software for replicating block devices over the network. Figure 5-5 shows a typical setup of two nodes where DRBD is used to replicate a disk to a secondary server. The setup creates two DRBD block devices, one on each node, which in turn write the data to the real disks. The two DRBD processes communicate 138 | Chapter 5: Replication for High Availability over the network to ensure any changes made to the primary are replicated over to the secondary. To the MySQL server, the device replication is transparent. The DRBD de‐ vices look and behave like normal disks, so no special configuration is needed for the servers. You can only use DRBD in an active-passive setup, meaning that the passive disk cannot be accessed at all. In contrast with the shared disk solution outlined earlier and the bidirectional replication implementation described later in this chapter, the passive master cannot be accessed at all. Similar to the shared disk solution, DRBD has the advantage of not needing to translate positions between the two masters because they share the same files. However, failing over to the standby master takes longer than in the shared disk setup described earlier. Figure 5-5. Using DRBD to replicate disks For both the shared disk and the DRBD setup, it is necessary to perform recovery of the database files before bringing the servers online. Because recovery of MyISAM tables is quite expensive, it is recommended that you use a transactional engine with good recovery performance for the database tables. InnoDB is the proven solution in this case, but investigating alternative transactional engines might prove to be well-invested time. The mysql database contains strictly MyISAM tables so you should, as a general prin‐ ciple, avoid unnecessary changes to these tables during normal operations. It is, of course, impossible to avoid when you need to perform administrative tasks. One advantage of DRBD over shared disks is that for the shared disk solution, the disks actually provide a single point of failure. Should the network to the shared disk array go down, it is possible that the server will not work at all. In contrast, replicating the Procedures | 139 disks means that the data is available on both servers, which reduces the risk of a total failure. DRBD also has support built in to handle split-brain syndrome and can be configured to automatically recover from it. Bidirectional replication When using dual masters in an active-passive setup, there are no significant differences compared to the hot standby solution outlined earlier. However, in contrast to the other dual-masters solutions outlined earlier, it is possible to have an active-active setup (shown in Figure 5-6). Figure 5-6. Bidirectional replication Although controversial in some circles, an active-active setup does have its uses. A typical case is when there are two (or more) offices working with local information in the same database (e.g., sales data or employee data) and each office wants low response times when working with the database, while ensuring the data is available in both places. In this case, the data is naturally local to each office—each salesperson is normally working with his own sales and rarely, if ever, makes changes to another salesperson’s data. Use the following steps to set up bidirectional replication: 1. Ensure both servers have different server IDs. 2. Ensure both servers have the same data (and that no changes are made to either system until replication has been activated). 3. Create a replication user and prepare replication (using the information in Chap‐ ter 1) on both servers. 140 | Chapter 5: Replication for High Availability 4. Start replication on both servers. When using bidirectional replication, be forewarned that replica‐ tion includes no concept of conflict resolution. If both servers up‐ date the same piece of data, you will have a conflict that may or may not be noticed. If you are lucky, replication will stop at the offend‐ ing statement, but you shouldn’t count on it. If you intend to have a high availability system, you should ensure, at the application level, that two servers do not try to update the same data. Even if data is naturally partitioned—as in the example given previ‐ ously with two offices in separate locations—it is critical to put pro‐ visions in place to ensure data is not accidentally updated at the wrong server. In this case, the application has to connect to the server responsi‐ ble for the employee and update the information there, not just update the information locally and hope for the best. If you want to connect slaves to either of the servers, you have to ensure the log-slave- updates option is enabled. The other master is also connected as a slave, so an obvious question is this: what happens to events that the server sends out when they return to the server? When replication is running, the server ID of the server that created the event is attached to each event. This server ID is then propagated further when the slave writes the event to its binary log. When a server sees an event with the same server ID as its own server ID, that event is simply skipped and replication proceeds with the next event. Sometimes, you want to process the event anyway. This might be the case if you have removed the old server and created a new one with the same server ID and you are in the process of performing a PITR. In those cases, it is possible to disable this checking using the replicate-same-server-id configuration variable. However, to prevent you from shooting yourself in the foot, you cannot set this option at the same time that log- slave-updates is set. Otherwise, it would be possible to send events in a circle and quickly thrash all the servers. To prevent that from happening, MySQL prevents events from being forwarded if replicate-same-server-id is set. When using an active-active setup, there is a need to handle conflicts in a safe way, and by far the easiest way—and indeed the only recommended way to handle an active- active setup—is to ensure the different active servers write to different areas. One possible solution is to assign different databases—or different tables—to different masters. Example 5-2 shows a setup that uses two different tables, each updated by different masters. To make it easy to view the split data, a view is created that combines the two tables. Procedures | 141 Example 5-2. Different tables for different offices CREATE TABLE Employee_Sweden ( uid INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) ); CREATE TABLE Employee_USA ( uid INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(20) ); -- This view is used when reading from the two tables simultaneously. CREATE VIEW Employee AS SELECT 'SWE', uid, name FROM Employee_Sweden UNION SELECT 'USA', uid, name FROM Employee_USA; This approach is best to use if the split is natural in that, for example, different offices have different tables for their local data and the data only needs to be combined for reporting purposes. This might seem easy enough, but the following issues can com‐ plicate usage and administration of the tables: Reads and writes to different tables Because of the way the view is defined, you cannot update it. Writes have to be directed at the real tables, while reads can either use the view or read directly from the tables. It might therefore be necessary to introduce application logic to handle the split into reads and writes that go to different tables. Accurate and current data Because the two tables are managed by different sites, simultaneous updates to the two tables will cause the system to temporarily enter a state where both servers have information that is not available on the other server. If a snapshot of the information is taken at this time, it will not be accurate. If accurate information is required, generate methods for ensuring the information is accurate. Because such methods are highly application-dependent, they will not be covered here. Optimization of views When using views, two techniques are available to construct a result set. In the first method—called MERGE—the view is expanded in place, optimized, and executed as if it were a SELECT query. In the second method—called TEMPTABLE—a temporary table is constructed and populated with the data. 142 | Chapter 5: Replication for High Availability If the server uses a TEMPTABLE view, it performs very poorly, whereas the MERGE view is close to the corresponding SELECT. MySQL uses TEMPTABLE whenever the view definition does not have a simple one-to-one mapping between the rows of the view and the rows of the underlying table—for example, if the view definition contains UNION, GROUP BY, subqueries, or aggregate functions—so careful design of the views is paramount for getting good performance. In either case, you have to consider the implications of using a view for reporting, as it might affect performance. If each server is assigned separate tables, there will be no risk of conflict at all, given that updates are completely separated. However, if all the sites have to update the same tables, you will have to use some other scheme. The MySQL server has special support for handling this situation in the form of two server variables: auto_increment_offset This variable controls the starting value for any AUTO_INCREMENT column in a table (i.e., the value that the first row inserted into the table gets for the AUTO_INCRE MENT column). For subsequent rows, the value is calculated using auto_incre ment_increment. auto_increment_increment This is the increment used to compute the next value of an AUTO_INCREMENT column. There are session and global versions of these two variables and they affect all tables on the server, not just the tables created. Whenever a new row is inserted into a table with an AUTO_INCREMENT column, the next value available in this sequence is used: valueN = auto_increment_offset + N*auto_increment_increment You should notice that the next value is not computed by adding the auto_increment_increment to the last value in the table. Building on the previous example, auto_increment_increment to ensure new rows added to a table are assigned numbers from different sequences of numbers depending on which server is used. The idea is that the first server uses the sequence 1, 3, 5… (odd numbers), while the second server uses the sequence 2, 4, 6… (even numbers). Continuing with Example 5-2, Example 5-3 uses these two variables to ensure the two servers use different IDs when inserting new employees into the Employee table. Example 5-3. Two servers writing to the same table -- The common table can be created on either server CREATE TABLE Employee ( uid INT AUTO_INCREMENT PRIMARY KEY, Procedures | 143 name VARCHAR(20), office VARCHAR(20) ); -- Setting for first master SET GLOBAL AUTO_INCREMENT_INCREMENT = 2; SET GLOBAL AUTO_INCREMENT_OFFSET = 1; -- Setting for second master SET GLOBAL AUTO_INCREMENT_INCREMENT = 2; SET GLOBAL AUTO_INCREMENT_OFFSET = 2; This scheme handles the insertion of new items in the tables, but when entries are being updated, it is still critical to ensure the update statements are sent to the correct server (i.e., the server responsible for the employee). Otherwise, data is likely to be inconsistent. If updates are not done correctly, the slaves will normally not stop—they will just rep‐ licate the information, which leads to inconsistent values on the two servers. For example, if the first master executes the statement: master-1> UPDATE Employee SET office = 'Vancouver' WHERE uid = 3; Query OK, 1 rows affected (0.00 sec) and at the same time, the same row is updated at the second server using the statement: master-2> UPDATE Employee SET office = 'Paris' WHERE uid = 3; Query OK, 1 rows affected (0.00 sec) the result will be that the first master will place the employee in Paris while the second master will place the employee in Vancouver (note that the order will be swapped be‐ cause each server will update the other server’s statement after its own). Detecting and preventing such inconsistencies is important because they will cascade and create more inconsistency over time. Statement-based replication executes state‐ ments based on the data in the two servers, so one inconsistency can lead to others. If you take care to separate the changes made by the two servers as outlined previously, the row changes will be replicated and the two masters will therefore be consistent. If users use different tables on the different servers, the easiest way to prevent such mistakes is to assign privileges so that a user cannot accidentally change tables on the wrong server. This is, however, not always possible and cannot prevent the case just shown. Slave Promotion The procedures described so far work well when you have a master running that you can use to synchronize the standby and the slave before the switchover, but what happens when the master dies all of a sudden? Because replication has stopped in its tracks with all slaves (including the standby), there is no way to know what is on each slave. If the 144 | Chapter 5: Replication for High Availability standby is ahead of all the slaves that need to be reassigned, there is no problem, because you can run replication on each slave to the place where the standby stopped. You will lose any changes that were made on the master but not yet sent to the standby. We will cover how to handle the recovery of the master in this case separately. If the standby is behind one of the slaves, you shouldn’t use the standby as the new master, because the slave knows more than the standby. As a matter of fact, it would be better if the slave that has replicated most events from the common master (which is now “more knowledgeable”) were the master instead! This is exactly the approach taken to handle master failures using slave promotion: instead of trying to keep a dedicated standby around (which then might not be the best candidate), ensure that any one of the slaves connected to the master can be promoted to master and take over at the point where the master was lost. By selecting the “most knowledgeable” slave as the new master, you guarantee that none of the other slaves will be more knowledgeable than the new master, so they can connect to the new master and read events from it. There is, however, a critical issue that needs to be resolved: osynchronizing all slaves with the new master so that no events are lost or repeated. The problem in this situation is that all of the slaves need to read events from the new master. The traditional method for promoting a slave Before delving into the final solution, let us first take a look at the traditionally recom‐ mended way for handling slave promotion. This will work as a good introduction to the problem. Figure 5-7 shows a typical setup with a master and several slaves. For the traditional method of slave promotion, the following are required: • Each promotable slave must have a user account for the replication user. • Each promotable slave should run with log-bin with the binary log enabled. • Each promotable slave should run without the log-slave-updates option (the reason will become obvious shortly). Procedures | 145 Figure 5-7. Promoting a slave to replace a failed master Assume you are starting with the original setup shown in Figure 5-7 and that the master fails. You can promote a slave to be the new master by doing the following: 1. Stop the slave using STOP SLAVE. 2. Reset the slave that is going to be the new master using RESET MASTER. This will ensure the slave starts as the new master and that any connecting slave will start reading events from the time the slave was promoted. 3. Connect the other slaves to the new master using CHANGE MASTER. Because you reset the new master, you can start replication from the beginning of the binary log, so it is not necessary to provide any position to CHANGE MASTER. Unfortunately, this approach is based on an assumption that is not generally true—that the slaves have received all changes that the master has made. In a typical setup, the slaves will lag behind the master to various degrees. It might be just a few transactions, but nevertheless, they lag behind. This means that each slave needs to fetch the missing transactions somehow, and if none of the other slaves have a binary log enabled, there is no way to provide these changes to the slaves. The situation can be handled by figuring out what slave has seen most of the master’s changes (the most knowlegeable slave) and then synchronize all the other slaves with the most knowledgeable slave by either copying the entire database, or using something like the mysqldbcompare to get the changes to the slaves. Regardless of that, this approach is so simple that it is useful if you can handle lost transactions or if you are operating under a low load. 146 | Chapter 5: Replication for High Availability A revised method for promoting a slave The traditional approach to promoting a slave is inadequate in most cases because slaves usually lag behind the master. Figure 5-8 illustrates the typical situation when the master disappears unexpectedly. The box labeled “binary log” in the center is the master’s binary log and each arrow represents how much of the binary log the slave has executed. Figure 5-8. Binary log positions of the master and the connected slaves In the figure, each slave has stopped at a different binlog position, and even the most knowledgeable slave has not got all the transactions from the now defunct master. The transactions that have not been replicated to the new master are lost forever (it will become clear why in “Consistency in a Hierarchical Deployment” on page 180), and the transactions missing from the lagging slaves have to be transferred from the most knowledgeable slave. The situation is resolved by promoting one slave (the most knowl‐ edgeable one) to be the new master and then synchronizing the other slaves to it. In Example 5-4, you can find code to order the slaves based on their master positions. This works if the slaves are connected to the same master, which means that the coordinates are comparable. Example 5-4. Python code to find the best slave from mysql.replicant.commands import ( fetch_slave_position, ) def fetch_gtid_executed(server): server.connect() Procedures | 147 result = server.sql( "SELECT server_id, trans_id FROM Last_Exec_Trans" ) server.disconnect() return result def order_slaves_on_position(slaves): entries = [] for slave in slaves: pos = fetch_slave_position(slave) gtid = fetch_gtid_executed(slave) entries.append((pos, gtid, slave)) entries.sort(key=lambda x: x[0]) return [ entry[1:2] for entry in entries ] With the introduction of native support for GTIDs in MySQL ver‐ sion 5.6, this problem was eliminated. A complete description of GTIDs can be found in “Global Transaction Identifiers” on page 260. The critical problem lies in translating the positions for each slave (which are the po‐ sitions in the now-defunct master) to positions on the promoted slave. In versions prior to 5.6, the history of events executed and the binlog positions they correspond to on the slaves are lost in the replication process. Each time the slave executes an event that has arrived from the master, it writes a new event to its binary log, with a new binary log position. The slave’s position has no relation to the master’s binlog position of the same event. The only option that remains is to implement an alternative version of GTIDs and scan the binary log of the promoted slave. The alternative implementation of GTIDs is described in Appendix B. You can see a Python implementation of slave promotion in Example 5-5. Example 5-5. Slave promotion in Python def promote_best_slave(slaves): entries = order_slaves_on_position(slaves) _, master = entries.pop() for gtid, slave in entries: pos_on_master = find_position_from_gtid(master, gtid) switch_to_master(master, slave, pos_on_master) Here the positions of each slave are fetched using the function introduced in Appendix B, which uses SHOW SLAVE STATUS to fetch the position of the last executed event. Pick the slave with the highest position to promote to master. If there are several that have the highest position, pick either one. 148 | Chapter 5: Replication for High Availability This will connect to the promoted slave and scan the binary log to find the GTID of the last executed transaction for each slave. This step will give you a binlog position on the promoted slave for each GTID that you collected. Reconnect each slave to the promoted slave, starting at the position retrieved from the new master’s binary log. Circular Replication After reading about dual masters, you might wonder whether it is possible to set up a multimaster with more than two masters replicating to each other. Because each slave can only have a single master, it is possible to get this configuration only by setting up replication in a circular fashion. Before MySQL 5.6, this this was not a recommended setup, but it is certainly possible. With the introduction of global transaction IDs in MySQL 5.6, many of the reasons for rejecting circular replication are no longer valid, because the main problem is to get it to work correctly in the presence of failure. Using a circular replication setup with three or more servers can be quite practical for reasons of locality. As a real-life example, consider the case of a mobile phone operator with subscribers all over Europe. Because mobile phone users travel around quite a lot, it is convenient to have the registry for the customers close to the actual phone, so by placing the data centers at some strategic places in Europe, it is possible to quickly verify call data and also register new calls locally. The changes can then be replicated to all the servers in the ring, and eventually all servers will have accurate billing information. In this case, circular replication is a perfect setup: all subscriber data is replicated to all sites, and updates of data are allowed in all data centers. Setting up circular replication (as shown in Figure 5-9) is quite easy. Example 5-6 provides a script that sets up circular replication automatically, so where are the com‐ plications? As in every setup, you should ask yourself, “What happens when something goes wrong?” Example 5-6. Setting up circular replication def circular_replication(server_list): from mysql.replicant.commands import change_master for source, target in zip(server_list, server_list[1:] + [server_list[0]]): change_master(target, source) Procedures | 149 Figure 5-9. Circular replication setup In Figure 5-9, there are four servers named for the cities in which they are located (the names are arbitrarily picked and do not reflect a real setup). Replication goes in a circle: Stockholm to Moscow to Paris to London and back to Stockholm. This means that Moscow is upstream of Paris, but downstream of Stockholm. Suppose that Moscow goes down suddenly and unexpectedly. To allow replication to continue, it is necessary to reconnect the “downstream” server Paris to the “upstream” server Stockholm to ensure the continuing operation of the system. Figure 5-10 shows a scenario in which a single server fails and the servers reconnect to allow replication to continue. Sounds simple enough, doesn’t it? Well, it’s not really as simple as it looks. There are basically three potential problems: • The downstream server—the server that was slave to the failed master—needs to connect to the upstream server and start replication from what it last saw. How is that position decided? • Suppose that the crashed server has managed to send out some events before crash‐ ing. What happens with those events? • We need to consider how we should bring the failed server into the topology again. What if the server applied some transactions of its own that were written to the binary log but not yet sent out? It is clear that these transactions are lost, so we need to handle this. As you will see, all of these issues are easy to solve with the global transaction identifiers introduced in MySQL 5.6. When detecting that one of the servers failed, just use the CHANGE MASTER command to connect the downstream server to the upstream server using the MASTER_AUTO_POSITION=1 option: paris> CHANGE MASTER TO -> MASTER_HOST='stockholm.example.com', -> MASTER_AUTO_POSITION = 1; 150 | Chapter 5: Replication for High Availability Because each server remembers what transactions were seen, any transactions that was sent out by the failing server will be applied to each remaining server in the ring exactly once. This means that it automatically handles the second and third issues from our list. Figure 5-10. Changing topology in response to a failing server Because the failed server can be in an alternative future (see Figure 6-8) compared to the other servers, bringing the server back in the ring will cause the missing transaction to “suddenly appear” (from the perspective of the application using the database), and this might not be what you want. The safest way to bring it into the circle again is to restore the server from one of the servers in the ring and reconnect the circle so that the new server is in the ring again. Conclusion High availability is a nontrivial concept to implement in practice. In this chapter, we presented a look into high availability and how you can achieve it with MySQL. In the next chapter, we will look more at high availability as we examine a companion topic: scaling out. Joel’s email notification chime sounded. He clicked on his email and opened the latest message. It was from Mr. Summerson, who made comments about his report. He read it through and at the bottom found what he expected. It read, “I like the redundancy ideas and especially the hot standby strategy. Make this happen.” Joel sighed as he realized his plans for getting to know some of his coworkers were going to have to wait. He had a lot of work to do. Conclusion | 151 CHAPTER 6 MySQL Replication for Scale-Out Joel stood and stretched and figured it was time for a soda. As he walked around his desk, headed to the break room, his boss met him at his door. “Good afternoon, sir.” “Hey, Joel. We just sold a bunch of licenses for our new applications. The marketing people tell me we can expect to see an increase in load of at least tenfold on the database server.” Joel raised his eyebrows. He had added a single slave just last week and that had improved the load problem, but not entirely. “We need to scale out, Joel.” “Yes, sir. I’ll get right on it.” Mr. Summerson tapped Joel on the shoulder and smiled, then walked down the hall to his office. Joel stood still for a few moments as he pondered what “scale out” meant and formed a plan. “I’ve got to do a little more reading,” he mumbled as he headed to the break room. When the load starts to increase—and if you are running a successful deployment, it is just a matter of when it will start to increase—you can handle it in two ways. The first is to buy larger and more powerful servers to handle the increased load, which is called scaling up, whereas the second is to add more servers to handle the increased load, which is called scaling out. Of these two, scaling out is by far the more popular solution because it usually involves buying a batch of low-cost standard servers and is much more cost- effective. In addition to handling an increased load, additional servers can support high availa‐ bility and other business requirements. When used effectively, scaling out puts the combined resources—such as computing power—of all the servers to best use. 153 This chapter doesn’t go into all the hardware, network, and other considerations in‐ volved in scaling out—those are beyond the scope of this book and are covered to some degree in High Performance MySQL, but we will talk about how to set up replication in MySQL to make the best use of scale-out. After some basic instructions for replication, we’ll start to develop a Python library that makes it easy to administer replication over large sets of servers, and we’ll examine how replication fits into your organization’s business needs. All the code in this chapter (as well as the other chapters of the book) can be found in the MySQL Replicant source code repository at Launchpad. The most common uses for scaling out and replication are: Load balancing for reads The master is occupied with updating data, so it can be wise to have separate servers to answer queries. Because queries only need to read data, you can use replication to send changes on the master to slaves—as many as you feel you need—so that they have current data and can process queries. Load balancing for writes High-traffic deployments distribute processing over many computers, sometimes several thousand. Here, replication plays a critical role in distributing the informa‐ tion to be processed. The information can be distributed in many different ways based on the business use of your data and the nature of the use: • Distributed based on the information’s role. Rarely updated tables can be kept on a single server, while frequently updated tables are partitioned over several servers. • Partitioned by geographic region so that traffic can be directed to the closest server. Disaster avoidance through hot standby If the master goes down, everything will stop—it will not be possible to execute (perhaps critical) transactions, get information about customers, or retrieve other critical data. This is something that you want to avoid at (almost) any cost as it can severely disrupt your business. The easiest solution is to configure a slave with the sole purpose of acting as a hot standby, ready to take over the job of the master if it fails. Disaster avoidance through remote replication Every deployment runs the risk of having a data center go down due to a disaster, be it a power failure, an earthquake, or a flood. To mitigate this, use replication to transport information between geographically remote sites. 154 | Chapter 6: MySQL Replication for Scale-Out Making backups Keeping an extra server around for making backups is very common. This allows you to make your backups without having to disturb the master at all, because you can take the backup server offline and do whatever you like with it. Report generation Creating reports from data on a server will degrade the server’s performance, in some cases significantly. If you’re running lots of background jobs to generate re‐ ports, it’s worth creating a slave just for this purpose. You can get a snapshot of the database at a certain time by stopping replication on the slave and then running large queries on it without disturbing the main business server. For example, if you stop replication after the last transaction of the day, you can extract your daily reports while the rest of the business is humming along at its normal pace. Filtering or partitioning data If the network connection is slow, or if some data should not be available to certain clients, you can add a server to handle data filtering. This is also useful when the data needs to be partitioned and reside on separate servers. Scaling Out Reads, Not Writes It is important to understand that scaling out in this manner scales out reads, not writes. Each new slave has to handle the same write load as the master. The average load of the system can be described as follows: AverageLoad = ∑ ReadLoad + ∑ WriteLoad ∑ Capacity So if you have a single server with a total capacity of 10,000 transactions per second, and there is a write load of 4,000 transactions per second on the master, while there is a read load of 6,000 transactions per second, the result will be: AverageLoad = ∑ ReadLoad + ∑ WriteLoad ∑ Capacity = 6000 + 4000 10000 = 100 % Now, if you add three slaves to the master, the total capacity increases to 40,000 trans‐ actions per second. Because the write queries are replicated as well, each query is exe‐ cuted a total of four times—once on the master and once on each of the three slaves— which means that each slave has to handle 4,000 transactions per second in write load. The total read load does not increase because it is distributed over the slaves. This means that the average load now is: Scaling Out Reads, Not Writes | 155 AverageLoad = ∑ ReadLoad + ∑ WriteLoad ∑ Capacity = 6000 + 4 * 4000 4 * 10000 = 55 % Notice that in the formula, the capacity is increased by a factor of 4, as we now have a total of four servers, and replication causes the write load to increase by a factor of 4 as well. It is quite common to forget that replication forwards to each slave all the write queries that the master handles. So you cannot use this simple approach to scale writes, only reads. In the next chapter, you will see how to scale writes using a technique called sharding. The Value of Asynchronous Replication MySQL replication is asynchronous, a type of replication particularly suitable for modern applications such as websites. To handle a large number of reads, sites use replication to create copies of the master and then let the slaves handle all read requests while the master handles the write re‐ quests. This replication is considered asynchronous because the master does not wait for the slaves to apply the changes, but instead just dispatches each change request to the slaves and assumes they will catch up eventually and replicate all the changes. This technique for improving performance is usually a good idea when you are scaling out. In contrast, synchronous replication keeps the master and slaves in sync and does not allow a transaction to be committed on the master unless the slave agrees to commit it as well (i.e., synchronous replication makes the master wait for all the slaves to keep up with the writes). Asynchronous replication is a lot faster than synchronous replication, for reasons our description should make obvious. Compared to asynchronous replication, synchronous replication requires extra synchronizations to guarantee consistency. It is usually im‐ plemented through a protocol called two-phase commit, which guarantees consistency between the master and slaves, but requires extra messages to ping-pong between them. Typically, it works like this: 1. When a commit statement is executed, the transaction is sent to the slaves and the slave is asked to prepare for a commit. 2. Each slave prepares the transaction so that it can be committed, and then sends an OK (or ABORT) message to the master, indicating that the transaction is prepared (or that it could not be prepared). 3. The master waits for all slaves to send either an OK or an ABORT message: 156 | Chapter 6: MySQL Replication for Scale-Out a. If the master receives an OK message from all slaves, it sends a commit message to all slaves asking them to commit the transaction. b. If the master receives an ABORT message from any of the slaves, it sends an ABORT message to all slaves asking them to abort the transaction. 4. Each slave is then waiting for either an OK or an ABORT message from the master. a. If the slaves receive the commit request, they commit the transaction and send an acknowledgment to the master that the transaction is committed. b. If the slaves receive an abort request, they abort the transaction by undoing any changes and releasing any resources they held, then send an acknowledgment to the master that the transaction was aborted. 5. When the master has received acknowledgments from all slaves, it reports the transaction as committed (or aborted) and continues with processing the next transaction. What makes this protocol slow is that it requires a total of four messages, including the messages with the transaction and the prepare request. The major problem is not the amount of network traffic required to handle the synchronization, but the latency in‐ troduced by the network and by processing the commit on the slave, together with the fact that the commit is blocked on the master until all the slaves have acknowledged the transaction. In contrast, asynchronous replication requires only a single message to be sent with the transaction. As a bonus, the master does not have to wait for the slave, but can report the transaction as committed immediately, which improves performance significantly. So why is it a problem that synchronous replication blocks each commit while the slaves process it? If the slaves are close to the master on the network, the extra messages needed by synchronous replication make little difference, but if the slaves are not nearby— maybe in another town or even on another continent—it makes a big difference. Table 6-1 shows some examples for a server that can commit 10,000 transactions per second. This translates to a commit time of 0.1 ms (but note that some implementations, such as MySQL Cluster, are able to process several commits in parallel if they are inde‐ pendent). If the network latency is 0.01 ms (a number we’ve chosen as a baseline by pinging one of our own computers), the transaction commit time increases to 0.14 ms, which translates to approximately 7,000 transactions per second. If the network latency is 10 ms (which we found by pinging a server in a nearby city), the transaction commit time increases to 40.1 ms, which translates to about 25 transactions per second! In contrast, asynchronous replication introduces no delay at all, because the transactions are reported as committed immediately, so the transaction commit time stays at the original 10,000 per second, just as if there were no slaves. The Value of Asynchronous Replication | 157 Table 6-1. Typical slowdowns caused by synchronous replication Latency (ms) Transaction commit time (ms) Equivalent transactions per second Example case 0.01 0.14 ~7,100 Same computer 0.1 0.5 ~2,000 Small LAN 1 4.1 ~240 Bigger LAN 10 40.1 ~25 Metropolitan network 100 400.1 ~2 Satellite The performance of asynchronous replication comes at the price of consistency. Recall that in asynchronous replication the transaction is reported as committed immediately, without waiting for any acknowledgment from the slave. This means the master may consider the transaction committed when the slave does not. As a matter of fact, it might not even have left the master, but is still waiting to be sent to the slave. There are two problems with this that you need to be aware of: • In the event of crashes on the master, transactions can “disappear.” • A query executed on the slaves might return old data. Later in this chapter, we will talk about how to ensure you are reading current data, but for now, just remember that asynchronous replication comes with its own set of caveats that you have to handle. Managing the Replication Topology A deployment is scaled by creating new slaves and adding them to the collection of computers you have. The term replication topology refers to the ways you connect servers using replication. Figure 6-1 shows some examples of replication topologies: a simple topology, a tree topology, a dual-master topology, and a circular topology. These topologies are used for different purposes: the dual-master topology handles fail- overs elegantly, for example, and circular replication and dual masters allow different sites to work locally while still replicating changes over to the other sites. 158 | Chapter 6: MySQL Replication for Scale-Out Figure 6-1. Simple, tree, dual-master, and circular replication topologies The simple and tree topologies are used for scale-out. The use of replication causes the number of reads to greatly exceed the number of writes. This places special demands on the deployment in two ways: It requires load balancing We’re using the term load balancing here to describe any way of dividing queries among servers. Replication creates both reasons for load balancing and methods for doing so. First, replication imposes a basic division of the load by specifying writes to be directed to the masters while reads go to the slaves. Furthermore, you sometimes have to send a particular query to a particular slave. It requires you to manage the topology Servers crash sooner or later, which makes it necessary to replace them. Replacing a crashed slave might not be urgent, but you’ll have to replace a crashed master quickly. In addition to this, if a master crashes, clients have to be redirected to the new master. If a slave crashes, it has to be taken out of the pool of load balancers so no queries are directed to it. To handle load balancing and management, you should put tools in place to manage the replication topology, specifically tools that monitor the status and performance of servers and tools to handle the distribution of queries. For load balancing to be effective, it is necessary to have spare capacity on the servers. There are a few reasons for ensuring you have spare capacity: Peak load handling You need to have margins to be able to handle peak loads. The load on a system is never even, but fluctuates up and down. The spare capacity necessary to handle a large deployment depends a lot on the application, so you need to monitor it closely to know when the response times start to suffer. Managing the Replication Topology | 159 Distribution cost You need to have spare capacity for running the replication setup. Replication al‐ ways causes a “waste” of some capacity on the overhead of running a distributed system. It involves extra queries to manage the distributed system, such as the extra queries necessary to figure out where to execute a read query. One item that is easily forgotten is that each slave has to perform the same writes as the master. The queries from the master are executed in an orderly manner (i.e., serially), with no risk of conflicting updates, but the slave needs extra capacity for running replication. Administrative tasks Restructuring the replication setup requires spare capacity so you can support temporary dual use, like when moving data between servers. Load balancing works in two basic ways: either the application asks for a server based on the type of query, or an intermediate layer—usually referred to as a proxy—analyzes the query and sends it to the correct server. Using an intermediate layer to analyze and distribute the queries (as shown in Figure 6-2) is by far the most flexible approach, but it has two disadvantages: • There is a performance degradation when using a proxy because of two reasons: processing resources have to be spent on analyzing queries and there is an extra hop introduced for the queries that now have to go through the proxy. Processing the query may delay it—because it has to be parsed and analyzed twice (once by the proxy and again by the MySQL server)—but the latency introduced by the extra hop is likely to exceed the time for analyzing the query. Depending on the appli‐ cation, this may or may not be a problem. • Correct query analysis can be hard to implement, sometimes even impossible. A proxy will often hide the internal structure of the deployment from the application programmer so that it should not be necessary to make the hard choices. For this reason, the client may send a query that can be very hard to analyze properly and might require a significant rewrite before being sent to the servers. One of the tools that you can use for proxy load balancing is MySQL Proxy. It contains a full implementation of the MySQL client protocol, and therefore can act as a server for the real client connecting to it and as a client when connecting to the MySQL server. This means that it can be fully transparent: a client can’t distinguish between the proxy and a real server. 160 | Chapter 6: MySQL Replication for Scale-Out Figure 6-2. Using a proxy to distribute queries The MySQL Proxy is controlled using the Lua programming language. It has a built-in Lua engine that executes small—and sometimes not so small—programs to intercept and manipulate both the queries and the result sets. Because the proxy is controlled using a real programming language, it can carry out a variety of sophisticated tasks, including query analysis, query filtering, query manipulation, and query distribution. Configuration and programming of the MySQL Proxy are beyond the scope of this book, but there are extensive publications about it online. Some of the ones we find useful are: Jan Kneschke Jan Kneschke is the original author of the MySQL Proxy and has several good presentations and posts about the Proxy. The MySQL Reference Manual The MySQL Proxy section of the MySQL Reference Manual contain the details of the implementation as well as an introduction to writing scripts for the MySQL Proxy. The precise methods for using a proxy depend entirely on the type of proxy you use, so we will not cover that information here. Instead, we’ll focus on using a load balancer in the application layer. There are a number of load balancers available, including: • Hardware • Simple software load balancers, such as Balance Managing the Replication Topology | 161 • Peer-based systems, such as Wackamole • Full-blown clustering solutions, such as the Linux Virtual Server It is also possible to distribute the load on the DNS level and to handle the distribution directly in the application. Application-Level Load Balancing The most straightforward approach to load balancing at the application level is to have the application ask the load balancer for a connection based on the type of query it is going to send. In most cases, the application already knows whether the query is going to be a read or write query, and which tables will be affected. In fact, forcing the appli‐ cation developer to consider these issues when designing the queries may produce other benefits for the application, usually in the form of improved overall performance of the system. Based on this information, a load balancer can provide a connection to the right server, which the application can then use to execute the query. A load balancer on the application layer needs to have a store with information about the servers and what queries they should handle. Functions in the application layer send queries to this store, which returns the name or IP address of the MySQL server to query. The lookup procedure can either be placed in the application, or inside the connector if it supports it. Many connectors support ways for providing information about servers without a central store, but then you need to have means for seeding the connectors with this information, or provide it through the application. Example of an application-level load balancer Let’s develop a simple load balancer like the one shown in Figure 6-3 for use by the application layer. PHP is being used for the presentation logic because it’s so popular on web servers. It is necessary to write functions for updating the server pool information and functions to fetch servers from the pool. The pool is implemented by creating a table with all the servers in the deployment in a common database that is shared by all nodes. In this case, we just use the host and port as primary key for the table (instead of creating a host ID) and create a common database to contain the tables of the shared data. 162 | Chapter 6: MySQL Replication for Scale-Out Figure 6-3. Load balancing on the application level You should duplicate the central store so that it doesn’t create a single point of failure. In addition, because the list of available servers does not often change, load balancing information is a perfect candidate for caching. For the sake of simplicity—and to avoid introducing dependencies on other systems— we demonstrate the application-level load balancer using a pure MySQL implementa‐ tion. There are many other techniques that you can use that do not involve MySQL. The most common technique is to use round-robin DNS; another alternative is using Memcached, which is a distributed in-memory key/value store. Also note that the addition of an extra query (to query the common database) is a significant overhead for high-performing systems and should be avoided. This is tra‐ ditionally done using a cache for the information, but we start with an implementation without caches. You will see how to add caches in Example 6-5. The load balancer lists servers in the load balancer pool, separated into categories based on what kind of queries they can handle. Information about the servers in the pool is stored in a central repository. The implementation consists of a table in the common database given in Example 6-1, the load balancer in Example 6-2 for querying the load balancer from the application, and the Python functions in Example 6-3 for updating information about the servers. Example 6-1. Database tables for the load balancer CREATE TABLE nodes ( host VARCHAR(28) NOT NULL, port INT UNSIGNED NOT NULL, sock VARCHAR(80) type ENUM('OL','RO','RW') NOT NULL DEFAULT '', Managing the Replication Topology | 163 PRIMARY KEY (host, port) ); The store contains the host and the port of the server, as well as if it is an offline (OL), read-only (RO), or read-write (RW) server. The offline setting can be used for mainte‐ nance. Example 6-2 shows code for implementing a load balancer. It consists of a dictionary class responsible for dealing out connections to servers. Example 6-2. Load balancer in PHP define('DB_OFFLINE', 'OL') define('DB_RW', 'RW'); define('DB_RO', 'RO'); $FETCH_QUERY = <<server = new mysqli($host, $user, $pass, 'metainfo', $port); } public function get_connection($user, $pass, $db, $hint) { global $FETCH_QUERY; $type = $hint['type']; if ($stmt = $this->server->prepare($FETCH_QUERY)){ $stmt->bind_param('s', $type); $stmt->execute(); $stmt->bind_result($host, $port); if ($stmt->fetch()) return new mysqli($host, $user, $pass, $db, $port); } return null; } } A simple SELECT will suffice to find all the servers that can accept the query. Because we want just a single server, we limit the output to a single line using the LIMIT modifier to the SELECT query, and to distribute queries evenly among available servers, we use the ORDER BY RAND() modifier. 164 | Chapter 6: MySQL Replication for Scale-Out A dictionary class is introduced (and will be used in the remainder of the book) whose responsibility is to deal out connections to MySQL instances. When a dictionary class instance is constructed, information about the MySQL server that stores the information needs to be provided. This server stores information about each server in the deployment, in order to manage the connections to them. The get_connection function is used to request a connection to a server in the deployment. What server to connect to is decided based on a hint passed to get_connection. The hint is an associative array with information about what sort of connection is requested, and the function will deliver a connection to a server matching the criteria. In this case, the hint conveys only whether a read- only or read-write server is requested. The final task is to provide utility functions for adding and removing servers and for updating the capabilities of a server. Because these are mainly to be used from the ad‐ ministration logic, we’ve implemented this function in Python using the Replicant li‐ brary. The utility consists of three functions, demonstrated in Example 6-3: pool_add(common, server, type) Adds a server to the pool. The pool is stored at the server denoted by common, and the type to use is a list—or other iterable—of values to set. pool_del(common, server) Deletes a server from the pool. pool_set(common, server, type) Changes the type of the server. Example 6-3. Administrative functions for the load balancer from mysql.replicant.errors import ( Error, ) from MySQLdb import IntegrityError class AlreadyInPoolError(Error): pass _INSERT_SERVER = ("INSERT INTO nodes(host, port, sock, type)" "VALUES (%s, %s, %s, %s)") _DELETE_SERVER = ("DELETE FROM nodes" " WHERE host = %s AND port = %s") _UPDATE_SERVER = ("UPDATE nodes SET type = %s" " WHERE host = %s AND port = %s") Managing the Replication Topology | 165 def pool_add(common, server, types=None): if types is None: types = [] common.use("common") try: common.sql(_INSERT_SERVER, (server.host, server.port, server.socket, ','.join(types))) except IntegrityError: raise AlreadyInPoolError def pool_del(common, server): common.use("common") common.sql(_DELETE_SERVER, (server.host, server.port)) def pool_set(common, server, types=None): if types is None: types = [] common.use("common") common.sql(_UPDATE_SERVER, (','.join(types), server.host, server.port)) These functions can be used as shown in the following examples: pool_add(common, master, ['READ', 'WRITE']) for slave in slaves: pool_add(common, slave, ['READ']) With everything in place, the load balancer can be used as in Example 6-4, where the dictionary is set up to use central.example.com for the central repository. After that, get_connection can be used to get connections to the server based on the hint provided. Example 6-4. PHP code using the load balancer $DICT = new Dictionary("central.example.com", "mats", ""); $QUERY = <<get_connection('mats', 'xyzzy', 'employees', array('type' => DB_RO)); $stmt = $mysql->prepare($QUERY); if ($stmt) { $stmt->bind_param("d", $emp_no); $stmt->execute(); $stmt->bind_result($first_name, $last_name, $dept_name); while ($stmt->fetch()) print "$first_name $last_name $dept_name\n"; 166 | Chapter 6: MySQL Replication for Scale-Out $stmt->close(); } else { echo "Error: " . $mysql->error; } In Example 6-2, a query is sent to the central repository for each query dispatched. This doubles the number of queries sent out by the application, and can lead to performance degradation. To solve this, you should cache the data from the central repository and fetch the information from the cache instead, as shown in Example 6-5. Caches require a strategy for when to invalidate the cache. In this case, a simple time to live caching strategy is employed, where the cache is reloaded if it is too old. This is a very simple implementation, but it means that any changes to the topology are not recognized immediately. If any changes are made to the topology and you change the information in the centralized store, you have to keep the old servers available until the timer expires; the information is guaranteed to be reloaded from the centralized store. Example 6-5. Caching a load balancer in PHP define('DB_RW', 'RW'); define('DB_RO', 'RO'); define('TTL', 60); $FETCH_QUERY = <<server = new mysqli($host, $user, $pass, 'metainfo', $port); } public function get_connection($user, $pass, $db, $hint) { if (time() > $this->last_update + TTL) $this->update_cache(); $type = $hint['type']; if (array_key_exists($type, $this->cache)) { $servers = $this->cache[$type]; $no = rand(0, count($servers) - 1); list($host, $port) = $servers[$no]; return new mysqli($host, $user, $pass, $db, $port); } else Managing the Replication Topology | 167 return null; } private function update_cache() { global $FETCH_QUERY; if ($stmt = $this->server->prepare($FETCH_QUERY)){ $cache = array(); $stmt->execute(); $stmt->bind_result($host, $port, $type); while ($stmt->fetch()) $cache[$type][] = array($host, $port); $this->cache = $cache; $this->last_update = time(); } o } } This constant is used for the “time to live” for the cache. A long time means that the centralized store is not queried as often, but it also means that changes in the topology are not recognized as fast. In contrast to Example 6-2, the entire contents of the centralized store are loaded with the query. In this case, it is assumed that the entire contents can be loaded, but for really large data sets, it might be more sensible to create a query that does not load parts of the dictionary table that are not going to be used. Check the last time the cache was updated. If it was more than TTL seconds ago, the cache will be updated. After this if statement has executed, it is guaranteed that the cache is up-to-date (or at least as up-to-date as it can be). Fetch the host and the port from the cache instead of from the server, as done in Example 6-2. Here, a random server is picked, but other policies are possible. Here we only update the cache if it is possible to prepare the query on the server. If the server cannot be contacted for some reason, you still have to be able to execute queries. In this code, it is assumed that the current contents of the cache can be used, at least for a while longer, while the database with the information restarts. Here the cache is filled based on the type of the server. Each entry in the cache contains a list of candidate servers for that type. MySQL native driver replication and load balancing plug-in The PHP team at MySQL has created several plug-ins to the MySQL Native Driver (mysqlnd). One of these plug-ins can be used for handling read-write splitting, load- balancing using a few different strategies, and failover handling. You can find more information on PHP.net. 168 | Chapter 6: MySQL Replication for Scale-Out In contrast to the example implementation used earlier, mysqlnd_ms uses a configura‐ tion file containing information about where to failover. This means that it is very ef‐ ficient (all the info is in memory), but also that it is static. The information about the masters and the slaves is stored in a configuration file in JSON format similar to the file in Example 6-6. Here, the master is assumed to be read- write and the slaves are all read-only servers. Example 6-6. Example of mysqlnd_ms configuration file { "myapp": { "master": [ { "host": "master1.example.com" } ], "slave": [ { "host": "slave1.example.com", "port": "3306" }, { "host": "slave2.example.com", "port": "3307" }, { "host": "slave3.example.com", "port": "3308" } ] } } } When a connection is established, the hostname is used as a key into the structure in Example 6-6, and if a match is found, the connection information in one of the entries under the key is used instead. Which connection information is used depends on the policy set for the load balancer. The load balancer investigates the statement to decide where to send it. Any statement starting with SELECT is considered a read-only statement and will be sent to the slave, while any other statement is sent to the master. You can see example code for using mysqlnd_ms in Example 6-7. Example 6-7. PHP code for using mysqlnd_ms $QUERY = <<prepare($QUERY); if ($stmt) { $stmt->bind_param("d", $emp_no); $stmt->execute(); $stmt->bind_result($first_name, $last_name, $dept_name); while ($stmt->fetch()) print "$first_name $last_name $dept_name\n"; $stmt->close(); } Managing the Replication Topology | 169 else { echo "Error: " . $mysql->error; } The query contains SELECT first, so the plug-in will assume that this is a read query and should be sent to a read slave. Note that the hostname given is not a real hostname, but rather a reference to the myapp key in the configuration file. The plug-in will use this information to dispatch the query to the correct server. Hierarchical Replication Although the master is quite good at handling a large number of slaves, there is a limit to how many slaves it can handle before the load becomes too high for comfort (roughly 70 slaves for each master seems to be a practical limit, but as you probably realize, this depends a lot on the application), and an unresponsive master is always a problem. In those cases, you can add an extra slave (or several) as a relay slave (or simply relay), whose only purpose is to lighten the load of replication on the master by taking care of a bunch of slaves. Using a relay in this manner is called hierarchical replication. Figure 6-4 illustrates a typical setup with a master, a relay, and several slaves connected to the relay. Figure 6-4. Hierarchical topology with master, relay, and slaves By default, the changes the slave receives from its master are not written to the binary log of the slave, so if SHOW BINLOG EVENTS is executed on the slave in the previous setup, 170 | Chapter 6: MySQL Replication for Scale-Out you will not see any events in the binlog. The reason for this is that there is no point in wasting disk space by recording the changes; if there is a problem and, say, the slave crashes, you can always recover by cloning the master or another slave. On the other hand, the relay server needs to keep a binary log to record all the changes, because the relay passes them on to other slaves. Unlike typical slaves, however, the relay doesn’t need to actually apply changes to a database of its own, because it doesn’t answer queries. In short, a typical slave needs to apply changes to a database, but not to a binary log. A relay server needs to keep a binary log, but does not apply changes to a database. To avoid writing changes to the database, it is necessary to keep tables around (so the statements can be executed), but the changes should just be thrown away. A storage engine named Blackhole was created for purposes just like this one. The Blackhole engine accepts all statements and always reports success in executing them, but any changes are just thrown away. A relay introduces an extra delay that can cause its slaves to lag further behind the master than slaves that are directly connected to the master. This lag should be balanced against the benefits of removing some load from the master, because managing a hierarchical setup is significantly more difficult than managing a simple setup. Setting Up a Relay Server Setting up a relay server is quite easy, but we have to consider what to do with tables that are being created on the relay as well as what to do with tables that already exist on the relay when we change its role. Not keeping data in the databases will make processing events faster and reduce the lag for the slaves at the end of the replication process, because there is no data to be updated. To set up a relay server, we thus have to: 1. Configure the slave to forward any events executed by the slave thread by writing them to the binlog of the relay slave. 2. Change the storage engine for all tables on the relay server to use the BLACKHOLE storage engine to preserve space and improve performance. 3. Ensure that any new tables added to the relay also use the BLACKHOLE engine. Configuring the relay server to forward events executed by the slave thread is done by adding the log-slave-updates option to my.cnf, as demonstrated earlier. In addition to setting log-slave-updates, it is necessary to change the default storage engine using the default-storage-engine in the my.cnf file. You can temporarily change the storage engine on the relay by issuing the command SET STORAGE_ENGINE = 'BLACKHOLE', but that setting will not persist if the server is restarted. Hierarchical Replication | 171 The final task is to change the storage engine for all tables already on the relay server to use BLACKHOLE. Do this using the ALTER TABLE statement to change the storage engine for each table on the server. Because the ALTER TABLE statements shouldn’t be written to the binary log (the last thing we want is for slaves to discard the changes they receive!), turn off the binary log temporarily while executing the ALTER TABLE statements. This is shown in Example 6-8. Example 6-8. Changing the engine for all tables in database windy relay> SHOW TABLES FROM windy; +-----------------+ | Tables_in_windy | +-----------------+ | user_data | . . . | profile | +-----------------+ 45 row in set (0.15 sec) relay> SET SQL_LOG_BIN = 0; relay> ALTER TABLE user_data ENGINE = 'BLACKHOLE'; . . . relay> ALTER TABLE profile ENGINE = 'BLACKHOLE'; relay> SET SQL_BIN_LOG = 1; This is all you need to turn a server into a relay server. The usual way you come to employ a relay is to start with a setup in which all slaves attach directly to a master and discover after some time that it is necessary to introduce a relay server. The reason is usually that the master has become too loaded, but there could be architectural reasons for making the change as well. So how do you handle that? You can use what you learned in the previous sections and modify the existing deploy‐ ment to introduce the new relay server by: 1. Connecting the relay server to the master and configuring it to act as a relay server. 2. Switching over the slaves one by one to the relay server. Adding a Relay in Python Let’s turn to the task of developing support for administering relays by extending our library. Because we have a system for creating new roles and imbuing servers with those roles, let’s use that by defining a special role for the relay server. This is shown in Example 6-9. 172 | Chapter 6: MySQL Replication for Scale-Out Example 6-9. Role definition for relay from mysql.replicant import roles class Relay(roles.Role): def __init__(self, master): super(Relay, self).__init__() self.__master = master def imbue(self, server): config = server.get_config() self._set_server_id(server, config) self._enable_binlog(server) config.set('mysqld', 'log-slave-updates' '1') server.put_config(config) server.sql("SET SQL_LOG_BIN = 0") for db in list of databases: for table in server.sql("SHOW TABLES FROM %s", (db)): server.sql("ALTER TABLE %s.%s ENGINE=BLACKHOLE", (db,table)) server.sql("SET SQL_LOG_BIN = 1") Specialized Slaves In the simple scale-out deployment—like the one described thus far—all slaves receive all data and can therefore handle any kind of query. It is, however, not very common to distribute requests evenly over the different parts of the data. Instead, there is usually some data that needs to be accessed very frequently and some that is rarely accessed. For example, consider the needs of an ecommerce site: • The product catalog is browsed almost all the time. • Data about items in stock may not be requested very often. • User data is not requested very often, because most of the critical information is recorded using session-specific information stored in the browser as cookies. • On the other hand, if cookies are disabled, the session data will be requested from the server with almost every page request. • Newly added items are usually accessed more frequently than old items (e.g., “spe‐ cial offers” might be accessed more frequently than other items). It would clearly be a waste of resources to keep the rarely accessed data on each and every slave just in case it is requested. It would be much better to use the deployment shown in Figure 6-5, where a few servers are dedicated to keeping rarely accessed data, while a different set of servers are dedicated to keeping data that is accessed frequently. Specialized Slaves | 173 Figure 6-5. Replication topology with master and specialized slaves To do this, it is necessary to separate tables when replicating. MySQL can do this by filtering the events that leave the master or, alternatively, filtering the events when they come to the slave. Filtering Replication Events The two different ways of filtering events are called master filters when the events are filtered on the master and slave filters when the events are filtered on the slave. The master filters control what goes into the binary log and therefore what is sent to the slaves, while slave filters control what is executed on the slave. For the master filters, events for filtered-out tables are not stored in the binary log at all, while for slave filters, the events are stored in the binary log and also sent to the slave and not filtered out until just before they are going to be executed. This means that it is not possible to use PITR to recover these databases properly—if the databases are stored in the backup image, they will still be restored when restoring the backup, but any changes made to tables in the database since that moment will not be recovered, because the changes are not in the binary log. If slave filters are used, all changes are sent over the network. This clearly wastes network bandwidth, especially over long-haul network connections. Later in this chapter, you will see a detailed discussion of the relative merits of master and slave filtering and an approach that allows the binary log to remain intact while still saving network bandwidth. Master filters There are two configuration options for creating master filters: 174 | Chapter 6: MySQL Replication for Scale-Out binlog-do-db=db If the current database of the statement is db, the statement will be written to the binary log; otherwise, the statement will be discarded. binlog-ignore-db=db If the current database of the statement is db, the statement will be discarded; otherwise, the statement will be written to the binary log. If you want to replicate everything except a few databases, use binlog-ignore-db. If you want to replicate just a few databases, use binlog-do-db. Combining them is not recommended, because the logic for deciding whether a database should be replicated or not is complicated (see Figure 4-3). The options do not accept lists of databases, so if you want to list several databases, you have to repeat an option multiple times. As an example, to replicate everything except the top and secret databases, add the following options to the configuration file: [mysqld] ... binlog-ignore-db = top binlog-ignore-db = secret Using binlog-*-db options to filter events means that the two databases will not be stored in the binary log at all, and hence can‐ not be recovered using PITR in the event of a crash. For that rea‐ son, it is strongly recommended that you use slave filters, not mas‐ ter filters, when you want to filter the replication stream. You should use master filters only for data that can be considered volatile and that you can afford to lose. Slave filters Slave filtering offers a longer list of options. In addition to being able to filter the events based on the database, slave filters can filter individual tables and even groups of table names by using wildcards. In the following list of rules, the replicate-wild rules look at the full name of the table, including both the database and table name. The pattern supplied to the option uses the same patterns as the LIKE string comparison function—that is, an underscore (_) matches a single character, whereas a percent sign (%) matches a string of any length. Note, however, that the pattern must contain a period to be legitimate. This means that the database name and table name are matched individually, so each wildcard applies only to the database name or table name. replicate-do-db=db If the current database of the statement is db, execute the statement. Specialized Slaves | 175 replicate-ignore-db=db If the current database of the statement is db, discard the statement. replicate-do-table=db_name.tbl_name replicate-wild-do-table=db_pattern.tbl_pattern If the name of the table being updated is table or matches the pattern, execute updates to the table. replicate-ignore-table=db_name.tbl_name replicate-wild-ignore-table=db_pattern.tbl_pattern If the name of the table being updated is table or matches the pattern, discard updates to the table. These filtering rules are evaluated just before the server decides whether to execute them, so all events are sent to the slave before being filtered. Using Filtering to Partition Events to Slaves So what are the benefits and drawbacks of filtering on the master versus filtering on the slave? At a brief glance, it might seem like a good idea to structure the databases so that it is possible to filter events on the master using the binlog-*-db options instead of using the replicate-*-db options. That way, the network is not laden with a lot of useless events that will be removed by the slave anyway. However, as mentioned earlier in the chapter, there are problems associated with filtering on the master: • Because the events are filtered from the binary log and there is only a single binary log, it is not possible to “split” the changes and send different parts of the database to different servers. • The binary log is also used for PITR, so if there are any problems with the server, it will not be possible to restore everything. • If, for some reason, it becomes necessary to split the data differently, it will no longer be possible, because the binary log has already been filtered and cannot be “unfil‐ tered.” It would be ideal if the filtering could be on the events sent from the master and not on the events written to the binary log. It would also be good if the filtering could be controlled by the slave so that the slave could decide which data to replicate. For MySQL version 5.1 and later, this is not possible, and instead, it is necessary to filter events using the replicate-* options—that is, to filter the events on the slave. As an example, to dedicate a slave to the user data stored in the two tables users and profiles in the app database, shut down the server and add the following filtering options to the my.cnf file: 176 | Chapter 6: MySQL Replication for Scale-Out [mysqld] ... replicate-wild-do-table=app.users replicate-wild-do-table=app.profiles If you are concerned about network traffic—which could be significant if you replicate over long-haul networks—you can set up a relay server on the same machine as the master, as shown in Figure 6-6 (or on the same network segment as the master), whose only purpose is to produce a filtered version of the master’s binary log. Figure 6-6. Filtering by putting master and relay on the same machine Managing Consistency of Data As discussed earlier in the chapter, one of the problems with asynchronous replication is managing consistency. To illustrate the problem, let’s imagine you have an ecommerce site where customers can browse products and put items they want to purchase in a cart. You’ve set up your servers so that when a user adds an item to the cart, the change request goes to the master, but when the web server requests information about the contents of the cart, the query goes to one of the slaves tasked with answering such queries. Because the master is ahead of the slave, it is possible that the change has not reached the slave yet, so a query to the slave will then find the cart empty. This will, of course, come as a big surprise to the customer, who will then promptly add the item to the cart again only to discover that the cart now contains two items, because this time, the slave managed to catch up and replicate both changes to the cart. This situation clearly needs to be avoided or you will risk a bunch of irritated customers. To avoid getting data that is too old, it is necessary to somehow ensure that the data provided by the slave is recent enough to be useful. As you will see, the problem becomes even trickier when a relay server is added to the mix. The basic idea of handling this is Managing Consistency of Data | 177 to somehow mark each transaction committed on the master, and then wait for the slave to reach that transaction (or later) before trying to execute a query on the slave. With the introduction of global transaction identifiers (GTID) in MySQL 5.6, failing over slaves and clients has become significantly simpler because most of the techniques described here are handled automatically. The global transaction IDs are described in detail in “Global Transaction Identifiers” on page 260, but this chapter will to a large extent focus on the old solution for the benefit of users that have not transitioned to MySQL 5.6 yet. The differences between pre-5.6 and 5.6-based solution are highlighted in the reminder of the chapter. Prior to MySQL 5.6, the problem needed to be handled in different ways depending on whether there are any relay servers between the master and the slave. Consistency in a Nonhierarchical Deployment When all the slaves are connected directly to the master, it is very easy to check for consistency. In this case, it is sufficient to record the binlog position after the transaction has been committed and then wait for the slave to reach this position using the previ‐ ously introduced MASTER_POS_WAIT function. But it is not possible to get the exact po‐ sition where a transaction was written in the binlog. Why? Because in the time between the commit of a transaction and the execution of SHOW MASTER STATUS, several events can be written to the binlog. This does not matter, because in this case, it is not necessary to get the exact binlog position where the transaction was written; it is sufficient to get a position that is at or later than the position of the transaction. Because the SHOW MASTER STATUS command will show the position where replication is currently writing events, executing this after the transaction has committed will be sufficient for getting a binlog position that can be used for checking consistency. Example 6-10 shows the PHP code for processing an update to guarantee that the data presented is not stale. Example 6-10. PHP code for avoiding read of stale data function fetch_master_pos($server) { $result = $server->query('SHOW MASTER STATUS'); if ($result == NULL) return NULL; // Execution failed $row = $result->fetch_assoc(); if ($row == NULL) return NULL; // No binlog enabled $pos = array($row['File'], $row['Position']); $result->close(); return $pos; } 178 | Chapter 6: MySQL Replication for Scale-Out function sync_with_master($master, $slave) { $pos = fetch_master_pos($master); if ($pos == NULL) return FALSE; if (!wait_for_pos($slave, $pos[0], $pos[1])) return FALSE; return TRUE; } function wait_for_pos($server, $file, $pos) { $result = $server->query( "SELECT MASTER_POS_WAIT('$file', $pos)"); if ($result == NULL) return FALSE; // Execution failed $row = $result->fetch_row(); if ($row == NULL) return FALSE; // Empty result set ?! if ($row[0] == NULL || $row[0] < 0) return FALSE; // Sync failed $result->close(); return TRUE; } function commit_and_sync($master, $slave) { if ($master->commit()) { if (!sync_with_master($master, $slave)) return NULL; // Synchronization failed return TRUE; // Commit and sync succeeded } return FALSE; // Commit failed (no sync done) } function start_trans($server) { $server->autocommit(FALSE); } Example 6-10 contains the functions commit_and_sync and start_trans together with three support functions, fetch_master_pos, wait_for_pos, and sync_with_master. The commit_and_sync function commits a transaction and waits for it to reach a des‐ ignated slave. It accepts two arguments, a connection object to a master and a connection object to the slave. The function will return TRUE if the commit and the sync succeeded, FALSE if the commit failed, and NULL if the commit succeeded but the synchronization failed (either because there was an error in the slave or because the slave lost the master). The function works by committing the current transaction and then, if that succeeds, fetching the current master binlog position through SHOW MASTER STATUS. Because other threads may have executed updates to the database between the commit and the call to SHOW MASTER STATUS, it is possible (even likely) that the position returned is not at the end of the transaction, but rather somewhere after where the transaction was written in the binlog. As mentioned earlier, this does not matter from an accuracy per‐ Managing Consistency of Data | 179 spective, because the transaction will have been executed anyway when we reach this later position. After fetching the binlog position from the master, the function proceeds by connecting to the slave and executing a wait for the master position using the MASTER_POS_WAIT function. If the slave is running, a call to this function will block and wait for the position to be reached, but if the slave is not running, NULL will be returned immediately. This is also what will happen if the slave stops while the function is waiting (like if an error occurs when the slave thread executes a statement). In either case, NULL indicates the transaction has not reached the slave, so it’s important to check the result from the call. If MASTER_POS_WAIT returns 0, it means that the slave had already seen the transaction and therefore synchronization succeeds trivially. To use these functions, it is sufficient to connect to the server as usual, but then use the functions to start, commit, and abort transactions. Example 6-11 shows examples of their use in context, but the error checking has been omitted because it is dependent on how errors are handled. Example 6-11. Using the start_trans and commit_and_sync functions require_once './database.inc'; start_trans($master); $master->query('INSERT INTO t1 SELECT 2*a FROM t1'); commit_and_sync($master, $slave); PHP scripts have a maximum execution time, which defaults to 30 seconds. If the script exceeds this time, execution will be terminat‐ ed. You need to keep that in mind when using the code in Example 6-10 by either running in safe mode, or changing the max‐ imum execution time. Consistency in a Hierarchical Deployment Thanks to the global transaction identifiers introduced in MySQL 5.6, managing con‐ sistency in a MySQL 5.6 server is just as easy as in “Consistency in a Nonhierarchical Deployment” on page 178. Because the transaction identifier does not change between machines, it does not matter how many relay servers there are between the origin of the transaction and the server you connect to. Managing consistency in a hierarchical deployment before MySQL 5.6 is significantly different from managing consistency in a simple replication topology where each slave is connected directly to the master. Because the positions are changed by every inter‐ mediate relay server, it is not possible to wait for a master position at the ultimate slave (the slave at the bottom at the hierarchy). Instead, it is necessary to figure out another 180 | Chapter 6: MySQL Replication for Scale-Out way to wait for the transactions to reach the ultimate slave. There are basically two alternatives that you can use to ensure you are not reading stale data. The first solution is to use the global transaction identifiers shown in Appendix B to handle slave promotions and to poll the slave repeatedly until it has processed the transaction. In contrast with the global transaction identifiers in MySQL 5.6, there is no wait function that we can use for these, so it is necessary to poll repeatedly. The MASTER_POS_WAIT function is quite handy when it comes to handling the wait, so if it were possible to use that function, it would solve a lot of problems. The second solution, illustrated in Figure 6-7, uses this function to connect to each of the relay servers in the path from the master to the final slave to ensure the change propagates to the slave. It is necessary to connect to each relay slave between the master and the slave, because it is not possible to know which binlog position will be used on each of the relay servers. Figure 6-7. Synchronizing with all servers in a relay chain Both solutions have their merits, so let’s consider the advantages and disadvantages of each of them. If the slaves are normally up-to-date with respect to the master, the first solution will perform a simple check of the final slave only and will usually show that the transaction Managing Consistency of Data | 181 has been replicated to the slave and that processing can proceed. If the transaction has not been processed yet, it is likely that it will be processed before the next check, so the second time the final slave is checked, it will show that the transaction has reached the slave. If the checking period is small enough, the delay will not be noticeable for the user, so a typical consistency check will require one or two extra messages when polling the final slave. This approach requires only the final slave to be polled, not any of the intermediate slaves. This can be an advantage from an administrative point as well, because it does not require keeping track of the intermediate slaves and how they are connected. On the other hand, if the slaves normally lag behind, or if the replication lag varies a lot, the second approach is probably better. The first solution will repeatedly poll the slave, and most of the time will report that the transaction has not been committed on the slave. You can handle this by increasing the polling period, but if the polling period has to be so large that the response time is unacceptable, the first solution will not work well. In this case, it is better to use the second solution and wait for the changes to ripple down the replication tree and then execute the query. For a tree of size N, the number of extra requests will then be proportional to log N. For instance, if you have 50 relay servers and each relay server handles 50 final slaves, you can handle all 2,500 slaves with exactly two extra requests: one to the relay server and then one to the final slave. The disadvantages of the second approach are: • It requires the application code to have access to the relay slaves so that they can connect to each relay server in turn and wait for the position to be reached. • It requires the application code to keep track of the architecture of your replication so that the relay servers can be queried. Querying the relay servers will slow them down, because they have to handle more work, but in practice, this might turn out not to be a problem. By introducing a caching database connection layer, you can avoid some of the traffic. The caching layer will remember the binlog position each time a request is made and query the relay only if the binlog position is greater than the cached one. The following is a rough stub for the caching function: function wait_for_pos($server, $wait_for_pos) { if (cached position for $server > $wait_for_pos) return TRUE; else { code to wait for position and update cache } } 182 | Chapter 6: MySQL Replication for Scale-Out Because the binlog positions are always increasing—once a binlog position is passed it remains passed—there is no risk of returning an incorrect result. The only way to know for sure which technique is more efficient is to monitor and profile the deployment to make sure queries are executed fast enough for the application. Example 6-12 shows sample code to handle the first solution; it queries the slave re‐ peatedly to see whether the transaction has been executed. This code uses the Last_Ex ec_Trans table introduced in Chapter 5 by checking it on the master, and then repeatedly reading the table on the slave until it finds the correct transaction. Example 6-12. PHP code for avoiding read of stale data using polling function fetch_trans_id($server) { $result = $server->query( "SELECT server_id, trans_id FROM Last_Exec_Trans"); if ($result == NULL) return NULL; // Execution failed $row = $result->fetch_assoc(); if ($row == NULL) return NULL; // Empty table !? $gid = array($row['server_id'], $row['trans_id']); $result->close(); return $gid; } function wait_for_trans_id($server, $server_id, $trans_id) { if ($server_id == NULL || $trans_id == NULL) return TRUE; // No transactions executed, in sync $server->autocommit(TRUE); $gid = fetch_trans_id($server); if ($gid == NULL) return FALSE; list($current_server_id, $current_trans_id) = $gid; while ($current_server_id != $server_id || $current_trans_id < $trans_id) { usleep(500000); // Wait half a second $gid = fetch_trans_id($server); if ($gid == NULL) return FALSE; list($current_server_id, $current_trans_id) = $gid; } return TRUE; } function commit_and_sync($master, $slave) { if ($master->commit()) { $gid = fetch_trans_id($master); if ($gid == NULL) return NULL; if (!wait_for_trans_id($slave, $gid[0], $gid[1])) return NULL; Managing Consistency of Data | 183 return TRUE; } return FALSE; } function start_trans($server) { $server->autocommit(FALSE); } The two functions commit_and_sync and start_trans behave the same way as in Example 6-10, and can therefore be used in the same way as in Example 6-11. The difference is that the functions in Example 6-12 internally call fetch_trans_id and wait_for_trans_id instead of fetch_master_pos and wait_for_pos. Some points worth noting in the code: Autocommit is turned on in wait_for_trans_id before starting to query the slave. This is necessary because if the isolation level is REPEATABLE READ or stricter, the select will find the same global transaction identifier every time. To prevent this, each SELECT is committed as a separate transaction by turning on autocommit. An alternative is to use the READ COMMITTED isolation level. To avoid unnecessary sleeps in wait_for_trans_id, the global transaction identifier is fetched and checked once before entering the loop. This code requires access only to the master and slave, not to the intermediate relay servers. Example 6-13 includes code for ensuring you do not read stale data. It uses the technique of querying all servers between the master and the final slave. This method proceeds by first finding the entire chain of servers between the final slave and the master, and then synchronizing each in turn all the way down the chain until the transaction reaches the final slave. The code reuses the fetch_master_pos and wait_for_pos from Example 7-8, so they are not repeated here. The code does not implement any caching layer. Example 6-13. PHP code for avoiding reading stale data using waiting function fetch_relay_chain($master, $final) { $servers = array(); $server = $final; while ($server !== $master) { $server = get_master_for($server); $servers[] = $server; } $servers[] = $master; return $servers; } function commit_and_sync($master, $slave) { 184 | Chapter 6: MySQL Replication for Scale-Out if ($master->commit()) { $server = fetch_relay_chain($master, $slave); for ($i = sizeof($server) - 1; $i > 1 ; --$i) { if (!sync_with_master($server[$i], $server[$i-1])) return NULL; // Synchronization failed } } } function start_trans($server) { $server->autocommit(FALSE); } To find all the servers between the master and the slave, we use the function fetch_re lay_chain. It starts from the slave and uses the function get_master_for to get the master for a slave. We have deliberately not included the code for this function, as it does not add anything to our current discussion. However, this function has to be de‐ fined for the code to work. After the relay chain is fetched, the code synchronizes the master with its slave all the way down the chain. This is done with the sync_with_master function, which was introduced in Example 6-10. One way to fetch the master for a server is to use SHOW SLAVE STATUS and read the Master_Host and Master_Port fields. If you do this for each transaction you are about to commit, however, the system will be very slow. Because the topology rarely changes, it is better to cache the information on the appli‐ cation servers, or somewhere else, to avoid excessive traffic to the database servers. In Chapter 5, you saw how to handle the failure of a master by, for example, failing over to another master or promoting a slave to be a master. We also mentioned that once the master is repaired, you need to bring it back to the deployment. The master is a critical component of a deployment and is likely to be a more powerful machine than the slaves, so you should restore it to the master position when bringing it back. Because the master stopped unexpectedly, it is very likely to be out of sync with the rest of the deployment. This can happen in two ways: • If the master has been offline for more than just a short time, the rest of the system will have committed many transactions that the master is not aware of. In a sense, the master is in an alternative future compared to the rest of the system. An illus‐ tration of this situation is shown in Figure 6-8. • If the master committed a transaction and wrote it to the binary log, then crashed just after it acknowledged the transaction, the transaction may not have made it to the slaves. This means the master has one or more transactions that have not been seen by the slaves, nor by any other part of the system. Managing Consistency of Data | 185 If the original master is not too far behind the current master, the easiest solution to the first problem is to connect the original master as a slave to the current master, and then switch over all slaves to the master once it has caught up. If, however, the original master has been offline for a significant period, it is likely to be faster to clone one of the slaves and then switch over all the slaves to the master. If the master is in an alternative future, it is not likely that its extra transactions should be brought into the deployment. Why? Because the sudden appearance of a new trans‐ action is likely to conflict with existing transactions in subtle ways. For example, if the transaction is a message in a message board, it is likely that a user has already recom‐ mitted the message. If a message written earlier but reported as missing—because the master crashed before the message was sent to a slave—suddenly reappears, it will be‐ fuddle the users and definitely be considered an annoyance. In a similar manner, users will not look kindly on shopping carts suddenly having items added because the master was brought back into the system. Figure 6-8. Original master in an alternative future In short, you can solve both of the out-of-sync problems—the master in an alternative future and the master that needs to catch up—by simply cloning a slave to the original master and then switching over each of the current slaves in turn to the original master. These problems, however, highlight how important it is to ensure consistency by check‐ ing that changes to a master are available on some other system before reporting the transaction as complete, in the event that the master should crash. The code that we have discussed in this chapter assumes that a user will try to read the data immediately, and therefore checks that it has reached the slave before a read query is carried out on the server. From a recovery perspective, this is excessive; it is sufficient to ensure the transaction is available on at least one other machine (e.g., on one of the slaves or relay 186 | Chapter 6: MySQL Replication for Scale-Out servers connected to the master). In general, you can tolerate n−1 failures if you have the change available on n servers. As of MySQL 5.6, you can use global transaction identifiers to handle this. Simply use the WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS function instead of the MASTER_POS_WAIT in Example 6-10, leading to the definition of wait_for_pos: function wait_for_pos($server, $gtids) { $result = $server->query( "SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS($gtids)"); if ($result == NULL) return FALSE; // Execution failed $row = $result->fetch_row(); if ($row == NULL) return FALSE; // Empty result set ?! if ($row[0] == NULL || $row[0] < 0) return FALSE; // Sync failed $result->close(); return TRUE; } You can find a full description of global transaction identifiers in “Global Transaction Identifiers” on page 260. Conclusion In this chapter, we looked at techniques to increase the throughput of your applications by scaling out, whereby we introduced more servers to handle more requests for data. We presented ways to set up MySQL for scaling out using replication and gave practical examples of some of the concepts. In the next chapter, we will look at some more ad‐ vanced replication concepts. A rap on Joel’s door drew his attention to Mr. Summerson standing in the doorway. “I like your report on scaling out our servers, Joel. I want you to get started on that right away. Use some of those surplus servers we have down in the computer room.” Joel was happy he had decided to send his boss a proposal first. “Yes, sir. When do we need these online?” Mr. Summerson smiled and glanced at his watch. “It’s not quitting time yet,” he said and walked away. Joel wasn’t sure whether he was joking or not, so he decided to get started right away. He picked up his now-well-thumbed copy of MySQL High Availability and his notes and headed to the computer room. “I hope I set the TiVo,” he muttered, knowing this was going to be a late night. Conclusion | 187 CHAPTER 7 Data Sharding Joel finished reviewing his server logs and noted a few issues with a couple of queries. He made notes in his engineering notebook to watch these queries so he could learn whether they are simply long running or queries that need to be refactored to be more efficient. He was just noting the username for each query when his boss pinged him on the company’s Jabber channel. “I wish I’d never suggested we adopt instant messaging,” he thought. Joel typed the normal response, “pong,” and waited for his boss to fire off another microtask. While it was nice not to get ambushed in his own office, Joel knew his boss well enough that sooner or later Mr. Summerson would give up using Jabber and return to his usual drive- by tasking routine. “J, I need u to wrt a WP on shrdin. Do we need? Hlp new acct?” Joel took a moment to decipher his boss’s propensity to think instant messaging was the same as texting, where typing is more difficult and some services charge by the letter. “OK, so he’s discovered sharding. Now, what is a WP?” Joel typed his response, “Sure, I’ll get right on it. Do you mean white paper?” “Ack” appeared a moment later. Joel minimized his Jabber window, opened his browser, entered “sharding mysql white paper pdf” in the search box, and pressed Enter. “If it’s out there, I’ll find it.” In the previous chapter, you learned how to scale reads by attaching slaves to a master and directing reads to the slaves while writes go to the master. As the load increases, it is easy to add more slaves to the master and serve more read queries. This allows you to easily scale when the read load increases, but what about the write load? All writes still go to the master, so if the number of writes increases enough, the master will become the bottleneck preventing the system from scaling. At this point, you will probably ask 189 1. The MySQL Reference Manual refers to a schema as a database, making that term ambiguous. The SQL standard actually uses the name schema and the syntax for the create statement is CREATE SCHEMA schema in the SQL standard. whether there is some way to scale writes as well as reads. We’ll present sharding as a solution in this chapter, but let’s start with a look at some background. In previous chapters, the data in the database is fully stored in a single server, but in this chapter, you will see the data in the database distributed over several servers. To avoid confusion, we’ll use the term schema to denote the name that you use with the statement USE schema or CREATE DATABASE schema.1 We’ll reserve the term database for the col‐ lection of all the data that you have stored, regardless of how many machines it is dis‐ tributed over. For example, you can choose to break up a database by placing some tables on different machines (also known as functional partitioning), and splitting some tables placing some of the rows on different machines (called horizontal partitioning, which is what we are talking about in this chapter). What Is Sharding? Most attempts to scale writes start with using the setup in Figure 7-1, consisting of two masters using bidirectional replication and a set of clients that update different masters depending on which data they need to change. Although the architecture appears to double the capacity for handling writes (because there are two masters), it actually doesn’t. Writes are just as expensive as before because each statement has to be executed twice: once when it is received from the client and once when it is received from the other master. All the writes done by the A clients, as well as the B clients, are replicated and get executed twice, which leaves you in no better position than before. In short, a dual-master setup doesn’t help you scale writes, so it is necessary to find some other means to scale the system. The only way forward is to remove replication between the servers so that they are completely separate. With this architecture, it is possible to scale writes by partitioning the data into two completely independent sets and directing the clients to the partition that is responsible for each item of data the clients attempt to update. This way, no resources need to be devoted to processing updates for other partitions. Partitioning the data in this manner is usually referred to as sharding (other common names are splintering or horizontal partitioning) and each partition in this setup is referred to as a shard. 190 | Chapter 7: Data Sharding Figure 7-1. Pair of masters with bidirectional replication Why Should You Shard? Depending on where your application is experiencing strain, you have different reasons to shard. The biggest advantages of sharding, and the most common reasons to shard, are: Placing data geographically close to the user By placing bulky data such as pictures or videos close to the user, it is possible to reduce latency. This will improve the perceived performance of the system. Reducing the size of the working set If the table is smaller, it is possible that a larger part of the table, maybe even the entire table, can fit into main memory. Searching through a table that is entirely in main memory is very efficient, so splitting a large table into many small tables may improve performance drastically. This means that performance can be improved by sharding tables, even if multiple shards are stored on a single server. Another aspect that affects performance is that the algorithms that search the tables are more efficient if the table is smaller. This can give a performance boost even when multiple shards are stored on the same machine. There are, however, technical limitations and overheads associated with storing multiple shards on a machine, so it is necessary to strike a balance between the number of shards and the size of the shards. Deciding the optimal size of the tables requires monitoring the performance of the MySQL server and also monitoring InnoDB (or any other storage engine you use) to learn the average number of I/O operations required on average for each row scanned and to see if you need to make the shards even smaller. You will learn more about monitoring the server using the performance schema in Chapter 11 and monitoring InnoDB in Chapter 12 (especially getting statistics on the buffer pool, as it is important to optimize the size of the shards). What Is Sharding? | 191 Distributing the work If the data is sharded, it is possible to parallelize the work, provided that it is simple enough. This approach is most efficient when the shards are approximately the same size. So if you shard your database for this reason, you must find a way to balance the shards as they grow or shrink over time. It’s worth noting that you do not have to shard all the data in the database. You can shard some of the big tables, and duplicate the smaller tables on each shard (these are usually called global tables). You can also combine sharding and functional partitioning and shard bulky data such as posts, comments, pictures, and videos, while keeping direc‐ tories and user data in an unsharded central store, similar to the deployment shown in Figure 7-2. Figure 7-2. Shards with a centralized database Limitations of Sharding Sharding can improve performance, but it is not a panacea and comes with its own set of limitations that may or may not affect you. Many of these can be handled, and you will in this section learn about the limitations and how to handle them. The challenge is to ensure that all queries give the same result when executed toward the unsharded database and the sharded database. If your queries access multiple tables (which is usually the case) you have to be careful to ensure that you get the same result 192 | Chapter 7: Data Sharding for the unsharded database and the sharded database. This means that you have to pick a sharding index that ensures that the queries get the same result on a sharded or un‐ sharded database. In some cases, it is not practical or possible to solve the problem using sharding indexes, and it is necessary to rewrite the query (or eliminate it entirely, if possible). Two common problems you need to handle are cross-shard joins and AUTO_INCREMENT columns. We’ll briefly cover them in the following sections. Cross-shard joins One of the most critical limitations that might affect you are cross-shard joins. Because the tables are partitioned, it is not possible to join two tables that belong to different shards and get the same result as if you executed the query in an unsharded database. The most common reason for using cross-shard joins is to create reports. This usually requires collecting information from the entire database, so two approaches are gener‐ ally used: • Execute the query in a map-reduce fashion (i.e., send the query to all shards and collect the result into a single result set). • Replicate all the shards to a separate reporting server and run the query there. The advantage of executing the query in a map-reduce fashion is that you can get a snapshot of the live database, but it means that you take resources from the business application that is using the database. If your query is short and you really need to have a result reflecting the current state of the application database, this might be a useful approach. It is probably wise to monitor these queries, though, to make sure that they are not taking up too many resources and impacting application performance. The second approach, replication, is easier. It’s usually feasible, as well, because most reporting is done at specific times, is long-running, and does not depend on the current state of the database. Later, in “Mapping the Sharding Key” on page 206, you will see a technique to automatically detect cross-shard joins and raise an error when attempts are made to execute such queries. Using AUTO_INCREMENT It is quite common to use AUTO_INCREMENT to create a unique identifier for a column. However, this fails in a sharded environment because the shards do not synchronize their AUTO_INCREMENT identifiers. This means that if you insert a row in one shard, it might well happen that the same identifier is used on another shard. If you truly want to generate a unique identifier, there are basically two approaches: What Is Sharding? | 193 • Generate a unique UUID. The drawback is that the identifier takes 128 bits (16 bytes). There is also a slight possibility that the same UUID is picked independently, but it is so small that you can ignore it. • Use a composite identifier, as in Figure 7-3, where the first part is the shard identifier (see “Mapping the Sharding Key” on page 206) and the second part is a locally gen‐ erated identifier (which can be generated using AUTO_INCREMENT). Note that the shard identifier is used when generating the key, so if a row with this identifier is moved, the original shard identifier has to move with it. You can solve this by maintaining, in addition to the column with the AUTO_INCREMENT, an extra column containing the shard identifier for the shard where the row was created. Figure 7-3. A composite key In case you are interested, the probability of a collision can be com‐ puted using the equation that solves the Birthday Problem, where d is the number of “days” and n is the number of “people”: P2(n, d) = 1 - d ! (d - n) ! d n Elements of a Sharding Solution The way you shard your database is ultimately determined by the queries that users intend to execute. For instance, it may make sense to shard sales data by year (2012 in one shard, 2013 in another, etc.), but if users run a lot of queries comparing sales in one December to another December, you will force the queries to cross shards. As we noted before, cross-shard joins are notoriously difficult to handle, so this would hamper per‐ formance and even force users to rewrite their queries. In this section, we will cover the issues you need to handle in order to build a good sharding solution. These decide how you can distribute the data, as well as how you can reshard the data in an efficient manner: • You have to decide how to partition the application data. What tables should be split? What tables should be available on all shards? What columns are the tables going to be sharded on? • You have to decide what sharding metadata (information about the shards) you need and how to manage it. This covers such issues as how to allocate shards to 194 | Chapter 7: Data Sharding MySQL servers, how to map sharding keys to shards, and what you need to store in the “sharding database.” • You have to decide how to handle the query dispatch. This covers such issues as how to get the sharding key necessary to direct queries and transactions to the right shard. • You have to create a scheme for shard management. This covers issues such as how to monitor the load on the shards, how to move shards, and how to rebalance the system by splitting and merging shards. In this chapter, you will become familiar with each of these areas and understand what decisions you have to make to develop a working sharding solution. Applications are usually not designed originally to handle shards. After, such an exten‐ sive redesign does not emerge as a requirement until the database is starting to grow enough to impact performance. So normally you start off with an unsharded database and discover that you need to start sharding it. To describe the elements of sharding, we use the example employee schema in Figure 7-4. The entities in that figure represent a schema of employees, one of the standard example schema available on the MySQL site. To get an idea of how big the database is, you can see a row count in Table 7-1. Table 7-1. Row count of the tables in the employees schema Table Rows departments 9 dept_emp 331603 dept_manager 24 employees 300024 salaries 2844047 titles 443308 Elements of a Sharding Solution | 195 Figure 7-4. Employee schema High-Level Sharding Architecture Figure 7-5 shows the high-level architecture of a sharding solution. Queries come from an application and are received by a broker. The broker decides where to send the query, possibly with the help of a sharding database that keeps track of sharding information. The query is then sent to one or more shards of the application database and executed. The result set from the executions are collected by the broker, possibly post-processed, and then sent back to the application. Figure 7-5. High-level sharding architecture 196 | Chapter 7: Data Sharding Partitioning the Data Writing each item of data to a particular server allows you to scale writes efficiently. But that’s not sufficient for scalability: efficient data retrieval is also important, and to achieve that, it is necessary to keep associated data together. For this reason, the biggest challenge in efficient sharding is to have a good sharding index so that data commonly requested together is on the same shard. As you will see, a sharding index is defined over columns in multiple tables; typically you use only a single column from each table, but multiple columns are also possible. The sharding index will decide what tables will be sharded and how they will be sharded. After having picked a sharding index, you will end up with something similar to Figure 7-6. Figure 7-6. Schema with sharded and global tables Here you can see several tables that have been sharded and where the rows are dis‐ tributed over the shards (employees, salaries, and titles). Identical copies of the Partitioning the Data | 197 global tables (companies and departments) are present on each shard. We’ll cover how to select columns for the sharding index and show how we came up with this particular solution for our sample schema. To split the database over shards, you need to pick one or more of the columns for the sharding index and use them to distribute the rows over the shards. Using multiple sharding columns from a single table for the index can be hard to maintain unless they are used correctly. For this reason, it is usually best to pick a single column in a table and use that for the sharding index. Sharding on a column that is a primary key offers significant advantages. The reason for this is that the column should have a unique index, so that each value in the column uniquely identifies the row. To illustrate the problem of picking a sharding column that does not contain unique values, suppose that you picked the country of an employee as the sharding key. In this case, all rows that belong to Sweden (for example) will go to one shard, and all that belong to China (for example) will go to another shard. This can be an appealing choice for sharding the database if reports or updates are often done on a per-country basis. But even though this might work as long as the size of each shard is relatively small, it will break down once the shards start to grow and need to be split further. At this point, because all rows in a shard will have the same sharding key, it won’t be possible to split the shards further when they have grown to the point where they need to be split again. In the end, the shard for Sweden can contain a maximum of 9 million entries, while the shard for China can contain a maximum of 1.3 billion entries, and these shards cannot be split further. This is a quite unfair distribution, and the server managing China has to perform more than 100 times better than the server managing Sweden to achieve the same performance. If you instead pick the primary key of the table (in this case, the column with the em‐ ployee number), you can group the employees any way you like and create partitions of arbitrary sizes. This will allow you to distribute the rows into shards of roughly the same size, hence distributing the workload evenly over the servers. So how should you pick columns for the sharding index for the schema in Figure 7-4? Well, the first question is what tables need to be sharded. A good starting point for deciding that is to look at the number of rows in the tables as well as the dependencies between the tables. Table 7-1 shows the number of rows in each table in the employees schema. Now, the numbers are nothing like what you would see in a real database in need of sharding— the contents of this database easily fit onto a single server—but it serves as a good example to demonstrate how to construct a sharded database from an unsharded one. A good candidate table for sharding is employees. Not only is it a big table, but several 198 | Chapter 7: Data Sharding other tables are dependent on it, and as you will see later, if there are dependencies between tables, there are opportunities for sharding them as well. The primary key of the employees table is the emp_no column. Because this is the pri‐ mary key, sharding on this column will allow you to distribute the rows of the employ ees table on the shards evenly and split the tables as you need. So if we shard on the emp_no column in the employees table, how does that affect the tables that are dependent on employees? Because there is a foreign key reference to the employees table, this suggests that the intention is to support joining the tables on that column. Take a look at this query, which could be a typical query for fetching the title and salary of an employee: SELECT first_name, last_name, title FROM titles JOIN employees USING (emp_no) WHERE emp_no = employee number AND CURRENT_DATE BETWEEN from_date AND to_date As previously mentioned, the goal is to make sure the query returns the same result in both the sharded and unsharded databases. Because employees is sharded on column emp_no, this query can never reference rows in titles and employees that are on dif‐ ferent shards. So after sharding employees, all rows in titles that have an emp_no that is not in the employees shard you use will never be referenced. To fix this, titles should be sharded on column emp_no as well. The same reasoning holds for all tables that have a foreign key reference into the employees table, so employees, titles, salaries, emp_dep, and dept_manager need to be sharded. In short, even though you picked a single column to start with, you will shard several tables, each on a column that is related to your original sharding of the employees table. Now that we have sharded almost all the tables in the schema, only the departments table remains. Can this also be sharded? The table is so small that it is not necessary to shard it, but what would be the consequence if it was sharded? As noted before, it depends on the queries used to retrieve information from the database, but because dept_manager and dept_emp are used to connect departments and employees, it is a strong hint that the schema is designed to execute queries joining these tables. For example, consider this query to get the name and department for an employee: SELECT first_name, last_name, dept_name FROM employees JOIN dept_emp USING (emp_no) JOIN departments USING (dept_no) WHERE emp_no = employee number This query puts more stress on your sharding choices than the previous SELECT, because it is not dealing with a single column shared by two tables (the primary key of the employees table) but with two columns that can range anywhere throughout the three tables involved. So how can you ensure that this query returns all the results from a sharded database as it would in the unsharded one? Because the employees table is Partitioning the Data | 199 sharded on the emp_no column, every row where dept_emp.emp_no = employ ees.emp_no and dept_emp.dept_no = departments.dept_no has to be in the same shard. If they are in different shards, no rows will match, and the query will return an empty result. Because employees in the same department can reside on different shards, it is better not to shard the departments table, but instead to keep it available on all shards as a global table. Duplicating a table on multiple shards makes updating the tables a little more complicated (this will be covered later) but because the departments table is not expected to change frequently, this is likely to be a good trade-off. Automatically Computing Possible Sharding Indexes As this section has shown, even when you pick a single sharding key, you need to shard all tables that depend on the table that you picked for sharding. Sharding a single table on one of its columns can force you to shard other tables on the related column. For example, because the emp_no column in the employees table is related to the emp_no column in the salaries table, sharding the employees table will allow the salaries table to also be sharded on the same column. For small schemas, it is easy to follow the foreign key relations, but if you have a schema with many tables and many relations, it is not as easy to find all the dependencies. If you are careful about using foreign keys to define all your dependencies, you can compute all the possible sharding indexes of dependent columns by using the information schema available in MySQL. The query to find all sets of dependent columns is: USE information_schema; SELECT GROUP_CONCAT( CONCAT_WS('.', table_schema, table_name, column_name) ) AS indexes FROM key_column_usage JOIN table_constraints USING (table_schema, table_name, constraint_name) WHERE constraint_type = 'FOREIGN KEY' GROUP BY referenced_table_schema, referenced_table_name, referenced_column_name ORDER BY table_schema, table_name, column_name; If you run this query on the employees schema, the result is two possible sharding indexes: 200 | Chapter 7: Data Sharding Candidate #1 Candidate #2 salaries.emp_no dept_manager.dept_no dept_manager.emp_no dept_emp.dept_no dept_emp.emp_no titles.emp_no A single query can compute this because each foreign key has to reference a primary key in the target table. This means that there will be no further references that have to be followed, as would be the case if a foreign key could refer to another foreign key. By counting the number of rows in the tables, you can get an idea of what index would give the best sharding. You can also see that dept_manager and dept_emp are in both alternatives, so these are conflicting and you can use only one of them. Usually, you have only one big set of tables that need to be sharded, as in the example schema in Figure 7-4. In other cases, however, you have several “sets” of tables that you want to shard independently. For example, assume that, in addition to the employees of the company, you want to keep track of all the publications of each department. The relation between employees and publications is a many-to-many relationship, which in classic database schema design is created by a junction table with foreign keys pointing to the employees and publications tables. So tracking publications requires, in addition to the schema shown earlier, a publications table and a dept_pub table to be added to the schema, as in Figure 7-7. Figure 7-7. Publication schema added to employees schema If the publications table is so large that it needs to be sharded as well, you can do so. If you look carefully in Figure 7-7, you’ll see that the departments table is still available on all nodes, and there are foreign key references from dept_pub to publications and departments. This means that you can shard the publications and dept_pub tables, leading to a system where you have multiple independent sharding indexes. What are the consequences of multiple independent sharding indexes? A single query can contain references to tables in one of the sets in Table 7-2 plus global tables, but it Partitioning the Data | 201 must never reference tables from different sets at the same time. In other words, you can query employees together with their titles, or query employees together with their publications, but you must not write a query that asks for information on titles and publications. An example of a query that cannot be executed with this sharding in place is a query that joins a table in the “employee” part of the schema with the “publications” part of the schema: SELECT first_name, last_name, dept_name, COUNT(pub_id) FROM employee JOIN dept_manager ON (emp_no) JOIN departments ON (dept_no) JOIN dept_pub ON (dept_no) JOIN publications ON (pub_id) WHERE emp_no = 110386; Table 7-2. Sharding index with columns Index name Sharding column set si_emps employees.emp_no, dept_emp.emp_no, salaries.emp_no, dept_manager.emp_no, ti tles.emp_no si_pubs publications.pub_no, dept_pub.pub_no Shard Allocation To work efficiently with shards, you need to store them in a way that speeds up physical access. The most straightforward approach is to keep one shard per server, but it is also possible to keep multiple virtual shards per server. To decide how shards should be allocated for your solution, ask the following questions: Do your applications use cross-schema queries? If each of your queries always uses a single schema (e.g., the employees schema), sharding becomes a lot easier. In that case, you can keep multiple shards on a server by using one schema per shard and there is no need to rewrite the queries because they will always go to a single schema. Can queries be tailored to the sharding solution? If your queries are cross-schema but you can request the application developers to write queries with the sharding solution in mind, you can still keep multiple shards per server. This will allow you to rewrite queries in a controlled manner, which means that you can have, for example, the shard number as a suffix on the names of all databases. Do you need to re-shard frequently? If you cannot rewrite the queries easily, or if you require the application program‐ mer to write queries a specific way, you have to use a single shard per server because they can potentially be cross-schema queries. If, however, you are required to re- 202 | Chapter 7: Data Sharding shard frequently (to reflect changes in the application or other reasons), a single shard per server can be a performance bottleneck, so there is always a trade-off between having to adapt the application and getting the performance you need. If you need to re-shard frequently, having multiple shards on each server can be part of a solution. This allow you to move shards between servers to balance the load. However, you might still have to split shards if a single shard grows too hot. How can you back up a shard? Apart from being able to easily back up a single shard at a time, you also need to be able to easily create backups to move shards between servers. Most backup methods can create a backup of an entire server, or one or more schemas. For that reason, it is prudent to ensure that a schema is entirely in a shard (but there can be multiple schemas in each shard). Single shard per server The most straightforward approach is to keep a single shard on each server. This allows cross-schema queries, so it is not necessary to rewrite queries. There are two drawbacks to this approach: multiple tables may exceed the size of main memory on the server, which affects performance, and balancing the load between servers becomes more ex‐ pensive in case you need to re-shard the tables. As mentioned earlier, one of the goals of sharding a database is to reduce the size of the tables so that it can fit into memory. Smaller tables take less time to search, both because they contain fewer rows and because more of each table can fit in memory. If the server becomes overloaded and it is necessary to reduce the load, this principle suggests the solution: split the shard and either create a new shard using a spare server, or move the now extraneous rows to another shard and merge them with the rows there. If the rows are moved to an existing shard, and there is just one shard per server, the rows have to be merged with the rows already on that shard. Because merging is very difficult to do as an online operation, splitting and remerging is expensive when only one shard is allowed per server. In the next section, we will consider how to avoid having to merge shards when moving them. Multiple shards per server (virtual shards) As we’ve explained, if you can keep multiple shards on a single machine, the data can be moved between machines in a more efficient manner because the data is already sharded. This offers some flexibility to move shards around to balance the load on the machines, but if you do that, you need to be able to distinguish between the shards that coexist on the same server. For example, you need to be able to distinguish table em ployees.dept_emp in shard 123 from employees.dept_emp in shard 234 even if they are on the same machine. Partitioning the Data | 203 2. Many backup techniques can handle individual tables as well, but it is more complicated to manage backup and restore of individual tables. Using databases to structure the database makes the job of managing backups easier. A common approach is to attach the shard identifier to the name of the schema. For example, the schema employees in shard 123 would then be named employees_123 and a partition of each table is placed in each schema (e.g., the dept_emp table consists of employees_1.dept_emp, employees_2.dept_emp, … employees_N.dept_emp). Because the MySQL server stores each schema in its own directory, most backup meth‐ ods can make backups of schemas but have problems backing up individual tables.2 The approach just shown separates the tables for different shards into different directories, making it easy to take backups of shards (something that you will need later). Because you can limit replicate-do-db to specific schemas on a server, you can replicate changes to the individual shards as well, which will prove useful when you move shards between servers. Keeping multiple shards on each server makes it comparably easy to move one of the shards to another server to reduce the load on the server. Because you can have multiple shards on each server, you can even move the shard to a server that already has other shards, without having to merge the rows of the shards. Note that this approach is not a replacement for re-sharding, because you need to have techniques in place to split a shard anyway. In addition to adding the schema names with the shard identifier, you can add the shard identifier to the name of the table. So, with this approach, the names would be employ ees_123.dept_emp_123, employees_124.dept_emp_124, and so on. Although the shard number on the table seems redundant, it can be useful for catching problems where the application code mistakenly queries the wrong shard. The drawback of adding the shard number to the schema names and/or the tables is that users need to rewrite their queries. If all your queries always go to a single schema, never executing cross-schema queries, it is easy to issue USE employee_identifier before sending the query to the server and keep the old table names. But if cross-schema queries are allowed, it is necessary to rewrite the query to locate all the schema names and append the shard identifier to each. Inserting specific table numbers into queries can be quite error-prone, so if you can, generalize the query and automate the insertion of the right table number. For example, you can use braces to wrap the number in the schema name, and then use a regular expression to match and replace the schema and table name with the schema and table name for the shard in question. Example PHP code is shown in Example 7-1. 204 | Chapter 7: Data Sharding Example 7-1. Replacing table references in queries class my_mysqli extends mysqli { public $shard_id; private function do_replace($query) { return preg_replace(array('/\{(\w+)\.(\w+)\}/', '/\{(\w+)\}/'), array("$1_{$this->shard_id}.$2", "$1"), $query); } public function __construct($shard_id, $host, $user, $pass, $db, $port) { parent::__construct($host, $user, $pass, "{$db}_{$shard_id}", $port); $this->shard_id = $shard_id; } public function prepare($query) { return parent::prepare($this->do_replace($query)); } public function query($query, $resultmode = MYSQLI_STORE_RESULT) { return parent::query($this->do_replace($query), $resultmode); } } The code creates a subclass of mysqli, overriding the prepare and query functions with specialized versions that rewrite the names of the databases. Then the original function is called, passing the correct database name to connect to. Because there are no changes to the mysqli interface, no changes are normally necessary in the application code. An example using the class is: if ($result = $mysqli->query("SELECT * FROM {test.t1}")) { while ($row = $result->fetch_object()) print_r($row); $result->close(); } else { echo "Error: " . $mysql->error; } However, this works only if the application writers are willing (and able) to add this markup to the queries. It is also error-prone because application writers can forget to add the markup. Partitioning the Data | 205 Mapping the Sharding Key In the previous section, you saw how the choice of sharding column decides what tables need to be sharded. You also saw how to partition a table by range. In this section, partition functions will be discussed in more depth: you will see what sharding meta- data is needed to compute the right shards as well as how to map the rows of a sharded table to actual shards. As explained earlier in the chapter, the goal of mapping the sharding key is to create a partition function that accepts a sharding key value and outputs a shard identifier for the shard where the row exists. As also noted earlier, there can be several sharding keys, but in that case, we create a separate partition function for each sharding key. For the discussions in this section, we assume that each shard has a unique shard identifier, which is just an integer and can be used to identify each database or table as shown in the previous section. You saw in “Partitioning the Data” on page 197 that each partition function is associated with several columns if there are foreign keys relationships between the tables. So when you have a sharding key value you want to map (e.g., “20156”) it does not matter whether it was the employees.emp_no column or the dept_emp.emp_no column: both tables are sharded the same way. This means that when talking about mapping a sharding key value to a shard, the columns are implicitly given by the partition function and it is sufficient to provide the key. Sharding Scheme The partition function can be implemented using either a static sharding scheme or a dynamic sharding scheme (as the names suggest, the schemes just tell whether the sharding can change or is fixed): Static sharding In a static sharding scheme, the sharding key is mapped to a shard identifier using a fixed assignment that never changes. The computation of the shard identifier is usually done in the connector or in the application, which means that it can be done very efficiently. For example, you could use range-based assignment, such as making the first shard responsible for users 0 through 9,999, the second shard responsible for users 10,000 through 19,999, and so on. Or you could scatter users semirandomly through a hash based on the value of the last four digits of the identifier. Dynamic sharding schemes In a dynamic sharding scheme, the sharding key is looked up in a dictionary that indicates which shard contains the data. This scheme is more flexible than a static scheme, but requires a centralized store called the sharding database in this chapter. 206 | Chapter 7: Data Sharding Static sharding schemes As you might have realized, static sharding schemes run into problems when the dis‐ tribution of the queries is not even. For example, if you distribute the rows to different shards based on country, you can expect the load on the China shard to be about 140 times that of the Sweden shard. Swedes would love this, because assuming that the servers have the same capacity, they will experience very short response times. Chinese visitors may suffer, however, because their shard has to take 140 times that load. The skewed distribution can also occur if the hash function does not offer a good distribu‐ tion. For this reason, picking a good partition key and a good partition function is of paramount importance. An example partition function for a static schema appears in Example 7-2. Example 7-2. Example PHP implementation of a dictionary for static sharding class Dictionary { public $shards; /* Our shards */ public function __construct() { $this->shards = array(array('127.0.0.1', 3307), array('127.0.0.1', 3308), array('127.0.0.1', 3309), array('127.0.0.1', 3310)); } public function get_connection($key, $user, $pass, $db) { $no = $key % count($this->shards); list($host, $port) = $this->shards[$no]; $link = new my_mysqli($host, $user, $pass, $db, $port); $link->shard_id = $no; $link->select_db("{$db}_{$no}"); return $link; } } $DICT = new Dictionary('localhost', 'mats', 'xyzzy', 'sharding'); We define a Dictionary class to be responsible for managing the connections to the sharded system. All logic for deciding what host to use is made inside this class. This is a factory method that provides a new connection when given a sharding key. Because each sharding key potentially can go to a different server, a new connection is established each time this function is called. This creates a new connection using the my_mysqli function that we defined in Example 7-1. It is also possible to fetch a connection from a connection pool here, if you decide to implement one. However, for the sake of simplicity, no such pooling mechanism was implemented here. Mapping the Sharding Key | 207 The partition function that we use here computes a shard based on the modulo of the employee number (which is the sharding key). In Example 7-2, you can see an example of how to create a dictionary for static sharding using PHP. The Dictionary class is used to manage connections to the sharded system and will return a connection to the correct shard given the sharding key. In this case, assume that the sharding key is the employee number, but the same technique can be generalized to handle any sharding key. In Example 7-3, you can see an example usage where a connection is fetched and a query executed on the shard. Example 7-3. Example of using the dictionary $mysql = $DICT->get_connection($key, 'mats', 'xyzzy', 'employees'); $stmt = $mysql->prepare( "SELECT last_name FROM {employees} WHERE emp_no = ?"); if ($stmt) { $stmt->bind_param("d", $key); $stmt->execute(); $stmt->bind_result($first_name, $last_name); while ($stmt->fetch()) print "$first_name $last_name\n"; $stmt->close(); } else { echo "Error: " . $mysql->error; } Dynamic sharding schemes Dynamic sharding schemes are distinguished from static ones by their flexibility. Not only do they allow you to change the location of the shards, but it is also easy to move data between shards if you have to. As always, the flexibility comes at the price of a more complex implementation, and potentially also impacts performance. Dynamic schemes require extra queries to find the correct shard to retrieve data from, which adds to complexity as well as to performance. A caching policy will allow information to be cached instead of sending a query each time, helping you reduce the performance im‐ pact. Ultimately, good performance requires a careful design that matches the patterns of user queries. Because the dynamic sharding scheme is the most flexible, we will con‐ centrate on that for the rest of the chapter. The simplest and most natural way to preserve the data you need for dynamic sharding is to store the sharding database as a set of tables in a MySQL database on a sharding server, which you query to retrieve the information. Example 7-4 shows a sample loca tions table containing information for each shard, and a partition_function table containing one row for each partition function. Given a sharding identifier, you can 208 | Chapter 7: Data Sharding figure out what service instance to contact by joining with the locations table. We’ll look at the sharding types later. Example 7-4. Tables used for dynamic sharding CREATE TABLE locations ( shard_id INT AUTO_INCREMENT, host VARCHAR(64), port INT UNSIGNED DEFAULT 3306, PRIMARY KEY (shard_id) ); CREATE TABLE partition_functions ( func_id INT AUTO_INCREMENT, sharding_type ENUM('RANGE','HASH','LIST'), PRIMARY KEY (func_id) ); Now we’ll change the static implementation of the Dictionary class from Example 7-2 to use the tables in Example 7-4. In Example 7-5, the class now fetches the shard infor‐ mation from a sharding database instead of looking it up statically. It uses the informa‐ tion returned to create a connection as before. As you can see, the query for fetching the shard information is not filled in. This is dependent on how the mapping is designed and is what we’ll discuss next. Example 7-5. Implementation of dictionary for dynamic sharding $FETCH_SHARD = <<server = $mysqli; } public function get_connection($key, $user, $pass, $db, $tables) { global $FETCH_SHARD; if ($stmt = $this->server->prepare($FETCH_SHARD)){ $stmt->bind_param('i', $key); $stmt->execute(); $stmt->bind_result($no, $host, $port); if ($stmt->fetch()) { $link = new my_mysqli($no, $host, $user, $pass, $db, $port); $link->shard_id = $no; return $link; } } return null; Mapping the Sharding Key | 209 } } Shard Mapping Functions Our sharding database in Example 7-4 showed three different sharding types in the partition_function table. Each partition type, described in the online MySQL docu‐ mentation, uses a different kind of mapping between the data in the sharded column and the shards themselves. Our table includes the three most interesting ones: List mapping Rows are distributed over the shards based on a set of distinct values in the sharding column. For example, the list could be a list of countries. Range mapping Rows are distributed over the shards based on where the sharding column falls within a range. This can be convenient when you shard on an ID column, dates, or other information that falls conveniently into ranges. Hash mapping Rows are distributed over the shards based on a hash value of the sharding key value. This theoretically provides the most even distribution of data over shards. Of these mappings, the list mapping is the easiest to implement, but is the most difficult to use when you want to distribute the load efficiently. It can be useful when you shard for locality, because it can ensure that each shard is located close to its users. The range partitioning is easy to implement and eliminates some of the problems with distributing the load, but it can still be difficult to distribute the load evenly over the shards. The hash mapping is the one that distributes the load best over the shards of all three, but it is also the most complicated to implement in an efficient manner, as you will see in the following sections. The most important mappings are the range mapping and the hash mapping, so let’s concentrate on those. For each shard mapping, we will consider both how to add a new shard and how to select the correct shard based on the sharding key chosen. Range mapping The most straightforward approach to range mapping is to separate the rows of a table into ranges based on the sharding column and to assign one shard for each range. Even though ranges are easy to implement, they have the problem of potentially becoming very fragmented. This solution also calls for a data type that supports ranges efficiently, which you are not always lucky to have. For example, if you are using URIs as keys, “hot” sites will be clustered together when you actually want the opposite, to spread them out. To get a good distribution in that case, you should use a hash mapping, which we cover in “Hash mapping and consistent hashing” on page 212. 210 | Chapter 7: Data Sharding Creating the index table. To implement a range mapping, create a table containing the ranges and map them to the shard identifiers: CREATE TABLE ranges ( shard_id INT, func_id INT, lower_bound INT, UNIQUE INDEX (lower_bound), FOREIGN KEY (shard_id) REFERENCES locations(shard_id), FOREIGN KEY (func_id) REFERENCES partition_functions(func_id) ) Table 7-3 shows the typical types of information contained in such a table, which also includes the function identifier from the partition_functions table (you will see what the function identifier is used for momentarily). Only the lower bound is kept for each shard, because the upper bound is implicitly given by the lower bound of the next shard in the range. Also, the shards do not have to be the same size, and having to maintain both an upper and lower bound when splitting the shards is an unnecessary complica‐ tion. Table 7-3 shows the definition of the table. Table 7-3. Range mapping table ranges Lower bound Key ID Shard ID 0 0 1 1000 0 2 5500 0 4 7000 0 3 Adding new shards. To add new shards when using range-based sharding, you insert a row in the ranges table as well as a row in the locations table. So, assuming that you want to add a shard shard-1.example.com with the range 1000−2000 for the partition function given by @func_id, you would first insert a row into the locations table, to get a new shard identifier, and then use the new shard identifier to add a row in the ranges table: INSERT INTO locations(host) VALUES ('shard-1.example.com'); SET @shard_id = LAST_INSERT_ID(); INSERT INTO ranges VALUES (@shard_id, @func_id, 1000); Note that the upper bound is implicit and given by the next row in the ranges table. This means that you do not need to provide the upper bound when adding a new shard. Mapping the Sharding Key | 211 Fetching the shard. After defining and populating this table, you can fetch the shard number, hostname, and port for the shard using the following query, to be used in Example 7-5: SELECT shard_id, hostname, port FROM ranges JOIN locations USING (shard_id) WHERE func_id = 0 AND ? >= ranges.lower_bound ORDER BY ranges.lower_bound DESC LIMIT 1; The query fetches all rows that have a lower bound below the key provided, orders them by lower bound, and then takes the first one. Note that the code in Example 7-5 prepares the query before executing it, so the question mark in the query will be replaced with the sharding key in use. Another option would be to store both the lower and upper bound, but that makes it more complicated to update the sharding database if the num‐ ber of shards or the ranges for the shards should change. Hash mapping and consistent hashing One of the issues you might run into when using a range mapping is that you do not get a good distribution of the “hot” clusters of data, which means that one shard can become overloaded and you have to split it a lot to be able to cope with the increase in load. If you instead use a function that distributes the data points evenly over the range, the load will also be distributed evenly over the shards. A hash function takes some input and computes a number from it called the hash. A good hash function distributes the input as evenly as possible, so that a small change in the input string still generates a very different output number. You saw one very common hash function in Example 7-2, where modulo arithmetic was used to get the number of the shard. The naïve hash function in common use computes a hash of the input in some manner (e.g., using MD5 or SHA-1 or even some simpler functions) and then uses modulo arithmetic to get a number between 1 and the number of shards. This approach does not work well when you need to re-shard to, for example, add a new shard. In this case, you can potentially move a lot of rows between the shards, because computing the modulo of the hashed string can potentially move all the elements to a new shard. To avoid this problem, you can instead use consistent hashing, which is guaranteed to move rows from just one old shard to the new shard. To understand how this is possible, look at Figure 7-8. The entire hash range (the output of the hash function) is shown as a ring. On the hash ring, the shards are assigned to points on the ring using the hash function (we’ll show you how to do this later). In a similar manner, the rows (here represented as the red dots) are distributed over the ring using the same hash function. Each shard is now responsible for the region of the ring that starts at the shard’s point on the ring and continues to the next shard point. Because a region may start at the end of the hash range and wrap around to the beginning of the 212 | Chapter 7: Data Sharding hash range, a ring is used here instead of a flat line. But this cannot happen when using the regular hash function shown earlier, as each shard has a slot on the line and there is no slot that wraps around from the end to the beginning of the range. Figure 7-8. Hash ring used for consistent hashing Now suppose that a new shard is added to the ring, say shard-5 in the figure. It will be assigned to a position on the ring. Here it happens to split shard-2, but it could have been either of the existing shards. Because it splits shard-2, only the circled rows from the old shard-2 will have to be moved to shard-5. This means that the new shard will just take over the rows that are in one shard and move them to the new shard, which improves performance significantly. So, how do you implement this consistent hashing? Well, the first thing that you need is a good hash function, which will generate values on the hash ring. It must have a very big range, hence a lot of “points” on the hash ring where rows can be assigned. A good set of hash functions having the needed properties comes from cryptography. Cryptography uses hash functions to create “signatures” of messages to detect tamper‐ ing. These functions take as input an arbitrary string, and produce a number as output. Cryptography requires a number of complex mathematical properties from its hash functions, but for our purpose, the two most important properties are to provide a hash value containing a large number of bits and to distribute the input strings evenly over the output range. Cryptographic hash functions have these properties, so they are a good choice for us. The most commonly used functions are MD5 and the SHA family of hash functions (i.e., SHA-1, SHA-256/224, and SHA-512/384). Table 7-4 shows the most common hash functions and the number of bits in the numbers they produce. These functions are designed to be fast and accept any string as input, which makes them perfect for com‐ puting a hash of arbitrary values. Mapping the Sharding Key | 213 Table 7-4. Common cryptographic hash functions Hash function Output size (bits) MD5 128 SHA-1 160 SHA-256 256 SHA-512 512 Creating the index table. To define a hash mapping, define a table containing the hash values of the servers containing the shards (as usual, we store the location of the shards in a separate table, so only the shard identifier needs to be stored in the table): CREATE TABLE hashes ( shard_id INT, func_id INT, hash BINARY(32), UNIQUE INDEX (hash) FOREIGN KEY (shard_id) REFERENCES locations(shard_id), FOREIGN KEY (func_id) REFERENCES partition_functions(func_id) ) An index is added to allow fast searching on the hash value. Table 7-5 shows typical contents of such a table. Table 7-5. Hash mapping table hashes Key ID Shard ID Hash 1 0 dfd59508d347f5e4ba41defcb973d9de 2 0 2e7d453c8d2f9d2b75a421569f758da0 3 0 468934ac4c69302a77cbe5e7fa7dcb13 4 0 47a9ae8f8b8d5127fc6cc46b730f4f22 Adding new shards. To add new shards, you need to insert an entry both in the loca tions table and in the hashes table. To compute the row for the hashes table, you build a string for the server and compute the hash value for the strings. The string representing the server could, for example, be the fully qualified domain name, but any representation will do. For example, you might need to add the port to the string if you want to dis‐ tinguish the servers. The hash values are stored in the hashes table, and assuming that the function identifier is in @func_id, the following statements will do the job: INSERT INTO locations(host) VALUES ('shard-1.example.com'); SET @shard_id = LAST_INSERT_ID(); INSERT INTO hashes VALUES (@shard_id, @func_id, MD5('shard-1.example.com')); 214 | Chapter 7: Data Sharding Fetching the shard. You have now prepared the table containing information about the shards. When you need to look up the location of a shard using the sharding key, you compute the hash value of the sharding key and locate the shard identifier with the largest hash value that is smaller than the hash value of the sharding key. If no hash value is smaller than the hash key, pick the largest hash value. ( SELECT shard_id FROM hashes WHERE MD5(sharding key) > hash ORDER BY hash DESC ) UNION ALL ( SELECT shard_id FROM shard_hashes WHERE hash = (SELECT MAX(hash) from hashes) ) LIMIT 1 This SELECT picks all shards that have a hash value smaller than the hash of the sharding key. Note that this select might be empty. This SELECT provides a default value in case the previous one does not match anything. Because you need only one row, and the union of SELECT statements can potentially match multiple shards, just pick the first shard. This will either be the a shard from or, if that select did not match any shards, the shard from . Processing Queries and Dispatching Transactions By now you have decided how to partition your data by selecting an appropriate shard‐ ing column, how to handle the sharding data (the data about the sharding setup, such as where the shards are located), and how to map your sharding keys to shards. The next steps are to work out: • How to dispatch transactions to the right shard • How to get the sharding key for the transaction • How to use caching to improve performance If you recall the high-level architecture in “High-Level Sharding Architecture” on page 196, it includes a broker that has the responsibility of dispatching the queries to the right shards. This broker can either be implemented as an intermediate proxy or be part of the connector. To implement the broker as a proxy, you usually send all queries to a dedicated host that implements the MySQL protocol. The proxy extracts the sharding key from each query somehow and dispatches the query to the correct shard. The advantage of using a proxy as broker is that the connectors can be unaware that they are connecting to a proxy: it Processing Queries and Dispatching Transactions | 215 behaves just as if they connected to a server. This looks like a very transparent solution, but in reality, it is not. For simple applications, a proxy works well, but as you will see in “Handling Transactions” on page 216, using a proxy requires you to extend the protocol and/or limit what the application can do. Handling Transactions To dispatch transactions correctly through a broker, you need to know the parameters of transactions that it needs to handle. From the application’s side, each transaction consists of a sequence of queries or state‐ ments, where the last statement is a commit or an abort. To get an understanding for how transaction processing needs to be handled, take a look at the following transaction and consider what problems you need to solve for each line of the transaction: START TRANSACTION; SELECT salary INTO @s FROM salaries WHERE emp_no = 20101; SET @s = 1.1 * @s; INSERT INTO salaries(emp_no, salary) VALUES (20101, @s); COMMIT; START TRANSACTION; INSERT INTO …; COMMIT; At the start of a transaction, there is no way to know the sharding key of the tables or databases it will affect. It is not possible to deduce it from the query, because it is not present at all. However, the START TRANSACTION can be deferred until a real statement is seen, which then hopefully would contain the sharding key. However, a broker needs to know when a new transaction starts, because it may cause a switch to a different server, and it’s important to know this for load balancing. The first statement of this transaction makes it look like a read transaction. If it is a read transaction, it means it can be sent to a slave to balance the load. Here you can also find the sharding key, so at this point, you can figure out what shard the transaction should go to. Setting a user-defined variable creates a session-specific state. The user variable is not global, so all following transactions can refer to (and hence be dependent on) the user-defined variable. 216 | Chapter 7: Data Sharding It is now clear that this is a read-write transaction, so if you assumed at that the transaction was a read-only transaction and sent it to a slave, you will now start updating the slave instead of the master. If you can generate an error here, it is possible to abort the transaction to indicate that there was a user error, but in that case, you still have to be able to indicate that this is a read-write transaction and that it should go to the master despite the initial SELECT. This is guaranteed to end the transaction, but what do you do with the session state? In this example, a few user-defined variables were set: do they persist to the next transaction? From this example, you can see that your proxy needs to handle several issues: • To be able to send the transaction to the right shard, the sharding key has to be available to the broker when it sees the first statement in the transaction. • You have to know whether the transaction is a read-only or read-write transaction before sending the first statement to a server. • You need to be able to deduce that you are inside a transaction and that the next statement should go to the same connection. • You need to be able to see whether the previous statement committed a transaction, so that you can switch to another connection. • You need to decide how to handle session-specific state information such as user- defined variables, temporary tables, and session-specific settings of server variables. It is theoretically possible to solve the first issue by holding back the START TRANSAC TION and then extracting the sharding key from the first statement of the transaction by parsing the query. This is, however, very error-prone and still requires the application writer to know that it has to make the sharding key clearly visible in the first statement. A better solution is for the application to provide the sharding key explicitly with the first statement of the transaction, either through special comments or by allowing the broker to accept the sharding key out of band (i.e., not as part of the query). To solve the second issue, you can use the same technique just described and mark a transaction as read-write or read-only. This can be done either through a special com‐ ment in the query or by providing the broker with this information out-of-band. A transaction marked as read-only will then be sent to the slave and executed there. For the first and second issues, you need to be able to detect when the user makes an error by either issuing update statements in a read-only transaction or sending a trans‐ action to the wrong shard. Fortunately, MySQL 5.6 has added START TRANSACTION READ ONLY so you can easily make sure that the application does not succeed in issuing an Processing Queries and Dispatching Transactions | 217 update statement. Detecting whether the statement is sent to the right shard can be more tricky. If you rewrite your queries as shown in Example 7-2, you will automatically get an error when you access the wrong shard because the schema name will be wrong. If you do not rewrite the queries, you have to tailor some assert-like functionality to ensure that the query is executed on the correct shard. To detect whether a transaction is in progress, the response packet of the MySQL Pro‐ tocol contains two flags: SERVER_STATUS_IN_TRANS and SERVER_STATUS_AUTOCOMMIT. The first flag is true if a transaction has been started explicitly using START TRANSAC TION, but will not be set when AUTOCOMMIT=0. The flag SERVER_STATUS_AUTOCOMMIT is set if autocommit is on, and is clear otherwise. By combining these two flags, it is possible to see whether a statement is part of a transaction and the next statement should be sent to the same connection. There is currently no support in the MySQL connectors to check these flags, so currently you have to track transaction-starting statements and the autocommit flag in the broker. Handling the fourth issue (detecting whether a new transaction has started) would be easy if there were a server flag in the response packet that told you if a new transaction had started. Unfortunately, this is currently not available in the server, so you just have to monitor queries as they come in and recognize those that start a new transaction. Remember that some statements cause an implicit commit, so make sure to include them if you want a generic solution. Dispatching Queries As explained in the discussion in “Handling Transactions” on page 216, handling trans‐ actions in a sharded environment is far from transparent, and applications have to take into account that they are working with a sharded database. For this reason, the goal for the sample implementation demonstrated in this chapter is not to make query dispatch transparent, but rather to make it easy to use for the application developer. Most of the discussion in this section applies both when using a proxy as a broker and when placing the broker close to the application, such as if you implement the broker in a PHP program on the application server. For the purposes of illustration, we’ll as‐ sume a PHP implementation in this section. Let’s continue with the model introduced in Example 7-2 through Example 7-5, and implement a dynamic sharding scheme for range sharding. So what kind of information can you reasonably ask the application developer to pro‐ vide? As you saw previously, it is necessary to provide the sharding key one way or the other. A typical range mapping, such as shown in Table 7-3, allows you to fetch the shard identifier only if you also provide the function identifier of the function used to shard the tables used in the query. It’s unreasonable to expect the application developer to know the function identifier that is needed for the query, and doing so would not be very robust either because the partition functions might change. However, because each 218 | Chapter 7: Data Sharding table in a query is sharded based on the function identifier, it is possible to deduce the function identifier if all the tables accessed in the query are provided. It will also be possible to check that the query truly accesses only the tables sharded using one partition function, along with optional global tables. To figure out the partition function from the tables, we need to add an additional table that maps tables to partition functions. Such a table is shown in Example 7-6, where each fully qualified table name is mapped to the partition function used for that table. If a table is not present here, it is a global table and exists on all shards. Example 7-6. Table for tracking the partition function used for tables CREATE TABLE columns ( schema_name VARCHAR(64), table_name VARCHAR(64), func_id INT, PRIMARY KEY (schema_name, table_name), FOREIGN KEY (func_id) REFERENCES partition_functions(func_id) ) Given a set of tables, we can both compute the partition function identifier and the shard identifier at the same time (as in Example 7-7). Example 7-7. Full PHP code to fetch shard information from a sharding database $FETCH_SHARD = <<= ranges.lower_bound ORDER BY ranges.lower_bound DESC LIMIT 1; END_OF_QUERY; class Dictionary { private $server; public function __construct($host, $user, $pass, $port = 3306) { $mysqli = new mysqli($host, $user, $pass, 'sharding', $port); $this->server = $mysqli; } public function get_connection($key, $user, $pass, $db, $tables) { global $FETCH_SHARD; $quoted_tables = array_map(function($table) { return "'$table'"; }, $tables); $fetch_query = sprintf($FETCH_SHARD, implode(', ', $quoted_tables), $this->server->escape_string($key)); if ($res = $this->server->query($fetch_query)) { Processing Queries and Dispatching Transactions | 219 list($shard_id, $host, $port) = $res->fetch_row(); $link = new my_mysqli($shard_id, $host, $user, $pass, $db, $port); return $link; } return null; } } This query fetches the shard identifier, the host, and the port of the shard using the tables accessed in a query and the sharding key. This query returns one row for each partition function being used by the tables. This means that if the tables belong to more than one partition function, this subselect will return more than one row. Because this subselect is not allowed to return more than one row, an error will be raised and the entire query will fail with a “subselect returned more than one row.” The where condition can match more than one row (i.e., all rows that have a lower bound smaller than the key). Because only the row with the highest lower bound is needed, the result sets are ordered in descending order (placing the highest lower bound first in the result set) and only the first row is picked using a LIMIT clause. Establish a connection to the sharding database so that we can fetch information about the shards. For this we use a “plain” connector. Construct the list of tables to look up and insert them into the statement. Establish a connection to the shard by passing the necessary information. Here we use the specialized connector that can handle schema name replacement in the queries. Shard Management To keep the system responsive even as the load on the system changes, or just for ad‐ ministrative reasons, you will sometimes have to move data around, either by moving entire shards to different nodes or moving data between shards. Each of these two pro‐ cedures presents its own challenges in rebalancing the load with a minimum of down‐ time—preferably no downtime at all. Automated solutions should be preferred. Moving a Shard to a Different Node The easiest solution is to move an entire shard to a different node. If you have followed our earlier advice and placed each shard in a separate schema, moving the schema is as easy as moving the directory. However, doing this while continuing to allow writes to the node is a different story. 220 | Chapter 7: Data Sharding Moving a shard from one node (the source node) to another node (the target node) without any downtime at all is not possible, but it is possible to keep the downtime to a minimum. The technique is similar to the description in Chapter 3 of creating a slave. The idea is to make a backup of the shard, restore it on the target node, and use repli‐ cation to re-execute any changes that happened in between. This is what you’ll need to do: 1. Create a backup of the schemas on the source node that you want to move. Both online and offline backup methods can be used. 2. Each backup, as you could see in earlier chapters, backs up data to a particular point in the binary log. Write this log position down. 3. Bring the target node down by stopping the server. 4. While the server is down: a. Set the option replicate-do-db in the configuration file to replicate only the shard that you want to move: [mysqld] replicate-do-db=shard_123 b. If you have to restore the backup from the source node while the server is down, do that at this time. 5. Bring the server up again. 6. Configure replication to start reading from the position that you noted in step 2 and start replication on the target server. This will read events from the source server and apply any changes to the shard that you are moving. Plan to have excess capacity on the target node so that you can temporarily handle an increase in the number of writes on it. 7. When the target node is sufficiently close to the source node, lock the shard’s schema on the source node in order to stop changes. It is not necessary to stop changes to the shard on the target node, because no writes will go there yet. The easiest way to handle that is to issue LOCK TABLES and lock all the tables in the shard, but other schemes are possible, including just removing the tables (e.g., if the application can handle a table that disappears, as outlined next, this is a possible alternative). 8. Check the log position on the source server. Because the shard is not being updated anymore, this will be the highest log position you need to restore. 9. Wait for the target server to catch up to this position, such as by using START SLAVE UNTIL and MASTER_POS_WAIT. 10. Turn off replication on the target server by issuing RESET SLAVE. This will remove all replication information, including master.info, relay-log.info, and all relay log‐ Shard Management | 221 files. If you added any replication options to the my.cnf file to configure replication, you have to remove them, preferably in the next step. 11. Optionally bring the target server down, remove the replicate-do-db from the my.cnf file for the target server, and bring the server up again. This step is not strictly necessary, because the replicate-do-db option is used only to move shards and does not affect the functioning of the shard after the shard has been moved. When the time comes to move a shard here again, you have to change the option at that time anyway. 12. Update the shard information so that update requests are directed to the new lo‐ cation of the shard. 13. Unlock the schema to restart writes to the shard. 14. Drop the shard schema from the source server. Depending on how the shard is locked, there might still be readers of the shard at this point, so you have to take that into consideration. Whew! That took quite a few steps. Fortunately, they can be automated using the MySQL Replicant library. The details for each individual step vary depending on how the ap‐ plication is implemented. Various backup techniques are covered in Chapter 15, so we won’t list them here. Note that when designing a solution, you don’t want to tie the procedure to any specific backup method, because it might later turn out that other ways of creating the backup are more suitable. To implement the backup procedure just described, it is necessary to bring the shard offline, which means that it is necessary to prevent updates to the shard. You can do this either by locking the shard in the application or by locking tables in the schema. Implementing locking in the application requires coordination of all requests so that there are no known conflicts, and because web applications are inherently distributed, lock management can become quite complicated very quickly. In our case, we simplify the situation by locking a single table—the locations table— instead of spreading out the locks among the various tables accessed by many clients. Basically, all lookups for shard locations go through the locations table, so a single lock on this table ensures that no new updates to any shard will be started while we perform the move and remap the shards. It is possible that there are updates in pro‐ gress that either have started to update the shard or are just about to start updating the shard. So you should also lock the entire server using READ_ONLY. Any updates about to start will be locked out and be given an error message. Updates in progress will be allowed to finish (or might be killed after a timeout). When the lock on the shard is released, the shard will be gone, so the statements doing the update will fail and will have to be redone on the new shard. 222 | Chapter 7: Data Sharding Example 7-8 automates the procedures just described. You can also use the Replicant library to do it. Example 7-8. Procedure for moving a shard between nodes _UPDATE_SHARD_MAP = """ UPDATE locations SET host = %s, port = %d WHERE shard_id = %d """ def lock_shard(server, shard): server.use("common") server.sql("BEGIN") server.sql(("SELECT host, port, sock" " FROM locations" " WHERE shard_id = %d FOR UPDATE"), (shard,)) def unlock_shard(server): server.sql("COMMIT") def move_shard(common, shard, source, target, backup_method): backup_pos = backup_method.backup_to() config = target.fetch_config() config.set('replicate-do-db', shard) target.stop().replace_config(config).start() replicant.change_master(target, source, backup_pos) replicant.slave_start(target) # Wait until slave is at most 10 seconds behind master replicant.slave_status_wait_until(target, 'Seconds_Behind_Master', lambda x: x < 10) lock_shard(common, shard) pos = replicant.fetch_master_pos(source) replicant.slave_wait_for_pos(target, pos) source.sql("SET GLOBAL READ_ONLY = 1") kill_connections(source) common.sql(_UPDATE_SHARD_MAP, (target.host, target.port, target.socket, shard)) unlock_shard(common, shard) source.sql("DROP DATABASE shard_%s", (shard)) source.sql("SET GLOBAL READ_ONLY = 1") As described earlier, you have to keep in mind that even though the table is locked, some client sessions may be using the table because they have retrieved the node location but are not yet connected to it, or alternatively may have started updating the shard. The application code has to take this into account. The easiest solution is to have the application recompute the node if the query to the shard fails. Example 7-9 shows the Shard Management | 223 changes that are necessary to fix Example 7-3 to re-execute the lookup if certain errors occurred. Example 7-9. Changes to application code to handle shard moving do { $error = 0; $mysql = $DICT->get_connection($key, 'mats', 'xyzzy', 'employees', array('employees.employees', 'employees.dept_emp', 'employees.departments')); if ($stmt = $mysql->prepare($QUERY)) { $stmt->bind_param("d", $key); if ($stmt->execute()) { $stmt->bind_result($first_name, $last_name, $dept_name); while ($stmt->fetch()) print "$first_name $last_name $dept_name @{$mysql->shard_id}\n"; } else $error = $stmt->errno; $stmt->close(); } else $error = $mysql->errno; /* Handle the error */ switch ($error) { case 1290: case 1146: case 2006: continue; } } while (0); In this case, execution failed because the server was set in read-only mode. The application looked up the shard, but the move procedure started before it had a chance to start executing the query. In this case, execution failed because the schema disappeared. The connection looked up the shard location before it was moved, and tried to execute the query after it was moved. Recall from “Multiple shards per server (virtual shards)” on page 203 that the shard identifier is part of each schema name. This is how you can detect that the shard is gone. If you did not have a unique name for each schema, you would not be able to distinguish the shards. In this case, execution failed because the connection was killed. The connection looked up the shard location before it was moved and started to execute the query, but the server decided that it took too long to execute. 224 | Chapter 7: Data Sharding Splitting Shards When a host becomes too loaded, you can move one of the shards on the host to another server, but what do you do when the shard becomes too hot? The answer is: you split it. Splitting a shard into multiple smaller shards can be very expensive, but the downtime can be kept to a minimum if done carefully. Assume that you need to split a shard and move half of the contents of the shard to a new node. Here’s a step-by-step explanation: 1. Take a backup of all the schemas in the shard. If you use an online backup method, such as MEB, XtraDB, or filesystem snapshots, the shard can be kept online while the backup is taken. 2. Write down the binary log position that this backup corresponds to. 3. Restore the backup from step 1 on the destination node. 4. Start replication from the source node to the destination node. If you want to avoid copying more changes than necessary, you can use binlog-do-db or replication- do-db to just replicate changes for the schemas that you moved. At this point, all requests still go to the original shard and the new shard is “dark” and not visible. 5. Wait until replication has caught up and the destination is close enough to the source. Then lock the source shard so that neither reads nor writes are possible. 6. Wait until the destination host is fully up to date with the source. During this step, all data in the shard will be unavailable. 7. Update the sharding database so that all requests for data in the new shard go to the new shard. 8. Unlock the source shard. At this point, all data is available, but there is too much data on both the source and destination shards. This data is, however, not part of the new shard data and queries sent to the server will not access this data. 9. Start two jobs in parallel that remove the superfluous rows on each shard using a normal DELETE. To avoid a large impact on performance, you can remove just a few rows at a time by adding a LIMIT. Conclusion This chapter presented techniques for increasing the throughput of your applications by scaling out, whereby we introduced more servers to handle more requests for data. We presented ways to set up MySQL for scaling out using replication and gave practical examples of some of the concepts. In the next chapter, we will look at some more ad‐ vanced replication concepts. Conclusion | 225 Joel felt pretty good. He had delivered his first company white paper to Mr. Summerson earlier in the day. He knew the response would come soon. While his boss was a bit on the “hyper alpha boss” high end, he could count on his work being reviewed promptly. A little while later, on his way to the break room, Joel met his boss in the hall. “I liked your paper on scaling, Joel. You can get started on that right away; we’ve got some extra servers lying around downstairs.” “Right away,” Joel said with a smile as his boss moved on to deliver another drive-by tasking. 226 | Chapter 7: Data Sharding CHAPTER 8 Replication Deep Dive A knock on his door drew Joel’s attention away from reading his email. He wasn’t sur‐ prised to see Mr. Summerson standing in his doorway. “Yes, sir?” “I am getting a little concerned about all this replication stuff we’ve got now. I’d like you to do some research into what we need to do to improve our knowledge of how it all works. I want you to put together a document explaining not only the current config‐ uration, but also troubleshooting ideas with specific details on what to do when things go wrong and what makes it tick.” Joel was expecting such a task. He, too, was starting to be concerned that he needed to know more about replication. “I’ll get right on it, sir.” “Great. Take your time on this one. I want to get it right.” Joel nodded as his boss walked away. He sighed and gathered his favorite MySQL books together. He needed to do some reading on the finer points of replication. Previous chapters introduced the basics of configuring and deploying replication to keep your site up and available, but to understand replication’s potential pitfalls and how to use it effectively, you should know something about its operation and the kinds of information it uses to accomplish its tasks. This is the goal of this chapter. We will cover a lot of ground, including: • How to promote slaves to masters more robustly • Tips for avoiding corrupted databases after a crash • Multisource replication • Row-based replication 227 • Global transaction identifiers • Multithreaded replication Replication Architecture Basics Chapter 4 discussed the binary log along with some of the tools that are available to investigate the events it records. But we didn’t describe how events make it over to the slave and get re-executed there. Once you understand these details, you can exert more control over replication, prevent it from causing corruption after a crash, and investigate problems by examining the logs. Figure 8-1 shows a schematic illustration of the internal replication architecture, con‐ sisting of the clients connected to the master, the master itself, and several slaves. For each client that connects to the master, the server runs a session that is responsible for executing all SQL statements and sending results back to the client. The events flow through the replication system from the master to the slaves in the following manner: 1. The session accepts a statement from the client, executes the statement, and syn‐ chronizes with other sessions to ensure each transaction is executed without con‐ flicting with other changes made by other sessions. 2. Just before the statement finishes execution, an entry consisting of one or more events is written to the binary log. This process is covered in Chapter 3 and will not be described again in this chapter. 3. After the events have been written to the binary log, a dump thread in the master takes over, reads the events from the binary log, and sends them over to the slave’s I/O thread. 4. When the slave I/O thread receives the event, it writes it to the end of the relay log. 5. Once in the relay log, a slave SQL thread reads the event from the relay log and executes the event to apply the changes to the database on the slave. If the connection to the master is lost, the slave I/O thread will try to reconnect to the server in the same way that any MySQL client thread does. Some of the options that we’ll see in this chapter deal with reconnection attempts. 228 | Chapter 8: Replication Deep Dive Figure 8-1. Master and several slaves with internal architecture The Structure of the Relay Log As the previous section showed, the relay log is the information that ties the master and slave together—the heart of replication. It’s important to be aware of how it is used and how the slave threads coordinate through it. Therefore, we’ll go through the details here of how the relay log is structured and how the slave threads use the relay log to handle replication. Replication Architecture Basics | 229 As described in the previous section, the events sent from the master are stored in the relay log by the I/O thread. The relay log serves as a buffer so that the master does not have to wait for the slave execution to finish before sending the next event. Figure 8-2 shows a schematic view of the relay log. It’s similar in structure to the binlog on the master but has some extra files. Figure 8-2. Structure of the relay log In addition to the content files and the index files in the binary log, the relay log main‐ tains two files to keep track of replication progress: the relay log information file and the master log information file. The names of these two files are controlled by two options in the my.cnf file: relay-log-info-file=filename This option sets the name of the relay log information file. It is also available as the read-only server variable relay_log_info_file. Unless an absolute filename is given, the filename is relative to the data directory of the server. The default filename is relay-log.info. master-info-file=filename This option sets the name of the master log information file. The default filename is master.info. 230 | Chapter 8: Replication Deep Dive The information in the master.info file takes precedence over infor‐ mation in the my.cnf file. This means that if you change informa‐ tion in the my.cnf file and restart the server, the information will still be read from the master.info file instead of from the my.cnf file. For this reason, we recommend not to put any of the options that can be specified with the CHANGE MASTER TO command in the my.cnf file, but instead to use the CHANGE MASTER TO command to config‐ ure replication. If, for some reason, you want to put any of the rep‐ lication options in the my.cnf file and you want to make sure that the options are read from it when starting the slave, you have to issue RESET SLAVE before editing the my.cnf file. Beware when executing RESET SLAVE! It will delete the master.info file, the relay-log.info file, and all the relay logfiles! For convenience, we will use the default names of the information files in the discussion that follows. The master.info file contains the master read position as well as all the information necessary to connect to the master and start replication. When the slave I/O thread starts up, it reads information from this file, if it is available. Example 8-1 shows a short example of a master.info file. We’ve added a line number before each line and an annotation in italics at the end of each line (the file itself cannot contain comments). If the server is not compiled with SSL support, lines 9 through 15— which contain all the SSL options—will be missing. Example 8-1 shows what these options look like when SSL is compiled. The SSL fields are covered later in the chapter. The password is written unencrypted in the master.info file. For that reason, it is critical to protect the file so it can be read only by the MySQL server. The standard way to ensure this is to define a dedicated user on the server to run the server, assign all the files responsible for replication and database maintenance to this user, and remove all permissions from the files except read and write by this user. Example 8-1. Contents of the master.info file (MySQL version 5.6.12) 1 23 Number of lines in the file 2 master-bin.000001 Current binlog file being read (Master_Log_File) 3 151 Last binlog position read (Read_Master_Log_Pos) 4 localhost Master host connected to (Master_Host) 5 root Replication user (Master_User) 6 Replication password 7 13000 Master port used (Master_Port) Replication Architecture Basics | 231 8 60 Number of times slave will try to reconnect (Connect_Retry) 9 0 1 if SSL is enabled, otherwise 0 10 SSL Certification Authority (CA) 11 SSL CA Path 12 SSL Certificate 13 SSL Cipher 14 SSL Key 15 0 SSL Verify Server Certificate 16 60.000 Heartbeat 17 Bind Address 18 0 Ignore Server IDs 19 Master UUID 8c6d027e-cf38-11e2-84c7-0021cc6850ca 20 10 Retry Count 21 SSL CRL 22 SSL CRL Path 23 0 Auto Position If you have an old server, the format can be slightly different. In MySQL versions earlier than 4.1, the first line did not appear. Developers added a line count to the file in version 4.1.1 so they could extend the file with new fields and detect which fields are supported by just checking the line count. Version 5.1.16 introduced line 15, SSL Verify Server Certificate, and the lines after that were introduced in different versions of 5.6. The relay-log.info file tracks the progress of replication and is updated by the SQL thread. Example 8-2 shows a sample excerpt of a relay-log.info file. These lines correspond to the beginning of the next event to execute. Example 8-2. Contents of the relay-log.info file ./slave-relay-bin.000003 Relay log file (Relay_Log_File) 380 Relay log position (Relay_Log_Pos) master1-bin.000001 Master log file (Relay_Master_Log_File) 234 Master log position (Exec_Master_Log_Pos) If any of the files are not available, they will be created from information in the my.cnf file and the options given to the CHANGE MASTER TO command when the slave is started. It is not enough to just configure a slave using my.cnf and execute a CHANGE MASTER TO statement. The relay logfiles, the master.info file, and the relay-log.info file are not created until you issue START SLAVE. 232 | Chapter 8: Replication Deep Dive The Replication Threads As you saw earlier in the chapter, replication requires several specialized threads on both the master and the slave. The dump thread on the master handles the master’s end of replication. Two slave threads—the I/O thread and the SQL thread—handle repli‐ cation on the slave. Master dump thread This thread is created on the master when a slave I/O thread connects. The dump thread is responsible for reading entries from the binlog on the master and sending them to the slave. There is one dump thread per connected slave. Slave I/O thread This thread connects to the master to request a dump of all the changes that occur and writes them to the relay log for further processing by the SQL thread. There is one I/O thread on each slave. Once the connection is established, it is kept open so that any changes on the master are immediately received by the slave. Slave SQL thread This thread reads changes from the relay log and applies them to the slave database. The thread is responsible for coordinating with other MySQL threads to ensure changes do not interfere with the other activities going on in the MySQL server. From the perspective of the master, the I/O thread is just another client thread and can execute both dump requests and SQL statements on the master. This means a client can connect to a server and pretend to be a slave to get the master to dump changes from the binary log. This is how the mysqlbinlog program (covered in detail in Chapter 4) operates. The SQL thread acts as a session when working with the database. This means it main‐ tains state information similar to that of a session, but with some differences. Because the SQL thread has to process changes from several different threads on the master— the events from all threads on the master are written in commit order to the binary log —the SQL thread keeps some extra information to distinguish events properly. For example, temporary tables are session-specific, so to keep temporary tables from dif‐ ferent sessions separated, the session ID is added to the events. The SQL thread then refers to the session ID to keep actions for different sessions on the master separate. The details of how the SQL thread executes events are covered later in the chapter. Replication Architecture Basics | 233 The I/O thread is significantly faster than the SQL thread because the I/O thread merely writes events to a log, whereas the SQL thread has to figure out how to execute changes against the databases. There‐ fore, during replication, several events are usually buffered in the relay log. If the master crashes, you have to handle these before connect‐ ing to a new master. To avoid losing these events, wait for the SQL thread to catch up before trying to reconnect the slave to another master. Later in the chapter, you will see several ways of detecting whether the relay log is empty or has events left to execute. Starting and Stopping the Slave Threads In Chapter 3, you saw how to start the slave using the START SLAVE command, but a lot of details were glossed over. We’re now ready for a more thorough description of starting and stopping the slave threads. When the server starts, it will also start the slave threads if there is a master.info file. As mentioned earlier in this chapter, the master.info file is created if the server was con‐ figured for replication and if START SLAVE commands were issued on the slaves to start their I/O and SQL threads. So if the previous session had been used for replication, replication will be resumed from the last position stored in the master.info and relay- log.info files, with slightly different behavior for the two slave threads: Slave I/O thread The slave I/O thread will resume by reading from the last read position according to the master.info file. For writing the events, the I/O thread will rotate the relay logfile and start writing to a new file, updating the positions accordingly. Slave SQL thread The slave SQL thread will resume reading from the relay log position given in relay- log.info. You can start the slave threads explicitly using the START SLAVE command and stop them explicitly with the STOP SLAVE command. These commands control the slave threads and can be used to stop and start the I/O thread or SQL thread separately: START SLAVE and STOP SLAVE These will start or stop both the I/O and the slave thread. START SLAVE IO_THREAD and STOP SLAVE IO_THREAD These will start or stop only the I/O thread. 234 | Chapter 8: Replication Deep Dive START SLAVE SQL_THREAD and STOP SLAVE SQL_THREAD These will start or stop only the SQL thread. When you stop the slave threads, the current state of replication is saved to the master.in fo and relay-log.info files. This information is then picked up when the slave threads are started again. If you specify a master host using the master-host option (which can be either in the my.cnf file or passed as an option when starting mysqld), the slave will also start. Because the recommendation is not to use this option, but instead to use the MASTER_HOST option to the CHANGE MASTER command, the master-host option will not be covered here. Running Replication over the Internet There are many reasons to replicate between two geographically separated data centers. One reason is to ensure you can recover from a disaster such as an earthquake or a power outage. You can also locate a site strategically close to some of your users, such as content delivery networks, to offer them faster response times. Although organiza‐ tions with enough resources can lease dedicated fiber, we will assume you use the open Internet to connect. The events sent from the master to the slave should never be considered secure in any way: as a matter of fact, it is easy to decode them to see the information that is replicated. As long as you are behind a firewall and do not replicate over the Internet—for example, replicating between two data centers—this is probably secure enough, but as soon you need to replicate to another data center in another town or on another continent, it is important to protect the information from prying eyes by encrypting it. The standard method for encrypting data for transfer over the Internet is to use SSL. There are several options for protecting your data, all of which involve SSL in some way: • Use the support that is built into the server to encrypt the replication from master to slave. • Use Stunnel, a program that establishes an SSL tunnel (essentially a virtual private network) to a program that lacks SSL support. • Use SSH in tunnel mode. This last alternative does not appear to really offer any significant advantages over using Stunnel, but can be useful if you are not allowed to install any new programs on a machine and can enable SSH on your servers. In that case, you can use SSH to set up a tunnel. We will not cover this option further. Running Replication over the Internet | 235 When using either the built-in SSL support or stunnel for creating a secure connection, you need: • A certificate from a certification authority (CA) • A (public) certificate for the server • A (private) key for the server The details of generating, managing, and using SSL certificates is beyond the scope of this book, but for demonstration purposes, Example 8-3 shows how to generate a self- signed public certificate and associated private key. This example assumes you use the configuration file for OpenSSL in /etc/ssl/openssl.cnf. Example 8-3. Generating a self-signed public certificate with a private key $ sudo openssl req -new -x509 -days 365 -nodes \ -config /etc/ssl/openssl.cnf \ > -out /etc/ssl/certs/master.pem -keyout /etc/ssl/private/master.key Generating a 1024 bit RSA private key .....++++++ .++++++ writing new private key to '/etc/ssl/private/master.key' ----- You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:SE State or Province Name (full name) [Some-State]:Uppland Locality Name (eg, city) []:Storvreta Organization Name (eg, company) [Internet Widgits Pty Ltd]:Big Inc. Organizational Unit Name (eg, section) []:Database Management Common Name (eg, YOUR name) []:master-1.example.com Email Address []:mats@example.com The certificate signing procedure puts a self-signed public certificate in /etc/ssl/certs/ master.pem and the private key in /etc/ssl/private/master.key (which is also used to sign the public certificate). On the slave, you have to create a server key and a server certificate in a similar manner. For the sake of discussion, we’ll use /etc/ssl/certs/slave.pem as the name of the slave server’s public certificate and /etc/ssl/private/slave.key as the name of the slave server’s private key. 236 | Chapter 8: Replication Deep Dive Setting Up Secure Replication Using Built-in Support The simplest way to encrypt the connection between the master and slave is to use a server with SSL support. Methods for compiling a server with SSL support are beyond the scope of this book; if you are interested, consult the online reference manual. To use the built-in SSL support, it is necessary to do the following: • Configure the master by making the master keys available. • Configure the slave to encrypt the replication channel. To configure the master to use SSL support, add the following options to the my.cnf file: [mysqld] ssl-capath=/etc/ssl/certs ssl-cert=/etc/ssl/certs/master.pem ssl-key=/etc/ssl/private/master.key The ssl-capath option contains the name of a directory that holds the certificates of trusted CAs, the ssl-cert option contains the name of the file that holds the server certificate, and the ssl-key option contains the name of the file that holds the private key for the server. As always, you need to restart the server after you have updated the my.cnf file. The master is now configured to provide SSL support to any client, and because a slave uses the normal client protocol, it will allow a slave to use SSL as well. To configure the slave to use an SSL connection, issue CHANGE MASTER TO with the MASTER_SSL option to turn on SSL for the connection, then issue MASTER_SSL_CAPATH, MASTER_SSL_CERT, and MASTER_SSL_KEY, which function like the ssl-capath, ssl- cert, and ssl-key configuration options just mentioned, but specify the slave’s side of the connection to the master: slave> CHANGE MASTER TO -> MASTER_HOST = 'master-1', -> MASTER_USER = 'repl_user', -> MASTER_PASSWORD = 'xyzzy', -> MASTER_SSL_CAPATH = '/etc/ssl/certs', -> MASTER_SSL_CERT = '/etc/ssl/certs/slave.pem', -> MASTER_SSL_KEY = '/etc/ssl/private/slave.key'; Query OK, 0 rows affected (0.00 sec) slave> START SLAVE; Query OK, 0 rows affected (0.15 sec) Now you have a slave running with a secure channel to the master. Running Replication over the Internet | 237 Setting Up Secure Replication Using Stunnel Stunnel is an easy-to-use SSL tunneling application that you can set up either as an SSL server or as an SSL client. Using Stunnel to set up a secure connection is almost as easy as setting up an SSL connection using the built-in support, but requires some additional configuration. This approach can be useful if the server is not compiled with SSL support or if for some reason you want to offload the extra processing required to encrypt and decrypt data from the MySQL server (which makes sense only if you have a multicore CPU). As with the built-in support, you need to have a certificate from a CA as well as a public certificate and a private key for each server. These are then used for the stunnel com‐ mand instead of for the server. Figure 8-3 shows a master, a slave, and two Stunnel instances that communicate over an insecure network. One Stunnel instance on the slave server accepts data over a stan‐ dard MySQL client connection from the slave server, encrypts it, and sends it over to the Stunnel instance on the master server. The Stunnel instance on the master server, in turn, listens on a dedicated SSL port to receive the encrypted data, decrypts it, and sends it over a client connection to the non-SSL port on the master server. Figure 8-3. Replication over an insecure channel using Stunnel Example 8-4 shows a configuration file that sets up Stunnel to listen on socket 3508 for an SSL connection, where the master server is listening on the default MySQL socket 3306. The example refers to the certificate and key files by the names we used earlier. Example 8-4. Master server configuration file /etc/stunnel/master.conf cert=/etc/ssl/certs/master.pem key=/etc/ssl/private/master.key CApath=/etc/ssl/certs [mysqlrepl] accept = 3508 connect = 3306 Example 8-5 shows the configuration file that sets up Stunnel on the client side. The example assigns port 3408 as the intermediate port—the non-SSL port that the slave 238 | Chapter 8: Replication Deep Dive will connect to locally—and Stunnel connects to the SSL port 3508 on the master server, as shown in Example 8-4. Example 8-5. Slave server configuration file /etc/stunnel/slave.conf cert=/etc/ssl/certs/slave.pem key=/etc/ssl/private/slave.key CApath=/etc/ssl/certs [mysqlrepl] accept = 3408 connect = master-1:3508 You can now start the Stunnel program on each server and configure the slave to connect to the Stunnel instance on the slave server. Because the Stunnel instance is on the same server as the slave, you should give localhost as the master host to connect to and the port that the Stunnel instance accepts connections on (3408). Stunnel will then take care of tunneling the connection over to the master server: slave> CHANGE MASTER TO -> MASTER_HOST = 'localhost', -> MASTER_PORT = 3408, -> MASTER_USER = 'repl_user', -> MASTER_PASSWORD = 'xyzzy'; Query OK, 0 rows affected (0.00 sec) slave> START SLAVE; Query OK, 0 rows affected (0.15 sec) You now have a secure connection set up over an insecure network. If you are using Debian-based Linux (e.g., Debian or Ubuntu), you can start one Stunnel instance for each configuration file in the /etc/ stunnel directory by setting ENABLED=1 in /etc/default/stunnel4. So if you create the Stunnel configuration files as given in this sec‐ tion, one slave Stunnel and one master Stunnel instance will be start‐ ed automatically whenever you start the machine. Finer-Grained Control Over Replication With an understanding of replication internals and the information replication uses, you can control it more expertly and learn how to avoid some problems that can occur. We’ll give you some useful background in this section. Information About Replication Status You can find most of the information about replication status on the slave, but there is some information available on the master as well. Most of the information on the master Finer-Grained Control Over Replication | 239 relates to the binlog (covered in Chapter 4), but information relating to the connected slaves is also available. The SHOW SLAVE HOSTS command only shows information about slaves that use the report-host option, which the slave uses to give information to the master about the server that is connected. The master cannot trust the information about the connected slaves, because there are routers with NAT between the master and the slave. In addition to the hostname, there are some other options that you can use to provide information about the connecting slave: report-host The name of the connecting slave. This is typically the domain name of the slave, or some other similar identifier, but can in reality be any string. In Example 8-6, we use the name “Magic Slave.” report-port The port on which the slave listens for connections. This default is 3306. report-user This is the user for connecting to the master. The value given does not have to match the value used in CHANGE MASTER TO. This option is only shown when the show- slave-auth-info option is given to the server. report-password This is the password used when connecting to the master. The password given does not have to match the password given to CHANGE MASTER TO. show-slave-auth-info If this option is enabled, the master will show the additional information about the reported user and password in the output from SHOW SLAVE HOSTS. Example 8-6 shows sample output from SHOW SLAVE HOSTS where three slaves are con‐ nected to the master. Example 8-6. Sample output from SHOW SLAVE HOSTS master> SHOW SLAVE HOSTS; +-----------+-------------+------+-------------------+-----------+ | Server_id | Host | Port | Rpl_recovery_rank | Master_id | +-----------+-------------+------+-------------------+-----------+ | 2 | slave-1 | 3306 | 0 | 1 | | 3 | slave-2 | 3306 | 0 | 1 | | 4 | Magic Slave | 3306 | 0 | 1 | +-----------+-------------+------+-------------------+-----------+ 1 row in set (0.00 sec) 240 | Chapter 8: Replication Deep Dive The output shows slaves that are connected to the master and some information about the slaves. Notice that this display also shows slaves that are indirectly connected to the master via relays. There are two additional fields shown when show-slave-auth- info is enabled (which we do not show here). The following fields are purely informational and do not necessarily show the real slave host or port, nor the user and password used when configuring the slave in CHANGE MASTER TO: Server_id This is the server ID of the connected slave. Host This is the name of the host as given by report-host. User This is the username reported by the slave by using report-user. Password This column shows the password reported by the slave using report-password. Port This shows the port. Master_id This shows the server ID that the slave is replicating from. Rpl_recovery_rank This field has never been used and is removed in MySQL version 5.5. The information about indirectly connected slaves cannot be entire‐ ly trusted, because it is possible for the information to be inaccurate in certain situations where slaves are being added. For this reason, there is an effort underway to remove this informa‐ tion and show only directly connected slaves, as this information can be trusted. You can use the SHOW MASTER LOGS command to see which logs the master is keeping track of in the binary log. A typical output from this command can be seen in Example 8-7. The SHOW MASTER STATUS command (shown in Example 8-8) shows where the next event will be written in the binary log. Because a master has only a single binlog file, the table will always contain only a single line. And because of that, the last line of the output of SHOW MASTER LOGS will match the output of this command, only with different head‐ ers. This means that if you need to execute a SHOW MASTER LOGS to implement some Finer-Grained Control Over Replication | 241 feature, you do not need to execute a SHOW MASTER STATUS as well but can instead use the last line of SHOW MASTER LOGS. Example 8-7. Typical output from SHOW MASTER LOGS master> SHOW MASTER LOGS; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | master-bin.000011 | 469768 | | master-bin.000012 | 1254768 | | master-bin.000013 | 474768 | | master-bin.000014 | 4768 | +-------------------+-----------+ 1 row in set (0.00 sec) Example 8-8. Typical output from SHOW MASTER STATUS master> SHOW MASTER STATUS; +--------------------+----------+--------------+------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | +--------------------+----------+--------------+------------------+ | master-bin.000014 | 4768 | | | +--------------------+----------+--------------+------------------+ 1 row in set (0.00 sec) To determine the status for the slave threads, use the SHOW SLAVE STATUS command. This command contains almost everything you need to know about the replication status. Let’s go through the output in more detail. A typical output from SHOW SLAVE STATUS is given in Example 8-9. Example 8-9. Sample output from SHOW SLAVE STATUS Slave_IO_State: Waiting for master to send event Master_Host: master1.example.com Master_User: repl_user Master_Port: 3306 Connect_Retry: 1 Master_Log_File: master-bin.000001 Read_Master_Log_Pos: 192 Relay_Log_File: slave-relay-bin.000006 Relay_Log_Pos: 252 Relay_Master_Log_File: master-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 242 | Chapter 8: Replication Deep Dive Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 192 Relay_Log_Space: 553 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: The state of the I/O and SQL threads The two fields Slave_IO_Running and Slave_SQL_Running indicate whether the slave I/O thread or the SQL thread, respectively, is running. If the slave threads are not run‐ ning, it could be either because they have been stopped or because of an error in the replication. If the I/O thread is not running, the fields Last_IO_Errno and Last_IO_Error will show the reason it stopped. Similarly, Last_SQL_Errno and Last_SQL_Error will show the reason why the SQL thread stopped. If either of the threads stopped without error—for example, because they were explicitly stopped or reached the until condition—there will be no error message and the errno field will be 0, similar to the output in Example 8-9. The fields Last_Errno and Last_Error are synonyms for Last_SQL_Err no and Last_SQL_Error, respectively. The Slave_IO_State shows a description of what the I/O thread is currently doing. Figure 8-4 shows a state diagram of how the message can change depending on the state of the I/O thread. Finer-Grained Control Over Replication | 243 Figure 8-4. Slave I/O thread states The messages have the following meanings: Waiting for master update This message is shown briefly when the I/O thread is initialized and before it tries to establish a connection with the master. 244 | Chapter 8: Replication Deep Dive Connecting to master This message is shown while the slave is trying to establish a connection with the master, but has not yet made the connection. Checking master version This message is shown when the slave has managed to connect to the master and is performing a handshake with the master. Registering slave on master This message is shown while the slave is trying to register itself with the master. When registering, it sends the value of the report-host option described earlier to the master. This usually contains the hostname or the IP number of the slave, but can contain any string. The master cannot depend simply on checking the IP ad‐ dress of the TCP connection, because there might be routers running inetwork address translation (NAT) between the master and slave. Requesting binlog dump This message is shown when the slave starts to request a binlog dump by sending the binlog file, binlog position, and server ID to the master. Waiting for master to send event This message is printed when the slave has established a connection with the master and is waiting for the master to send an event. Queueing master event to the relay log This message is shown when the master has sent an event and the slave I/O thread is about to write it to the relay log. This message is displayed regardless of whether the event is actually written to the relay log or skipped because of the rules outlined in “Filtering Replication Events” on page 174. Note the spelling in the previous message (“Queueing” instead of “Queuing”). When checking for messages using scripts or other tools, it is very important to check what the message really says and not just what you think it should read. Waiting to reconnect after action This message is shown when a previous action failed with a transient error and the slave will try to reconnect. Possible values for action are: registration on master When attempting to register with the master binlog dump request When requesting a binlog dump from the master Finer-Grained Control Over Replication | 245 master event read When waiting for or reading an event from the master Reconnecting after failed action This message is shown when the slave is trying to reconnect to the master after trying action but has not yet managed to establish a connection. The possible values for action are the same as for the “Waiting to reconnect after action” message. Waiting for slave mutex on exit This message is shown while the I/O thread is shutting down. Waiting for the slave SQL thread to free enough relay log space This message is shown if the relay log space limit (as set by the relay-log-space- limit option) has been reached and the SQL thread needs to process some of the relay log to write the new events. The binary log and relay log positions As replication processes events on the slave, it maintains three positions in parallel. These positions are shown in the output from SHOW SLAVE STATUS in Example 8-9, as the following pairs of fields: Master_Log_File, Read_Master_Log_Pos The master read position: the position in the master’s binary log of the next event to be read by the I/O thread. The values of these fields are taken from lines 2 and 3 of master.info, as shown in Example 8-1. Relay_Master_Log_File, Exec_Master_Log_Pos The master execute position: the position in the master’s binlog of the next event to be executed by the SQL thread. The values of these fields are taken from lines 3 and 4 of relay-log.info, as shown in Example 8-2. Relay_Log_File, Relay_Log_Pos The relay log execute position: the position in the slave’s relay log of the next event to be executed by the SQL thread. The values of these fields are taken from lines 1 and 2 of relay-log.info, as shown in Example 8-2. You can use the positions to gain information about replication progress or to optimize some of the algorithms developed in Chapter 5. 246 | Chapter 8: Replication Deep Dive For example, by comparing the master read position and the master execute position, it is possible to determine whether there are any events waiting to be executed. This is particularly interesting if the I/O thread has stopped, because it allows an easy way to wait for the relay log to become empty: once the positions are equal, there is nothing waiting in the relay log, and the slave can be safely stopped and redirected to another master. Example 8-10 shows sample code that waits for an empty relay log on a slave. MySQL provides the convenient MASTER_POS_WAIT function to wait until a slave’s relay log has processed all waiting events. In the event that the slave thread is not running, then MASTER_POS_WAIT will return NULL, which is caught and generates an exception. Example 8-10. Python script to wait for an empty relay log from mysql.replicant.errors import Error class SlaveNotRunning(Error): pass def slave_wait_for_empty_relay_log(server): result = server.sql("SHOW SLAVE STATUS") log_file = result["Master_Log_File"] log_pos = result["Read_Master_Log_Pos"] running = server.sql( "SELECT MASTER_POS_WAIT(%s,%s)", (log_file, log_pos)) if running is None: raise SlaveNotRunning Using these positions, you can also optimize the scenarios in Chapter 5. For instance, after running Example 8-21, which promotes a slave to master, you will probably have to process a lot of events in each of the other slaves’ relay logs before switching the slave to the new master. In addition, ensuring that the promoted slave has executed all events before allowing any slaves to connect will allow you to lose a minimum of data. By modifying the function order_slaves_on_position in Example 5-5 to create Example 8-11, you can make the former slaves execute all events they have in their relay logs before performing the switch. The code uses the slave_wait_for_empty_re lay_log function in Example 8-10 to wait for the relay log to become empty before reading the slave position. Example 8-11. Minimizing the number of lost events when promoting a slave from mysql.replicant.commands import ( fetch_slave_position, slave_wait_for_empty_relay_log, ) def order_slaves_on_position(slaves): entries = [] for slave in slaves: Finer-Grained Control Over Replication | 247 slave_wait_for_empty_relay_log(slave) pos = fetch_slave_position(slave) gtid = fetch_gtid_executed(slave) entries.append((pos, gtid, slave)) entries.sort(key=lambda x: x[0]) return [ entry[1:2] for entry in entries ] In addition to the technique demonstrated here, another technique mentioned in some of the literature is to check the status of the SQL thread in the SHOW PROCESSLIST output. If the State field is “Has read all relay log; waiting for the slave I/O thread to update it,” the SQL thread has read the entire relay log. This State message is generated only by the SQL thread, so you can safely search for it in all threads. Options for Handling Broken Connections The I/O thread has the responsibility for maintaining the connection with the master and, as you have seen in Figure 8-4, includes quite a complicated bit of logic to do so. If the I/O thread loses the connection with the master, it will attempt to reconnect to the master a limited number of times. The period of inactivity after which the I/O thread reacts, the retry period, and the number of retries attempted are controlled by three options: --slave-net-timeout The number of seconds of inactivity accepted before the slave decides that the con‐ nection with the master is lost and tries to reconnect. This does not apply to a situation in which a broken connection can be detected explicitly. In these cases, the slave reacts immediately, moves the I/O thread into the reconnection phase, and attempts a reconnect (possibly waiting according to the value of master- connect-retry and only if the number of retries done so far does not exceed master-retry-count). The default is 3,600 seconds. --master-connect-retry The number of seconds between retries. You can specify this option as the CONNECT_RETRY parameter for the CHANGE MASTER TO command. Use of the option in my.cnf is deprecated. The default is 60 seconds. --master-retry-count The number of retries before finally giving up. The default is 86,400. These defaults are probably not what you want, so you’re better off supplying your own values. 248 | Chapter 8: Replication Deep Dive How the Slave Processes Events Central to replication are the log events: they are the information carriers of the repli‐ cation system and contain all the metadata necessary to ensure replication can execute the changes made on the master to produce a replica of the master. Because the binary log on the master is in commit order for all the transactions executed on the master, each transaction can be executed in the same order in which it appears in the binary log to produce the same result on the slave as on the master. The slave SQL thread executes events from all the sessions on the master in sequence. This has some consequences for how the slave executes the events: The slave reply is single-threaded, whereas the master is multithreaded The log events are executed in a single thread on the slave, but on multiple threads on the master. This can make it difficult for the slave to keep up with the master if the master is committing a lot of transactions. Some statements are session-specific Some statements on the master are session-specific and will cause different results when executed from the single session on the slave: • Every user variable is session-specific. • Temporary tables are session-specific. • Some functions are session-specific (e.g., CONNECTION_ID). The binary log decides execution order Even though two transactions in the binary log appear to be independent—and in theory could be executed in parallel—they may in reality not be independent. This means that the slave is forced to execute the transactions in sequence to guarantee the master and the slave are consistent. Housekeeping in the I/O Thread Although the SQL thread does most of the event processing, the I/O does some house‐ keeping before the events even come into the SQL thread’s view. So we’ll look at I/O thread processing before discussing the “real execution” in the SQL thread. To keep up processing speed, the I/O thread inspects only certain bytes to determine the type of the event, then takes the necessary action to the relay log: Stop events These events indicate that a slave further up in the chain has been stopped in an orderly manner. This event is ignored by the I/O thread and is not even written to the relay log. How the Slave Processes Events | 249 Rotate event If the master binary log is rotated, so is the relay log. The relay log might be rotated more times than the master, but the relay log is rotated at least each time the master’s binary log is rotated. Format description events These events are saved to be written when the relay log is rotated. Recall that the format between two consecutive binlog files might change, so the I/O thread needs to remember this event to process the files correctly. If replication is set up to replicate in a circle or through a dual-master setup (which is circular replication with only two servers), events will be forwarded in the circle until they arrive at the server that originally sent them. To avoid having events continue to replicate around in the circle indefinitely, it is necessary to remove events that have been executed before. To implement this check, each server determines whether the event has the server’s own server ID. If it does, this event was sent from this server previously, and replication on the slave has come full circle. To avoid an event that circulates infinitely (and hence is applied infinitely) this event is not written to the relay log, but just ignored. You can turn this behavior off using the replicate-same-server-id option on the server. If you set this option, the server will not carry out the check for an identical server ID and the event will be written to the relay log regardless of which server ID it has. SQL Thread Processing The slave SQL thread reads the relay log and re-executes the master’s database state‐ ments on the slave. Some of these events require special information that is not part of the SQL statement. The special handling includes: Passing master context to the slave server Sometimes state information needs to be passed to the slave for the statement to execute correctly. As mentioned in Chapter 4, the master writes one or more context events to pass this extra information. Some of the information is thread-specific but different from the information in the next item. Handling events from different threads The master executes transactions from several sessions, so the slave SQL thread has to decide which thread generated some events. Because the master has the best knowledge about the statement, it marks any event that it considers thread-specific. For instance, the master will usually mark events that operate on temporary tables as thread-specific. Filtering events and tables The SQL thread is responsible for doing filtering on the slave. MySQL provides both database filters, which are set up by replicate-do-db and replicate- 250 | Chapter 8: Replication Deep Dive ignore-db, and table filters, which are set up by replicate-do-table, replicate- ignore-table, replicate-wild-do-table, and replicate-wild-ignore-table. Skipping events To recover replication after it has stopped, there are features available to skip events when restarting replication. The SQL thread handles this skipping. Context events On the master, some events require a context to execute correctly. The context is usually thread-specific features such as user-defined variables, but can also include state infor‐ mation required to execute correctly, such as autoincrement values for tables with au‐ toincrement columns. To pass this context from the master to the slave, the master has a set of context events that it can write to the binary log. The master writes each context event before the event that contains the actual change. Currently, context events are associated only with Query events and are added to the binary log before the Query events. Context events fall into the following categories: User variable event This event holds the name and value of a user-defined variable. This event is generated whenever the statement contains a reference to a user- defined variable. SET @foo = 'SmoothNoodleMaps'; INSERT INTO my_albums(artist, album) VALUES ('Devo', @foo); Integer variable event This event holds an integer value for either the INSERT_ID session variable or the LAST_INSERT_ID session variable. The INSERT_ID integer variable event is used for statements that insert into tables with an AUTO_INCREMENT column to transfer the next value to use for the autoincrement column. This information, for example, is required by this table definition and statement: CREATE TABLE Artist (id INT AUTO_INCREMENT PRIMARY KEY, artist TEXT); INSERT INTO Artist VALUES (DEFAULT, 'The The'); The LAST_INSERT_ID integer variable event is generated when a statement uses the LAST_INSERT_ID function, as in this statement: INSERT INTO Album VALUES (LAST_INSERT_ID(), 'Mind Bomb'); How the Slave Processes Events | 251 Rand event If the statement contains a call to the RAND function, this event will contain the random seeds, which will allow the slave to reproduce the “random” value generated on the master: INSERT INTO my_table VALUES (RAND()); These context events are necessary to produce correct behavior in the situations just described, but there are other situations that cannot be handled using context events. For example, the replication system cannot handle a user-defined function (UDF) unless the UDF is deterministic and also exists on the slave. In these cases, the user variable event can solve the problem. User variable events can be very useful for avoiding problems with replicating nonde‐ terministic functions, for improving performance, and for integrity checks. As an example, suppose that you enter documents into a database table. Each document is automatically assigned a number using the AUTO_INCREMENT feature. To maintain the integrity of the documents, you also add an MD5 checksum of the documents in the same table. A definition of such a table is shown in Example 8-12. Example 8-12. Definition of document table with MD5 checksum CREATE TABLE document( id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, doc BLOB, checksum CHAR(32) ); Using this table, you can now add documents to the table together with the checksum and also verify the integrity of the document, as shown in Example 8-13, to ensure it has not been corrupted. Although the MD5 checksum is currently not considered cryp‐ tographically secure, it still offers some protection against random errors such as disk and memory problems. Example 8-13. Inserting into the table and checking document integrity master> INSERT INTO document(doc) VALUES (document); Query OK, 1 row affected (0.02 sec) master> UPDATE document SET checksum = MD5(doc) WHERE id = LAST_INSERT_ID(); Query OK, 1 row affected (0.04 sec) master> SELECT id, -> IF(MD5(doc) = checksum, 'OK', 'CORRUPT!') AS Status -> FROM document; +-----+----------+ | id | Status | +-----+----------+ | 1 | OK | 252 | Chapter 8: Replication Deep Dive | 2 | OK | | 3 | OK | | 4 | OK | | 5 | OK | | 6 | OK | | 7 | CORRUPT! | | 8 | OK | | 9 | OK | | 10 | OK | | 11 | OK | +-----+----------+ 11 row in set (5.75 sec) But how well does this idea play with replication? Well, it depends on how you use it. When the INSERT statement in Example 8-13 is executed, it is written to the binary log as is, which means the MD5 checksum is recalculated on the slave. So what happens if the document is corrupted on the way to the slave? In that case, the MD5 checksum will be recalculated using the corrupt document, and the corruption will not be detected. So the statement given in Example 8-13 is not replication-safe. We can, however, do better than this. Instead of following Example 8-13, write your code to look like Example 8-14, which stores the checksum in a user-defined variable and uses it in the INSERT statement. The user-defined variable contains the actual value computed by the MD5 function, so it will be identical on the master and the slave even if the document is corrupted in the transfer (but, of course, not if the checksum is corrupted in the transfer). Either way, a corruption occurring when the document is replicated will be noticed. Example 8-14. Replication-safe method of inserting a document in the table master> INSERT INTO document(doc) VALUES (document); Query OK, 1 row affected (0.02 sec) master> SELECT MD5(doc) INTO @checksum FROM document WHERE id = LAST_INSERT_ID(); Query OK, 0 rows affected (0.00 sec) master> UPDATE document SET checksum = @checksum WHERE id = LAST_INSERT_ID(); Query OK, 1 row affected (0.04 sec) Thread-specific events As mentioned earlier, some statements are thread-specific and will yield a different result when executed in another thread. There are several reasons for this: Reading and writing thread-local objects A thread-local object can potentially clash with an identically named object in an‐ other thread. Typical examples of such objects are temporary tables or user-defined variables. How the Slave Processes Events | 253 We have already examined how replication handles user-defined variables, so this section will just concentrate on how replication handles the temporary tables. Using variables or functions that have thread-specific results Some variables and functions have different values depending on which thread they are running in. A typical example of this is the server variable connection_id. The server handles these two cases slightly differently. In addition, there are a few cases in which replication does not try to account for differences between the server and client, so results can differ in subtle ways. To handle thread-local objects, some form of thread-local store (TLS) is required, but because the slave is executing from a single thread, it has to manage this storage and keep the TLSes separate. To handle temporary tables, the slave creates a unique (man‐ gled) filename for the table based on the server process ID, the thread ID, and a thread- specific sequence number. This means that the two statements in Example 8-15—each runs from a different client on the master—create two different filenames on the slave to represent the temporary tables. Example 8-15. Two threads, each creating a temporary table master-1> CREATE TEMPORARY TABLE cache (a INT, b INT); Query OK, 0 rows affected (0.01 sec) master-2> CREATE TEMPORARY TABLE cache (a INT, b INT); Query OK, 0 rows affected (0.01 sec) All the statements from all threads on the master are stored in sequence in the binary log, so it is necessary to distinguish the two statements. Otherwise, they will cause an error when executed on the slave. To distinguish the statements in the binary log so that they do not conflict, the server tags the Query events containing the statement as thread-specific and also adds the thread ID to the event. (Actually, the thread ID is added to all Query events, but is not really necessary except for thread-specific statements.) When the slave receives a thread-specific event, it sets a variable special to the replication slave thread, called the pseudothread ID, to the thread ID passed with the event. The pseudothread ID will then be used when constructing the temporary tables. The process ID of the slave server—which is the same for all master threads—will be used when constructing the filename, but that does not matter as long as there is a distinction among tables from different threads. We also mentioned that thread-specific functions and variables require special treat‐ ment to work correctly when replicated. This is not, however, handled by the server. When a server variable is referenced in a statement, the value of the server variable will be retrieved on the slave. If, for some reason, you want to replicate exactly the same 254 | Chapter 8: Replication Deep Dive value, you have to store the value in a user-defined variable as shown in Example 8-14, or use row-based replication, which we will cover later in the chapter. Filtering and skipping events In some cases, events may be skipped either because they are filtered out using repli‐ cation filters or because the slave has been specifically instructed to skip a number of events. The SQL_SLAVE_SKIP_COUNTER variable instructs the slave server to skip a specified number of events. The SQL thread should not be running when you set the variable. This condition is typically easy to satisfy, because the variable is usually used to skip some events that caused replication to stop already. An error that stops replication should, of course, be investigated and handled, but if you fix the problem manually, it is necessary to ignore the event that stopped replication and force replication to continue after the offending event. This variable is provided as a convenience, to keep you from having to use CHANGE MASTER TO. Example 8-16 shows the feature in use after a bad statement has caused replication to stop. Example 8-16. Using the SQL_SLAVE_SKIP_COUNTER slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 3; Query OK, 0 rows affected (0.02 sec) slave> START SLAVE; Query OK, 0 rows affected (0.02 sec) When you start the slave, three events will be skipped before resuming replication. If skipping three events causes the slave to end up in the middle of a transaction, the slave will continue skipping events until it finds the end of the transaction. Events can also be filtered by the slave if replication filters are set up. As we discussed in Chapter 4, the master can handle filtering, but if there are slave filters, the events are filtered in the SQL thread, which means that the events are still sent from the master and stored in the relay log. Filtering is done differently depending on whether database filters or table filters are set up. The logic for deciding whether a statement for a certain database should be filtered out from the binary log was detailed in Chapter 4, and the same logic applies to slave filters, with the addition that here a set of table filters have to be handled as well. One important aspect of filtering is that a filter applying to a single table causes the entire statement referring to that filter to be left out of replication. The logic for filtering state‐ ments on the slave is shown in Figure 8-5. Filtering that involves tables can easily become difficult to understand, so we advise the following rules to avoid unwanted results: How the Slave Processes Events | 255 • Do not qualify table names with the database they’re a part of. Precede the statement with a USE statement instead to set a new default database. • Do not update tables in different databases using a single statement. • Avoid updating multiple tables in a statement, unless you know that all tables are filtered or none of the tables are filtered. Notice that from the logic in Figure 8-5, the whole statement will be filtered if even one of the tables is filtered. Figure 8-5. Replication filtering rules 256 | Chapter 8: Replication Deep Dive Semisynchronous Replication Google has an extensive set of patches for MySQL and InnoDB to tailor the server and the storage engine. One of the patches that is available for MySQL version 5.0 is the semisynchronous replication patch. MySQL has since reworked the patch and released it with MySQL 5.5. The idea behind semisynchronous replication is to ensure the changes are written to disk on at least one slave before allowing execution to continue. This means that for each connection, at most one transaction can be lost due to a master crash. It is important to understand that the semisynchronous replication patch does not hold off commits of the transaction; it just avoids sending a reply back to the client until the transaction has been written to the relay log of at least one slave. Figure 8-6 shows the order of the calls when committing a transaction. As you can see, the transaction is committed to the storage engine before the transaction is sent to the slave, but the return from the client’s commit call occurs after the slave has acknowledged that the transaction is in durable storage. Figure 8-6. Transaction commit with semisynchronous replication For each connection, one transaction can be lost if a crash occurs after the transaction has been committed to the storage engine but before the transaction has been sent to the slave. However, because the acknowledgment of the transaction goes to the client Semisynchronous Replication | 257 after the slave has acknowledged that it has the transaction, at most one transaction can be lost. This usually means that one transaction can be lost per client. Configuring Semisynchronous Replication To use semisynchronous replication, both the master and the slave need to support it, so both the master and the slave have to be running MySQL version 5.5 or later and have semisynchronous replication enabled. If either the master or the slave does not support semisynchronous replication, it will not be used, but replication works as usual, meaning that more than one transaction can be lost unless special precautions are taken to ensure each transaction reaches the slave before a new transaction is started. Use the following steps to enable semisynchronous replication: 1. Install the master plug-in on the master: master> INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so'; 2. Install the slave plug-in on each slave: slave> INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so'; 3. Once you have installed the plug-ins, enable them on the master and the slave. This is controlled through two server variables that are also available as options, so to ensure that the settings take effect even after restart, it is best to bring down the server and add the options to the my.cnf file of the master: [mysqld] rpl-semi-sync-master-enabled = 1 and to the slave: [mysqld] rpl-semi-sync-slave-enabled = 1 4. Restart the servers. If you followed the instructions just given, you now have a semisynchronous replication setup and can test it, but consider these cases: • What happens if all slaves crash and therefore no slave acknowledges that it has stored the transaction to the relay log? This is not unlikely if you have only a single server attached to the master. • What happens if all slaves disconnect gracefully? In this case, the master has no slave to which the transaction can be sent for safekeeping. In addition to rpl-semi-sync-master-enabled and rpl-semi-sync-slave-enabled, there are two options that you can use to handle the situations we just laid out: 258 | Chapter 8: Replication Deep Dive rpl-semi-sync-master-timeout=milliseconds To prevent semisynchronous replication from blocking if it does not receive an acknowledgment, it is possible to set a timeout using the rpl-semi-sync-master- timeout=milliseconds option. If the master does not receive any acknowledgment before the timeout expires, it will revert to normal asynchronous replication and continue operating without semisynchronous replication. This option is also available as a server variable and can be set without bringing the server down. Note, however, that as with every server variable, the value will not be saved between restarts. rpl-semi-sync-master-wait-no-slave={ON|OFF} If a transaction is committed but the master does not have any slaves connected, it is not possible for the master to send the transaction anywhere for safekeeping. By default, the master will then wait for a slave to connect—as long as it is within the timeout limit—and acknowledge that the transaction has been properly written to disk. You can use the rpl-semi-sync-master-wait-no-slave={ON|OFF} option to turn off this behavior, in which case the master reverts to asynchronous replication if there are no connected slaves. Note that if the master does not receive any acknowledgment be‐ fore the timeout given by rpl-semi-sync-master-timeout expires, or if rpl-semi-sync-master-wait-no-slave=ON, semi- synchronous replication will silently revert to normal asynchro‐ nous replication and continue operating without semisynchronous replication. Monitoring Semisynchronous Replication Both plug-ins install a number of status variables that allow you to monitor semisyn‐ chronous replication. We will cover the most interesting ones here (for a complete list, consult the online reference manual for semisynchronous replication): rpl_semi_sync_master_clients This status variable reports the number of connected slaves that support and have been registered for semisynchronous replication. rpl_semi_sync_master_status The status of semisynchronous replication on the master is 1 if it is active, and 0 if it is inactive—either because it has not been enabled or because it was enabled but has reverted to asynchronous replication. Semisynchronous Replication | 259 rpl_semi_sync_slave_status The status of semisynchronous replication on the slave is 1 if active (i.e., if it has been enabled and the I/O thread is running) and 0 if it is inactive. You can read the values of these variables either using the SHOW STATUS command or through the information schema table GLOBAL_STATUS. If you want to use the values for other purposes, the SHOW STATUS command is hard to use and a query as shown in Example 8-17 uses SELECT on the information schema to extract the value and store it in a user-defined variable. Example 8-17. Retrieving values using the information schema master> SELECT Variable_value INTO @value -> FROM INFORMATION_SCHEMA.GLOBAL_STATUS -> WHERE Variable_name = 'Rpl_semi_sync_master_status'; Query OK, 1 row affected (0.00 sec) Global Transaction Identifiers Starting with MySQL 5.6, the concept of global transaction identifiers (GTIDs) was added, which means that each transaction is assigned a unique identifier. This section introduces GTIDs and demonstrates how they can be used. For a detailed description of GTIDs, look in “Replication with Global Transaction Identifiers” in the MySQL 5.6 Reference Manual. In MySQL 5.6, each transaction on a server is assigned a transaction identifier, which is a nonzero 64-bit value assigned to a transaction based on the order in which they committed. This number is local to the server (i.e., some other server might assign the same number to some other transaction). To make this transaction identifier global, the server UUID is added to form a pair. For example, if the server has a server UUID (as given by the server variable @@server_uuid) 2298677f-c24b-11e2- a68b-0021cc6850ca, the 1477th transaction committed on the server will have GTID 2298677f-c24b-11e2-a68b-0021cc6850ca:1477. When a transaction is replicated from a master to a slave, the binary log position of the transaction changes because the slave has to write it to the binary logfile on the slave. Because a slave might be configured differently, the positions can be vastly different from the position on the master—but the global transaction identifier will be the same. When transactions are replicated and global transaction identifiers are enabled, the GTID of the transaction is retained regardless of the number of times that the transac‐ tion is propagated. This simple idea makes GTIDs a very powerful concept, as you will soon see. While the notation just shown indicates an individual transaction, it is also necessary to have a notation for a global transaction identifier set (or GTID set). This helps, for example, when talking about transactions that have been logged on a server. A GTID 260 | Chapter 8: Replication Deep Dive set is written by giving a range, or list of ranges, of transaction identifiers. So the set of transactions 911-1066 and 1477-1593 is written as 2298677f-c24b-11e2- a68b-0021cc6850ca:911-1066:1477-1593. GTIDs are written to the binary log and assigned only to transac‐ tions that are written to the binary log. This means that if you turn off the binary log, transactions will not get assigned GTIDs. This applies to the slave as well as the master. The consequence is that if you want to use a slave for failover, you need to have the binary log enabled on it. If you do not have a binary log enabled, the slave will not remember the GTIDs of the transactions it has executed. Setting Up Replication Using GTIDs To set up replication using global transaction identifiers, you must enable global trans‐ action identifiers when configuring the servers. We’ll go through what you need to do to enable global transaction identifers here. To configure a standby for using global transaction identifiers, you need to update my.cnf as follows: [mysqld] user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp log-bin = master-bin log-bin-index = master-bin.index server-id = 1 gtid-mode = ON log-slave-updates enforce-gtid-consistency It is necessary to have the binary log enabled on the standby. This ensures that all changes are logged to the binary log when the master becomes the primary, but it is also a requirement for log-slave-updates. This option is used to enable the generation of global transaction identifiers. This option ensures that events received from the master and executed are also written to the standby’s binary log. If this is not enabled, it will not be possible for the standby to send out changes done indirectly to slaves connected to it. Note that by default, this option is not enabled. This option ensures that statements throw an error if they cannot be logged consistently with global transaction identifiers enabled. This is recommended to ensure that failover happens correctly. Global Transaction Identifiers | 261 After updating the options file, you need to restart the server for the changes to take effect. Once you’ve done this for all servers that are going to be used in the setup, you’re set for doing a failover. Using the GTID support in MySQL 5.6, switching masters just requires you to issue the command: CHANGE MASTER TO MASTER_HOST = host_of_new_master, MASTER_PORT = port_of_new_master, MASTER_USER = replication_user_name, MASTER_PASSWORD = replication_user_password, MASTER_AUTO_POSITION = 1 The MASTER_AUTO_POSITION causes the slave to automatically negotiate what transac‐ tions should be sent over when connecting to the master. To see status of replication in GTID positions, SHOW SLAVE STATUS has been extended with a few new columns. You can see an example of those in Example 8-18. Example 8-18. Output of SHOW SLAVE STATUS with GTID enabled Slave_IO_State: Waiting for master to send event . . . Slave_IO_Running: Yes Slave_SQL_Running: Yes . . . Master_UUID: 4e2018fc-c691-11e2-8c5a-0021cc6850ca . . . Retrieved_Gtid_Set: 4e2018fc-c691-11e2-8c5a-0021cc6850ca:1-1477 Executed_Gtid_Set: 4e2018fc-c691-11e2-8c5a-0021cc6850ca:1-1593 Auto_Position: 1 Master_UUID This is the server UUID of the master. The field is not strictly tied to the GTID implementation (it was added before the GTIDs were introduced), but it is useful when debugging problems. Retrieved_Gtid_Set This is the set of GTIDs that have been fetched from the master and stored in the relay log. Executed_Gtid_Set This is the set of GTIDs that have been executed on the slave and written to the slave’s binary log. 262 | Chapter 8: Replication Deep Dive Failover Using GTIDs “Hot Standby” on page 130 described how to switch to a hot standby without using global transaction identifiers. That process used binary log positions, but with global transaction identifiers, there is no longer a need to check the positions. Switching over to a hot standby with global transaction identifiers is very easy (it is sufficient to just redirect the slave to the new master using CHANGE MASTER): CHANGE MASTER TO MASTER_HOST = 'standby.example.com'; As usual, if no other parameters change, it is not necessary to repeat them. When you enable MASTER_AUTO_POSITION, the master will figure out what transactions need to be sent over. The failover procedure is therefore easily defined using the Rep‐ licant ilibrary: _CHANGE_MASTER = ( "CHANGE MASTER TO " "MASTER_HOST = %s, MASTER_PORT = %d, " "MASTER_USER = %s, MASTER_PASSWORD = %s, " "MASTER_AUTO_POSITION = 1" ) def change_master(server, master): server.sql(_CHANGE_MASTER, master.host, master.port, master.user, master.password) def switch_to_master(server, standby): change_master(server, standby) server.sql("START SLAVE") By comparing this procedure with the one in Example 5-1, you can see that there are a few things that have been improved by using GTIDs: • Because you do not need to check the position of the master, it is not necessary to stop it to ensure that it is not changing. • Because the GTIDs are global (i.e., they never change when replicated), there is no need for the slave to “align” with the master or the standby to get a good switchover position. • It is not necessary to fetch the position on the standby (which is a slave to the current primary) because everything is replicated to the slave. • It is not necessary to provide a position when changing the master because the servers automatically negotiate positions. Because the GTIDs are global (i.e., it is not necessary to do any sort of translation of the positions), the preceding procedure works just as well for switchover and failover, even Global Transaction Identifiers | 263 when a hierarchical replication is used. This was not the case in “Hot Standby” on page 130, where different procedures had to be employed for switchover, non-hierarchical failover, and failover in a hierarchy. In order to avoid losing transactions when the master fails, it is a good habit to empty the relay log before actually executing the failover. This avoids re-fetching transactions that have already been transferred from the master to the slave. The best approach would be to redirect only the I/O thread to the new master, but unfortunately, this is (not yet) possible. To wait for the relay log to become empty, the handy WAIT_UN TIL_SQL_THREAD_AFTER_GTIDS function will block until all the GTIDs in a GTID set have been processed by the SQL thread. To use this function, we change the function in Example 8-19 . Example 8-19. Python code for failover to a standby using GTID def change_master(server, standby): fields = server.sql("SHOW SLAVE STATUS") server.sql("SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS(%s)", fields['Retrieved_Gtid_Set']) server.sql("STOP SLAVE") change_master(server, standby) server.sql("START SLAVE") Slave Promotion Using GTIDs The procedure shown in the previous section for failover works fine when the slave is actually behind the standby. But, as mentioned in “Slave Promotion” on page 144, if the slave knows more transactions than the standby, failing over to the standby does not put you in a better situation. It would actually be better if the slave were the new master. So how can this be implemented using global transaction identifiers? The actual failover using the procedure in Example 8-19 can still be used, but if there are multiple slaves to a master, and the master fails, it is necessary to compare the slaves to see what slave is more knowledgable. To help with this, MySQL 5.6 introduced the variable GTID_EXECUTED. This global variable contains a GTID set consisting of all transactions that have been written to the binary log on the server. Note that no GTID is generated unless the transaction is written to the binary log, so only transactions that were written to the binary log are represented in this set. There is also a global variable GTID_PURGED that contains the set of all transactions that have been purged (i.e., removed) from the binary log and are no longer available to replicate. This set is always a subset of (or equal to) GTID_EXECUTED. This variable can be used to check that a candidate master has enough events in the binary log to act as master to some slave. If there are any events in GTID_PURGED on the master that are not in GTID_EXECUTED on the slave, the master will not be able to replicate some events that the slave needs because they are not in the binary log. The relation 264 | Chapter 8: Replication Deep Dive between these two variables can be seen in Figure 8-7, where each variable represents a “wavefront” through the space of all GTIDs. Using GTID_EXECUTED, it is easy to compare the slaves and decide which one knows the most transactions. The code in Example 8-20 orders the slaves based on GTID_EXECUT ED and picks the “best” one as the new master. Note that GTID sets are not normally totally ordered (i.e., two GTID sets can differ but have the same size). In this particular case, however, the GTID sets will be totally ordered, because they were ordered in the binary log of the master. Example 8-20. Python code to find the best slave from mysql.replicant.server import GTIDSet def fetch_gtid_executed(server): return GTIDSet(server.sql("SELECT @@GLOBAL.GTID_EXECUTED")) def fetch_gtid_purged(server): return GTIDSet(server.sql("SELECT @@GLOBAL.GTID_PURGED")) def order_slaves_on_gtid(slaves): entries = [] for slave in slaves: pos = fetch_gtid_executed(slave) entries.append((pos, slave)) entries.sort(key=lambda x: x[0]) return entries Figure 8-7. GTID_EXECUTED and GTID_PURGED Combining the examples in Example 8-19 and Example 8-20 allows the function to promote the best slave to be written as simply as what is shown in Example 8-21. Example 8-21. Slave promotion with MySQL 5.6 GTIDs def promote_best_slave_gtid(slaves): entries = order_slaves_on_gtid(slaves) _, master = entries.pop(0) # "Best" slave will be new master for _, slave in entries: switch_to_master(master, slave) Global Transaction Identifiers | 265 Replication of GTIDs The previous sections showed how to set up the MySQL server to use global transaction identifiers and how to handle failover and slave promotion, but one piece of the puzzle is still missing: how are GTIDs propagated between the servers? A GTID is assigned to every group in the binary log—that is, to each transaction, single- statement DML (whether transactional or nontransactional), and DDL statement. A special GTID event is written before the group and contains the full GTID for the transaction, as illustrated in Figure 8-8. Figure 8-8. A binary logfile with GTIDs To handle the replication of transactions with a GTID assigned, the SQL thread pro‐ cesses the GTID event in the following manner: 1. If the GTID is already present in the GTID_EXECUTED, the transaction will be skipped entirely, not even written to the binary log. (Recall that GTID_EXECUTED contains all transactions already in the binary log, so there is no need to write it again.) 2. Otherwise, the GTID will be assigned to the transaction that follows, and the next transaction is executed as normal. 3. When the transaction commits, the GTID assigned to the transaction is used to generate a new GTID event, which is then written to the binary log before the transaction. 4. The contents of the transaction cache are then written to the binary log after the GTID event. 266 | Chapter 8: Replication Deep Dive Note that with GTIDs assigned to every transaction, it is possible to filter out transac‐ tions that have already been executed in the first step, which was not possible before MySQL 5.6. You can control what GTID is assigned to a transaction through a new variable named GTID_NEXT. This variable can either contain a GTID or have the value AUTOMATIC. (It can also take the value ANONYMOUS, but this can be used only when GTID_MODE = ON, so we disregard this case.) When committing a transaction, different actions are taken depending on the value of GTID_NEXT: • If GTID_NEXT has the value AUTOMATIC, a new GTID is created and assigned to the transaction. • If GTID_NEXT has a GTID as a value, that GTID will be used when the transaction is written to the binary log. The GTID assigned to GTID_NEXT is not changed after the transaction commits. This means you have to set it either to a new GTID or to AUTOMATIC after you have committed the transaction. If you do not change the value of GTID_NEXT, you will get an error when you try to start a new transaction, regardless of whether it is done explicitly or implicitly. Observe that GTID_NEXT has to be set before the transaction starts. If you try to set the variable after starting a transaction, you will just get an error. Once you set GTID_NEXT and start a transaction, the GTID is owned by the transaction. This will be reflected in the variable GTID_OWNED: mysql> SELECT @@GLOBAL.GTID_OWNED; +-------------------------------------------+ | @@GLOBAL.GTID_OWNED | +-------------------------------------------+ | 02020202-0202-0202-0202-020202020202:4#42 | +-------------------------------------------+ 1 row in set (0.00 sec) In this case, the only owned GTID, which is owned by the session with ID 42, is 02020202-0202-0202-0202-020202020202:4. GTID_OWNED should be considered internal and is intended for testing and debugging. Replicating from a master to a slave directly is not the only way changes can be replicated. MySQL replication is also designed to work with mysqlbinlog so that SQL statements can be generated, saved to a file, and applied to a server. To handle propagation of GTIDs even when it is done indirectly through mysqlbinlog, GTID_NEXT is used. Whenever mysqlbinlog encounters a GTID event, it will generate a statement to set GTID_NEXT. In Example 8-22, you can see an example of the output. Global Transaction Identifiers | 267 Example 8-22. Example output from mysqlbinlog with GTID events # at 410 #130603 20:57:54 server id 1 end_log_pos 458 CRC32 0xc6f8a5eb # GTID [commit=yes] SET @@SESSION.GTID_NEXT= '01010101-0101-0101-0101-010101010101:3'/*!*/; # at 458 #130603 20:57:54 server id 1 end_log_pos 537 CRC32 0x1e2e40d0 # Position Timestamp Type Master ID Size Master Pos Flags # Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1370285874/*!*/; BEGIN /*!*/; # at 537 #130603 20:57:54 server id 1 end_log_pos 638 CRC32 0xc16f211d # Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1370285874/*!*/; INSERT INTO t VALUES (1004) /*!*/; # at 638 #130603 20:57:54 server id 1 end_log_pos 669 CRC32 0x91980f0b COMMIT/*!*/; Slave Safety and Recovery Slave servers can crash too, and when they do, you need to recover them. The first step in handling a crashed slave is always to investigate why it crashed. This cannot be au‐ tomated, because there are so many hard-to-anticipate reasons for crashes. A slave might be out of disk space, it may have read a corrupt event, or it might have re-executed a statement that resulted in a duplicate key error for some reason. However, it is possible to automate some recovery procedures and use this automation to help diagnose a problem. Syncing, Transactions, and Problems with Database Crashes To ensure slaves pick up replication safely after a crash on the master or slave, you need to consider two different aspects: • Ensuring the slave stores all the necessary data needed for recovery in the event of a crash • Executing the recovery of a slave Slaves do their best to meet the first condition by syncing to disk. To provide acceptable performance, operating systems keep files in memory while working with them, and write them to disk only periodically or when forced to. This means data written to a file is not necessarily in safe storage. If there is a crash, data left only in memory will be lost. 268 | Chapter 8: Replication Deep Dive To force a slave to write files to disk, the database server issues an fsync call, which writes all data stored in memory to disk. To protect replication data, the MySQL server normally executes fsync calls for the relay log, the master.info file, and the relay- log.info file at regular intervals. I/O thread syncing For the I/O thread, two fsync calls are made whenever an event has been processed: one to flush the relay log to disk and one to flush the master.info file to disk. Doing the flushes in this order ensures that no events will be lost if the slave crashes between flushing the relay log and flushing the master.info file. This, however, means that an event can be duplicated if a crash occurs in any of the following cases: • The server flushes the relay log and is about to update the master read position in master.info. • The server crashes, which means that the master read position now refers to the position before the event that was flushed to the relay log. • The server restarts and gets the master read position from master.info, meaning the position before the last event written to the relay log. • Replication resumes from this position, and the event is duplicated. If the files were flushed in the opposite order—the master.info file first and the relay log second—there would be potential for losing an event in the same scenario, because the slave would pick up replication after the event that it was about to write to the relay log. Losing an event is deemed to be worse than duplicating one, hence the relay log is flushed first. SQL thread syncing The SQL thread processes the groups in the relay log by processing each event in turn. When all the events in the group are processed, the SQL thread commits the transaction using the following process: 1. It commits the transaction to the storage engine (assuming the storage engine sup‐ ports commit). 2. It updates the relay-log.info file with the position of the next event to process, which is also the beginning of the next group to process. 3. It writes relay-log.info to disk by issuing an fsync call. While executing inside a group, the thread increments the event position to keep track of where the SQL thread is reading in the relay log, but if there is a crash, execution will resume from the last recorded position in the relay-log.info file. Slave Safety and Recovery | 269 This behavior leaves the SQL thread with its own version of the atomic update problem mentioned for the I/O thread, so the slave database and the relay-log.info file can get out of sync in the following scenario: 1. The event is applied to the database and the transaction is committed. The next step is to update the relay-log.info file. 2. The slave crashes, which means relay-log.info now points to the beginning of the just-completed transaction. 3. On recovery, the SQL thread reads the information from the relay-log.info file and starts replication from the saved position. 4. The last executed transaction is repeated. What all this boils down to is that committing a transaction on the slave and updating the replication information is not atomic: it is possible that relay-log.info does not ac‐ curately reflect what has been committed to the database. The next section describes how transactional replication is implemented in MySQL 5.6 to solve this problem. Transactional Replication As noted in the previous section, replication is not crash-safe, because the information about the progress of replication is not always in sync with what has actually been applied to the database. Although transactions are not lost if the server crashes, it can require some tweaking to bring the slaves up again. MySQL 5.6 has increased crash safety for the slave by committing the replication in‐ formation together with the transaction as shown in Figure 8-9. This means that rep‐ lication information will always be consistent with what has been applied to the database, even in the event of a server crash. Also, some fixes were done on the master to ensure that it recovers correctly. Figure 8-9. Position information updated after the transaction and inside the transac‐ tion 270 | Chapter 8: Replication Deep Dive Recall that the replication information is stored in two files: master.info and relay- log.info. The files are arranged so that they are updated after the transaction has been applied. This means that if you have a crash between the transaction commit and the update of the files, as on the left in Figure 8-9, the position information will be wrong. In other words, a transaction cannot be lost this way, but there is a risk that a transaction could be applied again when the slave recovers. The usual way to avoid this is to have a primary key on all your tables. In that case, a repeated update of the table would cause the slave to stop, and you would have to use SQL_SLAVE_SKIP_COUNTER to skip the transaction and get the slave up and running again (or GTID_NEXT to commit a dummy transaction). This is better than losing a transaction, but it is nevertheless a nuisance. Removing the primary key to prevent the slave from stopping will only solve the problem partially: it means that the transaction would be applied twice, which would both place a burden on the application to handle dual entries and also require that the tables be cleaned regularly. Both of these approaches require either manual intervention or scripting support. This does not affect reliability, but crashes are much easier to handle if the replication information is committed in the same transaction as the data being updated. To implement transactional replication in MySQL 5.6, the replication information can be stored either in files (as before) or in tables. Even when storing the replication in‐ formation in tables, it is necessary either to store the data and the replication information in the same storage engine (which must be transactional) or to support XA on both storage engines. If neither of these steps are taken, the replication information and the data cannot be committed as a single transaction. Setting up transactional replication The default in MySQL 5.6 is to use files for the replication information, so to use trans‐ actional replication, it is necessary to reconfigure the server to use tables for the repli‐ cation information. To control where the replication information is placed, two new options have been added: master_info_repository and relay_log_info_reposito ry. These options take the value FILE or TABLE to use either the file or the table for the respective piece of information. Thus, to use transactional replication, edit your configuration file, add the options as shown in Example 8-23, and restart the server. Example 8-23. Adding options to turn on transactional replication [mysqld] ... master_info_repository = TABLE relay_log_info_repository = TABLE ... Slave Safety and Recovery | 271 Before MySQL 5.6.6, the default engine for slave_master_info and slave_relay_log_info was MyISAM. For replication to be trans‐ actional, you need to change the engine to use a transactional en‐ gine, typically InnoDB, using ALTER TABLE: slave> ALTER TABLE mysql.slave_master_info ENGINE = InnoDB; slave> ALTER TABLE mysql.slave_relay_log_info ENGINE = InnoDB; Details of transactional replication Two tables in the mysql database preserve information needed for transactional repli‐ cation: slave_master_info, corresponding to the file master.info, and slave_re lay_log_info, corresponding to the file relay_log.info. Just like the master.info file, the slave_master_info table stores information about the connection to the master. Table 8-1 shows each field of the table, and which row in the master.info file and SHOW SLAVE STATUS output it corresponds to. Table 8-1. Fields for slave_master_info Field Line in file Slave status column Number_of_lines 1 Master_log_name 2 Master_Log_File Master_log_pos 3 Read_Master_Log_Pos Host 3 Master_Host User_name 4 Master_User User_password 5 Port 6 Master_Port Connect_retry 7 Connect_Retry Enabled_ssl 8 Master_SSL_Allowed Ssl_ca 9 Master_SSL_CA_File Ssl_capath 10 Master_SSL_CA_Path Ssl_cert 11 Master_SSL_Cert Ssl_cipher 12 Master_SSL_Cipher Ssl_key 13 Master_SSL_Key Ssl_verify_servert_cert 14 Master_SSL_Verify_Server_Cert Heartbeat 15 Bind 16 Master_Bind Ignored_server_ids 17 Replicate_Ignore_Server_Ids Uuid 18 Master_UUID Retry_count 19 Master_Retry_Count Ssl_crl 20 Master_SSL_Crl 272 | Chapter 8: Replication Deep Dive Field Line in file Slave status column Ssl_crlpath 21 Master_SSL_Crlpath Enabled_auto_position 22 Auto_Position Similarly, Table 8-2 shows the definition of the slave_relay_log_info table corre‐ sponding to the relaylog.info file. Table 8-2. Fields of slave_relay_log_info Field Line in file Slave status column Number_of_lines 1 Relay_log_name 2 Relay_Log_File Relay_log_pos 3 Relay_Log_Pos Master_log_name 4 Relay_Master_Log_File Master_log_pos 5 Exec_Master_Log_Pos Sql_delay 6 SQL_Delay Number_of_workers 7 Id 8 Now, suppose that the following transaction was executed on the master: START TRANSACTION; UPDATE titles, employees SET titles.title = 'Dictator-for-Life' WHERE first_name = 'Calvin' AND last_name IS NULL; UPDATE salaries SET salaries.salary = 1000000 WHERE first_name = 'Calvin' AND last_name IS NULL; COMMIT; When the transaction reaches the slave and is executed there, it behaves as if it was executed the following way (where Exec_Master_Log_Pos, Relay_Master_Log_File, Relay_Log_File, and Relay_Log_Pos are taken from the SHOW SLAVE STATUS output): START TRANSACTION; UPDATE titles, employees SET titles.title = 'Dictator-for-Life' WHERE first_name = 'Calvin' AND last_name IS NULL; UPDATE salaries SET salaries.salary = 1000000 WHERE first_name = 'Calvin' AND last_name IS NULL; SET @@SESSION.LOG_BIN = 0; UPDATE mysql.slave_relay_log_info SET Master_log_pos = Exec_Master_Log_Pos, Master_log_name = Relay_Master_Log_File, Relay_log_name = Relay_Log_File, Relay_log_pos = Relay_Log_Pos; SET @@SESSION.LOG_BIN = 1; COMMIT; Note that the added “statement” is not logged to the binary log on the master, because the binary log is temporarily disabled when the “statement” is executed. If both Slave Safety and Recovery | 273 slave_relay_log_info and the tables are placed in the same engine, this will be com‐ mitted as a unit. The result is to update slave_relay_log_info with each transaction executed on the slave, but note that slave_master_info does not contain information that is critical for ensuring that transactional replication works. The only fields that are updated are the positions of events fetched from the master. On a crash, the slave will pick up from the last executed position, and not from the last fetched position, so this information is interesting to have only in the event that the master crashes. In this case, the events in the relay log can be executed to avoid losing more events than necessary. Similar to flushing the disk, committing to tables is expensive. Because the slave_mas ter_info table does not contain any information that is critical for ensuring transac‐ tional replication, avoiding unnecessary commits to this table improves performance. For this reason, the sync_master_info option was introduced. The option contains an integer telling how often the replication information should be committed to the slave_master_info (or flushed to disk, in the event that the information is stored in the traditional files). If it is nonzero, replication information is flushed each time the master fetches the number of events indicated by the variable’s value. If it is zero, no explicit flushing is done at all, but the operating system will flush the information to disk. Note, however, that the information is flushed to disk or committed to the table when the binary log is rotated or the slave starts or stops. If you are using tables for storing the replication information, this means that if: sync_master_info = 0 the slave_master_info table is updated only when the slave starts or stops, or the binary log is rotated, so changes to the fetched position are not visible to other threads. If it is critical for your application that you can view this information, you need to set sync_master_info to a nonzero value. Rules for Protecting Nontransactional Statements Statements executed outside of transactions cannot be tracked and protected from re-execution after a crash. The problem is comparable on masters and slaves. If a state‐ ment against a MyISAM table is interrupted by a crash on the master, the statement is not logged at all, because logging is done after the statement has completed. Upon restart (and successful repair) the MyISAM table will contain a partial update, but the binary log will not have logged the statement at all. The situation is similar on the slave: if a crash occurs in the middle of execution of a statement (or a transaction that modifies a nontransactional table), the changes might remain in the table, but the group position will not be changed. The nontransactional statement will be re-executed when the slave starts up replication again. 274 | Chapter 8: Replication Deep Dive It is not possible to automatically catch problems with crashes in the middle of updating a nontransactional table, but by obeying a few rules, it is possible to ensure you at least receive an error when this situation occurs. INSERT statements To handle these statements, you need to have a primary key on the tables that you replicate. In this way, an INSERT that is re-executed will generate a duplicate key error and stop the slave so that you can check why the master and the slave are not consistent. DELETE statements To handle these, you need to stay away from LIMIT clauses. If you do this, the statement will just delete the same rows again (i.e., the rows that match the WHERE clause), which is fine since it will either pick up where the previous statement left off or do nothing if all specified rows are already deleted. However, if the statement has a LIMIT clause, only a subset of the rows matching the WHERE condition will be executed, so when the statement is executed again, another set of rows will be de‐ leted. UPDATE statements These are the most problematic statements. To be safe, either the statement has to be idempotent—executing it twice should lead to the same result—or the occasional double execution of the statement should be acceptable, which could be the case if the UPDATE statement is just for maintaining statistics over, say, page accesses. Multisource Replication As you may have noticed, it is not possible to have a slave connect to multiple masters and receive changes from all of them. This topology is called multisource and should not be confused with the multimaster topology introduced in Chapter 6. In a multi‐ source topology, changes are received from several masters, but in a multimaster top‐ ology, the servers form a group that acts as a single master by replicating changes from each master to all the other masters. There have been plans for introducing multisource replication into MySQL for a long time, but one issue stands in the way of the design: what to do with conflicting updates. These can occur either because different sources make truly conflicting changes, or because two intermediate relays are forwarding a change made at a common master. Figure 8-10 illustrates both types of conflicts. In the first, two masters (sources) make changes to the same data and the slave cannot tell which is the final change. In the second, only a single change is made, but it looks to the slave like two changes from two different sources. In both cases, the slave will not be able to distinguish between events coming from the two relays, so an event sent from the master will be seen as two different events when arriving at the slave. Multisource Replication | 275 The diamond configuration does not have to be explicitly set up: it can occur inadvertently as a result of switching from one relay to another if the replication stream is overlapping during a switchover. For this reason, it is important to ensure all events in queue—on the slave and on all the relays between the master and the slave—have been replicated to the slave before switching over to another master. You can avoid conflicts by making sure you handle switchovers correctly and—in the case of multiple data sources—ensuring updates are done so that they never have a chance of conflicting. The typical way to accomplish this is to update different databases, but it is also possible to assign updates of different rows in the same table to different servers. Although MySQL does not currently let you replicate from several sources simultane‐ ously, you can come close by switching a slave among several masters, replicating pe‐ riodically from each of them in turn. This is called round-robin multisource replica‐ tion. It can be useful for certain types of applications, such as when you’re aggregating data from different sources for reporting purposes. In these cases, you can separate data naturally by storing the writes from each master in its own database, table, or partition. There is no risk of conflict, so it should be possible to use multisource replication. Figure 8-10. True multisource and a diamond configuration 276 | Chapter 8: Replication Deep Dive Figure 8-11 shows a slave that replicates from three masters in a round-robin fashion, running a client dedicated to handling the switches between the masters. The process for round-robin multisource replication is as follows: 1. Set the slave up to replicate from one master. We’ll call this the current master. 2. Let the slave replicate for a fixed period of time. The slave will then read changes from the current master and apply them while the client responsible for handling the switching just sleeps. 3. Stop the I/O thread of the slave using STOP SLAVE IO_THREAD. 4. Wait until the relay log is empty. 5. Stop the SQL thread using STOP SLAVE SQL_THREAD. CHANGE MASTER requires that you stop both threads. 6. Save the slave position for the current master by saving the values of the Ex ec_Master_Log_Pos and Relay_Master_Log_File columns from the SHOW SLAVE STATUS output. 7. Change the slave to replicate from the next master in sequence by taking the pre‐ viously saved positions and using CHANGE MASTER to set up replication. 8. Restart the slave threads using START SLAVE. 9. Repeat the sequence starting from step 2. Figure 8-11. Round-robin multisource replication using a client to switch Note that in steps 3 through 5, we stop first the I/O thread and then the SQL thread. The reason for doing this and not just stopping replication on the slave is that the SQL Multisource Replication | 277 thread can be lagging behind (and usually is), so if we just stop both threads, there will be a bunch of outstanding events in the relay log that will just be thrown away. If you are more concerned about executing only, say, one minute’s worth of transactions from each master and don’t care about throwing away those additional events, you can simply stop replication instead of performing steps 3 through 5. The procedure will still work correctly, because the events that were thrown away will be refetched from the master in the next round. This can, of course, be automated using a separate client connection and the MySQL Replicant library, as shown in Example 8-24. By using the cycle function from the itertools module, you can repeatedly read from a list of masters in turn. Example 8-24. Round-robin multisource replication in Python import itertools position = {} def round_robin_multi_master(slave, masters): current = masters[0] for master in itertools.cycle(masters): slave.sql("STOP SLAVE IO_THREAD"); slave_wait_for_empty_relay_log(slave) slave.sql("STOP SLAVE SQL_THREAD"); position[current.name] = fetch_slave_position(slave) slave.change_master(position[current.name]) master.sql("START SLAVE") current = master sleep(60) # Sleep 1 minute Details of Row-Based Replication “Row-Based Replication” on page 97 left out one major subject concerning row-based replication: how the rows are executed on the slave. In this section, you will see the details of how row-based replication is implemented on the slave side. In statement-based replication, statements are handled by writing the statement in a single Query event. However, because a significant number of rows can be changed in each statement, row-based replication handles this differently and therefore requires multiple events for each statement. To handle row-based replication, four new events have been introduced: Table_map This maps a table ID to a table name (including the database name) and some basic information about the columns of the table on the master. The table information does not include the names of the columns, just the types. This is because row-based replication is positional: each column on the master goes into the same position in the table on the slave. 278 | Chapter 8: Replication Deep Dive Write_rows, Delete_rows, and Update_rows These events are generated whenever rows are inserted, deleted, or updated, re‐ spectively. This means that a single statement can generate multiple events. In addition to the rows, each event contains a table ID that refers to a table ID introduced by a preceding Table_map event and one or two column bitmaps spec‐ ifying the columns of the table affected by the event. This allows the log to save space by including only those columns that have changed or that are necessary to locate the correct row to insert, delete, or update. Whenever a statement is executed, it is written into the binary log as a sequence of Table_map events, followed by a sequence of row events. The last row event of the statement is marked with a special flag indicating it is the last event of the statement. Example 8-25 shows the execution of a statement and the resulting events. We have skipped the format description event here, because you have already seen it. Example 8-25. Execution of an INSERT statement and the resulting events master> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) master> INSERT INTO t1 VALUES (1),(2),(3),(4); Query OK, 4 rows affected (0.01 sec) Records: 4 Duplicates: 0 Warnings: 0 master> INSERT INTO t1 VALUES (5),(6),(7),(8); Query OK, 4 rows affected (0.01 sec) Records: 4 Duplicates: 0 Warnings: 0 master> COMMIT; Query OK, 0 rows affected (0.00 sec) master> SHOW BINLOG EVENTS IN 'master-bin.000053' FROM 106\G *************************** 1. row *************************** Log_name: master-bin.000054 Pos: 106 Event_type: Query Server_id: 1 End_log_pos: 174 Info: BEGIN *************************** 2. row *************************** Log_name: master-bin.000054 Pos: 174 Event_type: Table_map Server_id: 1 End_log_pos: 215 Info: table_id: 18 (test.t1) *************************** 3. row *************************** Log_name: master-bin.000054 Pos: 215 Details of Row-Based Replication | 279 Event_type: Write_rows Server_id: 1 End_log_pos: 264 Info: table_id: 18 flags: STMT_END_F *************************** 4. row *************************** Log_name: master-bin.000054 Pos: 264 Event_type: Table_map Server_id: 1 End_log_pos: 305 Info: table_id: 18 (test.t1) *************************** 5. row *************************** Log_name: master-bin.000054 Pos: 305 Event_type: Write_rows Server_id: 1 End_log_pos: 354 Info: table_id: 18 flags: STMT_END_F *************************** 6. row *************************** Log_name: master-bin.000054 Pos: 354 Event_type: Xid Server_id: 1 End_log_pos: 381 Info: COMMIT /* xid=23 */ 6 rows in set (0.00 sec) This example adds two statements to the binary log. Each statement starts with a Table_map event followed by a single Write_rows event holding the four rows of each statement. You can see that each statement is terminated by setting the statement-end flag of the row event. Because the statements are inside a transaction, they are also wrapped with Query events containing BEGIN and COMMIT statements. The size of the row events is controlled by the option binlog-row-event-max-size, which gives a threshold for the number of bytes in the binary log. The option does not give a maximum size for a row event: it is possible to have a binlog row event that has a larger size if a row contains more bytes than binlog-row-event-max-size. Table_map Events As already mentioned, the Table_map event maps a table name to an identifier so that it can be used in the row events, but that is not its only role. In addition, it contains some basic information about the fields of the table on the master. This allows the slave to check the basic structure of the table on the slave and compare it to the structure on the master to make sure they match well enough for replication to proceed. 280 | Chapter 8: Replication Deep Dive The basic structure of the table map event is shown in Figure 8-12. The common header—the header that all replication events have—contains the basic information about the event. After the common header, the post header gives information that is special for the table map event. Most of the fields in Figure 8-12 are self-explanatory, but the representation of the field types deserves a closer look. Figure 8-12. Table map event structure The following fields together represent the column type: Column type array An array listing the base types for all the columns. It indicates whether this is an integer, a string type, a decimal type, or any of the other available types, but it does not give the parameters for the column type. For example, if the type of a column is CHAR(5), this array will contain 254 (the constant representing a string), but the length of the string (in this case, 5) is stored in the column metadata mentioned later. Null bit array An array of bits that indicate whether each field can be NULL. Column metadata An array of metadata for the fields, fleshing out details left out of the column type array. The piece of metadata available to each field depends on the type of the field. For example, the DECIMAL field stores the precision and decimals in the metadata, whereas the VARCHAR type stores the maximum length of the field. By combining the data in these three arrays, it is possible to deduce the type of the field. Not all type information is stored in the arrays, so in two particular cases, it is not possible for the master and the slave to distinguish between two types: Details of Row-Based Replication | 281 • When there is no information about whether an integer field is signed or unsigned. This means the slave will be unable to distinguish between a signed and unsigned field when checking the tables. • When the character sets of string types are not part of the information. This means that replicating between different character sets is not supported and may lead to strange results, because the bytes will just be inserted into the column with no checking or conversion. The Structure of Row Events Figure 8-13 shows the structure of a row event. This structure can vary a little depending on the type of event (write, delete, or update). Figure 8-13. Row event header In addition to the table identifier, which refers to the table ID of a previous table map event, the event contains the following fields: Table width The width of the table on the master. This width is length-encoded in the same way as for the client protocol, which is why it can be either one or two bytes. Most of the time, it will be one byte. Columns bitmap The columns that are sent as part of the payload of the event. This information allows the master to send a selected set of fields with each row. There are two types of column bitmaps: one for the before image and one for the after image. The before image is needed for deletions and updates, whereas the after image is needed for writes (inserts) and updates. See Table 8-3 for more information. 282 | Chapter 8: Replication Deep Dive Table 8-3. Row events and their images Before image After image Event None Row to insert Write rows Row to delete None Delete rows Column values before update Column values after update Update rows Execution of Row Event Because multiple events can represent a single statement executed by the master, the slave has to keep state information to execute the row events correctly in the presence of concurrent threads that update the same tables. Recall that each statement in the binary log starts with one or more table map events followed by one or more row events, each of the same type. Use the following procedure to process a statement from the binary log: 1. Each event is read from the relay log. 2. If the event is a table map event, the SQL thread extracts the information about the table and saves a representation of how the master defines the table. 3. When the first row event is seen, all tables in the list are locked. 4. For each table in the list, the thread checks that the definition on the master is compatible with the definition on the slave. 5. If the tables are not compatible, the thread reports an error and stops replication on the slave. 6. Row events are processed according to the procedure shown later in this section, until the thread reads the last event of the statement (i.e., an event with the statement end flag set). This procedure is required to lock tables the correct way on the slave and is similar to how the statement was executed on the master. All tables are locked in step 3 and then checked in step 4. If the tables are not locked before checking the definitions, a thread on the slave can come between the steps and change the definition, causing the appli‐ cation of the row events to fail later. Each row event consists of a set of rows that are used differently depending on the event type. For Delete_rows and Write_rows events, each row represents a change. For the Update_rows event, it is necessary to have two rows—one to locate the correct row to update and one with values to use for the update—so the event consists of an even number of rows, where each pair represents an update. Events that have a before image require a search to locate the correct row to operate on: for a Delete_rows event, the row will be removed, whereas for the Update_rows event, it will be changed. In descending order of preference, the searches are: Details of Row-Based Replication | 283 Primary key lookup If the table on the slave has a primary key, it is used to perform a primary key lookup. This is the fastest of all the methods. Index scan If there is no primary key defined for the table but an index is defined, this will be used to locate the correct row to change. All rows in the index will be scanned and the columns compared with the row received from the master. If the rows match, this row will be used for the Delete_rows or Update_rows operation. If no rows match, the slave will stop replication with an error indicating that it could not locate the correct row. Table scan If there is no primary key or index on the table, a full table scan is used to locate the correct row to delete or update. In the same way as for the index scan, each row in the scan will be compared with the row received from the master, and if they match, that row will be used for the delete or update operation. Because the index or primary key on the slave rather than the master is used to locate the correct row to delete or update, you should keep a couple of things in mind: • If the table has a primary key on the slave, the lookup will be fast. If the table does not have a primary key, the slave has to do either a full table scan or an index scan to find the correct row to update, which is slower. • You can have different indexes on the master and slave. When replicating a table, it is always wise to have a primary key on the table regardless of whether row-based or statement-based replication is used. Because statement-based replication actually executes each statement, a primary key on updates and deletes speeds up replication significantly for statement-based replication as well. Events and Triggers The execution of events and triggers differs in statement-based replication and row- based replication. The only difference for events is that row-based replication generates row events instead of query events. Triggers, on the other hand, reveal a different and more interesting story. As discussed in Chapter 4, for statement-based replication, trigger definitions are re‐ plicated to the slave so that when a statement is executed that affects a table with a trigger, the trigger will be executed on the slave as well. 284 | Chapter 8: Replication Deep Dive For row-based replication, it doesn’t matter how the rows change—whether changes come from a trigger, a stored procedure, an event, or directly from the statement. Be‐ cause the rows updated by the trigger are replicated to the slave, the trigger does not need to be executed on the slave. As a matter of fact, executing it on the slave would lead to incorrect results. Consider Example 8-26, which defines a table with a trigger. Example 8-26. Definition of a table and triggers CREATE TABLE log ( number INT AUTO_INCREMENT PRIMARY KEY, user CHAR(64), brief TEXT ); CREATE TABLE user ( id INT AUTO_INCREMENT PRIMARY KEY, email CHAR(64), password CHAR(64) ); CREATE TRIGGER tr_update_user AFTER UPDATE ON user FOR EACH ROW INSERT INTO log SET user = NEW.email, brief = CONCAT("Changed password from '", OLD.password, "' to '", NEW.password, "'"); CREATE TRIGGER tr_insert_user AFTER INSERT ON user FOR EACH ROW INSERT INTO log SET user = NEW.email, brief = CONCAT("User '", NEW.email, "' added"); Given these table and trigger definitions, this sequence of statements can be executed: master> INSERT INTO user(email,password) VALUES ('mats@example.com', 'xyzzy'); Query OK, 1 row affected (0.05 sec) master> UPDATE user SET password = 'secret' WHERE email = 'mats@example.com'; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 master> SELECT * FROM log; +--------+--------------+-------------------------------------------+ | number | user | brief | +--------+--------------+-------------------------------------------+ | 1 | mats@sun.com | User 'mats@example.com' added | | 2 | mats@sun.com | Changed password from 'xyzzy' to 'secret' | Details of Row-Based Replication | 285 +--------+--------------+-------------------------------------------+ 2 rows in set (0.00 sec) This is, of course, not very secure, but at least it illustrates the situation. So, how do these changes appear in the binary log when using row-based replication? master> SHOW BINLOG EVENTS IN 'mysqld1-bin.000054' FROM 2180; +--------------+----+-----------+---------+-----------+------------------------+ |Log_name |Pos |Event_type |Server_id|End_log_pos|Info | +--------------+----+-----------+---------+-----------+------------------------+ |master-bin…54|2180|Query | 1| 2248|BEGIN | |master-bin…54|2248|Table_map | 1| 2297|table_id: 24 (test.user)| |master-bin…54|2297|Table_map | 1| 2344|table_id: 26 (test.log) | |master-bin…54|2344|Write_rows | 1| 2397|table_id: 24 | |master-bin…54|2397|Write_rows | 1| 2471|table_id: 26 flags: | | | | | | | STMT_END_F | |master-bin…54|2471|Query | 1| 2540|COMMIT | |master-bin…54|2540|Query | 1| 2608|BEGIN | |master-bin…54|2608|Table_map | 1| 2657|table_id: 24 (test.user)| |master-bin…54|2657|Table_map | 1| 2704|table_id: 26 (test.log) | |master-bin…54|2704|Update_rows| 1| 2783|table_id: 24 | |master-bin…54|2783|Write_rows | 1| 2873|table_id: 26 flags: | | | | | | | STMT_END_F | |master-bin…54|2873|Query | 1| 2942|COMMIT | +--------------+----+-----------+---------+-----------+------------------------+ 12 rows in set (0.00 sec) As you can see, each statement is treated as a separate transaction containing only a single statement. The statement changes two tables—the test.user and test.log tables—and therefore there are two table maps at the beginning of the statement in the binary log. When replicated to the slave, these events are executed directly and the execution goes “below the trigger radar,” thereby avoiding execution of the triggers for the tables on the slave. Filtering in Row-Based Replication Filtering also works differently in statement-based and row-based replication. Recall from Chapter 4 that statement-based replication filtering is done on the entire statement (i.e., either all of the statement is executed or the statement is not executed at all) because it is not possible to execute just part of a statement. For the database filtering options, the current database is used and not the database of the table that is being changed. Row-based replication offers more choice. Because each row for a specific table is caught and replicated, it is possible to filter on the actual table being updated and even filter out some rows based on arbitrary conditions. For this reason, row-based replication also filters changes based on the actual table updated and is not based on the current database for the statement. 286 | Chapter 8: Replication Deep Dive Consider what will happen with filtering on a slave set up to ignore the ignore_me database. What will be the result of executing the following statement under statement- based and row-based replication? USE test; INSERT INTO ignore_me.t1 VALUES (1),(2); For statement-based replication, the statement will be executed, but for row-based rep‐ lication, the changes to table t1 will be ignored, because the ignore_me database is on the ignore list. Continuing on this path, what will happen with the following multitable update statement? USE test; UPDATE ignore_me.t1, test.t2 SET t1.a = 3, t2.a = 4 WHERE t1.a = t2.a; With statement-based replication, the statement will be executed, expecting the table ignore_me.t1 to exist—which it might not, because the database is ignored—and will update both the ignore_me.t1 and test.t2 tables. Row-based replication, on the other hand, will update only the test.t2 table. Partial Execution of Statements As already noted, statement-based replication works pretty well unless you have to ac‐ count for failures, crashes, and nondeterministic behavior. Because you can count on the failure or crash to occur at the worst possible moment, this will almost always lead to partially executed statements. The same situation occurs when the number of rows affected by an UPDATE, DELETE, or INSERT statement is artificially limited. This may happen explicitly through a LIMIT clause or because the table is nontransactional and, say, a duplicate key error aborts execution and causes the statement to be only partially applied to the table. In such cases, the changes that the statement describes are applied to only an initial set of rows. The master and the slave can have different opinions of how the rows are ordered, which can therefore result in the statement being applied to different sets of rows on the master and the slave. MyISAM maintains all the rows in the order in which they were inserted. That may give you confidence that the same rows will be affected in case of partial changes. Unfortu‐ nately, however, that is not the case. If the slave has been cloned from the master using a logical backup or restored from a backup, it is possible that the insertion order changed. Normally, you can solve this problem by adding an ORDER BY clause, but even that does not leave you entirely safe, because you are still in danger of having the statement par‐ tially executed because of a crash. Details of Row-Based Replication | 287 Partial Row Replication As mentioned earlier, the events Write_rows, Delete_rows, and Update_rows each contain a column bitmap that tells what columns are present in the rows in the body of the event. Note that there is one bitmap for the before image and one for the after image. Prior to MySQL 5.6.2, only the MySQL Cluster engine uses the option of limiting the columns written to the log, but starting with MySQL 5.6.2, it is possible to control what colums are written to the log using the option binlog-row-image. The option accepts three different values: full, noblob, and minimal. full This is the default for binlog-row-image and will replicate all columns. Prior to MySQL 5.6.2, this is how the rows were always logged. noblob With this setting, blobs will be omitted from the row unless they change as part of the update. minimal With this setting, only the primary key (in the before image) and the columns that change values (in the after image) are written to the binary log. The reason for having full as default is because there might be different indexes on the master, and the slave and columns that are not part of the primary key on the master might be needed to find the correct row on the slave. If you look at Example 8-27, there are different definitions of the tables on the master and slave, but the only difference is that there are different indexes. The rationale for this difference could be that on the master it is necessary for the id column to be a primary key for autoincrement to work, but on the slave all selects are done using the email column. In this case, setting binlog-row-image to minimal will store the values of the id column in the binary log, but this column cannot be used to find the correct row on the slave. This will cause replication to fail. Because it is expected that replication should work even if this mistake is made, the default for binlog-row-image is full. If you are using identical indexes on the master and slave (or at least have indexes on the slave on the columns that are indexed on the master), you can set binlog-row- image to minimal and save space by reducing the size of the binary log. So what’s the role of the noblob value then? Well... it acts as a middle ground. Even though it is possible to have different indexes on the master and slave, it is very rare for blobs to be part of an index. Because blobs usually take a lot of space, using noblob will be almost as safe as full, under the assumption that blobs are never indexed. 288 | Chapter 8: Replication Deep Dive Example 8-27. Table with different indexes on master and slave /* Table definition on the master */ CREATE TABLE user ( id INT AUTO_INCREMENT PRIMARY KEY, email CHAR(64), password CHAR(64) ); /* Table definition on the slave */ CREATE TABLE user ( id INT, email CHAR(64) PRIMARY KEY, password CHAR(64) ); Conclusion This chapter concludes a series of chapters about MySQL replication. We discussed advanced replication topics such as how to promote slaves to masters more robustly, looked at tips and techniques for avoiding corrupted databases after a crash, examined multisource replication configurations and considerations, and finally looked at row- based replication in detail. In the next chapters, we examine another set of topics for building robust data centers, including monitoring, performance tuning of storage engines, and replication. Joel met his boss in the hallway on his way back from lunch. “Hello, Mr. Summerson.” “Hello, Joel.” “Have you read my report?” “Yes, I have, Joel. Good work. I’ve passed it around to some of the other departments for comment. I want to add it to our SOP manual.” Joel imagined SOP meant standard operating procedures. “I’ve asked the reviewers to send you their comments. It might need some wordsmithing to fit into an SOP, but I know you’re up to the task.” “Thank you, sir.” Mr. Summerson nodded, patted Joel on the shoulder, and continued on his way down the hall. Conclusion | 289 CHAPTER 9 MySQL Cluster A subdued knock on his door alerted Joel to his visitor. He looked up to see a worried- looking Mr. Summerson. “I’ve got to dump on you this time, Joel. We’re in a real bind here.” Joel remained silent, wondering what his definition of “dump on you” meant. So far, he had tasked Joel with some pretty intense work. “We’ve just learned of a new customer who wants to use our latest database application in a real-time, five-nines environment.” “Always up and no downtime?” “That’s right. Now, I know MySQL is very reliable, but there’s no time to change the application to use a fault-tolerant database server.” Joel remembered skimming a chapter on a special version of MySQL and wondered if that would work. He decided to take a chance: “We could use the cluster technology.” “Cluster?” “Yes, MySQL has a cluster version that is a fault-tolerant database system. It has worked in some pretty demanding environments, like telecom, as I recall....” Mr. Summerson’s eyes brightened and he appeared to stand a little straighter as he de‐ livered his coup de grâce. “Perfect. Give me a report by tomorrow morning. I want cost, hardware requirements, limitations—the works. Don’t pull any punches. If we can get this to work, I want to do it, but I don’t want to risk our reputation on a hunch.” “I’ll get right on it,” Joel said, wondering what he had gotten himself into this time. After Mr. Summerson left, he sighed and opened his favorite MySQL book. “This may be my greatest challenge yet,” he said. 291 When high performance, high availability, redundancy, and scalability are paramount concerns for database planners, they often seek to improve their replication topologies with commodity high-availability hardware and load-balancing solutions. Although this approach often meets the needs of most organizations, if you need a solution with no single points of failure and extremely high throughput with 99.999% uptime, chances are the MySQL Cluster technology will meet your needs. In this chapter, you will be introduced to the concepts of the MySQL Cluster technology. It provides you an example of starting and stopping a simple cluster, and discusses the key points of using MySQL Cluster, including high availability, distributed data, and data replication. We begin by describing what MySQL Cluster is and how it differs from a normal MySQL server. What Is MySQL Cluster? MySQL Cluster is a shared-nothing storage solution with a distributed node architec‐ ture designed for fault tolerance and high performance. Data is stored and replicated on individual data nodes, where each data node runs on a separate server and maintains a copy of the data. Each cluster also contains management nodes. Updates use read- committed isolation to ensure all nodes have consistent data and a two-phased commit to ensure the nodes have the same data (if any one write fails, the update fails). The original implementation of MySQL Cluster stored all information in main memory with no persistent storage. Later releases of MySQL Cluster permit storage of the data on disk. Perhaps the best quality of MySQL Cluster is that it uses the MySQL server as the query engine via the storage engine layer. Thus, you can migrate applications de‐ signed to interact with MySQL to MySQL Cluster transparently. The shared-nothing, peer node concept permits an update executed on one server to become visible immediately on the other servers. The transmission of the updates uses a sophisticated communication mechanism designed for very high throughput across networks. The goal is to have the highest performance possible by using multiple MySQL servers to distribute the load, and high availability and redundancy by storing data in different locations. Terminology and Components Typical installations of the MySQL Cluster involve installing the components of the cluster on different machines on a network. Hence, MySQL Cluster is also known as a network database (NDB). When we use the term “MySQL Cluster,” we refer to the MySQL server plus the NDB components. However, when we use “NDB” or “NDB Cluster” we refer specifically to the cluster components. MySQL Cluster is a database system that uses the MySQL server as the frontend to support standard SQL queries. A storage engine named NDBCluster is the interface 292 | Chapter 9: MySQL Cluster that links the MySQL server with the cluster technology. This relationship is often con‐ fused. You cannot use the NDBCluster storage engine without the NDBCluster com‐ ponents. However is it is possible to use the NDB Cluster technologies without the MySQL server, but this requires lower-level programming with the NDB API. The NDB API is object-oriented and implements indexes, scans, transactions, and event handling. This allows you to write applications that retrieve, store, and manipulate data in the cluster. The NDB API also provides object-oriented error-handling facilities to allow orderly shutdown or recovery during failures. If you are a developer and want to learn more about the NDB API, see the MySQL NDB API online documentation. How Does MySQL Cluster Differ from MySQL? You may be wondering, “What is the difference between a cluster and replication?” There are several definitions of clustering, but it can generally be viewed as something that has membership, messaging, redundancy, and automatic failover capabilities. Rep‐ lication, in contrast, is simply a way to send messages (data) from one server to another. We discuss replication within a cluster (also called local replication) and MySQL repli‐ cation in more detail later in this chapter. Typical Configuration You can view the MySQL Cluster as having three layers: • Applications that communicate with the MySQL server • The MySQL server that processes the SQL commands and communicates to the NDB storage engine • The NDB Cluster components (i.e., the data nodes) that process the queries and return the results to the MySQL server You can scale up each layer independently with more server process‐ es to increase performance. Figure 9-1 shows a conceptual drawing of a typical cluster installation. The applications connect to the MySQL server, which accesses the NDB Cluster com‐ ponents via the storage engine layer (specifically, the NDB storage engine). We will discuss the NDB Cluster components in more detail momentarily. What Is MySQL Cluster? | 293 Figure 9-1. MySQL Cluster There are many possible configurations. You can use multiple MySQL servers to connect to a single NDB Cluster and even connect multiple NDB Clusters via MySQL replication. We will discuss more of these configurations in later sections. Features of MySQL Cluster To satisfy the goals of having the highest achievable performance, high availability, and redundancy, data is replicated inside the cluster among the peer data nodes. The data is stored on multiple data nodes and is replicated synchronously, each data node con‐ necting to every other data node. 294 | Chapter 9: MySQL Cluster It is also possible to replicate data between clusters, but in this case, you use MySQL replication, which is asynchronous rather than syn‐ chronous. As we’ve discussed in previous chapters, asynchronous replication means you must expect a delay in updating the slaves; slaves do not report back the progress in committing changes, and you cannot expect a consistent view across all servers in the replica‐ ted architecture like you can expect within a single MySQL cluster. MySQL Cluster has several specialized features for creating a highly available system. The most significant ones are: Node recovery Data node failures can be detected via either communication loss or heartbeat fail‐ ure, and you can configure the nodes to restart automatically using copies of the data from the remaining nodes. Failure and recovery can comprise single or mul‐ tiple storage nodes. This is also called local recovery. Logging During normal data updates, copies of the data change events are written to a log stored on each data node. You can use the logs to restore the data to a point in time. Checkpointing The cluster supports two forms of checkpoints, local and global. Local checkpoints remove the tail of the log. Global checkpoints are created when the logs of all data nodes are flushed to disk, creating a transaction-consistent snapshot of all node data to disk. In this way, checkpointing permits a complete system restore of all nodes from a known good synchronization point. System recovery In the event the whole system is shut down unexpectedly, you can restore it using checkpoints and change logs. Typically, the data is copied from disk into memory from known good synchronization points. Hot backup and restore You can create simultaneous backups of each data node without disturbing exe‐ cuting transactions. The backup includes the metadata about the objects in the database, the data itself, and the current transaction log. No single point of failure The architecture is designed so that any node can fail without bringing down the database system. Failover To ensure node recovery is possible, all transactions are committed using read commit isolation and two-phase commits. Transactions are then doubly safe (i.e., What Is MySQL Cluster? | 295 they are stored in two separate locations before the user gets acceptance of the transaction). Partitioning Data is automatically partitioned across the data nodes. Starting with MySQL ver‐ sion 5.1, MySQL Cluster supports user-defined partitioning. Online operations You can perform many of the maintenance operations online without the normal interruptions. These are operations that normally require stopping a server or placing locks on data. For example, it is possible to add new data nodes online, alter table structures, and even reorganize the data in the cluster. For more information about MySQL Cluster, see the MySQL Cluster Documentation containing reference guides for the different versions of cluster. Local and Global Redundancy You can create local redundancy (inside a particular cluster) using a two-phase commit protocol. In principle, each node goes through a round in which it agrees to make a change, then undergoes a round in which it commits the transaction. During the agree‐ ment phase, each node ensures that there are enough resources to commit the change in the second round. In NDB Cluster, the MySQL server commit protocol changes to allow updates to multiple nodes. NDB Cluster also has an optimized version of two- phase commit that reduces the number of messages sent using synchronous replication. The two-phase protocol ensures the data is redundantly stored on multiple data nodes, a state known as local redundancy. Global redundancy uses MySQL replication between clusters. This establishes two no‐ des in a replication topology. As discussed previously, MySQL replication is asynchro‐ nous because it does not include an acknowledgment or receipt for arrival or execution of the events replicated. Figure 9-2 illustrates the differences. 296 | Chapter 9: MySQL Cluster Figure 9-2. Local and global redundancy Log Handling MySQL Cluster implements two types of checkpoints: local checkpoints to purge part of the redo log, and a global checkpoint that is mainly for synchronizing between the different data nodes. The global checkpoint becomes important for replication because it forms the boundary between sets of transactions known as epochs. Each epoch is replicated between clusters as a single unit. In fact, MySQL replication treats the set of transactions between two consecutive global checkpoints as a single transaction. Redundancy and Distributed Data Data redundancy is based on replicas, where each replica has a copy of the data. This allows a cluster to be fault tolerant. If any data node fails, you can still access the data. Naturally, the more replicas you allow in a cluster, the more fault tolerant the cluster will be. What Is MySQL Cluster? | 297 Split-Brain Syndrome If one or more data nodes fail, it is possible that the remaining data nodes will be unable to communicate. When this happens, the two sets of data nodes are in a split-brain scenario. This type of situation is undesirable, because each set of data nodes could theoretically perform as a separate cluster. To overcome this, you need a network partitioning algorithm to decide between the competing sets of data nodes. The decision is made in each set independently. The set with the minority of nodes will be restarted and each node of that set will need to join the majority set individually. If the two sets of nodes are exactly the same size, a theoretical problem still exists. If you split four nodes into two sets with two nodes in each, how do you know which set is a minority? For this purpose, you can define an arbitrator. In the case that the sets are exactly the same size, the set that first succeeds in contacting the arbitrator wins. You can designate the arbitrator as either a MySQL server (SQL node) or a management node. For best availability, you should locate the arbitrator on a system that does not host a data node. The network partitioning algorithm with arbitration is fully automatic in MySQL Clus‐ ter, and the minority is defined with respect to node groups to make the system even more available than it would be compared to just counting the nodes. You can specify how many copies of the data (NoOfReplicas) exist in the cluster. You need to set up as many data nodes as you want replicas. You can also distribute the data across the data nodes using partitioning. In this case, each data node has only a portion of the data, making queries faster. But because you have multiple copies of the data, you can still query the data in the event that a node fails, and the recovery of the missing node is assured (because the data exists in the other replicas). To achieve this, you need multiple data nodes for each replica. For example, if you want two replicas and parti‐ tioning, you need to have at least four data nodes (two data nodes for each replica). Architecture of MySQL Cluster MySQL Cluster is composed of one or more MySQL servers communicating via the NDB storage engine to an NDB cluster. An NDB cluster itself is composed of several components: data or storage nodes that store and retrieve the data and one or more management nodes that coordinate startup, shutdown, and recovery of data nodes. Most of the NDB components are implemented as daemon processes, while MySQL Cluster also offers client utilities to manipulate the daemons’ features. Here is a list of the dae‐ mons and utilities (Figure 9-3 depicts how each of these components communicates): 298 | Chapter 9: MySQL Cluster mysqld The MySQL server ndbd A data node ndbmtd A multithreaded data node ndb_mgmd The cluster’s management server ndb_mgm The cluster’s management client Each MySQL server with the executable name mysqld typically supports one or more applications that issue SQL queries and receive results from the data nodes. When dis‐ cussing MySQL Cluster, the MySQL servers are sometimes called SQL nodes. The data nodes are NDB daemon processes that store and retrieve the data either in memory or on disk depending on their configuration. Data nodes are installed on each server participating in the cluster. There is also a multithreaded data node daemon named ndbmtd that works on platforms that support multiple CPU cores. You can see improved data node performance if you use the multithreaded data node on dedicated servers with modern multiple-core CPUs. Figure 9-3. The MySQL Cluster components The management daemon, ndb_mgmd, runs on a server and is responsible for reading a configuration file and distributing the information to all of the nodes in the cluster. ndb_mgm, the NDB management client utility, can check the cluster’s status, start back‐ Architecture of MySQL Cluster | 299 ups, and perform other administrative functions. This client runs on a host convenient to the administrator and communicates with the daemon. There are also a number of utilities that make maintenance easier. Here are a few of the more popular ones (consult the NDB Cluster documentation for a complete list): ndb_config Extracts configuration information from existing nodes. ndb_delete_all Deletes all rows from an NDB table. ndb_desc Describes NDB tables (like SHOW CREATE TABLE). ndb_drop_index Drops an index from an NDB table. ndb_drop_table Drops an NDB table. ndb_error_reporter Diagnoses errors and problems in a cluster. ndb_redo_log_reader Checks and prints out a cluster redo log. ndb_restore Performs a restore of a cluster. Backups are made using the NDB management client. How Data Is Stored MySQL Cluster keeps all indexed columns in main memory. You can store the remain‐ ing nonindexed columns either in memory or on disk with an in-memory page cache. Storing nonindexed columns on disk allows you to store more data than the size of available memory. When data is changed (via INSERT, UPDATE, DELETE, etc.), MySQL Cluster writes a record of the change to a redo log, checkpointing data to disk regularly. As described previously, the log and the checkpoints permit recovery from disk after a failure. However, because the redo logs are written asynchronously with the commit, it is possible that a limited number of transactions can be lost during a failure. To mitigate against this risk, MySQL Cluster implements a write delay (with a default of two seconds, but this is configurable). This allows the checkpoint write to complete so that if a failure occurs, the last check‐ point is not lost as a result of the failure. Normal failures of individual data nodes do not result in any data loss due to the synchronous data replication within the cluster. 300 | Chapter 9: MySQL Cluster When a MySQL Cluster table is maintained in memory, the cluster accesses disk storage only to write records of the changes to the redo log and to execute the requisite check‐ points. Because writing the logs and checkpoints is sequential and few random access patterns are involved, MySQL Cluster can achieve higher write throughput rates with limited disk hardware than the traditional disk caching used in relational database sys‐ tems. You can calculate the size of memory you need for a data node using the following formula. The size of the database is the sum of the size of the rows times the number of rows for each table. Keep in mind that if you use disk storage for nonindexed columns, you should count only the indexed columns in calculating the necessary memory. (SizeofDatabase × NumberOfReplicas × 1.1 ) / NumberOfDataNodes This is a simplified formula for rough calculation. When planning the memory of your cluster, you should consult the online MySQL Cluster Reference Manual for additional details to consider. You can also use the Perl script ndb_size.pl found in most distributions. This script connects to a running MySQL server, traverses all the existing tables in a set of databases, and calculates the memory they would require in a MySQL cluster. This is convenient, because it permits you to create and populate the tables on a normal MySQL server first, then check your memory configuration before you set up, configure, and load data into your cluster. It is also useful to run periodically to keep ahead of schema changes that can result in memory issues and to give you an idea of your memory usage. Example 9-1 depicts a sample report for a simple database with a single table. To find the total size of the database, multiply the size of the data row from the summary by the number of rows. In Example 9-1, we have (for MySQL version 5.1) 84 bytes per row for data and index. If we had 64,000 rows, we would need to have 5,376,000 bytes of memory to store the table. If the script generates an error about a missing Class/MethodMak er.pm module, you need to install this class on your system. For ex‐ ample, on Ubuntu you can install it with the following command: sudo apt-get install libclass-methodmaker-perl Example 9-1. Checking the size of a database with ndb_size.pl $ ./ndb_size.pl \ > --database=cluster_test --user=root ndb_size.pl report for database: 'cluster_test' (1 tables) ---------------------------------------------------------- Connected to: DBI:mysql:host=localhost Architecture of MySQL Cluster | 301 Including information for versions: 4.1, 5.0, 5.1 cluster_test.City ----------------- DataMemory for Columns (* means varsized DataMemory): Column Name Type Varsized Key 4.1 5.0 5.1 district char(20) 20 20 20 population int(11) 4 4 4 ccode char(3) 4 4 4 name char(35) 36 36 36 id int(11) PRI 4 4 4 -- -- -- Fixed Size Columns DM/Row 68 68 68 Varsize Columns DM/Row 0 0 0 DataMemory for Indexes: Index Name Type 4.1 5.0 5.1 PRIMARY BTREE N/A N/A N/A -- -- -- Total Index DM/Row 0 0 0 IndexMemory for Indexes: Index Name 4.1 5.0 5.1 PRIMARY 29 16 16 -- -- -- Indexes IM/Row 29 16 16 Summary (for THIS table): 4.1 5.0 5.1 Fixed Overhead DM/Row 12 12 16 NULL Bytes/Row 0 0 0 DataMemory/Row 80 80 84 (Includes overhead, bitmap and indexes) Varsize Overhead DM/Row 0 0 8 Varsize NULL Bytes/Row 0 0 0 Avg Varside DM/Row 0 0 0 No. Rows 3 3 3 Rows/32kb DM Page 408 408 388 Fixedsize DataMemory (KB) 32 32 32 Rows/32kb Varsize DM Page 0 0 0 Varsize DataMemory (KB) 0 0 0 Rows/8kb IM Page 282 512 512 IndexMemory (KB) 8 8 8 Parameter Minimum Requirements ------------------------------ 302 | Chapter 9: MySQL Cluster * indicates greater than default Parameter Default 4.1 5.0 5.1 DataMemory (KB) 81920 32 32 32 NoOfOrderedIndexes 128 1 1 1 NoOfTables 128 1 1 1 IndexMemory (KB) 18432 8 8 8 NoOfUniqueHashIndexes 64 0 0 0 NoOfAttributes 1000 5 5 5 NoOfTriggers 768 5 5 5 Although Example 9-1 uses a very simple table, the output shows not only the row size, but also a host of statistics for the tables in the database. The report also shows the indexing statistics, which are the key mechanism the cluster uses for high performance. The script displays the different memory requirements across MySQL versions. This allows you to see any differences if you are working with older versions of MySQL Cluster. Partitioning One of the most important aspects of MySQL Cluster is data partitioning. MySQL Cluster partitions data horizontally (i.e., the rows are automatically divided among the data nodes using a function to distribute the rows). This is based on a hashing algorithm that uses the primary key for the table. In early versions of MySQL, the software uses an internal mechanism for partitioning, but MySQL versions 5.1 and later allow you to provide your own function for partitioning data. If you use your own function for partitioning, you should create a function that ensures the data is distributed evenly among the data nodes. If a table does not have a primary key, MySQL Cluster adds a surro‐ gate primary key. Partitioning allows the MySQL Cluster to achieve higher performance for queries be‐ cause it supports distribution of queries among the data nodes. Thus, a query will return results much faster when gathering data across several nodes than from a single node. For example, you can execute the following query on each data node, getting the sum of the column on each one and summing those results: SELECT SUM(population) FROM cluster_db.city; Data distributed across the data nodes is protected from failure if you have more than one replica (copy) of the data. If you want to use partitioning to distribute your data Architecture of MySQL Cluster | 303 across multiple data nodes to achieve parallel queries, you should also ensure you have at least two replicas of each row so that your cluster is fault tolerant. Transaction Management Another aspect of MySQL Cluster’s behavior that differs from MySQL server concerns transactional data operations. As mentioned previously, MySQL Cluster coordinates transactional changes across the data nodes. This uses two subprocesses called the transaction coordinator and the local query handler. The transaction coordinator handles distributed transactions and other data operations on a global level. The local query handler manages data and transactions local to the cluster’s data nodes and acts as a coordinator of two-phase commits at the data node. Each data node can be a transaction coordinator (you can tune this behavior). When an application executes a transaction, the cluster connects to a transaction coordinator on one of the data nodes. The default behavior is to select the closest data node as defined by the networking layer of the cluster. If there are several connections available within the same distance, a round-robin algorithm selects the transaction coordinator. The selected transaction coordinator then sends the query to each data node, and the local query handler executes the query, coordinating the two-phased commit with the transaction coordinator. Once all data nodes verify the transaction, the transaction co‐ ordinator validates (commits) the transaction. MySQL Cluster supports the read-committed transaction isolation level. This means that when there are changes during the execution of the transaction, only committed changes can be read while the transaction is underway. In this way, MySQL Cluster ensures data consistency while transactions are running. For more information about how transactions work in MySQL Cluster and a list of important limitations on transactions, see the MySQL Cluster chapter in the online MySQL Reference Manual. Online Operations In MySQL versions 5.1 and later, you can perform certain operations while a cluster is online, meaning that you do not have to either take the server down or lock portions of the system or database. The following list briefly discusses a few of the online operations available in MySQL Cluster and lists the versions that include each feature: Backup (versions 5.0 and later) You can use the NDB management console to perform a snapshot backup (a non‐ blocking operation) to create a backup of your data in the cluster. This operation includes a copy of the metadata (names and definitions of all tables), the table data, and the transaction log (a historical record of changes). It differs from a mysql‐ 304 | Chapter 9: MySQL Cluster dump backup in that it does not use a table scan to read the records. You can restore the data using the special ndb_restore utility. Adding and dropping indexes (MySQL Cluster version 5.1 and later) You can use the ONLINE keyword to perform the CREATE INDEX or DROP INDEX command online. When online operation is requested, the operation is noncopying —it does not make a copy of the data in order to index it—so indexes do not have to be recreated afterward. One advantage of this is that transactions can continue during alter table operations, and tables being altered are not locked against access by other SQL nodes. However, the table is locked against other queries on the SQL node performing the alter operation. In MySQL Cluster version 5.1.7 and later, add and drop index operations are performed online when the indexes are on variable-width columns only. Alter table (MySQL Cluster version 6.2 and later) You can use the ONLINE keyword to execute an ALTER TABLE statement online. It is also noncopying and has the same advantages as adding indexes online. Addition‐ ally, in MySQL Cluster version 7.0 and later, you can reorganize the data across partitions online using the REORGANIZE PARTITION command as long as you don’t use the INTO (partition_definitions) option. Changing default column values or data types online is current‐ ly not supported. Add data nodes and node groups (MySQL Cluster version 7.0 and later) You can manage the expansion of your data nodes online, either for scale-out or for node replacement after a failure. The process is described in great detail in the reference manual. Briefly, it involves changing the configuration file, performing a rolling restart of the NDB management daemon, performing a rolling restart of the existing data nodes, starting the new data nodes, and then reorganizing the partitions. For more information about MySQL Cluster, its architecture and features, you can find white papers covering MySQL Cluster, but also many other MySQL-related topics. Architecture of MySQL Cluster | 305 Example Configuration In this section, we present a sample configuration of a MySQL Cluster running two data nodes on two systems, with the MySQL server and NDB management node on a third system. We present examples of simplified data node setup. Our example system is shown in Figure 9-4. Figure 9-4. Sample cluster configuration You can see one node that contains both the NDB management daemon and the SQL node (the MySQL server). There are also two data nodes, each on its own system. You need a minimum of three computers to form a basic MySQL Cluster configuration with either increased availability or performance. This is a minimal configuration for MySQL Cluster and, if the number of replicas is set to two, the minimal configuration for fault tolerance. If the number of replicas is set to one, the configuration will support partitioning for better performance but will not be fault tolerant. It is generally permissible to run the NDB management daemon on the same node as a MySQL server, but you may want to move this daemon to another system if you are likely to have a high number of data nodes or want to ensure the greatest level of fault tolerance. Getting Started You can obtain MySQL Cluster from the MySQL downloads page. It is open source, like the MySQL server. You can download either a binary distribution or an installation file for some of the top platforms. You can also download the source code and build the 306 | Chapter 9: MySQL Cluster cluster on your own platform. Be sure to check the platform notes for specific issues for your host operating system. You should follow the normal installation procedures outlined in the MySQL Reference Manual. Aside from one special directory, the NDB tools are installed in the same lo‐ cation as the MySQL server binaries. Before we dive into our example, let us first review some general concepts concerning configuring a MySQL cluster. The cluster configuration is maintained by the NDB management daemon and is read (initially) from a configuration file. There are many parameters that you can use to tune the various parts of the cluster, but we will con‐ centrate on a minimal configuration for now. There are several sections in the configuration file. At a minimum, you need to include each of the following sections: [mysqld] The familiar section of the configuration file that applies to the MySQL server, the SQL node. [ndb default] A default section for global settings. Use this section to specify all of the settings you want applied to every node, both data and management. Note that the name of the section contains a space, not an underscore. [ndb_mgmd] A section for the NDB management daemon. [ndbd] You must add one section with this name for each data node. Example 9-2 shows a minimal configuration file that matches the configuration in Figure 9-4. Example 9-2. Minimal configuration file [ndbd default] NoOfReplicas= 2 DataDir= /var/lib/mysql-cluster [ndb_mgmd] hostname=192.168.0.183 datadir= /var/lib/mysql-cluster [ndbd] hostname=192.168.0.12 [ndbd] hostname=192.168.0.188 Example Configuration | 307 [mysqld] hostname=192.168.0.183 This example includes the minimal variables for a simple two data-node cluster with replication. Thus, the NoOfReplicas option is set to 2. Notice we have set the datadir variable to /var/lib/mysql-cluster. You can set it to whatever you want, but most instal‐ lations of MySQL Cluster use this directory. Finally, notice we have specified the hostname of each node. This is important, because the NDB management daemon needs to know the location of all of the nodes in the cluster. If you have downloaded and installed MySQL Cluster and want to follow along, make the necessary changes to the hostnames so they match our example. The MySQL Cluster configuration file is by default placed in /var/lib/mysql-cluster and is named config.ini. It is not necessary to install the complete MySQL Cluster binary package on the data nodes. As you will see later, you need only the ndbd daemon on the data nodes. Starting a MySQL Cluster Starting MySQL Cluster requires a specific order of commands. We will step through the procedures for this example, but it is good to briefly examine the general process: 1. Start the management node(s). 2. Start the data nodes. 3. Start the MySQL servers (SQL nodes). For our example, we first start the NDB management node on 192.168.0.183. Then we start each of the data nodes (192.168.0.12 and 192.168.0.188, in either order). Once the data nodes are running, we can start the MySQL server on 192.168.0.183 and, after a brief startup delay, the cluster is ready to use. Starting the management node The first node to start is the NDB management daemon named ndb_mgmd. This is located in the libexec folder of the MySQL installation. For example, on Ubuntu it is located in /usr/local/mysql/libexec. Start the NDB management daemon by issuing a superuser launch and specify the -- initial and --config-file options. The --initial option tells the cluster that this is our first time starting and we want to erase any configurations stored from previous 308 | Chapter 9: MySQL Cluster launches. The --config-file option tells the daemon where to find the configuration file. Example 9-3 shows how to start the NDB management daemon for our example. Example 9-3. Starting the NDB management daemon $ sudo ../libexec/ndb_mgmd --initial \ --config-file /var/lib/mysql-cluster/config.ini MySQL Cluster Management Server mysql-5.6.11 ndb-7.3.2 It is always a good idea to provide the --config-file option when you start, because some installations have different default locations for the configuration file search pat‐ tern. You can discover this pattern by issuing the command ndb_mgmd --help and searching for the phrase “Default options are read from.” It is not necessary to specify the --config-file option on subsequent starts of the daemon. Starting the management console While not absolutely necessary at this point, it is a good idea to now launch the NDB management console and check that the NDB management daemon has correctly read the configuration. The name of the NDB management console is ndb_mgm and it is located in the bin directory of the MySQL installation. We can view the configuration by issuing the SHOW command, as shown in Example 9-4. Example 9-4. Initial start of the NDB management console $ ./ndb_mgm -- NDB Cluster -- Management Client -- ndb_mgm> SHOW Connected to Management Server at: 192.168.0.183:1186 Cluster Configuration --------------------- [NDBd(NDB)] 2 node(s) id=2 (not connected, accepting connect from 192.168.0.188) id=3 (not connected, accepting connect from 192.168.0.12) [NDB_mgmd(MGM)] 1 node(s) id=1 @192.168.0.183 (mysql-5.5.31 ndb-7.2.13) [mysqld(API)] 1 node(s) id=4 (not connected, accepting connect from 192.168.0.183) ndb_mgm> This command displays the data nodes and their IP addresses as well as the NDB man‐ agement daemon and the SQL node. This is a good time to check that all of our nodes are configured with the right IP addresses and that all of the appropriate data nodes are loaded. If you have changed your cluster configuration but see the old values here, it is likely the NDB management daemon has not read the new configuration file. Example Configuration | 309 This output tells us that the NDB management daemon is loaded and ready. If it were not, the SHOW command would fail with a communication error. If you see that error, be sure to check that you are running the NDB management client on the same server as the NDB management daemon. If you are not, use the --ndb-connectstring option and provide the IP address or hostname of the machine hosting the NDB management daemon. Finally, notice the node IDs of your nodes. You will need this information to issue commands to a specific node in the cluster from the NDB management console. Issue the HELP command at any time to see the other commands available. You will also need to know the node ID for your SQL nodes so that they start up correctly. You can specify the node IDs for each node in your cluster using the --ndb-nodeid parameter in the config.ini file. You can also use the STATUS command to see the status of your nodes. Issue ALL STA TUS to see the status of all nodes or node-id STATUS to see the status of a specific node. This command is handy for watching the cluster start up, because the output reports which startup phase the data node is in. If you want to see the details, look in a version of the online MySQL Reference Manuals containing MySQL Cluster for more details about the phases of data node startup. Starting data nodes Now that we have started our NDB management daemon, it is time to start the data nodes. However, before we do that, let’s examine the minimal setup needed for an NDB data node. To set up an NDB data node, all you need is the NDB data node daemon (ndbd) compiled for the targeted host operating system. First, create the folder /var/lib/mysql-cluster, then copy in the ndbd executable, and you’re done! Clearly, this makes it very easy to script the creation of data nodes (and many have). You can start the data nodes (ndbd) using the --initial-start option, which signals that this is the first time the cluster has been started. You also must provide the --ndb- connectstring option, providing the IP address of the NDB management daemon. Example 9-5 shows starting a data node for the first time. Do this on each data node. Example 9-5. Starting the data node $ sudo ./ndbd --initial-start --ndb-connectstring=192.168.0.183 2013-02-11 06:22:52 [ndbd] INFO -- Angel connected to '192.168.0.183:1186' 2013-02-11 06:22:52 [ndbd] INFO -- Angel allocated nodeid: 2 310 | Chapter 9: MySQL Cluster If you are starting a new data node, have reset a data node, or are recovering from a failure, you can specify the --initial option to force the data node to erase any existing configuration and cached data and request a new copy from the NDB management daemon. Be careful when using the --initial options. They really do de‐ lete your data! Return to the management console and check the status (Example 9-6). Example 9-6. Status of data nodes ndb_mgm> SHOW Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=2 @192.168.0.188 (mysql-5.5.31 ndb-7.2.13, Nodegroup: 0, Master) id=3 @192.168.0.12 (mysql-5.5.31 ndb-7.2.13, Nodegroup: 0, Master) [ndb_mgmd(MGM)] 1 node(s) id=1 @192.168.0.183 (mysql-5.5.31 ndb-7.2.13) [mysqld(API)] 1 node(s) id=4 (not connected, accepting connect from 192.168.0.183) You can see that the data nodes started successfully, because information about their daemons is shown. You can also see that one of the nodes has been selected as the master for cluster replication. Because we set the number of replicas to 2 in our configuration file, we have two copies of the data. Don’t confuse this notion of master with a master in MySQL replication. We discuss the differences in more detail later in the chapter. Starting the SQL nodes Once the data nodes are running, we can connect our SQL node. There are several options we must specify that enable a MySQL server to connect to an NDB cluster. Most people specify these in the my.cnf file, but you can also specify them on the startup command line if you start the server in that manner: ndbcluster Tells the server that you want to include the NDB Cluster storage engine. ndb_connectstring Tells the server the location of the NDB management daemon. Example Configuration | 311 ndb_nodeid and server_id Normally set to the node ID. You can find the node ID in the output from the SHOW command in the NDB management console. Example 9-7 shows a correct startup sequence for the SQL node in our cluster example. Example 9-7. Starting the SQL node $ sudo ../libexec/mysqld --ndbcluster \ --console -umysql 130211 9:14:21 [Note] Plugin 'FEDERATED' is disabled. 130211 9:14:21 InnoDB: Started; log sequence number 0 1112278176 130211 9:14:21 [Note] NDB: NodeID is 4, management server '192.168.0.183:1186' 130211 9:14:22 [Note] NDB[0]: NodeID: 4, all storage nodes connected 130211 9:14:22 [Note] Starting Cluster Binlog Thread 130211 9:14:22 [Note] Event Scheduler: Loaded 0 events 130211 9:14:23 [Note] NDB: Creating mysql.NDB_schema 130211 9:14:23 [Note] NDB: Flushing mysql.NDB_schema 130211 9:14:23 [Note] NDB Binlog: CREATE TABLE Event: REPL$mysql/NDB_schema 130211 9:14:23 [Note] NDB Binlog: logging ./mysql/NDB_schema (UPDATED,USE_WRITE) 130211 9:14:23 [Note] NDB: Creating mysql.NDB_apply_status 130211 9:14:23 [Note] NDB: Flushing mysql.NDB_apply_status 130211 9:14:23 [Note] NDB Binlog: CREATE TABLE Event: REPL$mysql/NDB_apply_status 130211 9:14:23 [Note] NDB Binlog: logging ./mysql/NDB_apply_status (UPDATED,USE_WRITE) 2013-02-11 09:14:23 [NdbApi] INFO -- Flushing incomplete GCI:s < 65/17 2013-02-11 09:14:23 [NdbApi] INFO -- Flushing incomplete GCI:s < 65/17 130211 9:14:23 [Note] NDB Binlog: starting log at epoch 65/17 130211 9:14:23 [Note] NDB Binlog: NDB tables writable 130211 9:14:23 [Note] ../libexec/mysqld: ready for connections. Version: '5.5.31-ndb-7.2.13-cluster-gpl-log' socket: '/var/lib/mysql/mysqld.sock' port: 3306 Source distribution The output includes extra comments about the NDB Cluster connection, logs, and status. If you do not see these or if you see errors, be sure that you started your SQL node with the proper options. Of particular importance is the message stating the node ID and the management server. If you have multiple management servers running, be sure your SQL node is communicating with the correct one. Once the SQL node starts correctly, return to the management console and check the status of all of your nodes (Example 9-8). Example 9-8. Example status of a running cluster ndb_mgm> SHOW Cluster Configuration --------------------- [NDBd(NDB)] 2 node(s) id=2 @192.168.0.188 (mysql-5.5.31 ndb-7.2.13, Nodegroup: 0, Master) id=3 @192.168.0.12 (mysql-5.5.31 ndb-7.2.13, Nodegroup: 0) 312 | Chapter 9: MySQL Cluster [NDB_mgmd(MGM)] 1 node(s) id=1 @192.168.0.183 (mysql-5.5.31 ndb-7.2.13) [mysqld(API)] 1 node(s) id=4 @192.168.0.183 (mysql-5.5.31 ndb-7.2.13) As you can see, all of our nodes are now connected and running. If you see any details other than what is shown here, you have a failure in the startup sequence of your nodes. Be sure to check the logs for each node to determine what went wrong. The most com‐ mon cause is network connectivity (e.g., firewall issues). The NDB nodes use port 1186 by default. The logfiles for the data nodes and the NDB management daemon are located in the data directory. The SQL node logs are located in the usual location for a MySQL server. Testing the Cluster Now that our example cluster is running, let’s perform a simple test (shown in Example 9-9) to ensure we can create a database and tables using the NDB Cluster storage engine. Example 9-9. Testing the cluster mysql> create database cluster_db; Query OK, 1 row affected (0.06 sec) mysql> create table cluster_db.t1 (a int) engine=NDBCLUSTER; Query OK, 0 rows affected (0.31 sec) mysql> show create table cluster_db.t1 \G *************************** 1. row *************************** Table: t1 Create Table: CREATE TABLE `t1` ( `a` int(11) DEFAULT NULL ) ENGINE=NDBcluster DEFAULT CHARSET=latin1 1 row in set (0.00 sec) mysql> insert into cluster_db.t1 VALUES (1), (100), (1000); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0 mysql> select * from cluster_db.t1 \G *************************** 1. row *************************** a: 1 *************************** 2. row *************************** a: 1000 *************************** 3. row *************************** a: 100 3 rows in set (0.00 sec) Example Configuration | 313 Now that you have a running cluster, you can experiment by loading data and running sample queries. We invite you to “fail” one of the data nodes during data updates and restart it to see that the loss of a single data node does not affect accessibility. Shutting Down the Cluster Just as there is a specific order for startup, there is a specific order to shutting down your cluster: 1. If you have replication running between clusters, allow the slaves to catch up, then stop replication. 2. Shut down your SQL nodes (mysqld). 3. Issue SHUTDOWN in the NDB management console. 4. Exit the NDB management console. If you have MySQL replication running among two or more clusters, the first step will ensure the replication slaves catch up (synchronize) with the master before you shut the SQL nodes down. When you issue the SHUTDOWN command in the NDB management console, it will shut down all of your data nodes and the NDB management daemon. Achieving High Availability The main motivation for using high availability is to keep a service accessible. For da‐ tabase systems, this means we must always be able to access the data. MySQL Cluster is designed to meet this need. MySQL Cluster supports high availability through distri‐ bution of data across data nodes (which reduces the risk of data loss from a single node), replication among replicas in the cluster, automatic recovery (failover) of lost data no‐ des, detection of data node failures using heartbeats, and data consistency using local and global checkpointing. Let’s examine some of the qualities of a high-availability database system. To be con‐ sidered highly available, a database system (or any system) must meet the following requirements: • 99.999% uptime • No single point of failure • Failover • Fault tolerance A 99.999% uptime means the data is, for practical purposes, always available. In other words, the database server is considered a nonstop, continuous service. The assumption 314 | Chapter 9: MySQL Cluster is that the server is never offline due to a component failure or maintenance. All oper‐ ations such as maintenance and recovery are expected to work online, where access is not interrupted, to complete the procedure. This ideal situation is rarely required, and only the most critical industries have a real need for this quality. Additionally, a small period of routine, preventive maintenance is expected (hence the asymptotic percentage rating). Interestingly, there is an accepted granularity of uptime related to the number of nines in the rating. Table 9-1 shows the acceptable downtime (offline time) per calendar year for each level of the rating. Table 9-1. Acceptable downtime chart Uptime Acceptable downtime 99.000% 3.65 days 99.900% 8.76 hours 99.990% 52.56 minutes 99.999% 5.26 minutes Notice in this chart that the more nines there are in the rating, the lower the acceptable downtime. For a 99.999% uptime rating, it must be possible to perform all maintenance online without interruption except for a very short period of time in a single year. MySQL Cluster meets this need in a variety of ways, including the capability to perform rolling restarts of data nodes, several online database maintenance operations, and multiple access channels (SQL nodes and applications connecting via NDB API) to the data. Having no single point of failure means that no single component of the system should determine the accessibility of the service. You can accomplish this with MySQL Cluster by configuring every type of node in the cluster with redundancy. In the small example in the previous section, we had two data nodes. Thus, the data was protected against one data node failing. However, we had only one management node and one SQL node. Ideally, you would also add extra nodes for these functions. MySQL Cluster supports multiple SQL nodes so that if the management node fails, the cluster can still operate. Failover means that if a component fails, another can replace its functionality. In the case of a MySQL data node, failover occurs automatically if the cluster is configured to contain multiple replicas of the data. If a MySQL data node fails for one replica, access to the data is not interrupted. When you restart the missing data node, it will copy back its data from the other replica. In the case of SQL nodes, because the data is actually stored in the data nodes, any SQL node can substitute for another. In the case of a failed NDB management node, the cluster can continue to operate without it and you can start a new management node at any time (provided the con‐ figuration has not changed). Achieving High Availability | 315 And you can employ the normal high availability solutions discussed in previous chap‐ ters, including replication and automated failover between whole clusters. We discuss cluster replication in more detail later in this chapter. Fault tolerance is normally associated with hardware such as backup power supplies and redundant network channels. For software systems, fault tolerance is a by-product of how well failover is handled. For MySQL Cluster, this means it can tolerate a certain number of failures and continue to provide access to the data. Much like a hardware RAID system that loses two drives on the same RAID array, loss of multiple data nodes across replicas can result in an unrecoverable failure. However, with careful planning, you can configure MySQL Cluster to reduce this risk. A healthy dose of monitoring and active maintenance can also reduce risk. MySQL Cluster achieves fault tolerance by actively managing the nodes in the cluster. MySQL Cluster uses a heartbeat to check that services are alive, and when it detects a failed node, it takes action to perform a recovery. The logging mechanisms in MySQL Cluster also provide a level of recovery for failover and fault tolerance. Local and global checkpointing ensures data is consistent across the cluster. This information is critical for rapid recovery of data node failures. Not only does it allow you to recover the data, but the unique properties of the checkpointing also allow for rapid recovery of nodes. We discuss this feature in more detail later. Figure 9-5 depicts a MySQL cluster configured for high availability in a web service scenario. The dotted boxes in the figure denote system boundaries. These components should reside on separate hardware to ensure redundancy. Also, you should configure the four data nodes as two replicas. Not shown in this drawing are additional components that interact with the application, such as a load balancer to divide the load across the web and MySQL servers. When configuring a MySQL cluster for high availability, you should consider employing all of the following best practices (we discuss these in more detail later in this chapter when we examine high performance MySQL Cluster techniques): • Use multiple replicas with data nodes on different hardware. • Use redundant network links to guard against network failures. • Use multiple SQL nodes. • Use multiple data nodes to improve performance and decentralize the data. 316 | Chapter 9: MySQL Cluster Figure 9-5. A highly available MySQL cluster System Recovery There are two types of system recovery. In one type, you shut down the server for maintenance or similar planned events. The other is an unanticipated loss of system capability. Fortunately, MySQL Cluster provides a mechanism to recover functionality even if the worst should occur. When MySQL Cluster is shut down properly, it restarts from the checkpoints in the logs. This is largely automatic and a normal phase of the startup sequence. The system loads the most recent data from the local checkpoints for each data node, thereby re‐ covering the data to the latest snapshot on restart. Once the data nodes have loaded the data from their local checkpoints, the system executes the redo log up to the most recent global checkpoint, thereby synchronizing the data to the last change made prior to the shutdown. The process is the same for either a restart following an intentional shutdown or a full system restart after a failure. You may not think a startup is something that would “recover,” but remember that MySQL Cluster is an in-memory database and, as such, the data must be reloaded from disk on startup. Loading the data up to the most recent checkpoint accomplishes this. Achieving High Availability | 317 When recovering a system from a catastrophic failure or as a corrective measure, you can also recover from a backup of the data. As mentioned previously, you can restore data by invoking the ndb_restore utility from the NDB management console and using the output of a recent online backup. To perform a complete system restore from backup, you should first place the cluster in single-user mode using the following command in the NDB management console: ENTER SINGLE USER MODE node-id The node-id is the node ID of the data node you want to use for the ndb_restore utility. See the online MySQL Reference Manual for more details about single-user mode and connecting API-based utilities. You then run restore on each data node in the cluster. Once you have restored the data on each data node, you can exit single-user mode and the cluster will be ready for use. To exit single-user mode, issue the following command in the NDB management console: EXIT SINGLE USER MODE For more information about MySQL Cluster backup and restore, see the “Using the MySQL Cluster Management Client to Create a Backup” and “Restore a MySQL Cluster Backup” sections of the online MySQL Reference Manual. Do not use the --initial option when restarting your server after a failure or scheduled takedown. Node Recovery There can be several reasons for a node failure, including network, hardware, memory, or operating system issues or failures. Here, we discuss the most common causes of these failures and how MySQL Cluster handles node recovery. In this section, we con‐ centrate on data nodes, as they are the most important nodes with respect to data ac‐ cessibility: Hardware In the event the host computer hardware fails, clearly the data node running on that system will fail. In this case, MySQL Cluster will fail over to the other replicas. To recover from this failure, replace the failed hardware and restart the data node. Network If the data node becomes isolated from the network due to some form of network hardware or software failure, the node may continue to execute, but because it cannot contact the other nodes (via heartbeating), MySQL Cluster will mark the 318 | Chapter 9: MySQL Cluster node as “down” and fail over to another replica until the node returns and can be recovered. To recover from this failure, replace the failed network hardware and restart the data node. Memory If there is insufficient memory on the host system, the cluster can essentially run out of space for data. This will result in that data node failing. To solve the problem, add more memory or increase the values of the configuration parameters for mem‐ ory allocation and perform a rolling restart of the data node. Operating system If the operating system configuration interferes with the execution of the data node, resolve the problems and restart the data node. For more information about database high availability and MySQL high availability using MySQL Cluster, see the white papers on the MySQL website. Replication We have already briefly discussed how MySQL replication and replication inside the cluster differ. MySQL Cluster replication is sometimes called internal cluster replica‐ tion or simply internal replication to clarify that it is not MySQL replication. MySQL replication is sometimes called external replication. In this section, we discuss MySQL Cluster internal replication. We will also look at how MySQL replication (external replication) replicates data between MySQL clusters in‐ stead of between individual MySQL servers. Replication inside the cluster versus MySQL replication We mentioned earlier that MySQL Cluster uses synchronous replication inside the cluster. This is done to support the two-phase commit protocol for data integrity. Con‐ versely, MySQL replication uses asynchronous replication, which is a one-way transfer of data that relies on the stable delivery and execution of events without verification that the data has been received before the commit. Replicating inside the cluster Internal MySQL Cluster replication provides redundancy by storing multiple copies of the data (which are called replicas). The process ensures data is written to multiple nodes before the query is acknowledged as complete (committed). This is done using a two- phase commit. This form of replication is synchronous in that the data is guaranteed to be consistent at the point at which the query is acknowledged or that the commit has completed. Achieving High Availability | 319 Data is replicated as fragments, where a fragment is defined as a subset of rows in a table. Fragments are distributed across the data nodes as a result of partitioning, and a copy of the fragment exists on another data node in each replica. One of the fragments is designated as the primary and is used for query execution. All other copies of the same data are considered secondary fragments. During an update, the primary fragment is updated first. MySQL replication between clusters Replication between clusters is very easy to do. If you can set up replication between two MySQL servers, you can set up replication between two MySQL clusters. This is because there are no special configuration steps or extra commands or parameters needed to start replication between clusters. MySQL replication works just as it does between individual servers. It just so happens that in this case, the data is stored in NDB clusters. However, there are some limitations to external replication. We list a few here for your consideration when planning external replication (consult the “MySQL Cluster Replication” section of the online MySQL Reference Manual for the latest details con‐ cerning external replication): • External replication must be row-based. • External replication cannot be circular. • External replication does not support the auto_increment_* options. • The size of the binary log may be larger than for normal MySQL replication. MySQL replication to replicate data from one cluster to another permits you to leverage the advantages of MySQL Cluster at each site and still replicate the data to other sites. Can MySQL Replication Be Used with MySQL Cluster? You can replicate from a MySQL Cluster server to a non-MySQL Cluster server (or vice versa). No special configuration is necessary other than to accommodate some potential storage engine conflicts, which is similar to replicating among MySQL servers with different storage engines. In this case, use default storage engine assignment and forgo specifying the storage engine in your CREATE statements. Replicating from a MySQL cluster to a non-MySQL cluster requires creating the special table called ndb_apply_status to replicate the epochs committed. If this table is missing on the slave, replication will stop with an error reporting that ndb_apply_status does not exist. You can create the table with the following command: CREATE TABLE `mysql`.`ndb_apply_status` ( `server_id` INT(10) UNSIGNED NOT NULL, `epoch` BIGINT(20) UNSIGNED NOT NULL, `log_name` VARCHAR(255) CHARACTER SET latin1 320 | Chapter 9: MySQL Cluster COLLATE latin1_bin NOT NULL, `start_pos` BIGINT(20) UNSIGNED NOT NULL, `end_pos` BIGINT(20) UNSIGNED NOT NULL, PRIMARY KEY (`server_id`) USING HASH ) ENGINE=NDBCLUSTER DEFAULT CHARSET=latin1; Replication of the MySQL cluster using external replication requires row-based MySQL replication, and the master SQL node must be started with --binlog-format=ROW or --binlog-format=MIXED. All other requirements for MySQL replication also apply (e.g., unique server IDs for all SQL nodes). External replication also requires some special additions to the replication process, in‐ cluding use of the cluster binary log, the binlog injector thread, and special system tables to support updates between clusters. External replication also handles transactional changes a bit differently. We discuss these concepts in more detail in the next section. Architecture of MySQL Cluster (external) replication You can consider the basic concepts of the operations of external replication to be the same as MySQL replication. Specifically, we define the roles of master and slave for certain cluster installations. As such, the master contains the original copy of the data, and the slaves receive copies of the data in increments based on the incoming flow of changes to the data. Replication in MySQL Cluster makes use of a number of dedicated tables in the mysql database on each SQL node on the master and the slave (whether the slave is a single server or a cluster). These tables are created during the MySQL installation process. The two tables are ndb_binlog_index, which stores index data for the binary log (local to the SQL node), and ndb_apply_status, which stores a record of the operations that have been replicated to the slave. The ndb_apply_status table is maintained on all SQL nodes and kept in sync so that is the same throughout the cluster. You can use it to execute PITR of a failed replicated slave that is part of a MySQL cluster. These tables are updated by a new thread called the binlog injector thread. This thread keeps the master updated with any changes performed in the NDB Cluster storage engine by recording the changes made in the cluster. The binlog injector thread is re‐ sponsible for capturing all the data events within the cluster as recorded in the binary log and ensures all events that change, insert, or delete data are recorded in the ndb_bin log_index table. The master’s dump thread sends the events to the slave I/O thread using MySQL replication. One important difference in external replication involving MySQL Cluster is that each epoch is treated as a transaction. Because an epoch is a span of time between checkpoints, and MySQL Cluster ensures consistency at each checkpoint, epochs are considered atomic and are replicated using the same mechanism as a transaction in MySQL repli‐ Achieving High Availability | 321 cation. The information about the last applied epoch is stored in the NDB system tables that support external replication between MySQL clusters. Single-channel and multichannel replication The MySQL replication connection between a master and slave is called a channel. A channel is, in effect, the networking protocol and medium used to connect the master to its slaves. Normally, there is only a single channel, but to ensure maximum availability, you can set up a secondary channel for fault tolerance. This is called multichannel rep‐ lication. Figure 9-6 shows multichannel external replication. Figure 9-6. Multichannel external replication Multichannel replication enhances recovery of a network link failure dramatically. Ide‐ ally, you would use active monitoring to trigger a potential failure of the network link to signal when the link is down. This can be accomplished in a variety of ways, from scripts that use simple heartbeat mechanisms to alerts and advisors such as those avail‐ able in the MySQL Enterprise Monitor. Notice that the setup in Figure 9-6 has a total of four SQL nodes (i.e., MySQL servers). The cluster acting as the master cluster has two SQL nodes acting as masters, one pri‐ mary and one secondary. Likewise, the cluster acting as a slave cluster has two SQL nodes acting as primary and secondary slaves. The primary master/slave pair commu‐ 322 | Chapter 9: MySQL Cluster nicates over one network connection, and the secondary master/slave pair communi‐ cates over a different network connection. Don’t take your networking components for granted. Even a switch can fail. Using different cabling on the same switched network gains very little. It is best to use a completely separate set of redundant connections and intermediary networking components to achieve true network redundancy. Setup of multichannel replication does not differ much from single-channel (normal) MySQL replication. However, the replication failover is a little different. The idea is that you do not start the slave on the secondary channel. Failover to the secondary channel requires some special steps. Use the following procedure to start multichannel external replication with the primary channel active and the second channel in standby mode (we assume the redundant networking communication and hardware is in place and working properly): 1. Start the primary master. 2. Start the secondary master. 3. Connect the primary slave to the primary master. 4. Connect the secondary slave to the secondary master. 5. Start the primary slave. Do not start the secondary slave (using START SLAVE). If you do, you risk primary key conflicts and duplicate data issues. You should, however, configure the secondary slave with information about the secondary master (using CHANGE MASTER) so the secondary chan‐ nel can be started quickly if the primary channel fails. Failover to the secondary channel requires a different procedure. It is not enough to just start the secondary slave. To avoid having the same data replicated twice, you must first establish the last replicated epoch and use it to start replication. The procedure is as follows (notice that we use variables to store intermediate results): 1. Find the time of the most recent global checkpoint the slave received. This requires finding the most recent epoch from the ndb_apply_status table on the primary slave: SELECT @latest := MAX(epoch) FROM mysql.ndb_apply_status; Achieving High Availability | 323 2. Get the rows that appear in the ndb_binlog_index table on the primary master following the failure. You can find these rows from the primary master with the following query: SELECT @file := SUBSTRING_INDEX(File, '/', −1), @pos := Position FROM mysql.ndb_binlog_index WHERE epoch > @latest ORDER BY ASC LIMIT 1; 3. Synchronize the secondary channel. Run this command on the secondary slave, where file is the actual filename and pos is the position: CHANGE MASTER TO MASTER_LOG_FILE = 'file', MASTER_LOG_POS = pos; 4. Start replication on the secondary channel by running this command on the sec‐ ondary slave: START SLAVE; This failover procedure will switch the replication channel. If you have failures of any of the SQL nodes, you must deal with those issues and repair them before executing this procedure. It is a good idea to ensure the primary channel is indeed offline. You may want to consider stopping the primary slave just in case. Achieving High Performance MySQL Cluster is designed not only for high availability, but also for high performance. We have already reviewed many of these features, as they are often beneficial for high availability. In this section, we examine a few features that provide high performance. We conclude with a list of best practices for tuning your system for high performance. The following features support high performance in MySQL Cluster; we have examined many of these in previous sections: Replication between clusters (global redundancy) All data is replicated to a remote site that can be used to offload the primary site. Replication inside a cluster (local redundancy) Multiple data nodes can be used to read data in parallel. Main memory storage Not needing to wait for disk writes ensures quick processing of updates to data. 324 | Chapter 9: MySQL Cluster Considerations for High Performance There are three main considerations when tuning your system to support high performance. • Ensure your applications are as efficient as they can be. Sometimes this requires modification of your database servers (e.g., optimizing the configuration of the servers or modifying the database schema), but often the application itself can be designed or refactored for higher performance. • Maximize access to your databases. This includes having enough MySQL servers for the number of connections (scale-out) and distributing the data for availability, such as through replication. • Consider making performance enhancements to your MySQL Cluster, for instance, by adding more data nodes. Queries with joins can often be very time-consuming. The main source is the distributed nature of MySQL Cluster and that the MySQL server did not have good support for handling MySQL Cluster joins. Before MySQL Cluster 7.2, a JOIN operation was executed by fetch‐ ing data from the data nodes and performing the join inside the SQL node, requiring the data to be transferred several times over the net‐ work. With MySQL Cluster 7.2, MySQL Server 5.5 has added sup‐ port to allowing the join to be “pushed down” into the engine, which then performs the actual join. This reduces the amount of data that needs to be sent over the network and also allows increased parallel‐ ism by executing the join on multiple data nodes. You may need to make certain trade offs between the level of high availability you desire and high performance. For example, adding more replicas increases availability. How‐ ever, while more replicas protect against loss of data nodes, they require more processing power and you may see lower performance during updates. The reads are still quick, because multiple replicas do not need to be read for the same data. Having a greater number of data nodes (scale out) while keeping the number of replicas low leads to higher write performance. Another primary consideration is the distributed nature of MySQL Cluster. Because each node performs best when run on a separate server, the performance of each server is critical, but so are the networking components. Coordination commands and data are being transported from node to node, so the performance of the networking inter‐ connect must be tuned for high performance. You should also consider parameters such as selection of transport (e.g., TCP/IP, SHM, and SCI), latency, bandwidth, and geo‐ graphic proximity. Achieving High Performance | 325 You can set up and run MySQL Cluster in a cloud environment. One advantage of doing so is that the network interconnections are very fast and optimized. Because the data nodes require mainly a fast processor, adequate memory, and a fast network, virtual server technology is more than adequate for using MySQL Cluster in the cloud. Note, however, that MySQL Cluster is not officially supported in virtual server environments. You can find a complete list of all of the considerations for high performance in the “MySQL Cluster” section of the online MySQL Reference Manual. For general MySQL performance improvements, see High Performance MySQL. High Performance Best Practices There are a number of things you can do to ensure your MySQL Cluster is running at peak performance. We list a few of the top performance enhancement practices here, along with a brief discussion of each. Some of these are more general in nature, but we do not want to overlook these in our quest for the highest performance possible. Tune for access patterns Consider the methods your applications use to access data. Because MySQL Cluster stores indexed columns in memory, accesses referencing these columns show an even greater speed-up over nonindexed columns than you get on single MySQL servers. MySQL Cluster requires a primary key in every table, so applications that retrieve data by primary key are almost guaranteed to be fast. Make your applications distribution-aware The best-case scenario for accessing data on a partitioned data store is to isolate a query to a single node in the cluster. By default, MySQL Cluster uses the primary key for hashing the rows across the partitions. Unfortunately, this isn’t always op‐ timal if you consider the behavior of master/detail queries (common applications that consult a master table followed by details in other tables that refer back to the master table). In this case, you should alter the hashing function to ensure the master row and the detail rows are on the same node. One way to accomplish this is partition pruning, whereby you drop the secondary field used in the detail table partition hash and partition the detail rows with only the master’s primary key (which is the foreign key in the detail table). This allows both the master and detail rows to be allocated to the same node in the partition tree. Use batch operations Each round trip of a query has significant overhead. For certain operations like inserts, you can save some of that overhead by using a multiple insert query (an INSERT statement that inserts multiple rows). You can also batch operations by turning on the transaction_allow_batching parameter and including multiple operations within a transaction (within the BEGIN and END blocks). This lets you list multiple data manipulation queries (INSERT, UPDATE, etc.) and reduce overhead. 326 | Chapter 9: MySQL Cluster The transaction_allow_batching option does not work with SELECT statements or UPDATE statements that include variables. Optimize schemas Optimizing your database schemas has the same effect for MySQL Cluster as it does for normal database systems. For MySQL Cluster, consider using efficient data types (e.g., the minimum size needed to save memory; 30 bytes per row in a million-row table can save you a significant amount of memory). You should also consider denormalization for certain schemas to take advantage of MySQL Cluster’s parallel data access methods (partitioning). Optimize queries Clearly, the more optimized the query, the faster the query performance. This is a practice common to all databases and should be one of the first things you do to improve the performance of your application. For MySQL Cluster, consider query optimization from the standpoint of how the data is retrieved. Specifically, joins are particularly sensitive to performance in MySQL Cluster. Poorly performing queries can sometimes cause anomalies that are easily mistaken for inefficiencies in other parts of your system. Optimize server parameters Optimize your cluster configuration to ensure it is running as efficiently as possible. This may mean spending some time to understand the many configuration options as well as securing the correct hardware to exploit. There is no magic potion for this task—each installation becomes more and more unique as you change more parameters. Use this practice with care, tune one parameter at a time, and always compare the results to known baselines before instituting a change. Use connection pools By default, SQL nodes use only a single thread to connect to the NDB cluster. With more threads to connect, the SQL nodes can execute several queries at once. To use connection pooling for your SQL nodes, add the ndb-cluster-connection-pool option in your configuration file. Set the value to be greater than 1 (say, 4) and place it in the [mysqld] section. You should experiment with this setting because it is frequently too high for the application or hardware. Use multithreaded data nodes If your data node has multiple CPU cores or multiple CPUs, you will gain additional performance by running the multithreaded data node daemon named ndbmtd. This daemon can make use of up to eight CPU cores or threads. Using multiple threads allows the data node to run many operations in parallel, such as the local query handler (LQH) and communication processes, to achieve even higher throughput. Achieving High Performance | 327 Use the NDB API for custom applications While the MySQL server (the SQL node) offers a fast query processor frontend, MySQL has built a direct-access C++ mechanism called the NDB API. For some operations, such as interfacing MySQL Cluster with LDAP, this may be the only way to connect the MySQL cluster (in this case just the NDB cluster) to your ap‐ plication. If performance is critical to your application and you have the necessary development resources to devote to a custom NDB API solution, you can see sig‐ nificant improvements in performance. Use the right hardware Naturally, faster hardware results in faster performance (generally speaking). How‐ ever, you should consider every aspect of the cluster configuration. Consider not only faster CPUs and more and faster memory, but also high-speed interconnect solutions such as SCI and high-speed, hardware-redundant network connections. In many cases, these hardware solutions are built as turnkey commodities and do not require reconfiguring the cluster. Do not use swap space Make sure your data nodes are using real memory and not swap space. You will notice a dramatic performance drop when the data nodes start using swap space. This affects not only performance, but also possibly the stability of the cluster. Use processor affinity for data nodes In multiple-CPU machines, lock your data node processes to CPUs that are not involved in network communications. You can do this on some platforms (e.g., Sun CMT processor systems) using the LockExecuteThreadToCPU and LockMaint ThreadsToCPU parameters in the [ndbd] section of the configuration file. If you follow these best practices, you will be well on your way to making MySQL Cluster the best high-performance, high-availability solution for your organization. For more information about optimizing MySQL Cluster, see the white paper “Optimizing Per‐ formance of the MySQL Cluster Database.” Conclusion In this chapter, we discussed the unique high availability solution for MySQL using MySQL Cluster. The strengths of MySQL Cluster include partitioning tables and dis‐ tributing them across separate nodes and the parallel architecture of MySQL Cluster as a multimaster database. This allows the system to execute high volumes of both read and write operations concurrently. All updates are instantly available to all application nodes (via SQL commands or the NDB API) accessing data stored in the data nodes. Because write loads are distributed across all of the data nodes, you can achieve very high levels of write throughput and scalability for transactional workloads. Finally, with the implementation of multiple MySQL server nodes (SQL nodes) running in parallel, 328 | Chapter 9: MySQL Cluster where each server shares the load with multiple connections, and the use of MySQL replication to ensure data shipping among geographic sites, you can build highly effi‐ cient, high-concurrency transactional applications. Although few applications may have such stringent needs, MySQL Cluster is a great solution for those applications that demand the ultimate form of MySQL high availability. “Joel!” Joel smiled as his boss backtracked to stand in his doorway. “Yes, Bob?” Joel asked. Mr. Summerson stepped into the office and closed the door, then pulled up a chair and sat down directly across from Joel. Momentarily taken off-guard, Joel merely smiled and said, “What can I do for you, Bob?” “It’s what you have done for me, Joel. You’ve come up to speed on this MySQL stuff very quickly. You have kept pace with our ramp-up and recent acquisitions. And now you’ve helped make us a lot of money on this last deal. I know I’ve thrown a lot at you and you deserve something in return.” After an uncomfortable pause, he asked, “Do you play golf, Joel?” Joel shrugged. “I haven’t played since college, and I was never very good at it.” “That won’t be a problem. I love the game, but the feeling isn’t mutual. I lose half a box of balls every time I play. Are you free Saturday for a game of nine holes?” Joel wasn’t sure where this was going, but something told him he should accept. “Sure, I’m free.” “Good. Meet me at the Fair Oaks course at 1000 hours. We’ll play nine holes, then discuss your future over lunch.” “OK. See you there, Bob.” Mr. Summerson stood, opened the door, and paused. “I’ve told the accounting office to create a budget for you to manage, including enough to cover the cost of the MySQL Enterprise subscription and funding for two full-time assistants.” “Thanks,” Joel said, stunned. He wasn’t prepared for outright acceptance of his proposal, much less more responsibility. As Mr. Summerson disappeared down the hall, Joel’s friend Amy came in and stood next to him. “Are you OK?” she asked with concern. “Yeah, why?” Conclusion | 329 “I’ve never seen him close the door to talk to someone. If you don’t mind me asking, what was that all about?” With a wave of his hand over the documentation on his desk, Joel said, “He asked me to play golf and then said I had my own budget and could buy the MySQL Enterprise subscription.” Amy smiled and touched his arm. “That’s good, Joel—really good.” Joel was confused. He didn’t think the responsibility of managing money or the approval for a purchase order was worthy of such a reaction. “What?” “The last person who played golf with Mr. Summerson got a promotion and a raise. Mr. Summerson may be tough on the outside, but he rewards loyalty and determination.” “Really?” Joel stared at the papers on his desk. He told himself not to get his hopes up. “Are you free for lunch?” Amy asked with a light squeeze of his arm. Joel looked at her hand on his arm and smiled. “Sure. Let’s go somewhere nice.” But in accepting her offer, Joel knew he would be up late working on a plan for their next date. 330 | Chapter 9: MySQL Cluster PART II Monitoring and Managing Now that you have a sophisticated, multiserver system that hopefully meets your site’s needs, you must keep on top of it. This part of the book explains monitoring, with some topics in performance, and covers backups and other aspects of handling the inevitable failures that sometimes occur. CHAPTER 10 Getting Started with Monitoring Joel placed his nonfat half-caf latte, fruit cup, and cheese pastry on his desk and smiled at the parody of nutrition awaiting him. Ever since he found the upscale shopping center on his walk to work, his breakfasts had gotten rather creative. He turned on his monitor and waited for his email application to retrieve his messages while he opened the top of his latte. Scanning the message subjects and hoping there wasn’t yet another message from his boss, he noticed several messages from users with subjects that hinted at performance issues. Joel clicked through them, scanning the text. “Well, I guess something must be wrong,” he mumbled, as he read complaints about how applications that queried the database system were taking too long to respond. He unwrapped his pastry and pondered what could be causing the problems. “Things were just fine yesterday,” he reasoned. After a few sips of his latte he remembered some‐ thing he read about performance monitoring while working on the lab machines at college. Joel finished his pastry and reached for his MySQL High Availability book. “There has got to be something in here,” he said. How do you know when your servers are performing poorly? If you wait for your users to tell you something is wrong, chances are there has been something wrong for some time. Leaving problems unaddressed for an extended period complicates the diagnosis and repair process. In this chapter, we will begin our examination of monitoring MySQL at the operating system level, using the basic tools available on various systems. We look here first be‐ cause a system service or application always relies on the performance of the operating 333 system and its hardware. If the operating system is performing poorly, so will the da‐ tabase system or application. We will first examine the reasons for monitoring systems, then we’ll look at basic mon‐ itoring tasks for popular operating systems and discuss how monitoring can make your preventive maintenance tasks easier. Once you’ve mastered these skills, you can begin to look more closely at your database system. In the next chapter, we will look in greater detail at monitoring a MySQL server, along with some practical guides to solving com‐ mon performance problems. Ways of Monitoring When we think of monitoring, we normally think about some form of early warning system that detects problems. However, the definition of monitor (as a verb) is “to observe, record, or detect an operation or condition with instruments that do not affect the operation or condition.” This early warning system uses a combination of automated sampling and an alert system. The Linux and Unix operating systems are very complex and have many parameters that affect all manner of minor and major system activities. Tuning these systems for performance can be more art than science. Unlike some desktop operating systems, Linux and Unix (and their variants) do not hide the tuning tools nor do they restrict what you can tune. Some systems, such as Mac OS X and Windows, hide many of the underlying mechanics of the system behind a very user-friendly visual interface. The Mac OS X operating system, for example, is a very elegant and smoothly running operating system that needs little or no attention from the user under normal conditions. However, as you will see in the following sections, the Mac OS X system provides a plethora of advanced monitoring tools that can help you tune your system if you know where to look for them. The Windows operating system has many variants, the newest at the time of this writing being Windows 8. Fortunately, most of these variants include the same set of monitoring tools, which allow the user to tune the system to meet specific needs. While not con‐ sidered as suave as Mac OS X, Windows offers a greater range of user-accessible tuning options. There are three primary categories of system monitoring: system performance, appli‐ cation performance, and security. You may commence monitoring for more specific reasons, but in general, the task falls into one of these categories. Each category uses a different set of tools (with some overlap) and has a different ob‐ jective. For instance, you should monitor system performance to ensure the system is operating at peak efficiency. Application performance monitoring ensures a single 334 | Chapter 10: Getting Started with Monitoring application is performing at peak efficiency, and security monitoring helps you ensure the systems are protected in the most secure manner. Monitoring a MySQL server is akin to monitoring an application. This is because MySQL, like most database systems, lets you measure a number of variables and status indicators that have little or nothing to do with the operating system. However, a data‐ base system is very susceptible to the performance of the host operating system, so it is important to ensure your operating system is performing well before trying to diagnose problems with the database system. Because the goal is to monitor a MySQL system to ensure the database system is per‐ forming at peak efficiency, the following sections discuss monitoring the operating sys‐ tem for performance. We leave monitoring for security to other texts that specialize in the details and nuances of security monitoring. Benefits of Monitoring There are two approaches to monitoring. You may want to ensure nothing has changed (no degradation of performance and no security breaches) or to investigate what has changed or gone wrong. Monitoring the system to ensure nothing has changed is called proactive monitoring, whereas monitoring to see what went wrong is called reactive monitoring. Sadly, most monitoring occurs in a reactive manner. Very few IT profes‐ sionals have the time or resources to conduct proactive monitoring. Reactive monitor‐ ing is therefore the only form of monitoring some professionals understand. However, if you take the time to monitor your system proactively, you can eliminate a lot of reactive work. For example, if your users complain about poor performance (the number one trigger for reactive monitoring), you have no way of knowing how much the system has degraded unless you have previous monitoring results with which to compare. Recording such results is called forming a baseline of your system (i.e., you monitor the performance of your system under low, normal, and high loads over a period of time). If you do the sampling frequently and consistently, you can determine the typical performance of the system under various loads. Thus, when users report performance problems, you can sample the system and compare the results to your baseline. If you include enough detail in your historical data, you can normally see, at a glance, which part of the system has changed. System Components to Monitor You should examine four basic parts of the system when monitoring performance: Processor Check to see how much of it is utilized and what peaks are reached by utilization. Benefits of Monitoring | 335 Memory Check to see how much is being used and how much is still available to run programs. Disk Check to see how much disk space is available, how disk space is used, and what demand there is for it and how fast it delivers content (response time). Network Check for hroughput, latency, and error rates when communicating with other systems on the network. Processor Monitor the system’s CPU to ensure there are no runaway processes and that the CPU cycles are being shared equally among the running programs. One way to do this is to call up a list of the programs running and determine what percentage of the CPU each is using. Another method is to examine the load average of the system processes. Most operating systems provide several views of the performance of the CPU. A process is a unit of work in a Linux or Unix system. A program may have one or more processes running at a time. Multithreaded appli‐ cations, such as MySQL, generally appear on the system as multiple processes. When a CPU is under a performance load and contention is high, the system can exhibit very slow performance and even periods of seeming inactivity. When this occurs, you must either reduce the number of processes or reduce the CPU usage of processes that seem to be consuming more CPU time. You can find which processes are consuming more CPU by using the top utility for Linux and Unix systems, Activity Monitor on Mac OS X, or the Task Manager Performance tab on Windows. But be sure to monitor the CPUs to make sure that high CPU utilization is really the cause of the problem—slow‐ ness is even more likely to occur because of memory contention, discussed in the next section. Some of the common solutions to CPU overloading include: Provision a new server to run some processes This is, of course, the best method, but requires money for new systems. Experi‐ enced system administrators can often find other ways to reduce CPU usage, es‐ pecially when the organization is more willing to spend your time than to spend money. 336 | Chapter 10: Getting Started with Monitoring Remove unnecessary processes An enormous number of systems run background processes that may be useful for certain occasions but just bog down the system most of the time. However, an administrator must know the system very well to identify which processes are non‐ essential. Kill runaway processes These probably stem from buggy applications, and they are often the culprit when performance problems are intermittent or rare. In the event that you cannot stop a runaway process using a controlled or orderly method, you may need to terminate the process abruptly using a force quit dialog or the command line. Optimize applications Some applications routinely take up more CPU time or other resources than they really need. Poorly designed SQL statements are often a drag on the database system. Lower process priorities Some processes run as background jobs, such as report generators, and can be run more slowly to make room for interactive processes. Reschedule processes Maybe some of those report generators can run at night when system load is lower. Processes that consume too much CPU time are called CPU-bound or processor- bound, meaning they do not suspend themselves for I/O and cannot be swapped out of memory. If you find the CPU is not under contention and there are either few processes running or no processes consuming large amounts of CPU time, the problem with performance is likely to be elsewhere (waiting on disk I/O, insufficient memory, excessive page swapping, etc.). Memory Monitor memory to ensure your applications are not requesting so much memory that they waste system time on memory management. From the very first days of limited random access memory (RAM, or main memory), operating systems have evolved to employ a sophisticated method of using disk memory to store unused portions or pages of main memory. This technique, called paging or swapping, allows a system to run more processes than main memory can load at one time, by storing the memory for suspended processes and later retrieving the memory when the process is reactivated. While the cost of moving a page of memory from memory to disk and back again is relatively high (it is time-consuming compared to accessing main memory directly), modern operating systems can do it so quickly that the penalty isn’t normally an issue unless it reaches such a high level that the processor and disk cannot keep up with the demands. System Components to Monitor | 337 However, the operating system may perform some swapping at a high level periodically to reclaim memory. Be sure to measure memory usage over a period of time to ensure you are not observing a normal cleanup operation. When periods of high paging occur, it is likely that low memory availability may be the result of a runaway process consuming too much memory or too many processes re‐ questing too much memory. This kind of high paging, called thrashing, can be treated the same way as a CPU under contention. Processes that consume too much memory are called memory-bound. When treating memory performance problems, the natural tendency is to add more memory. While that may indeed solve the problem, it is also possible that the memory is not allocated correctly among the various subsystems. There are several things you can do in this situation. You can allocate different amounts of memory to parts of the system—such as the kernel or filesystem—or to various ap‐ plications that permit such tweaking, including MySQL. You can also change the priority of the paging subsystem so the operating system begins paging earlier. Be very careful when tweaking memory subsystems on your serv‐ er. Be sure to consult your documentation or a book dedicated to improving performance for your specific operating system. If you monitor memory and find that the system is not paging too frequently, but per‐ formance is still an issue, the problem is likely related to one of the other subsystems. Disk Monitor disk usage to ensure there is enough free disk space available, as well as suffi‐ cient I/O bandwidth to allow processes to execute without significant delay. You can measure this using either a per-process or overall transfer rate to and from disk. The per- process rate is the amount of data a single process can read or write. The overall transfer rate is the maximum bandwidth available for reading and writing data on disk. Some systems have multiple disk controllers; in these cases, overall transfer rate may be meas‐ ured per disk controller. Performance issues can arise if one or more processes are consuming too much of the maximum disk transfer rate. This can have very detrimental effects on the rest of the system in much the same way as a process that consumes too many CPU cycles: it “starves” other processes, forcing them to wait longer for disk access. Processes that consume too much of the disk transfer rate are called disk-bound, mean‐ ing they are trying to access the disk at a frequency greater than the available share of 338 | Chapter 10: Getting Started with Monitoring the disk transfer rate. If you can reduce the pressure placed on your I/O system by a disk-bound process, you’ll free up more bandwidth for other processes. You may hear the terms I/O-bound or I/O-starved when referring to processes. This normally means the process is consuming too much disk. One way to meet the needs of a process performing a lot of I/O to disk is to increase the block size of the filesystem, thus making large transfers more efficient and reducing the overhead imposed by a disk-bound process. However, this may cause other processes to run more slowly. Be careful when tuning filesystems on servers that have only a sin‐ gle controller or disk. Be sure to consult your documentation or a book dedicated to improving performance for your specific operat‐ ing system. If you have the resources, one strategy for dealing with disk contention is to add another disk controller and disk array and move the data for one of the disk-bound processes to the new disk controller. Another strategy is to move a disk-bound process to another, less utilized server. Finally, in some cases it may be possible to increase the bandwidth of the disk by upgrading the disk system to a faster technology. There are differing opinions as to where to optimize first or even which is the best choice. We believe: • If you need to run a lot of processes, maximize the disk transfer rate or split the processes among different disk arrays or systems. • If you need to run a few processes that access large amounts of data, maximize the per-process transfer rate by increasing the block size of the filesystem. You may also need to strike a balance between the two solutions to meet your unique mix of processes by moving some of the processes to other systems. Network Subsystem Monitor network interfaces to ensure there is enough bandwidth and that the data being sent or received is of sufficient quality. Processes that consume too much network bandwidth, because they are attempting to read or write more data than the network configuration or hardware make possible, are System Components to Monitor | 339 called network-bound. These processes keep other processes from accessing sufficient network bandwidth to avoid delays. Network bandwidth issues are normally indicated by utilization of a percentage of the maximum bandwidth of the network interface. You can solve these issues with processes by assigning the processes to specific ports on a network interface. Network data quality issues are normally indicated by a high number of errors encoun‐ tered on the network interface. Luckily, the operating system and data transfer appli‐ cations usually employ checksumming or some other algorithm to detect errors, but retransmissions place a heavy load on the network and operating system. Solving the problem may require moving some applications to other systems on the network or installing additional network cards, which normally requires a diagnosis followed by changing the network hardware, reconfiguring the network protocols, or moving the system to a different subnet on the network. When referring to a process that is taking too much time accessing networking subsystems, we say it is network-bound. Monitoring Solutions For each of the four subsystems just discussed, a modern operating system offers its own specific tools that you can use to get information about the subsystem’s status. These tools are largely standalone applications that do not correlate (at least directly) with the other tools. As you will see in the next sections, the tools are powerful in their own right, but it requires a fair amount of effort to record and analyze all of the data they produce. Fortunately, a number of third-party monitoring solutions are available for most op‐ erating and database systems. It is often best to contact your systems providers for recommendations on the best solution to meet your needs and maintain compatibility with your infrastructure. Most vendors offer system monitoring tools as an option. The following are a few of the more notable offerings: up.time An integrated system for monitoring and reporting performance for servers. It supports multiple platforms. Cacti A graphical reporting solution for graphing data from the RRDtool. RRDtool is an open source data logging system and can be tailored using Perl, Python, Ruby, LUA, or TCL. 340 | Chapter 10: Getting Started with Monitoring KDE System Guard (KSysGuard) Permits users to track and control processes on their system. Designed to be easy to set up. Gnome System Monitor A graphical tool to monitor CPU, network, memory, and processes on a system. Nagios A complete solution for monitoring all of your servers, network switches, applica‐ tions, and services. MySQL Enterprise Monitor Provides real-time visibility into the performance and availability of all your MySQL databases. We will discuss the MySQL Enterprise Monitor and automated monitoring and report in greater detail in Chapter 16. The following sections describe the built-in monitoring tools for some of the major operating systems. We will study the Linux and Unix commands in a little more detail, as they are particularly suited to investigating the performance issues and strategies we’ve discussed. However, we will also include an examination of the monitoring tools for Mac OS X and Microsoft Windows. Linux and Unix Monitoring Database monitoring on Linux or Unix can involve tools for monitoring the CPU, memory, disk, network, and even security and users. In classic Unix fashion, all of the core tools run from the command line and most are located in the bin or sbin folders. Table 10-1 includes the list of tools we’ve found useful, with a brief description of each. Table 10-1. System monitoring tools for Linux and Unix Utility Description ps Shows the list of processes running on the system. top Displays process activity sorted by CPU utilization. vmstat Displays information about memory, paging, block transfers, and CPU activity. uptime Displays how long the system has been running. It also tells you how many users are logged on and the system load average over 1, 5, and 15 minutes. free Displays memory usage. iostat Displays average disk activity and processor load. Linux and Unix Monitoring | 341 Utility Description sar System activity report. Allows you to collect and report a wide variety of system activity. pmap Displays a map of how a process is using memory. mpstat Displays CPU usage for multiprocessor systems. netstat Displays information about network activity. cron A subsystem that allows you to schedule the execution of a process. You can schedule execution of these utilities so you can collect regular statistics over time or check statistics at specific times, such as during peak or minimal loads. Some operating systems provide additional or alternative tools. Con‐ sult your operating system documentation for additional tools for monitoring your system performance. As you can see from Table 10-1, a rich variety of tools is available with a host of potentially useful information. The following sections discuss some of the more popular tools and explain briefly how you can use them to identify the problems described in the previous sections. Process Activity Several commands provide information about processes running on your system— notably top, iostat, mpstat, and ps. The top command The top command provides a summary of system information and a dynamic view of the processes on your system ranked by the most CPU-intensive tasks. The display typically contains information about the process, including the process ID, the user who started the process, its priority, the percentage of CPU it is using, how much time it has consumed, and of course, the command used to start the process. However, some op‐ erating systems have slightly different reports. This is probably the most popular utility in the set because it presents a snapshot of your system every few seconds. Figure 10-1 shows the output when running top on a Linux (Ubuntu) system under moderate load. The system summary is located at the top of the listing and has some interesting data. It shows the percentages of CPU time for user (%us); system (%sy); nice (%ni), which is the time spent running users’ processes that have had their priorities changed; I/O wait (%wa); and even the percentage of time spent handling hardware and software interrupts. Also included are the amount of memory and swap space available, how much is being used, how much is free, and the size of the buffers. 342 | Chapter 10: Getting Started with Monitoring Figure 10-1. The top command Below the summary comes the list of processes, in descending order (which is from where the name of the command derives) based on how much CPU time is being used. In this example, a Bash shell is currently the task leader followed by one or several installations of MySQL. Niceness You can change the priority of a process on a Linux or Unix system. You may want to do this to lower the priorities of processes that require too much CPU power, are of lower urgency, or could run for an extended period but that you do not want to cancel or reschedule. You can use the commands nice, ionice, and renice to alter the priority of a process. Most distributions of Linux and Unix now group processes that have had their priorities changed into a group called nice. This allows you to get statistics about these modified processes without having to remember or collate the information yourself. Having commands that report the CPU time for nice processes gives you the opportunity to see how much CPU these processes are consuming with respect to the rest of the system. For example, a high value on this parameter may indicate there is at least one process with too high of a priority. Perhaps the best use of the top command is to allow it to run and refresh every three seconds. If you check the display at intervals over time, you will begin to see which processes are consuming the most CPU time. This can help you determine at a glance whether there is a runaway process. Linux and Unix Monitoring | 343 You can change the refresh rate of the command by specifying the delay on the command. For example, top -d 3 sets the delay to three seconds. Most Linux and Unix variants have a top command that works like we have described. Some have interesting interactive hot keys that allow you to toggle information on or off, sort the list, and even change to a colored display. You should consult the manual page for the top command specific to your operating system, because the special hot keys and interactive features differ among operating systems. The iostat command The iostat command gives you different sets of information about your system, in‐ cluding statistics about CPU time, device I/O, and even partitions and network filesys‐ tems (NFS). The command is useful for monitoring processes because it gives you a picture of how the system is doing overall related to processes and the amount of time the system is waiting for I/O. Figure 10-2 shows an example of running the iostat command on a system with moderate load. Figure 10-2. The iostat command The iostat, mpstat, and sar commands might not be installed on your system by default, but they can be installed as an option. For example, they are part of the sysstat package in Ubuntu distribu‐ tions. Consult your operating system documentation for informa‐ tion about installation and setup. Figure 10-2 shows the percentages for CPU usage from the time the system was started. These are calculated as averages among all processors. As you can see, the system is running on a dual-core CPU, but only one row of values is given. This data includes the percentage of CPU utilization: 344 | Chapter 10: Getting Started with Monitoring • Executing at the user level (running applications) • Executing at the user level with nice priority • Executing at the system level (kernel processes) • Waiting on I/O • Waiting for virtual processes • Idle time A report like this can give you an idea of how your system has been performing since it was started. While this means that you might not notice periods of poor performance (because they are averaged over time), it does offer a unique perspective on how the processes have been consuming available processing time or waiting on I/O. For ex‐ ample, if %idle is very low, you can determine that the system was kept very busy. Similarly, a high value for %iowait can indicate a problem with the disk. If %system or %nice is much higher than %user, it can indicate an imbalance of system and prioritized processes that are keeping normal processes from running. The mpstat command The mpstat command presents much of the same information as iostat for processor time, but splits the information out by processor. If you run this command on a multi‐ processor system, you will see the percentage of data per processor as well as the totals for all processors. Figure 10-3 shows an example of the mpstat command. Figure 10-3. The mpstat command There is an option to tell the mpstat command to refresh the information based on an interval passed. This can be helpful if you want to watch how your processors are per‐ forming with respect to the processes over a period of time. For instance, you can see whether your processor affinity is unbalanced (too many processes are assigned to one specific processor). Linux and Unix Monitoring | 345 Some implementations of mpstat provide an option to see a more comprehensive display including show statistics for all processors. This may be -A or -P ALL depending on your operating system. To find out more about the mpstat command, consult the manual page for your oper‐ ating system. The ps command The ps command is one of those commands we use on a daily basis but never take the time to consider its power and utility. This command gives you a snapshot of the pro‐ cesses running on your system. It displays the process ID, the terminal the process is running from, the amount of time it has been running, and the command used to start the process. It can be used to find out how much memory a process uses, how much CPU a process uses, and more. You can also pipe the output to grep to more easily find processes. For example, the command ps -A | grep mysqld is a popular command to find information about all of the MySQL processes running on your system. This will send the list of all processes to the grep command, which will in turn only show those rows with “mysqld” in them. You can use this technique to find a process ID so you can get detailed information about that process using other commands. What makes the ps command so versatile is the number of options available for dis‐ playing data. You can display the processes for a specific user, get related processes for a specific process by showing its process tree, and even change the format of the output. Consult your documentation for information about the options available on your op‐ erating system. One of the ways you can use this output to diagnose problems is to look for processes that have been running for a long time or check process status (e.g., check those that are stuck in a suspicious state or sleeping). Unless they are known applications like MySQL, you might want to investigate why they have been running for so long. Figure 10-4 shows an abbreviated example of the ps command run on a system under moderate load. 346 | Chapter 10: Getting Started with Monitoring Figure 10-4. The ps command Another use for the output is to see whether there are processes that you do not recognize or a lot of processes run by a single user. Many times this indicates a script that is spawning processes, perhaps because it has been set up improperly, and can even indi‐ cate a dangerous security practice. There are many other utilities built into operating systems to display information about processes. As always, a good reference on performance tuning for your specific oper‐ ating system will be the best source for more in-depth information about monitoring processes. Memory Usage Several commands provide information about memory usage on your system. The most popular ones include free and pmap. The free command The free command shows you the amount of physical memory available. It displays the total amount of memory, the amount used, and the amount free for physical mem‐ ory, and it displays the same statistics for your swap space. It also shows the memory buffers used by the kernel and the size of the cache. Figure 10-5 shows an example of free run on a system with a moderate load. Linux and Unix Monitoring | 347 Figure 10-5. The free command In the output from an Ubuntu system, shown in Figure 10-5, the shared column is obsolete. There is a switch that puts the command into a polling mode where the statistics are updated for the number of seconds provided. For example, to poll memory every five seconds, issue free -t -s 5. The pmap command The pmap command gives you a detailed map of the memory used for a process. To use this command, you must first find the process ID for the process you want to explore. You can get this information using the ps command, or even the top command if you are looking at a process that is consuming lots of CPU time. You can also get the memory map of multiple processes by listing the process IDs on the command line. For example, pmap 12578 12579 will show the memory map for process IDs 12578 and 12579. The output shows a detailed map of all of the memory addresses and the sizes of the portions of memory used by the process at the instant the report was created. It displays the command used to launch the process, including the full path and parameters, which can be very useful for determining where the process was started and what options it is using. You’d be amazed how handy that is when trying to figure out why a process is behaving abnormally. The display also shows the mode (access rights) for the memory block. This can be useful in diagnosing interprocess issues. Figures 10-6 and 10-7 show an example of a mysqld process map when running on a system with moderate load. 348 | Chapter 10: Getting Started with Monitoring Figure 10-6. The pmap command—part 1 Figure 10-7. The pmap command—part 2 Notice that the listing chosen is the device output format (selected by issuing the -d parameter on startup) as well as where the memory is being mapped or used. This can be handy in diagnosing why a particular process is consuming lots of memory and which part (e.g., a library) is consuming the most. Figure 10-7 shows the final line of the pmap output, which displays some useful summary information. The final line shows how much memory is mapped to files, the amount of private memory space, and the amount shared with other processes. This information may be a key piece of data needed to solve memory allocation and sharing issues. There are several other commands and utilities that display information about memory usage (e.g., dmesg, which can display messages from bootup); consult a good reference on performance tuning for your operating system. Linux and Unix Monitoring | 349 Disk Usage A number of commands can reveal the disk usage statistics on your system. This section describes and demonstrates the iostat and sar commands. The iostat command As you have already seen in “Process Activity” on page 342, the iostat command shows the CPU time used and a list of all of the disks and their statistics. Specifically, iostat lists each device, its transfer speed, the number of blocks read and written per second, and the total number of blocks read and written. For easy consultation, Figure 10-8 repeats Figure 10-2, which is an example of the iostat command run on a system with a moderate load. Figure 10-8. The iostat command This report can be very important when diagnosing disk problems. At a glance, it can tell you whether some devices are being used more than others. If this is the case, you can move some processes to other devices to reduce demand for a single disk. The output can also tell you which disk is experiencing the most reads or writes—this can help you determine whether a particular device needs to be upgraded to a faster one. Conversely, you can learn which devices are underutilized. For example, if you see that your shiny new super-fast disk is not being accessed much, it is likely that you have not configured the high-volume processes to use the new disk. On the other hand, it could be that your program is using memory caches that I/O is seldom performed on. The sar command The sar command is a very powerful utility that displays all sorts of information about your system. It records data over time and can be configured in a variety of ways, so it can be a little tricky to set up. Consult your operating system’s documentation to ensure you have it set up correctly. Like most of the system utilization commands we show, you can also configure sar to generate reports at regular intervals. 350 | Chapter 10: Getting Started with Monitoring The sar command can also display CPU usage, memory, cache, and a host of other data similar to that shown by the other commands. Some administrators set up sar to run periodically to cull the data and form a benchmark for their system. A complete tutorial on sar is beyond the scope of this book. For a more detailed examination, see System Performance Tuning by Gian-Paolo D. Musumeci and Mike Loukides (O’Reilly). In this section, we will look at how to use the sar command to display information about disk usage. We do this by combining displays of the I/O transfer rates, swap space and paging statistics, and block device usage. Figure 10-9 shows an example of the sar command used to display disk usage statistics. Figure 10-9. The sar command for disk usage The report displays so much information that it seems overwhelming at first glance. Notice the first section after the header. This is the paging information that displays the performance of the paging subsystem. Below that is a report of the I/O transfer rates, followed by the swap space report and then a list of the devices with their statistics. The last portion of the report displays averages calculated for all parameters sampled. The paging report shows the rate of pages paged in or out of memory, the number of page faults per second that did not require disk access, the number of major faults requiring disk access, and additional statistics about the performance of the paging system. This information can be helpful if you are seeing a high number of page faults (major page faults are more costly), which could indicate too many processes running. Linux and Unix Monitoring | 351 Large numbers of major page faults can cause disk usage problems (i.e., if this value is very high and disk usage is high, poor performance may not be located in the disk subsystems). It is possible the observation is just a symptom of something going wrong in the application or operating system. The I/O transfer report shows the number of transactions per second (tps), the read and write requests, and the totals for blocks read and written. In this example, the system is not using I/O but is under heavy CPU load. This is a sign of a healthy system. If the I/O values were very high, we would suspect one or more processes of being stuck in an I/O-bound state. For MySQL, a query generating a lot of random disk accesses or tables that reside across a fragmented disk could cause such a problem. The swap space report shows the amount of swap space available, how much is used, the percentage used, and how much cache memory is used. This can be helpful in in‐ dicating a problem with swapping out too many processes and, like the other reports, can help you determine whether the problem lies in your disks and other devices or with memory or too many processes. The block device (any area of the system that moves data in blocks like disk, memory, etc.) report shows the transfer rate (tps), the reads and writes per second, and average wait times. This information can be helpful in diagnosing problems with your block devices. If these values are all very high (unlike this example, which shows almost no device activity), it could mean you have reached the maximum bandwidth of your de‐ vices. However, this information should be weighed against the other reports on this page to rule out a thrashing system, a system with too many processes, or a system without enough memory (or a combination of such problems). This composite report can be helpful in determining where your disk usage problems lie. If the paging report shows an unusually high rate of faults, it’s an indication you may have too many applications running or not enough memory. However, if these values are low or average, you need to look to the swap space; if that is normal, you can examine the device usage report for anomalies. Disk usage analyzer In addition to operating system utilities, the GNOME desktop project has created a graphical application called the Disk Usage Analyzer. This tool gives you an in-depth look at how your storage devices are being used. It also gives you a graphic that depicts disk usage. The utility is available in most distributions of Linux. Figure 10-10 shows a sample report from the Disk Usage Analyzer. 352 | Chapter 10: Getting Started with Monitoring Figure 10-10. Disk Usage Analyzer Basically, this report gives you a look at how the devices are performing alongside the paging and swap systems. Naturally, if a system is swapping a lot of processes in and out of memory, the disk usage will be unusual. This is why it is valuable to look at these items together on the same report. Diagnosing disk problems can be challenging, and only a few commands give you the kind of detailed statistics about disk usage we’ve described. However, some operating systems provide more detailed and specific tools for examining disk usage. Don’t forget that you can also determine available space, what is mounted, which filesystems each disk has, and much more from more general commands such as ls, df, and fdisk. Consult your operating system documentation for a list and description of all disk- related commands, as well as for disk usage and monitoring commands. The vmstat command, shown later in this chapter, can also show this data. Use the vmstat -d command to get a text-based representa‐ tion of the data. Network Activity Diagnosing network activity problems may require specialized knowledge of hardware and networking protocols. Detailed diagnostics are normally left to the networking specialists, but there are two commands you, as a MySQL administrator, can use to get an initial picture of the problem. Linux and Unix Monitoring | 353 The netstat command The netstat command allows you to see network connections, routing tables, interface statistics, and additional networking-related information. The command provides a lot of the information that a network specialist would use to diagnose and configure com‐ plex networking problems. However, it can be helpful to see how much traffic is passing through your network interfaces and which interfaces are being accessed the most. Figure 10-11 shows a sample report of all of the network interfaces and how much data has been transmitted over each one. Figure 10-11. The netstat command In systems that have multiple network interfaces, it may be helpful to determine whether any interface is being overutilized or if the wrong interfaces are active. The ifconfig command The ifconfig command, an essential tool for any network diagnostics, displays a list of the network interfaces on your system, including the status and settings for each. Figure 10-12 shows an example of the ifconfig command. Figure 10-12. The ifconfig command The output lists each interface, whether it is up or down, along with its configuration information. This can be very helpful in determining how an interface is configured and can tell you, for example, that instead of communicating over your super-fast Ethernet adapter, your network has failed over to a much slower interface. The root of 354 | Chapter 10: Getting Started with Monitoring networking problems is often not the traffic on the network, but rather the network interface choice or setup. If you produce the reports shown here for your system and still need help diagnosing the problem, having this data ahead of time can help your networking specialist zero in on the problem more quickly. Once you have eliminated any processes consuming too much network bandwidth and determined where you have a viable network interface, the networking specialist can then configure the interface for optimal performance. General System Statistics Along with the subsystem-specific commands we’ve discussed, and grouped statistical reporting commands, Linux and Unix offer additional commands that give you more general information about your system. These include commands such as uptime and vmstat. The uptime command The uptime command displays how long a system has been running. It displays the current time; how long the system has been running; how many users have been using the system (logged on); and load averages for the past 1, 5, and 15 minutes. Figure 10-13 shows an example of the command. Figure 10-13. The uptime command This information can be helpful if you want to see how the system has been performing on average in the recent past. The load averages given are for processes in an active state (not waiting on I/O or the CPU). Therefore, this information has limited use for de‐ termining performance issues, but can give you a general sense of the health of the system. The vmstat command The vmstat command is a general reporting tool that gives you information about processes, memory, the paging system, block I/O, disk, and CPU activity. It is sometimes used as a first stop on a quest for locating performance issues. High values in some fields may lead you to examine those areas more closely using other commands discussed in this chapter. Figure 10-14 shows an example of the vmstat command run on a system with low load. Linux and Unix Monitoring | 355 The data shown here includes the number of processes, where r indicates those waiting to run and b indicates those in an uninterruptible state. The next set of columns shows the swap space totals including amount of memory swapped in (si) or out (so). The next area shows the I/O reports for blocks received (bi) or sent (bo). The next area shows the number of interrupts per second (in), number of context switches per second (cs), time spent running processes in user space (us), time spent running processes in kernel space (sy), idle time (id), and time waiting for I/O (wa). These times are all in seconds. There are more parameters and options for the vmstat command. Check your operating system manual for more details on the options available for your operating system. Figure 10-14. The vmstat command Automated Monitoring with cron Perhaps the most important tool to consider is the cron facility. You can use cron to schedule a process to run at a specific time. This allows you to run commands and save the output for later analysis. It can be a very powerful strategy, allowing you to take snapshots of the system over time. You can then use the data to form averages of the system parameters, which you can use as a benchmark to compare to when the system performs poorly in the future. This is important because it allows you to see at a glance what has changed, saving you considerable time when diagnosing performance prob‐ lems. If you run your performance monitoring tools daily, and then examine the results and compare them to your benchmark, you may be able to detect problems before your users start complaining. Indeed, this is the basic premise behind the active monitoring tools we’ve described. Mac OS X Monitoring Because the Mac OS X operating system is built on the Unix Mac kernel, you can use most of the tools described earlier to monitor your operating system. However, there are other tools specific to the Mac. These include the following graphical administration tools: 356 | Chapter 10: Getting Started with Monitoring • System Profiler • Console • Activity Monitor This section will present an overview of each of these tools for the purposes of moni‐ toring a Mac OS X system. These tools form the core monitoring and reporting facilities for Mac OS X. In good Mac fashion, they are all well-written and well-behaved graphical user interfaces (GUIs). The GUIs even show the portions of the tools that report infor‐ mation from files. As you will see, each has a very important use and can be very helpful in diagnosing performance issues on a Mac. System Profiler The System Profiler gives you a snapshot of the status of your system. It provides an incredible amount of detail about just about everything in your system, including all of the hardware, the network, and the software installed. Figure 10-15 shows an example of the System Profiler. Figure 10-15. The System Profiler You can find the System Profiler in the Applications/Utilities folder on your hard drive. You can also launch the System Profiler via Spotlight. As Figure 10-15 shows, the tool offers a tree pane on the left and a detail pane on the right. You can use the tree pane to dive into the various components of your system. If you would prefer a console-based report, the System Profiler has a command-line-equivalent application in /usr/sbin/system_profiler. There are many parameters and options that allow you to restrict the view to certain reports. To find out more, open a terminal and type man system_profiler. Mac OS X Monitoring | 357 If you open the Hardware tree, you will see a listing of all of the hardware on your system. For example, if you want to see what type of memory is installed on your system, you can click the Memory item in the Hardware tree. System Profiler provides a network report, which we have seen in another form on Linux. Click the Network tree to get a basic report of all of the network interfaces on your system. Select one of the network interfaces in the tree or in the detail pane to see all of the same (and more) information that the network information commands in Linux and Unix generate. You can also find out information about firewalls, locations you’ve defined, and even which volumes are shared on the network. Another very useful report displays the applications installed on your system. Click Software→Applications report to see a list of all of the software on your system, including the name, version, when it was updated, whether it is a 64-bit application, and what kind of application it is—for instance, whether it’s a universal or a native Intel binary. This last detail can be very important. For example, you can expect a universal binary to run slower than an Intel binary. It is good to know these things in advance, as they can set certain expectations for performance. Figure 10-16 shows an example of this report. Figure 10-16. Memory report from System Profiler 358 | Chapter 10: Getting Started with Monitoring As you can see, this is a lot of detail. You can see how many memory cards are installed, their speed, and even the manufacturer code and part number. Wow! We call each detail pane a report because it’s essentially a detailed report for a given category. Some people may refer to all of the data as a report, which is not incorrect, but we think it’s better to consid‐ er the whole thing a collection of reports. If you are intrigued with the power of this tool, feel free to experiment and dig around in the tree for more information about your system. You will find just about any fact about it here. The System Profiler can be very valuable during diagnostics of system problems. Many times AppleCare representatives and Apple-trained technicians will ask for a report of your system. Generate the report from the System Profiler by using the File→Save com‐ mand. This saves an XML file that Apple professionals can use. You can also export the report to RTF using the File→Export command. Finally, you can print the report after saving it as a PDF file. You can also change the level of detail reported using the View menu. It has options for Mini, Basic, and Full, which change the level of detail from very minimal to a complete report. Apple professionals usually ask for the full report. A System Profiler report is the best way to determine what is on your system without opening the box. It should be your first source to determine your system configuration. Console The Console application displays the logfiles on your system, and is located in the /Applications/Utilities folder or via Spotlight. Unlike the System Profiler, this tool provides you not only a data dump, but also the ability to search the logs for vital in‐ formation. When diagnosing problems, it is sometimes helpful to see whether there are any messages in the logs that give more information about an event. Figure 10-17 shows an example of the Console application. When you launch the Console application, it reads all of the system logs and categorizes them into console diagnostic messages. As you can see in Figure 10-17, the display features a log search pane on the left and a log view on the right. You can also click the individual logfiles in the Files tree to see the contents of each log. The logfiles include the following: ~/Library/Logs Stores all messages related to user applications. Check here for messages about applications that crash while logged in, information about iDisk activity, and other user-related tasks. Mac OS X Monitoring | 359 /Library/Logs Stores all system messages. Check here for information generated at the system level for crashes and other unusual events. /private/var/log Stores all Unix BSD process-related messages. Check here for information about the system daemon or BSD utility. Logs are sequential text files where data is always appended, never updated in the middle, and rarely deleted. The most powerful feature of Console is its search capability. You can create reports containing messages for a given phrase or keyword and view them later. To create a new search, select File→New Database Search in the menu. You will be presented with a generalized search builder that you can use to create your query. When you are finished, you can name and save the report for later processing. This can be a very handy way to keep an eye on troublesome applications. Another really cool feature is the capability to mark a spot in a log that indicates the current date and time—you can use this to determine the last time you looked at the log. If your experience is like ours, you often find interesting messages in several places in the logs and need to review them later, but don’t know where you found them or where you left off reviewing the log. Having the ability to mark a log is a real help in this case. To mark a log, highlight a location in the file and click the Mark button on the toolbar. Although the data reported is a static snapshot of the logs upon launch and any reports you run are limited to this snapshot, you can also set up alerts for new messages in the logs. Use Console→Preferences to turn on notifications, which are delivered to you either via a bouncing icon on the Dock or by bringing the Console application to the forefront after a delay. The Console application can be very helpful for seeing how various aspects of your system work by monitoring the events that occur and for finding errors from applica‐ tions or hardware. When you are faced with a performance issue or another troublesome event, be sure to search the logs for information about the application or event. Some‐ times the cure for the problem is presented to you in the form of a message generated by the application itself. 360 | Chapter 10: Getting Started with Monitoring Figure 10-17. The Console application Activity Monitor Unlike the static nature of the previously described tools, the Activity Monitor is a dynamic tool that gives you information about the system as it is running. The bulk of the data you will need to treat performance issues can be found in the Activity Monitor. Indeed, you will see information comparable to every tool presented in the Linux and Unix section as you explore the Activity Monitor: information about the CPU, system memory, disk activity, disk usage, and network interfaces. With the Activity Monitor, for example, you can find out which processes are running and how much memory they are using as well as the percentage of CPU time each is consuming. In this case, the use is analogous to the top command from Linux. The CPU display shows useful data such as the percentage of time spent executing in user space (user time), the percentage spent in system space (system time), and the percentage of time spent idle. This screen also displays the number of threads and pro‐ cesses running, along with a color-coded graph displaying an aggregate of the user and system time. Combined with the top-like display, this can be an excellent tool if you are investigating problems related to CPU-bound processes. Figure 10-18 shows the Activity Monitor displaying a CPU report. Mac OS X Monitoring | 361 Figure 10-18. The Activity Monitor’s CPU display Notice that there is a Python script that, at the time of the sampling, was consuming a considerable portion of the CPU time. In this case, the system was running a Bazaar branch in a terminal window. The Activity Monitor shows why my system gets sluggish when branching a code tree. You can double-click a process to get more information about it. You can also cancel a process either in a controlled manner or by forcing it to quit. Figure 10-19 shows an example of the process inspection dialog. You can export the list of processes by selecting File→Save. You can save the list of processes either as a text file or as an XML file. Some Apple professionals may ask for the process list in addition to the System Profiler report when diagnosing problems. 362 | Chapter 10: Getting Started with Monitoring Figure 10-19. The Activity Monitor’s process inspection dialog The System Memory display (Figure 10-20) shows information about the distribution of memory. It shows how much memory is free, how much memory cannot be cached and must stay in RAM (in other words, the wired memory), how much is being used, and how much is inactive. With this report, you can see at a glance whether you have a memory issue. Figure 10-20. The Activity Monitor’s System Memory display The Disk Activity display (Figure 10-21) shows the disk activity for all of your disks. Shown in the first column are the total number of data transfers from (reads in) and to (writes out) disk along with disk performance for reads and writes per second. The next column shows the total size of the data read from and written to disk along with the throughput for each. Included is a graph that displays reads and writes over time in a color-coded graph. Mac OS X Monitoring | 363 Figure 10-21. The Activity Monitor’s Disk Activity display The Disk Activity data can tell you whether you invoke a lot of disk accesses and whether the number of reads and writes (and total amount of data) is unusually high. An un‐ usually high value could indicate you may have to run processes at different times so they do not compete for the disk or you may have to add another disk to balance the load. The Disk Usage display (Figure 10-22) shows the used and free space for each of your drives. It also shows a color-coded pie chart to give you a quick view of the disk uti‐ lization. You can view another disk by selecting the disk in the drop-down list. Figure 10-22. The Activity Monitor’s Disk Usage display This display allows you to monitor the free space on your disk so you know when to add more disks and/or extend partitions to add more space when you run low. The Network display (Figure 10-23) shows a lot of information about how your system is communicating with the network. Shown in the first column is how many packets were read or received (packets in) and written or sent (packets out) over the network. There are also performance statistics measured in packets per second for reads and writes. The next column shows the size of the data read and written on the network along with the transfer rate for each direction. A color-coded chart shows the relative performance of the network. Note the peak value over the chart. You can use the data on this display to determine whether a process is consuming the maximum bandwidth of your system’s network interfaces. 364 | Chapter 10: Getting Started with Monitoring Figure 10-23. The Activity Monitor’s Network display This section has given you a window into the powerful monitoring tools available on Mac OS X. It’s not a complete tutorial, but it will get you started with monitoring a Mac OS X system. For complete details about each of the applications shown, be sure to consult the documentation provided by Apple on the Help menu of each application. Microsoft Windows Monitoring Windows is saddled with the reputation of lacking tools; some have called its monitoring counterintuitive. The good news is the barriers to monitoring a Windows system are a myth. In fact, Windows comes with some very powerful tools, including a scheduler for running tasks. You can take performance snapshots, examine errors in the Event Viewer (the Windows equivalent of logs), and monitor performance in real time. The images shown in this section were taken from several Windows machines. The tools do not differ much in Windows XP or newer versions, including Windows Server 2008 and Windows 8. Howev‐ er, there are differences in accessing the tools in Windows 7 and later, and these differences are noted for each tool. Indeed, there are a great many tools available to the Windows administrator. We won’t try to cover them all here, but instead we’ll focus on tools that let you monitor a Windows system in real time. Let’s examine some of the basic reporting tools first. The following are the most popular tools you can use to diagnose and monitor perfor‐ mance issues in Windows: • Windows Experience Index • System Health Report • Event Viewer • Task Manager • Reliability Monitor Microsoft Windows Monitoring | 365 • Performance Monitor An excellent source for information about Microsoft Windows performance, tools, techniques, and documentation can be found at the Microsoft Technet website. The Windows Experience If you want a quick glance at how your system is performing compared to the expect‐ ations of Microsoft’s hardware performance indexes, you can run the Windows Expe‐ rience report. To launch the report, click Start, then select Control Panel→System and Mainte‐ nance→Performance Information and Tools. You will have to acknowledge the User Account Control (UAC) to continue. You can also access the System Health Report using the search feature on the Start menu. Click Start and enter “performance” in the search box, then click Performance Infor‐ mation and Tools. Click Advanced Tools and then click the link “Generate a system health report” at the bottom of the dialog. You will have to acknowledge the UAC to continue. Microsoft has changed the Windows Experience in Windows 7. The report is very similar to that of earlier Windows versions, but it sup‐ plies more information that you can use to judge the performance of your system. The report is run once after installation, but you can regenerate the report by clicking Update My Score. This report rates five areas of your system’s performance: processor (CPU), memory, video controller (graphics), video graphics accelerator (gaming graphics), and the pri‐ mary hard drive. Figure 10-24 shows an example of the Windows Experience report. There is a little-known feature of this report you may find valuable—click on the link “Learn how you can improve your computer’s performance” to get a list of best practices for improving each of these scores. You should run this report and regenerate the metrics every time you change the configuration of your system. This will help you identify situations where configuration changes affect the performance of your server. 366 | Chapter 10: Getting Started with Monitoring Figure 10-24. The Windows Experience report The best use for this tool is to get a general impression of how your system is performing without analyzing a ton of metrics. A low score in any of the categories can indicate a performance issue. If you examine the report in Figure 10-24, for instance, you will see that the system has a very low graphics and gaming graphics score. This is not unex‐ pected for a Windows system running as a virtual machine or a headless server, but it might be alarming to someone who just shelled out several thousand dollars for a high- end gaming system. The System Health Report One of the unique features and diagnostic improvements in Windows Vista and later is the ability to generate a report that takes a snapshot of all of the software, hardware, and performance metrics for your system. It is analogous to the System Profiler of Mac OS X, but also contains performance counters. To launch the System Health Report, click Start, then select Control Panel→System and Maintenance→Performance Information and Tools. Next, select Advanced Tools, then click the link “Generate a system health report” at the bottom of the dialog. You will have to acknowledge the UAC to continue. You can also access the System Health Report using the search feature on the Start menu. Click Start and enter “performance” in the search box, then click Performance Infor‐ mation and Tools. Click Advanced Tools and select the link “Generate a system health report” at the bottom of the dialog. Another way to access the System Health Report is to use the search feature on the Start menu. Click Start and enter “system health report” in the search box, then click the link in the Start menu. You will have to acknowledge the UAC to continue. Figure 10-25 shows an example of the System Health Report. Microsoft Windows Monitoring | 367 Figure 10-25. The System Health Report This report has everything—all of the hardware, software, and many other aspects of your system are documented here. Notice the report is divided into sections that you can expand or collapse for easier viewing. The following list briefly describes the infor‐ mation displayed by each section: System Diagnostics Report The system name and the date the report was generated. Diagnostic Results Warning messages generated while the report was being run, identifying potential problem areas on your computer. Also included is a brief overview of the perfor‐ mance of your system at the time the report was run. Software Configuration A list of all of the software installed on your system, including system security settings, system services, and startup programs. Hardware Configuration A list of the important metadata for disk, CPU performance counters, BIOS infor‐ mation, and devices. 368 | Chapter 10: Getting Started with Monitoring CPU A list of the processes running at report time and metadata about system compo‐ nents and services. Network Metadata about the network interfaces and protocols on your system. Disk Performance counters and metadata about all of the disk devices. Memory Performance counters for memory, including the process list and memory usage. Report Statistics General information about the system when the report was run, such as processor speed and the amount of memory installed. The System Health Report is your key to understanding how your system is configured and is performing at a glance. It is a static report, representing a snapshot of the system. There is a lot of detailed information in the Hardware Configuration, CPU, Network, Disk, and Memory sections. Feel free to explore those areas for greater details about your system. The best use of this tool, beyond examining the performance counters, is to save the report for later comparison to other reports when your system is performing poorly. You can save an HTML version of the report by selecting File→Save As. You can use the saved report as a baseline for performance of your system. If you generate the report several times over the course of low, medium, and high usage, you should be able to put together a general expectation for performance. These expectations are im‐ portant because you can use them to determine whether your performance issues are within the bounds of expectations. When a system enters a period of unusually high load during a time when it is expected to have a low load, the users’ experience may generate complaints. If you have these reports to compare to, you can save yourself a lot of time investigating the exact source of the slowdown. The Event Viewer The Windows Event Viewer shows all the messages logged for application, security, and system events. It is a great source of information about events that have occurred (or continue to occur) and should be one of the primary tools you use to diagnose and monitor your system. You can accomplish a great deal with the Event Viewer. For example, you can generate custom views of any of the logs, save the logs for later diagnosis, and set up alerts for specific events in the future. We will concentrate on viewing the logs. For more infor‐ Microsoft Windows Monitoring | 369 mation about the Event Viewer and how you can set up custom reports and subscribe to events, consult your Windows help files. To launch the Event Viewer, click the Start button, then right-click Computer and choose Manage. You will have to acknowledge the UAC to continue. You can then click the Event Viewer link in the left panel. You can also launch the Event Viewer by clicking Start, typing “event viewer,” and pressing Enter. The dialog has three panes by default. The left pane is a tree view of the custom views, logfiles, and applications and services logs. The logs are displayed in the center pane, and the right pane contains the Action menu items. The log entries are sorted, by default, in descending order by date and time. This allows you to see the most recent messages first. You can customize the Event Viewer views however you like. You can even group and sort events by clicking on the columns in the log header. Open the tree for the Windows logs to see the base logfiles for the applications, security, and system (among others). Figure 10-26 shows the Event Viewer open and the log tree expanded. The logs available to view and search include: Application All messages generated from user applications as well as operating system services. This is a good place to look when diagnosing problems with applications. Security Messages related to access and privileges exercised, as well as failed attempts to access any secure object. This can be a good place to look for application failures related to username and password issues. Setup Messages related to application installation. This is the best place to look for infor‐ mation about failures to install or remove software. System Messages about device drivers and Windows components. This can be the most useful set of logs for diagnosing problems with devices or the system as a whole. It contains information about all manner of devices running at the system level. Forwarded Events Messages forwarded from other computers. Consult the Windows documentation about working with remote event logging. 370 | Chapter 10: Getting Started with Monitoring Figure 10-26. The Windows Event Viewer Digging through these logs can be challenging, because many of them display infor‐ mation that is interesting to developers and not readable by mere mortals. To make things easier, you can search any of the logs by clicking the Find operation in the Actions pane and entering a text string. For example, if you are concerned about memory issues, you can enter “memory” to filter all of the log entries for ones containing the string “memory,” which will then be shown in the center pane. You can also click the Details tab to make things easier to read. Each log message falls into one of the following three categories (these apply to user processes, system components, and applications alike): Error Indicates a failure of some magnitude, such as a failed process, out-of-memory problem, or system fault. Microsoft Windows Monitoring | 371 Warning Indicates a less serious condition or event of note, such as low memory or low disk space. Information Conveys data about an event. This is generally not a problem, but it could provide additional information when diagnosing problems, such as when a USB drive was removed. To view a log, open the corresponding tree in the left pane. To view the details about any message, click on the message. The message will be displayed below the log entries, as shown in Figure 10-26. In the lower part of the center pane, you can click the General tab to see general information about the message, such as the statement logged, when it occurred, what log it is in, and the user who was running the process or application. You can click the Details tab to see a report of the data logged. You can view the infor‐ mation as text (Friendly View) or XML (XML View). You can also save the information for later review; the XML View is useful to pass the report to tools that recognize the format. The Reliability Monitor The most interesting monitoring tool in Windows is the Reliability Monitor. This is a specialized tool that plots the significant performance and error events that have oc‐ curred over time in a graph. A vertical bar represents each day over a period of time. The horizontal bar is an ag‐ gregate of the performance index for that day. If there are errors or other significant events, you will see a red X on the graph. Below the bar is a set of drop-down lists that contain the software installations and removals, any application failures, hardware fail‐ ures, Windows failures, and any additional failures. This tool is great for checking the performance of the system over a period of time. It can help diagnose situations when an application or system service has performed cor‐ rectly in the past but has started performing poorly, or when a system starts generating error messages. The tool can help locate the day the event first turned up, as well as give you an idea of how the system was performing when it was running well. Another advantage of this tool is that it gives you a set of daily baselines of your system over time. This can help you diagnose problems related to changing device drivers (one of the banes of Windows administration), which could go unnoticed until the system degrades significantly. In short, the Reliability Monitor gives you the opportunity to go back in time and see how your system was performing. The best part of all? You don’t have to turn it on—it runs automatically, gleaning much of its data from the logs, and therefore automatically knowing your system’s history. 372 | Chapter 10: Getting Started with Monitoring One big source of problems on Windows is connecting and config‐ uring hardware. We will not discuss this subject here, as it can easi‐ ly fill a book in its own right. The good news is there is a plethora of information about Windows on the Internet. Try googling for your specific driver or hardware to see the most popular hits. You can also check out the Microsoft support forums. Another excellent resource and host of some popular Windows tools is Sysinternals. You can access the Reliability Monitor by clicking Start, typing “reliability,” and pressing Enter or clicking on Reliability and Performance Monitor. You will have to acknowledge the UAC. Click Reliability Monitor in the tree pane on the left. Figure 10-27 shows an example of the Reliability Monitor. Figure 10-27. The Reliability Monitor In Windows 7, you can launch the Reliability Monitor by clicking Start, typing “action center” in the search box, and pressing Enter. You can then select Maintenance → View reliability report. The report differs from previous versions of Windows, but offers the same information in a tidier package. For example, instead of the drop-down lists, the new Reliability Monitor report lists known incidents in a single list. Microsoft Windows Monitoring | 373 The Task Manager The Windows Task Manager (shown in Figure 10-28) displays a dynamic list of running processes. It has been around for a long time and has been improved over various versions of Windows. The Task Manager offers a tabbed dialog with displays for running applications, pro‐ cesses (this is most similar to the Linux top command), services active on the system, a CPU performance meter, a network performance meter, and a list of users. Unlike some other reports, this tool generates its data dynamically, refreshing periodically. This makes the tool a bit more useful in observing the system during periods of low perfor‐ mance. The reports display the same information as the System Health Report, but in a much more compact form, and are updated continuously. You can find all of the critical metrics needed to diagnose performance issues with CPU, resource-hogging processes, mem‐ ory, and the network. Conspicuously missing is a report on disk performance. Figure 10-28. The Task Manager One of the interesting features of the Task Manager is that it shows a miniature perfor‐ mance meter in the notification area on the Start bar that gives you a chance to watch for peaks in usage. You can launch the Task Manager any time by pressing Ctrl+Alt+Del and choosing Task Manager from the menu. 374 | Chapter 10: Getting Started with Monitoring Running a dynamic performance monitoring tool consumes resour‐ ces and can affect a system that already suffers poor performance. The Performance Monitor The Performance Monitor is the premier tool for tracking performance in a Windows system. It allows you to select key metrics and plot their values over time. It can also store the session so you can later review it and create a baseline for your system. The Performance Monitor has metrics for just about everything in your system. There are counters for many of the smaller details having to do with the basic areas of per‐ formance: CPU, memory, disk, and network. There are a great many other categories as well. To launch the Performance Monitor, click Start, then select Control Panel→System and Maintenance→Performance Information and Tools. Click Advanced Tools and then click the link Open Reliability and Performance Monitor near the middle of the dialog. You will have to acknowledge the UAC to continue. Click Reliability Monitor in the tree pane on the left to access the Performance Monitor feature. You can also launch the Performance Monitor by clicking Start, typing “reliability,” and pressing Enter or clicking on Reliability and Performance Monitor. You will have to acknowledge the UAC. Click Reliability Monitor in the tree pane on the left to access the Performance Monitor feature. Figure 10-29 shows an example of the Performance Monitor. Microsoft has two levels of metrics: objects that offer a high-level view of an area such as the processor or memory, and counters that represent a specific detail of the system. Thus, you can monitor the CPU’s performance as a whole or watch the finer details, such as percentage of time idle or the number of user processes running. Add these objects or counters to the main chart by clicking the green plus sign on the toolbar. This opens a dialog that allows you to choose from a long list of items to add to the chart. Adding the items is a simple matter of selecting the object and expanding the drop- down list on the left, then dragging the desired object to the list on the right. Microsoft Windows Monitoring | 375 Figure 10-29. The Performance Monitor You can add as many items as you want; the chart will change its axis accordingly. If you add too many items to track or the values are too diverse, however, the chart may become unreliable. It is best to stick to a few related items at a time (such as only memory counters) to give you the best and most meaningful chart. A full description of the features of the Performance Monitor is well beyond the scope of this chapter. We encourage you to investigate additional features such as Data Col‐ lector Sets and changing the chart’s display characteristics. There are many excellent texts that describe these features and more in great detail. The versatility of the Performance Monitor makes it the best choice for forming base‐ lines and recording the behavior of the system over time. You can use it as a real-time diagnostic tool. 376 | Chapter 10: Getting Started with Monitoring If you have used the Reliability or Performance Monitor, you may have noticed a seldom-commented-on feature called the Resource Overview. This is the default view of the Reliability and Perfor‐ mance Monitor. It provides four dynamic performance graphs for CPU, disk, network, and memory. Below the graphs are drop-down detail panes containing information about these areas. This report is an expanded form of the Task Manager performance graphs and provides yet another point of reference for performance monitoring and diagnosis on Microsoft Windows. This brief introduction to monitoring performance on Microsoft Windows should per‐ suade you that the belief that Microsoft’s Windows platform is difficult to monitor and lacks sophisticated tools is a myth. The tools are very extensive (some could argue too much so) and provide a variety of views of your system’s data. Monitoring as Preventive Maintenance The techniques discussed so far give you a snapshot of the status of the system. However, most would agree that monitoring is normally an automated task that samples the available statistics for anomalies. When an anomaly is found, an alert is sent to an administrator (or group of administrators) to let someone know there may be a problem. This turns the reactive task of checking the system status into a proactive task. A number of third-party utilities combine monitoring, reporting, and alerts into easy- to-use interfaces. There are even monitoring and alert systems for an entire infrastruc‐ ture. For example, Nagios can monitor an entire IT infrastructure and set up alerts for anomalies. There are also monitoring and alert systems available either as part of or an add-on for operating systems and database systems. We will examine the Enterprise Monitor for MySQL in Chapter 16. Conclusion There are a great many references on both performance tuning and security monitoring. This chapter provides a general introduction to system monitoring. While it is not comprehensive, the material presented is an introduction to the tools, techniques, and concepts of monitoring your operating system and server performance. In the next chapter, we will take on the task of monitoring a MySQL system and discuss some common practices to help keep your MySQL system running at peak performance. Monitoring as Preventive Maintenance | 377 “Joel!” He knew that voice and that tone. Joel’s boss was headed his way and about to conduct another drive-by tasking. He turned to face his office door as his boss stepped through it. “Did you read Sally’s email about the slowdown?” Joel recalled that Sally was one of the employees who sent him a message asking why her application was running slowly. He had just finished checking the low-hanging fruit—there was plenty of memory and disk space wasn’t an issue. “Yes, I was just looking into the problem now.” “Make it your top priority. Marketing has a deadline to produce their quarterly sales projections. Let me know what you find.” His boss nodded once and stepped away. Joel sighed and returned to examining the reports on CPU usage while wondering how to describe technology to the nontechnical. 378 | Chapter 10: Getting Started with Monitoring CHAPTER 11 Monitoring MySQL Joel had a feeling today was going to be a better day. Everything was going well: the performance measurements for the servers were looking good and the user complaints were down. He had successfully reconfigured the server and improved performance greatly. There was only one application still performing poorly, but he was sure it wasn’t a hardware- or operating-system-related problem; more likely, it was an issue with a poorly written query. Nevertheless, he had sent his boss an email message explaining his findings and that he was working on the remaining problems. Joel heard quickened footsteps approaching his office. He instinctively looked toward his door, awaiting the now-routine appearance of his boss. He was shocked as Mr. Sum‐ merson zipped by without so much as a nod in his direction. He shrugged his shoulders and returned to reading his email messages. Just then a new message appeared with “HIGH PRIORITY” in the subject line in capital letters. It was from his boss. Chiding himself for holding his breath, Joel relaxed and opened the message. He could hear his boss’s voice in his mind as he read through the message. “Joel, good work on those reports. I especially like the details you included about mem‐ ory and disk performance. I’d like you to generate a similar report about the database server. I’d also like you to look into a problem one of the developers is having with a query. Susan will send you the details.” With a deep sigh, Joel once again opened his favorite MySQL book to learn more about monitoring the database system. “I hope it has something about drilling down into individual components,” he mumbled, knowing he needed to get up to speed quickly on an advanced feature of MySQL. Now that you understand how monitoring works and how to keep your host’s operating systems at peak efficiency, how do you know whether your MySQL servers are per‐ forming at their peak efficiency? Better still, how do you know when they aren’t? 379 In this chapter, we begin with a look at monitoring MySQL, a brief discussion of mon‐ itoring techniques for MySQL, and the taxonomy of monitoring MySQL, and then move on to monitoring and improving performance in your databases. We conclude with a look into best practices for improving database performance. What Is Performance? Before we begin discussions about database performance and general best practices for monitoring and tuning a MySQL server, it is important to define what we mean by performance. For the purposes of this chapter, good performance is defined as meeting the needs of the user such that the system performs as expediently as the user expects, whereas poor performance is defined as anything less. Typically, good performance means that response time and throughput meet the users’ expectations. While this may not seem very scientific, savvy administrators know the best gauge of how well things are going is the happiness of the users. That doesn’t mean we don’t measure performance. On the contrary, we can and must measure performance in order to know what to fix, when, and how. Furthermore, if you measure performance regularly, you can even predict when your users will begin to be unhappy. Your users won’t care if you reduce your cache hit rate by 3%, beating your best score to date. You may take pride in such things, but metrics and numbers are meaningless when compared to the user’s experience at the keyboard. There is a very important philosophy that you should adopt when dealing with perfor‐ mance. Essentially, you should never adjust the parameters of your server, database, or storage engine unless you have a deliberate plan and a full understanding of the ex‐ pectations of the change as well as the consequences. More important, never adjust without measuring the effects of the change over time. It is entirely possible that you can improve the performance of the server in the short run but negatively impact per‐ formance in the long run. Finally, you should always consult references from several sources, including the reference manuals. Now that we’ve issued that stern warning, let’s turn our attention to monitoring and improving performance of the MySQL server and databases. Administrators monitoring MySQL almost always focus on improv‐ ing performance. Certainly performance is important, in terms of how long the user must wait for a query to execute. But monitoring can also check for the exhaustion of resources, or a high demand for those resources which can cause timeouts or other failures to get access to your server. 380 | Chapter 11: Monitoring MySQL MySQL Server Monitoring Managing the MySQL server falls in the category of application monitoring. This is because most of the performance parameters are generated by the MySQL software and are not part of the host operating system. As mentioned previously, you should always monitor your base operating system in tandem with monitoring MySQL because MySQL is very sensitive to performance issues in the host operating system. There is an entire chapter in the online MySQL Reference Manual that covers all aspects of monitoring and performance improvement, intriguingly titled “Optimization.” Rath‐ er than repeat the facts and rhetoric of that excellent reference, we will discuss a general approach to monitoring the MySQL server and examine the various tools available. This section is an introduction to the finer details of monitoring the MySQL server. We’ll start with a short discussion of how to change and monitor the behavior of the system, then discuss monitoring primarily for the purposes of diagnosing performance issues and forming a performance benchmark. We will also discuss best practices for diagnosing performance issues and take a look at monitoring the storage engine sub‐ layer in MySQL—an area not well understood or covered by other reference sources. How MySQL Communicates Performance There are two mechanisms you can use to govern and monitor behavior in the MySQL server. You use server variables to control behavior and status variables to read behavior configuration and statistical information regarding features and performance. There are many variables you can use to configure the server. Some can be set only at startup (called startup options, which can also be set in option files). Others can be set at the global level (across all connections), the session level (for a single connection), or both the global and session levels. You can read server variables using the following commands: SHOW [GLOBAL | SESSION] VARIABLES; You can change those variables that are not static (read-only) using the following com‐ mands (you can include multiple settings on a single line using a comma separator): SET [GLOBAL | SESSION] variable_name = value; SET [@@global. | @@session. | @@]variable_name = value; Session variable settings are not persistent beyond the current con‐ nection and are reset when the connection is closed. MySQL Server Monitoring | 381 You can read status variables using the following commands—the first two commands display the value of all local or session scope variables (the default is session) and the third command displays those variables that are global in scope: SHOW STATUS; SHOW SESSION STATUS; SHOW GLOBAL STATUS; We discuss how and when to use these commands in the next section. Two of the most important commands for discovering information about the server and how it is performing are SHOW VARIABLES and SHOW STATUS. There are a great many variables (over 290 status variables alone). The variable lists are generally in alphabetical order and are often grouped by feature. However, sometimes the variables are not neatly arranged. Filtering the command by a keyword through the LIKE clause can produce information about the specific aspects of the system you want to monitor. For example, SHOW STATUS LIKE '%thread%' shows all of the status variables related to thread exe‐ cution. Performance Monitoring Performance monitoring in MySQL is the application of the previous commands— specifically, setting and reading system variables and reading status variables. The SHOW and SET commands are only two of the possible tools you can use to accomplish the task of monitoring the MySQL server. Indeed, there are several tools you can use to monitor your MySQL server. The tools available in the standard distributions are somewhat limited in that they are console tools and include special commands you can execute from a MySQL client (e.g., SHOW STATUS) and utilities you can run from a command line (e.g., mysqladmin). The MySQL client tool is sometimes called the MySQL monitor, but should not be confused with a monitoring tool. There are also GUI tools available that make things a little easier if you prefer or require such options. In particular, you can download the MySQL GUI tools, which include advanced tools that you can use to monitor your system, manage queries, and migrate your data from other database systems. We begin by examining how to use the SQL commands and then discuss the MySQL Workbench tool. We also take a look at one of the most overlooked tools available to the administrator: the server logs. 382 | Chapter 11: Monitoring MySQL Some savvy administrators may consider the server logs the first and primary tool for administering the server. Although they are not nearly as vital for performance moni‐ toring, they can be an important asset in diagnosing performance issues. SQL Commands All of the SQL monitoring commands could be considered variants of the SHOW com‐ mand, which displays internal information about the system and its subsystems. For example, one pair of commands that can be very useful in monitoring replication is SHOW MASTER STATUS and SHOW SLAVE STATUS. We will examine these in more detail later in this chapter. Many of these commands can be achieved by querying the INFORMA TION_SCHEMA tables directly. See the online MySQL Reference Man‐ ual for more details about the INFORMATION_SCHEMA database and its features. While there are many forms of the SHOW command, the following are the most common SQL commands you can use to monitor the MySQL server: SHOW INDEX FROM table Describes the indexes in the table. This can let you know whether you have the right indexes for the way your data is used. SHOW PLUGINS Displays the list of all known plug-ins. It shows the name of the plug-in and its current status. The storage engines in newer releases of MySQL are implemented as plug-ins. Use this command to get a snapshot of the currently available plug-ins and their status. While not directly related to monitoring peformance, some plug- ins supply system variables. Knowing which plug-ins are installed can help deter‐ mine whether you can access plug-in−specific variables. SHOW [FULL] PROCESSLIST Displays data for all threads (including those handling connections to clients) run‐ ning on the system. This command resembles the process commands of the host operating system. The information displayed includes connection data along with the command executing, how long it has been executing, and its current state. Like the operating system command it resembles, it can diagnose poor response (too many threads), a zombie process (long running or nonresponding), or even con‐ nection issues. When dealing with poor performance or unresponsive threads, use the KILL command to terminate them. The default behavior is to show the processes for the current user. The FULL keyword displays all processes. MySQL Server Monitoring | 383 You must have the SUPER privilege to see all processes running on the system. SHOW [GLOBAL | SESSION] STATUS Displays the values of all of the system variables. You will probably use this com‐ mand more frequently than any other. Use this command to read all of the statistical information available on the server. Combined with the GLOBAL or SESSION key‐ word, you can limit the display to those statistics that are global- or session-only. SHOW TABLE [FROM db ] STATUS Displays detailed information about the tables in a given database. This includes the storage engine, collation, creation data, index data, and row statistics. You can use this command along with the SHOW INDEX command to examine tables when diagnosing poorly performing queries. SHOW [GLOBAL | SESSION] VARIABLES Displays the system variables. These are typically configuration options for the server. Although they do not display statistical information, viewing the variables can be very important when determining whether the current configuration has changed or if certain options are set. Some variables are read-only and can be changed only via the configuration file or the command line on startup, while others can be changed globally or set locally. You can combine this command with the GLOBAL or SESSION keyword to limit the display to those variables that are global- or session-only. Limiting the Output of SHOW Commands The SHOW commands in MySQL are very powerful. However, they often display too much information. This is especially true for the SHOW STATUS and SHOW VARIABLES commands. To see less information, you can use the LIKE pattern clause, which limits the output to rows matching the pattern specified. The most common example is using the LIKE clause to see only variables for a certain subset, such as replication or logging. You can use the standard MySQL pattern symbols and controls in the LIKE clause in the same manner as a SELECT query. For example, the following displays the status variables that include the name “log”: mysql> SHOW SESSION STATUS LIKE '%log%'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ 384 | Chapter 11: Monitoring MySQL | Binlog_cache_disk_use | 0 | | Binlog_cache_use | 0 | | Com_binlog | 0 | | Com_purge_bup_log | 0 | | Com_show_binlog_events | 0 | | Com_show_binlogs | 0 | | Com_show_engine_logs | 0 | | Com_show_relaylog_events | 0 | | Tc_log_max_pages_used | 0 | | Tc_log_page_size | 0 | | Tc_log_page_waits | 0 | +--------------------------+-------+ 11 rows in set (0.11 sec) The commands specifically related to storage engines include the following: SHOW ENGINE engine_name LOGS Displays the log information for the specified storage engine. The information dis‐ played is dependent on the storage engine. This can be very helpful in tuning storage engines. Some storage engines do not provide this information. SHOW ENGINE engine_name STATUS Displays the status information for the specified storage engine. The information displayed depends on the storage engine. Some storage engines display more in‐ formation than others. For example, the InnoDB storage engine displays dozens of status variables, while the NDB storage engine shows a few, and the MyISAM stor‐ age engine displays no information. This command is the primary mechanism for viewing statistical information about a given storage engine and can be vital for tuning certain storage engines (e.g., InnoDB). Older synonyms for the SHOW ENGINE commands (SHOW en gine LOGS and SHOW engine STATUS) have been deprecated. Also, these commands can display information only on certain engines, including InnoDB and Performance_Schema. SHOW ENGINES Displays a list of all known storage engines for the MySQL release and their status (i.e., whether the storage engine is enabled). This can be helpful when deciding which storage engine to use for a given database or in replication to determine if the same storage engines exist on both the master and the slave. The commands specifically related to MySQL replication include: SHOW BINLOG EVENTS [IN log_file] [FROM pos] [LIMIT offset row_count] Displays the events as they were recorded to the binary log. You can specify a logfile to examine (omitting the IN clause tells the system to use the current logfile), and MySQL Server Monitoring | 385 limit output to the last events from a particular position or to the first number of rows after an offset into the file. This command is the primary command used in diagnosing replication problems. It comes in very handy when an event occurs that disrupts replication or causes an error during replication. If you do not use a LIMIT clause and your server has been run‐ ning and logging events for some time, you could get a very lengthy output. If you need to examine a large number of events, you should consider using the mysqlbinlog utility instead. SHOW BINARY LOGS Displays the list of the binary logs on the server. Use this command to get infor‐ mation about past and current binlog filenames. The size of each file is also displayed. This is another useful command for diagnosing replication problems because it will permit you to specify the binlog file for the SHOW BINLOG EVENTS command, thereby reducing the amount of data you must explore to determine the problem. The SHOW MASTER LOGS command is a synonym. SHOW RELAYLOG EVENTS [IN log_file] [FROM pos] [LIMIT offset row_count] Available in MySQL version 5.5.0, this command does the same thing as SHOW BIN LOG EVENTS, only with relay logs on the slave. If you do not supply a filename for the log, events from the first relay log are shown. This command has no effect when run on the master. SHOW MASTER STATUS Displays the current configuration of the master. It shows the current binlog file, the current position in the file, and all inclusive or exclusive replication settings. Use this command when connecting or reconnecting slaves. SHOW SLAVE HOSTS Displays the list of slaves connected to the master that used the --report-host option. Use this information to determine which slaves are connected to your master. SHOW SLAVE STATUS Displays the status information for the system acting as a slave in replication. This is the primary command for tracking the performance and status of your slaves. A considerable amount of information is displayed that is vital to maintaining a healthy slave. See Chapter 3 for more information about this command. Example 11-1 shows the SHOW VARIABLES command and its output from a recent beta release of MySQL. 386 | Chapter 11: Monitoring MySQL Example 11-1. Showing thread status variables mysql> SHOW VARIABLES LIKE '%thread%'; +----------------------------+---------------------------+ | Variable_name | Value | +----------------------------+---------------------------+ | innodb_file_io_threads | 4 | | innodb_read_io_threads | 4 | | innodb_thread_concurrency | 0 | | innodb_thread_sleep_delay | 10000 | | innodb_write_io_threads | 4 | | max_delayed_threads | 20 | | max_insert_delayed_threads | 20 | | myisam_repair_threads | 1 | | pseudo_thread_id | 1 | | thread_cache_size | 0 | | thread_handling | one-thread-per-connection | | thread_stack | 262144 | +----------------------------+---------------------------+ 12 rows in set (0.00 sec) This example shows not only those status variables for thread management, but also the thread control for the InnoDB storage engine. Although you sometimes get more information than you expected, a keyword-based LIKE clause is sure to help you find the specific variable you need. Knowing which variables to change and which variables to monitor can be the most challenging part of monitoring a MySQL server. As mentioned, a great deal of valuable information on this topic is included in the online MySQL Reference Manual. To illustrate the kinds of features you can monitor in a MySQL server, let us examine the variables that control the query cache. The query cache is one of the most important performance features in MySQL if you use the MyISAM storage engine for your appli‐ cation data. It allows the server to buffer frequently used queries and their results in memory. Thus, the more often a query is run, the more likely it is that the results can be read from the cache rather than reexamining the index structures and tables to re‐ trieve the data. Clearly, reading the results from memory is much faster than reading them from disk every time. This can be a performance improvement if your data is read much more frequently than it is written (updated). Each time you run a query, it is entered into the cache and has a lifetime governed by how recently it was used (old queries are dumped first) and how much memory there is available for the query cache. Additionally, there are a number of events that can invalidate (remove) queries from the cache. We include a partial list of these events here: • Changes to data or indexes. MySQL Server Monitoring | 387 • Subtle differences of the same query that have a different result set, which can cause missed cache hits. Thus, it is important to use standardized queries for commonly accessed data. You will see later in this chapter how views can help in this area. • When a query derives data from temporary tables (not cached). • Transaction events that can invalidate queries in the cache (e.g., COMMIT). You can determine whether the query cache is configured and available in your MySQL installation by examining the have_query_cache variable. This is a system variable with global scope, but it is read-only. You control the query cache using one of several vari‐ ables. Example 11-2 shows the server variables for the query cache. Example 11-2. Query cache server variables mysql> SHOW VARIABLES LIKE '%query_cache%'; +------------------------------+----------+ | Variable_name | Value | +------------------------------+----------+ | have_query_cache | YES | | query_cache_limit | 1048576 | | query_cache_min_res_unit | 4096 | | query_cache_size | 33554432 | | query_cache_type | ON | | query_cache_wlock_invalidate | OFF | +------------------------------+----------+ 6 rows in set (0.00 sec) As you can see, there are several things you can change to affect the query cache. Most notable is the ability to temporarily turn off the query cache by setting the query_cache_size variable, which sets the amount of memory available for the query cache. If you set this variable to 0, it effectively turns off the query cache and removes all queries from the cache. This is not related to the have_query_cache variable, which merely indicates that the feature is available. Furthermore, it is not sufficient to set query_cache_type = OFF because it does not deallocate the query cache buffer. You must also set the size to completely turn off the query cache. For more information about configuring the query cache, see the section titled “Query Cache Configuration” in the online MySQL Reference Manual. You can observe the performance of the query cache by examining several status vari‐ ables, as shown in Example 11-3. Example 11-3. Query cache status variables mysql> SHOW STATUS LIKE '%Qcache%'; +-------------------------+-------+ | Variable_name | Value | +-------------------------+-------+ | Qcache_free_blocks | 0 | | Qcache_free_memory | 0 | 388 | Chapter 11: Monitoring MySQL | Qcache_hits | 0 | | Qcache_inserts | 0 | | Qcache_lowmem_prunes | 0 | | Qcache_not_cached | 0 | | Qcache_queries_in_cache | 0 | | Qcache_total_blocks | 0 | +-------------------------+-------+ 8 rows in set (0.00 sec) Here we see one of the more subtle inconsistencies in the MySQL server. You can control the query cache using variables that start with query_cache, but the status variables start with Qcache. While the inconsistency was intentional (to help distinguish a server variable from a status variable), oddities like this can make searching for the right items a challenge. There are many nuances to the query cache that allow you to manage and configure it and monitor its performance. This makes the query cache an excellent example to demonstrate the complexity of monitoring the MySQL server. For example, you can and should periodically defragment the query cache with the FLUSH QUERY CACHE command. This does not remove results from the cache, but instead allows for internal reorganization to better utilize memory. While no single volume (or chapter in a broader work) can cover all such topics and nuances of the query cache, the practices described in this chapter therefore are general and are designed to be used with any feature in the MySQL server. However, the specific details may require additional research and a good read through the online MySQL Reference Manual. The mysqladmin Utility The mysqladmin command-line utility is the workhorse of the suite of command-line tools. There are many options and tools (called “commands”) this utility can perform. The online MySQL Reference Manual discusses the mysqladmin utility briefly. In this section, we examine the options and tools for monitoring a MySQL server. The utility runs from the command line, so it allows administrators to script sets of operations much more easily than they can process SQL commands. Indeed, some of the third-party monitoring tools use a combination of the mysqladmin and SQL com‐ mands to gather information for display in other forms. You must specify connection information (user, password, host, etc.) to connect to a running server. The following is a list of commonly used commands (as you will see, most of these have equivalent SQL commands that produce the same information): MySQL Server Monitoring | 389 status Displays a concise description of the status of the server, including uptime, number of threads (connections), number of queries, and general statistical data. This com‐ mand provides a quick snapshot of the server’s health. extended-status Displays the entire list of system statistics and is similar to the SQL SHOW STATUS command. processlist Displays the list of current processes and works the same way as the SQL SHOW PROCESSLIST command. kill thread id Allows you to kill a specified thread. Use this in conjunction with processlist to help manage runaway or hung processes. variables Displays the system server variables and values. This is equivalent to the SQL SHOW VARIABLES command. There are many options and other commands not listed here, including commands to start and stop a slave during replication and manage the various system logs. One of the best features of the mysqladmin utility is its comparison of information over time. The --sleep n option tells the utility to execute the specified command once every n seconds. For example, to see the process list refreshed every three seconds on the local host, use the following command: mysqladmin -uroot --password processlist --sleep 3 This command will execute until you cancel the utility using Ctrl-C. Perhaps the most powerful option is the comparative results for the extended status command. Use the --relative option to compare the previous execution values with the current values. For example, to see the previous and current values for the system status variables, use this command: mysqladmin -uroot --password extended-status --relative --sleep 3 You can also combine commands to get several reports at the same time. For example, to see the process list and status information together, issue the following command: mysqladmin --root … processlist status mysqladmin has many other uses. You can use it to shut down the server, flush the logs, ping a server, start and stop slaves in replication, and refresh the privilege tables. For more information about the mysqladmin tool, see the section titled “mysqladmin— Client for Administering a MySQL Server” in the online MySQL Reference Manual. Figure 11-1 shows the sample output of a system with no load. 390 | Chapter 11: Monitoring MySQL Figure 11-1. Sample mysqladmin process and status report MySQL Workbench The MySQL Workbench application is a GUI tool designed as a workstation-based administration tool. MySQL Workbench, which we’ll just call Workbench henceforth, is available for download on the MySQL website and is offered as a community edition (GPL) and a commercial version called the Standard Edition. The Standard Edition is bundled with the MySQL Enterprise offerings. The major features of Workbench include: • Server administration • SQL development • Data modeling • Database Migration Wizard We will discuss server administration in more detail and briefly introduce SQL devel‐ opment in the following sections. Data modeling is beyond the scope of this chapter, but if you want to implement configuration management for your database schemas, we encourage you to explore the feature presented in the Workbench documentation. The database migration wizard is designed to automate the migration of database sche‐ ma and data from other database systems. These include Microsoft SQL Server 2000, 2005, 2008, and 2012, PostgreSQL 8.0 and later, and Sybase Adaptive Server Enterprise 15.x and greater. It can be a really handy tool to make adoption of MySQL easier and faster. MySQL Workbench replaces the older MySQL GUI Tools, including MySQL Administrator, MySQL Query Browser, and MySQL Migra‐ tion Toolkit. When you launch Workbench, the main screen displays three distinct sections repre‐ senting SQL development, data modeling, and server administration (Figure 11-2). The MySQL Server Monitoring | 391 links below each section permit you to start working with each of these features. The database migration feature is accessed via the “Database Migrate…” menu option. Figure 11-2. MySQL Workbench home window You can use Workbench on any platform and can access one or more servers connected to the client. This makes the tool much more convenient when monitoring several servers on the network. For more information and details about installation and setup, refer to the online MySQL Workbench documentation. MySQL server administration The server administration feature provides facilities for viewing and changing system variables, managing configuration files, examining the server logs, monitoring status variables, and even viewing graphical representations of performance for some of the more important features. It also has a full set of administration options that allow you to manage users and view database configurations. While it was originally intended to replace the mysqladmin tool, popular demand ensures we will have both for the fore‐ seeable future. 392 | Chapter 11: Monitoring MySQL To use the server administration feature, you must first define an instance of a MySQL server to administer. Click the New Server Instance link and follow the steps to create a new instance (connection) to your server. The process will connect to and validate the parameters you entered to ensure it has a valid instance. Once the instance is created, it will be displayed in the box under the Server Administration section of the home window. To administer your server, choose the server instance from the list then click Server Administration. You will see a new window like Figure 11-3. Figure 11-3. Server Administration window Notice on the left side of the window there are four sections: management, configura‐ tion, security, and data export/restore. We discuss each of these briefly. Management. The management group of tools permits you to see an overview of the server status, start and stop the server, view system and status variables, and view the server logs. MySQL Server Monitoring | 393 In the first edition of this book, we presented the MySQL Adminis‐ trator application that contained a feature to produce detailed graphs of memory usage, connections, and more. This feature is not present in MySQL Workbench but is included in the MySQL Enterprise Monitor application that contains advanced monitoring tools for en‐ terprises. The graphing feature is vastly superior to the features in the deprecated MySQL Administrator tool. We see an example of the server status in Figure 11-3. Notice we see a small graph of the server load and its memory usage. To the right of that, we see graphs for connection usage, network traffic, query cache hits, and key efficiency. You can use these graphs as a quick look at your server status. If any of the graphs show unusually high (or, in rare cases, unusually low) values, you can use that as a clue to start looking for performance problems before they become critical. If you would like a tool that offers finer granularity in graphing system status, health, and so on, you may want to explore the MySQL Enterprise Monitor application. We discuss the MySQL Enterprise Monitor in Chapter 16. The startup and shutdown tool lets you start or stop the server instance. It also shows the most recent messages from the server, should you start or stop the server with the tool. The status and system variable tool is one of the handiest of the management group. Figure 11-4 shows an example screenshot of this tool. You can choose to explore status variables by category or search for any status variable matching a phrase (similar to LIKE '%test%'). The system variable tab has the same search feature. Figure 11-5 is an example screen shot of the system variables tool. As you can see, a lot of categories are defined. The categories allow you to quickly zoom to the area you are most interested in viewing. Any variable prefixed by [rw] is read/write and therefore can be changed by the ad‐ ministrator at runtime. The administrator account must have the SUPER privilege. 394 | Chapter 11: Monitoring MySQL Figure 11-4. Status variables Figure 11-5. System variables MySQL Server Monitoring | 395 Once you start using Workbench, you should find yourself using these tools frequently. The ability to search and quickly navigate to a status or system variable will save you a lot of typing or reentering SQL SHOW commands. If that isn’t enough to convince you, the tools also allow you to copy the variables to the clipboard for use in reports and similar efforts. You can copy all of the global variables or just those that are shown in the result list. The last tool in the management group allows you to explore the server logs. Figure 11-6 shows an example screenshot of the server logs tool. It displays a tab for each type of log that is enabled. In the example, we have the slow query, general, and error logs enabled. You can view each log in turn, paging through the log entries. You can also select portions of the logs and copy them to the clipboard for reporting and similar efforts. You may be prompted to enter elevated privileges for reading the log files. Also, if you are connected to a remote server (other than local‐ host), you must use a SSH instance connection with appropriate cre‐ dentials. Figure 11-6. Server logs 396 | Chapter 11: Monitoring MySQL As you can see, the graphical tools for managing MySQL servers are designed to make rudimentary and repetitive tasks easier. Configuration. The next group includes a powerful tool for managing your configuration file. Figure 11-7 shows a sample screenshot of the options file tool. Not only can you view what options are set in your configuration file, but you can change their values and save the new values. More on that in a moment. The user account used must have write privileges at the OS-level for this file. Figure 11-7. Options file There are several categories listed in tabs across the top. These include general, ad‐ vanced, MyISAM, performance, logfiles, security, InnoDB, NDB, transactions, net‐ working, replication, and miscellaneous. The tool includes all of the server options known for the version of your server. The use of categories makes finding and setting configuration file entries easier. A short help text is provided to the right of each option. MySQL Server Monitoring | 397 Setting options requires first checking the tick box to indicate the option should appear in the file. In addition, if the option takes a value, enter or change the value in the provided text box. Once you have all of the options set the way you want, you can make them take effect by clicking Apply. When you click Apply, a dialog opens that displays a summary of the changes to the file. You can cancel or apply the changes or you can see the complete contents of the file from this dialog. When you click Apply, the tool saves the changes to the file, which will take effect on the next start of the server. There is one other powerful feature to this tool. Notice the drop-down box labeled “mysqld” near the bottom. This allows you to set the section of the configuration file you are editing, and thereby use the tool to modify options for certain applications. Combined with the ability to restart the server, you can use this tool to help tune your server. You may find this easier and faster to use than traditional command-line tools. Security. The next group contains a permissions tool that allows you to quickly see the permissions for any user from a list of all users defined on the server. Figure 11-8 shows a sample screenshot of the tool. Figure 11-8. Privileges You can use this tool to help diagnose access issues and to help prune your permission sets to minimal access for your users. The tool also permits you to change permissions 398 | Chapter 11: Monitoring MySQL for a given user by clicking the tick box to toggle access (no checkmark means the user does not have the permission). Once you’ve made changes to one or more users, you can click Apply to issue the appropriate changes on the server. Data export/restore. The last group of tools encapsulate the basic data export and import features of mysqldump. While not strictly devoted to monitoring, you would do well to include such features in your collection of tools. For example, it may be necessary to make copies or export data from one server to another for further analysis of a performance-related query issue. You can select entire databases or any combination of objects to export. Figure 11-9 shows a sample screenshot of the export feature. Figure 11-9. Data export You can dump all the objects and data to a single file, or specify a project folder where each table is saved as a separate .sql file that contains the table creation and data insert SQL statements. After you select either option, along with the databases and tables you want to export, and then click Start Export, the associated mysqldump commands will run. A summary dialog is opened to display progress of the operation and the exact commands used to issue the export. You can save these commands for use in scripts. MySQL Server Monitoring | 399 You can also choose to export procedures and functions, events, or not export any data at all (exporting only the table structure). If your database uses InnoDB, you can also tell the tool to use a single transaction to avoid prolonged locking of the tables. In this case, the tool tells mysqldump to use the consistent snapshot feature of InnoDB to lock the tables. Importing data is done via the data import/restore tool (Figure 11-10). It allows you to select an export folder or file to import, a target default database (schema). Figure 11-10. Data import If you elected to export to a project folder, you can also select which files (tables) you want to import, allowing you to perform a selective restore. Like the export tool, exe‐ cuting the import will open a dialog that shows you the progress of the import as well as the mysqldump commands. SQL development The SQL Editor is another of the GUI tools available in Workbench. It also is not a monitoring tool in and of itself, but, as you shall see, provides a robust environment for constructing complex SQL statements. 400 | Chapter 11: Monitoring MySQL You can access the tool from the home window. Here you select an instance, then click Open Connection to Start Querying. Figure 11-11 shows a sample screenshot. You can use the SQL Editor to build queries and execute them in a graphical form. Result sets are returned and displayed in a spreadsheet-like display. The SQL Editor allows for vertical scrolling through all of the results as well as changing the size of the columns and horizontally scrolling to better view the data. Many users find this tool more con‐ venient and easier to use than the traditional mysql command-line client. Figure 11-11. SQL Editor The performance-related functionality and the value added for administrators is the graphical display of the results of the EXPLAIN command for any given query. Figure 11-12 shows a sample explanation of a query from the world (InnoDB) database. We will discuss this in greater detail later in the chapter. The SQL Editor example shown here should give you an indication of the utilitarian value of the GUI. You can enter any query and see the explanation of the query execution by first executing the query, then selecting Explain Query from the Query menu. MySQL Server Monitoring | 401 Figure 11-12. SQL Editor: Results view Notice there are two parts to the results. The bottom part shows the results of the EXPLAIN command as well as the actual rows returned. You can use the scroll bars to view more data without having to reissue the query. This is a valuable performance tuning tool because you can write the query once, use the Explain Query feature, observe the results, either rewrite the query or adjust the indexes, then reissue the query and observe the changes in the GUI. And you thought query tools were for users only—not so with this tool. But wait, there’s more. The SQL Editor has enhanced editing tools, such as color-coding. To see all of the advanced features and uses for the SQL Editor, check the online MySQL Workbench documentation. Third-Party Tools Some third-party tools are also useful. Some of the more popular are MySAR, mytop, InnoTop, and MONyog. Except for MONyog, they are all text-based (command-line) tools that you can run in any console window and connect to any MySQL server reach‐ able across the network. We discuss each of these briefly in the following sections. 402 | Chapter 11: Monitoring MySQL MySAR MySAR is a system activity report that resembles the output of the Linux sar command. MySAR accumulates the output of the SHOW STATUS, SHOW VARIABLES, and SHOW FULL PROCESSLIST commands and stores them in a database on the server named mysar. You can configure the data collection in a variety of ways, including limiting the data col‐ lected. You can delete older data in order to continue to run MySAR indefinitely and not worry about filling up your disk with status dumps. MySAR is open source and licensed under the GNU Public License version 2 (GPL v2). If you use sar to gather a lot of data, you may want to check out the ksar tool. The ksar tool is a graphical presentation tool that oper‐ ates on the output of sar. mytop The mytop utility monitors the thread statistics and general performance statistics of MySQL. It lists common statistics like hostname, version of the server, how many queries have run, the average times of queries, total threads, and other key statistics. It runs the SHOW PROCESSLIST and SHOW STATUS commands periodically and displays the infor‐ mation in a listing like the top command found on Linux. Figure 11-13 shows an ex‐ ample. Figure 11-13. The mytop utility Jeremy D. Zawodny wrote mytop, and it is still maintained by him along with the MySQL community. It is open source and licensed under the GNU Public License version 2 (GPL v2). MySQL Server Monitoring | 403 InnoTop InnoTop is another system activity report that resembles the top command. Inspired by the mytop utility, InnoTop has many of the same tools as mytop, but is specifically designed to monitor InnoDB performance as well as the MySQL server. You can monitor critical statistics concerning transactions, deadlocks, foreign keys, query activity, rep‐ lication activity, system variables, and a host of other details. InnoTop is widely used and considered by some to be a general performance monitoring tool. It has many features that allow you to monitor the system dynamically. If you are using InnoDB primarily as your default (or standard) storage engine and want a well- rounded monitoring tool you can run in text mode, look no further than InnoTop. Figure 11-14 shows an example of the InnoTop utility. Figure 11-14. The InnoTop utility The InnoTop utility is licensed under the GNU Public License version 2 (GPL v2). MONyog MySQL Monitor and Advisor (MONyog) is another good MySQL monitoring tool. It allows you to set parameters for key components for security and performance, and includes tools to help tune your servers for maximum performance. You can set events to monitor specific parameters and get alerts when the system reaches the specified thresholds. The major features of MONyog are: 404 | Chapter 11: Monitoring MySQL • Server resource monitoring • Identification of poorly executing SQL statements • Server log monitoring (e.g., the error log) • Real-time query performance monitoring and identification of long-running queries • Alerting for significant events MONyog also provides a GUI component if you prefer to graph the output. The MySQL Benchmark Suite Benchmarking is the process of determining how a system performs under certain loads. The act of benchmarking varies greatly and is somewhat of an art form. The goal is to measure and record statistics about the system while running a well-defined set of tests whereby the statistics are recorded under light, medium, and heavy load on the server. In effect, benchmarking sets the expectations for the performance of the system. This is important because it gives you a hint if your server isn’t performing as well as expected. For example, if you encounter a period during which users are reporting slower performance on the server, how do you know the server is performing poorly? Let’s say you’ve checked all of the usual things—memory, disk, etc.—and all are per‐ forming within tolerance and without error or other anomalies. How, then, do you know if things are running more slowly? Enter the benchmarks. You can rerun the benchmark test and if the values produced are much larger (or smaller, depending on what you are measuring), you know the system is performing below expectations. You can use the MySQL benchmark suite to establish your own benchmarks. The benchmark tool is located in the sql-bench folder and is normally included in the source code distribution. The benchmarks are written in Perl and use the Perl DBI module for access to the server. If you do not have Perl or the Perl DBI module, see the section titled “Installing Perl on Unix” in the online MySQL Reference Manual. Use the following command to run the benchmark suite: ./run-all-tests --server=mysql --cmp=mysql --user=root This command will run the entire set of standard benchmark tests, recording the current results and comparing them with known results of running the tests on a MySQL server. Example 11-4 shows an excerpt of the results of running this command on a system with limited resources. MySQL Server Monitoring | 405 Example 11-4. The MySQL benchmark suite results cbell@cbell-mini:~/source/bzr/mysql-6.0-review/sql-bench$ Benchmark DBD suite: 2.15 Date of test: 2009-12-01 19:54:19 Running tests on: Linux 2.6.28-16-generic i686 Comments: Limits from: mysql Server version: MySQL 6.0.14 alpha debug log Optimization: None Hardware: alter-table: Total time: 77 wallclock secs ( 0.12 usr 0.05 sys + 0.00 cusr 0.00 csys = 0.17 CPU) ATIS: Total time: 150 wallclock secs (20.22 usr 0.56 sys + 0.00 cusr 0.00 csys = 20.78 CPU) big-tables: Total time: 135 wallclock secs (45.73 usr 1.16 sys + 0.00 cusr 0.00 csys = 46.89 CPU) connect: Total time: 1359 wallclock secs (200.70 usr 30.51 sys + 0.00 cusr 0.00 csys = 231.21 CPU) … Although the output of this command isn’t immediately valuable, recall that bench‐ marking is used to track changes in performance over time. Whenever you run the benchmark suite, you should compare it to your known baseline and your last several benchmark checks. Because load can influence the benchmarks, taking the benchmark data over several increments can help mitigate the influence of load for systems that run 24-7. For example, if you see the wallclock times jump considerably from one run to another, this may not be an indication of a performance slowdown. You should also compare the detailed values, such as user and system time. Of course, an increase in the majority of these values can be an indication that the system is experiencing a heavy load. In this case, you should check the process list to see whether there are indeed a lot of users and queries running. If that is the case, run the benchmark suite again when the load is less and compare the values. If they decrease, you can deduce it was due to sporadic load. On the other hand, if the values remain larger (hence, the system is slower), you should begin investigation as to why the system is taking longer to execute the benchmark tests. The Benchmark Function MySQL contains a built-in function called benchmark() that you can use to execute a simple expression and obtain a benchmark result. It is best used when testing other functions or expressions to determine if they are causing delays. The function takes two parameters: a counter for looping and the expression you want to test. The following example shows the results of running 10,000,000 iterations of the CONCAT function: 406 | Chapter 11: Monitoring MySQL mysql> SELECT BENCHMARK(10000000, "SELECT CONCAT('te','s',' t')"); +-----------------------------------------------------+ | BENCHMARK(10000000, "SELECT CONCAT('te','s',' t')") | +-----------------------------------------------------+ | 0 | +-----------------------------------------------------+ 1 row in set (0.06 sec) The diagnostic output of this function is the time it takes to run the benchmark function. In this example, it took 0.06 seconds to run the iterations. If you are exploring a complex query, consider testing portions of it using this command. You may find the problem is related to only one part of the query. For more information about the benchmark suite, see the online MySQL Reference Manual. Now that we have discussed the various tools available for monitoring MySQL and have looked at some best practices, we turn our attention to capturing and preserving op‐ erational and diagnostic information using logfiles. Server Logs If you are a seasoned Linux or Unix administrator, you are familiar with the concepts and importance of logging. The MySQL server was born of this same environment. Consequently, MySQL creates several logs that contain vital information about errors, events, and data changes. This section examines the various logs in a MySQL server, including the role each log plays in monitoring and performance improvement. Logfiles can provide a lot of in‐ formation about past events. There are several types of logs in the MySQL server: • General query log • Slow query log • Error log • Binary log You can turn any of the logs on or off using startup options. Most installations have at least the error log enabled. The general query log, as its name implies, contains information about what the server is doing, such as connections from clients, as well as a copy of the commands sent to the server. As you can imagine, this log grows very quickly. Examine the general query log whenever you are trying to diagnose client-related errors or to determine which clients are issuing certain types of commands. Server Logs | 407 Commands in the general query log appear in the same order in which they were received from the clients and may not reflect the actual order of execution. Turn on the general query log by specifying the --general-log startup option. You can also specify the name of the logfile using the --log-output startup option. These op‐ tions have dynamic variable equivalents. For example, SET GLOBAL log_output = FILE; sets the log output for a running server to write to a file. Finally, you can read the values of either of these variables using the SHOW VARIABLES command. The slow query log stores a copy of long-running queries. It is in the same format as the general log, and you can control it in the same manner with the --log-slow-queries startup option. The server variable that controls which queries go into the slow query log is log_query_time (in seconds). You should tune this variable to meet the expect‐ ations for your server and applications to help track times when queries perform slower than desired. You can send log entries to a file, a table, or both using the FILE, TABLE, or BOTH option, respectively. The slow query log can be a very effective tool for tracking problems with queries before the users complain. The goal, of course, is to keep this log small or, better still, empty at all times. That is not to say you should set the variable very high; on the contrary, you should set it to your expectations and adjust the value as your expectations or circum‐ stances change. The slave does not log slow queries by default. However, if you use the --log-slow-slave-statements option, it will write slow- running events to its slow log. The error log contains information gathered when the MySQL server starts or stops. It also contains the errors generated while the server is running. The error log is your first stop when analyzing a failed or impaired MySQL server. On some operating systems, the error log can also contain a stack trace (or core dump). You can turn the error log on or off using the --log-error startup option. The default name for the error log is the hostname appended by the extension .err. It is saved in the base directory (the same location as the host of the data directory) by default but can be overridden by setting the path with the general_log_file option. If you start your server with --console, errors are written to standard error output as well as to the error log. 408 | Chapter 11: Monitoring MySQL The binary log stores all of the changes made to the data on the server as well as statistical information about the execution of the original command on the server. The online MySQL Reference Manual states that the binary logs are used for backup; however, practice shows that replication is a more popular use of the binary log. The unique format of the binary log allows you to use the log for incremental backups, where you store the binlog file created between each backup. You do this by flushing and rotating the binary logs (closing the log and opening a new log); this allows you to save a set of changes since your last backup. This same technique lets you perform PITR, where you restore data from a backup and apply the binary log up to a specific point or date. For more information about the binary log, see Chapter 4. For more information about PITR, see Chapter 15. Because the binary log makes copies of every data change, it does add a small amount of overhead to the server, but the performance penalty is well worth the benefits. How‐ ever, system configuration such as disk setup and storage engine choice can greatly affect the overhead of the binary log. For example, there is no concurrent commit when using the InnoDB storage engine. This may be a concern in high-write scenarios with binary logging and InnoDB. Turn on the binary log using the --log-bin startup option, specifying the root filename of the binary log. The server appends a numeric sequence to the end of the filename, allowing for automatic and manual rotations of the log. While not normally necessary, you can also change the name of the index for the binary logs by specifying the --log- bin-index startup option. Perform log rotations using the FLUSH LOGS command. You can also control what is logged (inclusive logging) or what is excluded (exclusive logging) using --binlog-do-db and --binlog-ignore-db, respectively. Performance Schema In this section, we present the Performance Schema feature as a technique for measuring the internal execution of the server, which can help you diagnose performance prob‐ lems. While this section introduces the feature and contains a brief startup guide, it does not contain all of the possible configuration and setup parameters and options nor does it contain a complete guide to using the Performance Schema views. For a complete detailed explanation of how to set up and use the Performance Schema tables, see the online reference manual under the heading “MySQL Performance Schema.” A recent addition to the MySQL Server, the Performance Schema feature is presented as a database named performance_schema (sometimes shown in all capitals). It contains a set of dynamic tables (stored in memory) that enable you to see very low-level metrics of server execution. This feature was added to the server in version 5.5.3. The Perfor‐ mance Schema feature provides metadata on the execution of the server, right down to Performance Schema | 409 the line of code being executed. Indeed, it is possible to monitor precisely a mechanism in a particular source file. For this reason, the Performance Schema is often considered a developer tool for diag‐ nosing execution of the server code itself. This is because it is most often used to diagnose deadlocks, mutex, and thread problems. However, it is much more than that! You can get metrics for stages of query optimization, file I/O, connections, and much more. Yes, it is very low level and will indeed show you references to source code. But although most metrics target specific code components, the tool also provides historical data as well as current values for a metric. This can be particularly useful if you are diagnosing a difficult performance problem that you can isolate to a specific use case. You may be thinking that this would create a tremendous load on the server and incur a severe penalty on performance. For some external monitoring solutions, this is true, but the Performance Schema is designed to have little or no measurable impact on server performance. This is possible because of the way the feature is intertwined with the server: it takes advantage of many of the optimizations in the server that external tools simply cannot. The following section presents a terse introduction to the terms and concepts used in Performance Schema. Later sections will show you how to use the feature to diagnose performance problems. Concepts This section presents the basic concepts of the Performance Schema in an effort to make it easier for you to get started using it to gather metrics. The Performance Schema appears in your list of databases (SHOW DATABASES) as per formance_schema and contains a number of dynamic tables that you can see with SHOW TABLES. Example 11-5 lists the available tables in an early release candidate for the MySQL 5.6 server. The number of tables is likely to expand with future releases of the server. Example 11-5. performance_schema tables mysql> SHOW TABLES;; +----------------------------------------------------+ | Tables_in_Performance Schema | +----------------------------------------------------+ | accounts | | cond_instances | | events_stages_current | | events_stages_history | | events_stages_history_long | | events_stages_summary_by_account_by_event_name | | events_stages_summary_by_host_by_event_name | | events_stages_summary_by_thread_by_event_name | 410 | Chapter 11: Monitoring MySQL | events_stages_summary_by_user_by_event_name | | events_stages_summary_global_by_event_name | | events_statements_current | | events_statements_history | | events_statements_history_long | | events_statements_summary_by_account_by_event_name | | events_statements_summary_by_digest | | events_statements_summary_by_host_by_event_name | | events_statements_summary_by_thread_by_event_name | | events_statements_summary_by_user_by_event_name | | events_statements_summary_global_by_event_name | | events_waits_current | | events_waits_history | | events_waits_history_long | | events_waits_summary_by_account_by_event_name | | events_waits_summary_by_host_by_event_name | | events_waits_summary_by_instance | | events_waits_summary_by_thread_by_event_name | | events_waits_summary_by_user_by_event_name | | events_waits_summary_global_by_event_name | | file_instances | | file_summary_by_event_name | | file_summary_by_instance | | host_cache | | hosts | | mutex_instances | | objects_summary_global_by_type | | performance_timers | | rwlock_instances | | session_account_connect_attrs | | session_connect_attrs | | setup_actors | | setup_consumers | | setup_instruments | | setup_objects | | setup_timers | | socket_instances | | socket_summary_by_event_name | | socket_summary_by_instance | | table_io_waits_summary_by_index_usage | | table_io_waits_summary_by_table | | table_lock_waits_summary_by_table | | threads | | users | +----------------------------------------------------+ 52 rows in set (0.01 sec) Performance Schema monitors events, where an event is any discrete execution that has been instrumented (enabled in code and called “instrument points”) and has a meas‐ urable duration. For example, the event could be a method call, a mutex lock/unlock, Performance Schema | 411 or a file I/O. Events are stored as a current event (the most recent value), historical values, and summaries (aggregates). Performance Schema events are not the same as binary log events. An instrument, therefore, consists of the instrument points in the server (source) that produce events when they execute. An instrument must be enabled in order to fire an event. You can monitor specific users (threads) using the setup_actors table. You can monitor specific tables or all tables in certain databases using the setup_objects table. Currently, only table objects are supported. A timer is a type of execution that is measured by a time duration. Timers include idle, wait, stage, and statement. You can change the duration of timers to change the fre‐ quency of the measurement. Values include CYCLE, NANOSECOND, MICROSECOND, MILLI SECOND, and TICK. You can see the available timers by examining the rows in the per formance_timers table. Setup tables are used to enable or disable actors, instruments, objects (tables), and timers. Getting Started Performance Schema can be enabled at startup or at runtime. You can check to see whether your server supports Performance Schema and whether it is turned on by examining the performance_schema variable. A value of ON indicates the feature is en‐ abled. To enable Performance Schema at startup, use the --performance-schema startup option: [mysqld] … performance_schema=ON … Enabling Performance Schema and configuring events to monitor at startup requires the use of several startup options. Depending on the level of detail you want to collect, enabling Performance Schema at startup can become complicated. Fortunately, all of the required and voluntary options and their values can be stored in your configuration file. If you want to collect all available events for a specific server under controlled conditions, it may be easier to enable Performance Schema at startup. 412 | Chapter 11: Monitoring MySQL However, most administrators will want to enable Performance Schema at runtime. You must enable Performance Schema either via the --performance-schema startup vari‐ able or via your configuration file. Once enabled, you must configure the events you want to record. This involves modifying the rows in the setup and configuration tables. This section will demonstrate the process you use to enable events and instruments for preparing to collect data for diagnosing performance problems. To enable monitoring with Performance Schema, begin by setting the timers you want to use, setting the events you want to enable, and enabling the instruments you want to monitor. For example, if you want to monitor all SHOW GRANTS events, begin by setting the timer for the statement object. In this case, we will use the standard NANOSECOND timing. You can check the current setting by examining the setup_timers table: mysql> select * from setup_timers; +-----------+-------------+ | NAME | TIMER_NAME | +-----------+-------------+ | idle | MICROSECOND | | wait | CYCLE | | stage | NANOSECOND | | statement | NANOSECOND | +-----------+-------------+ 4 rows in set (0.01 sec) Next, enable the instrument for the SQL statement as follows. In this case, we set some columns in the setup_instruments table to YES for the specific command (SHOW GRANTS). More specifically, we enable the instrumentation of the metric and enable the timer property for the metric: mysql> UPDATE setup_instruments SET enabled='YES', timed='YES' WHERE name = 'statement/sql/show_grants'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Next, enable the consumers for the events_statements_current and events_state ments_history statements: mysql> UPDATE setup_consumers SET enabled='YES' WHERE name = 'events_statements_current'; Query OK, 0 rows affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> UPDATE setup_consumers SET enabled='YES' WHERE name = 'events_statements_history'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Now execute the SHOW GRANTS command and examine the events_statements_cur rent and events_statements_history tables: Performance Schema | 413 mysql> show grants \G *************************** 1. row *************************** Grants for root@localhost: GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost' IDENTIFIED BY PASSWORD '*81F5E21E35407D884A6CD4A731AEBFB6AF209E1B' WITH GRANT OPTION *************************** 2. row *************************** Grants for root@localhost: GRANT PROXY ON ''@'' TO 'root'@'localhost' WITH GRANT OPTION 2 rows in set (0.01 sec) mysql> select * from events_statements_current \G *************************** 1. row *************************** THREAD_ID: 22 EVENT_ID: 80 END_EVENT_ID: NULL EVENT_NAME: statement/sql/select SOURCE: mysqld.cc:903 TIMER_START: 13104624563678000 TIMER_END: NULL TIMER_WAIT: NULL LOCK_TIME: 136000000 SQL_TEXT: select * from events_statements_current DIGEST: NULL DIGEST_TEXT: NULL CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 0 ROWS_EXAMINED: 0 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 1 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 1 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL NESTING_EVENT_TYPE: NULL 414 | Chapter 11: Monitoring MySQL 1 row in set (0.00 sec) mysql> select * from events_statements_history \G *************************** 1. row *************************** THREAD_ID: 22 EVENT_ID: 77 END_EVENT_ID: 77 EVENT_NAME: statement/sql/select SOURCE: mysqld.cc:903 TIMER_START: 12919040536455000 TIMER_END: 12919040870255000 TIMER_WAIT: 333800000 LOCK_TIME: 143000000 SQL_TEXT: select * from events_statements_history DIGEST: 77d3399ea8360ffc7b8d584c0fac948a DIGEST_TEXT: SELECT * FROM `events_statements_history` CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 1 ROWS_EXAMINED: 1 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 1 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 1 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL NESTING_EVENT_TYPE: NULL *************************** 2. row *************************** THREAD_ID: 22 EVENT_ID: 78 END_EVENT_ID: 78 EVENT_NAME: statement/sql/show_grants SOURCE: mysqld.cc:903 TIMER_START: 12922392541028000 TIMER_END: 12922392657515000 Performance Schema | 415 TIMER_WAIT: 116487000 LOCK_TIME: 0 SQL_TEXT: show grants DIGEST: 63ca75101f4bfc9925082c9a8b06503b DIGEST_TEXT: SHOW GRANTS CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 0 ROWS_EXAMINED: 0 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 0 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 0 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL NESTING_EVENT_TYPE: NULL *************************** 3. row *************************** THREAD_ID: 22 EVENT_ID: 74 END_EVENT_ID: 74 EVENT_NAME: statement/sql/show_grants SOURCE: mysqld.cc:903 TIMER_START: 12887992696398000 TIMER_END: 12887992796352000 TIMER_WAIT: 99954000 LOCK_TIME: 0 SQL_TEXT: show grants DIGEST: 63ca75101f4bfc9925082c9a8b06503b DIGEST_TEXT: SHOW GRANTS CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 416 | Chapter 11: Monitoring MySQL RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 0 ROWS_EXAMINED: 0 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 0 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 0 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL NESTING_EVENT_TYPE: NULL *************************** 4. row *************************** THREAD_ID: 22 EVENT_ID: 75 END_EVENT_ID: 75 EVENT_NAME: statement/sql/select SOURCE: mysqld.cc:903 TIMER_START: 12890520653158000 TIMER_END: 12890521011318000 TIMER_WAIT: 358160000 LOCK_TIME: 148000000 SQL_TEXT: select * from events_statements_current DIGEST: f06ce227c4519dd9d9604a3f1cfe3ad9 DIGEST_TEXT: SELECT * FROM `events_statements_current` CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 1 ROWS_EXAMINED: 1 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 Performance Schema | 417 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 1 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 1 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL NESTING_EVENT_TYPE: NULL *************************** 5. row *************************** THREAD_ID: 22 EVENT_ID: 76 END_EVENT_ID: 76 EVENT_NAME: statement/sql/select SOURCE: mysqld.cc:903 TIMER_START: 12895480384972000 TIMER_END: 12895480736605000 TIMER_WAIT: 351633000 LOCK_TIME: 144000000 SQL_TEXT: select * from events_statements_history DIGEST: 77d3399ea8360ffc7b8d584c0fac948a DIGEST_TEXT: SELECT * FROM `events_statements_history` CURRENT_SCHEMA: performance_schema OBJECT_TYPE: NULL OBJECT_SCHEMA: NULL OBJECT_NAME: NULL OBJECT_INSTANCE_BEGIN: NULL MYSQL_ERRNO: 0 RETURNED_SQLSTATE: NULL MESSAGE_TEXT: NULL ERRORS: 0 WARNINGS: 0 ROWS_AFFECTED: 0 ROWS_SENT: 1 ROWS_EXAMINED: 1 CREATED_TMP_DISK_TABLES: 0 CREATED_TMP_TABLES: 0 SELECT_FULL_JOIN: 0 SELECT_FULL_RANGE_JOIN: 0 SELECT_RANGE: 0 SELECT_RANGE_CHECK: 0 SELECT_SCAN: 1 SORT_MERGE_PASSES: 0 SORT_RANGE: 0 SORT_ROWS: 0 SORT_SCAN: 0 NO_INDEX_USED: 1 NO_GOOD_INDEX_USED: 0 NESTING_EVENT_ID: NULL 418 | Chapter 11: Monitoring MySQL NESTING_EVENT_TYPE: NULL 5 rows in set (0.00 sec) Notice that the output for the events_statements_table shows only the last recorded statement executed, whereas the output for events_statements_history shows recent queries from those events enabled. We enabled both the statement/sql/select and statement/sql/show_grants instruments in this example, so events of both types are shown. While the example is rather simplistic, there is a wealth of information we can gain from using this technique. For example, we see the output includes timing information, such as when the query started and ended as well as lock time. We also see warning and error counts, information about how the query was optimized, and indications of whether indexes were used. The steps in this example are representative of the steps you would use to enable other instruments and events. In summary, you should do the following to enable monitoring using the Performance Schema: 1. Set the timer (applies to instruments with a timing element). 2. Enable the instrument. 3. Enable the consumer. Filtering Events There are two techniques for filtering events: prefiltering and postfiltering. Prefiltering is accomplished by modifying the Performance Schema setup configuration to turn on only those events you want to collect from certain producers and collected by certain consumers. Prefiltering reduces overhead and avoids filling the history tables with metrics that you don’t need, and avoid filling tables that are not needed (by not maintaining consumers). The drawback of prefiltering is that it requires you to predict which events you want to check before you run your test. Postfiltering is typically done by enabling a host of producers and consumers to collect as much information as possible. Filtering is done after the data is collected by using WHERE clauses on the Performance Schema tables. Postfiltering is done on a per-user basis (in the WHERE clause). You would use postfiltering in cases where you are not certain which events you need to collect: for example, when there is no repeatable use case. Which to use depends on your preference for the amount of data to collect. If you know what you are looking for (the metrics to measure) and you want only to record those events, prefiltering will be the technique to use. On the other hand, if you are unsure of what you are looking for or you need to generate data metrics over time, you may want Performance Schema | 419 to consider postfiltering and explore the Performance Schema tables using SELECT statements to narrow the scope of your search. Using Performance Schema to Diagnose Performance Problems This section presents an alternative to the methodology listed in the online reference manual for diagnosing problems using the Performance Schema. It includes a much improved process that ensures your server is returned to its original state. Like the example in the reference manual, the methodology assumes you have a set of operations that exhibit a repeatable problem over several databases. One word of caution: it is likely your use case will not be so cut and dry and you may need to be able to reproduce more than just the data and the queries. For example, if your diagnosis involves problems associated with load or certain other conditions (a number of connections, a certain application, etc.), you may need to be able to reproduce the load and similar conditions. Another condition you should sort out before using Performance Schema is what pa‐ rameters, variables, and options you will use to tune your server. It does little good to tinker with your server if you are not certain what you need to tune. You may not be able to know precisely what to tune, but you should have a good idea at this point. Also, be sure to record the current value before you change it. The normal course of tuning is to change one and only one thing at a time, and to compare the performance before and after the change. If no positive change occurs, you should restore the original value before moving on to another parameter or option. The following lists the steps you can use to diagnose your performance problem with Performance Schema: 1. Query the setup_instruments table to identify all related instruments and enable them. 2. Set up the timers for the frequency that you need to record. Most times, the defaults are the correct timer values. If you change the timers, record their original values. 3. Identify the consumers (event tables) associated with the instruments and enable them. Be sure to enable the current, history, and history_long variants. 4. Truncate the *history and *history_long tables to ensure you start with a “clean” state. 5. Reproduce the problem. 6. Query the Performance Schema tables. If your server has multiple clients running, you can isolate the rows by thread ID. 7. Observe the values and record them. 420 | Chapter 11: Monitoring MySQL 8. Tune one option/parameter/variable set. 9. Return to step 5. Repeat until performance is improved. 10. Truncate the *history and history_long tables to ensure you end with a “clean” state. 11. Disable the events you enabled. 12. Disable the instruments you enabled. 13. Return the timers to their original state. 14. Truncate the *history and history_long tables once more to ensure you end with a “clean” state. MySQL Monitoring Taxonomy The previous sections have demonstrated a number of devices you can use to monitor MySQL. Some devices, such as system and status variables, have many metrics you can inspect for clues to uncover the cause of the performance, accessibility, or resource issue. Learning what can or should be used is crucial to solving the problem, and can save days of research. What is needed is a map to the various devices, tools, and metrics for monitoring MySQL. The following table presents a classification of monitoring devices you can use to effectively monitor your MySQL servers. Table 11-1 organizes tasks by focus area, device, and metric. Examples are shown to give context for the metrics. Table 11-1. MySQL monitoring taxonomy Focus Device Metric Example Performance System Variables Query Cache SHOW VARIABLES LIKE '%query_cache%' Performance Status Variables Number of Inserts SHOW STATUS LIKE 'com_insert' Performance Status Variables Number of Deletes SHOW STATUS LIKE 'com_delete' Performance Status Variables Table Lock Collisions SHOW STATUS LIKE 'table_locks_wai ted' Performance Logging Slow Queries SELECT * FROM slow_log ORDER BY query_time DESC Performance Logging General SELECT * FROM general_log Performance Logging Errors --log-error=file name (startup variable) Performance Performance Schema Thread Information SELECT * FROM threads Performance Performance Schema Mutex Information SELECT * FROM events_wait_current Performance Performance Schema Mutex Information SELECT * FROM mutex_instances Performance Performance Schema File Use Summary SELECT * FROM file_summary_by_in stance MySQL Monitoring Taxonomy | 421 Focus Device Metric Example Performance Storage Engine Features InnoDB Status SHOW ENGINE innodb STATUS Performance Storage Engine Features InnoDB Statistics SHOW STATUS LIKE '%Innodb%' Performance External Tools Processlist mysqladmin -uroot --password process list --sleep 3 Performance External Tools Connection Health (graph) MySQL Workbench Performance External Tools Memory Health (graph) MySQL Workbench Performance External Tools InnoDB Rows Read MySQL Workbench Performance External Tools Logs MySQL Workbench Performance External Tools All Variables MySQL Workbench Performance External Tools Query Plan/Executiona MySQL Workbench Performance External Tools Benchmarking MySQL Benchmark Suite Availability Status Variables Connected Threads SHOW STATUS LIKE 'threads_connected' Availability Operating System Tools Accessibility ping Availability External Tools Accessibility mysqladmin -uroot --password extended-status --relative --sleep 3 Resources Status Variables Storage Engines Supported SHOW ENGINES Resources Operating System Tools CPU Usage top -n 1 -pid mysqld_pid Resources Operating System Tools RAM Usage top -n 1 -pid mysqld_pid Resources MySQL Utilities Disk Usage mysqldiskusage Resources MySQL Utilities Server Information mysqlserverinfo Resources MySQL Utilities Replication Health mysqlepladmin a You can also use the EXPLAIN SQL command. As you can see, the bulk of monitoring techniques are geared toward performance monitoring. This is no surprise, given that the database server is often the focus of many applications and potentially thousands of users. You can also see from this table that there are several devices you can use to help investigate performance problems. Often, several of these devices and the metrics they expose will lead you to the solution to your performance issue. However, now that you have a road map of how to approach MySQL monitoring, you can use it to help focus your efforts on the appropriate devices. It is often the case that you need to investigate performance problems for a particular database (or several databases) or must improve the performance of a set of queries that are causing performance bottlenecks in your applications. We study the techniques and best practices for improving database and query performance in the following sections. 422 | Chapter 11: Monitoring MySQL Database Performance Monitoring the performance of an individual database is one of the few areas in the MySQL feature set where community and third-party developers have improved the MySQL experience. MySQL includes some basic tools you can use to improve perfor‐ mance, but they do not have the sophistication of some other system-tuning tools. Due to this limitation, most MySQL DBAs earn their pay through experience in relational query optimization techniques. We recognize there are several excellent references that cover database performance in great detail and many readers are likely to be well versed in basic database optimization. Here are a few resources for you to turn to: • Refactoring SQL Applications by Stephane Faroult and Pascal L’Hermite (O’Reilly) • SQL and Relational Theory: How to Write Accurate SQL Code by C.J. Date (O’Reilly) • SQL Cookbook by Anthony Mollinaro (O’Reilly) Rather than reintroducing query optimization techniques, we will concentrate on how you can work with the tools available in MySQL to assist in optimizing databases. We will use a simple example and a known sample database to illustrate the use of the query performance command in MySQL. In the next section, we list best practices for im‐ proving database performance. Measuring Database Performance Traditionally, database management systems have provided profiling tools and indexing tools that report statistics you can use to fine-tune indexes. Although there are some basic elements that can help you improve database performance in MySQL, there is no advanced profiling tool available as open source. Although the basic MySQL installation does not include formal tools for monitoring database improvement, the MySQL Enterprise Manager suite offers a host of perfor‐ mance monitoring features. We will discuss this tool in more detail in Chapter 16. Fortunately, MySQL provides a few simple tools to help you determine whether your tables and queries are optimal. They are all SQL commands and include EXPLAIN, ANALYZE TABLE, and OPTIMIZE TABLE. The following sections describe each of these commands in greater detail. Using EXPLAIN The EXPLAIN command gives information about how a SELECT statement (EXPLAIN works only for SELECT statements) can be executed. Here is the syntax for EXPLAIN (note that EXPLAIN is a synonym for the DESCRIBE command found in other database systems): [EXPLAIN | DESCRIBE] [EXTENDED] SELECT select options Database Performance | 423 You can also use the EXPLAIN and DESCRIBE commands to view details about the columns or partitions of a table. The syntax for this version of the command is: [EXPLAIN | DESCRIBE] [PARTITIONS SELECT * FROM] table_name A synonym for EXPLAIN table_name is SHOW COLUMNS FROM table_name. We will discuss the first use of the EXPLAIN command, examining a SELECT command to see how the MySQL optimizer executes the statement. The results of this contain a step-by-step list of join operations that the optimizer predicts it would require to execute the statement. The best use of this command is to determine whether you have the best indexes on your tables to allow for more precise targeting of candidate rows. You can also use the results to test the various optimizer override options. While this is an advanced tech‐ nique and generally discouraged, under the right circumstances you may encounter a query that runs faster with certain optimizer options. We will see an example of this later in this section. Now let’s look at some examples of the EXPLAIN command in action. The following examples are queries executed on the sakila sample database provided for MySQL de‐ velopment and experimentation. Let’s begin with a simple and seemingly harmless query. Let’s say we want to see all of the films rated higher than a PG rating. The result set contains a single row with the following columns: id Sequence number of the statement in order of execution select_type The type of statement executed table The table operated on for this step type The type of join to be used possible_keys A list of columns available if there are indexes that include the primary key key The key selected by the optimizer 424 | Chapter 11: Monitoring MySQL key_len The length of the key or portion of the key used ref Constraints or columns to be compared rows An estimate of the number of rows to process extra Additional information from the optimizer If the type column shows ALL, you are doing a full table scan. You should strive to avoid that by adding indexes or rewriting your query. Similarly, if this column shows INDEX, you are doing a full index scan, which is very inefficient. See the online MySQL Reference Manual for more details on the types of joins and their consequences. Example 11-6 shows how the MySQL optimizer executes this statement. We use the \G to request a vertical display format for clarity. The table we are using in the example contains a field (column) that is defined as an enumerated type. Enumerated types permit you to provide a list of possible values. If you did not use the enumerated type and defined a lookup table, you would have to perform a join to select results with the value of the field. Thus, enumerated values can replace small lookup tables and therefore enumerated values can be used to improve performance. This is because the text for the enumerated values is stored only once—in the table header structures. What is saved in the rows is a numeric reference value that forms an index (array index) of the enumerated value. Enumerated value lists can save space and can make traversing the data a bit more efficient. An enumerated field type allows one and only one value. In the following example, the film table in the sakila database has an enumerated field named rating taking the values G, PG, PG-13, R, and NC-17. In the examples that follow, we will see how this enumerated value field can be used (and misused) in queries. Example 11-6. A simple SELECT statement mysql> EXPLAIN SELECT * FROM film WHERE rating > 'PG' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: NULL key: NULL Database Performance | 425 key_len: NULL ref: NULL rows: 892 Extra: Using where 1 row in set (0.01 sec) You can see from this output that the optimizer has only one step to execute and that it is not using any indexes. This makes sense because we are not using any columns with indexes. Furthermore, even though there is a WHERE clause, the optimizer will still have to do a full table scan. This may be the right choice when you consider the columns used and the lack of indexes. However, if we ran this query hundreds of thousands of times, the full table scan would be a very poor use of time. In this case, we know from looking at the results that adding an index should improve execution (Example 11-7). Example 11-7. Adding an index to improve query performance mysql> ALTER TABLE film ADD INDEX film_rating (rating); Query OK, 0 rows affected (0.42 sec) Records: 0 Duplicates: 0 Warnings: 0 Let’s add an index to the table and try again. Example 11-8 shows the improved query plan. Example 11-8. Improved query plan mysql> EXPLAIN SELECT * FROM film WHERE rating > 'PG' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: film_rating key: NULL key_len: NULL ref: NULL rows: 892 Extra: Using where 1 row in set (0.00 sec) For those of you with sharp eyes who have already spotted the prob‐ lem, bear with us as we work through it. Here we see that the query has now identified an index (possible_keys) but is still not using the index, because the key field is NULL. So what can we do? For this simple ex‐ ample, you may note that only 892 rows are expected to be read. The actual row count 426 | Chapter 11: Monitoring MySQL is 1,000 rows and the result set would contain only 418 rows. Clearly, it would be a much faster query if it read only 42% of the rows! Now let’s see whether we can get any additional information from the optimizer by using the EXTENDED keyword. This keyword allows us to see extra information via the SHOW WARNINGS command. You should issue the command immediately after the EX PLAIN command. The warning text describes how the optimizer identifies table and column names in the statement, the internal rewrite of the query, any optimizer rules applied, and any additional notes about the execution. Example 11-9 shows the results of using the EXTENDED keyword. Example 11-9. Using the EXTENDED keyword for more information mysql> EXPLAIN EXTENDED SELECT * FROM film WHERE rating > 'PG' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: film_rating key: NULL key_len: NULL ref: NULL rows: 892 filtered: 100.00 Extra: Using where 1 row in set, 1 warning (0.00 sec) mysql> SHOW WARNINGS \G *************************** 1. row *************************** Level: Note Code: 1003 Message: select `sakila`.`film`.`film_id` AS `film_id`, `sakila`.`film`.`title` AS `title`,`sakila`.`film`.`description` AS `description`, `sakila`.`film`.`release_year` AS `release_year`, `sakila`.`film`.`language_id` AS `language_id`, `sakila`.`film`.`original_language_id` AS `original_language_id`, `sakila`.`film`.`rental_duration` AS `rental_duration`, `sakila`.`film`.`rental_rate` AS `rental_rate`, `sakila`.`film`.`length` AS `length`, `sakila`.`film`.`replacement_cost` AS `replacement_cost`, `sakila`.`film`.`rating` AS `rating`, `sakila`.`film`.`special_features` AS `special_features`, `sakila`.`film`.`last_update` AS `last_update` from `sakila`.`film` where (`sakila`.`film`.`rating` > 'PG') 1 row in set (0.00 sec) This time, there is one warning that contains information from the optimizer, displaying a rewritten form of the query to include all columns and explicitly reference the column Database Performance | 427 in the WHERE clause. While this has told us the query can be written a bit better, it doesn’t suggest any performance improvements. Fortunately, we can make it more efficient. Let’s see what happens when we issue a query for a specific rating rather than using a range query. We will see the optimization with the index and without. Example 11-10 shows the results. Example 11-10. Removing the range query mysql> EXPLAIN SELECT * FROM film WHERE rating = 'R' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ref possible_keys: film_rating key: film_rating key_len: 2 ref: const rows: 195 Extra: Using where 1 row in set (0.00 sec) mysql> ALTER TABLE film DROP INDEX film_rating; Query OK, 0 rows affected (0.37 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> EXPLAIN SELECT * FROM film WHERE rating = 'R' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 892 Extra: Using where 1 row in set (0.00 sec) Now we see a little improvement. Notice that the first query plan does indeed use the index and results in a much improved plan. The question then remains, why doesn’t the optimizer use the index? In this case, we’ve used a nonunique index on an enumerated field. What sounded like a really good idea is actually not much help at all for a range query of enumerated values. However, we could rewrite the query differently (in several ways, actually) to produce better performance. Let’s look at the query again. We know we want all films rated higher than PG. We assumed that the rating is ordered and that the enumerated field reflects the order. Thus, it appears the order is maintained 428 | Chapter 11: Monitoring MySQL if we accept the enumeration index for each value that corresponds to the order (e.g., G = 1, PG = 2, etc.). But what if the order is incorrect or if (like in this example) the list of values is incomplete? In the example we’ve chosen, where we want all of the films that have a rating higher than PG, we know from our list of ratings that this includes films with a rating of R or NC-17. Rather than using a range query, let’s examine what the optimizer would do if we listed these values. Recall that we removed the index, so we will try the query first without the index, then add the index and see if we have an improvement. Example 11-11 shows the improved query. Example 11-11. Improved query without range mysql> EXPLAIN SELECT * FROM film WHERE rating = 'R' OR rating = 'NC-17' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 892 Extra: Using where 1 row in set (0.00 sec) mysql> ALTER TABLE film ADD INDEX film_rating (rating); Query OK, 0 rows affected (0.40 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> EXPLAIN SELECT * FROM film WHERE rating = 'R' OR rating = 'NC-17' \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film type: ALL possible_keys: film_rating key: NULL key_len: NULL ref: NULL rows: 892 Extra: Using where 1 row in set (0.00 sec) Alas, that didn’t work either. Again, we have chosen to query on a column that has an index but is not an index the optimizer can use. However, the optimizer can use the index for a simple equality comparison because the values being compared are stored Database Performance | 429 in the index. We can exploit this by rewriting the query as the union of two queries. Example 11-12 shows the rewritten query. Example 11-12. Query rewritten using UNION mysql> EXPLAIN SELECT * FROM film WHERE rating = 'R' UNION SELECT * FROM film WHERE rating = 'NC-17' \G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: film type: ref possible_keys: film_rating key: film_rating key_len: 2 ref: const rows: 195 Extra: Using where *************************** 2. row *************************** id: 2 select_type: UNION table: film type: ref possible_keys: film_rating key: film_rating key_len: 2 ref: const rows: 210 Extra: Using where *************************** 3. row *************************** id: NULL select_type: UNION RESULT table: type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: 3 rows in set (0.00 sec) Success! Now we can see we have a query plan that is using the index and processing far fewer rows. We can see from the result of the EXPLAIN command that the optimizer is running each query individually (steps execute from row 1 down to row n) and com‐ bines the result in the last step. 430 | Chapter 11: Monitoring MySQL MySQL has a session status variable named last_query_cost that stores the cost of the most recent query executed. Use this variable to compare two query plans for the same query. For example, after each EXPLAIN, check the value of the variable. The query with the lowest cost value is considered the more efficient (less time-consuming) query. A value of 0 indicates that no query has been submitted for compilation. While this exercise may seem to be a lot of work for a little gain, consider that there are many such queries being executed in applications without anyone noticing the ineffi‐ ciency. Normally we encounter these types of queries only when the row count gets large enough to notice. In the sakila database, there are only 1,000 rows, but what if there were a million or tens of millions of rows? EXPLAIN is the only tool in a standard MySQL distribution that you can use by itself to profile a query in MySQL. The “Optimization” chapter in the online MySQL Reference Manual has a host of tips and tricks to help an experienced DBA improve the perfor‐ mance of various query forms. Using ANALYZE TABLE The MySQL optimizer, like most traditional optimizers, uses statistical information about tables to perform its analysis of the optimal query execution plan. These statistics include information about indexes, distribution of values, and table structure, among many items. The ANALYZE TABLE command recalculates the key distribution for one or more tables. This information determines the table order for a join operation. The syntax for the ANALYZE TABLE command is: ANALYZE [LOCAL | NO_WRITE_TO_BINLOG] TABLE table_list You should run this command whenever there have been significant updates to the table (e.g., bulk-loaded data). The system must have a read lock on the table for the duration of the operation. You can update the key distribution only for MyISAM and InnoDB tables. Other storage engines don’t support this tool, but all storage engines must report index cardinality statistics to the optimizer if they support indexes. Some storage engines, particularly third-party engines, have their own specific built-in statistics. A typical execution of the command is shown in Example 11-13. Running the command on a table with no indexes has no effect, but will not result in an error. Database Performance | 431 Example 11-13. Analyzing a table to update key distribution mysql> ANALYZE TABLE film; +-------------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +-------------+---------+----------+----------+ | sakila.film | analyze | status | OK | +-------------+---------+----------+----------+ 1 row in set (0.00 sec) If you are using InnoDB, there are some cases when you should not use this command. See innodb_stats_persistent in the online ref‐ erence manual for more details. In this example, we see that the analysis is complete and there are no unusual conditions. Should there be any unusual events during the execution of the command, the Msg_type field can indicate info, Error, or warning. In these cases, the Msg_text field will give you additional information about the event. You should always investigate the situation if you get any result other than status and OK. For example, if the .frm file for your table is corrupt or missing, you could see the following messages. In other cases, the output may indicate the table is unreadable (e.g., permission/access issues). Also, the command performs checks specific to the storage engine. In the case of InnoDB, the checks are more thorough and when there are errors, you are likely to see InnoDB-specific errors. Example 11-14. Analyze table errors mysql> ANALYZE TABLE test.t1; +---------+---------+----------+-------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+---------+----------+-------------------------------+ | test.t1 | analyze | Error | Table 'test.t1' doesn't exist | | test.t1 | analyze | status | Operation failed | +---------+---------+----------+-------------------------------+ 2 rows in set (0.00 sec) You can see the status of your indexes using the SHOW INDEX command. A sample of the output of the film table is shown in Example 11-15. In this case, we’re interested in the cardinality of each index, which is an estimate of the number of unique values in it. We omit the other columns from the display for brevity. For more information about SHOW INDEX, see the online MySQL Reference Manual. Example 11-15. The indexes for the film table mysql> SHOW INDEX FROM film \G *************************** 1. row *************************** Table: film 432 | Chapter 11: Monitoring MySQL Non_unique: 0 Key_name: PRIMARY Seq_in_index: 1 Column_name: film_id Collation: A Cardinality: 1028 … *************************** 2. row *************************** Table: film Non_unique: 1 Key_name: idx_title Seq_in_index: 1 Column_name: title Collation: A Cardinality: 1028 … *************************** 3. row *************************** Table: film Non_unique: 1 Key_name: idx_fk_language_id Seq_in_index: 1 Column_name: language_id Collation: A Cardinality: 2 … *************************** 4. row *************************** Table: film Non_unique: 1 Key_name: idx_fk_original_language_id Seq_in_index: 1 Column_name: original_language_id Collation: A Cardinality: 2 … *************************** 5. row *************************** Table: film Non_unique: 1 Key_name: film_rating Seq_in_index: 1 Column_name: rating Collation: A Cardinality: 11 Sub_part: NULL Packed: NULL Null: YES Index_type: BTREE Comment: 5 rows in set (0.00 sec) Database Performance | 433 Using OPTIMIZE TABLE Tables that are updated frequently with new data and deletions can become fragmented quickly and, depending on the storage engine, can have gaps of unused space or sub‐ optimal storage structures. A badly fragmented table can result in slower performance, especially during table scans. The OPTIMIZE TABLE command restructures the data structures for one or more tables. This is especially beneficial for row formats with variable length fields (rows). It can be used only for MyISAM and InnoDB tables. The syntax is: OPTIMIZE [LOCAL | NO_WRITE_TO_BINLOG] TABLE table_list The LOCAL or NO_WRITE_TO_BINLOG keyword prevents the command from being written to the binary log (and thereby from being repli‐ cated in a replication topology). This can be very useful if you want to experiment or tune while replicating data or if you want to omit this step from your binary log and not replay it during PITR. You should run this command whenever there have been significant updates to the table (e.g., a large number of deletes and inserts). This operation is designed to rearrange data elements into a more optimal structure and could run for quite a long time (holding write locks on the table). So this is one operation that is best run during times of low loads. If the table cannot be reorganized (perhaps because there are no variable length records or there is no fragmentation), the command will recreate the table and update the sta‐ tistics. A sample output from this operation is shown in Example 11-16. Example 11-16. The optimize table command mysql> OPTIMIZE TABLE film \G *************************** 1. row *************************** Table: sakila.film Op: optimize Msg_type: note Msg_text: Table does not support optimize, doing recreate + analyze instead *************************** 2. row *************************** Table: sakila.film Op: optimize Msg_type: status Msg_text: OK 2 rows in set (0.44 sec) Here we see two rows in the result set. The first row tells us that the OPTIMIZE TABLE command could not be run and that the command will instead recreate the table and run the ANALYZE TABLE command. The second row is the result of the ANALYZE TABLE step. 434 | Chapter 11: Monitoring MySQL Like the ANALYZE TABLE command, any unusual events during the execution of the command are indicated in the Msg_type field by info, Error, or warning. In these cases, the Msg_text field will give you additional information about the event. You should always investigate the situation if you get any result other than status and OK. When using InnoDB, especially when there are secondary indexes (which usually get fragmented), you may not see any improvement or may encounter long processing times for the operation unless you use the InnoDB “fast index create” option, but this depends on how the index was constructed. It may not apply to all indexes. Best Practices for Database Optimization As mentioned previously, there are many great examples, techniques, and practices concerning optimization that come highly recommended by the world’s best database performance experts. Because monitoring is used to detect and diagnose performance issues, we include these best practices as a summary for the lessons learned about mon‐ itoring MySQL. For brevity, and to avoid controversial techniques, we will discuss a few commonly agreed-upon best practices for improving database performance. We encourage you to examine some of the texts referenced earlier for more detail on each of these practices. Use indexes sparingly but effectively Most database professionals understand the importance of indexes and how they im‐ prove performance. Using the EXPLAIN command is often the best way to determine which indexes are needed. While the problem of not having enough indexes is under‐ stood, having too much of a good thing can also cause a performance issue. As you saw when exploring the EXPLAIN command, it is possible to create too many indexes or indexes that are of little or no use. Each index adds overhead for every in‐ sertion and deletion against the table. In some cases, having too many indexes with wide (as in many values) distributions can slow insert and delete performance considerably. It can also lead to slower replication and restore operations. You should periodically check your indexes to ensure they are all meaningful and uti‐ lized. Remove any indexes that are not used, have limited use, or have wide distributions. You can often use normalization to overcome some of the problems with wide distri‐ butions. Use normalization, but don’t overdo it Many database experts who studied computer science or a related discipline may have fond memories (or nightmares) of learning the normal forms as described by C.J. Date Database Performance | 435 and others. We won’t revisit the material here; rather, we will discuss the impacts of taking those lessons too far. Normalization (at least to third normal form) is a well-understood and standard prac‐ tice. However, there are situations in which you may want to violate these rules. The use of lookup tables is often a by-product of normalization (i.e., you create a special table that contains a list of related information that is used frequently in other tables). However, you can impede performance when you use lookup tables with limited dis‐ tributions (only a few rows or a limited number of rows with small values) that are accessed frequently. In this case, every time your users query information, they must use a join to get the complete data. Joins are expensive, and frequently accessed data can add up over time. To mitigate this potential performance problem, you can use enumerated fields to store the data rather than a lookup table. For example, rather than creating a table for hair color (despite what some subcultures may insist upon, there really are only a limited number of hair color types), you can use an enumerated field and avoid the join altogether. For example, if you created a child table to contain the possible values of hair color, the master table would contain a field whose value is an index into the hair color table. When you execute a query to get results from the master table, you would have to do a join to get the values for the hair color field. If you used an enumerated field, you can eliminate the need for the join and thus improve performance. Another potential issue concerns calculated fields. Typically, we do not store data that is formed from other data (such as sales tax or the sum of several columns). Rather, the calculated data is performed either during data retrieval via a view or in the application. This may not be a problem if the calculations are simple or are seldom performed, but what if the calculations are complex and are performed many times? In this case, you are potentially wasting a lot of time performing these calculations. One way to mitigate this problem is to use a trigger to calculate the value and store it in the table. While this technically duplicates data (a big no-no for normalization theorists), it can improve performance when a lot of calculations are being performed. Use the right storage engine for the task One of the most powerful features of MySQL is its support for different storage engines. Storage engines govern how data is stored and retrieved. MySQL supports a number of them, each with unique features and uses. This allows database designers to tune their database performance by selecting the storage engine that best meets their application needs. For example, if you have an environment that requires transaction control for highly active databases, choose a storage engine best suited for this task. You may also have identified a view or table that is often queried but almost never updated (e.g., a lookup table). In this case, you may want to use a storage engine that keeps the data in memory for faster access. 436 | Chapter 11: Monitoring MySQL Recent changes to MySQL have permitted some storage engines to become plug-ins, and some distributions of MySQL have only certain storage engines enabled by default. To find out which storage engines are enabled, issue the SHOW ENGINES command. Example 11-17 shows the storage engines on a typical installation. Example 11-17. Storage engines mysql> SHOW ENGINES \G *************************** 1. row *************************** Engine: InnoDB Support: YES Comment: Supports transactions, row-level locking, and foreign keys Transactions: YES XA: YES Savepoints: YES *************************** 2. row *************************** Engine: MyISAM Support: DEFAULT Comment: Default engine as of MySQL 3.23 with great performance Transactions: NO XA: NO Savepoints: NO *************************** 3. row *************************** Engine: BLACKHOLE Support: YES Comment: /dev/null storage engine (anything you write to it disappears) Transactions: NO XA: NO Savepoints: NO *************************** 4. row *************************** Engine: CSV Support: YES Comment: CSV storage engine Transactions: NO XA: NO Savepoints: NO *************************** 5. row *************************** Engine: MEMORY Support: YES Comment: Hash based, stored in memory, useful for temporary tables Transactions: NO XA: NO Savepoints: NO *************************** 6. row *************************** Engine: FEDERATED Support: NO Comment: Federated MySQL storage engine Transactions: NULL XA: NULL Savepoints: NULL *************************** 7. row *************************** Engine: ARCHIVE Database Performance | 437 1. The InnoDB storage became the default storage engine in version 5.5 Support: YES Comment: Archive storage engine Transactions: NO XA: NO Savepoints: NO *************************** 8. row *************************** Engine: MRG_MYISAM Support: YES Comment: Collection of identical MyISAM tables Transactions: NO XA: NO Savepoints: NO 8 rows in set (0.00 sec) The result set includes all of the known storage engines; whether they are installed and configured (where Support = YES); a note about the engine’s features; and whether it supports transactions, distributed transactions (XA), or savepoints. A savepoint is a named event that you can use like a transaction. You can establish a savepoint and either release (delete the savepoint) or roll back the changes since the savepoint. See the online MySQL Reference Manual for more details about savepoints. With so many storage engines to choose from, it can be confusing when designing your database for performance. You can choose the storage engine for a table using the ENGINE parameter on the CREATE statement, and you can change the storage engine by issuing an ALTER TABLE command: CREATE TABLE t1 (a int) ENGINE=InnoDB; ALTER TABLE t1 ENGINE=MEMORY; The following describes each of the storage engines briefly, including some of the uses for which they are best suited: InnoDB The premier transactional support storage engine, InnoDB is also the default en‐ gine.1 This engine will be used if you omit the ENGINE option on the CREATE state‐ ment. You should always choose this storage engine when requiring transactional support; InnoDB and NDB are currently the only transactional engines in MySQL. There are third-party storage engines in various states of production that can sup‐ port transactions, but the only “out-of-the-box” option is InnoDB. InnoDB is the storage engine of choice for high reliability and transaction-processing environ‐ ments. MyISAM MyISAM is often used for data warehousing, ecommerce, and enterprise applica‐ tions where most operations are reads (called read-mostly). MyISAM uses ad‐ 438 | Chapter 11: Monitoring MySQL vanced caching and indexing mechanisms to improve data retrieval and indexing. MyISAM is an excellent choice when you need storage in a wide variety of appli‐ cations requiring fast retrieval of data without the need for transactions. Blackhole This storage engine is very interesting. It doesn’t store anything at all. In fact, it is what its name suggests—data goes in but never returns. All jocularity aside, the Blackhole storage engine fills a very special need. If binary logging is enabled, SQL statements are written to the logs, and Blackhole is used as a relay agent (or proxy) in a replication topology. In this case, the relay agent processes data from the master and passes it on to its slaves but does not actually store any data. The Blackhole storage engine can be handy in situations where you want to test an application to ensure it is writing data, but you don’t want to store anything on disk. CSV This storage engine can create, read, and write comma-separated value (CSV) files as tables. The CSV storage engine is best used to rapidly export structured business data to spreadsheets. The CSV storage engine does not provide any indexing mech‐ anisms and has certain issues in storing and converting date/time values (they do not obey locality during queries). The CSV storage engine is best used when you want to permit other applications to share or exchange data in a common format. Given that it is not as efficient for storing data, you should use the CSV storage engine sparingly. The CSV storage engine is used for writing logfiles. For exam‐ ple, the backup logs are CSV files and can be opened by other applications that use the CSV protocol (but not while the serv‐ er is running). Memory This storage engine (sometimes called HEAP) is an in-memory storage that uses a hashing mechanism to retrieve frequently used data. This allows for much faster retrieval. Data is accessed in the same manner as with the other storage engines, but the data is stored in memory and is valid only during the MySQL session—the data is flushed and deleted on shutdown. Memory storage engines are typically good for situations in which static data is accessed frequently and rarely ever altered (e.g., lookup tables). Examples include zip code listings, state and county names, category listings, and other data that is accessed frequently and seldom updated. You can also use the Memory storage engine for databases that utilize snapshot techniques for distributed or historical data access. Database Performance | 439 Federated Creates a single table reference from multiple database systems. The Federated storage engine allows you to link tables together across database servers. This mechanism is similar in purpose to the linked data tables available in other database systems. The Federated storage engine is best suited for distributed or data mart environments. The most interesting feature of the Federated storage engine is that it does not move data, nor does it require the remote tables to use the same storage engine. The Federated storage engine is currently disabled in most dis‐ tributions of MySQL. Consult the online MySQL Reference Manual for more details. Archive This storage engine can store large amounts of data in a compressed format. The Archive storage engine is best suited for storing and retrieving large amounts of seldom-accessed archival or historical data. Indexes are not supported and the only access method is via a table scan. Thus, you should not use the Archive storage engine for normal database storage and retrieval. Merge This storage engine (MRG_MYISAM) can encapsulate a set of MyISAM tables with the same structure (table layout or schema) referenced as a single table. Thus, the tables are partitioned by the location of the individual tables, but no additional partition‐ ing mechanisms are used. All tables must reside on the same server (but not nec‐ essarily the same database). When a DROP command is issued on a merged table, only the Merge specification is removed. The original tables are not altered. The best attribute of the Merge storage engine is speed. It permits you to split a large table into several smaller tables on different disks, combine them using a merge table specification, and access them simultaneously. Searches and sorts will execute more quickly, because there is less data in each table to manipulate. Also, repairs on tables are more efficient because it is faster and easier to repair several smaller individual tables than a single large table. Unfortunately, this configuration has several disadvantages: • You must use identical MyISAM tables to form a single merge table. 440 | Chapter 11: Monitoring MySQL • The replace operation is not allowed. • Indexes are less efficient than for a single table. The Merge storage engine is best suited for very large database (VLDB) applications, like data warehousing, where data resides in more than one table in one or more databases. You can also use it to help solve some partitioning problems where you want to partition horizontally but do not want to add the complexity of setting up the partition table options. Clearly, with so many choices of storage engines, it is possible to choose engines that can hamper performance or, in some cases, prohibit certain solutions. For example, if you never specify a storage engine when the table is created, MySQL uses the default storage engine. If not set manually, the default storage engine reverts to the platform- specific default, which may be MyISAM on some platforms. This may mean you are missing out on optimizing lookup tables or limiting features of your application by not having transactional support. It is well worth the extra time to include an analysis of storage engine choices when designing or tuning your databases. Use views for faster results via the query cache Views are a very handy way to encapsulate complex queries to make it easier to work with the data. You can use views to limit data both vertically (fewer columns) or hori‐ zontally (a WHERE clause on the underlying SELECT statement). Both uses are very handy, and of course, the more complex views use both practices to limit the result set returned to the user or to hide certain base tables or to ensure an efficient join is executed. Using views to limit the columns returned can help you in ways you may not have considered. It not only reduces the amount of data processed, but can also help you avoid costly SELECT * operations that users tend to do without much thought. When many of these types of operations are run, your applications are processing far too much data and this can affect performance of not only the application, but also the server, and more important, can decrease available bandwidth on your network. It’s always a good idea to use views to limit data in this manner and hide access to the base table(s) to remove any temptation users may have to access the base table directly. Views that limit the number of rows returned also help reduce network bandwidth and can improve the performance of your applications. These types of views also protect against proliferation of SELECT * queries. Using views in this manner requires a bit more planning, because your goal is to create meaningful subsets of the data. You will have to examine the requirements for your database and understand the queries issued to form the correct WHERE clauses for these queries. With a little effort, you may find you can create combinations of vertically and hori‐ zontally restrictive views, thereby ensuring your applications operate on only the data Database Performance | 441 that is needed. The less data moving around, the more data your applications can process in the same amount of time. Perhaps the best way to use views is to eliminate poorly formed joins. This is especially true when you have a complex normalized schema. It may not be obvious to users how to combine the tables to form a meaningful result set. Indeed, most of the work done by DBAs when striving for better performance is focused on correcting poorly formed joins. Sometimes this can be trivial—for example, fewer rows processed during the join operation—but most of the time the improved response time is significant. Views can also be helpful when using the query cache in MySQL. The query cache stores the results of frequently used (accessed) queries. Using views that provide a standardized result set can improve the likelihood that the results will be cached and, therefore, re‐ trieved more efficiently. You can improve performance with a little design work and the judicious use of views in your databases. Take the time to examine how much data is being moved around (both the number of columns and rows) and examine your application for any query that uses joins. Spend some time forming views that limit the data and identify the most efficient joins and wrap them in a view as well. Imagine how much easier you’ll rest knowing your users are executing efficient joins. Use constraints The use of constraints provides another tool in your arsenal for combating performance problems. Rather than proselytizing about limitations on using constraints, we encour‐ age you to consider constraints a standard practice and not an afterthought. There are several types of constraints available in MySQL, including the following: • Unique indexes • Primary keys • Enumerated values • Sets • Default values • NOT NULL option We’ve discussed using indexes and overusing indexes. Indexes help improve data re‐ trieval by allowing the system to store and find data more quickly. 442 | Chapter 11: Monitoring MySQL Foreign keys are another form of constraint but are not directly re‐ lated to performance. Rather, foreign keys can be used to protect referential integrity. However, it should be noted that updating tables with a lot of foreign keys or executing cascade operations can have some affect on performance. Currently, only InnoDB supports for‐ eign keys. For more information about foreign keys, see the online MySQL Reference Manual. Sets in MySQL are similar to enumerated values, allowing you to constrain the values in a field. You can use sets to store information that represents attributes of the data, instead of using a master/detail relationship. This not only saves space in the table (set values are bitwise combinations), but also eliminates the need to access another table for the values. The use of the DEFAULT option to supply default values is an excellent way to prohibit problems associated with poorly constructed data. For example, if you have a numeric field that represents values used for calculations, you may want to ensure that when the field is unknown, a default value is stored for it. You can set defaults on most data types. You can also use defaults for date and time fields to avoid problems processing invalid date and time values. More important, default values can save your application from having to supply the values (or using the less reliable method of asking the user to provide them), thereby reducing the amount of data sent to the server during data entry. You should also consider using the NOT NULL option when specifying fields that must have a value. If an entry is attempted where there are NOT NULL columns and no data values are provided, the INSERT statement will fail. This prevents data integrity issues by ensuring all important fields have values. Null values can also make certain queries on these fields take longer. Use EXPLAIN, ANALYZE, and OPTIMIZE We have already discussed the benefits of these commands. We list them here as a best practice to remind you that these tools are vital for diagnostic and tuning efforts. Use them often and with impunity, but follow their use carefully. Specifically, use ANALYZE and OPTIMIZE when it makes sense and not as a regular, scheduled event. We have encountered administrators who run these commands nightly, and in some cases that may be warranted, but in the general case it is not warranted and can lead to unnecessary table copies (like we saw in the earlier examples). Thus, forcing the system to copy data regularly can be a waste of time and could lead to limited access during the operation. Now that we’ve discussed how to monitor and improve MySQL query performance, let us look at some best practices that you can use to help focus your investigation of performance. Database Performance | 443 Best Practices for Improving Performance The details of diagnosing and improving performance of databases are covered by works devoted to the subject and indeed, the information fills many pages. For completeness and as a general reference, we include in this section a set of best practices for combating performance anomalies; this is meant to be a checklist for you to use as a guide. We have grouped the practices by common problems. Everything Is Slow When the system as a whole is performing poorly, you must focus your efforts on how the system is running, starting with the operating system. You can use one or more of the following techniques to identify and improve the performance of your system: • Check hardware for problems. • Improve hardware (e.g., add memory). • Consider moving data to isolated disks. • Check the operating system for proper configuration. • Consider moving some applications to other servers. • Consider replication for scale-out. • Tune the server for performance. Slow Queries Any query that appears in the slow query log or those identified as problematic by your users or developers can be improved using one or more of the following techniques: • Normalize your database schema. • Use EXPLAIN to identify missing or incorrect indexes. • Use the benchmark() function to test parts of queries. • Consider rewriting the query. • Use views to standardize queries. • Test using the query cache (this may not work for all queries or the frequency of your access patterns). 444 | Chapter 11: Monitoring MySQL A replication slave does not write replicated queries to the slow query log, regardless of whether the query was written to the slow query log on the master. Slow Applications If an application is showing signs of performance issues, you should examine the ap‐ plication components to determine where the problem is located. Perhaps you will find only one module is causing the problem, but sometimes it may be more serious. The following techniques can help you identify and solve application performance problems: • Turn on the query cache. • In cases where the query cache is already enabled, turning it off may improve some queries. Consider using the query cache on demand using DEMAND mode and SELECT SQL_CACHE. • Consider and optimize your storage engine choices. • Verify the problem isn’t in the server or operating system. • Define benchmarks for your applications and compare to known baselines. • Examine internalized (written in the application) queries and maximize their performance. • Divide and conquer—examine one part at a time. • Use partitioning to spread out the data. • Examine your indexes for fragmentation. Slow Replication The performance problems related to replication, as discussed earlier, are normally isolated to problems with the database and server performance. Use the following tech‐ niques when diagnosing performance issues for replication: • Ensure your network is operating at peak performance. • Ensure your servers are configured correctly. • Optimize your databases. • Limit updates to the master. • Divide reads across several slaves. • Check the slaves for replication lag. Best Practices for Improving Performance | 445 • Perform regular maintenance on your logs (binary and relay logs). • Use compression over your network if bandwidth is limited. • Use inclusive and exclusive logging options to minimize what is replicated. Conclusion There are a lot of things to monitor on a MySQL server. We’ve discussed the basic SQL commands available for monitoring the server, the mysqladmin command-line utility, the benchmark suite, and MySQL Workbench. We have also examined some best prac‐ tices for improving database performance. Now that you know the basics of operating system monitoring, database performance, MySQL monitoring, and benchmarking, you have the tools and knowledge to success‐ fully tune your server for optimal performance. Joel smiled as he compiled his report about Susan’s nested query problem. It had taken a few hours of digging through logfiles to find the problem, but after he explained the overhead of the query to the developers, they agreed to change the query to use a lookup table stored in a memory table. Joel felt his boss was going to be excited to learn about his ingenuity. He clicked Send just as his boss appeared in his door frame. “Joel!” Joel jumped, despite knowing Mr. Summerson was there. “I’ve got the marketing ap‐ plication problem solved, sir,” he said quickly. “Great! I look forward to reading about your solution.” Joel wasn’t sure his boss would understand the technical parts of his message, but he also knew his boss would keep asking if he didn’t explain everything. Mr. Summerson nodded once and went on his way. Joel opened an email message from Phil in Seattle complaining about replication problems and soon realized the problems extended much further than the server he had been working with. 446 | Chapter 11: Monitoring MySQL CHAPTER 12 Storage Engine Monitoring Joel was enjoying his morning latte when his office phone rang. It startled him because until now he had never heard it ring. He lifted the receiver and heard engine noises. Expecting the call was a wrong number, he said hesitantly, “Hello?” “Joel! Glad I caught you.” It was Mr. Summerson, calling from his car. “Yes, sir.” “I’m on my way to the airport to meet with the sales staff in the Seattle office. I wanted to ask you to look into the new application database. The developers in Seattle tell me they think we need to figure out a better configuration for performance.” Joel had expected something like this. He knew a little about InnoDB and MyISAM, but he wasn’t familiar with monitoring, much less tuning their performance. “I can look into it, sir.” “Great. Thanks, Joel. I’ll email you.” The connection was severed before Joel could reply. Joel finished the last of his latte and started reading about storage engines in MySQL. Now that you know when your servers are performing well (and when they aren’t), how do you know how well your storage engines are performing? If you are hosting one or more transactional databases or need your storage engine to perform at its peak for fast queries, you will need to monitor your storage engines. In this chapter, we discuss ad‐ vanced storage engine monitoring, focusing on improving storage engine performance, by examining the two most popular storage engines: InnoDB and MyISAM. We will discuss how to monitor each and offer some practical advice on how to improve performance. 447 InnoDB The InnoDB storage engine is the default storage engine for MySQL (as of version 5.5). InnoDB provides high reliability and high performance transactional operations that support full ACID compliance. InnoDB has been proven to be very reliable and con‐ tinues to be improved. The latest improvements include multicore processor support, improved memory allocation, and finer grain performance tuning capabilities. The on‐ line reference manual contains a detailed explanation of all of the features of the InnoDB storage engine. There are many tuning options for the InnoDB storage engine, and a thorough exami‐ nation of all of them and the techniques that go along with each can fill an entire volume. For example, there are 50 variables that control the behavior of InnoDB and over 40 status variables to communicate metadata about performance and status. In this section, we discuss how to monitor the InnoDB storage engine and focus on some key areas for improving performance. Rather than discussing the broader aspects of these areas, we provide a strategy or‐ ganized into the following areas of performance improvement: • Using the SHOW ENGINE command • Using InnoDB monitors • Monitoring logfiles • Monitoring the buffer pool • Monitoring tablespaces • Using INFORMATION_SCHEMA tables • Using PERFORMANCE_SCHEMA tables • Other parameters to consider • Troubleshooting InnoDB We will discuss each of these briefly in the sections that follow. However, before we get started, let’s take a brief look at the InnoDB architectural features. The InnoDB storage engine uses a very sophisticated architecture that is designed for high concurrency and heavy transactional activity. It has a number of advanced features that you should consider prior to attempting to improve performance. We focus on the features we can monitor and improve. These include indexes, the buffer pool, logfiles, and tablespaces. The indexes in an InnoDB table use clustered indexes. Even if no index is specified, InnoDB assigns an internal value to each row so that it can use a clustered index. A clustered index is a data structure that stores not only the index, but also the data itself. 448 | Chapter 12: Storage Engine Monitoring This means once you’ve located the value in the index, you can retrieve the data without additional disk seeks. Naturally, the primary key index or first unique index on a table is built as a clustered index. When you create a secondary index, the key from the clustered index (primary key, unique key, or row ID) is stored along with the value for the secondary index. This allows very fast search by key and fast retrieval of the original data in the clustered index. It also means you can use the primary key columns when scanning the secondary index to allow the query to use only the secondary index to retrieve data. The buffer pool is a caching mechanism for managing transactions and writing and reading data to or from disks and, properly configured, can reduce disk access. The buffer pool is also a vital component for crash recovery, as the buffer pool is written to disk periodically (e.g., during shutdown). By default, the buffer pool state is saved in a file named ib_buffer_pool in the same directory as the InnoDB datafiles. Because the state is an in-memory component, you must monitor the effectiveness of the buffer pool to ensure it is configured correctly. InnoDB also uses the buffer pool to store data changes and transactions. InnoDB caches changes by saving them to a page (block) of data in the buffer pool. Each time a page is referenced, it is placed in the buffer pool and when changed, it is marked as “dirty.” The changes are then written to disk to update the data and a copy is written into a redo log. These logfiles are stored as files named ib_logfile0 or ib_logfile1. You can see these files in the data directory of the MySQL server. For more information about configuring and controlling the flush‐ ing of the buffer pool, see the section entitled “Improvements to Buffer Pool Flushing” in the online MySQL Reference Manual. The InnoDB storage engine uses two disk-based mechanisms for storing data: logfiles and tablespaces. InnoDB also uses the logs to rebuild (or redo) data changes made prior to a shutdown or crash. On startup, InnoDB reads the logs and automatically writes the dirty pages to disk, thereby recovering buffered changes made before the crash. Separate Tablespaces One of the newest performance features permits the storage of undo logs as separate tablespaces. Because the undo logs can consume a lot of space during long-running transactions, placing the undo logs in separate or even multiple tablespaces reduces the size of the system tablespace. To place the undo logs in separate tablespaces, set the --innodb_undo_tablespaces configuration option to a value greater than zero. You can also specify the location of the undo log tablespaces by using the InnoDB | 449 --innodb_undo_directory option. MySQL assigns names to the undo log tablespaces using the form innodbn, where n is a sequential integer with leading zeros. Tablespaces are an organizational tool InnoDB uses as machine-independent files that contain both data and indexes as well as a rollback mechanism (to roll back transac‐ tions). By default, all tables share one tablespace (called a shared tablespace). Shared tablespaces do not automatically extend across multiple files. By default, a tablespace takes up a single file that grows as the data grows. You can specify the autoextend option to allow the tablespace to create new files. You can also store tables in their own tablespaces (called file-per-table). File-per-table tablespaces contain both the data and the indexes for your tables. While there is still a central InnoDB file that is maintained, file-per-table permits you to segregate the data to different files (tablespaces). These tablespaces automatically extend across multiple files, thereby allowing you to store more data in your tables than what the operating system can handle. You can divide your tablespace into multiple files to place on different disks. Use innodb_file_per_table to create a separate tablespace for each table. Any tables created prior to setting this option will remain in the shared tablespace. Using this command affects only new tables. It does not reduce or otherwise save space already allocated in the shared tablespace. To apply the change to existing tables, use the ALTER TABLE ... ENGINE=INNODB command. Do this after you have turned on the innodb_file_per_table feature. Using the SHOW ENGINE Command The SHOW ENGINE INNODB STATUS command (also known as the InnoDB monitor) displays statistical and configuration information concerning the state of the InnoDB storage engine. This is the standard way to see information about InnoDB. The list of statistical data displayed is long and very comprehensive. Example 12-1 shows an ex‐ cerpt of the command run on a standard installation of MySQL. Example 12-1. The SHOW ENGINE INNODB STATUS command mysql> SHOW ENGINE INNODB STATUS \G *************************** 1. row *************************** Type: InnoDB Name: Status: ===================================== 2013-01-08 20:50:16 11abaa000 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 3 seconds ----------------- 450 | Chapter 12: Storage Engine Monitoring BACKGROUND THREAD ----------------- srv_master_thread loops: 1 srv_active, 0 srv_shutdown, 733 srv_idle srv_master_thread log flush and writes: 734 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 2 OS WAIT ARRAY INFO: signal count 2 Mutex spin waits 1, rounds 19, OS waits 0 RW-shared spins 2, rounds 60, OS waits 2 RW-excl spins 0, rounds 0, OS waits 0 Spin rounds per wait: 19.00 mutex, 30.00 RW-shared, 0.00 RW-excl ------------ TRANSACTIONS ------------ Trx id counter 1285 Purge done for trx's n:o < 0 undo n:o < 0 state: running but idle History list length 0 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 0, not started MySQL thread id 3, OS thread handle 0x11abaa000, query id 32 localhost 127.0.0.1 root init SHOW ENGINE INNODB STATUS -------- FILE I/O -------- I/O thread 0 state: waiting for i/o request (insert buffer thread) I/O thread 1 state: waiting for i/o request (log thread) I/O thread 2 state: waiting for i/o request (read thread) I/O thread 3 state: waiting for i/o request (read thread) I/O thread 4 state: waiting for i/o request (read thread) I/O thread 5 state: waiting for i/o request (read thread) I/O thread 6 state: waiting for i/o request (write thread) I/O thread 7 state: waiting for i/o request (write thread) I/O thread 8 state: waiting for i/o request (write thread) I/O thread 9 state: waiting for i/o request (write thread) Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] , ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0 Pending flushes (fsync) log: 0; buffer pool: 0 171 OS file reads, 5 OS file writes, 5 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 0, seg size 2, 0 merges merged operations: insert 0, delete mark 0, delete 0 discarded operations: insert 0, delete mark 0, delete 0 Hash table size 276671, node heap has 0 buffer(s) 0.00 hash searches/s, 0.00 non-hash searches/s InnoDB | 451 --- LOG --- Log sequence number 1625987 Log flushed up to 1625987 Pages flushed up to 1625987 Last checkpoint at 1625987 0 pending log writes, 0 pending chkp writes 8 log i/o's done, 0.00 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 137363456; in additional pool allocated 0 Dictionary memory allocated 55491 Buffer pool size 8191 Free buffers 8034 Database pages 157 Old database pages 0 Modified db pages 0 Pending reads 0 Pending writes: LRU 0, flush list 0 single page 0 Pages made young 0, not young 0 0.00 youngs/s, 0.00 non-youngs/s Pages read 157, created 0, written 1 0.00 reads/s, 0.00 creates/s, 0.00 writes/s No buffer pool page gets since the last printout Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 157, unzip_LRU len: 0 I/O sum[0]:cur[0], unzip sum[0]:cur[0] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB Main thread id 4718366720, state: sleeping Number of rows inserted 0, updated 0, deleted 0, read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ 1 row in set (0.00 sec) The SHOW ENGINE INNODB MUTEX command displays mutex information about InnoDB and can be very helpful when tuning threading in the storage engine. Example 12-2 shows an excerpt of the command run on a standard installation of MySQL. 452 | Chapter 12: Storage Engine Monitoring Example 12-2. The SHOW ENGINE INNODB MUTEX command mysql> SHOW ENGINE INNODB MUTEX; +--------+--------------------+---------------+ | Type | Name | Status | +--------+--------------------+---------------+ | InnoDB | trx/trx0rseg.c:167 | os_waits=1 | | InnoDB | trx/trx0sys.c:181 | os_waits=7 | | InnoDB | log/log0log.c:777 | os_waits=1003 | | InnoDB | buf/buf0buf.c:936 | os_waits=8 | | InnoDB | fil/fil0fil.c:1487 | os_waits=2 | | InnoDB | srv/srv0srv.c:953 | os_waits=101 | | InnoDB | log/log0log.c:833 | os_waits=323 | +--------+--------------------+---------------+ 7 rows in set (0.00 sec) The Name column displays the source file and line number where the mutex was created. The Status column displays the number of times the mutex waited for the operating system (e.g., os_waits=5). If the source code was compiled with the UNIV_DEBUG direc‐ tive, the column can display one of the following values: count The number of times the mutex was requested spin_waits The number of times a spinlock operation was run os_waits The number of times the mutex waited on the operating system os_yields The number of times a thread abandoned its time slice and returned to the operating system os_wait_times The amount of time the mutex waited for the operating system The SHOW ENGINE INNODB STATUS command displays a lot of information directly from the InnoDB storage engine. While it is unformatted (it isn’t displayed in neat rows and columns), there are several tools that use this information and redisplay it. For example, the InnoTop (see “InnoTop” on page 404) command communicates data this way. Using InnoDB Monitors The InnoDB storage engine is the only native storage engine that supports monitoring directly. Under the hood of InnoDB is a special mechanism called a monitor that gathers and reports statistical information to the server and client utilities. All of the following (and most third-party tools) interact with the monitoring facility in InnoDB, hence InnoDB monitors the following items via the MySQL server: InnoDB | 453 • Table and record locks • Lock waits • Semaphore waits • File I/O requests • Buffer pool • Purge and insert buffer merge activity The InnoDB monitors are engaged automatically via the SHOW ENGINE INNODB STA TUS command, and the information displayed is generated by the monitors. However, you can also get this information directly from the InnoDB monitors by creating a special set of tables in MySQL. The actual schema of the tables and where they reside are not important (provided you use the ENGINE = INNODB clause). Once they are cre‐ ated, each of the tables tells InnoDB to dump the data to stderr. You can see this information via the MySQL error log. For example, a default install of MySQL on Mac OS X has an error log named /usr/local/mysql/data/localhost.err. On Windows, you can also display the output in a console by starting MySQL with the --console option. To turn on the InnoDB monitors, create the following tables in a database of your choice: mysql> SHOW TABLES LIKE 'innodb%'; +---------------------------+ | Tables_in_test (innodb%) | +---------------------------+ | innodb_lock_monitor | | innodb_monitor | | innodb_table_monitor | | innodb_tablespace_monitor | +---------------------------+ 4 rows in set (0.00 sec) To turn off the monitors, simply delete the table. The monitors automatically regenerate data every 15 seconds. The tables are deleted on reboot. To continue monitoring after a reboot, you must recreate the tables. Each monitor presents the following data: innodb_monitor The standard monitor that prints the same information as the status SQL command. See Example 12-1 for an example of the output of this monitor. The only difference between the SQL command and the output of the innodb_monitor is that the output 454 | Chapter 12: Storage Engine Monitoring to stderr is formatted the same way as if you used the vertical display option in the MySQL client. innodb_lock_monitor The lock monitor also displays the same information as the SQL command, but includes additional information about locks. Use this report to detect deadlocks and explore concurrency issues. Example 12-3. The InnoDB lock monitor report ------------ TRANSACTIONS ------------ Trx id counter 2E07 Purge done for trx's n:o < 2C02 undo n:o < 0 History list length 36 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 2E06, not started mysql tables in use 1, locked 1 MySQL thread id 3, OS thread handle 0x10b2f3000, query id 30 localhost root show engine innodb status innodb_table_monitor The table monitor produces a detailed report of the internal data dictionary. Example 12-4 shows an excerpt of the report generated (formatted for readability). Notice the extensive data provided about each table, including the column defini‐ tions, indexes, approximate number of rows, foreign keys, and more. Use this report when diagnosing problems with tables or if you want to know the details of indexes. Example 12-4. The InnoDB table monitor report =========================================== 2013-01-08 21:11:00 11dc5f000 INNODB TABLE MONITOR OUTPUT =========================================== -------------------------------------- TABLE: name SYS_DATAFILES, id 14, flags 0, columns 5, indexes 1, appr.rows 9 COLUMNS: SPACE: DATA_INT len 4; PATH: DATA_VARCHAR prtype 524292 len 0; DB_ROW_ID: DATA_SYS prtype 256 len 6; DB_TRX_ID: DATA_SYS prtype 257 len 6; DB_ROLL_PTR: DATA_SYS prtype 258 len 7; INDEX: name SYS_DATAFILES_SPACE, id 16, fields 1/4, uniq 1, type 3 root page 308, appr.key vals 9, leaf pages 1, size pages 1 FIELDS: SPACE DB_TRX_ID DB_ROLL_PTR PATH -------------------------------------- ... ----------------------------------- END OF INNODB TABLE MONITOR OUTPUT ================================== InnoDB | 455 innodb_tablespace_monitor Displays extended information about the shared tablespace, including a list of file segments. It also validates the tablespace allocation data structures. The report can be quite detailed and very long, as it lists all of the details about your tablespace. Example 12-5 shows an excerpt of this report. Example 12-5. The InnoDB tablespace monitor report ================================================ 2013-01-08 21:11:00 11dc5f000 INNODB TABLESPACE MONITOR OUTPUT ================================================ FILE SPACE INFO: id 0 size 768, free limit 576, free extents 3 not full frag extents 1: used pages 13, full frag extents 3 first seg id not used 180 SEGMENT id 1 space 0; page 2; res 2 used 2; full ext 0 fragm pages 2; free extents 0; not full extents 0: pages 0 SEGMENT id 2 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 3 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 4 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 5 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 6 space 0; page 2; res 0 used 0; full ext 0 fragm pages 0; free extents 0; not full extents 0: pages 0 SEGMENT id 7 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 8 space 0; page 2; res 0 used 0; full ext 0 fragm pages 0; free extents 0; not full extents 0: pages 0 SEGMENT id 9 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 10 space 0; page 2; res 0 used 0; full ext 0 fragm pages 0; free extents 0; not full extents 0: pages 0 SEGMENT id 11 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 12 space 0; page 2; res 0 used 0; full ext 0 fragm pages 0; free extents 0; not full extents 0: pages 0 SEGMENT id 13 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 14 space 0; page 2; res 0 used 0; full ext 0 fragm pages 0; free extents 0; not full extents 0: pages 0 SEGMENT id 15 space 0; page 2; res 160 used 160; full ext 2 fragm pages 32; free extents 0; not full extents 0: pages 0 SEGMENT id 16 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 17 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 18 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 19 space 0; page 2; res 1 used 1; full ext 0 456 | Chapter 12: Storage Engine Monitoring fragm pages 1; free extents 0; not full extents 0: pages 0 SEGMENT id 20 space 0; page 2; res 1 used 1; full ext 0 fragm pages 1; free extents 0; not full extents 0: pages 0 ... NUMBER of file segments: 179 Validating tablespace Validation ok --------------------------------------- END OF INNODB TABLESPACE MONITOR OUTPUT ======================================= As you can see, the InnoDB monitor reports quite a lot of detail. Keeping these turned on for extended periods could add a substantial amount of data to your logfiles. Monitoring Logfiles Because the InnoDB logfiles buffer data between your data and the operating system, keeping these files running well will ensure good performance. You can monitor the logfiles directly by watching the following system status variables: mysql> SHOW STATUS LIKE 'InnoDB%log%'; +------------------------------+-------+ | Variable_name | Value | +------------------------------+-------+ | InnoDB_log_waits | 0 | | InnoDB_log_write_requests | 0 | | InnoDB_log_writes | 2 | | InnoDB_os_log_fsyncs | 5 | | InnoDB_os_log_pending_fsyncs | 0 | | InnoDB_os_log_pending_writes | 0 | | InnoDB_os_log_written | 1024 | +------------------------------+-------+ We saw some of this information presented by the InnoDB monitors, but you can also get detailed information about the logfiles using the following status variables: InnoDB_log_waits A count of the number of times the log was too small (i.e., did not have enough room for all of the data) and the operation had to wait for the log to be flushed. If this value begins to increase and remains higher than zero for long periods (except perhaps during bulk operations), you may want to increase the size of the logfiles. InnoDB_log_write_requests The number of log write requests. InnoDB_log_writes The number of times data was written to the log. InnoDB | 457 InnoDB_os_log_fsyncs The number of operating system file syncs (i.e., fsync() method calls). InnoDB_os_log_pending_fsyncs The number of pending file sync requests. If this value begins to increase and stays above zero for an extended period of time, you may want to investigate possible disk access issues. InnoDB_os_log_pending_writes The number of pending log write requests. If this value begins to increase and stays higher than zero for an extended period of time, you may want to investigate pos‐ sible disk access issues. InnoDB_os_log_written The total number of bytes written to the log. Because all of these options present numerical information, you can build your own custom graphs in MySQL Workbench to display the information in graphical form. Monitoring the Buffer Pool The buffer pool is where InnoDB caches frequently accessed data. Any changes you make to the data in the buffer pool are also cached. The buffer pool also stores infor‐ mation about current transactions. Thus, the buffer pool is a critical mechanism used for performance. You can view information about the behavior of the buffer pool using the SHOW ENGINE INNODB STATUS command, as shown in Example 12-1. We repeat the buffer pool and memory section here for your convenience: ---------------------- BUFFER POOL AND MEMORY ---------------------- Total memory allocated 138805248; in additional pool allocated 0 Dictionary memory allocated 70560 Buffer pool size 8192 Free buffers 760 Database pages 6988 Modified db pages 113 Pending reads 0 Pending writes: LRU 0, flush list 0, single page 0 Pages read 21, created 6968, written 10043 0.00 reads/s, 89.91 creates/s, 125.87 writes/s Buffer pool hit rate 1000 / 1000 LRU len: 6988, unzip_LRU len: 0 I/O sum[9786]:cur[259], unzip sum[0]:cur[0] The critical items to watch for in this report are listed here (we discuss more specific status variables later): 458 | Chapter 12: Storage Engine Monitoring Free buffers The number of buffer segments that are empty and available for buffering data. Modified pages The number of pages that have changes (dirty). Pending reads The number of reads waiting. This value should remain low. Pending writes The number of writes waiting. This value should remain low. Hit rate A ratio of the number of successful buffer hits to the number of all requests. You want this value to remain as close as possible to 1:1. There are a number of status variables you can use to see this information in greater detail. The following shows the InnoDB buffer pool status variables: mysql> SHOW STATUS LIKE 'InnoDB%buf%'; +-----------------------------------+-------+ | Variable_name | Value | +-----------------------------------+-------+ | InnoDB_buffer_pool_pages_data | 21 | | InnoDB_buffer_pool_pages_dirty | 0 | | InnoDB_buffer_pool_pages_flushed | 1 | | InnoDB_buffer_pool_pages_free | 8171 | | InnoDB_buffer_pool_pages_misc | 0 | | InnoDB_buffer_pool_pages_total | 8192 | | InnoDB_buffer_pool_read_ahead_rnd | 0 | | InnoDB_buffer_pool_read_ahead_seq | 0 | | InnoDB_buffer_pool_read_requests | 558 | | InnoDB_buffer_pool_reads | 22 | | InnoDB_buffer_pool_wait_free | 0 | | InnoDB_buffer_pool_write_requests | 1 | +-----------------------------------+-------+ There are a number of status variables for the buffer pool that display key statistical information about the performance of the buffer pool. You can monitor such detailed information as the status of the pages in the buffer pool, the reads and writes from and to the buffer pool, and how often the buffer pool causes a wait for reads or writes. The following explains each status variable in more detail: InnoDB_buffer_pool_pages_data The number of pages containing data, including both unchanged and changed (dirty) pages. InnoDB_buffer_pool_pages_dirty The number of pages that have changes (dirty). InnoDB | 459 InnoDB_buffer_pool_pages_flushed The number of times the buffer pool pages have been flushed. InnoDB_buffer_pool_pages_free The number of empty (free) pages. InnoDB_buffer_pool_pages_misc The number of pages that are being used for administrative work by the InnoDB engine itself. This is calculated as: X = InnoDB_buffer_pool_pages_total – InnoDB_buffer_pool_pages_free – In noDB_buffer_pool_pages_data InnoDB_buffer_pool_pages_total The total number of pages in the buffer pool. InnoDB_buffer_pool_read_ahead_rnd The number of random read-aheads that have occurred by InnoDB scanning for a large block of data. InnoDB_buffer_pool_read_ahead_seq The number of sequential read-aheads that have occurred as a result of a sequential full table scan. InnoDB_buffer_pool_read_requests The number of logical read requests. InnoDB_buffer_pool_reads The number of logical reads that were not found in the buffer pool and were read directly from the disk. InnoDB_buffer_pool_wait_free If the buffer pool is busy or there are no free pages, InnoDB may need to wait for pages to be flushed. This value is the number of times the wait occurred. If this value grows and stays higher than zero, you may have either an issue with the size of the buffer pool or a disk access issue. InnoDB_buffer_pool_write_requests The number of writes to the InnoDB buffer pool. Because all of these options present numerical data, you can build your own custom graphs in MySQL Workbench to display the information in graphical form. Monitoring Tablespaces InnoDB tablespaces are basically self-sufficient, provided you have allowed InnoDB to extend them when they run low on space. You can configure tablespaces to automatically grow using the autoextend option for the innodb_data_file_path variable. For ex‐ 460 | Chapter 12: Storage Engine Monitoring ample, the default configuration of a MySQL installation sets the shared tablespace to 10 megabytes and can automatically extend to more files: --innodb_data_file_path=ibdata1:10M:autoextend See the “InnoDB Configuration” section in the online MySQL Reference Manual for more details. You can see the current configuration of your tablespaces using the SHOW ENGINE INNODB STATUS command, and you can see the details of the tablespaces by turning on the InnoDB tablespace monitor (see the “Using Tablespace Monitors” section in the online MySQL Reference Manual for more details). Using INFORMATION_SCHEMA Tables The INFORMATION_SCHEMA database includes a number of tables devoted to In‐ noDB. These tables are technically views, in the sense that the data they present is not stored on disk; rather, the data is generated when the table is queried. The tables provide another way to monitor InnoDB and provide performance information to administra‐ tors. These tables are present by default in version 5.5 and later. There are tables for monitoring compression, transactions, locks, and more. Here we’ll describe some of the available tables briefly: INNODB_CMP Displays details and statistics for compressed tables. INNODB_CMP_RESET Displays the same information as INNODB_CMP, but has the side effect that querying the table resets the statistics. This allows you to track statistics periodically (e.g., hourly, daily, etc.). INNODB_CMPMEM Displays details and statistics about compression use in the buffer pool. INNODB_CMPMEM_RESET Displays the same information as INNODB_CMPMEM, but has the side effect that querying the table resets the statistics. This allows you to track statistics periodically (e.g., hourly, daily, etc.). INNODB_TRX Displays details and statistics about all transactions, including the state and query currently being processed. INNODB_LOCKS Displays details and statistics about all locks requested by transactions. It describes each lock, including the state, mode, type, and more. InnoDB | 461 INNODB_LOCK_WAITS Displays details and statistics about all locks requested by transactions that are being blocked. It describes each lock, including the state, mode, type, and the blocking transaction. A complete description of each table, including the columns and examples of how to use each, is presented in the online reference manual. You can use the compression tables to monitor the compression of your tables, including such details as the page size, pages used, time spent in compression and decompression, and much more. This can be important information to monitor if you are using com‐ pression and want to ensure the overhead is not affecting the performance of your database server. You can use the transaction and locking tables to monitor your transactions. This is a very valuable tool in keeping your transactional databases running smoothly. Most im‐ portant, you can determine precisely which state each transaction is in, as well as which transactions are blocked and which are in a locked state. This information can also be critical in diagnosing complex transaction problems such as deadlock or poor perfor‐ mance. Older versions of InnoDB, specifically during the MySQL version 5.1 era, were built as a plug-in storage engine. If you are using the older InnoDB storage engine plug-in, you also have access to seven spe‐ cial tables in the INFORMATION_SCHEMA database. You must install the INFORMATION_SCHEMA tables separately. For more details, see the In‐ noDB plug-in documentation. Using PERFORMANCE_SCHEMA Tables As of MySQL version 5.5, InnoDB now supports the PERFORMANCE_SCHEMA feature of the MySQL server. The PERFORMANCE_SCHEMA feature was introduced in Chapter 11. For InnoDB, this means you can monitor the internal behavior of the InnoDB subsystems. This enables you to tune InnoDB using a general knowledge of the source code. While it is not strictly necessary to read the InnoDB source code to use the PERFORMANCE_SCHE MA tables to tune InnoDB, expert users with this knowledge can obtain more precise performance tuning. But that comes at a price. It is possible to over tune such that the system performs well for certain complex queries at the expense of other queries that may not see the same performance improvements. 462 | Chapter 12: Storage Engine Monitoring To use PERFORMANCE_SCHEMA with InnoDB, you must have MySQL version 5.5 or later, InnoDB 1.1 or later, and have PERFORMANCE_SCHEMA enabled in the server. All InnoDB- specific instances, objects, consumers, and so on are prefixed with “innodb” in the name. For example, the following shows a list of the active InnoDB threads (you can use this to help isolate and monitor how InnoDB threads are performing): mysql> SELECT thread_id, name, type FROM threads WHERE NAME LIKE '%innodb%'; +-----------+----------------------------------------+------------+ | thread_id | name | type | +-----------+----------------------------------------+------------+ | 2 | thread/innodb/io_handler_thread | BACKGROUND | | 3 | thread/innodb/io_handler_thread | BACKGROUND | | 4 | thread/innodb/io_handler_thread | BACKGROUND | | 5 | thread/innodb/io_handler_thread | BACKGROUND | | 6 | thread/innodb/io_handler_thread | BACKGROUND | | 7 | thread/innodb/io_handler_thread | BACKGROUND | | 8 | thread/innodb/io_handler_thread | BACKGROUND | | 9 | thread/innodb/io_handler_thread | BACKGROUND | | 10 | thread/innodb/io_handler_thread | BACKGROUND | | 11 | thread/innodb/io_handler_thread | BACKGROUND | | 13 | thread/innodb/srv_lock_timeout_thread | BACKGROUND | | 14 | thread/innodb/srv_error_monitor_thread | BACKGROUND | | 15 | thread/innodb/srv_monitor_thread | BACKGROUND | | 16 | thread/innodb/srv_purge_thread | BACKGROUND | | 17 | thread/innodb/srv_master_thread | BACKGROUND | | 18 | thread/innodb/page_cleaner_thread | BACKGROUND | +-----------+----------------------------------------+------------+ 16 rows in set (0.00 sec) You can find InnoDB-specific items in the rwlock_instances, mutex_instances, file_instances, file_summary_by_event_name, and file_summary_by_instances tables as well. Other Parameters to Consider There are many things to monitor and tune in the InnoDB storage engine. We have discussed only a portion of those and focused mainly on monitoring the various sub‐ systems and improving performance. However, there are a few other items you may want to consider. Thread performance can be improved under certain circumstances by adjusting the innodb_thread_concurrency option. The default value is zero in MySQL version 5.5 and later (8 in prior versions), meaning that there is infinite concurrency or many threads executing in the storage engine. This is usually sufficient, but if you are running MySQL on a server with many processors and many independent disks (and heavy use of InnoDB), you may see a performance increase by setting this value equal to the number of processors plus independent disks. This ensures InnoDB will use enough threads to allow maximum concurrent operations. Setting this value to a value greater InnoDB | 463 than what your server can support has little or no effect—if there aren’t any available threads, the limit will never be reached. If your MySQL server is part of a system that is shut down frequently or even periodically (e.g., you run MySQL at startup on your Linux laptop), you may notice when using InnoDB that shutdown can take a long time to complete. Fortunately, InnoDB can be shut down quickly by setting the innodb_fast_shutdown option. This does not affect data integrity nor will it result in a loss of memory (buffer) management. It simply skips the potentially expensive operations of purging the internal caches and merging insert buffers. It still performs a controlled shutdown, storing the buffer pools on disk. By setting the innodb_lock_wait_timeout variable, you can control how InnoDB deals with deadlocks. This variable exists at both the global and session level, and controls how long InnoDB will allow a transaction to wait for a row lock before aborting. The default value is 50 seconds. If you are seeing a lot of lock-wait timeouts, you can decrease the value to decrease the amount of time your locks wait. This may help diagnose some of your concurrency problems or at least allow your queries to time out sooner. If you are importing lots of data, you can improve load time by making sure your in‐ coming datafiles are sorted in primary key order. In addition, you can turn off the automatic commit by setting AUTOCOMMIT to 0. This ensures the entire load is committed only once. You can also improve bulk load by turning off foreign key and unique constraints. Remember, you should approach tuning InnoDB with great care. With so many things to tweak and adjust, it can be very easy for things to go wrong quickly. Be sure to follow the practice of chang‐ ing one variable at a time (and only with a purpose) and measure, measure, measure. Troubleshooting Tips for InnoDB The best tools to use are those listed earlier, including the InnoDB monitors, the SHOW ENGINE INNODB STATUS command (another way to display the data from the InnoDB monitors), and the PERFORMANCE_SCHEMA features. However, there are additional strate‐ gies that you may find helpful in dealing with errors using InnoDB. This section provides some general best practices for troubleshooting InnoDB problems. Use these practices when faced with trying to solve errors, warnings, and data corruption issues. Errors When encountering errors related to InnoDB, find the error information in the error log. To turn on the error log, use the --log-error startup option. 464 | Chapter 12: Storage Engine Monitoring Deadlocks If you encounter multiple deadlock failures, you can use the option --innodb_print_all_deadlocks (available in version 5.5 and later) to write all dead‐ lock messages to the error log. This allows you to see more than the last deadlock as shown in the SHOW ENGINE INNODB STATUS and may be informative if your application does not have its own error handlers to deal with deadlocks. Data dictionary problems Table definitions are stored in the .frm files having the same name as the table and stored under folders by database name. Definitions are also stored in the InnoDB data dictio‐ nary. If there is a storage corruption or file broken, you can encounter errors related to a mismatched data dictionary. Here are some of the more common symptoms and solutions: Orphaned temporary table If an ALTER TABLE operation fails, the server may not clean up correctly, leaving a temporary table in the InnoDB tablespace. When this occurs, you can use the table monitor to identify the table name (temporary tables are named starting with #sql). You can then issue a DROP TABLE to drop this table to eliminate the orphaned table. Cannot open a table If you see an error like Can't open file: 'somename.innodb' along with an error message in the error log like Cannot find table somedb/somename..., it means there is an orphaned file named somename.frm inside the database folder. In this case, deleting the orphaned .frm file will correct the problem. Tablespace does not exist If you are using the --innodb_file_per_table option and encounter an error similar to InnoDB data dictionary has tablespace id N, but tablespace with the id or name does not exist..., you must drop the table and recreate it. However, it is not that simple. You must first recreate the table in another data‐ base, locate the .frm there and copy it to the original database, then drop the table. This may generate a warning about a missing .ibd file but it will correct the data dictionary. From there, you can recreate the table and restore the data from backup. Cannot create a table If you encounter an error in the error log that tells you the table already exists in the data dictionary, you may have a case where there is no corresponding .frm file for the table in question. When this occurs, follow the directions in the error log. InnoDB | 465 Observe console messages Some errors and warnings are only printed to the console (e.g., stdout, stderr). When troubleshooting errors or warnings, it is sometimes best to launch MySQL via the com‐ mand line instead of using the mysqld_safe script. On Windows, you can use the --console option to prevent suppression of console messages. I/O problems I/O problems are generally encountered on startup or when new objects are created or dropped. These types of errors are associated with the InnoDB files and vary in severity. Unfortunately, these problems are normally very specific to the platform or OS and therefore may require specific steps to correct. As a general strategy, always check your error log or console for errors, checking particularly for OS-specific errors because these can indicate why I/O errors are occurring. Also check for missing or corrupt folders in the data directory along with properly named InnoDB files. You can also experience I/O problems when there are problems with the data disks. These normally appear at startup but can occur anytime there are disk read or write errors. Hardware errors can sometimes be mistaken for performance problems. Be sure to check your operating system’s disk diagnosis as part of your troubleshooting routine. Sometimes the problem is a configuration issue. In this case, you should double-check your configuration file to ensure InnoDB is properly configured. For example, check to ensure the innodb_data_* options are set correctly. Corrupted databases If you encounter severe or critical errors in your databases that cause InnoDB to crash or keep the server from starting, you can launch the server with the innodb_force_re covery recovery option in your configuration file, assigning it an integer value ranging from 1 to 6 that causes InnoDB to skip certain operations during startup. This option is considered a last resort option and should be used only in the most extreme cases where all other attempts to start the server have failed. You should also export the data prior to attempt‐ ing the procedure. Here’s a brief description of each option (more information can be found in the online reference manual): 1. Skip corrupt pages when select statements are issued. Allows partial data recovery. 2. Do not start the master or purge thread. 3. Do not execute rollbacks after crash recovery. 466 | Chapter 12: Storage Engine Monitoring 4. Do not execute insert buffer operations. Do not calculate table statistics. 5. Ignore undo logs on startup. 6. Do not execute the redo log when running recovery. MyISAM There are very few things to monitor on the MyISAM storage engine. This is because the MyISAM storage engine was built for web applications with a focus on fast queries and, as such, has only one feature in the server that you can tune—the key cache. That doesn’t mean there is nothing else that you can do to improve performance. On the contrary, there are many things you can do, including using options like low priority and concurrent inserts. Most fall into one of three areas: optimizing storage on disk, using memory efficiently by monitoring and tuning the key cache, and tuning your tables for maximum performance. Rather than discussing the broader aspects of these areas, we provide a strategy or‐ ganized into the following areas of performance improvement: • Optimizing disk storage • Tuning your tables for performance • Using the MyISAM utilities • Storing a table in index order • Compressing tables • Defragmenting tables • Monitoring the key cache • Preloading key caches • Using multiple key caches • Other parameters to consider We will discuss each of these briefly in the sections that follow. Optimizing Disk Storage Optimizing disk space for MyISAM is more of a system configuration option than a MyISAM-specific tuning parameter. MyISAM stores each table as its own .myd (data‐ file) and one or more .myi (index) files. They are stored with the .frm file in the folder under the name of the database in the data directory specified by the --datadir startup option. Thus, optimizing disk space for MyISAM is the same as optimizing disk space for the server (i.e., you can see performance improvements by moving the data directory MyISAM | 467 to its own disk, and you can further improve performance of the disk with RAID or other high availability storage options). The latest release of MySQL Utilities includes a new utility named the .frm reader (mysqlfrm), which allows you to read .frm files and produce the CREATE statement for the table. You can use this utility whenever you need to diagnose problems with .frm files. See the MySQL Utilities documentation for more information about the .frm reader. Repairing Your Tables There are a couple of SQL commands that you can use to keep your tables in optimal condition. These include the ANALYZE TABLE, OPTIMIZE TABLE, and REPAIR TABLE commands. The ANALYZE TABLE command examines and reorganizes the key distribution for a table. The MySQL server uses the key distribution to determine the join order when joining on a field other than a constant. Key distributions also determine which indexes to use for a query. We discuss this command in more detail in “Using ANALYZE TABLE” on page 431. The REPAIR TABLE command is not really a performance tool—you can use it to fix a corrupted table for the MyISAM, Archive, and CSV storage engines. Use this command to try to recover tables that have become corrupt or are performing very poorly (which is usually a sign that a table has degraded and needs reorganizing or repair). Use the OPTIMIZE TABLE command to recover deleted blocks and reorganize the table for better performance. You can use this command for MyISAM and InnoDB tables. While these commands are useful, there are a number of more advanced tools you can use to further manage your MyISAM tables. Using the MyISAM Utilities There are a number of special utilities included in the MySQL distribution that are designed for use with the MyISAM storage engine (tables): • myisam_ftdump allows you to display information about full-text indexes. • myisamchk allows you to perform analysis on a MyISAM table. • myisamlog allows you to view the change logs of a MyISAM table. • myisampack allows you to compress a table to minimize storage. 468 | Chapter 12: Storage Engine Monitoring myisamchk is the workhorse utility for MyISAM. It can display information about your MyISAM tables or analyze, repair, and optimize them. You can run the command for one or more tables, but you can only use it when the server is running if you flush the tables and lock them. Alternatively, you can shut down the server. Be sure to make a backup of your tables before running this utility in case the repair or optimization steps fail. In rare cases, this has been known to leave tables corrupted and irreparable. The following list describes options related to performance improvement, recovery, and report status (see the online MySQL Reference Manual for a complete description of the available options): analyze Analyzes the key distribution of indexes to improve query performance. backup Makes a copy of the tables (the .myd file) prior to altering them. check Checks the table for errors (report only). extended-check Does a thorough check of the table for errors, including all indexes (report only). force Performs a repair if any errors are found. information Shows statistical information about the table. Use this command first to see the condition of your table before running recover. medium-check Performs a more thorough check of the table (r