Perl Cookbook 2nd Edition

[ Team LiB ] • Table of Contents • Index • Reviews • Reader Reviews • Errata Perl Cookbook, 2nd Edition By Tom Christiansen, Nathan Torkington Publisher: O'Reilly Pub Date: August 2003 ISBN: 0-596-00313-7 Pages: 1000 Find a Perl programmer, and you'll find a copy of Perl Cookbook nearby. Perl Cookbook is a comprehensive collection of problems, solutions, and practical examples for anyone programming in Perl. The book contains hundreds of rigorously reviewed Perl "recipes" and thousands of examples ranging from brief one-liners to complete applications. The second edition of Perl Cookbook has been fully updated for Perl 5.8, with extensive changes for Unicode support, I/O layers, mod_perl, and new technologies that have emerged since the previous edition of the book. Recipes have been updated to include the latest modules. New recipes have been added to every chapter of the book, and some chapters have almost doubled in size. [ Team LiB ] [ Team LiB ] • Table of Contents • Index • Reviews • Reader Reviews • Errata Perl Cookbook, 2nd Edition By Tom Christiansen, Nathan Torkington Publisher: O'Reilly Pub Date: August 2003 ISBN: 0-596-00313-7 Pages: 1000 Copyright Foreword Preface What's in This Book What's New in This Edition Platform Notes Other Books Conventions Used in This Book We'd Like to Hear from You Acknowledgments for the First Edition Acknowledgments for the Second Edition Chapter 1. Strings Introduction Recipe 1.1. Accessing Substrings Recipe 1.2. Establishing a Default Value Recipe 1.3. Exchanging Values Without Using Temporary Variables Recipe 1.4. Converting Between Characters and Values Recipe 1.5. Using Named Unicode Characters Recipe 1.6. Processing a String One Character at a Time Recipe 1.7. Reversing a String by Word or Character Recipe 1.8. Treating Unicode Combined Characters as Single Characters Recipe 1.9. Canonicalizing Strings with Unicode Combined Characters Recipe 1.10. Treating a Unicode String as Octets Recipe 1.11. Expanding and Compressing Tabs Recipe 1.12. Expanding Variables in User Input Recipe 1.13. Controlling Case Recipe 1.14. Properly Capitalizing a Title or Headline Recipe 1.15. Interpolating Functions and Expressions Within Strings Recipe 1.16. Indenting Here Documents Recipe 1.17. Reformatting Paragraphs Recipe 1.18. Escaping Characters Recipe 1.19. Trimming Blanks from the Ends of a String Recipe 1.20. Parsing Comma-Separated Data Recipe 1.21. Constant Variables Recipe 1.22. Soundex Matching Recipe 1.23. Program: fixstyle Recipe 1.24. Program: psgrep Chapter 2. Numbers Introduction Recipe 2.1. Checking Whether a String Is a Valid Number Recipe 2.2. Rounding Floating-Point Numbers Recipe 2.3. Comparing Floating-Point Numbers Recipe 2.4. Operating on a Series of Integers Recipe 2.5. Working with Roman Numerals Recipe 2.6. Generating Random Numbers Recipe 2.7. Generating Repeatable Random Number Sequences Recipe 2.8. Making Numbers Even More Random Recipe 2.9. Generating Biased Random Numbers Recipe 2.10. Doing Trigonometry in Degrees, Not Radians Recipe 2.11. Calculating More Trigonometric Functions Recipe 2.12. Taking Logarithms Recipe 2.13. Multiplying Matrices Recipe 2.14. Using Complex Numbers Recipe 2.15. Converting Binary, Octal, and Hexadecimal Numbers Recipe 2.16. Putting Commas in Numbers Recipe 2.17. Printing Correct Plurals Recipe 2.18. Program: Calculating Prime Factors Chapter 3. Dates and Times Introduction Recipe 3.1. Finding Today's Date Recipe 3.2. Converting DMYHMS to Epoch Seconds Recipe 3.3. Converting Epoch Seconds to DMYHMS Recipe 3.4. Adding to or Subtracting from a Date Recipe 3.5. Difference of Two Dates Recipe 3.6. Day in a Week/Month/Year or Week Number Recipe 3.7. Parsing Dates and Times from Strings Recipe 3.8. Printing a Date Recipe 3.9. High-Resolution Timers Recipe 3.10. Short Sleeps Recipe 3.11. Program: hopdelta Chapter 4. Arrays Introduction Recipe 4.1. Specifying a List in Your Program Recipe 4.2. Printing a List with Commas Recipe 4.3. Changing Array Size Recipe 4.4. Implementing a Sparse Array Recipe 4.5. Iterating Over an Array Recipe 4.6. Iterating Over an Array by Reference Recipe 4.7. Extracting Unique Elements from a List Recipe 4.8. Finding Elements in One Array but Not Another Recipe 4.9. Computing Union, Intersection, or Difference of Unique Lists Recipe 4.10. Appending One Array to Another Recipe 4.11. Reversing an Array Recipe 4.12. Processing Multiple Elements of an Array Recipe 4.13. Finding the First List Element That Passes a Test Recipe 4.14. Finding All Elements in an Array Matching Certain Criteria Recipe 4.15. Sorting an Array Numerically Recipe 4.16. Sorting a List by Computable Field Recipe 4.17. Implementing a Circular List Recipe 4.18. Randomizing an Array Recipe 4.19. Program: words Recipe 4.20. Program: permute Chapter 5. Hashes Introduction Recipe 5.1. Adding an Element to a Hash Recipe 5.2. Testing for the Presence of a Key in a Hash Recipe 5.3. Creating a Hash with Immutable Keys or Values Recipe 5.4. Deleting from a Hash Recipe 5.5. Traversing a Hash Recipe 5.6. Printing a Hash Recipe 5.7. Retrieving from a Hash in Insertion Order Recipe 5.8. Hashes with Multiple Values per Key Recipe 5.9. Inverting a Hash Recipe 5.10. Sorting a Hash Recipe 5.11. Merging Hashes Recipe 5.12. Finding Common or Different Keys in Two Hashes Recipe 5.13. Hashing References Recipe 5.14. Presizing a Hash Recipe 5.15. Finding the Most Common Anything Recipe 5.16. Representing Relationships Between Data Recipe 5.17. Program: dutree Chapter 6. Pattern Matching Introduction Recipe 6.1. Copying and Substituting Simultaneously Recipe 6.2. Matching Letters Recipe 6.3. Matching Words Recipe 6.4. Commenting Regular Expressions Recipe 6.5. Finding the Nth Occurrence of a Match Recipe 6.6. Matching Within Multiple Lines Recipe 6.7. Reading Records with a Separator Recipe 6.8. Extracting a Range of Lines Recipe 6.9. Matching Shell Globs as Regular Expressions Recipe 6.10. Speeding Up Interpolated Matches Recipe 6.11. Testing for a Valid Pattern Recipe 6.12. Honoring Locale Settings in Regular Expressions Recipe 6.13. Approximate Matching Recipe 6.14. Matching from Where the Last Pattern Left Off Recipe 6.15. Greedy and Non-Greedy Matches Recipe 6.16. Detecting Doubled Words Recipe 6.17. Matching Nested Patterns Recipe 6.18. Expressing AND, OR, and NOT in a Single Pattern Recipe 6.19. Matching a Valid Mail Address Recipe 6.20. Matching Abbreviations Recipe 6.21. Program: urlify Recipe 6.22. Program: tcgrep Recipe 6.23. Regular Expression Grab Bag Chapter 7. File Access Introduction Recipe 7.1. Opening a File Recipe 7.2. Opening Files with Unusual Filenames Recipe 7.3. Expanding Tildes in Filenames Recipe 7.4. Making Perl Report Filenames in Error Messages Recipe 7.5. Storing Filehandles into Variables Recipe 7.6. Writing a Subroutine That Takes Filehandles as Built-ins Do Recipe 7.7. Caching Open Output Filehandles Recipe 7.8. Printing to Many Filehandles Simultaneously Recipe 7.9. Opening and Closing File Descriptors by Number Recipe 7.10. Copying Filehandles Recipe 7.11. Creating Temporary Files Recipe 7.12. Storing a File Inside Your Program Text Recipe 7.13. Storing Multiple Files in the DATA Area Recipe 7.14. Writing a Unix-Style Filter Program Recipe 7.15. Modifying a File in Place with a Temporary File Recipe 7.16. Modifying a File in Place with the -i Switch Recipe 7.17. Modifying a File in Place Without a Temporary File Recipe 7.18. Locking a File Recipe 7.19. Flushing Output Recipe 7.20. Doing Non-Blocking I/O Recipe 7.21. Determining the Number of Unread Bytes Recipe 7.22. Reading from Many Filehandles Without Blocking Recipe 7.23. Reading an Entire Line Without Blocking Recipe 7.24. Program: netlock Recipe 7.25. Program: lockarea Chapter 8. File Contents Introduction Recipe 8.1. Reading Lines with Continuation Characters Recipe 8.2. Counting Lines (or Paragraphs or Records) in a File Recipe 8.3. Processing Every Word in a File Recipe 8.4. Reading a File Backward by Line or Paragraph Recipe 8.5. Trailing a Growing File Recipe 8.6. Picking a Random Line from a File Recipe 8.7. Randomizing All Lines Recipe 8.8. Reading a Particular Line in a File Recipe 8.9. Processing Variable-Length Text Fields Recipe 8.10. Removing the Last Line of a File Recipe 8.11. Processing Binary Files Recipe 8.12. Using Random-Access I/O Recipe 8.13. Updating a Random-Access File Recipe 8.14. Reading a String from a Binary File Recipe 8.15. Reading Fixed-Length Records Recipe 8.16. Reading Configuration Files Recipe 8.17. Testing a File for Trustworthiness Recipe 8.18. Treating a File as an Array Recipe 8.19. Setting the Default I/O Layers Recipe 8.20. Reading or Writing Unicode from a Filehandle Recipe 8.21. Converting Microsoft Text Files into Unicode Recipe 8.22. Comparing the Contents of Two Files Recipe 8.23. Pretending a String Is a File Recipe 8.24. Program: tailwtmp Recipe 8.25. Program: tctee Recipe 8.26. Program: laston Recipe 8.27. Program: Flat File Indexes Chapter 9. Directories Introduction Recipe 9.1. Getting and Setting Timestamps Recipe 9.2. Deleting a File Recipe 9.3. Copying or Moving a File Recipe 9.4. Recognizing Two Names for the Same File Recipe 9.5. Processing All Files in a Directory Recipe 9.6. Globbing, or Getting a List of Filenames Matching a Pattern Recipe 9.7. Processing All Files in a Directory Recursively Recipe 9.8. Removing a Directory and Its Contents Recipe 9.9. Renaming Files Recipe 9.10. Splitting a Filename into Its Component Parts Recipe 9.11. Working with Symbolic File Permissions Instead of Octal Values Recipe 9.12. Program: symirror Recipe 9.13. Program: lst Chapter 10. Subroutines Introduction Recipe 10.1. Accessing Subroutine Arguments Recipe 10.2. Making Variables Private to a Function Recipe 10.3. Creating Persistent Private Variables Recipe 10.4. Determining Current Function Name Recipe 10.5. Passing Arrays and Hashes by Reference Recipe 10.6. Detecting Return Context Recipe 10.7. Passing by Named Parameter Recipe 10.8. Skipping Selected Return Values Recipe 10.9. Returning More Than One Array or Hash Recipe 10.10. Returning Failure Recipe 10.11. Prototyping Functions Recipe 10.12. Handling Exceptions Recipe 10.13. Saving Global Values Recipe 10.14. Redefining a Function Recipe 10.15. Trapping Undefined Function Calls with AUTOLOAD Recipe 10.16. Nesting Subroutines Recipe 10.17. Writing a Switch Statement Recipe 10.18. Program: Sorting Your Mail Chapter 11. References and Records Introduction Recipe 11.1. Taking References to Arrays Recipe 11.2. Making Hashes of Arrays Recipe 11.3. Taking References to Hashes Recipe 11.4. Taking References to Functions Recipe 11.5. Taking References to Scalars Recipe 11.6. Creating Arrays of Scalar References Recipe 11.7. Using Closures Instead of Objects Recipe 11.8. Creating References to Methods Recipe 11.9. Constructing Records Recipe 11.10. Reading and Writing Hash Records to Text Files Recipe 11.11. Printing Data Structures Recipe 11.12. Copying Data Structures Recipe 11.13. Storing Data Structures to Disk Recipe 11.14. Transparently Persistent Data Structures Recipe 11.15. Coping with Circular Data Structures Using Weak References Recipe 11.16. Program: Outlines Recipe 11.17. Program: Binary Trees Chapter 12. Packages, Libraries, and Modules Introduction Recipe 12.1. Defining a Module's Interface Recipe 12.2. Trapping Errors in require or use Recipe 12.3. Delaying use Until Runtime Recipe 12.4. Making Variables Private to a Module Recipe 12.5. Making Functions Private to a Module Recipe 12.6. Determining the Caller's Package Recipe 12.7. Automating Module Cleanup Recipe 12.8. Keeping Your Own Module Directory Recipe 12.9. Preparing a Module for Distribution Recipe 12.10. Speeding Module Loading with SelfLoader Recipe 12.11. Speeding Up Module Loading with Autoloader Recipe 12.12. Overriding Built-in Functions Recipe 12.13. Overriding a Built-in Function in All Packages Recipe 12.14. Reporting Errors and Warnings Like Built-ins Recipe 12.15. Customizing Warnings Recipe 12.16. Referring to Packages Indirectly Recipe 12.17. Using h2ph to Translate C #include Files Recipe 12.18. Using h2xs to Make a Module with C Code Recipe 12.19. Writing Extensions in C with Inline::C Recipe 12.20. Documenting Your Module with Pod Recipe 12.21. Building and Installing a CPAN Module Recipe 12.22. Example: Module Template Recipe 12.23. Program: Finding Versions and Descriptions of Installed Modules Chapter 13. Classes, Objects, and Ties Introduction Recipe 13.1. Constructing an Object Recipe 13.2. Destroying an Object Recipe 13.3. Managing Instance Data Recipe 13.4. Managing Class Data Recipe 13.5. Using Classes as Structs Recipe 13.6. Cloning Constructors Recipe 13.7. Copy Constructors Recipe 13.8. Invoking Methods Indirectly Recipe 13.9. Determining Subclass Membership Recipe 13.10. Writing an Inheritable Class Recipe 13.11. Accessing Overridden Methods Recipe 13.12. Generating Attribute Methods Using AUTOLOAD Recipe 13.13. Coping with Circular Data Structures Using Objects Recipe 13.14. Overloading Operators Recipe 13.15. Creating Magic Variables with tie Chapter 14. Database Access Introduction Recipe 14.1. Making and Using a DBM File Recipe 14.2. Emptying a DBM File Recipe 14.3. Converting Between DBM Files Recipe 14.4. Merging DBM Files Recipe 14.5. Sorting Large DBM Files Recipe 14.6. Storing Complex Data in a DBM File Recipe 14.7. Persistent Data Recipe 14.8. Saving Query Results to Excel or CSV Recipe 14.9. Executing an SQL Command Using DBI Recipe 14.10. Escaping Quotes Recipe 14.11. Dealing with Database Errors Recipe 14.12. Repeating Queries Efficiently Recipe 14.13. Building Queries Programmatically Recipe 14.14. Finding the Number of Rows Returned by a Query Recipe 14.15. Using Transactions Recipe 14.16. Viewing Data One Page at a Time Recipe 14.17. Querying a CSV File with SQL Recipe 14.18. Using SQL Without a Database Server Recipe 14.19. Program: ggh—Grep Netscape Global History Chapter 15. Interactivity Introduction Recipe 15.1. Parsing Program Arguments Recipe 15.2. Testing Whether a Program Is Running Interactively Recipe 15.3. Clearing the Screen Recipe 15.4. Determining Terminal or Window Size Recipe 15.5. Changing Text Color Recipe 15.6. Reading Single Characters from the Keyboard Recipe 15.7. Ringing the Terminal Bell Recipe 15.8. Using POSIX termios Recipe 15.9. Checking for Waiting Input Recipe 15.10. Reading Passwords Recipe 15.11. Editing Input Recipe 15.12. Managing the Screen Recipe 15.13. Controlling Another Program with Expect Recipe 15.14. Creating Menus with Tk Recipe 15.15. Creating Dialog Boxes with Tk Recipe 15.16. Responding to Tk Resize Events Recipe 15.17. Removing the DOS Shell Window with Windows Perl/Tk Recipe 15.18. Graphing Data Recipe 15.19. Thumbnailing Images Recipe 15.20. Adding Text to an Image Recipe 15.21. Program: Small termcap Program Recipe 15.22. Program: tkshufflepod Recipe 15.23. Program: graphbox Chapter 16. Process Management and Communication Introduction Recipe 16.1. Gathering Output from a Program Recipe 16.2. Running Another Program Recipe 16.3. Replacing the Current Program with a Different One Recipe 16.4. Reading or Writing to Another Program Recipe 16.5. Filtering Your Own Output Recipe 16.6. Preprocessing Input Recipe 16.7. Reading STDERR from a Program Recipe 16.8. Controlling Input and Output of Another Program Recipe 16.9. Controlling the Input, Output, and Error of Another Program Recipe 16.10. Communicating Between Related Processes Recipe 16.11. Making a Process Look Like a File with Named Pipes Recipe 16.12. Sharing Variables in Different Processes Recipe 16.13. Listing Available Signals Recipe 16.14. Sending a Signal Recipe 16.15. Installing a Signal Handler Recipe 16.16. Temporarily Overriding a Signal Handler Recipe 16.17. Writing a Signal Handler Recipe 16.18. Catching Ctrl-C Recipe 16.19. Avoiding Zombie Processes Recipe 16.20. Blocking Signals Recipe 16.21. Timing Out an Operation Recipe 16.22. Turning Signals into Fatal Errors Recipe 16.23. Program: sigrand Chapter 17. Sockets Introduction Recipe 17.1. Writing a TCP Client Recipe 17.2. Writing a TCP Server Recipe 17.3. Communicating over TCP Recipe 17.4. Setting Up a UDP Client Recipe 17.5. Setting Up a UDP Server Recipe 17.6. Using Unix Domain Sockets Recipe 17.7. Identifying the Other End of a Socket Recipe 17.8. Finding Your Own Name and Address Recipe 17.9. Closing a Socket After Forking Recipe 17.10. Writing Bidirectional Clients Recipe 17.11. Forking Servers Recipe 17.12. Pre-Forking Servers Recipe 17.13. Non-Forking Servers Recipe 17.14. Multitasking Server with Threads Recipe 17.15. Writing a Multitasking Server with POE Recipe 17.16. Writing a Multihomed Server Recipe 17.17. Making a Daemon Server Recipe 17.18. Restarting a Server on Demand Recipe 17.19. Managing Multiple Streams of Input Recipe 17.20. Program: backsniff Recipe 17.21. Program: fwdport Chapter 18. Internet Services Introduction Recipe 18.1. Simple DNS Lookups Recipe 18.2. Being an FTP Client Recipe 18.3. Sending Mail Recipe 18.4. Reading and Posting Usenet News Messages Recipe 18.5. Reading Mail with POP3 Recipe 18.6. Simulating Telnet from a Program Recipe 18.7. Pinging a Machine Recipe 18.8. Accessing an LDAP Server Recipe 18.9. Sending Attachments in Mail Recipe 18.10. Extracting Attachments from Mail Recipe 18.11. Writing an XML-RPC Server Recipe 18.12. Writing an XML-RPC Client Recipe 18.13. Writing a SOAP Server Recipe 18.14. Writing a SOAP Client Recipe 18.15. Program: rfrm Recipe 18.16. Program: expn and vrfy Chapter 19. CGI Programming Introduction Recipe 19.1. Writing a CGI Script Recipe 19.2. Redirecting Error Messages Recipe 19.3. Fixing a 500 Server Error Recipe 19.4. Writing a Safe CGI Program Recipe 19.5. Executing Commands Without Shell Escapes Recipe 19.6. Formatting Lists and Tables with HTML Shortcuts Recipe 19.7. Redirecting to a Different Location Recipe 19.8. Debugging the Raw HTTP Exchange Recipe 19.9. Managing Cookies Recipe 19.10. Creating Sticky Widgets Recipe 19.11. Writing a Multiscreen CGI Script Recipe 19.12. Saving a Form to a File or Mail Pipe Recipe 19.13. Program: chemiserie Chapter 20. Web Automation Introduction Recipe 20.1. Fetching a URL from a Perl Script Recipe 20.2. Automating Form Submission Recipe 20.3. Extracting URLs Recipe 20.4. Converting ASCII to HTML Recipe 20.5. Converting HTML to ASCII Recipe 20.6. Extracting or Removing HTML Tags Recipe 20.7. Finding Stale Links Recipe 20.8. Finding Fresh Links Recipe 20.9. Using Templates to Generate HTML Recipe 20.10. Mirroring Web Pages Recipe 20.11. Creating a Robot Recipe 20.12. Parsing a Web Server Log File Recipe 20.13. Processing Server Logs Recipe 20.14. Using Cookies Recipe 20.15. Fetching Password-Protected Pages Recipe 20.16. Fetching https:// Web Pages Recipe 20.17. Resuming an HTTP GET Recipe 20.18. Parsing HTML Recipe 20.19. Extracting Table Data Recipe 20.20. Program: htmlsub Recipe 20.21. Program: hrefsub Chapter 21. mod_perl Introduction Recipe 21.1. Authenticating Recipe 21.2. Setting Cookies Recipe 21.3. Accessing Cookie Values Recipe 21.4. Redirecting the Browser Recipe 21.5. Interrogating Headers Recipe 21.6. Accessing Form Parameters Recipe 21.7. Receiving Uploaded Files Recipe 21.8. Speeding Up Database Access Recipe 21.9. Customizing Apache's Logging Recipe 21.10. Transparently Storing Information in URLs Recipe 21.11. Communicating Between mod_perl and PHP Recipe 21.12. Migrating from CGI to mod_perl Recipe 21.13. Sharing Information Between Handlers Recipe 21.14. Reloading Changed Modules Recipe 21.15. Benchmarking a mod_perl Application Recipe 21.16. Templating with HTML::Mason Recipe 21.17. Templating with Template Toolkit Chapter 22. XML Introduction Recipe 22.1. Parsing XML into Data Structures Recipe 22.2. Parsing XML into a DOM Tree Recipe 22.3. Parsing XML into SAX Events Recipe 22.4. Making Simple Changes to Elements or Text Recipe 22.5. Validating XML Recipe 22.6. Finding Elements and Text Within an XML Document Recipe 22.7. Processing XML Stylesheet Transformations Recipe 22.8. Processing Files Larger Than Available Memory Recipe 22.9. Reading and Writing RSS Files Recipe 22.10. Writing XML Colophon Index [ Team LiB ] [ Team LiB ] Copyright Copyright © 2003, 1998 O'Reilly & Associates, Inc. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( For more information, contact our corporate/institutional sales department: (800) 998-9938 or Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a bighorn sheep and the topic of Perl is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. [ Team LiB ] [ Team LiB ] Foreword They say that it's easy to get trapped by a metaphor. But some metaphors are so magnificent that you don't mind getting trapped in them. Perhaps the cooking metaphor is one such, at least in this case. The only problem I have with it is a personal one—I feel a bit like Betty Crocker's mother. The work in question is so monumental that anything I could say here would be either redundant or irrelevant. However, that never stopped me before. Cooking is perhaps the humblest of the arts; but to me humility is a strength, not a weakness. Great artists have always had to serve their artistic medium—great cooks just do so literally. And the more humble the medium, the more humble the artist must be in order to lift the medium beyond the mundane. Food and language are both humble media, consisting as they do of an overwhelming profusion of seemingly unrelated and unruly ingredients. And yet, in the hands of someone with a bit of creativity and discipline, things like potatoes, pasta, and Perl are the basis of works of art that "hit the spot" in a most satisfying way, not merely getting the job done, but doing so in a way that makes your journey through life a little more pleasant. Cooking is also one of the oldest of the arts. Some modern artists would have you believe that so-called ephemeral art is a recent invention, but cooking has always been an ephemeral art. We can try to preserve our art, make it last a little longer, but even the food we bury with our pharoahs gets dug up eventually. So too, much of our Perl programming is ephemeral. This aspect of Perl cuisine has been much maligned. You can call it quick-and-dirty if you like, but there are billions of dollars out there riding on the supposition that fast food is not necessarily dirty food. (We hope.) Easy things should be easy, and hard things should be possible. For every fast-food recipe, there are countless slow-food recipes. One of the advantages of living in California is that I have ready access to almost every national cuisine ever invented. But even within a given culture, There's More Than One Way To Do It. It's said in Russia that there are more recipes for borscht than there are cooks, and I believe it. My mom's recipe doesn't even have any beets in it! But that's okay, and it's more than okay. Borscht is a cultural differentiator, and different cultures are interesting, and educational, and useful, and exciting. So you won't always find Tom and Nat doing things in this book the way I would do them. Sometimes they don't even do things the same way as each other. That's okay—again, this is a strength, not a weakness. I have to confess that I learned quite a few things I didn't know before I read this book. What's more, I'm quite confident that I still don't know it all. And I hope I don't any time soon. I often talk about Perl culture as if it were a single, static entity, but there are in fact many healthy Perl subcultures, not to mention sub-subcultures and supercultures and circumcultures in every conceivable combination, all inheriting attributes and methods from each other. It can get confusing. Hey, I'm confused most of the time. So the essence of a cookbook like this is not to cook for you (it can't), or even to teach you how to cook (though it helps), but rather to pass on various bits of culture that have been found useful, and perhaps to filter out other bits of "culture" that grew in the refrigerator when no one was looking. You in turn will pass on some of these ideas to other people, filtering them through your own experiences and tastes, your creativity and discipline. You'll come up with your own recipes to pass to your children. Just don't be surprised when they in turn cook up some recipes of their own, and ask you what you think. Try not to make a face. I commend to you these recipes, over which I've made very few faces. —Larry Wall, June, 1998 [ Team LiB ] [ Team LiB ] Preface The investment group eyed the entrepreneur with caution, their expressions flickering from scepticism to intrigue and back again. "Your bold plan holds promise," their spokesman conceded. "But it is costly and entirely speculative. Our mathematicians mistrust your figures. Why should we entrust our money into your hands? What do you know that we do not?" "For one thing," he replied, "I know how to balance an egg on its point without outside support. Do you?" And with that, the entrepreneur reached into his satchel and delicately withdrew a fresh hen's egg. He handed over the egg to the financial tycoons, who passed it amongst themselves trying to carry out the simple task. At last they gave up. In exasperation they declared, "What you ask is impossible! No man can balance an egg on its point." So the entrepreneur took back the egg from the annoyed businessmen and placed it upon the fine oak table, holding it so that its point faced down. Lightly but firmly, he pushed down on the egg with just enough force to crush in its bottom about half an inch. When he took his hand away, the egg stood there on its own, somewhat messy, but definitely balanced. "Was that impossible?" he asked. "It's just a trick," cried the businessmen. "Once you know how, anyone can do it." "True enough," came the retort. "But the same can be said for anything. Before you know how, it seems an impossibility. Once the way is revealed, it's so simple that you wonder why you never thought of it that way before. Let me show you that easy way, so others may easily follow. Will you trust me?" Eventually convinced that this entrepreneur might possibly have something to show them, the skeptical venture capitalists funded his project. From the tiny Andalusian port of Palos de Moguer set forth the Niña, the Pinta, and the Santa María, led by an entrepreneur with a slightly broken egg and his own ideas: Christopher Columbus. Many have since followed. Approaching a programming problem can be like balancing Columbus's egg. If no one shows you how, you may sit forever perplexed, watching the egg—and your program—fall over again and again, no closer to the Indies than when you began. This is especially true in a language as idiomatic as Perl. This book isn't meant to be a complete reference book for Perl. Keeping a copy of Programming Perl handy will let you look up exact definitions of operators, keywords, functions, pragmata, or modules. Alternatively, every Perl installation comes with a voluminous collection of searchable, online reference materials. If those aren't where you can easily get at them, see your system administrator if you have one, or consult the documentation section at Neither is this book meant to be a bare-bones introduction for programmers who have never seen Perl before. That's what Learning Perl, a kinder and gentler introduction to Perl, is designed for. (If you're on a Microsoft system, you might prefer the Learning Perl for Win32 Systems version.) Instead, this is a book for learning more Perl. Neither a reference book nor a tutorial book, Perl Cookbook serves as a companion book to both. It's for people who already know the basics but are wondering how to mix all those ingredients together into a complete program. Spread across 22 chapters and more than 400 focused topic areas affectionately called recipes, this task-oriented book contains thousands of solutions to everyday challenges encountered by novice and journeyman alike. We tried hard to make this book useful for both random and sequential access. Each recipe is self-contained, but has a list of references at the end should you need further information on the topic. We've tried to put the simpler, more common recipes toward the front of each chapter and the simpler chapters toward the front of the book. Perl novices should find that these recipes about Perl's basic data types and operators are just what they're looking for. We gradually work our way through topic areas and solutions more geared toward the journeyman Perl programmer. Now and then we include material that should inspire even the master Perl programmer. Each chapter begins with an overview of that chapter's topic. This introduction is followed by the main body of each chapter, its recipes. In the spirit of the Perl slogan of TMTOWTDI, There's more than one way to do it, most recipes show several different techniques for solving the same or closely related problems. These recipes range from short-but-sweet solutions to in-depth mini-tutorials. Where more than one technique is given, we often show costs and benefits of each approach. As with a traditional cookbook, we expect you to access this book more or less at random. When you want to learn how to do something, you'll look up its recipe. Even if the exact solutions presented don't fit your problem exactly, they'll give you ideas about possible approaches. Each chapter concludes with one or more complete programs. Although some recipes already include small programs, these longer applications highlight the chapter's principal focus and combine techniques from other chapters, just as any real-world program would. All are useful, and many are used on a daily basis. Some even helped us put this book together. [ Team LiB ] [ Team LiB ] What's in This Book Spread over five chapters, the first portion of the book addresses Perl's basic data types. Chapter 1, covers matters like accessing substrings, expanding function calls in strings, and parsing comma-separated data; it also covers Unicode strings. Chapter 2, tackles oddities of floating-point representation, placing commas in numbers, and pseudo-random numbers. Chapter 3, demonstrates conversions between numeric and string date formats and using timers. Chapter 4, covers everything relating to list and array manipulation, including finding unique elements in a list, efficiently sorting lists, and randomizing them. Chapter 5, concludes the basics with a demonstration of the most useful data type, the associative array. The chapter shows how to access a hash in insertion order, how to sort a hash by value, how to have multiple values per key, and how to have an immutable hash. Chapter 6, includes recipes for converting a shell wildcard into a pattern, matching letters or words, matching multiple lines, avoiding greediness, matching nested or recursive patterns, and matching strings that are close to but not exactly what you're looking for. Although this chapter is one of the longest in the book, it could easily have been longer still—every chapter contains uses of regular expressions. It's part of what makes Perl Perl. The next three chapters cover the filesystem. Chapter 7, shows opening files, locking them for concurrent access, modifying them in place, and storing filehandles in variables. Chapter 8, discusses storing filehandles in variables, managing temporary files, watching the end of a growing file, reading a particular line from a file, handling alternative character encodings like Unicode and Microsoft character sets, and random access binary I/O. Finally, in Chapter 9, we show techniques to copy, move, or delete a file, manipulate a file's timestamps, and recursively process all files in a directory. Chapter 10 through Chapter 13 focus on making your program flexible and powerful. Chapter 10, includes recipes on creating persistent local variables, passing parameters by reference, calling functions indirectly, crafting a switch statement, and handling exceptions. Chapter 11, is about data structures; basic manipulation of references to data and functions are demonstrated. Later recipes show how to create elaborate data structures and how to save and restore these structures from permanent storage. Chapter 12, concerns breaking up your program into separate files; we discuss how to make variables and functions private to a module, customize warnings for modules, replace built-ins, trap errors loading missing modules, and use the h2ph and h2xs tools to interact with C and C++ code. Lastly, Chapter 13, covers the fundamentals of building your own object-based module to create user-defined types, complete with constructors, destructors, and inheritance. Other recipes show examples of circular data structures, operator overloading, and tied data types. The next two chapters are about interfaces: one to databases, the other to users. Chapter 14, includes techniques for manipulating DBM files and querying and updating databases with SQL and the DBI module. Chapter 15, covers topics such as clearing the screen, processing command-line switches, single-character input, moving the cursor using termcap and curses, thumbnailing images, and graphing data. The last portion of the book is devoted to interacting with other programs and services. Chapter 16, is about running other programs and collecting their output, handling zombie processes, named pipes, signal management, and sharing variables between running programs. Chapter 17, shows how to establish stream connections or use datagrams to create low-level networking applications for client-server programming. Chapter 18, is about higher-level protocols such as mail, FTP, Usenet news, XML-RPC, and SOAP. Chapter 19, contains recipes for processing web forms, trapping their errors, avoiding shell escapes for security, managing cookies, shopping cart techniques, and saving forms to files or pipes. Chapter 20, covers non-interactive uses of the Web, such as fetching web pages, automating form submissions in a script, extracting URLs from a web page, removing HTML tags, finding fresh or stale links, and parsing HTML. Chapter 21, introduces mod_perl, the Perl interpreter embedded in Apache. It covers fetching form parameters, issuing redirections, customizing Apache's logging, handling authentication, and advanced templating with Mason and the Template Toolkit. Finally, Chapter 22 is about the ubiquitous data format XML and includes recipes such as validating XML, parsing XML into events and trees, and transforming XML into other formats. [ Team LiB ] [ Team LiB ] What's New in This Edition The book you're holding is thicker than its previous edition of five years ago—about 200 pages thicker. New material is spread across more than 80 entirely new recipes plus over 100 existing recipes that were substantially updated since the first edition. You'll also find two new chapters: one on mod_perl, Perl's interface to the popular Apache web server; the other on XML, an increasingly important standard for exchanging structured data. Growth in this book reflects growth in Perl itself, from Version 5.004 in the first edition to v5.8.1 in this one. Syntactic changes to the core language are nevertheless comparatively few. Some include the spiffy our keyword to replace the crufty use vars construct for declaring global variables, fancier forms of open to disambiguate filenames with strange characters in them, and automatic allocation of anonymous filehandles into undefined scalar variables. We've updated our solutions and code examples to reflect these changes where it made sense to make use of the new features. Several of Perl's major subsystems have been completely overhauled for improved functionality, stability, and portability. Some of these are relatively isolated, like the subsystems for threading (see Recipe 17.14) and for safe signals (see Recipe 16.17). Their applications are usually confined to systems programming. More sweeping are the changes to Perl and to this book that stem from integrated support for Unicode characters. The areas most profoundly affected are strings (now with multibyte characters) and I/O (now with stackable encoding layers), so Chapter 1 and Chapter 8 include new introductory material to orient you to these sometimes confusing topics. These chapters also provide the bulk of recipes dealing with those specific topics, but this fundamental shift touches many more recipes throughout the book. Another growth area for this book and Perl has been the welcome proliferation of many highly used and highly useful modules now released standard with the Perl core. Previously, these modules had to be separately located, downloaded, configured, built, tested, and installed. Now that they're included in the standard distribution, that's all taken care of when installing Perl itself. Some new core modules are really pragmas that alter Perl's compilation or runtime environment, as demonstrated in Recipe 1.21, Recipe 12.3, and Recipe 12.15. Some are programmer tools to aid code development and debugging, like modules shown in Recipe 11.11, Recipe 11.13, Recipe 11.15, and Recipe 22.2. Others augment basic operations available on core data types, like those shown in Recipe 2.1, Recipe 4.13, Recipe 4.18, Recipe 5.3, Recipe 8.7, and Recipe 11.15. Finally, the networking modules have at last made their way into the core distribution, as seen throughout Chapter 18. We've probably not seen the last of this inward migration of modules. [ Team LiB ] [ Team LiB ] Platform Notes This book was developed using Perl release v5.8.1. That means major release 5, minor release 8, and patch level 1. We tested most programs and examples under BSD, Linux, and SunOS, but that doesn't mean they'll work only on those systems. Perl was designed for platform independence. When you use Perl as a general-purpose programming language, employing basic operations like variables, patterns, subroutines, and high-level I/O, your program should work the same everywhere that Perl runs—which is just about everywhere. The first two-thirds of this book uses Perl for general-purpose programming. Perl was originally conceived as a high-level, cross-platform language for systems programming. Although it has long since expanded beyond its original domain, Perl continues to be heavily used for systems programming, both on its native Unix systems and elsewhere. Most recipes in Chapter 14 through Chapter 18 deal with classic systems programming. For maximum portability in this area, we've mainly focused on open systems as defined by the Portable Operating System Interface (POSIX), which includes nearly every form of Unix and numerous other systems as well. Most recipes should run with little or no modification on any POSIX system. You can still use Perl for systems programming work even on non-POSIX systems by using vendor-specific modules, but these are not covered in this book. That's because they're not portable—and to be perfectly forward, because we have no such systems at our disposal. Consult the documentation that came with your port of Perl for any proprietary modules that may have been included. The perlport(1) manpage is a good start; its SEE ALSO section points to per-platform documentation, such as perlmacos(1) and perlvms(1). But don't worry. Many recipes for systems programming should work on non-POSIX systems as well, especially those dealing with databases, networking, and web interaction. That's because the modules used for those areas hide platform dependencies. The principal exception is those few recipes and programs that rely upon multitasking constructs, notably the powerful fork function, standard on POSIX systems, but seldom on others. Mac OS X now supports fork natively, however, and even on Windows systems Perl now emulates that syscall remarkably well. When we needed structured files, we picked the convenient Unix /etc/passwd database; when we needed a text file to read, we picked /etc/motd; and when we needed a program to produce output, we picked who(1). These were merely chosen to illustrate the principles—the principles work whether or not your system has these files and programs. [ Team LiB ] [ Team LiB ] Other Books If you'd like to learn more about Perl, here are some related publications that we (somewhat sheepishly) recommend: Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant; O'Reilly & Associates (Third Edition, 2000). This book is indispensable for every Perl programmer. Coauthored by Perl's creator, this classic reference is the authoritative guide to Perl's syntax, functions, modules, references, invocation options, and much more. Mastering Algorithms with Perl, by Jon Orwant, Jarkko Hietaniemi, and John Macdonald; O'Reilly & Associates (2000). All the useful techniques from a CS algorithms course, but without the painful proofs. This book covers fundamental and useful algorithms in the fields of graphs, text, sets, and more. Mastering Regular Expressions, by Jeffrey Friedl; O'Reilly & Associates (Second Edition, 2002). This book is dedicated to explaining regular expressions from a practical perspective. It not only covers general regular expressions and Perl patterns well, it also compares and contrasts these with those used in other popular languages. Object Oriented Perl, by Damian Conway; Manning (1999). For beginning as well as advanced OO programmers, this book explains common and esoteric techniques for writing powerful object systems in Perl. Learning Perl, by Randal Schwartz and Tom Phoenix; O'Reilly & Associates (Third Edition, 2001). A tutorial introduction to Perl for folks who are already programmers and who are interested in learning Perl from scratch. It's a good starting point if this book is over your head. Erik Olson refurbished this book for Windows systems, called Learning Perl for Win32 Systems. Programming the Perl DBI, by Tim Bunce and Alligator Descartes; O'Reilly & Associates (2000). The only book on Perl's relational database interface, by the author of the DBI module. CGI Programming with Perl, by Scott Guelich, Shishir Gundavaram, and Gunther Birznieks; O'Reilly & Associates (Second Edition, 2000). This is a solid introduction to the world of CGI programming. Writing Apache Modules with Perl and C, by Lincoln Stein and Doug MacEachern; O'Reilly & Associates (1999). This guide to web programming teaches you how to extend the capabilities of the Apache web server, especially using the turbo-charged mod_perl for fast CGI scripts and via the Perl-accessible Apache API. Practical mod_perl, by Stas Bekman and Eric Cholet; O'Reilly & Associates (2003). A comprehensive guide to installing, configuring, and developing with mod_perl. This book goes into corners of mod_perl programming that no other book dares to touch. The mod_perl Developer's Cookbook, by Geoff Young, Paul Lindner, and Randy Kobes; SAMS (2002). Written in a similar style to the Cookbook you hold in your hand, this book belongs on every mod_perl developer's desk. It covers almost every task a mod_perl developer might want to perform. Beyond the Perl-related publications listed here, the following books came in handy when writing this book. They were used for reference, consultation, and inspiration. The Art of Computer Programming, by Donald Knuth, Volumes I-III: "Fundamental Algorithms," "Seminumerical Algorithms," and "Sorting and Searching"; Addison-Wesley (Third Edition, 1998). Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest; MIT Press and McGraw-Hill (1990). Algorithms in C, by Robert Sedgewick; Addison-Wesley (1992). The Art of Mathematics, by Jerry P. King; Plenum (1992). The Elements of Programming Style, by Brian W. Kernighan and P.J. Plauger; McGraw-Hill (1988). The UNIX Programming Environment, by Brian W. Kernighan and Rob Pike; Prentice-Hall (1984). POSIX Programmer's Guide, by Donald Lewine; O'Reilly & Associates (1991). Advanced Programming in the UNIX Environment, by W. Richard Stevens; Addison- Wesley (1992). TCP/IP Illustrated, by W. Richard Stevens, et al., Volumes I-III; Addison-Wesley (1992- 1996). HTML: The Definitive Guide, by Chuck Musciano and Bill Kennedy; O'Reilly & Associates (Third Edition, 1998). Official Guide to Programming with, by Lincoln Stein; John Wiley & Sons (1997). Web Client Programming with Perl, by Clinton Wong; O'Reilly & Associates (1997). The New Fowler's Modern English Usage, edited by R.W. Burchfield; Oxford (Third Edition, 1996). [ Team LiB ] [ Team LiB ] Conventions Used in This Book Programming Conventions We give lots of examples, most of which are pieces of code that should go into a larger program. Some examples are complete programs, which you can recognize because they begin with a #! line. We start nearly all of our longer programs with: #!/usr/bin/perl -w use strict; or else the newer: #!/usr/bin/perl use strict; use warnings; Still other examples are things to be typed on a command line. We've used % to show the shell prompt: % perl -e 'print "Hello, world.\n"' Hello, world. This style represents a standard Unix command line, where single quotes represent the "most quoted" form. Quoting and wildcard conventions on other systems vary. For example, many command-line interpreters under MS-DOS and VMS require double quotes instead of single ones to group arguments with spaces or wildcards in them. Typesetting Conventions The following typographic conventions are used in this book: Bold is used exclusively for command-line switches. This allows one to distinguish for example, between the -w warnings switch and the -w filetest operator. Italic is used for URLs, manpages, pathnames, and programs. New terms are also italicized when they first appear in the text. Constant Width is used for function and method names and their arguments; in examples to show text that you enter verbatim; and in regular text to show literal code. Constant Width Bold Italic is used in examples to show output produced. Indicates a warning or caution. Documentation Conventions The most up-to-date and complete documentation about Perl is included with Perl itself. If typeset and printed, this massive anthology would use more than a thousand pages of printed paper, greatly contributing to global deforestation. Fortunately, you don't have to print it out, because it's available in a convenient and searchable electronic form. When we refer to a "manpage" in this book, we're talking about this set of online manuals. The name is purely a convention; you don't need a Unix-style man program to read them. The perldoc command distributed with Perl also works, and you may even have the manpages installed as HTML pages, especially on non-Unix systems. Plus, once you know where they're installed, you can grep them directly.[1] The HTML version of the manpages is available on the Web at [1] If your system doesn't have grep, use the tcgrep program supplied at the end of Chapter 6. When we refer to non-Perl documentation, as in "See kill(2) in your system manual," this refers to the kill manpage from section 2 of the Unix Programmer's Manual (system calls). These won't be available on non-Unix systems, but that's probably okay, because you couldn't use them there anyway. If you really do need the documentation for a system call or library function, many organizations have put their manpages on the Web; a quick search of Google for crypt(3) manual will find many copies. [ Team LiB ] [ Team LiB ] We'd Like to Hear from You We have tested and verified the information in this book to the best of our ability, but you may find that features have changed (which may in fact resemble bugs). Please let us know about any errors you find, as well as your suggestions for future editions, by writing to: O'Reilly & Associates, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (800) 998-9938 (in the U.S. or Canada) (707) 829-0515 (international or local) (707) 829-0104 (FAX) You can also send us messages electronically. To be put on the mailing list or request a catalog, send email to: To ask technical questions or comment on the book, send email to: There is a web site for the book, where we'll list errata and plans for future editions. Here you'll also find source code for the book's examples available for download so you don't have to type them in yourself. You can access this page at: For more information about this book and others, see the O'Reilly web site: [ Team LiB ] [ Team LiB ] Acknowledgments for the First Edition This book wouldn't exist but for a legion of people standing, knowing and unknowing, behind the authors. At the head of this legion would have to be our editor, Linda Mui, carrot on a stick in one hand and a hot poker in the other. She was great. As the author of Perl, Larry Wall was our ultimate reality check. He made sure we weren't documenting things he was planning to change and helped out on wording and style.[2] If now and then you think you're hearing Larry's voice in this book, you probably are. [2] And footnotes. Larry's wife, Gloria, a literary critic by trade, shocked us by reading through every single word—and actually liking most of them. Together with Sharon Hopkins, resident Perl Poetess, she helped us rein in our admittedly nearly insatiable tendency to produce pretty prose sentences that could only be charitably described as lying somewhere between the inscrutably complex and the hopelessly arcane, eventually rendering the meandering muddle into something legible even to those whose native tongues were neither PDP-11 assembler nor Mediæval Spanish. Our three most assiduous reviewers, Mark-Jason Dominus, Jon Orwant, and Abigail, have worked with us on this book nearly as long as we've been writing it. Their rigorous standards, fearsome intellects, and practical experience in Perl applications have been of invaluable assistance. Doug Edwards methodically stress-tested every piece of code from the first seven chapters of the book, finding subtle border cases no one else ever thought about. Other major reviewers include Andy Dougherty, Andy Oram, Brent Halsey, Bryan Buus, Gisle Aas, Graham Barr, Jeff Haemer, Jeffrey Friedl, Lincoln Stein, Mark Mielke, Martin Brech, Matthias Neeracher, Mike Stok, Nate Patwardhan, Paul Grassie, Peter Prymmer, Raphaël Manfredi, and Rod Whitby. And this is just the beginning. Part of what makes Perl fun is the sense of community and sharing it seems to engender. Many selfless individuals lent us their technical expertise. Some read through complete chapters in formal review. Others provided insightful answers to brief technical questions when we were stuck on something outside our own domain. A few even sent us code. Here's a partial list of these helpful people: Aaron Harsh, Ali Rayl, Alligator Descartes, Andrew Hume, Andrew Strebkov, Andy Wardley, Ashton MacAndrews, Ben Gertzfield, Benjamin Holzman, Brad Hughes, Chaim Frenkel, Charles Bailey, Chris Nandor, Clinton Wong, Dan Klein, Dan Sugalski, Daniel Grisinger, Dennis Taylor, Doug MacEachern, Douglas Davenport, Drew Eckhardt, Dylan Northrup, Eric Eisenhart, Eric Watt Forste, Greg Bacon, Gurusamy Sarathy, Henry Spencer, Jason Ornstein, Jason Stewart, Joel Noble, Jonathan Cohen, Jonathan Scott Duff, Josh Purinton, Julian Anderson, Keith Winstein, Ken Lunde, Kirby Hughes, Larry Rosler, Les Peters, Mark Hess, Mark James, Martin Brech, Mary Koutsky, Michael Parker, Nick Ing- Simmons, Paul Marquess, Peter Collinson, Peter Osel, Phil Beauchamp, Piers Cawley, Randal Schwartz, Rich Rauenzahn, Richard Allan, Rocco Caputo, Roderick Schertler, Roland Walker, Ronan Waide, Stephen Lidie, Steven Owens, Sullivan Beck, Tim Bunce, Todd Miller, Troy Denkinger, and Willy Grimm. And let's not forget Perl itself, without which this book could never have been written. Appropriately enough, we used Perl to build endless small tools to help produce this book. Perl tools converted our text in pod format into troff for displaying and review and into FrameMaker for production. Another Perl program ran syntax checks on every piece of code in the book. The Tk extension to Perl was used to build a graphical tool to shuffle around recipes using drag-and- drop. Beyond these, we also built innumerable smaller tools for tasks like checking RCS locks, finding duplicate words, detecting certain kinds of grammatical errors, managing mail folders with feedback from reviewers, creating program indices and tables of contents, and running text searches that crossed line boundaries or were restricted to certain sections—just to name a few. Some of these tools found their way into the same book they were used on. Tom Thanks first of all to Larry and Gloria for sacrificing some of their European vacation to groom the many nits out of this manuscript, and to my other friends and family—Bryan, Sharon, Brent, Todd, and Drew—for putting up with me over the last couple of years and being subjected to incessant proofreadings. I'd like to thank Nathan for holding up despite the stress of his weekly drives, my piquant vegetarian cooking and wit, and his getting stuck researching the topics I so diligently avoided. I'd like to thank those largely unsung titans in our field—Dennis, Linus, Kirk, Eric, and Rich—who were all willing to take the time to answer my niggling operating system and troff questions. Their wonderful advice and anecdotes aside, without their tremendous work in the field, this book could never have been written. Thanks also to my instructors who sacrificed themselves to travel to perilous places like New Jersey to teach Perl in my stead. I'd like to thank Tim O'Reilly and Frank Willison first for being talked into publishing this book, and second for letting time-to-market take a back seat to time- to-quality. Thanks also to Linda, our shamelessly honest editor, for shepherding dangerously rabid sheep through the eye of a release needle. Most of all, I want to thank my mother, Mary, for tearing herself away from her work in prairie restoration and teaching high school computer and biological sciences to keep both my business and domestic life in smooth working order long enough for me to research and write this book. Finally, I'd like to thank Johann Sebastian Bach, who was for me a boundless font of perspective, poise, and inspiration—a therapy both mental and physical. I am certain that forevermore the Cookbook will evoke for me the sounds of BWV 849, now indelibly etched into the wetware of head and hand. Nat Without my family's love and patience, I'd be baiting hooks in a 10-foot swell instead of mowing my lawn in suburban America. Thank you! My friends have taught me much: Jules, Amy, Raj, Mike, Kef, Sai, Robert, Ewan, Pondy, Mark, and Andy. I owe a debt of gratitude to the denizens of Nerdsholm, who gave sound technical advice and introduced me to my wife (they didn't give me sound technical advice on her, though). Thanks also to my employer, Front Range Internet, for a day job I don't want to quit. Tom was a great co-author. Without him, this book would be nasty, brutish, and short. Finally, I have to thank Jenine. We'd been married a year when I accepted the offer to write, and we've barely seen each other since then. Nobody will savour the final full-stop in this sentence more than she. [ Team LiB ] [ Team LiB ] Acknowledgments for the Second Edition We would like to thank our many tech reviewers, who gave generously of their time and knowledge so that we might look better. Some were formal reviewers who painstakingly plodded through endless drafts and revisions, while others were casual comrades roped into reading small excerpts related to their own particular expertise or interest. The bugs you don't find in this book are thanks to them. Those you do find were probably introduced after they reviewed it. Just a few of these selfless people were Adam Maccabee Trachtenberg, Rafael Garcia-Suarez, Ask Björn Hansen, Mark-Jason Dominus, Abhijit Menon-Sen, Jarkko Hietaniemi, Benjamin Goldberg, Aaron Straup Cope, Tony Stubblebine, Michel Rodriguez, Nick Ing-Simmons, Geoffrey Young, Douglas Wilson, Paul Kulchenko, Jeffrey Friedl, Arthur Bergman, Autrijus Tang, Matt Sergeant, Steve Marvell, Damian Conway, Sean M. Burke, Elaine Ashton, Steve Lidie, Ken Williams, Robert Spier, Chris Nandor, Brent Halsey, Matthew Free, Rocco Caputo, Robin Berjon, Adam Turoff, Chip Turner, David Sklar, Mike Sierra, Dave Rolsky, Kip Hampton, Chris Fedde, Graham Barr, Jon Orwant, Rich Bowen, Mike Stok, Tim Bunce, Rob Brown, Dan Brian, Gisle Aas, and Abigail. We'd also like to thank our patient and persistent editor, Linda Mui, who ran serious risk of getting herself committed as she tried to wrestle "the final edits" from us. Tom I would like to thank Larry Wall for making the programming world (and several others) a better place for all of us, Nathan for documenting the undocumented, and our editor, Linda Mui, for her indefatigable patience at herding her author cats of the Schrödinger clan ever onward. This book would not exist but for all three of them. I would especially like to thank someone who is no longer here to read these words in print, words he would otherwise himself have shepherded: O'Reilly's longtime editor-in-chief and my friend, Frank Willison, gone from us two years now. His many erudite epistles are a thing of legend, carefully crafted treasures more dear to any writer than finest gold. Over our years of working together, Frank was a constant source of personal inspiration and encouragement. His easygoing cheer and charm, his broad learning and interests, and his sparkling wit—sometimes subtle, sometimes hilarious, and often both—made him more deserving of being called avuncular than anyone else I have ever known, and as such do I miss him. Thank you, Frank, wherever you are. Nat Henry David Thoreau wrote, "What is commonly called friendship is only a little more honor among rogues." If that be true, I have two honorable rogues to thank: Jon Orwant, who engineered my job at O'Reilly & Associates; and Linda Mui, who helped me keep it. As with the first edition, the book in your hands wouldn't be there without Tom's drive, attention to detail, and willingness to tackle the hard stuff. Thanks for taking the Unicode bullet, Tom. And finally, my family. Jenine nearly became a solo parent while I worked on this book. My heart broke when William sadly told a friend, "My daddy works and works—all day and all night," and again when one of Raley's first two-word sentences was "Daddy work." Thank you all. [ Team LiB ] [ Team LiB ] Chapter 1. Strings He multiplieth words without knowledge. —Job 35:16 [ Team LiB ] [ Team LiB ] Introduction Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair; Perl isn't a low- level language, so lines and strings are easy to handle. Perl was designed for easy but powerful text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6 and Chapter 8, which discuss interesting techniques not covered here. Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length, within the limits of your machine's virtual memory, and can hold any arbitrary data you care to put there—even binary data containing null bytes. A string in Perl is not an array of characters—nor of bytes, for that matter. You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow on demand. Space is reclaimed by Perl's garbage collection system when no longer used, typically when the variables have gone out of scope or when the expression in which they were used has been evaluated. In other words, memory management is already taken care of, so you don't have to worry about it. A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is undef. All other values are defined, even numeric and the empty string. Definedness is not the same as Boolean truth, though; to check whether a value is defined, use the defined function. Boolean truth has a specialized meaning, tested with operators such as && and || or in an if or while block's test condition. Two defined strings are false: the empty string ("") and a string of length one containing the digit zero ("0"). All other defined values (e.g., "false", 15, and \$x) are true. You might be surprised to learn that "0" is false, but this is due to Perl's on-demand conversion between strings and numbers. The values 0., 0.00, and 0.0000000 are all numbers and are therefore false when unquoted, since the number zero in any of its guises is always false. However, those three values ("0.", "0.00", and "0.0000000") are true when used as literal quoted strings in your program code or when they're read from the command line, an environment variable, or an input file. This is seldom an issue, since conversion is automatic when the value is used numerically. If it has never been used numerically, though, and you just test whether it's true or false, you might get an unexpected answer—Boolean tests never force any sort of conversion. Adding 0 to the variable makes Perl explicitly convert the string to a number: print "Gimme a number: "; 0.00000 chomp($n = ); # $n now holds "0.00000"; print "The value $n is ", $n ? "TRUE" : "FALSE", "\n"; That value 0.00000 is TRUE $n += 0; print "The value $n is now ", $n ? "TRUE" : "FALSE", "\n"; That value 0 is now FALSE The undef value behaves like the empty string ("") when used as a string, 0 when used as a number, and the null reference when used as a reference. But in all three possible cases, it's false. Using an undefined value where Perl expects a defined value will trigger a runtime warning message on STDERR if you've enabled warnings. Merely asking whether something is true or false demands no particular value, so this is exempt from warnings. Some operations do not trigger warnings when used on variables holding undefined values. These include the autoincrement and autodecrement operators, ++ and --, and the addition and concatenation assignment operators, += and .= ("plus-equals" and "dot-equals"). Specify strings in your program using single quotes, double quotes, the quoting operators q// and qq//, or here documents. No matter which notation you use, string literals are one of two possible flavors: interpolated or uninterpolated. Interpolation governs whether variable references and special sequences are expanded. Most are interpolated by default, such as in patterns (/regex/) and running commands ($x = `cmd`). Where special characters are recognized, preceding any special character with a backslash renders that character mundane; that is, it becomes a literal. This is often referred to as "escaping" or "backslash escaping." Using single quotes is the canonical way to get an uninterpolated string literal. Three special sequences are still recognized: ' to terminate the string, \' to represent a single quote, and \\ to represent a backslash in the string. $string = '\n'; # two characters, \ and an n $string = 'Jon \'Maddog\' Orwant'; # literal single quotes Double quotes interpolate variables (but not function calls—see Recipe 1.15 to find how to do this) and expand backslash escapes. These include "\n" (newline), "\033" (the character with octal value 33), "\cJ" (Ctrl-J), "\x1B" (the character with hex value 0x1B), and so on. The full list of these is given in the perlop(1) manpage and the section on "Specific Characters" in Chapter 5 of Programming Perl. $string = "\n"; # a "newline" character $string = "Jon \"Maddog\" Orwant"; # literal double quotes If there are no backslash escapes or variables to expand within the string, it makes no difference which flavor of quotes you use. When choosing between writing 'this' and writing "this", some Perl programmers prefer to use double quotes so that the strings stand out. This also avoids the slight risk of having single quotes mistaken for backquotes by readers of your code. It makes no difference to Perl, and it might help readers. The q// and qq// quoting operators allow arbitrary delimiters on interpolated and uninterpolated literals, respectively, corresponding to single- and double-quoted strings. For an uninterpolated string literal that contains single quotes, it's easier to use q// than to escape all single quotes with backslashes: $string = 'Jon \'Maddog\' Orwant'; # embedded single quotes $string = q/Jon 'Maddog' Orwant/; # same thing, but more legible Choose the same character for both delimiters, as we just did with /, or pair any of the following four sets of bracketing characters: $string = q[Jon 'Maddog' Orwant]; # literal single quotes $string = q{Jon 'Maddog' Orwant}; # literal single quotes $string = q(Jon 'Maddog' Orwant); # literal single quotes $string = q; # literal single quotes Here documents are a notation borrowed from the shell used to quote a large chunk of text. The text can be interpreted as single-quoted, double-quoted, or even as commands to be executed, depending on how you quote the terminating identifier. Uninterpolated here documents do not expand the three backslash sequences the way single-quoted literals normally do. Here we double-quote two lines with a here document: $a = <<"EOF"; This is a multiline here document terminated by EOF on a line by itself EOF Notice there's no semicolon after the terminating EOF. Here documents are covered in more detail in Recipe 1.16. The Universal Character Code As far as the computer is concerned, all data is just a series of individual numbers, each a string of bits. Even text strings are just sequences of numeric codes interpreted as characters by programs like web browsers, mailers, printing programs, and editors. Back when memory sizes were far smaller and memory prices far more dear, programmers would go to great lengths to save memory. Strategies such as stuffing six characters into one 36-bit word or jamming three characters into one 16-bit word were common. Even today, the numeric codes used for characters usually aren't longer than 7 or 8 bits, which are the lengths you find in ASCII and Latin1, respectively. That doesn't leave many bits per character—and thus, not many characters. Consider an image file with 8-bit color. You're limited to 256 different colors in your palette. Similarly, with characters stored as individual octets (an octet is an 8-bit byte), a document can usually have no more than 256 different letters, punctuation marks, and symbols in it. ASCII, being the American Standard Code for Information Interchange, was of limited utility outside the United States, since it covered only the characters needed for a slightly stripped- down dialect of American English. Consequently, many countries invented their own incompatible 8-bit encodings built upon 7-bit ASCII. Conflicting schemes for assigning numeric codes to characters sprang up, all reusing the same limited range. That meant the same number could mean a different character in different systems and that the same character could have been assigned a different number in different systems. Locales were an early attempt to address this and other language- and country-specific issues, but they didn't work out so well for character set selection. They're still reasonable for purposes unrelated to character sets, such as local preferences for monetary units, date and time formatting, and even collating sequences. But they are of far less utility for reusing the same 8- bit namespace for different character sets. That's because if you wanted to produce a document that used Latin, Greek, and Cyrillic characters, you were in for big trouble, since the same numeric code would be a different character under each system. For example, character number 196 is a Latin capital A with a diaeresis above it in ISO 8859-1 (Latin1); under ISO 8859-7, that same numeric code represents a Greek capital delta. So a program interpreting numeric character codes in the ISO 8859-1 locale would see one character, but under the ISO 8859-7 locale, it would see something totally different. This makes it hard to combine different character sets in the same document. Even if you did cobble something together, few programs could work with that document's text. To know what characters you had, you'd have to know what system they were in, and you couldn't easily mix systems. If you guessed wrong, you'd get a jumbled mess on your screen, or worse. Unicode Support in Perl Enter Unicode. Unicode attempts to unify all character sets in the entire world, including many symbols and even fictional character sets. Under Unicode, different characters have different numeric codes, called code points. Mixed-language documents are now easy, whereas before they weren't even possible. You no longer have just 128 or 256 possible characters per document. With Unicode you can have tens of thousands (and more) of different characters all jumbled together in the same document without confusion. The problem of mixing, say, an Ä with a D evaporates. The first character, formally named "LATIN CAPITAL LETTER A WITH DIAERESIS" under Unicode, is assigned the code point U+00C4 (that's the Unicode preferred notation). The second, a "GREEK CAPITAL LETTER DELTA", is now at code point U+0394. With different characters always assigned different code points, there's no longer any conflict. Perl has supported Unicode since v5.6 or so, but it wasn't until the v5.8 release that Unicode support was generally considered robust and usable. This by no coincidence corresponded to the introduction of I/O layers and their support for encodings into Perl. These are discussed in more detail in Chapter 8. All Perl's string functions and operators, including those used for pattern matching, now operate on characters instead of octets. If you ask for a string's length, Perl reports how many characters are in that string, not how many bytes are in it. If you extract the first three characters of a string using substr, the result may or may not be three bytes. You don't know, and you shouldn't care, either. One reason not to care about the particular underlying bytewise representation is that if you have to pay attention to it, you're probably looking too closely. It shouldn't matter, really—but if it does, this might mean that Perl's implementation still has a few bumps in it. We're working on that. Because characters with code points above 256 are supported, the chr function is no longer restricted to arguments under 256, nor is ord restricted to returning an integer smaller than that. Ask for chr(0x394), for example, and you'll get a Greek capital delta: D. $char = chr(0x394); $code = ord($char); printf "char %s is code %d, %#04x\n", $char, $code, $code; char D is code 916, 0x394 If you test the length of that string, it will say 1, because it's just one character. Notice how we said character; we didn't say anything about its length in bytes. Certainly the internal representation requires more than just 8 bits for a numeric code that big. But you the programmer are dealing with characters as abstractions, not as physical octets. Low-level details like that are best left up to Perl. You shouldn't think of characters and bytes as the same. Programmers who interchange bytes and characters are guilty of the same class of sin as C programmers who blithely interchange integers and pointers. Even though the underlying representations may happen to coincide on some platforms, this is just a coincidence, and conflating abstract interfaces with physical implementations will always come back to haunt you, eventually. You have several ways to put Unicode characters into Perl literals. If you're lucky enough to have a text editor that lets you enter Unicode directly into your Perl program, you can inform Perl you've done this via the use utf8 pragma. Another way is to use \x escapes in Perl interpolated strings to indicate a character by its code point in hex, as in \xC4. Characters with code points above 0xFF require more than two hex digits, so these must be enclosed in braces. print "\xC4 and \x{0394} look different\n"; char Ä andD look different\n Recipe 1.5 describes how to use charnames to put \N{NAME} escapes in string literals, such as \N{GREEK CAPITAL LETTER DELTA}, \N{greek:Delta}, or even just \N{Delta} to indicate a D character. That's enough to get started using Unicode in Perl alone, but getting Perl to interact properly with other programs requires a bit more. Using the old single-byte encodings like ASCII or ISO 8859-n, when you wrote out a character whose numeric code was NN, a single byte with numeric code NN would appear. What actually appeared depended on which fonts were available, your current locale setting, and quite a few other factors. But under Unicode, this exact duplication of logical character numbers (code points) into physical bytes emitted no longer applies. Instead, they must be encoded in any of several available output formats. Internally, Perl uses a format called UTF-8, but many other encoding formats for Unicode exist, and Perl can work with those, too. The use encoding pragma tells Perl in which encoding your script itself has been written, or which encoding the standard filehandles should use. The use open pragma can set encoding defaults for all handles. Special arguments to open or to binmode specify the encoding format for that particular handle. The -C command-line flag is a shortcut to set the encoding on all (or just standard) handles, plus the program arguments themselves. The environment variables PERLIO, PERL_ENCODING, and PERL_UNICODE all give Perl various sorts of hints related to these matters. [ Team LiB ] [ Team LiB ] Recipe 1.1 Accessing Substrings 1.1.1 Problem You want to access or modify just a portion of a string, not the whole thing. For instance, you've read a fixed-width record and want to extract individual fields. 1.1.2 Solution The substr function lets you read from and write to specific portions of the string. $value = substr($string, $offset, $count); $value = substr($string, $offset); substr($string, $offset, $count) = $newstring; substr($string, $offset, $count, $newstring); # same as previous substr($string, $offset) = $newtail; The unpack function gives only read access, but is faster when you have many substrings to extract. # get a 5-byte string, skip 3 bytes, # then grab two 8-byte strings, then the rest; # (NB: only works on ASCII data, not Unicode) ($leading, $s1, $s2, $trailing) = unpack("A5 x3 A8 A8 A*", $data); # split at 5-byte boundaries @fivers = unpack("A5" x (length($string)/5), $string); # chop string into individual single-byte characters @chars = unpack("A1" x length($string), $string); 1.1.3 Discussion Strings are a basic data type; they aren't arrays of a basic data type. Instead of using array subscripting to access individual characters as you sometimes do in other programming languages, in Perl you use functions like unpack or substr to access individual characters or a portion of the string. The offset argument to substr indicates the start of the substring you're interested in, counting from the front if positive and from the end if negative. If the offset is 0, the substring starts at the beginning. The count argument is the length of the substring. $string = "This is what you have"; # +012345678901234567890 Indexing forwards (left to right) # 109876543210987654321- Indexing backwards (right to left) # note that 0 means 10 or 20, etc. above $first = substr($string, 0, 1); # "T" $start = substr($string, 5, 2); # "is" $rest = substr($string, 13); # "you have" $last = substr($string, -1); # "e" $end = substr($string, -4); # "have" $piece = substr($string, -8, 3); # "you" You can do more than just look at parts of the string with substr; you can actually change them. That's because substr is a particularly odd kind of function—an lvaluable one, that is, a function whose return value may be itself assigned a value. (For the record, the others are vec, pos, and keys. If you squint, local, my, and our can also be viewed as lvaluable functions.) $string = "This is what you have"; print $string; This is what you have substr($string, 5, 2) = "wasn't"; # change "is" to "wasn't" This wasn't what you have substr($string, -12) = "ondrous";# "This wasn't wondrous" This wasn't wondrous substr($string, 0, 1) = ""; # delete first character his wasn't wondrous substr($string, -10) = ""; # delete last 10 characters his wasn' Use the =~ operator and the s///, m//, or tr/// operators in conjunction with substr to make them affect only that portion of the string. # you can test substrings with =~ if (substr($string, -10) =~ /pattern/) { print "Pattern matches in last 10 characters\n"; } # substitute "at" for "is", restricted to first five characters substr($string, 0, 5) =~ s/is/at/g; You can even swap values by using several substrs on each side of an assignment: # exchange the first and last letters in a string $a = "make a hat"; (substr($a,0,1), substr($a,-1)) = (substr($a,-1), substr($a,0,1)); print $a; take a ham Although unpack is not lvaluable, it is considerably faster than substr when you extract numerous values all at once. Specify a format describing the layout of the record to unpack. For positioning, use lowercase "x" with a count to skip forward some number of bytes, an uppercase "X" with a count to skip backward some number of bytes, and an "@" to skip to an absolute byte offset within the record. (If the data contains Unicode strings, be careful with those three: they're strictly byte-oriented, and moving around by bytes within multibyte data is perilous at best.) # extract column with unpack $a = "To be or not to be"; $b = unpack("x6 A6", $a); # skip 6, grab 6 print $b; or not ($b, $c) = unpack("x6 A2 X5 A2", $a); # forward 6, grab 2; backward 5, grab 2 print "$b\n$c\n"; or be Sometimes you prefer to think of your data as being cut up at specific columns. For example, you might want to place cuts right before positions 8, 14, 20, 26, and 30. Those are the column numbers where each field begins. Although you could calculate that the proper unpack format is "A7 A6 A6 A6 A4 A*", this is too much mental strain for the virtuously lazy Perl programmer. Let Perl figure it out for you. Use the cut2fmt function: sub cut2fmt { my(@positions) = @_; my $template = ''; my $lastpos = 1; foreach $place (@positions) { $template .= "A" . ($place - $lastpos) . " "; $lastpos = $place; } $template .= "A*"; return $template; } $fmt = cut2fmt(8, 14, 20, 26, 30); print "$fmt\n"; A7 A6 A6 A6 A4 A* The powerful unpack function goes far beyond mere text processing. It's the gateway between text and binary data. In this recipe, we've assumed that all character data is 7- or 8-bit data so that pack's byte operations work as expected. 1.1.4 See Also The pack, unpack, and substr functions in perlfunc(1) and in Chapter 29 of Programming Perl; use of the cut2fmt subroutine in Recipe 1.24; the binary use of unpack in Recipe 8.24 [ Team LiB ] [ Team LiB ] Recipe 1.2 Establishing a Default Value 1.2.1 Problem You would like to supply a default value to a scalar variable, but only if it doesn't already have one. It often happens that you want a hardcoded default value for a variable that can be overridden from the command line or through an environment variable. 1.2.2 Solution Use the || or ||= operator, which work on both strings and numbers: # use $b if $b is true, else $c $a = $b || $c; # set $x to $y unless $x is already true $x ||= $y; If 0, "0", and "" are valid values for your variables, use defined instead: # use $b if $b is defined, else $c $a = defined($b) ? $b : $c; # the "new" defined-or operator from future perl use v5.9; $a = $b // $c; 1.2.3 Discussion The big difference between the two techniques (defined and ||) is what they test: definedness versus truth. Three defined values are still false in the world of Perl: 0, "0", and "". If your variable already held one of those, and you wanted to keep that value, a || wouldn't work. You'd have to use the more elaborate three-way test with defined instead. It's often convenient to arrange for your program to care about only true or false values, not defined or undefined ones. Rather than being restricted in its return values to a mere 1 or 0 as in most other languages, Perl's || operator has a much more interesting property: it returns its first operand (the lefthand side) if that operand is true; otherwise it returns its second operand. The && operator also returns the last evaluated expression, but is less often used for this property. These operators don't care whether their operands are strings, numbers, or references—any scalar will do. They just return the first one that makes the whole expression true or false. This doesn't affect the Boolean sense of the return value, but it does make the operators' return values more useful. This property lets you provide a default value to a variable, function, or longer expression in case the first part doesn't pan out. Here's an example of ||, which would set $foo to be the contents of either $bar or, if $bar were false, "DEFAULT VALUE": $foo = $bar || "DEFAULT VALUE"; Here's another example, which sets $dir to be either the first argument to the program or "/tmp" if no argument were given. $dir = shift(@ARGV) || "/tmp"; We can do this without altering @ARGV: $dir = $ARGV[0] || "/tmp"; If 0 is a valid value for $ARGV[0], we can't use ||, because it evaluates as false even though it's a value we want to accept. We must resort to Perl's only ternary operator, the ?: ("hook colon," or just "hook"): $dir = defined($ARGV[0]) ? shift(@ARGV) : "/tmp"; We can also write this as follows, although with slightly different semantics: $dir = @ARGV ? $ARGV[0] : "/tmp"; This checks the number of elements in @ARGV, because the first operand (here, @ARGV) is evaluated in scalar context. It's only false when there are 0 elements, in which case we use "/tmp". In all other cases (when the user gives an argument), we use the first argument. The following line increments a value in %count, using as the key either $shell or, if $shell is false, "/bin/sh". $count{ $shell || "/bin/sh" }++; You may chain several alternatives together as we have in the following example. The first expression that returns a true value will be used. # find the user name on Unix systems $user = $ENV{USER} || $ENV{LOGNAME} || getlogin( ) || (getpwuid($<))[0] || "Unknown uid number $<"; The && operator works analogously: it returns its first operand if that operand is false; otherwise, it returns the second one. Because there aren't as many interesting false values as there are true ones, this property isn't used much. One use is demonstrated in Recipe 13.12 and Recipe 14.19. The ||= assignment operator looks odd, but it works exactly like the other binary assignment operators. For nearly all of Perl's binary operators, $VAR OP= VALUE means $VAR = $VAR OP VALUE; for example, $a += $b is the same as $a = $a + $b. So ||= is used to set a variable when that variable is itself still false. Since the || check is a simple Boolean one—testing for truth—it doesn't care about undefined values, even when warnings are enabled. Here's an example of ||= that sets $starting_point to "Greenwich" unless it is already set. Again, we assume $starting_point won't have the value 0 or "0", or that if it does, it's okay to change it. $starting_point ||= "Greenwich"; You can't use or in place of || in assignments, because or's precedence is too low. $a = $b or $c is equivalent to ($a = $b) or $c. This will always assign $b to $a, which is not the behavior you want. Don't extend this curious use of || and ||= from scalars to arrays and hashes. It doesn't work, because the operators put their left operand into scalar context. Instead, you must do something like this: @a = @b unless @a; # copy only if empty @a = @b ? @b : @c; # assign @b if nonempty, else @c Perl is someday expected to support new operators: //, //=, and err. It may already do so by the time you read this text. These defined-or operators will work just like the logical-or operators, ||, except that they will test definedness, not mere truth. That will make the following pairs equivalent: $a = defined($b) ? $b : $c; $a = $b // $c; $x = defined($x) ? $x : $y; $x //= $y; defined(read(FH, $buf, $count) or die "read failed: $!"; read(FH, $buf, $count) err die "read failed: $!"; These three operators are already present in Perl release v5.9, which being an odd-numbered release, is an experimental version and not what you want in a production environment. It is expected to be in v5.10, which will be a stable release, and will most certainly be in Perl v6, whose release date has not yet been determined. 1.2.4 See Also The || operator in perlop(1) and Chapter 3 of Programming Perl; the defined and exists functions in perlfunc(1) and Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.3 Exchanging Values Without Using Temporary Variables 1.3.1 Problem You want to exchange the values of two scalar variables, but don't want to use a temporary variable. 1.3.2 Solution Use list assignment to reorder the variables. ($VAR1, $VAR2) = ($VAR2, $VAR1); 1.3.3 Discussion Most programming languages require an intermediate step when swapping two variables' values: $temp = $a; $a = $b; $b = $temp; Not so in Perl. It tracks both sides of the assignment, guaranteeing that you don't accidentally clobber any of your values. This eliminates the temporary variable: $a = "alpha"; $b = "omega"; ($a, $b) = ($b, $a); # the first shall be last -- and versa vice You can even exchange more than two variables at once: ($alpha, $beta, $production) = qw(January March August); # move beta to alpha, # move production to beta, # move alpha to production ($alpha, $beta, $production) = ($beta, $production, $alpha); When this code finishes, $alpha, $beta, and $production have the values "March", "August", and "January". 1.3.4 See Also The section on "List value constructors" in perldata(1) and on "List Values and Arrays" in Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.4 Converting Between Characters and Values 1.4.1 Problem You want to print the number represented by a given character, or you want to print a character given a number. 1.4.2 Solution Use ord to convert a character to a number, or use chr to convert a number to its corresponding character: $num = ord($char); $char = chr($num); The %c format used in printf and sprintf also converts a number to a character: $char = sprintf("%c", $num); # slower than chr($num) printf("Number %d is character %c\n", $num, $num); Number 101 is character e A C* template used with pack and unpack can quickly convert many 8-bit bytes; similarly, use U* for Unicode characters. @bytes = unpack("C*", $string); $string = pack("C*", @bytes); $unistr = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9); @unichars = unpack("U*", $unistr); 1.4.3 Discussion Unlike low-level, typeless languages such as assembler, Perl doesn't treat characters and numbers interchangeably; it treats strings and numbers interchangeably. That means you can't just assign characters and numbers back and forth. Perl provides Pascal's chr and ord to convert between a character and its corresponding ordinal value: $value = ord("e"); # now 101 $character = chr(101); # now "e" If you already have a character, it's really represented as a string of length one, so just print it out directly using print or the %s format in printf and sprintf. The %c format forces printf or sprintf to convert a number into a character; it's not used for printing a character that's already in character format (that is, a string). printf("Number %d is character %c\n", 101, 101); The pack, unpack, chr, and ord functions are all faster than sprintf. Here are pack and unpack in action: @ascii_character_numbers = unpack("C*", "sample"); print "@ascii_character_numbers\n"; 115 97 109 112 108 101 $word = pack("C*", @ascii_character_numbers); $word = pack("C*", 115, 97, 109, 112, 108, 101); # same print "$word\n"; sample Here's how to convert from HAL to IBM: $hal = "HAL"; @byte = unpack("C*", $hal); foreach $val (@byte) { $val++; # add one to each byte value } $ibm = pack("C*", @byte); print "$ibm\n"; # prints "IBM" On single-byte character data, such as plain old ASCII or any of the various ISO 8859 charsets, the ord function returns numbers from 0 to 255. These correspond to C's unsigned char data type. However, Perl understands more than that: it also has integrated support for Unicode, the universal character encoding. If you pass chr, sprintf "%c", or pack "U*" numeric values greater than 255, the return result will be a Unicode string. Here are similar operations with Unicode: @unicode_points = unpack("U*", "fac\x{0327}ade"); print "@unicode_points\n"; 102 97 99 807 97 100 101 $word = pack("U*", @unicode_points); print "$word\n"; façade If all you're doing is printing out the characters' values, you probably don't even need to use unpack. Perl's printf and sprintf functions understand a v modifier that works like this: printf "%vd\n", "fac\x{0327}ade"; printf "%vx\n", "fac\x{0327}ade"; The numeric value of each character (that is, its "code point" in Unicode parlance) in the string is emitted with a dot separator. 1.4.4 See Also The chr, ord, printf, sprintf, pack, and unpack functions in perlfunc(1) and Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.5 Using Named Unicode Characters 1.5.1 Problem You want to use Unicode names for fancy characters in your code without worrying about their code points. 1.5.2 Solution Place a use charnames at the top of your file, then freely insert "\N{CHARSPEC}" escapes into your string literals. 1.5.3 Discussion The use charnames pragma lets you use symbolic names for Unicode characters. These are compile-time constants that you access with the \N{CHARSPEC} double-quoted string sequence. Several subpragmas are supported. The :full subpragma grants access to the full range of character names, but you have to write them out in full, exactly as they occur in the Unicode character database, including the loud, all-capitals notation. The :short subpragma gives convenient shortcuts. Any import without a colon tag is taken to be a script name, giving case-sensitive shortcuts for those scripts. use charnames ':full'; print "\N{GREEK CAPITAL LETTER DELTA} is called delta.\n"; D is called delta. use charnames ':short'; print "\N{greek:Delta} is an upper-case delta.\n"; D is an upper-case delta. use charnames qw(cyrillic greek); print "\N{Sigma} and \N{sigma} are Greek sigmas.\n"; print "\N{Be} and \N{be} are Cyrillic bes.\n"; S and s are Greek sigmas. and are Cyrillic bes. Two functions, charnames::viacode and charnames::vianame, can translate between numeric code points and the long names. The Unicode documents use the notation U+XXXX to indicate the Unicode character whose code point is XXXX, so we'll use that here in our output. use charnames qw(:full); for $code (0xC4, 0x394) { printf "Character U+%04X (%s) is named %s\n", $code, chr($code), charnames::viacode($code); } Character U+00C4 (Ä) is named LATIN CAPITAL LETTER A WITH DIAERESIS Character U+0394 (D) is named GREEK CAPITAL LETTER DELTA use charnames qw(:full); $name = "MUSIC SHARP SIGN"; $code = charnames::vianame($name); printf "%s is character U+%04X (%s)\n", $name, $code, chr($code); MUSIC SHARP SIGN is character U+266F (#) Here's how to find the path to Perl's copy of the Unicode character database: % perl -MConfig -le 'print "$Config{privlib}/unicore/NamesList.txt"' /usr/local/lib/perl5/5.8.1/unicore/NamesList.txt Read this file to learn the character names available to you. 1.5.4 See Also The charnames(3) manpage and Chapter 31 of Programming Perl; the Unicode Character Database at [ Team LiB ] [ Team LiB ] Recipe 1.6 Processing a String One Character at a Time 1.6.1 Problem You want to process a string one character at a time. 1.6.2 Solution Use split with a null pattern to break up the string into individual characters, or use unpack if you just want the characters' values: @array = split(//, $string); # each element a single character @array = unpack("U*", $string); # each element a code point (number) Or extract each character in turn with a loop: while (/(.)/g) { # . is never a newline here # $1 has character, ord($1) its number } 1.6.3 Discussion As we said before, Perl's fundamental unit is the string, not the character. Needing to process anything a character at a time is rare. Usually some kind of higher-level Perl operation, like pattern matching, solves the problem more handily. See, for example, Recipe 7.14, where a set of substitutions is used to find command-line arguments. Splitting on a pattern that matches the empty string returns a list of individual characters in the string. This is a convenient feature when done intentionally, but it's easy to do unintentionally. For instance, /X*/ matches all possible strings, including the empty string. Odds are you will find others when you don't mean to. Here's an example that prints the characters used in the string "an apple a day", sorted in ascending order: %seen = ( ); $string = "an apple a day"; foreach $char (split //, $string) { $seen{$char}++; } print "unique chars are: ", sort(keys %seen), "\n"; unique chars are: adelnpy These split and unpack solutions give an array of characters to work with. If you don't want an array, use a pattern match with the /g flag in a while loop, extracting one character at a time: %seen = ( ); $string = "an apple a day"; while ($string =~ /(.)/g) { $seen{$1}++; } print "unique chars are: ", sort(keys %seen), "\n"; unique chars are: adelnpy In general, whenever you find yourself doing character-by-character processing, there's probably a better way to go about it. Instead of using index and substr or split and unpack, it might be easier to use a pattern. Instead of computing a 32-bit checksum by hand, as in the next example, the unpack function can compute it far more efficiently. The following example calculates the checksum of $string with a foreach loop. There are better checksums; this just happens to be the basis of a traditional and computationally easy checksum. You can use the standard[1] Digest::MD5 module if you want a more robust checksum. [1] It's standard as of the v5.8 release of Perl; otherwise, grab it from CPAN. $sum = 0; foreach $byteval (unpack("C*", $string)) { $sum += $byteval; } print "sum is $sum\n"; # prints "1248" if $string was "an apple a day" This does the same thing, but much faster: $sum = unpack("%32C*", $string); This emulates the SysV checksum program: #!/usr/bin/perl # sum - compute 16-bit checksum of all input files $checksum = 0; while (<>) { $checksum += unpack("%16C*", $_) } $checksum %= (2 ** 16) - 1; print "$checksum\n"; Here's an example of its use: % perl sum /etc/termcap 1510 If you have the GNU version of sum, you'll need to call it with the —sysv option to get the same answer on the same file. % sum --sysv /etc/termcap 1510 851 /etc/termcap Another tiny program that processes its input one character at a time is slowcat, shown in Example 1-1. The idea here is to pause after each character is printed so you can scroll text before an audience slowly enough that they can read it. Example 1-1. slowcat #!/usr/bin/perl # slowcat - emulate a s l o w line printer # usage: slowcat [-DELAY] [files ...] $DELAY = ($ARGV[0] =~ /^-([.\d]+)/) ? (shift, $1) : 1; $| = 1; while (<>) { for (split(//)) { print; select(undef,undef,undef, 0.005 * $DELAY); } } 1.6.4 See Also The split and unpack functions in perlfunc(1) and Chapter 29 of Programming Perl; the use of expanding select for timing is explained in Recipe 3.10 [ Team LiB ] [ Team LiB ] Recipe 1.7 Reversing a String by Word or Character 1.7.1 Problem You want to reverse the words or characters of a string. 1.7.2 Solution Use the reverse function in scalar context for flipping characters: $revchars = reverse($string); To flip words, use reverse in list context with split and join: $revwords = join(" ", reverse split(" ", $string)); 1.7.3 Discussion The reverse function is two different functions in one. Called in scalar context, it joins together its arguments and returns that string in reverse order. Called in list context, it returns its arguments in the opposite order. When using reverse for its character-flipping behavior, use scalar to force scalar context unless it's entirely obvious. $gnirts = reverse($string); # reverse letters in $string @sdrow = reverse(@words); # reverse elements in @words $confused = reverse(@words); # reverse letters in join("", @words) Here's an example of reversing words in a string. Using a single space, " ", as the pattern to split is a special case. It causes split to use contiguous whitespace as the separator and also discard leading null fields, just like awk. Normally, split discards only trailing null fields. # reverse word order $string = 'Yoda said, "can you see this?"'; @allwords = split(" ", $string); $revwords = join(" ", reverse @allwords); print $revwords, "\n"; this?" see you "can said, Yoda We could remove the temporary array @allwords and do it on one line: $revwords = join(" ", reverse split(" ", $string)); Multiple whitespace in $string becomes a single space in $revwords. If you want to preserve whitespace, use this: $revwords = join("", reverse split(/(\s+)/, $string)); One use of reverse is to test whether a word is a palindrome (a word that reads the same backward or forward): $word = "reviver"; $is_palindrome = ($word eq reverse($word)); We can turn this into a one-liner that finds big palindromes in /usr/dict/words: % perl -nle 'print if $_ eq reverse && length > 5' /usr/dict/words deedeed degged deified denned hallah kakkak murdrum redder repaper retter reviver rotator sooloos tebbet terret tut-tut 1.7.4 See Also The split, reverse, and scalar functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 1.8 [ Team LiB ] [ Team LiB ] Recipe 1.8 Treating Unicode Combined Characters as Single Characters 1.8.1 Problem You have a Unicode string that contains combining characters, and you'd like to treat each of these sequences as a single logical character. 1.8.2 Solution Process them using \X in a regular expression. $string = "fac\x{0327}ade"; # "façade" $string =~ /fa.ade/; # fails $string =~ /fa\Xade/; # succeeds @chars = split(//, $string); # 7 letters in @chars @chars = $string =~ /(.)/g; # same thing @chars = $string =~ /(\X)/g; # 6 "letters" in @chars 1.8.3 Discussion In Unicode, you can combine a base character with one or more non-spacing characters following it; these are usually diacritics, such as accent marks, cedillas, and tildas. Due to the presence of precombined characters, for the most part to accommodate legacy character systems, there can be two or more ways of writing the same thing. For example, the word "façade" can be written with one character between the two a's, "\x{E7}", a character right out of Latin1 (ISO 8859-1). These characters might be encoded into a two-byte sequence under the UTF-8 encoding that Perl uses internally, but those two bytes still only count as one single character. That works just fine. There's a thornier issue. Another way to write U+00E7 is with two different code points: a regular "c" followed by "\x{0327}". Code point U+0327 is a non-spacing combining character that means to go back and put a cedilla underneath the preceding base character. There are times when you want Perl to treat each combined character sequence as one logical character. But because they're distinct code points, Perl's character-related operations treat non-spacing combining characters as separate characters, including substr, length, and regular expression metacharacters, such as in /./ or /[^abc]/. In a regular expression, the \X metacharacter matches an extended Unicode combining character sequence, and is exactly equivalent to (?:\PM\pM*) or, in long-hand: (?x: # begin non-capturing group \PM # one character without the M (mark) property, # such as a letter \pM # one character that does have the M (mark) property, # such as an accent mark * # and you can have as many marks as you want ) Otherwise simple operations become tricky if these beasties are in your string. Consider the approaches for reversing a word by character from the previous recipe. Written with combining characters, "année" and "niño" can be expressed in Perl as "anne\x{301}e" and "nin\x{303}o". for $word ("anne\x{301}e", "nin\x{303}o") { printf "%s simple reversed to %s\n", $word, scalar reverse $word; printf "%s better reversed to %s\n", $word, join("", reverse $word =~ /\X/g); } That produces: année simple reversed to éenna année better reversed to eénna niño simple reversed to õnin niño better reversed to oñin In the reversals marked as simply reversed, the diacritical marking jumped from one base character to the other one. That's because a combining character always follows its base character, and you've reversed the whole string. By grabbing entire sequences of a base character plus any combining characters that follow, then reversing that list, this problem is avoided. 1.8.4 See Also The perlre(1) and perluniintro(1) manpages; Chapter 15 of Programming Perl; Recipe 1.9 [ Team LiB ] [ Team LiB ] Recipe 1.9 Canonicalizing Strings with Unicode Combined Characters 1.9.1 Problem You have two strings that look the same when you print them out, but they don't test as string equal and sometimes even have different lengths. How can you get Perl to consider them the same strings? 1.9.2 Solution When you have otherwise equivalent strings, at least some of which contain Unicode combining character sequences, instead of comparing them directly, compare the results of running them through the NFD( ) function from the Unicode::Normalize module. use Unicode::Normalize; $s1 = "fa\x{E7}ade"; $s2 = "fac\x{0327}ade"; if (NFD($s1) eq NFD($s2)) { print "Yup!\n" } 1.9.3 Discussion The same character sequence can sometimes be specified in multiple ways. Sometimes this is because of legacy encodings, such as the letters from Latin1 that contain diacritical marks. These can be specified directly with a single character (like U+00E7, LATIN SMALL LETTER C WITH CEDILLA) or indirectly via the base character (like U+0063, LATIN SMALL LETTER C) followed by a combining character (U+0327, COMBINING CEDILLA). Another possibility is that you have two or more marks following a base character, but the order of those marks varies in your data. Imagine you wanted the letter "c" to have both a cedilla and a caron on top of it in order to print a . That could be specified in any of these ways: $string = v231.780; # LATIN SMALL LETTER C WITH CEDILLA # COMBINING CARON $string = v99.807.780; # LATIN SMALL LETTER C # COMBINING CARON # COMBINING CEDILLA $string = v99.780.807 # LATIN SMALL LETTER C # COMBINING CEDILLA # COMBINING CARON The normalization functions rearrange those into a reliable ordering. Several are provided, including NFD( ) for canonical decomposition and NFC( ) for canonical decomposition followed by canonical composition. No matter which of these three ways you used to specify your , the NFD version is v99.807.780, whereas the NFC version is v231.780. Sometimes you may prefer NFKD( ) and NFKC( ), which are like the previous two functions except that they perform compatible decomposition, which for NFKC( ) is then followed by canonical composition. For example, \x{FB00} is the double-f ligature. Its NFD and NFC forms are the same thing, "\x{FB00}", but its NFKD and NFKC forms return a two-character string, "\x{66}\x{66}". 1.9.4 See Also The Universal Character Code section at the beginning of this chapter; the documentation for the Unicode::Normalize module; Recipe 8.20 [ Team LiB ] [ Team LiB ] Recipe 1.10 Treating a Unicode String as Octets 1.10.1 Problem You have a Unicode string but want Perl to treat it as octets (e.g., to calculate its length or for purposes of I/O). 1.10.2 Solution The use bytes pragma makes all Perl operations in its lexical scope treat the string as a group of octets. Use it when your code is calling Perl's character-aware functions directly: $ff = "\x{FB00}"; # ff ligature $chars = length($ff); # length is one character { use bytes; # force byte semantics $octets = length($ff); # length is two octets } $chars = length($ff); # back to character semantics Alternatively, the Encode module lets you convert a Unicode string to a string of octets, and back again. Use it when the character-aware code isn't in your lexical scope: use Encode qw(encode_utf8); sub somefunc; # defined elsewhere $ff = "\x{FB00}"; # ff ligature $ff_oct = encode_utf8($ff); # convert to octets $chars = somefunc($ff); # work with character string $octets = somefunc($ff_oct); # work with octet string 1.10.3 Discussion As explained in this chapter's Introduction, Perl knows about two types of string: those made of simple uninterpreted octets, and those made of Unicode characters whose UTF-8 representation may require more than one octet. Each individual string has a flag associated with it, identifying the string as either UTF-8 or octets. Perl's I/O and string operations (such as length) check this flag and give character or octet semantics accordingly. Sometimes you need to work with bytes and not characters. For example, many protocols have a Content-Length header that specifies the size of the body of a message in octets. You can't simply use Perl's length function to calculate the size, because if the string you're calling length on is marked as UTF-8, you'll get the size in characters. The use bytes pragma makes all Perl functions in its lexical scope use octet semantics for strings instead of character semantics. Under this pragma, length always returns the number of octets, and read always reports the number of octets read. However, because the use bytes pragma is lexically scoped, you can't use it to change the behavior of code in another scope (e.g., someone else's function). For this you need to create an octet-encoded copy of the UTF-8 string. In memory, of course, the same byte sequence is used for both strings. The difference is that the copy of your UTF-8 string has the UTF-8 flag cleared. Functions acting on the octet copy will give octet semantics, regardless of the scope they're in. There is also a no bytes pragma, which forces character semantics, and a decode_utf8 function, which turns octet-encoded strings into UTF-8 encoded strings. However, these functions are less useful because not all octet strings are valid UTF-8 strings, whereas all UTF-8 strings are valid octet strings. 1.10.4 See Also The documentation for the bytes pragma; the documentation for the standard Encode module [ Team LiB ] [ Team LiB ] Recipe 1.11 Expanding and Compressing Tabs 1.11.1 Problem You want to convert tabs in a string to the appropriate number of spaces, or vice versa. Converting spaces into tabs can be used to reduce file size when the file has many consecutive spaces. Converting tabs into spaces may be required when producing output for devices that don't understand tabs or think them at different positions than you do. 1.11.2 Solution Either use a rather funny looking substitution: while ($string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) { # spin in empty loop until substitution finally fails } or use the standard Text::Tabs module: use Text::Tabs; @expanded_lines = expand(@lines_with_tabs); @tabulated_lines = unexpand(@lines_without_tabs); 1.11.3 Discussion Assuming tab stops are set every N positions (where N is customarily eight), it's easy to convert them into spaces. The standard textbook method does not use the Text::Tabs module but suffers slightly from being difficult to understand. Also, it uses the $` variable, whose very mention currently slows down every pattern match in the program. This is explained in Special Variables in Chapter 6. You could use this algorithm to make a filter to expand its input's tabstops to eight spaces each: while (<>) { 1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e; print; } To avoid $`, you could use a slightly more complicated alternative that uses the numbered variables for explicit capture; this one expands tabstops to four each instead of eight: 1 while s/^(.*?)(\t+)/$1 . ' ' x (length($2) * 4 - length($1) % 4)/e; Another approach is to use the offsets directly from the @+ and @- arrays. This also expands to four-space positions: 1 while s/\t+/' ' x (($+[0] - $-[0]) * 4 - $-[0] % 4)/e; If you're looking at all of these 1 while loops and wondering why they couldn't have been written as part of a simple s///g instead, it's because you need to recalculate the length from the start of the line again each time rather than merely from where the last match occurred. The convention 1 while CONDITION is the same as while (CONDITION) { }, but shorter. Its origins date to when Perl ran the first incredibly faster than the second. While the second is now almost as fast, it remains convenient, and the habit has stuck. The standard Text::Tabs module provides conversion functions to convert both directions, exports a $tabstop variable to control the number of spaces per tab, and does not incur the performance hit because it uses $1 and $2 rather than $& and $`. use Text::Tabs; $tabstop = 4; while (<>) { print expand($_) } We can also use Text::Tabs to "unexpand" the tabs. This example uses the default $tabstop value of 8: use Text::Tabs; while (<>) { print unexpand($_) } 1.11.4 See Also The manpage for the Text::Tabs module; the s/// operator in perlre(1) and perlop(1); the @- and @+ variables (@LAST_MATCH_START and @LAST_MATCH_END) in Chapter 28 of Programming Perl; the section on "When a global substitution just isn't global enough" in Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.12 Expanding Variables in User Input 1.12.1 Problem You've read a string with an embedded variable reference, such as: You owe $debt to me. Now you want to replace $debt in the string with its value. 1.12.2 Solution Use a substitution with symbolic references if the variables are all globals: $text =~ s/\$(\w+)/${$1}/g; But use a double /ee if they might be lexical (my) variables: $text =~ s/(\$\w+)/$1/gee; 1.12.3 Discussion The first technique is basically to find what looks like a variable name, then use symbolic dereferencing to interpolate its contents. If $1 contains the string somevar, ${$1} will be whatever $somevar contains. This won't work if the use strict 'refs' pragma is in effect because that bans symbolic dereferencing. Here's an example: our ($rows, $cols); no strict 'refs'; # for ${$1}/g below my $text; ($rows, $cols) = (24, 80); $text = q(I am $rows high and $cols long); # like single quotes! $text =~ s/\$(\w+)/${$1}/g; print $text; I am 24 high and 80 long You may have seen the /e substitution modifier used to evaluate the replacement as code rather than as a string. It's designed for situations where you don't know the exact replacement value, but you do know how to calculate it. For example, doubling every whole number in a string: $text = "I am 17 years old"; $text =~ s/(\d+)/2 * $1/eg; When Perl is compiling your program and sees a /e on a substitute, it compiles the code in the replacement block along with the rest of your program, long before the substitution actually happens. When a substitution is made, $1 is replaced with the string that matched. The code to evaluate would then be something like: 2 * 17 If we tried saying: $text = 'I am $AGE years old'; # note single quotes $text =~ s/(\$\w+)/$1/eg; # WRONG assuming $text held a mention of the variable $AGE, Perl would dutifully replace $1 with $AGE and then evaluate code that looked like: '$AGE' which just yields us our original string back again. We need to evaluate the result again to get the value of the variable. To do that, just add another /e: $text =~ s/(\$\w+)/$1/eeg; # finds my( ) variables Yes, you can have as many /e modifiers as you'd like. Only the first one is compiled and syntax- checked with the rest of your program. This makes it work like the eval {BLOCK} construct, except that it doesn't trap exceptions. Think of it more as a do {BLOCK} instead. Subsequent /e modifiers are quite different. They're more like the eval "STRING" construct. They don't get compiled until runtime. A small advantage of this scheme is that it doesn't require a no strict 'refs' pragma for the block. A tremendous advantage is that unlike symbolic dereferencing, this mechanism finds lexical variables created with my, something symbolic references can never do. The following example uses the /x modifier to enable whitespace and comments in the pattern part of the substitute and /e to evaluate the righthand side as code. The /e modifier gives more control over what happens in case of error or other extenuating circumstances, as we have here: # expand variables in $text, but put an error message in # if the variable isn't defined $text =~ s{ \$ # find a literal dollar sign (\w+) # find a "word" and store it in $1 }{ no strict 'refs'; # for $$1 below if (defined ${$1}) { ${$1}; # expand global variables only } else { "[NO VARIABLE: \$$1]"; # error msg } }egx; Once upon a time, long ago and far away, $$1 used to mean ${$}1 when it occurred within a string; that is, the $$ variable followed by a 1. This was grandfathered to work that way so you could more readily expand the $$ variable as your process ID to compose temporary filenames. It now always means ${$1}, i.e., dereference the contents of the $1 variable. We have written it the more explicit way for clarity, not correctness. 1.12.4 See Also The s/// operator in perlre(1) and perlop(1) and Chapter 5 of Programming Perl; the eval function in perlfunc(1) and Chapter 29 of Programming Perl; the similar use of substitutions in Recipe 20.9 [ Team LiB ] [ Team LiB ] Recipe 1.13 Controlling Case 1.13.1 Problem A string in uppercase needs converting to lowercase, or vice versa. 1.13.2 Solution Use the lc and uc functions or the \L and \U string escapes. $big = uc($little); # "bo peep" -> "BO PEEP" $little = lc($big); # "JOHN" -> "john" $big = "\U$little"; # "bo peep" -> "BO PEEP" $little = "\L$big"; # "JOHN" -> "john" To alter just one character, use the lcfirst and ucfirst functions or the \l and \u string escapes. $big = "\u$little"; # "bo" -> "Bo" $little = "\l$big"; # "BoPeep" -> "boPeep" 1.13.3 Discussion The functions and string escapes look different, but both do the same thing. You can set the case of either just the first character or the whole string. You can even do both at once to force uppercase (actually, titlecase; see later explanation) on initial characters and lowercase on the rest. $beast = "dromedary"; # capitalize various parts of $beast $capit = ucfirst($beast); # Dromedary $capit = "\u\L$beast"; # (same) $capall = uc($beast); # DROMEDARY $capall = "\U$beast"; # (same) $caprest = lcfirst(uc($beast)); # dROMEDARY $caprest = "\l\U$beast"; # (same) These capitalization-changing escapes are commonly used to make a string's case consistent: # titlecase each word's first character, lowercase the rest $text = "thIS is a loNG liNE"; $text =~ s/(\w+)/\u\L$1/g; print $text; This Is A Long Line You can also use these for case-insensitive comparison: if (uc($a) eq uc($b)) { # or "\U$a" eq "\U$b" print "a and b are the same\n"; } The randcap program, shown in Example 1-2, randomly titlecases 20 percent of the letters of its input. This lets you converse with 14-year-old WaREz d00Dz. Example 1-2. randcap #!/usr/bin/perl -p # randcap: filter to randomly capitalize 20% of the letters # call to srand( ) is unnecessary as of v5.4 BEGIN { srand(time( ) ^ ($$ + ($$<<15))) } sub randcase { rand(100) < 20 ? "\u$_[0]" : "\l$_[0]" } s/(\w)/randcase($1)/ge; % randcap < genesis | head -9 boOk 01 genesis 001:001 in the BEginning goD created the heaven and tHe earTh. 001:002 and the earth wAS without ForM, aND void; AnD darkneSS was upon The Face of the dEEp. and the spIrit of GOd movEd upOn tHe face of the Waters. 001:003 and god Said, let there be ligHt: and therE wAs LigHt. In languages whose writing systems distinguish between uppercase and titlecase, the ucfirst( ) function (and \u, its string escape alias) converts to titlecase. For example, in Hungarian the "dz" sequence occurs. In uppercase, it's written as "DZ", in titlecase as "Dz", and in lowercase as "dz". Unicode consequently has three different characters defined for these three situations: Code point Written Meaning 01F1 DZ LATIN CAPITAL LETTER DZ 01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z 01F3 dz LATIN SMALL LETTER DZ It is tempting but ill-advised to just use tr[a-z][A-Z] or the like to convert case. This is a mistake because it omits all characters with diacritical markings—such as diaereses, cedillas, and accent marks—which are used in dozens of languages, including English. However, correctly handling case mappings on data with diacritical markings can be far trickier than it seems. There is no simple answer, although if everything is in Unicode, it's not all that bad, because Perl's case-mapping functions do work perfectly fine on Unicode data. See the section on The Universal Character Code in the Introduction to this chapter for more information. 1.13.4 See Also The uc, lc, ucfirst, and lcfirst functions in perlfunc(1) and Chapter 29 of Programming Perl; \L, \U, \l, and \u string escapes in the "Quote and Quote-like Operators" section of perlop(1) and Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.14 Properly Capitalizing a Title or Headline 1.14.1 Problem You have a string representing a headline, the title of book, or some other work that needs proper capitalization. 1.14.2 Solution Use a variant of this tc( ) titlecasing function: INIT { our %nocap; for (qw( a an the and but or as at but by for from in into of off on onto per to with )) { $nocap{$_}++; } } sub tc { local $_ = shift; # put into lowercase if on stop list, else titlecase s/(\pL[\pL']*)/$nocap{$1} ? lc($1) : ucfirst(lc($1))/ge; s/^(\pL[\pL']*) /\u\L$1/x; # last word guaranteed to cap s/ (\pL[\pL']*)$/\u\L$1/x; # first word guaranteed to cap # treat parenthesized portion as a complete title s/\( (\pL[\pL']*) /(\u\L$1/x; s/(\pL[\pL']*) \) /\u\L$1)/x; # capitalize first word following colon or semi-colon s/ ( [:;] \s+ ) (\pL[\pL']* ) /$1\u\L$2/x; return $_; } 1.14.3 Discussion The rules for correctly capitalizing a headline or title in English are more complex than simply capitalizing the first letter of each word. If that's all you need to do, something like this should suffice: s/(\w+\S*\w*)/\u\L$1/g; Most style guides tell you that the first and last words in the title should always be capitalized, along with every other word that's not an article, the particle "to" in an infinitive construct, a coordinating conjunction, or a preposition. Here's a demo, this time demonstrating the distinguishing property of titlecase. Assume the tc function is as defined in the Solution. # with apologies (or kudos) to Stephen Brust, PJF, # and to JRRT, as always. @data = ( "the enchantress of \x{01F3}ur mountain", "meeting the enchantress of \x{01F3}ur mountain", "the lord of the rings: the fellowship of the ring", ); $mask = "%-20s: %s\n"; sub tc_lame { local $_ = shift; s/(\w+\S*\w*)/\u\L$1/g; return $_; } for $datum (@data) { printf $mask, "ALL CAPITALS", uc($datum); printf $mask, "no capitals", lc($datum); printf $mask, "simple titlecase", tc_lame($datum); printf $mask, "better titlecase", tc($datum); print "\n"; } ALL CAPITALS : THE ENCHANTRESS OF DZUR MOUNTAIN no capitals : the enchantress of dzur mountain simple titlecase : The Enchantress Of Dzur Mountain better titlecase : The Enchantress of Dzur Mountain ALL CAPITALS : MEETING THE ENCHANTRESS OF DZUR MOUNTAIN no capitals : meeting the enchantress of dzur mountain simple titlecase : Meeting The Enchantress Of Dzur Mountain better titlecase : Meeting the Enchantress of Dzur Mountain ALL CAPITALS : THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING no capitals : the lord of the rings: the fellowship of the ring simple titlecase : The Lord Of The Rings: The Fellowship Of The Ring better titlecase : The Lord of the Rings: The Fellowship of the Ring One thing to consider is that some style guides prefer capitalizing only prepositions that are longer than three, four, or sometimes five letters. O'Reilly & Associates, for example, keeps prepositions of four or fewer letters in lowercase. Here's a longer list of prepositions if you prefer, which you can modify to your needs: @all_prepositions = qw{ about above absent across after against along amid amidst among amongst around as at athwart before behind below beneath beside besides between betwixt beyond but by circa down during ere except for from in into near of off on onto out over past per since than through till to toward towards under until unto up upon versus via with within without }; This kind of approach can take you only so far, though, because it doesn't distinguish between words that can be several parts of speech. Some prepositions on the list might also double as words that should always be capitalized, such as subordinating conjunctions, adverbs, or even adjectives. For example, it's "Down by the Riverside" but "Getting By on Just $30 a Day", or "A Ringing in My Ears" but "Bringing In the Sheaves". Another consideration is that you might prefer to apply the \u or ucfirst conversion by itself without also putting the whole string into lowercase. That way a word that's already in all capital letters, such as an acronym, doesn't lose that trait. You probably wouldn't want to convert "FBI" and "LBJ" into "Fbi" and "Lbj". 1.14.4 See Also The uc, lc, ucfirst, and lcfirst functions in perlfunc(1) and Chapter 29 of Programming Perl; the \L, \U, \l, and \u string escapes in the "Quote and Quote-like Operators" section of perlop(1) and Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.15 Interpolating Functions and Expressions Within Strings 1.15.1 Problem You want a function call or expression to expand within a string. This lets you construct more complex templates than with simple scalar variable interpolation. 1.15.2 Solution Break up your expression into distinct concatenated pieces: $answer = $var1 . func( ) . $var2; # scalar only Or use the slightly sneaky @{[ LIST EXPR ]} or ${ \(SCALAR EXPR ) } expansions: $answer = "STRING @{[ LIST EXPR ]} MORE STRING"; $answer = "STRING ${\( SCALAR EXPR )} MORE STRING"; 1.15.3 Discussion This code shows both techniques. The first line shows concatenation; the second shows the expansion trick: $phrase = "I have " . ($n + 1) . " guanacos."; $phrase = "I have ${\($n + 1)} guanacos."; The first technique builds the final string by concatenating smaller strings, avoiding interpolation but achieving the same end. Because print effectively concatenates its entire argument list, if we were going to print $phrase, we could have just said: print "I have ", $n + 1, " guanacos.\n"; When you absolutely must have interpolation, you need the punctuation-riddled interpolation from the Solution. Only @, $, and \ are special within double quotes and most backquotes. (As with m// and s///, the qx( ) synonym is not subject to double-quote expansion if its delimiter is single quotes! $home = qx'echo home is $HOME'; would get the shell $HOME variable, not one in Perl.) So, the only way to force arbitrary expressions to expand is by expanding a ${ } or @{ } whose block contains a reference. In the example: $phrase = "I have ${\( count_em( ) )} guanacos."; the function call within the parentheses is not in scalar context; it is still in list context. The following overrules that: $phrase = "I have ${\( scalar count_em( ) )} guanacos."; You can do more than simply assign to a variable after interpolation. It's a general mechanism that can be used in any double-quoted string. For instance, this example builds a string with an interpolated expression and passes the result to a function: some_func("What you want is @{[ split /:/, $rec ]} items"); You can interpolate into a here document, as by: die "Couldn't send mail" unless send_mail(<<"EOTEXT", $target); To: $naughty From: Your Bank Cc: @{ get_manager_list($naughty) } Date: @{[ do { my $now = `date`; chomp $now; $now } ]} (today) Dear $naughty, Today, you bounced check number @{[ 500 + int rand(100) ]} to us. Your account is now closed. Sincerely, the management EOTEXT Expanding backquotes (``) is particularly challenging because you would normally end up with spurious newlines. By creating a braced block following the @ within the @{[ ]} anonymous array dereference, as in the last example, you can create private variables. Although these techniques work, simply breaking your work up into several steps or storing everything in temporary variables is almost always clearer to the reader. The Interpolation module from CPAN provides a more syntactically palatable covering. For example, to make elements of the hash %E evaluate and return its subscript: use Interpolation E => 'eval'; print "You bounced check number $E{500 + int rand(100)}\n"; Or to make a hash named %money call a suitably defined function of your choice: use Interpolation money => \¤cy_commify; print "That will be $money{ 4 * $payment }, right now.\n"; expect to get something like: That will be $3,232.421.04, right now. 1.15.4 See Also perlref(1) and the "Other Tricks You Can Do with Hard References" section in Chapter 8 of Programming Perl; the Interpolation CPAN module [ Team LiB ] [ Team LiB ] Recipe 1.16 Indenting Here Documents 1.16.1 Problem When using the multiline quoting mechanism called a here document, the text must be flush against the margin, which looks out of place in the code. You would like to indent the here document text in the code, but not have the indentation appear in the final string value. 1.16.2 Solution Use a s/// operator to strip out leading whitespace. # all in one ($var = << HERE_TARGET) =~ s/^\s+//gm; your text goes here HERE_TARGET # or with two steps $var = << HERE_TARGET; your text goes here HERE_TARGET $var =~ s/^\s+//gm; 1.16.3 Discussion The substitution is straightforward. It removes leading whitespace from the text of the here document. The /m modifier lets the ^ character match at the start of each line in the string, and the /g modifier makes the pattern-matching engine repeat the substitution as often as it can (i.e., for every line in the here document). ($definition = << 'FINIS') =~ s/^\s+//gm; The five varieties of camelids are the familiar camel, his friends the llama and the alpaca, and the rather less well-known guanaco and vicuña. FINIS Be warned: all patterns in this recipe use \s, meaning one whitespace character, which will also match newlines. This means they will remove any blank lines in your here document. If you don't want this, replace \s with [^\S\n] in the patterns. The substitution uses the property that the result of an assignment can be used as the lefthand side of =~. This lets us do it all in one line, but works only when assigning to a variable. When you're using the here document directly, it would be considered a constant value, and you wouldn't be able to modify it. In fact, you can't change a here document's value unless you first put it into a variable. Not to worry, though, because there's an easy way around this, particularly if you're going to do this a lot in the program. Just write a subroutine: sub fix { my $string = shift; $string =~ s/^\s+//gm; return $string; } print fix( << "END"); My stuff goes here END # With function predeclaration, you can omit the parens: print fix << "END"; My stuff goes here END As with all here documents, you have to place this here document's target (the token that marks its end, END in this case) flush against the lefthand margin. To have the target indented also, you'll have to put the same amount of whitespace in the quoted string as you use to indent the token. ($quote = << ' FINIS') =~ s/^\s+//gm; ...we will have peace, when you and all your works have perished--and the works of your dark master to whom you would deliver us. You are a liar, Saruman, and a corrupter of men's hearts. --Theoden in /usr/src/perl/taint.c FINIS $quote =~ s/\s+--/\n--/; #move attribution to line of its own If you're doing this to strings that contain code you're building up for an eval, or just text to print out, you might not want to blindly strip all leading whitespace, because that would destroy your indentation. Although eval wouldn't care, your reader might. Another embellishment is to use a special leading string for code that stands out. For example, here we'll prepend each line with @@@, properly indented: if ($REMEMBER_THE_MAIN) { $perl_main_C = dequote << ' MAIN_INTERPRETER_LOOP'; @@@ int @@@ runops( ) { @@@ SAVEI32(runlevel); @@@ runlevel++; @@@ while ( op = (*op->op_ppaddr)( ) ) ; @@@ TAINT_NOT; @@@ return 0; @@@ } MAIN_INTERPRETER_LOOP # add more code here if you want } Destroying indentation also gets you in trouble with poets. sub dequote; $poem = dequote << EVER_ON_AND_ON; Now far ahead the Road has gone, And I must follow, if I can, Pursuing it with eager feet, Until it joins some larger way Where many paths and errands meet. And whither then? I cannot say. --Bilbo in /usr/src/perl/pp_ctl.c EVER_ON_AND_ON print "Here's your poem:\n\n$poem\n"; Here is its sample output: Here's your poem: Now far ahead the Road has gone, And I must follow, if I can, Pursuing it with eager feet, Until it joins some larger way Where many paths and errands meet. And whither then? I cannot say. --Bilbo in /usr/src/perl/pp_ctl.c The following dequote function handles all these cases. It expects to be called with a here document as its argument. It checks whether each line begins with a common substring, and if so, strips that off. Otherwise, it takes the amount of leading whitespace found on the first line and removes that much from each subsequent line. sub dequote { local $_ = shift; my ($white, $leader); # common whitespace and common leading string if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) { ($white, $leader) = ($2, quotemeta($1)); } else { ($white, $leader) = (/^(\s+)/, ''); } s/^\s*?$leader(?:$white)?//gm; return $_; } If that pattern makes your eyes glaze over, you could always break it up and add comments by adding /x: if (m{ ^ # start of line \s * # 0 or more whitespace chars (?: # begin first non-remembered grouping ( # begin save buffer $1 [^\w\s] # one character neither space nor word + # 1 or more of such ) # end save buffer $1 ( \s* ) # put 0 or more white in buffer $2 .* \n # match through the end of first line ) # end of first grouping (?: # begin second non-remembered grouping \s * # 0 or more whitespace chars \1 # whatever string is destined for $1 \2 ? # what'll be in $2, but optionally .* \n # match through the end of the line ) + # now repeat that group idea 1 or more $ # until the end of the line }x ) { ($white, $leader) = ($2, quotemeta($1)); } else { ($white, $leader) = (/^(\s+)/, ''); } s{ ^ # start of each line (due to /m) \s * # any amount of leading whitespace ? # but minimally matched $leader # our quoted, saved per-line leader (?: # begin unremembered grouping $white # the same amount ) ? # optionalize in case EOL after leader }{ }xgm; There, isn't that much easier to read? Well, maybe not; sometimes it doesn't help to pepper your code with insipid comments that mirror the code. This may be one of those cases. 1.16.4 See Also The "Scalar Value Constructors" section of perldata(1) and the section on "Here Documents" in Chapter 2 of Programming Perl; the s/// operator in perlre(1) and perlop(1), and the "Pattern Matching" section in Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 1.17 Reformatting Paragraphs 1.17.1 Problem Your string is too big to fit the screen, and you want to break it up into lines of words, without splitting a word between lines. For instance, a style correction script might read a text file a paragraph at a time, replacing bad phrases with good ones. Replacing a phrase like utilizes the inherent functionality of with uses will change the length of lines, so it must somehow reformat the paragraphs when they're output. 1.17.2 Solution Use the standard Text::Wrap module to put line breaks at the right place: use Text::Wrap; @output = wrap($leadtab, $nexttab, @para); Or use the more discerning CPAN module, Text::Autoformat, instead: use Text::Autoformat; $formatted = autoformat $rawtext; 1.17.3 Discussion The Text::Wrap module provides the wrap function, shown in Example 1-3, which takes a list of lines and reformats them into a paragraph with no line more than $Text::Wrap::columns characters long. We set $columns to 20, ensuring that no line will be longer than 20 characters. We pass wrap two arguments before the list of lines: the first is the indent for the first line of output, the second the indent for every subsequent line. Example 1-3. wrapdemo #!/usr/bin/perl -w # wrapdemo - show how Text::Wrap works @input = ("Folding and splicing is the work of an editor,", "not a mere collection of silicon", "and", "mobile electrons!"); use Text::Wrap qw($columns &wrap); $columns = 20; print "0123456789" x 2, "\n"; print wrap(" ", " ", @input), "\n"; The result of this program is: 01234567890123456789 Folding and splicing is the work of an editor, not a mere collection of silicon and mobile electrons! We get back a single string, with newlines ending each line but the last: # merge multiple lines into one, then wrap one long line use Text::Wrap; undef $/; print wrap('', '', split(/\s*\n\s*/, <>)); If you have the Term::ReadKey module (available from CPAN) on your system, you can determine your window size so you can wrap lines to fit the current screen size. If you don't have the module, sometimes the screen size can be found in $ENV{COLUMNS} or by parsing the output of the stty(1) command. The following program tries to reformat both short and long lines within a paragraph, similar to the fmt(1) program, by setting the input record separator $/ to the empty string (causing <> to read paragraphs) and the output record separator $\ to two newlines. Then the paragraph is converted into one long line by changing all newlines and any surrounding whitespace to single spaces. Finally, we call the wrap function with leading and subsequent tab strings set to the empty string so we can have block paragraphs. use Text::Wrap qw(&wrap $columns); use Term::ReadKey qw(GetTerminalSize); ($columns) = GetTerminalSize( ); ($/, $\) = ('', "\n\n"); # read by paragraph, output 2 newlines while (<>) { # grab a full paragraph s/\s*\n\s*/ /g; # convert intervening newlines to spaces print wrap('', '', $_); # and format } The CPAN module Text::Autoformat is much more clever. For one thing, it tries to avoid "widows," that is, very short lines at the end. More remarkably, it correctly copes with reformatting paragraphs that have multiple, deeply nested citations. An example from that module's manpage shows how the module can painlessly convert: In comp.lang.perl.misc you wrote: : > writes: : > CN> PERL sux because: : > CN> * It doesn't have a switch statement and you have to put $ : > CN>signs in front of everything : > CN> * There are too many OR operators: having |, || and 'or' : > CN>operators is confusing : > CN> * VB rools, yeah!!!!!!!!! : > CN> So anyway, how can I stop reloads on a web page? : > CN> Email replies only, thanks - I don't read this newsgroup. : > : > Begone, sirrah! You are a pathetic, Bill-loving, microcephalic : > script-infant. : Sheesh, what's with this group - ask a question, get toasted! And how : *dare* you accuse me of Ianuphilia! into: In comp.lang.perl.misc you wrote: : > writes: : > CN> PERL sux because: : > CN> * It doesn't have a switch statement and you : > CN> have to put $ signs in front of everything : > CN> * There are too many OR operators: having |, || : > CN> and 'or' operators is confusing : > CN> * VB rools, yeah!!!!!!!!! So anyway, how can I : > CN> stop reloads on a web page? Email replies : > CN> only, thanks - I don't read this newsgroup. : > : > Begone, sirrah! You are a pathetic, Bill-loving, : > microcephalic script-infant. : Sheesh, what's with this group - ask a question, get toasted! : And how *dare* you accuse me of Ianuphilia! simply via print autoformat($badparagraph). Pretty impressive, eh? Here's a miniprogram that uses that module to reformat each paragraph of its input stream: use Text::Autoformat; $/ = ''; while (<>) { print autoformat($_, {squeeze => 0, all => 1}), "\n"; } 1.17.4 See Also The split and join functions in perlfunc(1) and Chapter 29 of Programming Perl; the manpage for the standard Text::Wrap module; the CPAN module Term::ReadKey, and its use in Recipe 15.6 and the CPAN module Text::Autoformat [ Team LiB ] [ Team LiB ] Recipe 1.18 Escaping Characters 1.18.1 Problem You need to output a string with certain characters (quotes, commas, etc.) escaped. For instance, you're producing a format string for sprintf and want to convert literal % signs into %%. 1.18.2 Solution Use a substitution to backslash or double each character to be escaped: # backslash $var =~ s/([CHARLIST])/\\$1/g; # double $var =~ s/([CHARLIST])/$1$1/g; 1.18.3 Discussion $var is the variable to be altered. The CHARLIST is a list of characters to escape and can contain backslash escapes like \t and \n. If you just have one character to escape, omit the brackets: $string =~ s/%/%%/g; The following code lets you do escaping when preparing strings to submit to the shell. (In practice, you would need to escape more than just ' and " to make any arbitrary string safe for the shell. Getting the list of characters right is so hard, and the risks if you get it wrong are so great, that you're better off using the list form of system and exec to run programs, shown in Recipe 16.2. They avoid the shell altogether.) $string = q(Mom said, "Don't do that."); $string =~ s/(['"])/\\$1/g; We had to use two backslashes in the replacement because the replacement section of a substitution is read as a double-quoted string, and to get one backslash, you need to write two. Here's a similar example for VMS DCL, where you need to double every quote to get one through: $string = q(Mom said, "Don't do that."); $string =~ s/(['"])/$1$1/g; Microsoft command interpreters are harder to work with. In Windows, COMMAND.COM recognizes double quotes but not single ones, disregards backquotes for running commands, and requires a backslash to make a double quote into a literal. Any of the many free or commercial Unix-like shell environments available for Windows will work just fine, though. Because we're using character classes in the regular expressions, we can use - to define a range and ^ at the start to negate. This escapes all characters that aren't in the range A through Z. $string =~ s/([^A-Z])/\\$1/g; In practice, you wouldn't want to do that, since it would pick up a lowercase "a" and turn it into "\a", for example, which is ASCII BEL character. (Usually when you mean non-alphabetic characters, \PL works better.) If you want to escape all non-word characters, use the \Q and \E string metacharacters or the quotemeta function. For example, these are equivalent: $string = "this \Qis a test!\E"; $string = "this is\\ a\\ test\\!"; $string = "this " . quotemeta("is a test!"); 1.18.4 See Also The s/// operator in perlre(1) and perlop(1) and Chapter 5 of Programming Perl; the quotemeta function in perlfunc(1) and Chapter 29 of Programming Perl; the discussion of HTML escaping in Recipe 19.1; Recipe 19.5 for how to avoid having to escape strings to give the shell [ Team LiB ] [ Team LiB ] Recipe 1.19 Trimming Blanks from the Ends of a String 1.19.1 Problem You have read a string that may have leading or trailing whitespace, and you want to remove it. 1.19.2 Solution Use a pair of pattern substitutions to get rid of them: $string =~ s/^\s+//; $string =~ s/\s+$//; Or write a function that returns the new value: $string = trim($string); @many = trim(@many); sub trim { my @out = @_; for (@out) { s/^\s+//; # trim left s/\s+$//; # trim right } return @out = = 1 ? $out[0] # only one to return : @out; # or many } 1.19.3 Discussion This problem has various solutions, but this one is the most efficient for the common case. This function returns new versions of the strings passed in to it with their leading and trailing whitespace removed. It works on both single strings and lists. To remove the last character from the string, use the chop function. Be careful not to confuse this with the similar but different chomp function, which removes the last part of the string contained within that variable if and only if it is contained in the $/ variable, "\n" by default. These are often used to remove the trailing newline from input: # print what's typed, but surrounded by > < symbols while () { chomp; print ">$_<\n"; } This function can be embellished in any of several ways. First, what should you do if several strings are passed in, but the return context demands a single scalar? As written, the function given in the Solution does a somewhat silly thing: it (inadvertently) returns a scalar representing the number of strings passed in. This isn't very useful. You could issue a warning or raise an exception. You could also squash the list of return values together. For strings with spans of extra whitespace at points other than their ends, you could have your function collapse any remaining stretch of whitespace characters in the interior of the string down to a single space each by adding this line as the new last line of the loop: s/\s+/ /g; # finally, collapse middle That way a string like " but\t\tnot here\n" would become "but not here". A more efficient alternative to the three substitution lines: s/^\s+//; s/\s+$//; s/\s+/ /g; would be: $_ = join(' ', split(' ')); If the function isn't passed any arguments at all, it could act like chop and chomp by defaulting to $_. Incorporating all of these embellishments produces this function: # 1. trim leading and trailing white space # 2. collapse internal whitespace to single space each # 3. take input from $_ if no arguments given # 4. join return list into single scalar with intervening spaces # if return is scalar context sub trim { my @out = @_ ? @_ : $_; $_ = join(' ', split(' ')) for @out; return wantarray ? @out : "@out"; } 1.19.4 See Also The s/// operator in perlre(1) and perlop(1) and Chapter 5 of Programming Perl; the chomp and chop functions in perlfunc(1) and Chapter 29 of Programming Perl; we trim leading and trailing whitespace in the getnum function in Recipe 2.1 [ Team LiB ] [ Team LiB ] Recipe 1.20 Parsing Comma-Separated Data 1.20.1 Problem You have a data file containing comma-separated values that you need to read, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format. 1.20.2 Solution If your data file follows normal Unix quoting and escaping conventions, where quotes within a field are backslash-escaped "like \"this\" ", use the standard Text::ParseWords and this simple code: use Text::ParseWords; sub parse_csv0 { return quotewords("," => 0, $_[0]); } However, if quotes within a field are doubled "like ""this"" ", you could use the following procedure from Mastering Regular Expressions, Second Edition: sub parse_csv1 { my $text = shift; # record containing comma-separated values my @fields = ( ); while ($text =~ m{ # Either some non-quote/non-comma text: ( [^"',] + ) # ...or... | # ...a double-quoted field: (with "" allowed inside) " # field's opening quote; don't save this ( now a field is either (?: [^"] # non-quotes or | "" # adjacent quote pairs ) * # any number ) " # field's closing quote; unsaved }gx) { if (defined $1) { $field = $1; } else { ($field = $2) =~ s/""/"/g; } push @fields, $field; } return @fields; } Or use the CPAN Text:CSV module: use Text::CSV; sub parse_csv1 { my $line = shift; my $csv = Text::CSV->new( ); return $csv->parse($line) && $csv->fields( ); } Or use the CPAN Tie::CSV_File module: tie @data, "Tie::CSV_File", "data.csv"; for ($i = 0; $i < @data; $i++) { printf "Row %d (Line %d) is %s\n", $i, $i+1, "@{$data[$i]}"; for ($j = 0; $j < @{$data[$i]}; $j++) { print "Column $j is <$data[$i][$j]>\n"; } } 1.20.3 Discussion Comma-separated input is a deceptive and complex format. It sounds simple, but involves a fairly complex escaping system because the fields themselves can contain commas. This makes the pattern-matching solution complex and rules out a simple split /,/ . Still worse, quoting and escaping conventions vary between Unix-style files and legacy systems. This incompatibility renders impossible any single algorithm for all CSV data files. The standard Text::ParseWords module is designed to handle data whose quoting and escaping conventions follow those found in most Unix data files. This makes it eminently suitable for parsing the numerous colon-separated data files found on Unix systems, including disktab (5), gettytab (5), printcap (5), and termcap (5). Pass that module's quotewords function two arguments and the CSV string. The first argument is the separator (here a comma, but often a colon), and the second is a true or false value controlling whether the strings are returned with quotes around them. In this style of data file, you represent quotation marks inside a field delimited by quotation marks by escaping them with backslashes "like\"this\" ". Quotation marks and backslashes are the only characters that have meaning when backslashed. Any other use of a backslash will be left in the output string. The standard Text::ParseWords module's quotewords( ) function can handle such data. However, it's of no use on data files from legacy systems that represent quotation marks inside such a field by doubling them "like""this"" ". For those, you'll need one of the other solutions. The first of these is based on the regular expression from Mastering Regular Expressions, Second Edition, by Jeffrey E. F. Friedl (O'Reilly). It enjoys the advantage of working on any system without requiring installation of modules not found in the standard distribution. In fact, it doesn't use any modules at all. Its slight disadvantage is the risk of sending the unseasoned reader into punctuation shock, despite its copious commenting. The object-oriented CPAN module Text::CSV demonstrated in the next solution hides that parsing complexity in more easily digestible wrappers. An even more elegant solution is offered by the Tie::CSV_File module from CPAN, in which you are given what appears to be a two- dimensional array. The first dimension represents each line of the file, and the second dimension each column on each row. Here's how you'd use our two kinds of parse_csv subroutines. The q( ) is just a fancy quote so we didn't have to backslash everything. $line = q(XYZZY,"","O'Reilly, Inc","Wall, Larry","a \"glug\" bit,",5,"Error, Core Dumped"); @fields = parse_csv0($line); for ($i = 0; $i < @fields; $i++) { print "$i : $fields[$i]\n"; } 0 : XYZZY 1 : 2 : O'Reilly, Inc 3 : Wall, Larry 4 : a "glug" bit, 5 : 5 6 : Error, Core Dumped If the second argument to quotewords had been 1 instead of 0, the quotes would have been retained, producing this output instead: 0 : XYZZY 1 : "" 2 : "O'Reilly, Inc" 3 : "Wall, Larry" 4 : "a \"glug\" bit," 5 : 5 6 : "Error, Core Dumped" The other sort of data file is manipulated the same way, but using our parse_csv1 function instead of parse_csv0 . Notice how the embedded quotes are doubled, not escaped. $line = q(Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K); @fields = parse_csv1($line); for ($i = 0; $i < @fields; $i++) { print "$i : $fields[$i]\n"; } 0 : Ten Thousand 1 : 10000 2 : 2710 3 : 4 : 10,000 5 : It's "10 Grand", baby 6 : 10K 1.20.4 See Also The explanation of regular expression syntax in perlre (1) and Chapter 5 of Programming Perl ; the documentation for the standard Text::ParseWords module; the section on "Parsing CSV Files" in Chapter 5 of Mastering Regular Expressions , Second Edition [ Team LiB ] [ Team LiB ] Recipe 1.21 Constant Variables 1.21.1 Problem You want a variable whose value cannot be modified once set. 1.21.2 Solution If you don't need it to be a scalar variable that can interpolate, the use constant pragma will work: use constant AVOGADRO => 6.02252e23; printf "You need %g of those for guac\n", AVOGADRO; If it does have to be a variable, assign to the typeglob a reference to a literal string or number, then use the scalar variable: *AVOGADRO = \6.02252e23; print "You need $AVOGADRO of those for guac\n"; But the most foolproof way is via a small tie class whose STORE method raises an exception: package Tie::Constvar; use Carp; sub TIESCALAR { my ($class, $initval) = @_; my $var = $initval; return bless \$var => $class; } sub FETCH { my $selfref = shift; return $$selfref; } sub STORE { confess "Meddle not with the constants of the universe"; } 1.21.3 Discussion The use constant pragma is the easiest to use, but has a few drawbacks. The biggest one is that it doesn't give you a variable that you can expand in double-quoted strings. Another is that it isn't scoped; it puts a subroutine of that name into the package namespace. The way the pragma really works is to create a subroutine of that name that takes no arguments and always returns the same value (or values if a list is provided). That means it goes into the current package's namespace and isn't scoped. You could do the same thing yourself this way: sub AVOGADRO( ) { 6.02252e23 } If you wanted it scoped to the current block, you could make a temporary subroutine by assigning an anonymous subroutine to the typeglob of that name: use subs qw(AVOGADRO); local *AVOGADRO = sub ( ) { 6.02252e23 }; But that's pretty magical, so you should comment the code if you don't plan to use the pragma. If instead of assigning to the typeglob a reference to a subroutine, you assign to it a reference to a constant scalar, then you'll be able to use the variable of that name. That's the second technique given in the Solution. Its disadvantage is that typeglobs are available only for package variables, not for lexicals created via my. Under the recommended use strict pragma, an undeclared package variable will get you into trouble, too, but you can declare the variable using our: our $AVOGADRO; local *AVOGADRO = \6.02252e23; The third solution provided, that of creating your own little tie class, might appear the most complicated, but it provides the most flexibility. Plus you get to declare it as a lexical if you want. tie my $AVOGADRO, Tie::Constvar, 6.02252e23; After which this is okay: print "You need $AVOGADRO of those for guac\n"; But this will get you in trouble: $AVOGADRO = 6.6256e-34; # sorry, Max 1.21.4 See Also Recipe 1.15; Recipe 5.3; the discussion on folding constant subroutines toward the end of the section on "Compiling Your Code" in Chapter 18 of Programming Perl; the CPAN module Tie::Scalar::RestrictUpdates might give you some other ideas [ Team LiB ] [ Team LiB ] Recipe 1.22 Soundex Matching 1.22.1 Problem You have two English surnames and want to know whether they sound somewhat similar, regardless of spelling. This would let you offer users a "fuzzy search" of names in a telephone book to catch "Smith" and "Smythe" and others within the set, such as "Smite" and "Smote". 1.22.2 Solution Use the standard Text::Soundex module: use Text::Soundex; $CODE = soundex($STRING); @CODES = soundex(@LIST); Or use the CPAN module Text::Metaphone: use Text::Metaphone; $phoned_words = Metaphone('Schwern'); 1.22.3 Discussion The soundex algorithm hashes words (particularly English surnames) into a small space using a simple model that approximates an English speaker's pronunciation of the words. Roughly speaking, each word is reduced to a four-character string. The first character is an uppercase letter; the remaining three are digits. By comparing the soundex values of two strings, we can guess whether they sound similar. The following program prompts for a name and looks for similarly sounding names from the password file. This same approach works on any database with names, so you could key the database on the soundex values if you wanted to. Such a key wouldn't be unique, of course. use Text::Soundex; use User::pwent; print "Lookup user: "; chomp($user =); exit unless defined $user; $name_code = soundex($user); while ($uent = getpwent( )) { ($firstname, $lastname) = $uent->gecos =~ /(\w+)[^,]*\b(\w+)/; if ($name_code eq soundex($uent->name) || $name_code eq soundex($lastname) || $name_code eq soundex($firstname) ) { printf "%s: %s %s\n", $uent->name, $firstname, $lastname; } } The Text::Metaphone module from CPAN addresses the same problem in a different and better way. The soundex function returns a letter and a three-digit code that maps just the beginning of the input string, whereas Metaphone returns a code as letters of variable length. For example: soundex metaphone Christiansen C623 KRSXNSN Kris Jenson K625 KRSJNSN Kyrie Eleison K642 KRLSN Curious Liaison C624 KRSLSN To get the most of Metaphone, you should also use the String::Approx module from CPAN, described more fully in Recipe 6.13. It allows for there to be errors in the match and still be successful. The edit distance is the number of changes needed to go from one string to the next. This matches a pair of strings with an edit distance of two: if (amatch("string1", [2], "string2") { } There's also an adist function that reports the edit distance. The edit distance between "Kris Jenson" "Christiansen" is 6, but between their Metaphone encodings is only 1. Likewise, the distance between the other pair is 8 originally, but down to 1 again if you compare Metaphone encodings. use Text::Metaphone qw(Metaphone); use String::Approx qw(amatch); if (amatch(Metaphone($s1), [1], Metaphone($s1)) { print "Close enough!\n"; } This would successfully match both of our example pairs. 1.22.4 See Also The documentation for the standard Text::Soundex and User::pwent modules; the Text::Metaphone and String::Approx modules from CPAN; your system's passwd(5) manpage; Volume 3, Chapter 6 of The Art of Computer Programming, by Donald E. Knuth (Addison- Wesley) [ Team LiB ] [ Team LiB ] Recipe 1.23 Program: fixstyle Imagine you have a table with both old and new strings, such as the following: Old words New words bonnet hood rubber eraser lorry truck trousers pants The program in Example 1-4 is a filter that changes all occurrences of each element in the first set to the corresponding element in the second set. When called without filename arguments, the program is a simple filter. If filenames are supplied on the command line, an in-place edit writes the changes to the files, with the original versions saved in a file with a ".orig" extension. See Recipe 7.16 for a description. A -v command-line option writes notification of each change to standard error. The table of original strings and their replacements is stored below _ _END_ _ in the main program, as described in Recipe 7.12. Each pair of strings is converted into carefully escaped substitutions and accumulated into the $code variable like the popgrep2 program in Recipe 6.10. A -t check to test for an interactive run check tells whether we're expecting to read from the keyboard if no arguments are supplied. That way if users forget to give an argument, they aren't wondering why the program appears to be hung. Example 1-4. fixstyle #!/usr/bin/perl -w # fixstyle - switch first set of strings to second set # usage: $0 [-v] [files ...] use strict; my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift); if (@ARGV) { $^I = ".orig"; # preserve old files } else { warn "$0: Reading from stdin\n" if -t STDIN; } my $code = "while (<>) {\n"; # read in config, build up code to eval while () { chomp; my ($in, $out) = split /\s*=>\s*/; next unless $in && $out; $code .= "s{\\Q$in\\E}{$out}g"; $code .= "&& printf STDERR qq($in => $out at \$ARGV line \$.\\n)" if $verbose; $code .= ";\n"; } $code .= "print;\n}\n"; eval "{ $code } 1" || die; _ _END_ _ analysed => analyzed built-in => builtin chastized => chastised commandline => command-line de-allocate => deallocate dropin => drop-in hardcode => hard-code meta-data => metadata multicharacter => multi-character multiway => multi-way non-empty => nonempty non-profit => nonprofit non-trappable => nontrappable pre-define => predefine preextend => pre-extend re-compiling => recompiling reenter => re-enter turnkey => turn-key One caution: this program is fast, but it doesn't scale if you need to make hundreds of changes. The larger the DATA section, the longer it takes. A few dozen changes won't slow it down, and in fact, the version given in Example 1-4 is faster for that case. But if you run the program on hundreds of changes, it will bog down. Example 1-5 is a version that's slower for few changes but faster when there are many changes. Example 1-5. fixstyle2 #!/usr/bin/perl -w # fixstyle2 - like fixstyle but faster for many many changes use strict; my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift); my %change = ( ); while () { chomp; my ($in, $out) = split /\s*=>\s*/; next unless $in && $out; $change{$in} = $out; } if (@ARGV) { $^I = ".orig"; } else { warn "$0: Reading from stdin\n" if -t STDIN; } while (<>) { my $i = 0; s/^(\s+)// && print $1; # emit leading whitespace for (split /(\s+)/, $_, -1) { # preserve trailing whitespace print( ($i++ & 1) ? $_ : ($change{$_} || $_)); } } _ _END_ _ analysed => analyzed built-in => builtin chastized => chastised commandline => command-line de-allocate => deallocate dropin => drop-in hardcode => hard-code meta-data => metadata multicharacter => multi-character multiway => multi-way non-empty => nonempty non-profit => nonprofit non-trappable => nontrappable pre-define => predefine preextend => pre-extend re-compiling => recompiling reenter => re-enter turnkey => turn-key This version breaks each line into chunks of whitespace and words, which isn't a fast operation. It then uses those words to look up their replacements in a hash, which is much faster than a substitution. So the first part is slower, the second faster. The difference in speed depends on the number of matches. If you don't care about keeping the whitespace separating each word constant, the second version can run as fast as the first, even for a few changes. If you know a lot about your input, collapse whitespace into single blanks by plugging in this loop: # very fast, but whitespace collapse while (<>) { for (split) { print $change{$_} || $_, " "; } print "\n"; } That leaves an extra blank at the end of each line. If that's a problem, you could use the technique from Recipe 16.5 to install an output filter. Place the following code in front of the while loop that's collapsing whitespace: my $pid = open(STDOUT, "|-"); die "cannot fork: $!" unless defined $pid; unless ($pid) { # child while () { s/ $//; print; } exit; } [ Team LiB ] [ Team LiB ] Recipe 1.24 Program: psgrep Many programs, including ps , netstat , lsof , ls -l , find -ls , and tcpdump , can produce more output than can be conveniently summarized. Logfiles also often grow too long to be easily viewed. You could send these through a filter like grep to pick out only certain lines, but regular expressions and complex logic don't mix well; just look at the hoops we jump through in Recipe 6.18 . What we'd really like is to make full queries on the program output or logfile. For example, to ask ps something like, "Show me all processes that exceed 10K in size but which aren't running as the superuser" or "Which commands are running on pseudo-ttys?" The psgrep program does this—and infinitely more—because the specified selection criteria are not mere regular expressions; they're full Perl code. Each criterion is applied in turn to every line of output. Only lines matching all arguments are output. The following is a list of things to find and how to find them. Lines containing "sh" at the end of a word: % psgrep '/sh\b/' Processes whose command names end in "sh": % psgrep 'command =~ /sh$/' Processes running with a user ID below 10: % psgrep 'uid < 10' Login shells with active ttys: % psgrep 'command =~ /^-/' 'tty ne "?"' Processes running on pseudo-ttys: % psgrep 'tty =~ /^[p-t]/' Non-superuser processes running detached: % psgrep 'uid && tty eq "?"' Huge processes that aren't owned by the superuser: % psgrep 'size > 10 * 2**10' 'uid != 0' The last call to psgrep produced the following output when run on our system. As one might expect, only netscape and its spawn qualified. FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME COMMAND 0 101 9751 1 0 0 14932 9652 do_select S p1 0:25 netscape 100000 101 9752 9751 0 0 10636 812 do_select S p1 0:00 (dns helper) Example 1-6 shows the psgrep program. Example 1-6. psgrep #!/usr/bin/perl -w # psgrep - print selected lines of ps output by # compiling user queries into code use strict; # each field from the PS header my @fieldnames = qw(FLAGS UID PID PPID PRI NICE SIZE RSS WCHAN STAT TTY TIME COMMAND); # determine the unpack format needed (hard-coded for Linux ps) my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72); my %fields; # where the data will store die << Thanatos unless @ARGV; usage: $0 criterion ... Each criterion is a Perl expression involving: @fieldnames All criteria must be met for a line to be printed. Thanatos # Create function aliases for uid, size, UID, SIZE, etc. # Empty parens on closure args needed for void prototyping. for my $name (@fieldnames) { no strict 'refs'; *$name = *{lc $name} = sub ( ) { $fields{$name} }; } my $code = "sub is_desirable { " . join(" and ", @ARGV) . " } "; unless (eval $code.1) { die "Error in code: $@\n\t$code\n"; } open(PS, "ps wwaxl |") || die "cannot fork: $!"; print scalar ; # emit header line while () { @fields{@fieldnames} = trim(unpack($fmt, $_)); print if is_desirable( ); # line matches their criteria } close(PS) || die "ps failed!"; # convert cut positions to unpack format sub cut2fmt { my(@positions) = @_; my $template = ''; my $lastpos = 1; for my $place (@positions) { $template .= "A" . ($place - $lastpos) . " "; $lastpos = $place; } $template .= "A*"; return $template; } sub trim { my @strings = @_; for (@strings) { s/^\s+//; s/\s+$//; } return wantarray ? @strings : $strings[0]; } # the following was used to determine column cut points. # sample input data follows #123456789012345678901234567890123456789012345678901234567890123456789012345 # 1 2 3 4 5 6 7 # Positioning: # 8 14 20 26 30 34 41 47 59 63 67 72 # | | | | | | | | | | | | _ _END_ _ FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME COMMAND 100 0 1 0 0 0 760 432 do_select S ? 0:02 init 140 0 187 1 0 0 784 452 do_select S ? 0:02 syslogd 100100 101 428 1 0 0 1436 944 do_exit S 1 0:00 /bin/login 100140 99 30217 402 0 0 1552 1008 posix_lock_ S ? 0:00 httpd 0 101 593 428 0 0 1780 1260 copy_thread S 1 0:00 -tcsh 100000 101 30639 9562 17 0 924 496 R p1 0:00 ps axl 0 101 25145 9563 0 0 2964 2360 idetape_rea S p2 0:06 trn 100100 0 10116 9564 0 0 1412 928 setup_frame T p3 0:00 ssh -C www 100100 0 26560 26554 0 0 1076 572 setup_frame T p2 0:00 less 100000 101 19058 9562 0 0 1396 900 setup_frame T p1 0:02 nvi /tmp/a The psgrep program integrates many techniques presented throughout this book. Stripping strings of leading and trailing whitespace is found in Recipe 1.19 . Converting cut marks into an unpack format to extract fixed fields is in Recipe 1.1 . Matching strings with regular expressions is the entire topic of Chapter 6 . The multiline string in the here document passed to die is discussed in Recipe 1.15 and Recipe 1.16 . The assignment to @fields{@fieldnames} sets many values at once in the hash named %fields . Hash slices are discussed in Recipe 4.8 and Recipe 5.11 . The sample program input contained beneath _ _END_ _ is described in Recipe 7.12 . During development, we used canned input from the DATA filehandle for testing purposes. Once the program worked properly, we changed it to read from a piped-in ps command but left a remnant of the original filter input to aid in future porting and maintenance. Launching other programs over a pipe is covered in Chapter 16 , including Recipe 16.10 and Recipe 16.13 . The real power and expressiveness in psgrep derive from Perl's use of string arguments not as mere strings but directly as Perl code. This is similar to the technique in Recipe 9.9 , except that in psgrep , the user's arguments are wrapped with a routine called is_desirable . That way, the cost of compiling strings into Perl code happens only once, before the program whose output we'll process is even begun. For example, asking for UIDs under 10 creates this string to eval : eval "sub is_desirable { uid < 10 } " . 1; The mysterious ".1 " at the end is so that if the user code compiles, the whole eval returns true. That way we don't even have to check $@ for compilation errors as we do in Recipe 10.12 . Specifying arbitrary Perl code in a filter to select records is a breathtakingly powerful approach, but it's not entirely original. Perl owes much to the awk programming language, which is often used for such filtering. One problem with awk is that it can't easily treat input as fixed-size fields instead of fields separated by something. Another is that the fields are not mnemonically named: awk uses $1 , $2 , etc. Plus, Perl can do much that awk cannot. The user criteria don't even have to be simple expressions. For example, this call initializes a variable $id to user nobody 's number to use later in its expression: % psgrep 'no strict "vars"; BEGIN { $id = getpwnam("nobody") } uid = = $id ' How can we use unquoted words without even a dollar sign, like uid , command , and size , to represent those respective fields in each input record? We directly manipulate the symbol table by assigning closures to indirect typeglobs, which creates functions with those names. The function names are created using both uppercase and lowercase names, allowing both "UID < 10 " and "uid > 10 ". Closures are described in Recipe 11.4 , and assigning them to typeglobs to create function aliases is shown in Recipe 10.14 . One twist here not seen in those recipes is empty parentheses on the closure. These allowed us to use the function in an expression anywhere we'd use a single term, like a string or a numeric constant. It creates a void prototype so the field-accessing function named uid accepts no arguments, just like the built-in function time . If these functions weren't prototyped void, expressions like "uid < 10 " or "size/2 > rss " would confuse the parser because it would see the unterminated start of a wildcard glob and of a pattern match, respectively. Prototypes are discussed in Recipe 10.11 . The version of psgrep demonstrated here expects the output from Red Hat Linux's ps . To port to other systems, look at which columns the headers begin at. This approach isn't relevant only to ps or only to Unix systems; it's a generic technique for filtering input records using Perl expressions, easily adapted to other record layouts. The input format could be in columns, space separated, comma separated, or the result of a pattern match with capturing parentheses. The program could even be modified to handle a user-defined database with a small change to the selection functions. If you had an array of records as described in Recipe 11.9 , you could let users specify arbitrary selection criteria, such as: sub id( ) { $_->{ID} } sub title( ) { $_->{TITLE} } sub executive( ) { title =~ /(?:vice-)?president/i } # user search criteria go in the grep clause @slowburners = grep { id<10 && !executive } @employees; For reasons of security and performance, this kind of power is seldom found in database engines like those described in Chapter 14 . SQL doesn't support this, but given Perl and small bit of ingenuity, it's easy to roll it up on your own. [ Team LiB ] [ Team LiB ] Chapter 2. Numbers Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin. —John von Neumann (1951) [ Team LiB ] [ Team LiB ] Introduction Numbers, the most basic data type of almost any programming language, can be surprisingly tricky. Random numbers, numbers with decimal points, series of numbers, and conversion between strings and numbers all pose trouble. Perl works hard to make life easy for you, and the facilities it provides for manipulating numbers are no exception to that rule. If you treat a scalar value as a number, Perl converts it to one. This means that when you read ages from a file, extract digits from a string, or acquire numbers from any of the other myriad textual sources that Real Life pushes your way, you don't need to jump through the hoops created by other languages' cumbersome requirements to turn an ASCII string into a number. Perl tries its best to interpret a string as a number when you use it as one (such as in a mathematical expression), but it has no direct way of reporting that a string doesn't represent a valid number. Perl quietly converts non-numeric strings to zero, and it will stop converting the string once it reaches a non-numeric character—so "A7" is still 0, and "7A" is just 7. (Note, however, that the -w flag will warn of such improper conversions.) Sometimes, such as when validating input, you need to know whether a string represents a valid number. We show you how in Recipe 2.1. Recipe 2.15 shows how to get a number from strings containing hexadecimal, octal, or binary representations of numbers such as "0xff", "0377", and "0b10110". Perl automatically converts numeric literals of these non-decimal bases that occur in your program code (so $a = 3 + 0xff will set $a to 258) but not data read by that program (you can't read "ff" or even "0xff" into $b and then say $a = 3 + $b to make $a become 258). As if integers weren't giving us enough grief, floating-point numbers can cause even more headaches. Internally, a computer represents numbers with decimal points as floating-point numbers in binary format. Floating-point numbers are not the same as real numbers; they are an approximation of real numbers, with limited precision. Although infinitely many real numbers exist, you only have finite space to represent them, usually about 64 bits or so. You have to cut corners to fit them all in. When numbers are read from a file or appear as literals in your program, they are converted from their textual representation—which is always in base 10 for numbers with decimal points in them—into an internal, base-2 representation. The only fractional numbers that can be exactly represented using a finite number of digits in a particular numeric base are those that can be written as the sum of a finite number of fractions whose denominators are integral powers of that base. For example, 0.13 is one tenth plus three one-hundredths. But that's in base-10 notation. In binary, something like 0.75 is exactly representable because it's the sum of one half plus one quarter, and 2 and 4 are both powers of two. But even so simple a number as one tenth, written as 0.1 in base-10 notation, cannot be rewritten as the sum of some set of halves, quarters, eighths, sixteenths, etc. That means that, just as one third can't be exactly represented as a non-repeating decimal number, one tenth can't be exactly represented as a non-repeating binary number. Your computer's internal binary representation of 0.1 isn't exactly 0.1; it's just an approximation! $ perl -e 'printf "%.60f\n", 0.1' 0.100000000000000005551115123125782702118158340454101562500000 Recipe 2.2 and Recipe 2.3 demonstrate how to make your computer's floating-point representations behave more like real numbers. Recipe 2.4 gives three ways to perform one operation on each element of a set of consecutive integers. We show how to convert to and from Roman numerals in Recipe 2.5. Random numbers are the topic of several recipes. Perl's rand function returns a floating-point value between 0 and 1, or between 0 and its argument. We show how to get random numbers in a given range, how to make random numbers more random, and how to make rand give a different sequence of random numbers each time you run your program. We round out the chapter with recipes on trigonometry, logarithms, matrix multiplication, complex numbers, and the often-asked question: "How do you put commas in numbers?" [ Team LiB ] [ Team LiB ] Recipe 2.1 Checking Whether a String Is a Valid Number 2.1.1 Problem You want to check whether a string represents a valid number. This is a common problem when validating input, as in CGI scripts, configuration files, and command-line arguments. 2.1.2 Solution Compare it against a regular expression that matches the kinds of numbers you're interested in: if ($string =~ /PATTERN/) { # is a number } else { # is not } Or use the patterns provided by the CPAN module Regexp::Common: if ($string =~ m{^$RE{num}{real}$}) { # is a real number } else { # is not } 2.1.3 Discussion This problem gets to the heart of what we mean by a number. Even things that sound simple, like integer, make you think hard about what you will accept; for example, "Is a leading + for positive numbers optional, mandatory, or forbidden?" The many ways that floating-point numbers can be represented could overheat your brain. Decide what you will and will not accept. Then, construct a regular expression to match those things alone. Here are some precooked solutions (the Cookbook's equivalent of just-add-water meals) for most common cases: warn "has nondigits" if /\D/; warn "not a natural number" unless /^\d+$/; # rejects -3 warn "not an integer" unless /^-?\d+$/; # rejects +3 warn "not an integer" unless /^[+-]?\d+$/; warn "not a decimal number" unless /^-?\d+\.?\d*$/; # rejects .2 warn "not a decimal number" unless /^-?(?:\d+(?:\.\d*)?|\.\d+)$/; warn "not a C float" unless /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/; These lines do not catch the IEEE notations of "Infinity" and "NaN", but unless you're worried that IEEE committee members will stop by your workplace and beat you over the head with copies of the relevant standards documents, you can probably forget about these strange forms. If your number has leading or trailing whitespace, those patterns won't work. Either add the appropriate logic directly, or call the trim function from Recipe 1.19. The CPAN module Regexp::Common provides a wealth of canned patterns that test whether a string looks like a number. Besides saving you from having to figure out the patterns on your own, it also makes your code more legible. By default, this module exports a hash called %RE that you index into, according to which kind of regular expression you're looking for. Be careful to use anchors as needed; otherwise, it will search for that pattern anywhere in the string. For example: use Regexp::Common; $string = "Gandalf departed from the Havens in 3021 TA."; print "Is an integer\n" if $string =~ / ^ $RE{num}{int} $ /x; print "Contains the integer $1\n" if $string =~ / ( $RE{num}{int} ) /x; The following examples are other patterns that the module can use to match numbers: $RE{num}{int}{-sep=>',?'} # match 1234567 or 1,234,567 $RE{num}{int}{-sep=>'.'}{-group=>4} # match 1.2345.6789 $RE{num}{int}{-base => 8} # match 014 but not 99 $RE{num}{int}{-sep=>','}{-group=3} # match 1,234,594 $RE{num}{int}{-sep=>',?'}{-group=3} # match 1,234 or 1234 $RE{num}{real} # match 123.456 or -0.123456 $RE{num}{roman} # match xvii or MCMXCVIII $RE{num}{square} # match 9 or 256 or 12321 Some of these patterns, such as square, were not available in early module versions. General documentation for the module can be found in the Regexp::Common manpage, but more detailed documentation for just the numeric patterns is in the Regexp::Common::number manpage. Some techniques for identifying numbers don't involve regular expressions. Instead, these techniques use functions from system libraries or Perl to determine whether a string contains an acceptable number. Of course, these functions limit you to the definition of "number" offered by your libraries and Perl. If you're on a POSIX system, Perl supports the POSIX::strtod function. Its semantics are cumbersome, so the following is a getnum wrapper function for more convenient access. This function takes a string and returns either the number it found or undef for input that isn't a C float. The is_numeric function is a frontend to getnum for when you just want to ask, "Is this a float?" sub getnum { use POSIX qw(strtod); my $str = shift; $str =~ s/^\s+//; # remove leading whitespace $str =~ s/\s+$//; # remove trailing whitespace $! = 0; my($num, $unparsed) = strtod($str); if (($str eq '') || ($unparsed != 0) || $!) { return; } else { return $num; } } sub is_numeric { defined scalar &getnum } The Scalar::Util module, newly standard as of Perl v5.8.1, exports a function called looks_like_number( ) that uses the Perl compiler's own internal function of the same name (see perlapi(1)). It returns true for any base-10 number that is acceptable to Perl itself, such as 0, 0.8, 14.98, and 6.02e23—but not 0xb1010, 077, 0x392, or numbers with underscores in them. This means that you must check for alternate bases and decode them yourself if you want to permit users to enter such numbers, as in Example 2-1. Example 2-1. Decode numbers #!/usr/bin/perl -w use Scalar::Util qw(looks_like_number); print "$0: hit ^D (your eof character) to exit\n"; for (;;) { my ($on, $n); # original string and its numeric value print "Pick a number, any number: "; $on = $n = ; last if !defined $n; chomp($on,$n); $n =~ s/_//g; # allow 186_282.398_280_685 $n = oct($n) if $n =~ /^0/; # allow 0xFF, 037, 0b1010 if (looks_like_number($n)) { printf "Decimal double of $on is %g\n", 2*$n; } else { print "That doesn't look like a number to Perl.\n"; } } print "\nBye.\n"; 2.1.4 See Also The regular expression syntax in perlre(1) and Chapter 5 of Programming Perl; your system's strtod(3) manpage; the perlapi(1) manpage; the documentation for the CPAN module Regexp::Common, including the Regexp::Common::number manpage; the documentation for the standard POSIX and Scalar::Util modules (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 2.2 Rounding Floating-Point Numbers 2.2.1 Problem You want to round a floating-point value to a certain number of decimal places. This problem arises from the same inaccuracies in representation that make testing for equality difficult (see Recipe 2.3), as well as in situations where you must reduce the precision of your answers for readability. 2.2.2 Solution Use the Perl function sprintf, or printf if you're just trying to produce output: # round off to two places $rounded = sprintf("%.2f"", $unrounded); Or you can use other rounding functions described in the Discussion. 2.2.3 Discussion Whether visible or not, rounding of some sort is virtually unavoidable when working with floating-point numbers. Carefully defined standards (namely, IEEE 754, the standard for binary floating-point arithmetic) coupled with reasonable defaults within Perl often manage to eliminate or at least hide these round-off errors. In fact, Perl's implicit rounding on output is usually good enough so that it rarely surprises. It's almost always best to leave the numbers unrounded until output, and then, if you don't like Perl's default rounding, use printf or sprintf yourself with a format that makes the rounding explicit. The %f, %e, and %g formats all let you specify how many decimal places to round their argument to. Here's an example showing how all three behave; in each case, we're asking for a field that's 12 spaces wide, but with a precision of no more than four digits to the right of the decimal place. for $n ( 0.0000001, 10.1, 10.00001, 100000.1 ) { printf "%12.4e %12.4f %12.4g\n", $n, $n, $n; } This produces the following output: 1.0000e-07 0.0000 1e-07 1.0100e+01 10.1000 10.1 1.0000e+01 10.0000 10 1.0000e+05 100000.1000 1e+05 If that were all there were to the matter, rounding would be pretty easy. You'd just pick your favorite output format and be done with it. However, it's not that easy: sometimes you need to think more about what you really want and what's really happening. As we explained in the Introduction, even a simple number like 10.1 or 0.1 can only be approximated in binary floating-point. The only decimal numbers that can be exactly represented as floating-point numbers are those that can be rewritten as a finite sum of one or more fractions whose denominators are all powers of two. For example: $a = 0.625; # 1/2 + 1/8 $b = 0.725; # 725/1000, or 29/40 printf "$_ is %.30g\n", $_ for $a, $b; prints out: 0.625 is 0.625 0.725 is 0.724999999999999977795539507497 The number in $a is exactly representable in binary, but the one in $b is not. When Perl is told to print a floating-point number but not told the precision, as occurs for the interpolated value of $_ in the string, it automatically rounds that number to however many decimal digits of precision that your machine supports. Typically, this is like using an output format of "%.15g", which, when printed, produces the same number as you assigned to $b. Usually the round-off error is so small you never even notice it, and if you do, you can always specify how much precision you'd like in your output. But because the underlying approximation is still a little bit off from what a simple print might show, this can produce unexpected results. For example, while numbers such as 0.125 and 0.625 are exactly representable, numbers such as 0.325 and 0.725 are not. So let's suppose you'd like to round to two decimal places. Will 0.325 become 0.32 or 0.33? Will 0.725 become 0.72 or 0.73? $a = 0.325; # 1/2 + 1/8 $b = 0.725; # 725/1000, or 29/40 printf "%s is %.2f or %.30g\n", ($_) x 3 for $a, $b; This produces: 0.325 is 0.33 or 0.325000000000000011102230246252 0.725 is 0.72 or 0.724999999999999977795539507497 Since 0.325's approximation is a bit above that, it rounds up to 0.33. On the other hand, 0.725's approximation is really a little under that, so it rounds down, giving 0.72 instead. But what about if the number is exactly representable, such 1.5 or 7.5, since those are just whole numbers plus one-half? The rounding rule used in that case is probably not the one you learned in grade school. Watch: for $n (-4 .. +4) { $n += 0.5; printf "%4.1f %2.0f\n", $n, $n; } That produces this: -3.5 -4 -2.5 -2 -1.5 -2 -0.5 -0 0.5 0 1.5 2 2.5 2 3.5 4 4.5 4 What's happening is that the rounding rule preferred by numerical analysts isn't "round up on a five," but instead "round toward even." This way the bias in the round-off error tends to cancel itself out. Three useful functions for rounding floating-point values to integral ones are int, ceil, and floor. Built into Perl, int returns the integral portion of the floating-point number passed to it. This is called "rounding toward zero." This is also known as integer truncation because it ignores the fractional part: it rounds down for positive numbers and up for negative ones. The POSIX module's floor and ceil functions also ignore the fractional part, but they always round down and up to the next integer, respectively, no matter the sign. use POSIX qw(floor ceil); printf "%8s %8s %8s %8s %8s\n", qw(number even zero down up); for $n (-6 .. +6) { $n += 0.5; printf "%8g %8.0f %8s %8s %8s\n", $n, $n, int($n), floor($n), ceil($n); } This produces the following illustrative table; each column heading shows what happens when you round the number in the specified direction. number even zero down up -5.5 -6 -5 -6 -5 -4.5 -4 -4 -5 -4 -3.5 -4 -3 -4 -3 -2.5 -2 -2 -3 -2 -1.5 -2 -1 -2 -1 -0.5 -0 0 -1 0 0.5 0 0 0 1 1.5 2 1 1 2 2.5 2 2 2 3 3.5 4 3 3 4 4.5 4 4 4 5 5.5 6 5 5 6 6.5 6 6 6 7 If you add up each column, you'll see that you arrive at rather different totals: 6.5 6 6 0 13 What this tells you is that your choice of rounding style—in effect, your choice of round-off error—can have tremendous impact on the final outcome. That's one reason why you're strongly advised to wait until final output for any rounding. Even still, some algorithms are more sensitive than others to accumulation of round-off error. In particularly delicate applications, such as financial computations and thermonuclear missiles, prudent programmers will implement their own rounding functions instead of relying on their computers' built-in logic, or lack thereof. (A good textbook on numerical analysis is also recommended.) 2.2.4 See Also The sprintf and int functions in perlfunc(1) and Chapter 29 of Programming Perl; the floor and ceil entries in the documentation for the standard POSIX module (also in Chapter 32 of Programming Perl); we introduce the sprintf technique in Recipe 2.3 [ Team LiB ] [ Team LiB ] Recipe 2.3 Comparing Floating-Point Numbers 2.3.1 Problem Floating-point arithmetic isn't exact. You want to compare two floating-point numbers and know whether they're equal when carried out to a certain number of decimal places. Most of the time, this is the way you should compare floating-point numbers for equality. 2.3.2 Solution Use sprintf to format the numbers to a certain number of decimal places, then compare the resulting strings: # equal(NUM1, NUM2, PRECISION) : returns true if NUM1 and NUM2 are # equal to PRECISION number of decimal places sub equal { my ($A, $B, $dp) = @_; return sprintf("%.${dp}g", $A) eq sprintf("%.${dp}g", $B); } Alternatively, store the numbers as integers by assuming the decimal place. 2.3.3 Discussion You need the equal routine because computers' floating-point representations are just approximations of most real numbers, as we discussed in the Introduction to this chapter. Perl's normal printing routines display numbers rounded to 15 decimal places or so, but its numeric tests don't round. So sometimes you can print out numbers that look the same (after rounding) but do not test the same (without rounding). This problem is especially noticeable in a loop, where round-off error can silently accumulate. For example, you'd think that you could start a variable out at zero, add one-tenth to it ten times, and end up with one. Well, you can't, because a base-2 computer can't exactly represent one-tenth. For example: for ($num = $i = 0; $i < 10; $i++) { $num += 0.1 } if ($num != 1) { printf "Strange, $num is not 1; it's %.45f\n", $num; } prints out: Strange, 1 is not 1; it's 0.999999999999999888977697537484345957636833191 The $num is interpolated into the double-quoted string using a default conversion format of "%.15g" (on most systems), so it looks like 1. But internally, it really isn't. If you had checked only to a few decimal places, for example, five: !equal($num, 1, 5) then you'd have been okay. If you have a fixed number of decimal places, as with currency, you can often sidestep the problem by storing your values as integers. Storing $3.50 as 350 instead of 3.5 removes the need for floating-point values. Reintroduce the decimal point on output: $wage = 536; # $5.36/hour $week = 40 * $wage; # $214.40 printf("One week's wage is: \$%.2f\n", $week/100); One week's wage is: $214.40 It rarely makes sense to compare more than 15 decimal places, because you probably only have that many digits of precision in your computer's hardware. 2.3.4 See Also The sprintf function in perlfunc(1) and Chapter 29 of Programming Perl; the entry on $OFMT in the perlvar(1) manpage and Chapter 28 of Programming Perl; the documentation for the standard Math::BigFloat module (also in Chapter 32 of Programming Perl); we use sprintf in Recipe 2.2; Volume 2, Section 4.2.2 of The Art of Computer Programming [ Team LiB ] [ Team LiB ] Recipe 2.4 Operating on a Series of Integers 2.4.1 Problem You want to perform an operation on all integers between X and Y, such as when you're working on a contiguous section of an array or wherever you want to process all numbers[1] within a range. [1] Okay, integers. It's hard to find all the reals. Just ask Cantor. 2.4.2 Solution Use a for loop, or .. in conjunction with a foreach loop: foreach ($X .. $Y) { # $_ is set to every integer from X to Y, inclusive } foreach $i ($X .. $Y) { # $i is set to every integer from X to Y, inclusive } for ($i = $X; $i <= $Y; $i++) { # $i is set to every integer from X to Y, inclusive } for ($i = $X; $i <= $Y; $i += 7) { # $i is set to every integer from X to Y, stepsize = 7 } 2.4.3 Discussion The first two approaches use a foreach loop in conjunction with the $X .. $Y construct, which creates a list of integers between $X and $Y. Now, if you were just assigning that range to an array, this would use up a lot of memory whenever $X and $Y were far apart. But in a foreach loop, Perl notices this and doesn't waste time or memory allocating a temporary list. When iterating over consecutive integers, the foreach loop will run faster than the equivalent for loop. Another difference between the two constructs is that the foreach loop implicitly localizes the loop variable to the body of the loop, but the for loop does not. That means that after the for loop finishes, the loop variable will contain the value it held upon the final iteration. But in the case of the foreach loop, that value will be inaccessible, and the variable will hold whatever it held—if anything—prior to entering the loop. You can, however, use a lexically scoped variable as the loop variable: foreach my $i ($X .. $Y) { ... } for (my $i=$X; $i <= $Y; $i++) { ... } The following code shows each technique. Here we just print the numbers we generate: print "Infancy is: "; foreach (0 .. 2) { print "$_ "; } print "\n"; print "Toddling is: "; foreach $i (3 .. 4) { print "$i "; } print "\n"; print "Childhood is: "; for ($i = 5; $i <= 12; $i++) { print "$i "; } print "\n"; Infancy is: 0 1 2 Toddling is: 3 4 Childhood is: 5 6 7 8 9 10 11 12 2.4.4 See Also The for and foreach operators in perlsyn(1) and the "For Loops" and "Foreach Loops" sections of Chapter 4 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 2.5 Working with Roman Numerals 2.5.1 Problem You want to convert between regular numbers and Roman numerals. You need to do this with items in outlines, page numbers on a preface, and copyrights for movie credits. 2.5.2 Solution Use the Roman module from CPAN: use Roman; $roman = roman($arabic); # convert to roman numerals $arabic = arabic($roman) if isroman($roman); # convert from roman numerals 2.5.3 Discussion The Roman module provides both Roman and roman for converting Arabic ("normal") numbers to their Roman equivalents. Roman produces uppercase letters, whereas roman gives lowercase ones. The module only deals with Roman numbers from 1 to 3999, inclusive. The Romans didn't represent negative numbers or zero, and 5000 (which 4000 is represented in terms of) uses a symbol outside the ASCII character set. use Roman; $roman_fifteen = roman(15); # "xv" print "Roman for fifteen is $roman_fifteen\n"; $arabic_fifteen = arabic($roman_fifteen); print "Converted back, $roman_fifteen is $arabic_fifteen\n"; Roman for fifteen is xv Converted back, xv is 15 Or to print the current year: use Time::localtime; use Roman; printf "The year is now %s\n", Roman(1900 + localtime->year); The year is now MMIII Now, if you happen to have Unicode fonts available, you'll find that code points U+2160 through U+2183 represent Roman numerals, including those beyond the typical ASCII values. use charnames ":full"; print "2003 is \N{ROMAN NUMERAL ONE THOUSAND}" x 2, "\N{ROMAN NUMERAL THREE}\n"; 2003 is However, the Roman module doesn't yet have an option to use those characters. Believe it or not, there's even a CPAN module that lets you use Roman numerals in arithmetic. use Math::Roman qw(roman); print $a = roman('I'); # I print $a += 2000; # MMI print $a -= "III"; # MCMXCVIII print $a -= "MCM"; # XCVIII 2.5.4 See Also The Encyclopaedia Britannica article on "Mathematics, History Of"; the documentation with the CPAN modules Roman and Math::Roman; Recipe 6.23 [ Team LiB ] [ Team LiB ] Recipe 2.6 Generating Random Numbers 2.6.1 Problem You want to make random numbers in a given range, inclusive, such as when you randomly pick an array index, simulate rolling a die in a game of chance, or generate a random password. 2.6.2 Solution Use Perl's rand function: $random = int( rand( $Y-$X+1 ) ) + $X; 2.6.3 Discussion This code generates and prints a random integer between 25 and 75, inclusive: $random = int( rand(51)) + 25; print "$random\n"; The rand function returns a fractional number, from (and including) 0 up to (but not including) its argument. We give it an argument of 51 to get a number that can be 0 or more, but never 51 or more. We take the integer portion of this to get a number from 0 to 50, inclusive (50.99999.... will be turned into 50 by int). We then add 25 to it to get a number from 25 to 75, inclusive. A common application of this is the random selection of an element from an array: $elt = $array[ rand @array ]; That's just like saying: $elt = $array[ int( rand(0+@array) ) ]; Because rand is prototyped to take just one argument, it implicitly imposes scalar context on that argument, which, on a named array, is the number of elements in that array. The function then returns a floating-point number smaller than its argument and greater than or equal to zero. A floating-point number used as an array subscript implicitly undergoes integer truncation (rounding toward zero), producing in the end an evenly distributed, randomly selected array element to assign to $elt. Generating a random password from a sequence of characters is similarly easy: @chars = ( "A" .. "Z", "a" .. "z", 0 .. 9, qw(! @ $ % ^ & *) ); $password = join("", @chars[ map { rand @chars } ( 1 .. 8 ) ]); We use map to generate eight random indices into @chars, extract the corresponding characters with a slice, and join them together to form the random password. This isn't a good random number, though, as its security relies on the choice of seed, which (in older versions of Perl) is based on the time the program started. See Recipe 2.7 for a way to better seed your random number generator. 2.6.4 See Also The int, rand, map, and join functions in perlfunc(1) and Chapter 29 of Programming Perl; we explore random numbers further in Recipe 2.7, Recipe 2.8, and Recipe 2.9; we use random numbers in Recipe 1.13 [ Team LiB ] [ Team LiB ] Recipe 2.7 Generating Repeatable Random Number Sequences 2.7.1 Problem Every time you run your program, you get a different sequence of (pseudo-)random numbers. But you want a reproducible sequence, useful when running a simulation, so you need Perl to produce the same set of random numbers each time. 2.7.2 Solution Use Perl's srand function: srand EXPR; # use a constant here for repeated sequences 2.7.3 Discussion Making random numbers is hard. The best that computers can do, without special hardware, is generate "pseudo-random" numbers, which are evenly distributed in their range of values. These are generated using a mathematical formula, which means that given the same seed (starting point), two programs will produce identical pseudo-random numbers. The srand function creates a new seed for the pseudo-random number generator. If given an argument, it uses that number as the seed. If no argument is given, srand uses a value that's reasonably difficult to guess as the seed. If you call rand without first calling srand yourself, Perl calls srand for you, choosing a "good" seed. This way, every time you run your program you'll get a different set of random numbers. Ancient versions of Perl did not call srand, so the same program always produced the same sequence of pseudo-random numbers every time the program was run. Certain sorts of programs don't want a different set of random numbers each time; they want the same set. When you need that behavior, call srand yourself, supplying it with a particular seed: srand( 42 ); # pick any fixed starting point Don't call srand more than once in a program, because if you do, you'll start the sequence again from that point. Unless, of course, that's what you want. Just because Perl tries to use a good default seed does not necessarily guarantee that the numbers generated are cryptographically secure against the most intrepid crackers. Textbooks on cryptography are usually good sources of cryptographically secure random number generators. 2.7.4 See Also The srand function in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 2.6 and Recipe 2.8; Bruce Schneier's excellent Applied Cryptography (John Wiley & Sons) [ Team LiB ] [ Team LiB ] Recipe 2.8 Making Numbers Even More Random 2.8.1 Problem You want to generate numbers that are more random than Perl's random numbers. Limitations of your C library's random number generator seeds can sometimes cause problems. The sequence of pseudo-random numbers may repeat too soon for some applications. 2.8.2 Solution Use a different random number generator, such as those provided by the Math::Random and Math::TrulyRandom modules from CPAN: use Math::TrulyRandom; $random = truly_random_value( ); use Math::Random; $random = random_uniform( ); 2.8.3 Discussion The Perl build process tries to find the best C-library routine to use for generating pseudo- random numbers, looking at rand(3), random(3), and drand48(3). (This can be changed manually at build time, however.) The standard library functions are getting pretty good, but some ancient implementations of the rand function return only 16-bit random numbers or have other algorithmic weaknesses, and may therefore not be sufficiently random for your purposes. The Math::TrulyRandom module uses inadequacies of your system's timers to generate the random numbers. This takes a while, so it isn't useful for generating a lot of random numbers. The Math::Random module uses the randlib library to generate random numbers. It also includes a wide range of related functions for generating random numbers according to specific distributions, such as binomial, poisson, and exponential. 2.8.4 See Also The srand and rand functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 2.6 and Recipe 2.7; the documentation for the CPAN modules Math::Random and Math::TrulyRandom [ Team LiB ] [ Team LiB ] Recipe 2.9 Generating Biased Random Numbers 2.9.1 Problem You want to pick a random value where the probabilities of the values are not equal (the distribution is not even). You might be trying to randomly select a banner to display on a web page, given a set of relative weights saying how often each banner is to be displayed. Alternatively, you might want to simulate behavior according to a normal distribution (the bell curve). 2.9.2 Solution If you want a random value distributed according to a specific function—e.g., the Gaussian (Normal) distribution—consult a statistics textbook to find the appropriate function or algorithm. This subroutine generates random numbers that are normally distributed, with a standard deviation of 1 and a mean of 0: sub gaussian_rand { my ($u1, $u2); # uniformly distributed random numbers my $w; # variance, then a weight my ($g1, $g2); # gaussian-distributed numbers do { $u1 = 2 * rand( ) - 1; $u2 = 2 * rand( ) - 1; $w = $u1*$u1 + $u2*$u2; } while ($w >= 1 || $w = = 0); $w = sqrt( (-2 * log($w)) / $w ); $g2 = $u1 * $w; $g1 = $u2 * $w; # return both if wanted, else just one return wantarray ? ($g1, $g2) : $g1; } If you have a list of weights and values you want to randomly pick from, follow this two-step process: first, turn the weights into a probability distribution with weight_to_dist, and then use the distribution to randomly pick a value with weighted_rand: # weight_to_dist: takes a hash mapping key to weight and returns # a hash mapping key to probability sub weight_to_dist { my %weights = @_; my %dist = ( ); my $total = 0; my ($key, $weight); local $_; foreach (values %weights) { $total += $_; } while ( ($key, $weight) = each %weights ) { $dist{$key} = $weight/$total; } return %dist; } # weighted_rand: takes a hash mapping key to probability, and # returns the corresponding element sub weighted_rand { my %dist = @_; my ($key, $weight); while (1) { # to avoid floating point inaccuracies my $rand = rand; while ( ($key, $weight) = each %dist ) { return $key if ($rand -= $weight) < 0; } } } 2.9.3 Discussion The gaussian_rand function implements the polar Box Muller method for turning two independent, uniformly distributed random numbers between 0 and 1 (such as rand returns) into two numbers with a mean of 0 and a standard deviation of 1 (i.e., a Gaussian distribution). To generate numbers with a different mean and standard deviation, multiply the output of gaussian_rand by the new standard deviation, and then add the new mean: # gaussian_rand as shown earlier $mean = 25; $sdev = 2; $salary = gaussian_rand( ) * $sdev + $mean; printf("You have been hired at \$%.2f\n", $salary); The Math::Random module implements this and other distributions for you: use Math::Random qw(random_normal); $salary = random_normal(1, $mean, $sdev); The weighted_rand function picks a random number between 0 and 1. It then uses the probabilities generated by weight_to_dist to see which element the random number corresponds to. Because of the vagaries of floating-point representation, accumulated errors in representation might mean we don't find an element to return. This is why we wrap the code in a while to pick a new random number and try again. Also, the CPAN module Math::Random has functions to return random numbers from a variety of distributions. 2.9.4 See Also The rand function in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 2.6; the documentation for the CPAN module Math::Random [ Team LiB ] [ Team LiB ] Recipe 2.10 Doing Trigonometry in Degrees, Not Radians 2.10.1 Problem You want your trigonometry routines to operate in degrees instead of Perl's native radians. 2.10.2 Solution Convert between radians and degrees (2p radians equals 360 degrees): use constant PI => (4 * atan2 (1, 1)); sub deg2rad { my $degrees = shift; return ($degrees / 180) * PI; } sub rad2deg { my $radians = shift; return ($radians / PI) * 180; } Alternatively, use the standard Math::Trig module: use Math::Trig; $radians = deg2rad($degrees); $degrees = rad2deg($radians); 2.10.3 Discussion If you're doing a lot of trigonometry, look into using either the standard Math::Trig or POSIX modules. They provide many more trigonometric functions than are defined in the Perl core. Otherwise, the first solution will define the rad2deg and deg2rad functions. The value of p isn't built directly into Perl, but you can calculate it to as much precision as your floating-point hardware provides. In the Solution, the PI function is a constant created with use constant. Instead of having to remember that p is 3.14159265358979 or so, we use the built-in function call, resolved at compile time, which, besides sparing us from memorizing a long string of digits, is also guaranteed to provide as much accuracy as the platform supports. If you're looking for the sine in degrees, use this: # deg2rad and rad2deg defined either as above or from Math::Trig sub degree_sine { my $degrees = shift; my $radians = deg2rad($degrees); my $result = sin($radians); return $result; } 2.10.4 See Also The sin, cos, and atan2 functions in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the standard POSIX and Math::Trig modules (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 2.11 Calculating More Trigonometric Functions 2.11.1 Problem You want to calculate values for trigonometric functions like sine, tangent, or arc-cosine. 2.11.2 Solution Perl provides only sin, cos, and atan2 as standard functions. From these, you can derive tan and all other trig functions (if you're intimately familiar with esoteric trig identities): sub tan { my $theta = shift; return sin($theta)/cos($theta); } The POSIX module provides a wider range of trig functions: use POSIX; $y = acos(3.7); The standard Math::Trig module provides a complete set of functions and supports operations on or resulting in complex numbers: use Math::Trig; $y = acos(3.7); 2.11.3 Discussion The tan function will cause a division-by-zero exception when $theta is p/2, 3p/2, and so on, because the cosine is 0 for these values. Similarly, tan and many other functions from Math::Trig may generate the same error. To trap these, use eval: eval { $y = tan($pi/2); } or return undef; 2.11.4 See Also The sin, cos, and atan2 functions in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the standard Math::Trig module; we talk about trigonometry in the context of imaginary numbers in Recipe 2.14; we talk about the use of eval to catch exceptions in Recipe 10.12 [ Team LiB ] [ Team LiB ] Recipe 2.12 Taking Logarithms 2.12.1 Problem You want to take a logarithm in various bases. 2.12.2 Solution For logarithms to base e, use the built-in log : $log_e = log(VALUE); For logarithms to base 10, use the POSIX module's log10 function: use POSIX qw(log10); $log_10 = log10(VALUE); For other bases, use the mathematical identity: where x is the number whose logarithm you want, n is the desired base, and e is the natural logarithm base. sub log_base { my ($base, $value) = @_; return log($value)/log($base); } 2.12.3 Discussion The log_base function lets you take logarithms to any base. If you know the base you'll want in advance, it's more efficient to cache the log of the base instead of recalculating it every time. # log_base as defined earlier $answer = log_base(10, 10_000); print "log10(10,000) = $answer\n"; log10(10,000) = 4 The Math::Complex module does the caching for you via its logn( ) routine, so you can write: use Math::Complex; printf "log2(1024) = %lf\n", logn(1024, 2); # watch out for argument order! log2(1024) = 10.000000 even though no complex number is involved here. This is not very efficient, but there are plans to rewrite Math::Complex in C for speed. 2.12.4 See Also The log function in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the standard POSIX and Math::Complex modules (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 2.13 Multiplying Matrices 2.13.1 Problem You want to multiply a pair of two-dimensional arrays. Mathematicians and engineers often need this. 2.13.2 Solution Use the PDL modules, available from CPAN. PDL is the Perl Data Language—modules that give fast access to compact matrix and mathematical functions: use PDL; # $a and $b are both pdl objects $c = $a x $b; Alternatively, apply the matrix multiplication algorithm to your two-dimensional array: sub mmult { my ($m1,$m2) = @_; my ($m1rows,$m1cols) = matdim($m1); my ($m2rows,$m2cols) = matdim($m2); unless ($m1cols = = $m2rows) { # raise exception die "IndexError: matrices don't match: $m1cols != $m2rows"; } my $result = [ ]; my ($i, $j, $k); for $i (range($m1rows)) { for $j (range($m2cols)) { for $k (range($m1cols)) { $result->[$i][$j] += $m1->[$i][$k] * $m2->[$k][$j]; } } } return $result; } sub range { 0 .. ($_[0] - 1) } sub veclen { my $ary_ref = $_[0]; my $type = ref $ary_ref; if ($type ne "ARRAY") { die "$type is bad array ref for $ary_ref" } return scalar(@$ary_ref); } sub matdim { my $matrix = $_[0]; my $rows = veclen($matrix); my $cols = veclen($matrix->[0]); return ($rows, $cols); } 2.13.3 Discussion If you have the PDL library installed, you can use its lightning-fast manipulation of numbers. This requires far less memory and CPU than Perl's array manipulation. When using PDL objects, many numeric operators (such as + and *) are overloaded and work on an element-by-element basis (e.g., * is the so-called scalar multiplication operator). To get true matrix multiplication, use the overloaded x operator. use PDL; $a = pdl [ [ 3, 2, 3 ], [ 5, 9, 8 ], ]; $b = pdl [ [ 4, 7 ], [ 9, 3 ], [ 8, 1 ], ]; $c = $a x $b; # x overload If you don't have the PDL library, or don't feel like pulling it in for a small problem, you can always do the work yourself the good old-fashioned way. # mmult( ) and other subroutines as shown earlier $x = [ [ 3, 2, 3 ], [ 5, 9, 8 ], ]; $y = [ [ 4, 7 ], [ 9, 3 ], [ 8, 1 ], ]; $z = mmult($x, $y); 2.13.4 See Also The documentation with the CPAN module PDL [ Team LiB ] [ Team LiB ] Recipe 2.14 Using Complex Numbers 2.14.1 Problem Your application must manipulate complex numbers, as are often needed in engineering, science, and mathematics. 2.14.2 Solution Either keep track of the real and imaginary components yourself: # $c = $a * $b manually $c_real = ( $a_real * $b_real ) - ( $a_imaginary * $b_imaginary ); $c_imaginary = ( $a_real * $b_imaginary ) + ( $b_real * $a_imaginary ); or use the Math::Complex module (part of the standard Perl distribution): # $c = $a * $b using Math::Complex use Math::Complex; $c = $a * $b; 2.14.3 Discussion Here's how you'd manually multiply 3+5i and 2-2i: $a_real = 3; $a_imaginary = 5; # 3 + 5i; $b_real = 2; $b_imaginary = -2; # 2 - 2i; $c_real = ( $a_real * $b_real ) - ( $a_imaginary * $b_imaginary ); $c_imaginary = ( $a_real * $b_imaginary ) + ( $b_real * $a_imaginary ); print "c = ${c_real}+${c_imaginary}i\n"; c = 16+4i and with Math::Complex: use Math::Complex; $a = Math::Complex->new(3,5); # or Math::Complex->new(3,5); $b = Math::Complex->new(2,-2); $c = $a * $b; print "c = $c\n"; c = 16+4i You may create complex numbers via the cplx constructor or via the exported constant i: use Math::Complex; $c = cplx(3,5) * cplx(2,-2); # easier on the eye $d = 3 + 4*i; # 3 + 4i printf "sqrt($d) = %s\n", sqrt($d); sqrt(3+4i) = 2+i The Math::Trig module uses the Math::Complex module internally because some functions can break out from the real axis into the complex plane—for example, the inverse sine of 2. 2.14.4 See Also The documentation for the standard Math::Complex module (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 2.15 Converting Binary, Octal, and Hexadecimal Numbers 2.15.1 Problem You want to convert a string (e.g., "0b10110", "0x55", or "0755") containing a binary, octal, or hexadecimal number to the correct number. Perl understands numbers specified in binary (base-2), octal (base-8), and hexadecimal (base- 16) notation only when they occur as literals in your programs. If they come in as data—such as by reading from files or environment variables, or when supplied as command-line arguments—no automatic conversion takes place. 2.15.2 Solution Use Perl's hex function if you have a hexadecimal string like "2e" or "0x2e": $number = hex($hexadecimal); # hexadecimal only ("2e" becomes 47) Use the oct function if you have a hexadecimal string like "0x2e", an octal string like "047", or a binary string like "0b101110": $number = oct($hexadecimal); # "0x2e" becomes 47 $number = oct($octal); # "057" becomes 47 $number = oct($binary); # "0b101110" becomes 47 2.15.3 Discussion The oct function converts octal numbers with or without the leading "0"; for example, "0350" or "350". Despite its name, oct does more than convert octal numbers: it also converts hexadecimal ("0x350") numbers if they have a leading "0x" and binary ("0b101010") numbers if they have a leading "0b". The hex function converts only hexadecimal numbers, with or without a leading "0x": "0x255", "3A", "ff", or "deadbeef". (Letters may be in upper- or lowercase.) Here's an example that accepts an integer in decimal, binary, octal, or hex, and prints that integer in all four bases. It uses the oct function to convert the data from binary, octal, and hexadecimal if the input begins with a 0. It then uses printf to convert into all four bases as needed. print "Gimme an integer in decimal, binary, octal, or hex: "; $num = ; chomp $num; exit unless defined $num; $num = oct($num) if $num =~ /^0/; # catches 077 0b10 0x20 printf "%d %#x %#o %#b\n", ($num) x 4; The # symbol between the percent and the three non-decimal bases makes printf produce output that indicates which base the integer is in. For example, if you enter the number 255, the output would be: 255 0xff 0377 0b11111111 But without the # sign, you would only get: 255 ff 377 11111111 The following code converts Unix file permissions. They're always given in octal, so we use oct instead of hex. print "Enter file permission in octal: "; $permissions = ; die "Exiting ...\n" unless defined $permissions; chomp $permissions; $permissions = oct($permissions); # permissions always octal print "The decimal value is $permissions\n"; 2.15.4 See Also The "Scalar Value Constructors" section in perldata(1) and the "Numeric Literals" section of Chapter 2 of Programming Perl; the oct and hex functions in perlfunc(1) and Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 2.16 Putting Commas in Numbers 2.16.1 Problem You want to output a number with commas in the right places. People like to see long numbers broken up in this way, especially in reports. 2.16.2 Solution Reverse the string so you can use backtracking to avoid substitution in the fractional part of the number. Then use a regular expression to find where you need commas, and substitute them in. Finally, reverse the string back. sub commify { my $text = reverse $_[0]; $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g; return scalar reverse $text; } 2.16.3 Discussion It's a lot easier in regular expressions to work from the front than from the back. With this in mind, we reverse the string and make a minor change to the algorithm that repeatedly inserts commas three digits from the end. When all insertions are done, we reverse the final string and return it. Because reverse is sensitive to its implicit return context, we force it to scalar context. This function can easily be adjusted to accommodate the use of periods instead of commas, as are used in many countries. Here's an example of commify in action: # more reasonable web counter :-) use Math::TrulyRandom; $hits = truly_random_value( ); # negative hits! $output = "Your web page received $hits accesses last month.\n"; print commify($output); Your web page received -1,740,525,205 accesses last month. 2.16.4 See Also perllocale(1); the reverse function in perlfunc(1) and Chapter 29 of Programming Perl; the section "Adding Commas to a Number with Lookaround" in Chapter 2 of Mastering Regular Expressions, Second Edition [ Team LiB ] [ Team LiB ] Recipe 2.17 Printing Correct Plurals 2.17.1 Problem You're printing something like "It took $time hours", but "It took 1 hours" is ungrammatical. You would like to get it right. 2.17.2 Solution Use printf and the conditional operator (X ? Y : Z) to alter the noun or verb: printf "It took %d hour%s\n", $time, $time = = 1 ? "" : "s"; printf "%d hour%s %s enough.\n", $time, $time = = 1 ? "" : "s", $time = = 1 ? "is" : "are"; Or use the Lingua::EN::Inflect module from CPAN, as described in the following Discussion. 2.17.3 Discussion The only reason inane messages like "1 file(s) updated" appear is because their authors are too lazy to bother checking whether the count is 1 or not. If your noun changes by more than an "-s", you'll need to change the printf accordingly: printf "It took %d centur%s", $time, $time = = 1 ? "y" : "ies"; This is good for simple cases, but you'll tire of writing it. This leads you to write funny functions like this: sub noun_plural { local $_ = shift; # order really matters here! s/ss$/sses/ || s/([psc]h)$/${1}es/ || s/z$/zes/ || s/ff$/ffs/ || s/f$/ves/ || s/ey$/eys/ || s/y$/ies/ || s/ix$/ices/ || s/([sx])$/$1es/ || s/$/s/ || die "can't get here"; return $_; } *verb_singular = \&noun_plural; # make function alias As you find more exceptions, your function will become increasingly convoluted. When you need to handle such morphological changes, turn to the flexible solution provided by the Lingua::EN::Inflect module from CPAN. use Lingua::EN::Inflect qw(PL classical); classical(1); # why isn't this the default? while () { # each line in the data for (split) { # each word on the line print "One $_, two ", PL($_), ".\n"; } } # plus one more $_ = 'secretary general'; print "One $_, two ", PL($_), ".\n"; _ _END_ _ fish fly ox species genus phylum cherub radius jockey index matrix mythos phenomenon formula That produces the following: One fish, two fish. One fly, two flies. One ox, two oxen. One species, two species. One genus, two genera. One phylum, two phyla. One cherub, two cherubim. One radius, two radii. One jockey, two jockeys. One index, two indices. One matrix, two matrices. One mythos, two mythoi. One phenomenon, two phenomena. One formula, two formulae. One secretary general, two secretaries general. Without calling classical, these lines would have come out different than in the previous output: One phylum, two phylums. One cherub, two cherubs. One radius, two radiuses. One index, two indexes. One matrix, two matrixes. One formula, two formulas. This is just one of the many things the module can do. It also handles inflections or conjugations for other parts of speech, provides number-insensitive comparison functions, figures out whether to use a or an, and plenty more. 2.17.4 See Also The "Conditional Operator" in perlop(1) and Chapter 3 of Programming Perl; the documentation with the CPAN module Lingua::EN::Inflect [ Team LiB ] [ Team LiB ] Recipe 2.18 Program: Calculating Prime Factors The following program takes one or more integer arguments and determines the prime factors. It uses Perl's native numeric representation, unless those numbers use floating-point representation and thus lose accuracy. Otherwise (or if the program's -b switch is used), it uses the standard Math::BigInt library, thus allowing for huge numbers. However, it only loads this library if necessary. That's why we use require and import instead of use, which would unconditionally load the library at compile time instead of conditionally at runtime. This is not an efficient way to crack the huge integers used for cryptographic purposes. Call the program with a list of numbers, and it will show you the prime factors of those numbers: % bigfact 8 9 96 2178 8 2**3 9 3**2 96 2**5 3 2178 2 3**2 11**2 You can give it very large numbers: % bigfact 239322000000000000000000 +239322000000000000000000 2**19 3 5**18 +39887 % bigfact 25000000000000000000000000 +25000000000000000000000000 2**24 5**26 The program is shown in Example 2-2. Example 2-2. bigfact #!/usr/bin/perl # bigfact - calculate prime factors use strict; use integer; our ($opt_b, $opt_d); use Getopt::Std; @ARGV && getopts('bd') or die "usage: $0 [-b] number ..."; load_biglib( ) if $opt_b; ARG: foreach my $orig ( @ARGV ) { my ($n, %factors, $factor); $n = $opt_b ? Math::BigInt->new($orig) : $orig; if ($n + 0 ne $n) { # don't use -w for this printf STDERR "bigfact: %s would become %s\n", $n, $n+0 if $opt_d; load_biglib( ); $n = Math::BigInt->new($orig); } printf "%-10s ", $n; # Here $sqi will be the square of $i. We will take advantage # of the fact that ($i + 1) ** 2 = = $i ** 2 + 2 * $i + 1. for (my ($i, $sqi) = (2, 4); $sqi <= $n; $sqi += 2 * $i ++ + 1) { while ($n % $i = = 0) { $n /= $i; print STDERR "<$i>" if $opt_d; $factors {$i} ++; } } if ($n != 1 && $n != $orig) { $factors{$n}++ } if (! %factors) { print "PRIME\n"; next ARG; } for $factor ( sort { $a <=> $b } keys %factors ) { print "$factor"; if ($factors{$factor} > 1) { print "**$factors{$factor}"; } print " "; } print "\n"; } # this simulates a use, but at runtime sub load_biglib { require Math::BigInt; Math::BigInt->import( ); #immaterial? } [ Team LiB ] [ Team LiB ] Chapter 3. Dates and Times It is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch. —IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2 [ Team LiB ] [ Team LiB ] Introduction Times and dates are important things to be able to manipulate. "How many users logged in last month?", "How many seconds should I sleep if I want to wake up at midday?", and "Has this user's password expired yet?" are common questions whose answers involve surprisingly non- obvious manipulations. Perl represents points in time as intervals, measuring seconds past a point in time called the Epoch. On Unix and many other systems, the Epoch was 00:00 Jan 1, 1970, UTC (Universal Corrected Time).[1] [1] UTC is the preferred way to specify what used to be called GMT, or Greenwich Mean Time. When we talk about dates and times, we often interchange two different concepts: points in time (dates and times) and intervals between points in time (weeks, months, days, etc.). Epoch seconds represent intervals and points in the same units, so you can do basic arithmetic on them. However, people are not used to working with Epoch seconds. We are more used to dealing with individual year, month, day, hour, minute, and second values. Furthermore, the month can be represented by its full name or its abbreviation. The day can precede or follow the month. Because of the difficulty of performing calculations with a variety of formats, we typically convert human-supplied strings or lists to Epoch seconds, calculate, and then convert back to strings or lists for output. Epoch seconds are an absolute number of seconds, so they don't take into account time zones or daylight saving times. When converting to or from distinct values, always consider whether the time represented is UTC or local. Use different conversion functions depending on whether you need to convert from UTC to local time or vice versa. Perl's time function returns the number of seconds that have passed since the Epoch—more or less.[2] POSIX requires that time not include leap seconds, a peculiar practice of adjusting the world's clock by a second here and there to account for the slowing down of the Earth's rotation due to tidal angular-momentum dissipation. (See the sci.astro FAQ, section 3, at To convert Epoch seconds into distinct values for days, months, years, hours, minutes, and seconds, use the localtime and gmtime functions. In list context, these functions return a nine-element list, as shown in Table 3-1. [2] Well, less actually. To be precise, 22 seconds less as of this writing. Table 3-1. Values (and their ranges) returned from localtime and gmtime Variable Values Range $sec seconds 0-60 $min minutes 0-59 $hours hours 0-23 Variable Values Range $mday day of month 1-31 $mon month of year 0-11, 0 = = January $year years since 1900 1-138 (or more) $wday day of week 0-6, 0 = = Sunday $yday day of year 0-365 $isdst 0 or 1 true if daylight saving is in effect The values for seconds range from 0-60 to account for leap seconds; you never know when a spare second will leap into existence at the urging of various standards bodies. From now on, we'll refer to a list of day, month, year, hour, minute, and seconds as DMYHMS, for no better reason than that writing and reading "distinct day, month, year, hour, minute, and seconds values" is wearisome. The abbreviation is not meant to suggest an order of return values. Perl does not return a two-digit year value. It returns the year minus 1900, which just happens to be a two-digit number through 1999. Perl doesn't intrinsically have a Year 2000 problem, unless you make one yourself. (Your computer, and Perl, may have a 2038 problem, though, if we're still using 32 bits by that time.) Add 1900 to get the full year value instead of using the construct "20$year", or your programs will refer to the year as something like "20103". We can't pin down the year value's range, because it depends on how big an integer your operating system uses for Epoch seconds. Small integers mean a small range; big (64-bit) integers mean a very big range. In scalar context, localtime and gmtime return the date and time formatted as an ASCII string: Fri Apr 11 09:27:08 1997 The standard Time::tm module provides a named interface to these values. The standard Time::localtime and Time::gmtime modules override the list-returning localtime and gmtime functions, replacing them with versions that return Time::tm objects. Compare these two pieces of code: # using arrays print "Today is day ", (localtime)[7], " of the current year.\n"; Today is day 117 of the current year. # using Time::tm objects use Time::localtime; $tm = localtime; print "Today is day ", $tm->yday, " of the current year.\n"; Today is day 117 of the current year. To go from a list to Epoch seconds, use the standard Time::Local module. It provides the functions timelocal and timegm, both of which take a nine-element list and return an integer. The list's values have the same meaning and ranges as those returned by localtime and gmtime. Epoch seconds values are limited by the size of an integer. If you have a 32-bit signed integer holding your Epoch seconds, you can only represent dates (in UTC) from Fri Dec 13 20:45:52 1901 to Tue Jan 19 03:14:07 2038 (inclusive). By 2038, it is assumed, computers will change to $mday day of month 1-31 $mon month of year 0-11, 0 = = January $year years since 1900 1-138 (or more) $wday day of week 0-6, 0 = = Sunday $yday day of year 0-365 $isdst 0 or 1 true if daylight saving is in effect The values for seconds range from 0-60 to account for leap seconds; you never know when a spare second will leap into existence at the urging of various standards bodies. From now on, we'll refer to a list of day, month, year, hour, minute, and seconds as DMYHMS, for no better reason than that writing and reading "distinct day, month, year, hour, minute, and seconds values" is wearisome. The abbreviation is not meant to suggest an order of return values. Perl does not return a two-digit year value. It returns the year minus 1900, which just happens to be a two-digit number through 1999. Perl doesn't intrinsically have a Year 2000 problem, unless you make one yourself. (Your computer, and Perl, may have a 2038 problem, though, if we're still using 32 bits by that time.) Add 1900 to get the full year value instead of using the construct "20$year", or your programs will refer to the year as something like "20103". We can't pin down the year value's range, because it depends on how big an integer your operating system uses for Epoch seconds. Small integers mean a small range; big (64-bit) integers mean a very big range. In scalar context, localtime and gmtime return the date and time formatted as an ASCII string: Fri Apr 11 09:27:08 1997 The standard Time::tm module provides a named interface to these values. The standard Time::localtime and Time::gmtime modules override the list-returning localtime and gmtime functions, replacing them with versions that return Time::tm objects. Compare these two pieces of code: # using arrays print "Today is day ", (localtime)[7], " of the current year.\n"; Today is day 117 of the current year. # using Time::tm objects use Time::localtime; $tm = localtime; print "Today is day ", $tm->yday, " of the current year.\n"; Today is day 117 of the current year. To go from a list to Epoch seconds, use the standard Time::Local module. It provides the functions timelocal and timegm, both of which take a nine-element list and return an integer. The list's values have the same meaning and ranges as those returned by localtime and gmtime. Epoch seconds values are limited by the size of an integer. If you have a 32-bit signed integer holding your Epoch seconds, you can only represent dates (in UTC) from Fri Dec 13 20:45:52 1901 to Tue Jan 19 03:14:07 2038 (inclusive). By 2038, it is assumed, computers will change to use larger integers for Epoch seconds. We hope. For operations on dates outside this range, you must use another representation or work from distinct year, month, and day values. The Date::Calc and Date::Manip modules on CPAN both work from these distinct values, but be warned: years don't necessarily have 1900 subtracted from them the way the year value returned by localtime does, nor do months and weeks always start at 0. As always, consult the manpage of the appropriate module to make sure you're giving it what it expects and getting back from it what you expect. There's little more embarrassing than realizing you've calculated your company payroll based on a calendar that's 1,900 years in the past. [ Team LiB ] [ Team LiB ] Recipe 3.1 Finding Today's Date 3.1.1 Problem You need to find the year, month, and day values for today's date. 3.1.2 Solution Use localtime, which returns values for the current date and time if given no arguments. You can either use localtime and extract the information you want from the list it returns: ($DAY, $MONTH, $YEAR) = (localtime)[3,4,5]; or use Time::localtime, which overrides localtime to return a Time::tm object: use Time::localtime; $tm = localtime; ($DAY, $MONTH, $YEAR) = ($tm->mday, $tm->mon, $tm->year); 3.1.3 Discussion Here's how you'd print the current date as "YYYY MM DD", using the non-overridden localtime: ($day, $month, $year) = (localtime)[3,4,5]; printf("The current date is %04d %02d %02d\n", $year+1900, $month+1, $day); The current date is 2003 03 06 To extract the fields we want from the list returned by localtime, we take a list slice. We could also have written it as: ($day, $month, $year) = (localtime)[3..5]; This is how we'd print the current date as "YYYY-MM-DD" (in approved ISO 8601 fashion), using Time::localtime: use Time::localtime; $tm = localtime; printf("The current date is %04d-%02d-%02d\n", $tm->year+1900, ($tm->mon)+1, $tm->mday); The current date is 2003-03-06 The object interface might look out of place in a short program. However, when you do a lot of work with the distinct values, accessing them by name makes code much easier to understand. A more obfuscated way that does not involve temporary variables is: printf("The current date is %04d-%02d-%02d\n", sub {($_[5]+1900, $_[4]+1, $_[3])}->(localtime)); There is also strftime from the POSIX module discussed in Recipe 3.8: use POSIX qw(strftime); print strftime "%Y-%m-%d\n", localtime; The gmtime function works just as localtime does, but gives the answer in UTC instead of your local time zone. 3.1.4 See Also The localtime and gmtime functions in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the standard Time::localtime module [ Team LiB ] [ Team LiB ] Recipe 3.2 Converting DMYHMS to Epoch Seconds 3.2.1 Problem You want to convert a date, a time, or both with distinct values for day, month, year, etc. to Epoch seconds. 3.2.2 Solution Use the timelocal or timegm functions in the standard Time::Local module, depending on whether the date and time is in the current time zone or in UTC. use Time::Local; $TIME = timelocal($sec, $min, $hours, $mday, $mon, $year); $TIME = timegm($sec, $min, $hours, $mday, $mon, $year); 3.2.3 Discussion The built-in function localtime converts an Epoch seconds value to distinct DMYHMS values; the timelocal subroutine from the standard Time::Local module converts distinct DMYHMS values to an Epoch seconds value. Here's an example that shows how to find Epoch seconds for a time in the current day. It gets the day, month, and year values from localtime: # $hours, $minutes, and $seconds represent a time today, # in the current time zone use Time::Local; $time = timelocal($seconds, $minutes, $hours, (localtime)[3,4,5]); If you're passing month and year values to timelocal, it expects values with the same range as those which localtime returns. Namely, months start at 0, and years have 1900 subtracted from them. The timelocal function assumes the DMYHMS values represent a time in the current time zone. Time::Local also exports a timegm subroutine that assumes the DMYHMS values represent a time in the UTC time zone. Unfortunately, there is no convenient way to convert from a time zone other than the current local time zone or UTC. The best you can do is convert to UTC and add or subtract the time zone offset in seconds. This code illustrates both the use of timegm and how to adjust the ranges of months and years: # $day is day in month (1-31) # $month is month in year (1-12) # $year is four-digit year e.g., 1967 # $hours, $minutes and $seconds represent UTC (GMT) time use Time::Local; $time = timegm($seconds, $minutes, $hours, $day, $month-1, $year-1900); As explained in the introduction, Epoch seconds cannot hold values before Fri Dec 13 20:45:52 1901 or after Tue Jan 19 03:14:07 2038. Don't convert such dates to Epoch seconds—use a Date:: module from CPAN, and do your calculations with that instead. 3.2.4 See Also The documentation for the standard Time::Local module (also in Chapter 32 of Programming Perl); convert in the other direction using Recipe 3.3 [ Team LiB ] [ Team LiB ] Recipe 3.3 Converting Epoch Seconds to DMYHMS 3.3.1 Problem You have a date and time in Epoch seconds, and you want to calculate individual DMYHMS values from it. 3.3.2 Solution Use the localtime or gmtime functions, depending on whether you want the date and time in UTC or your local time zone. ($seconds, $minutes, $hours, $day_of_month, $month, $year, $wday, $yday, $isdst) = localtime($time); The standard Time::timelocal and Time::gmtime modules override the localtime and gmtime functions to provide named access to the individual values. use Time::localtime; # or Time::gmtime $tm = localtime($TIME); # or gmtime($TIME) $seconds = $tm->sec; # ... 3.3.3 Discussion The localtime and gmtime functions return strange year and month values; the year has 1900 subtracted from it, and 0 is the month value for January. Be sure to correct the base values for year and month, as this example does: ($seconds, $minutes, $hours, $day_of_month, $month, $year, $wday, $yday, $isdst) = localtime($time); printf("Dateline: %02d:%02d:%02d-%04d/%02d/%02d\n", $hours, $minutes, $seconds, $year+1900, $month+1, $day_of_month); We could have used the Time::localtime module to avoid the temporary variables: use Time::localtime; $tm = localtime($time); printf("Dateline: %02d:%02d:%02d-%04d/%02d/%02d\n", $tm->hour, $tm->min, $tm->sec, $tm->year+1900, $tm->mon+1, $tm->mday); 3.3.4 See Also The localtime function in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the standard Time::localtime and Time::gmtime modules; convert in the other direction using Recipe 3.3 [ Team LiB ] [ Team LiB ] Recipe 3.4 Adding to or Subtracting from a Date 3.4.1 Problem You have a date and time and want to find the date and time of some period in the future or past. 3.4.2 Solution Simply add or subtract Epoch seconds: $when = $now + $difference; $then = $now - $difference; If you have distinct DMYHMS values, use the CPAN Date::Calc module. If you're doing arithmetic with days only, use Add_Delta_Days ($offset is a positive or negative integral number of days): use Date::Calc qw(Add_Delta_Days); ($y2, $m2, $d2) = Add_Delta_Days($y, $m, $d, $offset); If you are concerned with hours, minutes, and seconds (in other words, times as well as dates), use Add_Delta_DHMS: use Date::Calc qw(Add_Delta_DHMS); ($year2, $month2, $day2, $h2, $m2, $s2) = Add_Delta_DHMS( $year, $month, $day, $hour, $minute, $second, $days_offset, $hour_offset, $minute_offset, $second_offset ); 3.4.3 Discussion Calculating with Epoch seconds is easiest, disregarding the effort to get dates and times into and out of Epoch seconds. This code shows how to calculate an offset (55 days, 2 hours, 17 minutes, and 5 seconds, in this case) from a given base date and time: $birthtime = 96176750; # 18/Jan/1973, 3:45:50 am $interval = 5 + # 5 seconds 17 * 60 + # 17 minutes 2 * 60 * 60 + # 2 hours 55 * 60 * 60 * 24; # and 55 days $then = $birthtime + $interval; print "Then is ", scalar(localtime($then)), "\n"; Then is Wed Mar 14 06:02:55 1973 We could have used Date::Calc's Add_Delta_DHMS function and avoided the conversion to and from Epoch seconds: use Date::Calc qw(Add_Delta_DHMS); ($year, $month, $day, $hh, $mm, $ss) = Add_Delta_DHMS( 1973, 1, 18, 3, 45, 50, # 18/Jan/1973, 3:45:50 am 55, 2, 17, 5); # 55 days, 2 hrs, 17 min, 5 sec print "To be precise: $hh:$mm:$ss, $month/$day/$year\n"; To be precise: 6:2:55, 3/14/1973 As usual, we need to know the range of values the function expects. Add_Delta_DHMS takes a full year value—that is, one that hasn't had 1900 subtracted from it. The month value for January is 1, not 0. Date::Calc's Add_Delta_Days function expects the same kind of values: use Date::Calc qw(Add_Delta_Days); ($year, $month, $day) = Add_Delta_Days(1973, 1, 18, 55); print "Nat was 55 days old on: $month/$day/$year\n"; Nat was 55 days old on: 3/14/1973 3.4.4 See Also The documentation for the CPAN module Date::Calc [ Team LiB ] [ Team LiB ] Recipe 3.5 Difference of Two Dates 3.5.1 Problem You need to find the number of days between two dates or times. 3.5.2 Solution If your dates are in Epoch seconds and fall in the range Fri Dec 13 20:45:52 1901 to Tue Jan 19 03:14:07 2038 (inclusive), subtract one from the other and convert the seconds to days: $seconds = $recent - $earlier; If you have distinct DMYMHS values or are worried about the range limitations of Epoch seconds, use the Date::Calc module from CPAN. It can calculate the difference between dates: use Date::Calc qw(Delta_Days); $days = Delta_Days( $year1, $month1, $day1, $year2, $month2, $day2); It also calculates the difference between a pair of dates and times: use Date::Calc qw(Delta_DHMS); ($days, $hours, $minutes, $seconds) = Delta_DHMS( $year1, $month1, $day1, $hour1, $minute1, $seconds1, # earlier $year2, $month2, $day2, $hour2, $minute2, $seconds2); # later 3.5.3 Discussion One problem with Epoch seconds is how to convert the large integers back to forms that people can read. The following example shows one way of converting an Epoch seconds value back to its component numbers of weeks, days, hours, minutes, and seconds: $bree = 361535725; # 16 Jun 1981, 4:35:25 $nat = 96201950; # 18 Jan 1973, 3:45:50 $difference = $bree - $nat; print "There were $difference seconds between Nat and Bree\n"; There were 265333775 seconds between Nat and Bree $seconds = $difference % 60; $difference = ($difference - $seconds) / 60; $minutes = $difference % 60; $difference = ($difference - $minutes) / 60; $hours = $difference % 24; $difference = ($difference - $hours) / 24; $days = $difference % 7; $weeks = ($difference - $days) / 7; print "($weeks weeks, $days days, $hours:$minutes:$seconds)\n"; (438 weeks, 4 days, 23:49:35) Date::Calc's functions can ease these calculations. The Delta_Days function returns the number of days between two dates. It takes the two dates as a list: year, month, day. The dates are given chronologically—earliest first. use Date::Calc qw(Delta_Days); @bree = (1981, 6, 16); # 16 Jun 1981 @nat = (1973, 1, 18); # 18 Jan 1973 $difference = Delta_Days(@nat, @bree); print "There were $difference days between Nat and Bree\n"; There were 3071 days between Nat and Bree The Delta_DHMS function returns a four-element list corresponding to the number of days, hours, minutes, and seconds between the two dates you give it. use Date::Calc qw(Delta_DHMS); @bree = (1981, 6, 16, 4, 35, 25); # 16 Jun 1981, 4:35:25 @nat = (1973, 1, 18, 3, 45, 50); # 18 Jan 1973, 3:45:50 @diff = Delta_DHMS(@nat, @bree); print "Bree came $diff[0] days, $diff[1]:$diff[2]:$diff[3] after Nat\n"; Bree came 3071 days, 0:49:35 after Nat 3.5.4 See Also The documentation for the CPAN module Date::Calc [ Team LiB ] [ Team LiB ] Recipe 3.6 Day in a Week/Month/Year or Week Number 3.6.1 Problem You have a date, either in Epoch seconds or as distinct year, month, etc. values. You want to find out what week of the year, day of the week, day of the month, or day of the year that the date falls on. 3.6.2 Solution If you have Epoch seconds, the day of the year, day of the month, and day of the week are returned by localtime. The week of the year is easily calculated from the day of the year (but see the following discussion, as standards differ). ($MONTHDAY, $WEEKDAY, $YEARDAY) = (localtime $DATE)[3,6,7]; $WEEKNUM = int($YEARDAY / 7) + 1; If you have distinct DMYHMS values, you can either convert them to Epoch seconds values as in Recipe 3.2 and then use the previous solution, or else use the Day_of_Week, Week_Number, and Day_of_Year functions from the CPAN module Date::Calc: use Date::Calc qw(Day_of_Week Week_Number Day_of_Year); # you have $year, $month, and $day # $day is day of month, by definition. $wday = Day_of_Week($year, $month, $day); $wnum = Week_Number($year, $month, $day); $dnum = Day_of_Year($year, $month, $day); 3.6.3 Discussion The Day_of_Week, Week_Number, and Day_of_Year functions all expect years that haven't had 1900 subtracted from them and months where January is 1, not 0. The return value from Day_of_Week can be 1 through 7 (corresponding to Monday through Sunday) or 0 in case of an error (an invalid date, for example). use Date::Calc qw(Day_of_Week Week_Number Day_of_Week_to_Text); $year = 1981; $month = 6; # (June) $day = 16; $wday = Day_of_Week($year, $month, $day); print "$month/$day/$year was a ", Day_of_Week_to_Text($wday), "\n"; ## see comment above $wnum = Week_Number($year, $month, $day); print "in the $wnum week.\n"; 6/16/1981 was a Tuesday in week number 25. The governing standard bodies of particular countries may have rules about when the first week of the year starts. For example, in Norway the first week must have at least 4 days in it (and weeks start on Mondays). If January 1 falls on a week with 3 or fewer days, it is counted as week 52 (or 53) of the previous year. In America, the first Monday of the year is usually the start of the first workweek. Given such rules, you may have to write your own algorithm, or at least look at the %G, %L, %W, and %U formats to the UnixDate function in Date::Manip. 3.6.4 See Also The localtime function in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the CPAN module Date::Calc [ Team LiB ] [ Team LiB ] Recipe 3.7 Parsing Dates and Times from Strings 3.7.1 Problem You read in a date or time specification in an arbitrary format but need to parse that string into distinct year, month, etc. values. 3.7.2 Solution If your date is already numeric, or in a rigid and easily parsed format, use a regular expression (and possibly a hash mapping month names to numbers) to extract individual day, month, and year values, and then use the standard Time::Local module's timelocal and timegm functions to turn that into an Epoch seconds value. use Time::Local; # $date is "2003-02-13" (YYYY-MM-DD form). ($yyyy, $mm, $dd) = ($date =~ /(\d+)-(\d+)-(\d+)/); # calculate epoch seconds at midnight on that day in this timezone $epoch_seconds = timelocal(0, 0, 0, $dd, $mm-1, $yyyy); For a more flexible solution, use the ParseDate function provided by the CPAN module Date::Manip, and then use UnixDate to extract the individual values. use Date::Manip qw(ParseDate UnixDate); $date = ParseDate($STRING); if (!$date) { # bad date } else { @VALUES = UnixDate($date, @FORMATS); } 3.7.3 Discussion The flexible ParseDate function accepts many formats. It even converts strings such as "today", "2 weeks ago Friday", "2nd Sunday in 1996", and "last Sunday in December", plus it understands the date and time format used in mail and news headers. It returns the decoded date in its own format: a string of the form "YYYYMMDDHH:MM:SS". You could compare two such strings to compare the dates they represent, but arithmetic is difficult. We therefore use the UnixDate function to extract the year, month, and day values in a preferred format. UnixDate takes a date (as returned by ParseDate) and a list of formats. It applies each format to the string and returns the result. A format is a string describing one or more elements of the date and time and the way that the elements are to be formatted. For example, %Y is the format for the year in four-digit form. Here's an example: use Date::Manip qw(ParseDate UnixDate); while (<>) { $date = ParseDate($_); if (!$date) { warn "Bad date string: $_\n"; next; } else { ($year, $month, $day) = UnixDate($date, "%Y", "%m", "%d"); print "Date was $month/$day/$year\n"; } } 3.7.4 See Also The documentation for the CPAN module Date::Manip; we use this in Recipe 3.11 [ Team LiB ] [ Team LiB ] Recipe 3.8 Printing a Date 3.8.1 Problem You need to print a date and time shown in Epoch seconds format in human-readable form. 3.8.2 Solution Call localtime or gmtime in scalar context, which takes an Epoch seconds value and returns a string of the form Tue July 22 05:15:20 2003: $STRING = localtime($EPOCH_SECONDS); Alternatively, the strftime function in the standard POSIX module supports a more customizable output format and takes individual DMYHMS values: use POSIX qw(strftime); $STRING = strftime($FORMAT, $SECONDS, $MINUTES, $HOUR, $DAY_OF_MONTH, $MONTH, $YEAR, $WEEKDAY, $YEARDAY, $DST); The CPAN module Date::Manip has a UnixDate routine that works like a specialized form sprintf designed to handle dates. Pass it a Date::Manip date value. Using Date::Manip in lieu of POSIX::strftime has the advantage of not requiring a POSIX-compliant system. use Date::Manip qw(UnixDate); $STRING = UnixDate($DATE, $FORMAT); 3.8.3 Discussion The simplest solution is built into Perl already: the localtime function. In scalar context, it returns the string formatted in a particular way: Wed July 16 23:58:36 2003 This makes for simple code, although it restricts the format of the string: use Time::Local; $time = timelocal(50, 45, 3, 18, 0, 73); print "Scalar localtime gives: ", scalar(localtime($time)), "\n"; Scalar localtime gives: Thu Jan 18 03:45:50 1973 Of course, localtime requires the date and time in Epoch seconds. The POSIX::strftime function takes individual DMYMHS values plus a format and returns a string. The format is similar to a printf format: % directives specify fields in the output string. A full list of these directives is available in your system's documentation for strftime. The strftime function expects the individual values representing the date and time to be in the same range as those returned by localtime: use POSIX qw(strftime); use Time::Local; $time = timelocal(50, 45, 3, 18, 0, 73); print "strftime gives: ", strftime("%A %D", localtime($time)), "\n"; strftime gives: Thursday 01/18/73 All values are shown in their national representation when using POSIX::strftime. So, if you run it in France, your program would print "Sunday" as "Dimanche". Be warned: Perl's interface to the POSIX function strftime assumes the date falls in the current time zone. If you don't have access to POSIX's strftime function, there's always the trusty Date::Manip CPAN module, described in Recipe 3.6. use Date::Manip qw(ParseDate UnixDate); $date = ParseDate("18 Jan 1973, 3:45:50"); $datestr = UnixDate($date, "%a %b %e %H:%M:%S %z %Y"); # as scalar print "Date::Manip gives: $datestr\n"; Date::Manip gives: Thu Jan 18 03:45:50 GMT 1973 3.8.4 See Also The gmtime and localtime functions in perlfunc(1) and Chapter 29 of Programming Perl; perllocale(1); your system's strftime(3) manpage; the documentation for the POSIX module (also in Chapter 32 of Programming Perl); the documentation for the CPAN module Date::Manip [ Team LiB ] [ Team LiB ] Recipe 3.9 High-Resolution Timers 3.9.1 Problem You need to measure time with a finer granularity than the full seconds that time returns. 3.9.2 Solution The Time::HiRes module, which is included standard starting with the v5.8 release of Perl, encapsulates this functionality for most systems: use Time::HiRes qw(gettimeofday); $t0 = gettimeofday( ); ## do your operation here $t1 = gettimeofday( ); $elapsed = $t1 - $t0; # $elapsed is a floating point value, representing number # of seconds between $t0 and $t1 3.9.3 Discussion Here's some code that uses Time::HiRes to time how long the user takes to press the Return key: use Time::HiRes qw(gettimeofday); print "Press return when ready: "; $before = gettimeofday( ); $line = ; $elapsed = gettimeofday( ) - $before; print "You took $elapsed seconds.\n"; Press return when ready: You took 0.228149 seconds. The module's gettimeofday function returns a two-element list representing seconds and microseconds when called in list context, or a single floating-point number combining the two when called in scalar context. You can also import its time function to replace the standard core version by that name; this always acts like scalar gettimeofday. The module also provides usleep and ualarm functions, which are alternate versions of the standard Perl sleep and alarm functions that understand granularities of microseconds instead of just seconds. They take arguments in microseconds; alternatively, you can import the module's sleep and alarm functions, which take floating-point arguments in seconds, to replace the standard versions, which take integer arguments in seconds. For access to your system's low-level itimer routines (if you have them), setitimer and getitimer are also provided. If your system doesn't support that module, you might try to poke around by hand using syscall. Compare Time::HiRes to the equivalent syscall code. (This example is included principally so that you can see an example of Perl's abstruse and archaic syscall function.) require 'sys/'; # initialize the structures returned by gettimeofday $TIMEVAL_T = "LL"; $done = $start = pack($TIMEVAL_T, (0,0)); # prompt print "Press return when ready: "; # read the time into $start syscall(&SYS_gettimeofday, $start, 0) != -1 || die "gettimeofday: $!"; # read a line $line = <>; # read the time into $done syscall(&SYS_gettimeofday, $done, 0) != -1 || die "gettimeofday: $!"; # expand the structure @start = unpack($TIMEVAL_T, $start); @done = unpack($TIMEVAL_T, $done); # fix microseconds for ($done[1], $start[1]) { $_ /= 1_000_000 } # calculate time difference $delta_time = sprintf "%.4f", ($done[0] + $done[1] ) - ($start[0] + $start[1] ); print "That took $delta_time seconds\n"; Press return when ready: That took 0.3037 seconds It's longer because it's doing system calls in Perl, whereas Time::HiRes does them in C providing a single function. It's complex because directly accessing system calls peculiar to your operating system requires understanding the underlying C structures that the system call takes and returns. Some programs that come with the Perl distribution try to automatically calculate the formats to pack and unpack for you, if fed the appropriate C header file. In the example, sys/ is a Perl library file generated with h2ph, which converts the sys/syscall.h header file into sys/, defining (among other things) &SYS_gettimeofday as a subroutine that returns the system call number of gettimeofday. Here's another example of Time::HiRes, showing how you could use it to benchmark a sort (if you didn't care to use the standard Benchmark module): use Time::HiRes qw(gettimeofday); # take mean sorting time $size = 2000; $number_of_times = 100; $total_time = 0; for ($i = 0; $i < $number_of_times; $i++) { my (@array, $j, $begin, $time); # populate array @array = ( ); for ($j=0; $j < $size; $j++) { push(@array, rand) } # sort it $begin = gettimeofday; @array = sort { $a <=> $b } @array; $time = gettimeofday-$begin; $total_time += $time; } printf "On average, sorting %d random numbers takes %.5f seconds\n", $size, ($total_time/$number_of_times); On average, sorting 2000 random numbers takes 0.01033 seconds 3.9.4 See Also The documentation for the Time::HiRes and Benchmark modules; the syscall function in perlfunc(1) and Chapter 29 of Programming Perl; your system's syscall(2) manpage [ Team LiB ] [ Team LiB ] Recipe 3.10 Short Sleeps 3.10.1 Problem You need to sleep for less than a second. 3.10.2 Solution Use the select( ) function, if your system supports it: select(undef, undef, undef, $time_to_sleep); Some systems don't support a four-argument select. The Time::HiRes module provides a sleep function that takes a floating-point number of seconds: use Time::HiRes qw(sleep); sleep($time_to_sleep); 3.10.3 Discussion Here's an example of select. It's a simpler version of the program in Recipe 1.6. Think of it as your very own 300-baud terminal. while (<>) { select(undef, undef, undef, 0.25); print; } Using Time::HiRes, we'd write it as: use Time::HiRes qw(sleep); while (<>) { sleep(0.25); print; } 3.10.4 See Also The documentation for the CPAN modules Time::HiRes and Benchmark; the sleep and select functions in perlfunc(1) and Chapter 29 of Programming Perl; we use the select function for short sleeps in the slowcat program in Recipe 1.6 [ Team LiB ] [ Team LiB ] Recipe 3.11 Program: hopdelta Have you ever wondered why it took so long for someone's mail to get to you? With postal mail, you can't trace how long each intervening post office let your letter gather dust in their back office. But with electronic mail, you can. The message carries in its header Received: lines showing when each intervening mail transport agent along the way got the message. The dates in the headers are hard to read. You have to read them backwards, bottom to top. They are written in many varied formats, depending on the whim of each transport agent. Worst of all, each date is written in its own local time zone. It's hard to eyeball "Tue, 26 May 1998 23:57:38 -0400" and "Wed, 27 May 1998 05:04:03 +0100" and realize these two dates are only 6 minutes and 25 seconds apart. The ParseDate and DateCalc functions in the Date::Manip module from CPAN can help this: use Date::Manip qw(ParseDate DateCalc); $d1 = ParseDate("Sun, 09 Mar 2003 23:57:38 -0400"); $d2 = ParseDate("Mon, 10 Mar 2003 05:04:03 +0100"); print DateCalc($d1, $d2); +0:0:0:0:0:6:25 That's a nice format for a program to read, but it's still not what the casual reader wants to see. The hopdelta program, shown in Example 3-1, takes a mailer header and tries to analyze the deltas (difference) between each hop (mail stop). Its output is shown in the local time zone. Example 3-1. hopdelta #!/usr/bin/perl # hopdelta - feed mail header, produce lines # showing delay at each hop. use strict; use Date::Manip qw (ParseDate UnixDate); # print header; this should really use format/write due to # printf complexities printf "%-20.20s %-20.20s %-20.20s %s\n", "Sender", "Recipient", "Time", "Delta"; $/ = ''; # paragraph mode $_ = <>; # read header s/\n\s+/ /g; # join continuation lines # calculate when and where this started my($start_from) = /^From.*\@([^\s>]*)/m; my($start_date) = /^Date:\s+(.*)/m; my $then = getdate($start_date); printf "%-20.20s %-20.20s %s\n", 'Start', $start_from, fmtdate($then); my $prevfrom = $start_from; # now process the headers lines from the bottom up for (reverse split(/\n/)) { my ($delta, $now, $from, $by, $when); next unless /^Received:/; s/\bon (.*?) (id.*)/; $1/s; # qmail header, I think unless (($when) = /;\s+(.*)$/) { # where the date falls warn "bad received line: $_"; next; } ($from) = /from\s+(\S+)/; ($from) = /\((.*?)\)/ unless $from; # some put it here $from =~ s/\)$//; # someone was too greedy ($by) = /by\s+(\S+\.\S+)/; # who sent it on this hop # now random mungings to get their string parsable for ($when) { s/ (for|via) .*$//; s/([+-]\d\d\d\d) \(\S+\)/$1/; s/id \S+;\s*//; } next unless $now = getdate($when); # convert to Epoch $delta = $now - $then; printf "%-20.20s %-20.20s %s ", $from, $by, fmtdate($now); $prevfrom = $by; puttime($delta); $then = $now; } exit; # convert random date strings into Epoch seconds sub getdate { my $string = shift; $string =~ s/\s+\(.*\)\s*$//; # remove nonstd tz my $date = ParseDate($string); my $epoch_secs = UnixDate($date,"%s"); return $epoch_secs; } # convert Epoch seconds into a particular date string sub fmtdate { my $epoch = shift; my($sec,$min,$hour,$mday,$mon,$year) = localtime($epoch); return sprintf "%02d:%02d:%02d %04d/%02d/%02d", $hour, $min, $sec, $year + 1900, $mon + 1, $mday, } # take seconds and print in pleasant-to-read format sub puttime { my($seconds) = shift; my($days, $hours, $minutes); $days = pull_count($seconds, 24 * 60 * 60); $hours = pull_count($seconds, 60 * 60); $minutes = pull_count($seconds, 60); put_field('s', $seconds); put_field('m', $minutes); put_field('h', $hours); put_field('d', $days); print "\n"; } # usage: $count = pull_count(seconds, amount) # remove from seconds the amount quantity, altering caller's version. # return the integral number of those amounts so removed. sub pull_count { my($answer) = int($_[0] / $_[1]); $_[0] -= $answer * $_[1]; return $answer; } # usage: put_field(char, number) # output number field in 3-place decimal format, with trailing char # suppress output unless char is 's' for seconds sub put_field { my ($char, $number) = @_; printf " %3d%s", $number, $char if $number || $char eq 's'; } =end Sender Recipient Time Delta Start 09:17:12 1998/05/23 09:20:56 1998/05/23 44s 3m 09:20:58 1998/05/23 2s [ Team LiB ] [ Team LiB ] Chapter 4. Arrays Works of art, in my opinion, are the only objects in the material universe to possess internal order, and that is why, though I don't believe that only art matters, I do believe in Art for Art's sake. —E.M. Forster [ Team LiB ] [ Team LiB ] Introduction If you are asked about the contents of your pockets, or the names of the first three Greek letters, or how to get to the highway, you recite a list: you name one thing after another in a particular order. Lists are part of your conception of the world. With Perl's powerful list- and array-handling primitives, you can translate this world view directly into code. In this chapter, we'll use the terms list and array as the Perl language thinks of them. Take ("alpha", "beta", "gamma"); that's a list of the names of the first three Greek letters, in order. To store that list into a variable, use an array, as in @greeks = ("alpha", "beta", "gamma"). Both are ordered groups of scalar values; the difference is that an array is a named variable, one whose array length can be directly changed, whereas a list is a more ephemeral notion. You might think of an array as a variable and a list as the values it contains. This distinction may seem arbitrary, but operations that modify the length of these groupings (like push and pop) require a proper array and not merely a list. Think of the difference between $a and 4. You can say $a++ but not 4++. Likewise, you can say pop(@a) but not pop (1,2,3). The most important thing to glean from this is that Perl's lists and arrays are both ordered groupings of scalars. Operators and functions that work on lists or arrays are designed to provide faster or more convenient access to the elements than manual access would provide. Since few actually deal with modifying the array's length, you can usually use arrays and lists interchangeably. You can't use nested parentheses to create a list of lists. If you try that in Perl, your lists get flattened, meaning that both these lines are equivalent: @nested = ("this", "that", "the", "other"); @nested = ("this", "that", ("the", "other")); Why doesn't Perl (usefully) just support nested lists directly? Although partially for historical reasons, this easily allows for operations (like print or sort) that work on arbitrarily long lists of arbitrary contents. What happens if you want a more complex data structure, such as an array of arrays or an array of hashes? Remember that scalars aren't restricted to containing just numbers or strings; they can also hold references. Complex (multilevel) data structures in Perl are always put together using references. Therefore, what appear to be "two-dimensional arrays" or "arrays of arrays" are always implemented as arrays of array references, in the same way that two- dimensional arrays in C can be arrays of pointers to arrays. Most recipes in this chapter don't care what you keep in your arrays; for example, the problem of merging two arrays is the same whether the arrays contains strings, numbers, or references. Some problems are intrinsically tied to the contents of your arrays; recipes for those are in Chapter 11. This chapter's recipes deal with generic arrays. Let's have some more terminology. The scalar items in an array or list are called elements, which you access by specifying their position, or index. Indices in Perl start at 0. So, given this list: @greeks = ( "alpha", "beta", "gamma" ); "alpha" is in the first position, but you'd access it as $greeks[0]. "beta" is in the second position, but you'd access it as $greeks[1]. This structure is doubly justified: the contrariness of computers, whose first representable number is 0, and the contrariness of language designers, who chose 0 because it is an offset into the array, not the ordinal number of the element. [ Team LiB ] [ Team LiB ] Recipe 4.1 Specifying a List in Your Program 4.1.1 Problem You want to include a list in your program. This is how you initialize arrays. 4.1.2 Solution You can write out a comma-separated list of elements: @a = ("quick", "brown", "fox"); If you have a lot of single-word elements, use the qw( ) operator: @a = qw(Meddle not in the affairs of wizards.); If you have a lot of multiword elements, use a here document and extract lines: @lines = (<< "END_OF_HERE_DOC" =~ /^\s*(.+)/gm); I sit beside the fire and think of all that I have seen, of meadow-flowers and butterflies and summers that have been; END_OF_HERE_DOC 4.1.3 Discussion The first technique is the one most commonly used, often because only small arrays are normally initialized as program literals. Initializing a large array would fill your program with values and make it hard to read, so such arrays either tend to be initialized in a separate library file (see Chapter 12), or else have their values read in from a file: @bigarray = ( ); open(FH, "<", "myinfo") or die "Couldn't open myinfo: $!"; while () { chomp; push(@bigarray, $_); } close(FH); The second technique uses qw( ), one of several pseudo-functions in Perl used for quoting without having to resort to actual quotation marks. This one splits its string argument on whitespace to produce a list of words, where "words" in this instance means strings that don't contain any whitespace. The initial argument is not subject to interpolation of variables or (most) backslash escape sequences. @banner = ('Costs', 'only', '$4.95'); @banner = qw(Costs only $4.95); @banner = split(' ', 'Costs only $4.95'); You can use qw( ) only when each whitespace-separated argument is to be a distinct element in the return list. Be careful not to give Columbus four ships instead of three: @ships = qw(Niña Pinta Santa María); # WRONG @ships = ('Niña', 'Pinta', 'Santa María'); # right The third solution takes a here document, which is a single, multiline string, and applies a global pattern match to that string. The pattern /^\s*(.+)/ says to skip any whitespace at the start of the line, then capture everything through the end of each line. The /g modifier means to apply that match globally, and the /m modifier says to permit ^ to match not just at the beginning of the string, but also immediately after a newline, which, in a multiline string, is just what you need. Applying that technique to the ships example yields: @ships = ( << "END_OF_FLOTILLA" =~ /^\s*(.+)/gm); Niña Pinta Santa María END_OF_FLOTILLA 4.1.4 See Also The "List Value Constructors" section of perldata(1); the "List Values and Arrays" section of Chapter 2 of Programming Perl; the "Quote and Quote-Like Operators" section of perlop(1); the s/// operator in perlop(1) and Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 4.2 Printing a List with Commas 4.2.1 Problem You'd like to print out a list containing an unknown number of elements, placing an "and" before the last element and commas between each element if there are more than two. 4.2.2 Solution Use this function, which returns the formatted string: sub commify_series { (@_ = = 0) ? '' : (@_ = = 1) ? $_[0] : (@_ = = 2) ? join(" and ", @_) : join(", ", @_[0 .. ($#_-1)], "and $_[-1]"); } 4.2.3 Discussion It often looks odd to print out arrays: @array = ("red", "yellow", "green"); print "I have ", @array, " marbles.\n"; print "I have @array marbles.\n"; I have redyellowgreen marbles. I have red yellow green marbles. What you really want it to say is, "I have red, yellow, and green marbles". The function given in the solution generates strings in that format. The word "and" is placed between the last two list elements. If there are more than two elements in the list, a comma is placed between every element. Example 4-1 gives a complete demonstration of the function, with one addition: if any element in the list already contains a comma, a semicolon is used for the separator character instead. Example 4-1. commify_series #!/usr/bin/perl -w # commify_series - show proper comma insertion in list output # @lists is an array of (references to anonymous) arrays @lists = ( [ 'just one thing' ], [ qw(Mutt Jeff) ], [ qw(Peter Paul Mary) ], [ 'To our parents', 'Mother Theresa', 'God' ], [ 'pastrami', 'ham and cheese', 'peanut butter and jelly', 'tuna' ], [ 'recycle tired, old phrases', 'ponder big, happy thoughts' ], [ 'recycle tired, old phrases', 'ponder big, happy thoughts', 'sleep and dream peacefully' ], ); foreach $aref (@lists) { print "The list is: " . commify_series(@$aref) . ".\n"; } # demo for single list @list = qw(one two three); print "The last list is: " . commify_series(@list) . ".\n"; sub commify_series { my $sepchar = grep(/,/ => @_) ? ";" : ","; (@_ = = 0) ? '' : (@_ = = 1) ? $_[0] : (@_ = = 2) ? join(" and ", @_) : join("$sepchar ", @_[0 .. ($#_-1)], "and $_[-1]"); } Here's the output from the program: The list is: just one thing. The list is: Mutt and Jeff. The list is: Peter, Paul, and Mary. The list is: To our parents, Mother Theresa, and God. The list is: pastrami, ham and cheese, peanut butter and jelly, and tuna. The list is: recycle tired, old phrases and ponder big, happy thoughts. The list is: recycle tired, old phrases; ponder big, happy thoughts; and sleep and dream peacefully. The last list is: one, two, and three. As you see, we don't follow the ill-advised practice of omitting the final comma from a series under any circumstances. To do so introduces unfortunate ambiguities and unjustifiable exceptions. The examples shown would have claimed that we were the offspring of Mother Teresa and God, and would have had us eating sandwiches made of jelly and tuna fish mixed together atop the peanut butter. 4.2.4 See Also Fowler's Modern English Usage; we explain the nested list syntax in Recipe 11.1; the grep function in perlfunc(1) and Chapter 29 of Programming Perl; the conditional operator ("?:") is discussed in perlop(1) and in the "Conditional Operator" section of Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 4.3 Changing Array Size 4.3.1 Problem You want to enlarge or truncate an array. For example, you might truncate an array of employees that's already sorted by salary to list the five highest-paid employees. Or, if you know how big your array will get and that it will grow piecemeal, it's more efficient to grab memory for it in one step by enlarging just once than to keep pushing values onto the end. 4.3.2 Solution Assign to $#ARRAY : # grow or shrink @ARRAY $#ARRAY = $NEW_LAST_ELEMENT_INDEX_NUMBER; Assigning to an element past the end automatically extends the array: $ARRAY[$NEW_LAST_ELEMENT_INDEX_NUMBER] = $VALUE; 4.3.3 Discussion $#ARRAY is the number of the last valid index in @ARRAY. If we assign it a number smaller than its current value, we truncate the array. Truncated elements are lost forever. If we assign $#ARRAY a number larger than its current value, the array grows. New elements have the undefined value. $#ARRAY is not @ARRAY, though. Although $#ARRAY is the last valid index in the array, @ARRAY (in scalar context, as when treated as a number) is the number of elements. $#ARRAY is one less than @ARRAY because array indices start at 0. Here's some code that uses both. We have to say scalar @array in the print because Perl gives list context to (most) functions' arguments, but we want @array in scalar context. sub what_about_that_array { print "The array now has ", scalar(@people), " elements.\n"; print "The index of the last element is $#people.\n"; print "Element #3 is `$people[3]'.\n"; } @people = qw(Crosby Stills Nash Young); what_about_that_array( ); prints: The array now has 4 elements. The index of the last element is 3. Element #3 is `Young'. whereas: $#people--; what_about_that_array( ); prints: The array now has 3 elements. The index of the last element is 2. Element #3 is `'. Element #3 disappeared when we shortened the array. If we'd turned on warnings (either with the -w command-line option to Perl or with use warnings inside the program), Perl would also have warned "use of uninitialized value" because $people[3] is undefined. $#people = 10_000; what_about_that_array( ); prints: The array now has 10001 elements. The index of the last element is 10000. Element #3 is `'. The "Young" element is now gone forever. Instead of assigning to $#people, we could have said: $people[10_000] = undef; although this isn't exactly the same. If you have a three-element array, as in: @colors = qw(red blue green); and you undef its last element: undef $colors[2]; # green is gone you still have a three-element array; its last element is just undefined. If you pop the array, either via the function or manually by changing $#colors: $last_color = $colors[ $#colors-- ]; then the array grows shorter by one element. Perl arrays are not sparse. In other words, if you have a 10,000th element, you must have the 9,999 other elements, too. They may be undefined, but they still take up memory. For this reason, $array[time( )], or any other construct that uses a very large integer as an array index, is a really bad idea. Use a hash instead. 4.3.4 See Also The discussion of the $#ARRAY notation in perldata(1), also explained in the "List Values and Arrays" section of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 4.4 Implementing a Sparse Array 4.4.1 Problem An array with large, unoccupied expanses between occupied elements wastes memory. How do you reduce that overhead? 4.4.2 Solution Use a hash instead of an array. 4.4.3 Discussion If you assign to the millionth element of an array, Perl allocates a million and one slots to store scalars. Only the last element contains interesting data, leaving earlier ones each set to undef at a cost of four (or more) bytes per unoccupied slot. In recent versions of Perl, if you grow an array by assigning either past the end or directly to $#ARRAY, you can distinguish these implicit undefs from those that would result from assigning undef there by using exists instead of defined, just as you would with a hash. $#foo = 5; @bar = ( (undef) x 5 ) ; printf "foo element 3 is%s defined\n", defined $foo[3] ? "" : "n't"; printf "foo element 3 does%s exist\n", exists $foo[3] ? "" : "n't"; printf "bar element 3 is%s defined\n", defined $bar[3] ? "" : "n't"; printf "bar element 3 does%s exist\n", exists $bar[3] ? "" : "n't"; foo element 3 isn't defined foo element 3 doesn't exist bar element 3 isn't defined bar element 3 does exist However, you still waste a lot of space. That's because Perl's array implementation reserves a contiguous vector, one for each element up to the highest occupied position. $real_array[ 1_000_000 ] = 1; # costs 4+ megabytes A hash works differently: you pay only for what you really use, not for unoccupied positions. Although a hash element costs somewhat more than an array element because you need to store both the value and its key, with sparse arrays, the savings can be astonishing. $fake_array{ 1_000_000 } = 1; # costs 28 bytes What's the trade-off? Because a hash's keys aren't ordered, a little more work is needed to sort the numeric keys so you can handle their values in the same order as you would if they were stored as a real array. With an array, you'd just do this to process elements in index order: foreach $element ( @real_array ) { # do something with $element } or this to process indices in ascending order: foreach $idx ( 0 .. $#real_array ) { # do something with $real_array[$idx] } Using a hash representation, you should instead do either this to process elements in index order: foreach $element ( @fake_array{ sort {$a <=> $b} keys %fake_array } ) { # do something with $element } or this to process indices in ascending order: foreach $idx ( sort {$a <=> $b} keys %fake_array ) { # do something with $fake_array{$idx} } If you don't care about handling elements in a particular order, however, you don't need to go through all that. Just process the values according to their internal order, either like this: foreach $element ( values %fake_array ) { # do something with $element } or like this: # process indices in internal hash order foreach $idx ( keys %fake_array ) { # do something with $fake_array{$idx} } If you're determined to use an array, two fairly specialized cases occasionally arise in which you can save substantial amounts of memory by using an alternate storage scheme. Both cases also apply to arrays that are densely populated, not just those that are mostly empty. The first case shows up when you grow an array by repeatedly appending new elements until its subscripts become large. Because of how Perl reallocates memory for growing arrays, this can use up to four times the memory you really need. If you happen to know how big the array will (or might) eventually become, you can avoid this reallocation overhead either by storing the large subscripts first instead of the small ones: for ($i = 10_000; $i >= 0; $i--) { $real_array[$i] = 1 } or by presizing the array by assigning to the special $#ARRAY notation: $#real_array = 10_000; The second special case comes up when each array element holds nothing but a single one-bit value—essentially either a true or a false. For example, suppose you are keeping track of numbered USENET news articles, and you only need to know whether a given article number has been read. For situations like this, use a bit vector instead of a real array: my $have_read = ''; for ($i = 10_000; $i >= 0; $i--) { vec($have_read, $i, 1) = 1 } Then you can check to see whether a given article has been read this way: if (vec($have_read, $artno, 1)) { .... } 4.4.4 See Also The vec function in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 4.5 Iterating Over an Array 4.5.1 Problem You want to repeat a procedure for every element in a list. Often you use an array to collect information you're interested in; for instance, login names of users who have exceeded their disk quota. When you finish collecting the information, you want to process it by doing something with every element in the array. In the disk quota example, you might send each user a stern mail message. 4.5.2 Solution Use a foreach loop: foreach $item (LIST) { # do something with $item } 4.5.3 Discussion Let's say we've used @bad_users to compile a list of users who are over their allotted disk quotas. To call some complain subroutine for each user, we'd use: foreach $user (@bad_users) { complain($user); } Rarely is this recipe so simply applied. Instead, we often use functions to generate the list: foreach $var (sort keys %ENV) { print "$var=$ENV{$var}\n"; } Here we're using sort and keys to build a sorted list of environment variable names. If you use the list more than once, you'll obviously keep it around by saving in an array. But for one-shot processing, it's often tidier to process the list directly. Not only can we add complexity to this formula by building up the list in the foreach, we can also add complexity by doing more work inside the code block. A common application of foreach is to gather information on every element of a list and then, based on that information, decide whether to do something. For instance, returning to the disk quota example: foreach $user (@all_users) { $disk_space = get_usage($user); # find out how much disk space in use if ($disk_space > $MAX_QUOTA) { # if it's more than we want ... complain($user); # ... then object vociferously } } More complicated program flow is possible. The code can call last to jump out of the loop, next to move on to the next element, or redo to jump back to the first statement inside the block. Use these to say "no point continuing with this one, I know it's not what I'm looking for" (next), "I've found what I'm looking for, there's no point in my checking the rest" (last), or "I've changed some things, I'd better run this loop's calculations again" (redo). The variable set to each value in the list is called a loop variable or iterator variable. If no iterator variable is supplied, the global variable $_ is used. $_ is the default variable for many of Perl's string, list, and file functions. In brief code blocks, omitting $_ improves readability. (In long ones, though, too much implicit use hampers readability.) For example: foreach (`who`) { if (/tchrist/) { print; } } or combining with a while loop: while () { # $_ is set to the line just read chomp; # $_ has a trailing \n removed, if it had one foreach (split) { # $_ is split on whitespace, into @_ # then $_ is set to each chunk in turn $_ = reverse; # the characters in $_ are reversed print; # $_ is printed } } Perhaps all these uses of $_ are starting to make you nervous. In particular, the foreach and the while both give values to $_. You might fear that at the end of the foreach, the full line as read into $_ with would be forever gone. Fortunately, your fears would be unfounded, at least in this case. Perl won't permanently clobber $_'s old value, because the foreach's iterator variable (here, $_) is automatically preserved during the loop. It saves away any old value on entry and restores it upon exit. However, there is some cause for concern. If the while had been the inner loop and the foreach the outer one, your fears would have been realized. Unlike a foreach loop, the while () construct clobbers the value of the global $_ without first localizing it! So any routine—or block for that matter—that uses this construct with $_ should declare local $_. If a lexical variable (one declared with my) is in scope, the temporary variable will be lexically scoped, private to that loop. Otherwise, it will be a dynamically scoped global variable. To avoid strange magic at a distance, write this more obviously and more clearly as: foreach my $item (@array) { print "i = $item\n"; } The foreach looping construct has another feature: each time through the loop, the iterator variable becomes not a copy of but rather an alias for the current element. This means that when you change that iterator variable, you really change each element in the list: @array = (1,2,3); foreach $item (@array) { $item--; } print "@array\n"; 0 1 2 # multiply everything in @a and @b by seven @a = ( .5, 3 ); @b = ( 0, 1 ); foreach $item (@a, @b) { $item *= 7; } print "@a @b\n"; 3.5 21 0 7 You can't change a constant, though, so this is illegal: foreach $n (1, 2, 3) { $n **= 2; } This aliasing means that using a foreach loop to modify list values is both more readable and faster than the equivalent code using a three-part for loop and explicit indexing would be. This behavior is a feature, not a bug, that was introduced by design. If you didn't know about it, you might accidentally change something. Now you know about it. For example, to trim leading and trailing whitespace in a hash, we take advantage of how the values function works: the elements of its return list really are the values of the hash, and changing these changes the original hash. Because we use s/// directly on the list returned by the values function without copying these into a variable, we change the real hash itself. # trim whitespace in the scalar, the array, and in all # the values in the hash foreach ($scalar, @array, values %hash) { s/^\s+//; s/\s+$//; } For reasons hearkening back to the equivalent construct in the Unix Bourne shell, the for and foreach keywords are interchangeable: for $item (@array) { # same as foreach $item (@array) # do something } for (@array) { # same as foreach $_ (@array) # do something } This style often indicates that its author writes or maintains shell scripts, perhaps for Unix system administration. As such, their life is probably hard enough, so don't speak too harshly of them. Remember, TMTOWTDI. This is just one of those ways. If you aren't fluent in Bourne shell, you might find it clearer to express "for each $thing in this @list" by saying foreach, to make your code look less like the shell and more like English. (But don't try to make your English look like your code!) 4.5.4 See Also The "For Loops," "Foreach Loops," and "Loop Control" sections of perlsyn(1) and Chapter 4 of Programming Perl; the "Temporary Values via local" section of perlsub(1); the "Scoped Declarations" section of Chapter 4 of Programming Perl; we talk about local in Recipe 10.13; we talk about my in Recipe 10.2 [ Team LiB ] [ Team LiB ] Recipe 4.6 Iterating Over an Array by Reference 4.6.1 Problem You have a reference to an array, and you want to use a loop to work with the array's elements. 4.6.2 Solution Use foreach or for to loop over the dereferenced array: # iterate over elements of array in $ARRAYREF foreach $item (@$ARRAYREF) { # do something with $item } for ($i = 0; $i <= $#$ARRAYREF; $i++) { # do something with $ARRAYREF->[$i] } 4.6.3 Discussion The solutions assume you have a scalar variable containing the array reference. This lets you do things like this: @fruits = ( "Apple", "Blackberry" ); $fruit_ref = \@fruits; foreach $fruit (@$fruit_ref) { print "$fruit tastes good in a pie.\n"; } Apple tastes good in a pie. Blackberry tastes good in a pie. We could have rewritten the foreach loop as a for loop like this: for ($i=0; $i <= $#$fruit_ref; $i++) { print "$fruit_ref->[$i] tastes good in a pie.\n"; } Frequently, though, the array reference is the result of a more complex expression. Use the @{ EXPR } notation to turn the result of the expression back into an array: $namelist{felines} = \@rogue_cats; foreach $cat ( @{ $namelist{felines} } ) { print "$cat purrs hypnotically..\n"; } print "--More--\nYou are controlled.\n"; Again, we can replace the foreach with a for loop: for ($i=0; $i <= $#{ $namelist{felines} }; $i++) { print "$namelist{felines}[$i] purrs hypnotically.\n"; } 4.6.4 See Also perlref(1) and perllol(1); Chapter 8 of Programming Perl; Recipe 11.1; Recipe 4.5 [ Team LiB ] [ Team LiB ] Recipe 4.7 Extracting Unique Elements from a List 4.7.1 Problem You want to eliminate duplicate values from a list, such as when you build the list from a file or from the output of another command. This recipe is equally applicable to removing duplicates as they occur in input and to removing duplicates from an array you've already populated. 4.7.2 Solution Use a hash to record which items have been seen, then keys to extract them. You can use Perl's idea of truth to shorten and speed up your code. Straightforward %seen = ( ); @uniq = ( ); foreach $item (@list) { unless ($seen{$item}) { # if we get here, we have not seen it before $seen{$item} = 1; push(@uniq, $item); } } Faster %seen = ( ); foreach $item (@list) { push(@uniq, $item) unless $seen{$item}++; } Similar but with user function %seen = ( ); foreach $item (@list) { some_func($item) unless $seen{$item}++; } Faster but different %seen = ( ); foreach $item (@list) { $seen{$item}++; } @uniq = keys %seen; Faster and even more different %seen = ( ); @uniq = grep { ! $seen{$_} ++ } @list; 4.7.3 Discussion The question at the heart of the matter is "Have I seen this element before?" Hashes are ideally suited to such lookups. The first technique (Recipe builds up the array of unique values as we go along, using a hash to record whether something is already in the array. The second technique (Recipe is the most natural way to write this sort of thing in Perl. It creates a new entry in the hash every time it sees an element that hasn't been seen before, using the ++ operator. This has the side effect of making the hash record the number of times the element was seen. This time we only use the hash for its property of working like a set. The third example (Recipe is similar to the second but rather than storing the item away, we call some user-defined function with that item as its argument. If that's all we're doing, keeping a spare array of those unique values is unnecessary. The next mechanism (Recipe waits until it's done processing the list to extract the unique keys from the %seen hash. This may be convenient, but the original order has been lost. The final approach (Recipe merges the construction of the %seen hash with the extraction of unique elements. This preserves the original order of elements. Using a hash to record the values has two side effects: processing long lists can take a lot of memory, and the list returned by keys is unordered. Here's an example of processing input as it is read. We use `who` to gather information on the current user list, then extract the username from each line before updating the hash: # generate a list of users logged in, removing duplicates %ucnt = ( ); for (`who`) { s/\s.*\n//; # kill from first space till end-of-line, yielding username $ucnt{$_}++; # record the presence of this user } # extract and print unique keys @users = sort keys %ucnt; print "users logged in: @users\n"; 4.7.4 See Also The "Foreach Loops" section of perlsyn(1) and Chapter 4 of Programming Perl; the keys function in perlfunc(1) and Chapter 29 of Programming Perl; the "Hashes" section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.8 and Recipe 4.9 [ Team LiB ] [ Team LiB ] Recipe 4.8 Finding Elements in One Array but Not Another 4.8.1 Problem You want to find elements that are in one array but not another. 4.8.2 Solution You want to find elements in @A that aren't in @B. Build a hash of the keys of @B to use as a lookup table. Then check each element in @A to see whether it is in @B. Straightforward implementation # assume @A and @B are already loaded %seen = ( ); # lookup table to test membership of B @aonly = ( ); # answer # build lookup table foreach $item (@B) { $seen{$item} = 1 } # find only elements in @A and not in @B foreach $item (@A) { unless ($seen{$item}) { # it's not in %seen, so add to @aonly push(@aonly, $item); } } More idiomatic version my %seen; # lookup table my @aonly; # answer # build lookup table @seen{@B} = ( ); foreach $item (@A) { push(@aonly, $item) unless exists $seen{$item}; } Loopless version my @A = ...; my @B = ...; my %seen; @seen {@A} = ( ); delete @seen {@B}; my @aonly = keys %seen; 4.8.3 Discussion As with nearly any problem in Perl that asks whether a scalar is in one list or another, this one uses a hash. First, process @B so that the %seen hash records each element from @B by setting its value to 1. Then process @A one element at a time, checking whether that particular element had been in @B by consulting the %seen hash. The given code retains duplicate elements in @A. This can be fixed easily by adding the elements of @A to %seen as they are processed: foreach $item (@A) { push(@aonly, $item) unless $seen{$item}; $seen{$item} = 1; # mark as seen } The first two solutions differ mainly in how they build the hash. The first iterates through @B. The second uses a hash slice to initialize the hash. A hash slice is easiest illustrated by this example: $hash{"key1"} = 1; $hash{"key2"} = 2; which is equivalent to: @hash{"key1", "key2"} = (1,2); The list in the curly braces holds the keys; the list on the right holds the values. We initialize %seen in the first solution by looping over each element in @B and setting the appropriate value of %seen to 1. In the second, we simply say: @seen{@B} = ( ); This uses items in @B as keys for %seen, setting each corresponding value to undef, because there are fewer values on the right than places to put them. This works out here because we check for existence of the key, not logical truth or definedness of the value. If we needed true values, a slice could still shorten our code: @seen{@B} = (1) x @B; In the third solution, we make use of this property even further and avoid explicit loops altogether. (Not that avoiding loops should be construed as being particularly virtuous; we're just showing you that there's more than one way to do it.) The slice assignment makes any element that was in @A a key, and the slice deletion removes from the hash any keys that were elements of @B, leaving those that were only in @A. A fairly common situation where this might arise is when you have two files and would like to know which lines from the second file either were or weren't in the first. Here's a simple solution based on this recipe: open(OLD, $path1) || die "can't open $path1: $!"; @seen{ } = ( ); open(NEW, $path2) || die "can't open $path2: $!"; while () { print if exists $seen{$_}; } This shows the lines in the second file that were already seen in the first one. Use unless instead of if to show the lines in the second file that were not in the first. Imagine two files, the first containing the lines: red yellow green blue and the second containing: green orange purple black yellow The output using if would be: green yellow and the output using unless would be: orange purple black You could even do this from the command line; given a suitable cat(1) program, it's easy: % perl -e '@s{`cat OLD`}=( ); exists $s{$_} && print for `cat NEW`' % perl -e '@s{`cat OLD`}=( ); exists $s{$_} || print for `cat NEW`' You'd find that you just emulated these calls to the Unix fgrep(1) program: % fgrep -Ff OLD NEW % fgrep -vFf OLD NEW 4.8.4 See Also Hash slices are explained in perldata(1) and the "Variables" section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.7 and Recipe 4.9 [ Team LiB ] [ Team LiB ] Recipe 4.9 Computing Union, Intersection, or Difference of Unique Lists 4.9.1 Problem You have a pair of lists, each holding unduplicated items. You'd like to find out which items are in both lists (intersection), one but not the other (difference), or either (union). 4.9.2 Solution The following solutions need the listed initializations: @a = (1, 3, 5, 6, 7, 8); @b = (2, 3, 5, 7, 9); @union = @isect = @diff = ( ); %union = %isect = ( ); %count = ( ); Simple solution for union and intersection foreach $e (@a) { $union{$e} = 1 } foreach $e (@b) { if ( $union{$e} ) { $isect{$e} = 1 } $union{$e} = 1; } @union = keys %union; @isect = keys %isect; More idiomatic version foreach $e (@a, @b) { $union{$e}++ && $isect{$e}++ } @union = keys %union; @isect = keys %isect; Union, intersection, and symmetric difference foreach $e (@a, @b) { $count{$e}++ } @union = keys %count; foreach $e (keys %count) { if ($count{$e} = = 2) { push @isect, $e; } else { push @diff, $e; } } Indirect solution @isect = @diff = @union = ( ); foreach $e (@a, @b) { $count{$e}++ } @union = keys %count; foreach $e (keys %count) { push @{ $count{$e} = = 2 ? \@isect : \@diff }, $e; } 4.9.3 Discussion The first solution most directly computes the union and intersection of two lists, neither containing duplicates. Two hashes are used to record whether a particular item goes in the union or the intersection. We put every element of the first array in the union hash, giving it a true value. Then, processing each element of the second array, we check whether that element is already present in the union. If it is, we put it in the intersection as well. In any event, it goes into the union. When we're done, we extract the keys of both the union and intersection hashes. The values aren't needed. The second solution (Recipe is essentially the same but relies on familiarity with the Perl (and awk, C, C++, and Java) ++ and && operators. By placing the ++ after the variable, we first look at its old value before incrementing it. The first time through it won't be in the union, which makes the first part of the && false, so the second part is consequently ignored. The second time that we encounter the same element, it's already present in the union, so we put it in the intersection. The third solution uses just one hash to track how many times each element is seen. Once both arrays have their elements recorded in the hash, we grab those keys and put them in the union. Then we process those hash keys one at a time. Keys whose values are 2 were in both arrays, so they go in the intersection array. Keys whose values are 1 were in just one of the two arrays, so they go in the difference array. Elements of the output arrays are not in the same order as those in the input arrays. The last solution, like the previous one, uses just one hash to count how many times each element is encountered. Here, though, we dynamically select one of two possible arrays by placing within the @{...} array-dereferencing block an expression whose evaluation yields a reference to whichever array is demanded by the situation. In this recipe we compute the symmetric difference, not the simple difference. These are terms from set theory. A symmetric difference is the set of all elements that are members of either @A or @B, but not both. A simple difference is the set of members of @A but not @B, which we calculated in Recipe 4.8. 4.9.4 See Also The "Hashes" section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.7 and Recipe 4.8 [ Team LiB ] [ Team LiB ] Recipe 4.10 Appending One Array to Another 4.10.1 Problem You want to join two arrays by appending all elements of one to the other. 4.10.2 Solution Use push: # push push(@ARRAY1, @ARRAY2); 4.10.3 Discussion The push function is optimized for appending a list to the end of an array. You can take advantage of Perl's list flattening to join two arrays, but this results in significantly more copying than push: @ARRAY1 = (@ARRAY1, @ARRAY2); Here's an example of push in action: @members = ("Time", "Flies"); @initiates = ("An", "Arrow"); push(@members, @initiates); # @members is now ("Time", "Flies", "An", "Arrow") To insert the elements of one array into the middle of another, use the splice function: splice(@members, 2, 0, "Like", @initiates); print "@members\n"; splice(@members, 0, 1, "Fruit"); splice(@members, -2, 2, "A", "Banana"); print "@members\n"; This is the output: Time Flies Like An Arrow Fruit Flies Like A Banana 4.10.4 See Also The splice and push functions in perlfunc(1) and Chapter 29 of Programming Perl; the "List Values and Arrays" section of Chapter 2 of Programming Perl; the "List Value Constructors" section of perldata(1) [ Team LiB ] [ Team LiB ] Recipe 4.11 Reversing an Array 4.11.1 Problem You want to reverse an array. 4.11.2 Solution Use the reverse function: # reverse @ARRAY into @REVERSED @REVERSED = reverse @ARRAY; Or process with a foreach loop on a reversed list: foreach $element (reverse @ARRAY) { # do something with $element } Or use a for loop, starting with the index of the last element and working your way down: for ($i = $#ARRAY; $i >= 0; $i--) { # do something with $ARRAY[$i] } 4.11.3 Discussion Called in list context, the reverse function reverses elements of its argument list. You can save a copy of that reversed list into an array, or just use foreach to walk through it directly if that's all you need. The for loop processes the array elements in reverse order by using explicit indices. If you don't need a reversed copy of the array, the for loop can save memory and time on very large arrays. If you're using reverse to reverse a list that you just sorted, you should have sorted it in the correct order to begin with. For example: # two-step: sort then reverse @ascending = sort { $a cmp $b } @users; @descending = reverse @ascending; # one-step: sort with reverse comparison @descending = sort { $b cmp $a } @users; 4.11.4 See Also The reverse function in perlfunc(1) and Chapter 29 of Programming Perl; we use reverse in Recipe 1.7 [ Team LiB ] [ Team LiB ] Recipe 4.12 Processing Multiple Elements of an Array 4.12.1 Problem You want to pop or shift multiple elements at a time. 4.12.2 Solution Use splice: # remove $N elements from front of @ARRAY (shift $N) @FRONT = splice(@ARRAY, 0, $N); # remove $N elements from the end of the array (pop $N) @END = splice(@ARRAY, -$N); 4.12.3 Discussion The splice function allows you to add elements, delete elements, or both, at any point in an array, not just at the ends. All other operations that modify an array's length can also be written as a splice: Direct method Splice equivalent push(@a, $x, $y) splice(@a, @a, 0, $x, $y) pop(@a) splice(@a, -1) shift(@a) splice(@a, 0, 1) unshift(@a, $x, $y) splice(@a, 0, 0, $x, $y) $a[$x] = $y splice(@a, $x, 1, $y) (@a, @a = ( )) splice(@a) Unlike pop and unshift, though, which always delete and return just one element at a time—and from the ends only—splice lets you specify the number of elements. This leads to code like the examples in the Solution. It's often convenient to wrap these splices as functions: sub shift2 (\@) { return splice(@{$_[0]}, 0, 2); } sub pop2 (\@) { return splice(@{$_[0]}, -2); } This makes their behavior more apparent when you use them: @friends = qw(Peter Paul Mary Jim Tim); ($this, $that) = shift2(@friends); # $this contains Peter, $that has Paul, and # @friends has Mary, Jim, and Tim @beverages = qw(Dew Jolt Cola Sprite Fresca); @pair = pop2(@beverages); # $pair[0] contains Sprite, $pair[1] has Fresca, # and @beverages has (Dew, Jolt, Cola) The splice function returns the elements it removed from the array, so shift2 replaces the first two elements in @ARRAY with nothing (i.e., deletes them) and returns the two elements deleted. In pop2, the two elements at end of the array are removed and returned. These two functions are prototyped to take an array reference as their argument to better mimic the built-in shift and pop functions. The caller doesn't pass in an explicit reference using a backslash. Instead, the compiler, having seen the array reference prototype, arranges to pass the array by reference anyway. Advantages to this approach include efficiency, transparency, and compile-time parameter checking. One disadvantage is that the thing passed in must look like a real array with a leading @ sign, not just a scalar containing an array reference. If it did, you'd have to prepend an @, making it less transparent: $line[5] = \@list; @got = pop2( @{ $line[5] } ); This is another example of where a proper array and not a mere list is called for. The \@ prototype requires that whatever goes in that argument slot be an array. $line[5] isn't an array, but an array reference. That's why we need the "extra" @ sign. 4.12.4 See Also The splice function in perlfunc(1) and Chapter 29 of Programming Perl; the "Prototypes" sections of perlsub(1) and Chapter 6 of Programming Perl; we use splice in Recipe 4.10 [ Team LiB ] [ Team LiB ] Recipe 4.13 Finding the First List Element That Passes a Test 4.13.1 Problem You want the first element in the list (or its index) that passes a test. Alternatively, you want to know whether any element passes the test. The test can be simple identity ("Is this element in the list?")[1] or more complex ("I have a list of Employee objects, sorted from highest salary to lowest. Which manager has the highest salary?"). Simple cases normally require only the value of the element, but when the array itself will be altered, you probably need to know the index number of the first matching element. [1] But why didn't you use a hash then? 4.13.2 Solution To find a matching value, use foreach to loop over every element, and call last as soon as you find a match: my ($match, $found, $item); foreach $item (@array) { if (CRITERION) { $match = $item; # must save $found = 1; last; } } if ($found) { ## do something with $match } else { ## unfound } To find a matching index, use for to loop a variable over every array index, and call last as soon as you find a match: my ($i, $match_idx); for ($i = 0; $i < @array; $i++) { if (CRITERION) { $match_idx = $i; # save the index last; } } if (defined $match_idx) { ## found in $array[$match_idx] } else { ## unfound } The List::Util module, shipped standard with Perl as of v5.8 but available on CPAN for earlier versions, provides an even easier approach: use List::Util qw(first); $match = first { CRITERION } @list 4.13.3 Discussion Lacking (until recently) a built-in mechanism to do this, we must write our own code to go through the list and test each element. We use foreach and for, and call last to ensure that we stop as soon as we find a match. Before we use last to stop looking, though, we save the value or index. A common approach is to try to use grep here. But grep always tests all elements and finds all matches, so it's inefficient if you want only the first match. However, grep might still be faster. That's because there will be less source code if you use grep rather than writing your own loop. That means fewer internal Perl operations, and it is these that in practice often dominate runtimes. Beyond a certain size of your data set, a loop that terminates early will still be faster—assuming it has the chance to do so. Empirical evidence suggests that for will be faster as long as you can exit before the first two-thirds of the list has been examined. It's worthwhile to know how to do that. We have to set $match when we want the value of the first matching element. We can't just test $item at the end of the loop, because foreach automatically localizes the iterator variable and thereby prevents us from accessing the final loop value after the loop ends. See Recipe 4.5. Here's an example. Assume that @all_emps holds a list of Employee objects, sorted in descending order by salary. We wish to find the highest paid engineer, who will be the first engineer in the array. We only want to print the engineer's name, so we want the value, not the index. foreach $employee (@all_emps) { if ( $employee->category( ) eq 'engineer' ) { $top_engr = $employee; last; } } print "Highest paid engineer is: ", $highest_engineer->name( ), "\n"; When we're searching and want only the index, we can save some code by remembering that $i will not be an acceptable array index if we don't find a match. This mainly saves us code space, as not doing an assignment doesn't really win much compared to the time spent testing list elements. It's more obscure, because it tests if ($i < @ARRAY) to check whether we found a match, instead of the more obvious defined test in the previous solution. for ($i = 0; $i < @ARRAY; $i++) { last if CRITERION; } if ($i < @ARRAY) { ## found and $i is the index } else { ## not found } The first function from List::Util encapsulates the logic from an entire loop into a convenient, easy-to-use function. It acts just like a short-circuiting form of the built-in grep function that stops as soon as a match is found. While running, each list element is in a localized $_ variable. For example: $first_odd = first { $_ % 2 = = 1 } @ARRAY; Or rewriting the previous employee loop: $top_engr = first { $_->category( ) eq 'engineer' } @all_emps; 4.13.4 See Also The "For Loops," "Foreach Loops," and "Loop Control" sections of perlsyn(1) and Chapter 4 of Programming Perl; the grep function in perlfunc(1) and Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 4.14 Finding All Elements in an Array Matching Certain Criteria 4.14.1 Problem From a list, you want only the elements that match certain criteria. This notion of extracting a subset of a larger list is common. It's how you find all engineers in a list of employees, all users in the "staff" group, or all the filenames you're interested in. 4.14.2 Solution Use grep to apply a condition to all elements in the list and return only those for which the condition was true: @MATCHING = grep { TEST ($_) } @LIST; 4.14.3 Discussion This could also be accomplished with a foreach loop: @matching = ( ); foreach (@list) { push(@matching, $_) if TEST ($_); } The Perl grep function is shorthand for all that looping and mucking about. It's not really like the Unix grep command; it doesn't have options to return line numbers or to negate the test, and it isn't limited to regular-expression tests. For example, to filter out just the large numbers from an array or to find out which keys in a hash have very large values: @bigs = grep { $_ > 1_000_000 } @nums; @pigs = grep { $users{$_} > 1e7 } keys %users; Here's something that sets @matching to lines from the who command that start with "gnat ": @matching = grep { /^gnat / } `who`; Here's another example: @engineers = grep { $_->position( ) eq "Engineer" } @employees; It extracts only those objects from the array @employees whose position method returns the string Engineer. You could have even more complex tests in a grep: @secondary_assistance = grep { $_->income >= 26_000 && $_->income < 30_000 } @applicants; But at that point you may decide it would be more legible to write a proper loop instead. 4.14.4 See Also The "For Loops," "Foreach Loops," and "Loop Control" sections of perlsyn(1) and Chapter 4 of Programming Perl; the grep function in perlfunc(1) and Chapter 29 of Programming Perl; your system's who(1) manpage, if it exists; Recipe 4.13 [ Team LiB ] [ Team LiB ] Recipe 4.15 Sorting an Array Numerically 4.15.1 Problem You want to sort a list of numbers, but Perl's sort (by default) sorts in ASCII order. 4.15.2 Solution Use Perl's sort function and the <=> numerical comparison operator: @sorted = sort { $a <=> $b } @unsorted; 4.15.3 Discussion The sort function takes an optional code block, which lets you replace the default alphabetic comparison with your own subroutine. This comparison function is called each time sort has to compare two values. The values to compare are loaded into the special package variables $a and $b, which are automatically localized. The comparison function should return a negative number if $a ought to appear before $b in the output list, 0 if they're the same and their order doesn't matter, or a positive number if $a ought to appear after $b. Perl has two operators that behave this way: <=> for sorting numbers in ascending numeric order, and cmp for sorting strings in ascending alphabetic order. By default, sort uses cmp-style comparisons. Here's code that sorts the list of PIDs in @pids, lets the user select one, then sends it a TERM signal followed by a KILL signal. We use a code block that compares $a to $b with <=> to sort numerically: # @pids is an unsorted array of process IDs foreach my $pid (sort { $a <=> $b } @pids) { print "$pid\n"; } print "Select a process ID to kill:\n"; chomp ($pid = <>); die "Exiting ... \n" unless $pid && $pid =~ /^\d+$/; kill('TERM',$pid); sleep 2; kill('KILL',$pid); If you use $a <=> $b or $a cmp $b, the list will be sorted in ascending order. For a descending sort, all we have to do is swap $a and $b in the sort subroutine: @descending = sort { $b <=> $a } @unsorted; Comparison routines must be consistent; that is, they should always return the same answer when called with the same values. Inconsistent comparison routines lead to infinite loops or core dumps, especially in older releases of Perl. You can also say sort SUBNAME LIST where SUBNAME is the name of a comparison subroutine returning -1, 0, or +1. In the interests of speed, the normal calling conventions are bypassed, and the values to be compared magically appear for the duration of the subroutine in the global package variables $a and $b. Because of the odd way Perl calls this subroutine, it may not be recursive. A word of warning: $a and $b are set in the package active in the call to sort, which may not be the same as the one that the SUBNAME function passed to sort was compiled in! For example: package Sort_Subs; sub revnum { $b <=> $a } package Other_Pack; @all = sort Sort_Subs::revnum 4, 19, 8, 3; This will silently fail (unless you have -w in effect, in which case it will vocally fail) because the sort call sets the package variables $a and $b in its own package, Other_Pack, but the revnum function uses its own package's versions. This is another reason why in-lining sort functions is easier, as in: @all = sort { $b <=> $a } 4, 19, 8, 3; For more on packages, see Chapter 10. 4.15.4 See Also The cmp and <=> operators in perlop(1) and Chapter 3 of Programming Perl; the kill, sort, and sleep functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 4.16 [ Team LiB ] [ Team LiB ] Recipe 4.16 Sorting a List by Computable Field 4.16.1 Problem You want to sort a list by something more complex than a simple string or numeric comparison. This is common when working with objects ("sort by the employee's salary") or complex data structures ("sort by the third element in the array that this is a reference to"). It's also applicable when you want to sort by more than one key; for instance, sorting by birthday and then by name when multiple people share the same birthday. 4.16.2 Solution Use the customizable comparison routine in sort: @ordered = sort { compare( ) } @unordered; You can speed this up by precomputing the field. @precomputed = map { [compute( ),$_] } @unordered; @ordered_precomputed = sort { $a->[0] <=> $b->[0] } @precomputed; @ordered = map { $_->[1] } @ordered_precomputed; And, finally, you can combine the three steps: @ordered = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { [compute( ), $_] } @unordered; 4.16.3 Discussion The use of a comparison routine was explained in Recipe 4.15. As well as using built-in operators like <=>, you can construct more complex tests: @ordered = sort { $a->name cmp $b->name } @employees; You often see sort used like this in part of a foreach loop: foreach $employee (sort { $a->name cmp $b->name } @employees) { print $employee->name, " earns \$", $employee->salary, "\n"; } If you're going to do a lot of work with elements in a particular order, it's more efficient to sort once and work from that: @sorted_employees = sort { $a->name cmp $b->name } @employees; foreach $employee (@sorted_employees) { print $employee->name, " earns \$", $employee->salary, "\n"; } # load %bonus foreach $employee (@sorted_employees) { if ( $bonus{ $employee->ssn } ) { print $employee->name, " got a bonus!\n"; } } We can put multiple comparisons in the routine and separate them with ||. || is a short-circuit operator: it returns the first true value it finds. This means we can sort by one kind of comparison, but if the elements are equal (the comparison returns 0), we can sort by another. This has the effect of a sort within a sort: @sorted = sort { $a->name cmp $b->name || $b->age <=> $a->age } @employees; This first considers the names of the two employees to be compared. If they're not equal, || stops and returns the result of the cmp (effectively sorting them in ascending order by name). If the names are equal, though, || keeps testing and returns the result of the <=> (sorting them in descending order by age). The result is a list that is sorted by name and by age within groups of the same name. Let's look at a real-life example of sorting. First we fetch all system users, represented as User::pwent objects. Then we sort them by name and print the sorted list: use User::pwent qw(getpwent); @users = ( ); # fetch all users while (defined($user = getpwent)) { push(@users, $user); } @users = sort { $a->name cmp $b->name } @users; foreach $user (@users) { print $user->name, "\n"; } We can have more than simple comparisons, or combinations of simple comparisons. This code sorts a list of names by comparing the second letters of the names. It gets the second letters by using substr: @sorted = sort { substr($a,1,1) cmp substr($b,1,1) } @names; and here we sort by string length: @sorted = sort { length $a <=> length $b } @strings; The sort function calls the code block each time it needs to compare two elements, so the number of comparisons grows dramatically with the number of elements we're sorting. Sorting 10 elements requires (on average) 46 comparisons, but sorting 1,000 elements requires 14,000 comparisons. A time-consuming operation like a split or a subroutine call for each comparison can easily make your program crawl. Fortunately, we can remove this bottleneck by running the operation once per element prior to the sort. Use map to store the results of the operation in an array whose elements are anonymous arrays containing both the computed field and the original field. Then we sort this array of arrays on the precomputed field and use map to get the sorted original data. This map- sort-map concept is useful and common, so let's look at it in depth. Let's apply map-sort-map to the sorting by string length example: @temp = map { [ length $_, $_ ] } @strings; @temp = sort { $a->[0] <=> $b->[0] } @temp; @sorted = map { $_->[1] } @temp; The first line creates a temporary array of strings and their lengths, using map. The second line sorts the temporary array by comparing the precomputed lengths. The third line turns the sorted temporary array of strings and lengths back into a sorted array of strings. This way, we calculate the length of each string only once. Because the input to each line is the output of the previous line (the @temp array we make in line 1 is fed to sort in line 2, and that output is fed to map in line 3), we can combine it into one statement and eliminate the temporary array: @sorted = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { [ length $_, $_ ] } @strings; The operations now appear in reverse order. When you meet a map-sort-map, you should read it from the bottom up to determine the function: @strings The last part is the data to be sorted. Here it's just an array, but later we'll see that this can be a subroutine or even backticks. Anything that returns a list is fair game. map The map closer to the bottom builds the temporary list of anonymous arrays. This list contains the precomputed fields (length $_) and also records the original element ($_) by storing both in an anonymous array. Look at this map line to find out how the fields are computed. sort The sort line sorts the list of anonymous arrays by comparing the precomputed fields. It won't tell you much, other than whether the list is sorted in ascending or descending order. map The map at the top of the statement turns the sorted list of anonymous arrays back into a list of the sorted original elements. It will generally be the same for every map-sort-map. Here's a more complicated example, which sorts by the first number that appears on each line in @fields: @temp = map { [ /(\d+)/, $_ ] } @fields; @sorted_temp = sort { $a->[0] <=> $b->[0] } @temp; @sorted_fields = map { $_->[1] } @sorted_temp; The regular expression mumbo jumbo in the first line extracts the first number from the line being processed by map. We use the regular expression /(\d+)/ in a list context to extract the number. We can remove the temporary arrays in that code, giving us: @sorted_fields = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { [ /(\d+)/, $_ ] } @fields; This final example compactly sorts colon-separated data, as from Unix's passwd file. It sorts the file numerically by the fourth field (group id), then numerically by the third field (user id), and then alphabetically by the first field (username). print map { $_->[0] } # whole line sort { $a->[1] <=> $b->[1] # gid || $a->[2] <=> $b->[2] # uid || $a->[3] cmp $b->[3] # login } map { [ $_, (split /:/)[3,2,0] ] } `cat /etc/passwd`; 4.16.4 See Also The sort function in perlfunc(1) and Chapter 29 of Programming Perl; the cmp and <=> operators in perlop(1) and Chapter 3 of Programming Perl; Recipe 4.15 [ Team LiB ] [ Team LiB ] Recipe 4.17 Implementing a Circular List 4.17.1 Problem You want to create and manipulate a circular list. 4.17.2 Solution Use unshift and pop (or push and shift) on a normal array. 4.17.3 Procedure unshift(@circular, pop(@circular)); # the last shall be first push(@circular, shift(@circular)); # and vice versa 4.17.4 Discussion Circular lists are commonly used to repeatedly process things in order; for example, connections to a server. The code shown previously isn't a true computer science circular list, with pointers and true circularity. Instead, the operations provide for moving the last element to the first position, and vice versa. sub grab_and_rotate ( \@ ) { my $listref = shift; my $element = $listref->[0]; push(@$listref, shift @$listref); return $element; } @processes = ( 1, 2, 3, 4, 5 ); while (1) { $process = grab_and_rotate(@processes); print "Handling process $process\n"; sleep 1; } 4.17.5 See Also The unshift and push functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 13.13 [ Team LiB ] [ Team LiB ] Recipe 4.18 Randomizing an Array 4.18.1 Problem You want to randomly shuffle the elements of an array. The obvious application is writing a card game, where you must shuffle a deck of cards, but it is equally applicable to any situation where you want to treat elements of an array in a random order. 4.18.2 Solution Use the shuffle function from the standard List::Util module, which returns the elements of its input list in a random order. use List::Util qw(shuffle); @array = shuffle(@array); 4.18.3 Discussion Shuffling is a surprisingly tricky process. It's easy to write a bad shuffle: sub naive_shuffle { # DON'T DO THIS for (my $i = 0; $i < @_; $i++) { my $j = int rand @_; # pick random element ($_[$i], $_[$j]) = ($_[$j], $_[$i]); # swap 'em } } This algorithm is biased; the list's possible permutations don't all have the same probability of being generated. The proof of this is simple: take the case where we're passed a three-element list. We generate three random numbers, each of which can have three possible values, yielding 27 possible outcomes. There are only six permutations of the three-element list, though. Because 27 isn't evenly divisible by 6, some outcomes are more likely than others. The List::Util module's shuffle function avoids this bias to produce a more randomly shuffled result. If all you want to do is pick one random element from the array, use: $value = $array[ int(rand(@array)) ]; 4.18.4 See Also The rand function in perlfunc(1) and Chapter 29 of Programming Perl; for more on random numbers, see Recipe 2.6, Recipe 2.7, and Recipe 2.8; Recipe 4.20 provides another way to select a random permutation [ Team LiB ] [ Team LiB ] Recipe 4.19 Program: words Have you ever wondered how programs like ls generate columns of sorted output that you read down the columns instead of across the rows? For example: awk cp ed login mount rmdir sum basename csh egrep ls mt sed sync cat date fgrep mail mv sh tar chgrp dd grep mkdir ps sort touch chmod df kill mknod pwd stty vi chown echo ln more rm su Example 4-2 does this. Example 4-2. words #!/usr/bin/perl -w # words - gather lines, present in columns use strict; my ($item, $cols, $rows, $maxlen); my ($xpixel, $ypixel, $mask, @data); getwinsize( ); # first gather up every line of input, # remembering the longest line length seen $maxlen = 1; while (<>) { my $mylen; s/\s+$//; $maxlen = $mylen if (($mylen = length) > $maxlen); push(@data, $_); } $maxlen += 1; # to make extra space # determine boundaries of screen $cols = int($cols / $maxlen) || 1; $rows = int(($#data+$cols) / $cols); # pre-create mask for faster computation $mask = sprintf("%%-%ds ", $maxlen-1); # subroutine to check whether at last item on line sub EOL { ($item+1) % $cols = = 0 } # now process each item, picking out proper piece for this position for ($item = 0; $item < $rows * $cols; $item++) { my $target = ($item % $cols) * $rows + int($item/$cols); my $piece = sprintf($mask, $target < @data ? $data[$target] : ""); $piece =~ s/\s+$// if EOL( ); # don't blank-pad to EOL print $piece; print "\n" if EOL( ); } # finish up if needed print "\n" if EOL( ); # not portable -- linux only sub getwinsize { my $winsize = "\0" x 8; my $TIOCGWINSZ = 0x40087468; if (ioctl(STDOUT, $TIOCGWINSZ, $winsize)) { ($rows, $cols, $xpixel, $ypixel) = unpack('S4', $winsize); } else { $cols = 80; } } The most obvious way to print out a sorted list in columns is to print each element of the list, one at a time, padded out to a particular width. Then when you're about to hit the end of the line, generate a newline. But that only works if you're planning on reading each row from left to right. If you instead expect to read it down each column, this approach won't do. The words program is a filter that generates output going down the columns. It reads all input, keeping track of the length of the longest line seen. Once everything has been read in, it divides the screen width by the length of the longest input record seen, yielding the expected number of columns. Then the program goes into a loop that executes once per input record, but the output order isn't in the obvious order. Imagine you had a list of nine items: Wrong Right ----- ----- 1 2 3 1 4 7 4 5 6 2 5 8 7 8 9 3 6 9 The words program does the necessary calculations to print out elements (1,4,7) on one line, (2,5,8) on the next, and (3,6,9) on the last. To figure out the current window size, this program does an ioctl call. This works fine—on the system it was written for. On any other system, it won't work. If that's good enough for you, then good for you. Recipe 12.17 shows how to find this on your system using the file, or with a C program. Recipe 15.4 shows a more portable solution, but that requires installing a CPAN module. 4.19.1 See Also Recipe 15.4 [ Team LiB ] [ Team LiB ] Recipe 4.20 Program: permute Have you ever wanted to generate all possible permutations of an array or to execute some code for every possible permutation? For example: % echo man bites dog | permute dog bites man bites dog man dog man bites man dog bites bites man dog man bites dog The number of permutations of a set is the factorial of the size of the set. This number grows extremely fast, so you don't want to run it on many permutations: Set Size Permutations 1 1 2 2 3 6 4 24 5 120 6 720 7 5040 8 40320 9 362880 10 3628800 11 39916800 12 479001600 13 6227020800 14 87178291200 15 1307674368000 Doing something for each alternative takes a correspondingly large amount of time. In fact, factorial algorithms exceed the number of particles in the universe with very small inputs. The factorial of 500 is greater than ten raised to the thousandth power! use Math::BigInt; sub factorial { my $n = shift; my $s = 1; $s *= $n-- while $n > 0; return $s; } print factorial(Math::BigInt->new("500")); +1220136... (1035 digits total) The two solutions that follow differ in the order of the permutations they return. The solution in Example 4-3 uses a classic list permutation algorithm used by Lisp hackers. It's relatively straightforward but makes unnecessary copies. It's also hardwired to do nothing but print out its permutations. Example 4-3. tsc-permute #!/usr/bin/perl -n # tsc_permute: permute each word of input permute([split], [ ]); sub permute { my @items = @{ $_[0] }; my @perms = @{ $_[1] }; unless (@items) { print "@perms\n"; } else { my (@newitems,@newperms,$i); foreach $i (0 .. $#items) { @newitems = @items; @newperms = @perms; unshift(@newperms, splice(@newitems, $i, 1)); permute( \@newitems, \@newperms); } } } The solution in Example 4-4, provided by Mark-Jason Dominus, is faster (by around 25%) and more elegant. Rather than precalculate all permutations, his code generates the nth particular permutation. It is elegant in two ways. First, it avoids recursion except to calculate the factorial, which the permutation algorithm proper does not use. Second, it generates a permutation of integers rather than permute the actual data set. He also uses a time-saving technique called memoizing. The idea is that a function that always returns a particular answer when called with a particular argument memorizes that answer. That way, the next time it's called with the same argument, no further calculations are required. The factorial function uses a private array @fact to remember previously calculated factorial values as described in Recipe 10.3. This technique is so useful that there's a standard module that will handle the value caching for you. If you just had a regular factorial function that didn't have its own caching, you could add caching to the existing function this way: use Memoize; memoize("factorial"); You call n2perm with two arguments: the permutation number to generate (from 0 to factorial(N), where N is the size of your array) and the subscript of the array's last element. The n2perm function calculates directions for the permutation in the n2pat subroutine. Then it converts those directions into a permutation of integers in the pat2perm subroutine. The directions are a list like (0 2 0 1 0), which means: "Splice out the 0th element, then the second element from the remaining list, then the 0th element, then the first, then the 0th." Example 4-4. mjd-permute #!/usr/bin/perl -w # mjd_permute: permute each word of input use strict; sub factorial($); # forward reference to declare prototype while (<>) { my @data = split; my $num_permutations = factorial(scalar @data); for (my $i=0; $i < $num_permutations; $i++) { my @permutation = @data[n2perm($i, $#data)]; print "@permutation\n"; } } # Utility function: factorial with memoizing BEGIN { my @fact = (1); sub factorial($) { my $n = shift; return $fact[$n] if defined $fact[$n]; $fact[$n] = $n * factorial($n - 1); } } # n2pat($N, $len) : produce the $N-th pattern of length $len sub n2pat { my $i = 1; my $N = shift; my $len = shift; my @pat; while ($i <= $len + 1) { # Should really be just while ($N) { ... push @pat, $N % $i; $N = int($N/$i); $i++; } return @pat; } # pat2perm(@pat) : turn pattern returned by n2pat( ) into # permutation of integers. XXX: splice is already O(N) sub pat2perm { my @pat = @_; my @source = (0 .. $#pat); my @perm; push @perm, splice(@source, (pop @pat), 1) while @pat; return @perm; } # n2perm($N, $len) : generate the Nth permutation of $len objects sub n2perm { pat2perm(n2pat(@_)); } 4.20.1 See Also unshift and splice in perlfunc(1) or Chapter 29 of Programming Perl; the sections discussing closures in perlsub(1) and perlref(1) and Chapter 8 of Programming Perl; Recipe 2.6; Recipe 10.3 [ Team LiB ] [ Team LiB ] Chapter 5. Hashes Doing linear scans over an associative array is like trying to club someone to death with a loaded Uzi. —Larry Wall [ Team LiB ] [ Team LiB ] Introduction People and parts of computer programs interact in all sorts of ways. Single scalar variables are like hermits, living a solitary existence whose only meaning comes from within the individual. Arrays are like cults, where multitudes marshal themselves under the name of a charismatic leader. In the middle lies the comfortable, intimate ground of the one-to-one relationship that is the hash. (Older documentation for Perl often called hashes associative arrays, but that's a mouthful. Other languages that support similar constructs sometimes use different terms for them; you may hear about hash tables, tables, dictionaries, mappings, or even alists, depending on the language.) Unfortunately, this isn't a relationship of equals. The relationship encoded in a hash is that of the genitive case or the possessive, like the word "of " in English, or like "'s". We could encode that the boss of Nat is Tim. Hashes only give convenient ways to access values for Nat's boss; you can't ask whose boss Tim is. Finding the answer to that question is a recipe in this chapter. Fortunately, hashes have their own special benefits, just like relationships. Hashes are a built-in data type in Perl. Their use reduces many complex algorithms to simple variable accesses. They are also fast and convenient to build indices and quick lookup tables. Only use the % when referring to the hash as a whole, such as %boss. When referring to the value associated with a particular key, that's a single scalar value, so a $ is called for—just as when referring to one element of an array, you also use a $. This means that "the boss of Nat" would be written as $boss{"Nat"}. We can assign "Tim" to that: $boss{"Nat"} = "Tim"; It's time to put a name to these notions. The relationship embodied in a hash is a good thing to use for its name. In the previous example you see a dollar sign, which might surprise you since this is a hash, not a scalar. But we're setting a single scalar value in that hash, so use a dollar sign. Where a lone scalar has $ as its type identifier and an entire array has @, an entire hash has %. A regular array uses integers for indices, but the indices of a hash are always strings. Its values may be any arbitrary scalar values, including references. With references as values, you can create hashes that hold not merely strings or numbers, but also arrays, other hashes, or objects. (Or rather, references to arrays, hashes, or objects.) An entire hash can be initialized with a list, where elements of the list are key and value pairs: %age = ( "Nat", 30, "Jules", 31, "Josh", 23 ); This is equivalent to: $age{"Nat"} = 30; $age{"Jules"} = 31; $age{"Josh"} = 23; To make it easier to read and write hash initializations, the => operator, sometimes known as a comma arrow, was created. Mostly it behaves like a better-looking comma. For example, you can write a hash initialization this way: %food_color = ( "Apple" => "red", "Banana" => "yellow", "Lemon" => "yellow", "Carrot" => "orange" ); (This particular hash is used in many examples in this chapter.) This initialization is also an example of hash-list equivalence—hashes behave in some ways as though they were lists of key-value pairs. We'll use this in a number of recipes, including the merging and inverting recipes. Unlike a regular comma, the comma arrow has a special property: it quotes any word preceding it, which means you can safely omit the quotes and improve legibility. Single-word hash keys are also automatically quoted when they occur inside braces, which means you can write $hash{somekey} instead of $hash{"somekey"}. You could rewrite the preceding initialization of %food_color as: %food_color = ( Apple => "red", Banana => "yellow", Lemon => "yellow", Carrot => "orange" ); One important issue to be aware of regarding hashes is that their elements are stored in an internal order convenient for efficient retrieval. This means that no matter what order you insert your data, it will come out in an unpredictable disorder. See Also The perldata(1) manpage; the two sections on "Hashes" in the first and second chapters of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 5.1 Adding an Element to a Hash 5.1.1 Problem You need to add an entry to a hash. 5.1.2 Solution Simply assign to the hash key: $HASH{$KEY} = $VALUE; 5.1.3 Discussion Putting something into a hash is straightforward. In languages that don't provide the hash as an intrinsic data type, you have to worry about overflows, resizing, and collisions in your hash table. In Perl, all that is taken care of for you with a simple assignment. If that entry was already occupied (had a previous value), memory for that value is automatically freed, just as when assigning to a simple scalar. # %food_color defined per the introduction $food_color{Raspberry} = "pink"; print "Known foods:\n"; foreach $food (keys %food_color) { print "$food\n"; } Known foods: Banana Apple Raspberry Carrot Lemon If you don't want to overwrite an existing value, but somehow have one key reference multiple values, see Recipe 5.8 and Recipe 11.2. 5.1.4 See Also The "List Value Constructors" section of perldata(1); the "List Values and Arrays" section of Chapter 2 of Programming Perl; Recipe 5.2 [ Team LiB ] [ Team LiB ] Recipe 5.2 Testing for the Presence of a Key in a Hash 5.2.1 Problem You need to know whether a hash has a particular key, regardless of whatever value may be associated with that key. 5.2.2 Solution Use the exists function. # does %HASH have a value for $KEY ? if (exists($HASH{$KEY})) { # it exists } else { # it doesn't } 5.2.3 Discussion This code uses exists to check whether a key is in the %food_color hash: # %food_color per the introduction foreach $name ("Banana", "Martini") { if (exists $food_color{$name}) { print "$name is a food.\n"; } else { print "$name is a drink.\n"; } } Banana is a food. Martini is a drink. The exists function tests whether a key is in the hash. It doesn't test whether the value corresponding to that key is defined, nor whether the value is true or false. We may be splitting hairs, but problems caused by confusing existence, definedness, and truth can multiply like rabbits. Take this code: %age = ( ); $age{"Toddler"} = 3; $age{"Unborn"} = 0; $age{"Phantasm"} = undef; foreach $thing ("Toddler", "Unborn", "Phantasm", "Relic") { print "$thing: "; print "Exists " if exists $age{$thing}; print "Defined " if defined $age{$thing}; print "True " if $age{$thing}; print "\n"; } Toddler: Exists Defined True Unborn: Exists Defined Phantasm: Exists Relic: $age{"Toddler"} passes the existence, definedness, and truth tests. It exists because we gave "Toddler" a value in the hash; it's defined because that value isn't undef; and it's true because the value isn't one of Perl's false values. $age{"Unborn"} passes only the existence and definedness tests. It exists because we gave "Unborn" a value in the hash, and it's defined because that value isn't undef. It isn't true, however, because 0 is one of Perl's false values. $age{"Phantasm"} passes only the existence test. It exists because we gave "Phantasm" a value in the hash. But because that value was undef, it doesn't pass the definedness test. Because undef is also one of Perl's false values, it doesn't pass the truth test either. $age{"Relic"} passes none of the tests. We didn't put a value for "Relic" into the hash, so the existence test fails. Because we didn't put a value in, $age{"Relic"} is undef whenever we try to access it. We know from "Phantasm" that undef fails the definedness and truth tests. Sometimes it's useful to store undef in a hash. This indicates "I've seen this key, but it didn't have a meaningful value associated with it." Take, for instance, a program to look up file sizes given a list of files as input. This version tries to skip files we've seen before, but it doesn't skip zero-length files, and it doesn't skip files that we've seen before but don't exist. %size = ( ); while (<>) { chomp; next if $size{$_}; # WRONG attempt to skip $size{$_} = -s $_; } If we change the incorrect line to call exists, we also skip files that couldn't be statted, instead of repeatedly trying (and failing) to look them up: next if exists $size{$_}; 5.2.4 See Also The exists and defined functions in perlfunc(1) and Chapter 29 of Programming Perl; the discussion of truth in the "Scalar Values" section of perldata(1), and the "Boolean Context" section of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 5.3 Creating a Hash with Immutable Keys or Values 5.3.1 Problem You'd like to have a hash whose keys or values can't be altered once set. 5.3.2 Solution Use the appropriate functions from the standard Hash::Util module. use Hash::Util qw{ lock_keys unlock_keys lock_value unlock_value lock_hash unlock_hash }; To restrict access to keys already in the hash, so no new keys can be introduced: lock_keys(%hash); # restrict to current keys lock_keys(%hash, @klist); # restrict to keys from @klist To forbid deletion of the key or modification of its value: lock_value(%hash, $key); To make all keys and their values read-only: lock_hash(%hash); 5.3.3 Discussion Suppose you're using a hash to implement a record (or an object) with some pre-determined set of keys, such as "NAME", "RANK", and "SERNO". You'd like to consider it an error to access any keys besides the ones initially in the hash, such as "ANME", a typo. Because Perl always creates hash elements on demand, this wouldn't be caught the way it would if you misspelled a variable name while under the use strict pragma. The Hash::Util module's lock_keys function takes care of this for you. Once a hash is marked as having locked keys, you can't use any other keys than those. The keys need not yet be in the hash, and they may still be deleted if they are. But no new keys may be used. Access to the values in those locked keys is not restricted by lock_keys. However, you may use the lock_value function to render a value in a hash read-only. That hash can also have its keys locked, but doesn't need to if the goal is just to have one or more values marked read-only. If you want to lock down the entire hash, thereby restricting both its keys and its values, the lock_hash function will do. 5.3.4 See Also The documentation for the Hash::Util module [ Team LiB ] [ Team LiB ] Recipe 5.4 Deleting from a Hash 5.4.1 Problem You want to remove an entry from a hash so that it doesn't show up with keys, values, or each. If you were using a hash to associate salaries with employees, and an employee resigned, you'd want to remove their entry from the hash. 5.4.2 Solution Use the delete function: # remove $KEY and its value from %HASH delete($HASH{$KEY}); 5.4.3 Discussion Sometimes people mistakenly try to use undef to remove an entry from a hash. undef $hash{$key} and $hash{$key} = undef both make %hash have an entry with key $key and value undef. The delete function is the only way to remove a specific entry from a hash. Once you've deleted a key, it no longer shows up in a keys list or an each iteration, and exists will return false for that key. This demonstrates the difference between undef and delete: # %food_color as per Introduction sub print_foods { my @foods = keys %food_color; my $food; print "Keys: @foods\n"; print "Values: "; foreach $food (@foods) { my $color = $food_color{$food}; if (defined $color) { print "$color "; } else { print "(undef) "; } } print "\n"; } print "Initially:\n"; print_foods( ); print "\nWith Banana undef\n"; undef $food_color{"Banana"}; print_foods( ); print "\nWith Banana deleted\n"; delete $food_color{"Banana"}; print_foods( ); Initially: Keys: Banana Apple Carrot Lemon Values: yellow red orange yellow With Banana undef Keys: Banana Apple Carrot Lemon Values: (undef) red orange yellow With Banana deleted Keys: Apple Carrot Lemon Values: red orange yellow As you see, if we set $food_color{"Banana"} to undef, "Banana" still shows up as a key in the hash. The entry is still there; we only succeeded in making the value undef. On the other hand, delete actually removed it from the hash—"Banana" is no longer in the list returned by keys. delete can also take a hash slice, deleting all listed keys at once: delete @food_color{"Banana", "Apple", "Cabbage"}; 5.4.4 See Also The delete and keys functions in perlfunc(1) and in Chapter 29 of Programming Perl; we use keys in Recipe 5.5 [ Team LiB ] [ Team LiB ] Recipe 5.5 Traversing a Hash 5.5.1 Problem You want to perform an action on each entry (i.e., each key-value pair) in a hash. 5.5.2 Solution Use each with a while loop: while(($key, $value) = each(%HASH)) { # do something with $key and $value } Or use keys with a foreach loop, unless the hash is potentially very large: foreach $key (keys %HASH) { $value = $HASH{$key}; # do something with $key and $value } 5.5.3 Discussion Here's a simple example, iterating through the %food_color hash from the introduction: # %food_color per the introduction while(($food, $color) = each(%food_color)) { print "$food is $color.\n"; } Banana is yellow. Apple is red. Carrot is orange. Lemon is yellow. foreach $food (keys %food_color) { my $color = $food_color{$food}; print "$food is $color.\n"; } Banana is yellow. Apple is red. Carrot is orange. Lemon is yellow. We didn't really need the $color variable in the foreach example, because we use it only once. Instead, we could have written: print "$food is $food_color{$food}.\n" Every time each is called on the same hash, it returns the "next" key-value pair. We say "next" because the pairs are returned in the order the underlying lookup structure imposes on them, which appears to be no order at all. When each runs out of hash elements, it returns the empty list ( ) , whose assignment tests false and terminates the while loop. The foreach example uses keys , which constructs an entire list containing every key from the hash before the loop even begins executing. The advantage to using each is that it gets the keys and values one pair at a time. If the hash contains many keys, not having to preconstruct a complete list of them can save substantial memory. The each function, however, doesn't let you control the order in which pairs are processed. Using foreach and keys to loop over the list lets you impose an order. For instance, if we wanted to print the food names in alphabetical order: foreach $food (sort keys %food_color) { print "$food is $food_color{$food}.\n"; } Apple is red. Banana is yellow. Carrot is orange. Lemon is yellow. This is a common use of foreach . We use keys to obtain a list of keys in the hash, and then we use foreach to iterate over them. The danger is that if the hash contains a large number of elements, the list returned by keys will use a lot of memory. The trade-off lies between memory use and the ability to process the entries in a particular order. We cover sorting in more detail in Recipe 5.10 . Because keys , values , and each all share the same internal data structures, be careful about mixing calls to these functions or prematurely exiting an each loop. Each time you call keys or values , the current location for each is reset. This code loops forever, printing the first key returned by each : while ( ($k,$v) = each %food_color ) { print "Processing $k\n"; keys %food_color; # goes back to the start of %food_color } Modifying a hash while looping over it with each or foreach is, in general, fraught with danger. The each function can behave differently with tie d and untied hashes when you add or delete keys from a hash. A foreach loops over a pregenerated list of keys, so once the loop starts, foreach can't know whether you've added or deleted keys. Keys added in the body of the loop aren't automatically appended to the list of keys to loop over, nor are keys deleted by the body of the loop deleted from this list. Example 5-1 reads a mailbox file and reports the number of messages from each person. It uses the From : line to determine the sender. (It isn't clever in this respect, but we're showing hash manipulation, not mail-file processing.) Supply the mailbox filename as a command-line argument, or use a "-" to indicate you're piping the mailbox to the program. (When Perl opens a file named "-" for reading using fewer than three arguments to open, this means to use the current standard input.) Example 5-1. countfrom #!/usr/bin/perl # countfrom - count number of messages from each sender $filename = $ARGV[0] || "-"; # "-" means standard input open(FILE, "< $filename") or die "Can't open $filename : $!"; while() { if (/^From: (.*)/) { $from{$1}++ } } foreach $person (sort keys %from) { print "$person: $from{$person}\n"; } 5.5.4 See Also The each and keys functions in perlfunc (1) and in Chapter 29 of Programming Perl ; we talk about for and foreach in Recipe 4.6 [ Team LiB ] [ Team LiB ] Recipe 5.6 Printing a Hash 5.6.1 Problem You want to print a hash, but neither print "%hash" nor print %hash does what you want; the first is a literal, while the second just has the keys and values all scrunched together. 5.6.2 Solution One of several approaches is to iterate over every key-value pair in the hash using Recipe 5.5 and print them: while ( ($k,$v) = each %hash ) { print "$k => $v\n"; } Or use map to generate a list of strings: print map { "$_ => $hash{$_}\n" } keys %hash; Or use the interpolation trick from Recipe 1.15 to interpolate the hash as a list: print "@{[ %hash ]}\n"; Or use a temporary array variable to hold the hash, and then print that: { my @temp = %hash; print "@temp"; } 5.6.3 Discussion The methods differ in the degree that their output is customizable (in order and formatting) and in their efficiency. The first method, iterating over the hash, is flexible and space-efficient. You can format the output as you like it, and it requires only two scalar variables: the current key and value. You can print the hash in key order (at the cost of building a list of sorted keys) if you use a foreach loop: foreach $k (sort keys %hash) { print "$k => $hash{$k}\n"; } The map function is just as flexible. You can still process the list in any order by sorting the keys. You can customize the output to your heart's content. But it builds up a list of strings like "KEY =>VALUE\n" to pass to print. The last two methods are interpolation tricks. By treating the hash as a list, you can't predict or control the output order of key-value pairs. Furthermore, the output will consist of a list of keys and values, each separated by whatever string that $" happens to hold. You can't put newlines between pairs or "=>" within them, as we could with the other methods. Another solution is to print the hash in a list context after temporarily localizing the $, variable to a space. { local $, = " "; print %hash; } This is like the solution of copying to an array and then doing double-quote interpolation on that array, except it doesn't duplicate the contents of the hash twice more than you need (i.e., once for the array, then again for the string). The Dumpvalue module, described in Recipe 11.11, can provide for pretty printed output displays, plus much more. For example: use Dumpvalue; $dumper = Dumpvalue->new; $dumper->dumpValue(\%food_color); 'Apple' => 'red' 'Banana' => 'yellow' 'Carrot' => 'orange' 'Lemon' => 'yellow' 5.6.4 See Also The $" and $, variables in perlvar(1) and in the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl; the foreach, map, keys, sort, and each functions in perlfunc(1) and Chapter 29 of Programming Perl; we give a technique for interpolating into strings in Recipe 1.15; we discuss the techniques for hash traversal in Recipe 5.5 [ Team LiB ] [ Team LiB ] Recipe 5.7 Retrieving from a Hash in Insertion Order 5.7.1 Problem The keys and each functions traverse the hash elements in a strange order, and you want them in the order in which you inserted them. 5.7.2 Solution Use the Tie::IxHash module. use Tie::IxHash; tie %HASH, "Tie::IxHash"; # manipulate %HASH @keys = keys %HASH; # @keys is in insertion order 5.7.3 Discussion Tie::IxHash makes keys, each, and values return the hash elements in the order they were added. This often removes the need to preprocess the hash keys with a complex sort comparison or maintain a distinct array containing the keys in the order they were inserted into the hash. Tie::IxHash also provides an object-oriented interface to splice, push, pop, shift, unshift, keys, values, and delete, among others. Here's an example, showing both keys and each: # initialize use Tie::IxHash; tie %food_color, "Tie::IxHash"; $food_color{"Banana"} = "Yellow"; $food_color{"Apple"} = "Green"; $food_color{"Lemon"} = "Yellow"; print "In insertion order, the foods are:\n"; foreach $food (keys %food_color) { print " $food\n"; } print "Still in insertion order, the foods' colors are:\n"; while (( $food, $color ) = each %food_color ) { print "$food is colored $color.\n"; } In insertion order, the foods are: Banana Apple Lemon Still in insertion order, the foods' colors are: Banana is colored Yellow. Apple is colored Green. Lemon is colored Yellow. 5.7.4 See Also The documentation for the CPAN module Tie::IxHash; Recipe 13.5 [ Team LiB ] [ Team LiB ] Recipe 5.8 Hashes with Multiple Values per Key 5.8.1 Problem You want to store more than one value for each key. 5.8.2 Solution Store an array reference in $hash{$key}, then put the values into the referenced array. 5.8.3 Discussion You can store only scalar values in a hash. References, however, are scalars. This solves the problem of storing multiple values for one key by making $hash{$key} a reference to an array containing values for $key. The normal hash operations—insertion, deletion, iteration, and testing for existence—can now be written in terms of array operations like push, splice, and foreach. This code shows simple insertion into the hash. It processes the output of who(1) on Unix machines and outputs a terse listing of users and the ttys they're logged in on: %ttys = ( ); open(WHO, "who|") or die "can't open who: $!"; while () { ($user, $tty) = split; push( @{$ttys{$user}}, $tty ); } foreach $user (sort keys %ttys) { print "$user: @{$ttys{$user}}\n"; } The heart of the code is the push line, the multivalued version of $ttys{$user} = $tty. The first time through, that hash value is undefined, so Perl automatically allocates a new anonymous hash and stores its reference in that value so that the push can succeed. This is called autovivification, and is explained more in Chapter 11. We interpolate all the tty names in the print line with @{$ttys{$user}}. We'd loop over the anonymous array if, for instance, we wanted to print the owner of each tty: foreach $user (sort keys %ttys) { print "$user: ", scalar( @{$ttys{$user}} ), " ttys.\n"; foreach $tty (sort @{$ttys{$user}}) { @stat = stat("/dev/$tty"); $user = @stat ? ( getpwuid($stat[4]) )[0] : "(not available)"; print "\t$tty (owned by $user)\n"; } } The exists function can have two meanings: "Is there at least one value for this key?" and "Does this value exist for this key?" Implementing the second approach requires searching the array for the value. The delete function and the first sense of exists are interrelated: if we can guarantee that no anonymous array is ever empty, we can use the built-in exists. We ensure that no anonymous array is ever empty by checking for that after deleting an element. sub multihash_delete { my ($hash, $key, $value) = @_; my $i; return unless ref( $hash->{$key} ); for ($i = 0; $i < @{ $hash->{$key} }; $i++) { if ($hash->{$key}->[$i] eq $value) { splice( @{$hash->{$key}}, $i, 1); last; } } delete $hash->{$key} unless @{$hash->{$key}}; } An alternative approach to multivalued hashes is given in Chapter 13, implemented as tied normal hashes. 5.8.4 See Also The splice, delete, push, foreach, and exists functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 11.1; we cover ties in Recipe 13.15 [ Team LiB ] [ Team LiB ] Recipe 5.9 Inverting a Hash 5.9.1 Problem Hashes map keys to values. You have a hash and a value whose corresponding key you want to find. 5.9.2 Solution Use reverse to create an inverted hash whose values are the original hash's keys and vice versa. # %LOOKUP maps keys to values %REVERSE = reverse %LOOKUP; 5.9.3 Discussion This technique uses the list equivalence of hashes mentioned in the introduction. In list context, reverse treats %LOOKUP as a list and reverses the order of its elements. The significant property of a hash treated as a list is that the list elements come in associated pairs: the first element is the key; the second, the value. When you reverse such a list, the first element is now the value, and the second the key. Treating this list as a hash results in a hash whose values are the keys of the original hash and vice versa. Here's an example: %surname = ( "Mickey" => "Mantle", "Babe" => "Ruth" ); %first_name = reverse %surname; print $first_name{"Mantle"}, "\n"; Mickey When we treat %surname as a list, it becomes: ("Mickey", "Mantle", "Babe", "Ruth") (or maybe ("Babe", "Ruth", "Mickey", "Mantle") because we can't predict the order). Reversing this list gives us: ("Ruth", "Babe", "Mantle", "Mickey") When we treat this list as a hash, it becomes: ("Ruth" => "Babe", "Mantle" => "Mickey") Now instead of turning first names into surnames, it turns surnames into first names. Example 5-2 is a program called foodfind. If you give it a food name, it'll tell you the color of that food. If you give it a color, it'll tell you a food of that color. Example 5-2. foodfind #!/usr/bin/perl -w # foodfind - find match for food or color $given = shift @ARGV or die "usage: foodfind food_or_color\n"; %color = ( "Apple" => "red", "Banana" => "yellow", "Lemon" => "yellow", "Carrot" => "orange" ); %food = reverse %color; if (exists $color{$given}) { print "$given is a food with color $color{$given}.\n"; } if (exists $food{$given}) { print "$food{$given} is a food with color $given.\n"; } If two keys in the original hash have the same value (as "Lemon" and "Banana" do in the color example), then the inverted hash will only have one (which is dependent on the hashing order, and you shouldn't try to predict it). This is because hashes have, by Perl definition, unique keys. If you want to invert a hash with non-unique values, you must use the techniques shown in Recipe 5.8. That is, build up a hash whose values are a list of keys in the original hash: # %food_color as per the introduction while (($food,$color) = each(%food_color)) { push(@{$foods_with_color{$color}}, $food); } print "@{$foods_with_color{yellow}} were yellow foods.\n"; Banana Lemon were yellow foods. This also lets us change the foodfind program to handle colors represented by more than one food. For instance, foodfind yellow reports bananas and lemons. If any values in the original hash were references instead of strings or numbers, the inverted hash poses a problem because references don't work well as hash keys—unless you use the Tie::RefHash module described in Recipe 5.13. 5.9.4 See Also The reverse function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 13.15 [ Team LiB ] [ Team LiB ] Recipe 5.10 Sorting a Hash 5.10.1 Problem You need to work with the elements of a hash in a particular order. 5.10.2 Solution Use keys to get a list of keys, then sort them based on the ordering you want: # %hash is the hash to sort @keys = sort { criterion( ) } (keys %hash); foreach $key (@keys) { $value = $hash{$key}; # do something with $key, $value } 5.10.3 Discussion Even though you can't directly maintain a hash in a specific order (unless you use the Tie::IxHash module mentioned in Recipe 5.7), you can access its entries in any order. This technique offers many variations on the same basic mechanism: you extract the keys, reorder them using the sort function, and then process the entries in the new order. All the sorting tricks shown in Chapter 4 can be used here. Let's look at some applications. The following code simply uses sort to order the keys alphabetically: foreach $food (sort keys %food_color) { print "$food is $food_color{$food}.\n"; } This sorts the keys by their associated values: foreach $food (sort { $food_color{$a} cmp $food_color{$b} } keys %food_color) { print "$food is $food_color{$food}.\n"; } This sorts by length of the values: @foods = sort { length($food_color{$a}) <=> length($food_color{$b}) } keys %food_color; foreach $food (@foods) { print "$food is $food_color{$food}.\n"; } 5.10.4 See Also The sort and keys functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 5.7; we discuss sorting lists in Recipe 4.16 [ Team LiB ] [ Team LiB ] Recipe 5.11 Merging Hashes 5.11.1 Problem You need to make a new hash with the entries of two existing hashes. 5.11.2 Solution Treat them as lists, and join them as you would lists. %merged = (%A, %B); To save memory, loop over the hashes' elements and build a new hash that way: %merged = ( ); while ( ($k,$v) = each(%A) ) { $merged{$k} = $v; } while ( ($k,$v) = each(%B) ) { $merged{$k} = $v; } 5.11.3 Discussion The first method, like the earlier recipe on inverting a hash, uses the hash-list equivalence explained in the introduction. (%A, %B) evaluates to a list of paired keys and values. When we assign it to %merged, Perl turns that list of pairs back into a hash. Here's an example of that technique: # %food_color as per the introduction %drink_color = ( Galliano => "yellow", "Mai Tai" => "blue" ); %ingested_color = (%drink_color, %food_color); Keys in both input hashes appear just once in the output hash. If a food and a drink shared the same name, for instance, then the last one seen by the first merging technique would be the one that showed up in the resultant hash. This style of direct assignment, as in the first example, is easier to read and write, but requires a lot of memory if the hashes are large. That's because Perl has to unroll both hashes into a temporary list before the assignment to the merged hash is done. Step-by-step merging using each, as in the second technique, spares you that cost and lets you decide what to do with duplicate keys. The first example could be rewritten to use the each technique: # %food_color per the introduction, then %drink_color = ( Galliano => "yellow", "Mai Tai" => "blue" ); %substance_color = ( ); while (($k, $v) = each %food_color) { $substance_color{$k} = $v; } while (($k, $v) = each %drink_color) { $substance_color{$k} = $v; } That technique duplicated the while and assignment code. Here's a sneaky way to get around that: foreach $substanceref ( \%food_color, \%drink_color ) { while (($k, $v) = each %$substanceref) { $substance_color{$k} = $v; } } If we're merging hashes with duplicates, we can insert our own code to decide what to do with those duplicates: foreach $substanceref ( \%food_color, \%drink_color ) { while (($k, $v) = each %$substanceref) { if (exists $substance_color{$k}) { print "Warning: $k seen twice. Using the first definition.\n"; next; } $substance_color{$k} = $v; } } In the special case of appending one hash to another, we can use the hash slice notation to give an elegant shorthand: @all_colors{keys %new_colors} = values %new_colors; This requires enough memory for lists of the keys and values of %new_colors. As with the first technique, the memory requirement might make this technique infeasible when such lists would be large. 5.11.4 See Also This is a variation on Recipe 4.10; the each function in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 5.12 Finding Common or Different Keys in Two Hashes 5.12.1 Problem You need to find keys in one hash that are or are not present in another hash. 5.12.2 Solution Use keys to loop through the keys of one hash, checking whether each key is also in the other hash. Find common keys my @common = ( ); foreach (keys %hash1) { push(@common, $_) if exists $hash2{$_}; } # @common now contains common keys Find keys from one hash that aren't in both my @this_not_that = ( ); foreach (keys %hash1) { push(@this_not_that, $_) unless exists $hash2{$_}; } 5.12.3 Discussion Because we're finding common or different keys of the hashes, we can apply our earlier array recipes for finding common or different elements to arrays of the hashes' keys. For an explanation, see Recipe 4.9. This code uses the difference technique to find non-citrus foods: # %food_color per the introduction # %citrus_color is a hash mapping citrus food name to its color. %citrus_color = ( Lemon => "yellow", Orange => "orange", Lime => "green" ); # build up a list of non-citrus foods @non_citrus = ( ); foreach (keys %food_color) { push (@non_citrus, $_) unless $citrus_color{$_}; } 5.12.4 See Also The "Hashes" section of Chapter 2 of Programming Perl; the each function in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 5.13 Hashing References 5.13.1 Problem When you use keys on a hash whose keys are references, the references that keys returns no longer work. This situation often arises when you want to cross-reference two different hashes. 5.13.2 Solution Use Tie::RefHash: use Tie::RefHash; tie %hash, "Tie::RefHash"; # you may now use references as the keys to %hash 5.13.3 Discussion Hash keys are automatically "stringified," that is, treated as though they appeared between double quotes. With numbers or strings, nothing is lost. This isn't so with references, though. Stringified references look like these: Class::Somewhere=HASH(0x72048) ARRAY(0x72048) A stringified reference can't be dereferenced, because it is just a string and no longer a reference. This means you can't use references as the keys to a hash without losing their "magic." Hand-rolled solutions to this problem involve maintaining a distinct hash whose keys are stringified references and whose values are the actual references. This is what Tie::RefHash does. We'll use IO objects for filehandles here to show you that even such strange references can index a hash tied with Tie::RefHash. Here's an example: use Tie::RefHash; use IO::File; tie %name, "Tie::RefHash"; foreach $filename ("/etc/termcap", "/vmunix", "/bin/cat") { $fh = IO::File->new("< $filename") or next; $name{$fh} = $filename; } print "open files: ", join(", ", values %name), "\n"; foreach $file (keys %name) { seek($file, 0, 2); # seek to the end printf("%s is %d bytes long.\n", $name{$file}, tell($file)); } If you're storing objects as the keys to a hash, though, you almost always should be storing a unique attribute of the object (e.g., name or ID number) instead. 5.13.4 See Also The documentation for the standard Tie::RefHash module; the "Warning" section of perlref(1) [ Team LiB ] [ Team LiB ] Recipe 5.14 Presizing a Hash 5.14.1 Problem You want to preallocate memory for a hash to speed up your program so Perl won't have to incrementally allocate memory each time a new entry is added to the hash. Often you know the final size of a hash before you start building it up, and it's possible to use this information to speed up your program. 5.14.2 Solution Assign the number of key-value pairs your hash will have to keys %HASH. # presize %hash to $num keys(%hash) = $num; 5.14.3 Discussion This feature may or may not improve your performance. Perl already shares keys between hashes, so if you already have a hash with "Apple" as a key, Perl won't need to allocate memory for another copy of "Apple" when you add an entry whose key is "Apple" to another hash. # will have 512 users in %users keys(%users) = 512; Perl's internal data structures require the number of keys to be a power of 2. If we had said: keys(%users) = 1000; Perl would have internally allocated 1024 "buckets" for the hash. Keys and buckets aren't always one to one. You get the best performance when they are, but the distribution of keys to buckets is dependent on your keys and Perl's (immutable) hash algorithm. 5.14.4 See Also The keys function in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 4.3 [ Team LiB ] [ Team LiB ] Recipe 5.15 Finding the Most Common Anything 5.15.1 Problem You have an aggregate data structure, such as an array or a hash. You want to know how often each element in the array (or value in the hash) occurs. For instance, if your array contains web server transactions, you might want to find the most commonly requested file. If your hash maps usernames to number of logins, you want to find the most common number of logins. 5.15.2 Solution Use a hash to count how many times each element, key, or value appears: %count = ( ); foreach $element (@ARRAY) { $count{$element}++; } 5.15.3 Discussion Any time you want to count how often different things appear, you should probably be using a hash. The foreach adds one to $count{$element} for every occurrence of $element. 5.15.4 See Also Recipe 4.7 and Recipe 4.8 [ Team LiB ] [ Team LiB ] Recipe 5.16 Representing Relationships Between Data 5.16.1 Problem You want to represent relationships between elements of data—for instance, the mother of relationship in a family tree or parent process for a process table. This is closely related to representing tables in relational databases (tables represent relationships between information) and to representing computer science graph structures (edges represent relationships between nodes). 5.16.2 Solution Use a hash to represent the relationship. 5.16.3 Discussion Here's part of the family tree from the Bible: %father = ( 'Cain' => 'Adam', 'Abel' => 'Adam', 'Seth' => 'Adam', 'Enoch' => 'Cain', 'Irad' => 'Enoch', 'Mehujael' => 'Irad', 'Methusael' => 'Mehujael', 'Lamech' => 'Methusael', 'Jabal' => 'Lamech', 'Jubal' => 'Lamech', 'Tubalcain' => 'Lamech', 'Enos' => 'Seth' ); This lets us, for instance, easily trace a person's lineage: while (<>) { chomp; do { print "$_ "; # print the current name $_ = $father{$_}; # set $_ to $_'s father } while defined; # until we run out of fathers print "\n"; } We can already ask questions like "Who begat Seth?" by checking the %father hash. By inverting this hash, we invert the relationship. This lets us use Recipe 5.9 to answer questions like "Whom did Lamech beget?" while ( ($k,$v) = each %father ) { push( @{ $children{$v} }, $k ); } $" = ', '; # separate output with commas while (<>) { chomp; if ($children{$_}) { @children = @{$children{$_}}; } else { @children = "nobody"; } print "$_ begat @children.\n"; } Hashes can also represent relationships such as the C language #includes. A includes B if A contains #include B. This code builds the hash (it doesn't look for files in /usr/include as it should, but that's a minor change): foreach $file (@files) { local *FH; unless (open(FH, " < $file")) { warn "Couldn't read $file: $!; skipping.\n"; next; } while () { next unless /^\s*#\s*include\s*<([^>]+)>/; push(@{$includes{$1}}, $file); } close FH; } This shows which files with include statements are not included in other files: @include_free = ( ); # list of files that don't include others @uniq{map { @$_ } values %includes} = undef; foreach $file (sort keys %uniq) { push( @include_free , $file ) unless $includes{$file}; } The values of %includes are anonymous arrays because a single file can (and often does) include more than one other file. We use map to build up a big list of the included files and remove duplicates using a hash. 5.16.4 See Also Recipe 4.7; the more complex data structures in Recipe 11.9 through Recipe 11.14 [ Team LiB ] [ Team LiB ] Recipe 5.17 Program: dutree The dutree program, shown in Example 5-3, turns the output of du: % du pcb 19 pcb/fix 20 pcb/rev/maybe/yes 10 pcb/rev/maybe/not 705 pcb/rev/maybe 54 pcb/rev/web 1371 pcb/rev 3 pcb/pending/mine 1016 pcb/pending 2412 pcb into sorted, indented output: 2412 pcb | 1371 rev | | 705 maybe | | | 675 . | | | 20 yes | | | 10 not | | 612 . | | 54 web | 1016 pending | | 1013 . | | 3 mine | 19 fix | 6 . The arguments you give dutree are passed through to du. That way you could call dutree in any of these ways, or maybe more if your du supports other options: % dutree % dutree /usr % dutree -a % dutree -a /bin The %Dirsize hash maintains the mapping of names to sizes. For example, $Dirsize{"pcb"} contains 2412 in this sample run. We'll use that hash both for output and for sorting each directory's subdirectories by size. %Kids is more interesting. For any given path PATH, $Kids{PATH} contains a (reference to an) array of names of subdirectories of this one. The "pcb" entry contains a reference to an anonymous array containing "fix", "rev", and "pending". The "rev" entry contains "maybe" and "web". The "maybe" entry contains "yes" and "not", which do not have their own entries because they are end nodes in the tree. The output function is passed the start of the tree—the last line read in from the output of du. First it prints that directory and its size. Then the function sorts the directory's children (if any) so that those with the most disk usage float to the top. Finally, output calls itself, recursing on each child in order. The extra arguments are used in formatting. This program is inherently recursive because the filesystem is recursive. However, its data structure is not recursive; at least, not the way a circular linked list is. Each value is an array of further keys to process. The recursion resides in the processing, not in the storage. Example 5-3. dutree #!/usr/bin/perl -w # dutree - print sorted indented rendition of du output use strict; my %Dirsize; my %Kids; getdots(my $topdir = input( )); output($topdir); # run du, read in input, save sizes and kids # return last directory (file?) read sub input { my($size, $name, $parent); @ARGV = ("du @ARGV |"); # prep the arguments while (<>) { # magic open is our friend ($size, $name) = split; $Dirsize{$name} = $size; ($parent = $name) =~ s#/[^/]+$##; # dirname push @{ $Kids{$parent} }, $name unless eof; } return $name; } # figure out how much is taken up in each directory # that isn't stored in subdirectories. add a new # fake kid called "." containing that much. sub getdots { my $root = $_[0]; my($size, $cursize); $size = $cursize = $Dirsize{$root}; if ($Kids{$root}) { for my $kid (@{ $Kids{$root} }) { $cursize -= $Dirsize{$kid}; getdots($kid); } } if ($size != $cursize) { my $dot = "$root/."; $Dirsize{$dot} = $cursize; push @{ $Kids{$root} }, $dot; } } # recursively output everything, # passing padding and number width in as well # on recursive calls sub output { my($root, $prefix, $width) = (shift, shift || '', shift || 0); my $path; ($path = $root) =~ s#.*/##; # basename my $size = $Dirsize{$root}; my $line = sprintf("%${width}d %s", $size, $path); print $prefix, $line, "\n"; for ($prefix .= $line) { # build up more output s/\d /| /; s/[^|]/ /g; } if ($Kids{$root}) { # not a bachelor node my @Kids = @{ $Kids{$root} }; @Kids = sort { $Dirsize{$b} <=> $Dirsize{$a} } @Kids; $Dirsize{$Kids[0]} =~ /(\d+)/; my $width = length $1; for my $kid (@Kids) { output($kid, $prefix, $width) } } } Before Perl supported hashes of arrays directly, Herculean efforts were required to emulate these higher order constructs. Some folks used repeated calls to split and join, but these were exceedingly slow. Example 5-4 is a version of dutree from those days of Perl antiquity. Because we didn't have proper array references, we had to usurp the Perl symbol table itself. This program created variables on the fly with bizarre names. Can you find which hash this program is using? The @{"pcb"} array contains "pcb/fix", "pcb/rev", and "pcb/pending". The @{"pcb/rev"} array contains "pcb/rev/maybe" and "pcb/rev/web". The @{"pcb/rev/maybe"} array contains "pcb/rev/yes" and "pcb/rev/not". When you assign something like "pcb/fix" to *kid, it promotes the string on the righthand side to a typeglob. This makes @kid an alias for @{"pcb/fix"}—among other things. It would also alias &kid to &{"pcb/fix"}, and so on. If that isn't interesting enough, consider how the local is using dynamic scoping of global variables to avoid passing in extra arguments. Check out what's happening with the $width variable in the output routine. Example 5-4. dutree-orig #!/usr/bin/perl # dutree_orig: the old version pre-perl5 (early 90s) @lines = `du @ARGV`; chop(@lines); &input($top = pop @lines); &output($top); exit; sub input { local($root, *kid, $him) = @_[0,0]; while (@lines && &childof($root, $lines[$#lines])) { &input($him = pop(@lines)); push(@kid, $him); } if (@kid) { local($mysize) = ($root =~ /^(\d+)/); for (@kid) { $mysize -= (/^(\d+)/)[0]; } push(@kid, "$mysize .") if $size != $mysize; } @kid = &sizesort(*kid); } sub output { local($root, *kid, $prefix) = @_[0,0,1]; local($size, $path) = split(' ', $root); $path =~ s!.*/!!; $line = sprintf("%${width}d %s", $size, $path); print $prefix, $line, "\n"; $prefix .= $line; $prefix =~ s/\d /| /; $prefix =~ s/[^|]/ /g; local($width) = $kid[0] =~ /(\d+)/ && length("$1"); for (@kid) { &output($_, $prefix); }; } sub sizesort { local(*list, @index) = shift; sub bynum { $index[$b] <=> $index[$a]; } for (@list) { push(@index, /(\d+)/); } @list[sort bynum 0..$#list]; } sub childof { local(@pair) = @_; for (@pair) { s/^\d+\s+//g; s/$/\//; } index($pair[1], $pair[0]) >= 0; } The answer to the question posed earlier — "Which hash is the old dutree using?" — is %main::, that is, the Perl symbol table itself. Needless to say, this program will never run under use strict. We're happy to report that the updated version runs three times as fast as the old one. That's because the old one keeps looking up variables in the symbol table, and the new one doesn't have to. It's also because we avoid all that slow splitting of the space used and the directory name. But we thought we'd show you the old version because it is instructive, too. [ Team LiB ] [ Team LiB ] Chapter 6. Pattern Matching [Art is] pattern informed by sensibility. —Sir Herbert Read, The Meaning of Art [ Team LiB ] [ Team LiB ] Introduction Most modern programming languages offer primitive pattern-matching tools, usually through an extra library. In contrast, Perl's patterns are integrated directly into the language core. Perl's pattern matching boasts features not found elsewhere, features that encourage a whole different way of looking at data. Just as chess players see patterns in the board positions that their pieces control, Perl adepts look at data in terms of patterns. These patterns, expressed in the intensely symbolic notation of regular expressions,[1] provide access to powerful algorithms normally available only to scholars of computer science. [1] Technically, Perl's patterns far exceed the capabilities of mere regular expressions as that term is formally used in computing theory. "If this pattern matching thing is so powerful and so fantastic," you may be asking, "why don't you have a hundred different recipes on regular expressions in this chapter?" Regular expressions are the natural solution to many problems involving numbers, strings, dates, web documents, mail addresses, and almost everything else in this book; we use pattern matching over 100 times in other chapters. This chapter mostly presents recipes in which pattern matching forms part of the questions, not just part of the answers. Perl's extensive and integrated support for regular expressions means that you not only have features available that you won't find in any other language, but you have new ways of using them, too. Programmers new to Perl often look for functions like these: match( $string, $pattern ); subst( $string, $pattern, $replacement ); but matching and substituting are such common tasks that they merit their own notation: $meadow =~ m/sheep/; # True if $meadow contains "sheep" $meadow !~ m/sheep/; # True if $meadow doesn't contain "sheep" $meadow =~ s/old/new/; # Replace "old" with "new" in $meadow Pattern matching isn't like direct string comparison, even at its simplest level; it's more like string searching with mutant wildcards on steroids. Without anchors, the position where the match occurs can float freely throughout the string. Any of the following lines would also be matched by the expression $meadow =~ /ovine/, giving false positives when looking for lost sheep: Fine bovines demand fine toreadors. Muskoxen are a polar ovibovine species. Grooviness went out of fashion decades ago. Sometimes they're right in front of you but they still don't match: Ovines are found typically in oviaries. The problem is that while you are probably thinking in some human language, the pattern- matching engine most assuredly is not. When the engine is presented with the pattern /ovine/ and a string to match against, it searches the string for an "o" that is immediately followed by a "v", then by an "i", then by an "n", and then finally by an "e". What comes before or after that sequence doesn't matter. Additionally, those letters are matched case-sensitively. That's why it didn't find "Ovines", since that string starts with a capital letter. As you find your patterns matching some strings you don't want them to match and not matching other strings that you do want them to match, you start embellishing. When looking for nothing but sheep, you probably want to match a pattern more like this: if ($meadow =~ /\bovines?\b/i) { print "Here be sheep!" } Don't be tricked by the phantom cow lurking in that string—that's not a bovine. It's an ovine with a \b in front, which matches at a word boundary only.[2] The s? indicates an optional "s" so we can find one or more ovines. The trailing /i makes the whole pattern match case- insensitive. [2] For Perl's idea of what defines a "word." As you see, certain character sequences have special meaning to the pattern-matching engine, often standing in for several possible literal characters. These so-called metacharacters let you do such things as restrict the pattern to the start or end of the string, give alternatives for parts of a pattern, allow repetition and wildcarding, and remember part of the matching substring for use later in the pattern or in code. Learning the syntax of pattern matching isn't as daunting as it might appear. Sure, there are a lot of symbols, but each has a reason for existing. Regular expressions aren't random jumbles of punctuation—they're carefully thought-out jumbles of punctuation! If you forget one, you can always look it up. Summary tables are included in Programming Perl, Learning Perl, Mastering Regular Expressions, and the perlre(1) and perlop(1) manpages included with every Perl installation. The Tricky Bits Much trickier than the syntax of regular expressions is their sneaky semantics. The three aspects of pattern-matching behavior that seem to cause folks the most trouble are greed, eagerness, and backtracking—and also how these three interact with each other. Greed is the principle that if a standard quantifier (such as *) can match a varying number of times, it matches as long a substring as it can. This is explained in Recipe 6.15. Eagerness is the notion that the leftmost match wins. The engine is eager to return you a match as quickly as possible, sometimes even before you are expecting it. Consider the match "Fred" =~ /x*/. If asked to explain this in plain language, you might say "Does the string "Fred" contain any x's?" If so, you might be surprised to learn that it seems to. That's because /x*/ doesn't truly mean "any x's," unless your idea of "any" includes nothing at all. Formally, it means zero or more of them, and here zero sufficed for the eager matcher. A more illustrative example of eagerness would be the following: $string = "good food"; $string =~ s/o*/e/; Can you guess which of the following is in $string after that substitution? good food geod food geed food geed feed ged food ged fed egood food The correct answer is the last one, because the earliest point at which zero or more occurrences of "o" could be found was right at the beginning of the string. Surprised? Regular expressions can do that to you if you're unfamiliar with their semantics. Here's another example of where greed takes a back seat to eagerness: $ echo longest | perl -ne 'print "$&\n" if /long|longer|longest/' long That's because Perl uses what's called a traditional NFA,[3] a non-deterministic finite automaton. This kind of matching engine is not guaranteed to return the longest overall match, just the first match. You might think of Perl's greed as being left-to-right directed, not globally greedy. [3] As opposed to a POSIX-style NFA. See Mastering Regular Expressions for the differences. NFAs can be slow, but significant performance gains can be made by rewriting the patterns to exploit how the particular NFA implementation runs. This is a major part of Jeffrey Friedl's book, Mastering Regular Expressions. The last and most powerful of the three tricky bits in pattern matching is backtracking. For a pattern to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and tries to find another match for the beginning part—that's why it's called backtracking. It means that the engine is going to try different possibilities, systematically investigating alternate matches until it finds one that works. In some pattern-matching implementations, the engine keeps backtracking in case other submatches make the overall match longer. Perl's matcher doesn't do that; as soon as one possibility works, it uses that—until and unless something later in the pattern fails, forcing a backtrack to retry another possible way of matching. This is discussed in Recipe 6.16. Pattern-Matching Modifiers Pattern-matching modifiers are a lot easier to list and learn than the different metacharacters. Table 6-1 contains a brief summary of them. Table 6-1. Pattern-matching modifiers and their meanings Modifier Meaning /i Ignore alphabetic case /x Ignore most whitespace in pattern and permit comments /g Global—match/substitute as often as possible /gc Don't reset search position on failed match /s Let . match newline /m Let ^ and $ match next to embedded \n /o Compile pattern once only Modifier Meaning /e Righthand side of an s/// is code whose result is used as the replacement value /ee Righthand side of an s/// is a string that's eval'd twice; the final result then used as the replacement value /i and /g are the most commonly used modifiers. The pattern /ram/i matches "ram", "RAM", "Ram", and so forth. Backreferences are checked case-insensitively if this modifier is on; see Recipe 6.16 for an example. This case-insensitivity can be made aware of the user's current locale settings if the use locale pragma has been invoked. The /g modifier is used with s/// to replace every non-overlapping match, not just the first one. /g is also used with m// in loops to find (but not replace) every matching occurrence: while (m/(\d+)/g) { print "Found number $1\n"; } Used on m// in list context, /g pulls out all matches: @numbers = m/(\d+)/g; That finds only non-overlapping matches. You have to be much sneakier to get overlapping ones by making a zero-width look-ahead with the (?=...) construct. Because it's zero-width, the match engine hasn't advanced at all. Within the look-ahead, capturing parentheses are used to grab the thing anyway. Although we've saved something, Perl notices we haven't made any forward progress on the /g, so it bumps us forward one character position. This shows the difference: $digits = "123456789"; @nonlap = $digits =~ /(\d\d\d)/g; @yeslap = $digits =~ /(?=(\d\d\d))/g; print "Non-overlapping: @nonlap\n"; print "Overlapping: @yeslap\n"; Non-overlapping: 123 456 789 Overlapping: 123 234 345 456 567 678 789 The /s and /m modifiers are useful when matching strings with embedded newlines. /s makes dot match "\n", something it doesn't normally do; it also makes the match ignore the value of the old, deprecated $* variable. /m makes ^ and $ match after and before "\n", respectively. They are useful with paragraph slurping mode as explained in the Introduction to Chapter 8, and in Recipe 6.6. The /e modifier is used on replacements so that the righthand part is run as code and its return value is used as the replacement string. s/(\d+)/sprintf("%#x", $1)/ge converts all numbers into hex, changing, for example, 2581 into 0xb23. Because different countries have different ideas of what constitutes an alphabet, the POSIX standard provides systems (and thus programs) with a standard way of representing alphabets, character set ordering, and so on. Perl gives you access to some of these through the use locale pragma; see the perllocale manpage for more information. When use locale is in effect, the \w character class includes accented and other exotic characters. The case-changing \u, \U, \l, and \L (and the corresponding uc, ucfirst, etc. functions) escapes also respect use locale, so s will be turned into S with \u if the locale says it should. (This only matters in 8-bit encodings, such as ISO 8859-7 for the Greek character set. If those characters had been in /e Righthand side of an s/// is code whose result is used as the replacement value /ee Righthand side of an s/// is a string that's eval'd twice; the final result then used as the replacement value /i and /g are the most commonly used modifiers. The pattern /ram/i matches "ram", "RAM", "Ram", and so forth. Backreferences are checked case-insensitively if this modifier is on; see Recipe 6.16 for an example. This case-insensitivity can be made aware of the user's current locale settings if the use locale pragma has been invoked. The /g modifier is used with s/// to replace every non-overlapping match, not just the first one. /g is also used with m// in loops to find (but not replace) every matching occurrence: while (m/(\d+)/g) { print "Found number $1\n"; } Used on m// in list context, /g pulls out all matches: @numbers = m/(\d+)/g; That finds only non-overlapping matches. You have to be much sneakier to get overlapping ones by making a zero-width look-ahead with the (?=...) construct. Because it's zero-width, the match engine hasn't advanced at all. Within the look-ahead, capturing parentheses are used to grab the thing anyway. Although we've saved something, Perl notices we haven't made any forward progress on the /g, so it bumps us forward one character position. This shows the difference: $digits = "123456789"; @nonlap = $digits =~ /(\d\d\d)/g; @yeslap = $digits =~ /(?=(\d\d\d))/g; print "Non-overlapping: @nonlap\n"; print "Overlapping: @yeslap\n"; Non-overlapping: 123 456 789 Overlapping: 123 234 345 456 567 678 789 The /s and /m modifiers are useful when matching strings with embedded newlines. /s makes dot match "\n", something it doesn't normally do; it also makes the match ignore the value of the old, deprecated $* variable. /m makes ^ and $ match after and before "\n", respectively. They are useful with paragraph slurping mode as explained in the Introduction to Chapter 8, and in Recipe 6.6. The /e modifier is used on replacements so that the righthand part is run as code and its return value is used as the replacement string. s/(\d+)/sprintf("%#x", $1)/ge converts all numbers into hex, changing, for example, 2581 into 0xb23. Because different countries have different ideas of what constitutes an alphabet, the POSIX standard provides systems (and thus programs) with a standard way of representing alphabets, character set ordering, and so on. Perl gives you access to some of these through the use locale pragma; see the perllocale manpage for more information. When use locale is in effect, the \w character class includes accented and other exotic characters. The case-changing \u, \U, \l, and \L (and the corresponding uc, ucfirst, etc. functions) escapes also respect use locale, so s will be turned into S with \u if the locale says it should. (This only matters in 8-bit encodings, such as ISO 8859-7 for the Greek character set. If those characters had been in Unicode, case translation would always apply, irrespective of current locale setting.) Special Variables Perl sets special variables as the result of certain matches: $1, $2, $3, and so on ad infinitum are set when a pattern contains capturing parentheses within parts of the pattern. Each open parenthesis as you read left to right in the pattern begins filling a new, numbered variable. The variable $+ contains the contents of the last backreference of the last successful match. This helps distinguish which of several alternate matches was found (for example, if /(x.*y)|(y.*z)/ matches, $+ contains whichever of $1 or $2 were filled). $& contains the complete text matched in the last successful pattern match. $` and $´ are the strings before and after the successful match, respectively: $string = "And little lambs eat ivy"; $string =~ /l[^s]*s/; print "($`) ($&) ($´)\n"; (And ) (little lambs) ( eat ivy) $`, $&, and $´ are tempting, but dangerous. Their very presence anywhere in a program slows down every pattern match because the engine must populate these variables for every match. This is true even if you use one of these variables only once, or, for that matter, if you never use them at all, only mention them. Using $& is no longer so expensive as the other two. A cheaper approach is to use the substr function in conjunction with the built-in array variables @- and @+, first introduced in Perl v5.6. These represent the starting and ending positions of the last submatches, respectively. The Nth elements of these two arrays hold the beginning and ending offset of the Nth submatch. So $-[1] is the offset where $1 begins, and $+[1] is the offset where it ends; $-[2] is the offset where $2 begins, and $+[2] is the offset where it ends; and so on. $-[0] is the offset of the beginning of the entire match, and $+[0] the offset of the end of the entire match. (When we say "offset of the end," we mean the offset to the first character following the end of whatever matched, so that we can subtract beginning offsets from end offsets and arrive at the length.) After a match against some variable $string, the following equivalencies hold true: Variable Equivalent $` substr($string, 0, $-[0]) $& substr($string, $-[0], $+[0] - $-[0]) $´ substr($string, $+[0]) $1 substr($string, $-[1], $+[1] - $-[1]) $2 substr($string, $-[2], $+[2] - $-[2]) $3 substr($string, $-[3], $+[3] - $-[3]) And so on and so forth. To learn far more about regular expressions than you ever thought existed, check out Mastering Regular Expressions, written by Jeffrey Friedl (O'Reilly). This book is dedicated to explaining regular expressions from a practical perspective. Not only does it cover general regular expressions and Perl specials, it also compares and contrasts these with patterns in other programming languages. [ Team LiB ] [ Team LiB ] Recipe 6.1 Copying and Substituting Simultaneously 6.1.1 Problem You're tired of using two separate statements with redundant information, one to copy and another to substitute. 6.1.2 Solution Instead of: $dst = $src; $dst =~ s/this/that/; use: ($dst = $src) =~ s/this/that/; 6.1.3 Discussion Sometimes you wish you could run a search and replace on a copy of a string, but you don't care to write this in two separate steps. You don't have to, because you can apply the regex operation to the result of the copy operation. For example: # strip to basename ($progname = $0) =~ s!^.*/!!; # Make All Words Title-Cased ($capword = $word) =~ s/(\w+)/\u\L$1/g; # /usr/man/man3/foo.1 changes to /usr/man/cat3/foo.1 ($catpage = $manpage) =~ s/man(?=\d)/cat/; You can even use this technique on an entire array: @bindirs = qw( /usr/bin /bin /usr/local/bin ); for (@libdirs = @bindirs) { s/bin/lib/ } print "@libdirs\n"; /usr/lib /lib /usr/local/lib Because of precedence, parentheses are required when combining an assignment if you wish to change the result in the leftmost variable. The result of a substitution is its success: either "" for failure, or an integer number of times the substitution was done. Contrast this with the preceding examples where the parentheses surround the assignment itself. For example: ($a = $b) =~ s/x/y/g; # 1: copy $b and then change $a $a = ($b =~ s/x/y/g); # 2: change $b, count goes in $a $a = $b =~ s/x/y/g; # 3: same as 2 6.1.4 See Also The "Variables" section of Chapter 2 of Programming Perl, and the "Assignment Operators" section of perlop(1) and Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.2 Matching Letters 6.2.1 Problem You want to see whether a string contains only alphabetic characters. 6.2.2 Solution The obvious character class for matching regular letters isn't good enough in the general case: if ($var =~ /^[A-Za-z]+$/) { # it is purely alphabetic } because it doesn't pay attention to letters with diacritics or characters from other writing systems. The best solution is to use Unicode properties: if ($var =~ /^\p{Alphabetic}+$/) { # or just /^\pL+$/ print "var is purely alphabetic\n"; } On older releases of Perl that don't support Unicode, your only real option was to use either a negated character class: if ($var =~ /^[^\W\d_]+$/) { print "var is purely alphabetic\n"; } or, if supported, POSIX character classes: if ($var =~ /^[[:alpha:]]+$/) { print "var is purely alphabetic\n"; } But these don't work for non-ASCII letters unless you use locale and the system you're running on actually supports POSIX locales. 6.2.3 Discussion Apart from Unicode properties or POSIX character classes, Perl can't directly express "something alphabetic" independent of locale, so we have to be more clever. The \w regular expression notation matches one alphabetic, numeric, or underscore character—hereafter known as an "alphanumunder" for short. Therefore, \W is one character that is not one of those. The negated character class [^\W\d_] specifies a character that must be neither a non- alphanumunder, a digit, nor an underscore. That leaves nothing but alphabetics, which is what we were looking for. Here's how you'd use this in a program: use locale; use POSIX 'locale_h'; # the following locale string might be different on your system unless (setlocale(LC_ALL, "fr_CA.ISO8859-1")) { die "couldn't set locale to French Canadian\n"; } while () { chomp; if (/^[^\W\d_]+$/) { print "$_: alphabetic\n"; } else { print "$_: line noise\n"; } } _ _END_ _ silly façade coöperate niño Renée Molière hæmoglobin naïve tschüß random!stuff#here POSIX character classes help a little here; available ones are alpha, alnum, ascii, blank, cntrl, digit, graph, lower, print, punct, space, upper, word, and xdigit. These are valid only within a square-bracketed character class specification: $phone =~ /\b[:digit:]{3}[[:space:][:punct:]]?[:digit:]{4}\b/; # WRONG $phone =~ /\b[[:digit:]]{3}[[:space:][:punct:]]?[[:digit:]]{4}\b/; # RIGHT It would be easier to use properties instead, because they don't have to occur only within other square brackets: $phone =~ /\b\p{Number}{3}[\p{Space}\p{Punctuation]?\p{Number}{4}\b/; $phone =~ /\b\pN{3}[\pS\pP]?\pN{4}\b/; # abbreviated form Match any one character with Unicode property prop using \p{prop}; to match any character lacking that property, use \P{prop} or [^\p{prop}]. The relevant property when looking for alphabetics is Alphabetic, which can be abbreviated as simply Letter or even just L. Other relevant properties include UppercaseLetter, LowercaseLetter, and TitlecaseLetter; their short forms are Lu, Ll, and Lt, respectively. 6.2.4 See Also The treatment of locales in Perl in perllocale(1); your system's locale(3) manpage; we discuss locales in greater depth in Recipe 6.12; the "Perl and the POSIX Locale" section of Chapter 7 of Mastering Regular Expressions; also much of that book's Chapter 3 [ Team LiB ] [ Team LiB ] Recipe 6.3 Matching Words 6.3.1 Problem You want to pick out words from a string. 6.3.2 Solution Think hard about what you want a word to be and what separates one word from the next, and then write a regular expression that encodes your decisions. For example: /\S+/ # as many non-whitespace characters as possible /[A-Za-z'-]+/ # as many letters, apostrophes, and hyphens 6.3.3 Discussion Because words vary between applications, languages, and input streams, Perl does not have built-in definitions of words. You must make them from character classes and quantifiers yourself, as we did previously. The second pattern is an attempt to recognize "shepherd's" and "sheep-shearing" each as single words. Most approaches have limitations because of the vagaries of written language. For instance, although the second pattern successfully identifies "spank'd" and "counter-clockwise" as words, it also pulls the "rd" out of "23rd Psalm". To be more precise when pulling words out from a string, specify the characters surrounding the word. Normally, this should be a word boundary, not whitespace: /\b([A-Za-z]+)\b/ # usually best /\s([A-Za-z]+)\s/ # fails at ends or w/ punctuation Although Perl provides \w, which matches a character that is part of a valid Perl identifier, Perl identifiers are rarely what you think of as words, since we mean a string of alphanumerics and underscores, but not colons or quotes. Because it's defined in terms of \w, \b may surprise you if you expect to match an English word boundary (or, even worse, a Mongolian word boundary). \b and \B can still be useful. For example, /\Bis\B/ matches the string "is" within a word only, not at the edges. And while "thistle" would be found, "vis-à-vis" wouldn't. 6.3.4 See Also The treatment of \b, \w, and \s in perlre(1) and Chapter 5 of Programming Perl; the words- related patterns in Recipe 6.23 [ Team LiB ] [ Team LiB ] Recipe 6.4 Commenting Regular Expressions 6.4.1 Problem You want to make your complex regular expressions understandable and maintainable. 6.4.2 Solution You have several techniques at your disposal: electing alternate delimiters to avoid so many backslashes, placing comments outside the pattern or inside it using the /x modifier, and building up patterns piecemeal in named variables. 6.4.3 Discussion The piece of sample code in Example 6-1 uses the first couple techniques, and its initial comment describes the overall intent of the regular expression. For simple patterns, this may be all that is needed. More complex patterns, as in the example, require more documentation. Example 6-1. resname #!/usr/bin/perl -p # resname - change all "" style names in the input stream # into " []" (or whatever) instead use Socket; # load inet_addr s{ ( # capture the hostname in $1 (?: # these parens for grouping only (?! [-_] ) # lookahead for neither underscore nor dash [\w-] + # hostname component \. # and the domain dot ) + # now repeat that whole thing a bunch of times [A-Za-z] # next must be a letter [\w-] + # now trailing domain part ) # end of $1 capture }{ # replace with this: "$1 " . # the original bit, plus a space ( ($addr = gethostbyname($1)) # if we get an addr ? "[" . inet_ntoa($addr) . "]" # format it : "[???]" # else mark dubious ) }gex; # /g for global # /e for execute # /x for nice formatting For aesthetics, the example uses alternate delimiters. When you split your match or substitution over multiple lines, using matching braces aids readability. A more common use of alternate delimiters is for patterns and replacements that themselves contain slashes, such as in s/\/\//\/..\//g. Alternate delimiters, as in s!//!/../!g or s{//}{/../}g, avoid escaping the non-delimiting slashes with backslashes, again improving legibility. The /x pattern modifier makes Perl ignore whitespace in the pattern (outside a character class) and treat # characters and their following text as comments. The /e modifier changes the replacement portion from a string into code to run. Since it's code, you can put regular comments there, too. To include literal whitespace or # characters in a pattern to which you've applied /x, escape them with a backslash: s/ # replace \# # a pound sign (\w+) # the variable name \# # another pound sign /${$1}/xg; # with the value of the global variable Remember that comments should explain what you're doing and why, not merely restate the code. Using "$i++ # add one to i" is apt to lose points in your programming course or at least get you talked about in substellar terms by your coworkers. The last technique for rendering patterns more legible (and thus, more maintainable) is to place each semantic unit into a variable given an appropriate name. We use single quotes instead of doubles so backslashes don't get lost. $optional_sign = '[-+]?'; $mandatory_digits = '\d+'; $decimal_point = '\.?'; $optional_digits = '\d*'; $number = $optional_sign . $mandatory_digits . $decimal_point . $optional_digits; Then use $number in further patterns: if (/($number)/) { # parse out one $found = $1; } @allnums = /$number/g; # parse all out unless (/^$number$/) { # any extra? print "need a number, just a number\n"; } We can even combine all of these techniques: # check for line of whitespace-separated numbers m{ ^ \s * # optional leading whitespace $number # at least one number (?: # begin optional cluster \s + # must have some separator $number # more the next one ) * # repeat at will \s * $ # optional trailing whitespace }x which is certainly a lot better than writing: /^\s*[-+]?\d+\.?\d*(?:\s+[-+]?\d+\.?\d*)*\s*/ Patterns that you put in variables should probably not contain capturing parentheses or backreferences, since a capture in one variable could change the numbering of those in others. Clustering parentheses—that is, /(?:...)/ instead of /(...)/—though, are fine. Not only are they fine, they're necessary if you want to apply a quantifier to the whole variable. For example: $number = "(?:" . $optional_sign . $mandatory_digits . $decimal_point . $optional_digits . ")"; Now you can say /$number+/ and have the plus apply to the whole number group. Without the grouping, the plus would have shown up right after the last star, which would have been illegal. One more trick with clustering parentheses is that you can embed a modifier switch that applies only to that cluster. For example: $hex_digit = '(?i:[0-9a-z])'; $hdr_line = '(?m:[^:]*:.*)'; The qr// construct does this automatically using cluster parentheses, enabling any modifiers you specified and disabling any you didn't for that cluster: $hex_digit = qr/[0-9a-z]/i; $hdr_line = qr/^[^:]*:.*/m; print "hex digit is: $hex_digit\n"; print "hdr line is: $hdr_line\n"; hex digit is: (?i-xsm:[0-9a-z]) hdr line is: (?m-xis:^[^:]*:.*) It's probably a good idea to use qr// in the first place: $optional_sign = qr/[-+]?/; $mandatory_digits = qr/\d+/; $decimal_point = qr/\.?/; $optional_digits = qr/\d*/; $number = qr{ $optional_sign $mandatory_digits $decimal_point $optional_digits }x; Although the output can be a bit odd to read: print "Number is $number\n"; Number is (?x-ism: (?-xism:[-+]?) (?-xism:\d+) (?-xism:\.?) (?-xism:\d*) ) 6.4.4 See Also The /x modifier in perlre(1) and Chapter 5 of Programming Perl; the "Comments Within a Regular Expression" section of Chapter 7 of Mastering Regular Expressions [ Team LiB ] [ Team LiB ] Recipe 6.5 Finding the Nth Occurrence of a Match 6.5.1 Problem You want to find the Nth match in a string, not just the first one. For example, you'd like to find the word preceding the third occurrence of "fish": One fish two fish red fish blue fish 6.5.2 Solution Use the /g modifier in a while loop, keeping count of matches: $WANT = 3; $count = 0; while (/(\w+)\s+fish\b/gi) { if (++$count = = $WANT) { print "The third fish is a $1 one.\n"; # Warning: don't `last' out of this loop } } The third fish is a red one. Or use a repetition count and repeated pattern like this: /(?:\w+\s+fish\s+){2}(\w+)\s+fish/i; 6.5.3 Discussion As explained in this chapter's Introduction, using the /g modifier in scalar context creates something of a progressive match, useful in while loops. This is commonly used to count the number of times a pattern matches in a string: # simple way with while loop $count = 0; while ($string =~ /PAT/g) { $count++; # or whatever you'd like to do here } # same thing with trailing while $count = 0; $count++ while $string =~ /PAT/g; # or with for loop for ($count = 0; $string =~ /PAT/g; $count++) { } # Similar, but this time count overlapping matches $count++ while $string =~ /(?=PAT)/g; To find the Nth match, it's easiest to keep your own counter. When you reach the appropriate N, do whatever you care to. A similar technique could be used to find every Nth match by checking for multiples of N using the modulus operator. For example, (++$count % 3) = = 0 would be used to find every third match. If this is too much bother, you can always extract all matches and then hunt for the ones you'd like. $pond = 'One fish two fish red fish blue fish'; # using a temporary @colors = ($pond =~ /(\w+)\s+fish\b/gi); # get all matches $color = $colors[2]; # then the one we want # or without a temporary array $color = ( $pond =~ /(\w+)\s+fish\b/gi )[2]; # just grab element 3 print "The third fish in the pond is $color.\n"; The third fish in the pond is red. To find all even-numbered fish: $count = 0; $_ = 'One fish two fish red fish blue fish'; @evens = grep { $count++ % 2 = = 0 } /(\w+)\s+fish\b/gi; print "Even numbered fish are @evens.\n"; Even numbered fish are two blue. For substitution, the replacement value should be a code expression that returns the proper string. Make sure to return the original as a replacement string for cases you aren't interested in changing. Here we fish out the fourth specimen and turn it into a snack: $count = 0; s{ \b # makes next \w more efficient ( \w+ ) # this is what we'll be changing ( \s+ fish \b ) }{ if (++$count = = 4) { "sushi" . $2; } else { $1 . $2; } }gex; One fish two fish red fish sushi fish Picking out the last match instead of the first one is a fairly common task. The easiest way is to skip the beginning part greedily. After /.*\b(\w+)\s+fish\b/s, for example, the $1 variable has the last fish. Another way to get arbitrary counts is to make a global match in list context to produce all hits, then extract the desired element of that list: $pond = 'One fish two fish red fish blue fish swim here.'; $color = ( $pond =~ /\b(\w+)\s+fish\b/gi )[-1]; print "Last fish is $color.\n"; Last fish is blue. To express this same notion of finding the last match in a single pattern without /g, use the negative lookahead assertion (?!THING). When you want the last match of arbitrary pattern P, you find P followed by any amount of not P through the end of the string. The general construct is P(?!.*P)*, which can be broken up for legibility: m{ P # find some pattern P (?! # mustn't be able to find .* # something P # and P ) }xs That leaves us with this approach for selecting the last fish: $pond = 'One fish two fish red fish blue fish swim here.'; if ($pond =~ m{ \b ( \w+) \s+ fish \b (?! .* \b fish \b ) }six ) { print "Last fish is $1.\n"; } else { print "Failed!\n"; } Last fish is blue. This approach has the advantage that it can fit in just one pattern, which makes it suitable for similar situations as shown in Recipe 6.18. It has its disadvantages, though. It's obviously much harder to read and understand, although once you learn the formula, it's not too bad. However, it also runs more slowly—around half as fast on the data set tested here. 6.5.4 See Also The behavior of m//g in scalar context is given in the "Regexp Quote-like Operators" section of perlop(1), and in the "Pattern Matching Operators" section of Chapter 5 of Programming Perl; zero-width positive lookahead assertions are shown in the "Regular Expressions" section of perlre(1), and in the "Fancy Patterns" section of Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.6 Matching Within Multiple Lines 6.6.1 Problem You want to use regular expressions on a string containing more than one logical line, but the special characters . (any character but newline), ^ (start of string), and $ (end of string) don't seem to work for you. This might happen if you're reading in multiline records or the whole file at once. 6.6.2 Solution Use /m, /s, or both as pattern modifiers. /s allows . to match a newline (normally it doesn't). If the target string has more than one line in it, /foo.*bar/s could match a "foo" on one line and a "bar" on a following line. This doesn't affect dots in character classes like [#%.], since they are literal periods anyway. The /m modifier allows ^ and $ to match immediately before and after an embedded newline, respectively. /^=head[1-7]/m would match that pattern not just at the beginning of the record, but anywhere right after a newline as well. 6.6.3 Discussion A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. If the pattern involves dot, such as .+ or .*?, and must match across newlines, you need to do something special to make dot match a newline; ordinarily, it does not. When you've read more than one line into a string, you'll probably prefer to have ^ and $ match beginning- and end-of-line, not just beginning- and end-of-string. The difference between /m and /s is important: /m allows ^ and $ to match next to an embedded newline, whereas /s allows . to match newlines. You can even use them together—they're not mutually exclusive. Example 6-2 creates a simplistic filter to strip HTML tags out of each file in @ARGV and then send those results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because @ARGV could have several arguments in it. If so, each readline would fetch the entire contents of one file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just .* for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using .*? in conjunction with /s solves these problems. Example 6-2. killtags #!/usr/bin/perl # killtags - very bad html tag killer undef $/; # each read is whole file while (<>) { # get one whole file at a time s/<.*?>//gs; # strip tags (terribly) print; # print file to STDOUT } Because this is just a single character, it would be much faster to use s/<[^>]*>//gs, but that's still a naïve approach: it doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<<Ooh la la!>>). Recipe 20.6 explains how to avoid these problems. Example 6-3 takes a plain text document and looks for lines at the start of paragraphs that look like "Chapter 20: Better Living Through Chemisery". It wraps these with an appropriate HTML level-one header. Because the pattern is relatively complex, we use the /x modifier so we can embed whitespace and comments. Example 6-3. headerfy #!/usr/bin/perl # headerfy: change certain chapter headers to html $/ = ''; while (<> ) { # fetch a paragraph s{ \A # start of record ( # capture in $1 Chapter # text string \s+ # mandatory whitespace \d+ # decimal number \s* # optional whitespace : # a real colon . * # anything not a newline till end of line ) }{


}gx; print; } Here it is as a one-liner from the command line for those of you for whom the extended comments just get in the way of understanding: % perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{


}gx' datafile This problem is interesting because we need to be able to specify start-of-record and end-of-line in the same pattern. We could normally use ^ for start-of-record, but we need $ to indicate not only end-of-record, but end-of-line as well. We add the /m modifier, which changes ^ and $. Instead of using ^ to match beginning-of-record, we use \A instead. We're not using it here, but in case you're interested, the version of $ that always matches end-of-record with an optional newline, even in the presence of /m, is \Z. To match the real end without the optional newline, use \z. The following example demonstrates using /s and /m together. That's because we want ^ to match the beginning of any line in the paragraph; we also want dot to match a newline. The predefined variable $. represents the record number of the filehandle most recently read from using readline(FH) or . The predefined variable $ARGV is the name of the file that's automatically opened by implicit processing. $/ = ''; # paragraph read mode while () { while (/^START(.*?)^END/sm) { # /s makes . span line boundaries # /m makes ^ match near newlines print "chunk $. in $ARGV has <<$1>>\n"; } } If you're already committed to the /m modifier, use \A and \Z for the old meanings of ^ and $, respectively. But what if you've used the /s modifier and want the original meaning of dot? You use [^\n]. Finally, although $ and \Z can match one before the end of a string if that last character is a newline, \z matches only at the very end of the string. We can use lookaheads to define the other two as shortcuts involving \z: $ without /m (?=\n)?\z $ with /m (?=\n)|\z \Z always (?=\n)?\z 6.6.4 See Also The $/ variable in perlvar(1) and in the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl; the /s and /m modifiers in perlre(1) and "The Fine Print" section of Chapter 2 of Programming Perl; the "Anchors and Other Zero-Width Assertions" section in Chapter 3 of Mastering Regular Expressions; we talk more about the special variable $/ in Chapter 8 [ Team LiB ] [ Team LiB ] Recipe 6.7 Reading Records with a Separator 6.7.1 Problem You want to read records separated by a pattern, but Perl doesn't allow its input record separator variable to be a regular expression. Many problems, most obviously those involving parsing complex file formats, become simpler when you can extract records separated by different strings. 6.7.2 Solution Read the whole file and use split: undef $/; @chunks = split(/pattern/, ); 6.7.3 Discussion Perl's official record separator, the $/ variable, must be a fixed string, not a pattern. To sidestep this limitation, undefine the input record separator entirely so that the next readline operation reads the rest of the file. This is sometimes called slurp mode, because it slurps in the whole file as one big string. Then split that huge string using the record separating pattern as the first argument. Here's an example where the input stream is a text file that includes lines consisting of ".Se", ".Ch", and ".Ss", which are special codes in the troff macro set that this book was developed under. These strings are the separators, and we want to find text that falls between them. # .Ch, .Se and .Ss divide chunks of STDIN { local $/ = undef; @chunks = split(/^\.(Ch|Se|Ss)$/m, <>); } print "I read ", scalar(@chunks), " chunks.\n"; We create a localized version of $/ so its previous value is restored once the block finishes. By using split with parentheses in the pattern, captured separators are also returned. This way data elements in the return list alternate with elements containing "Se", "Ch", or "Ss". If you don't want separators returned, but still need parentheses, use non-capturing parentheses in the pattern: /^\.(?:Ch|Se|Ss)$/m. To split before a pattern but include the pattern in the return, use a lookahead assertion: /^(? =\.(?:Ch|Se|Ss))/m. That way each chunk except the first starts with the pattern. Be aware that this uses a lot of memory when the file is large. However, with today's machines and typical text files, this is less often an issue now than it once was. Just don't try it on a 200 MB logfile unless you have plenty of virtual memory for swapping out to disk! Even if you do have enough swap space, you'll likely end up thrashing. 6.7.4 See Also The $/ variable in perlvar(1) and in the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl; the split function in perlfunc(1) and Chapter 29 of Programming Perl; we talk more about the special variable $/ in Chapter 8. [ Team LiB ] [ Team LiB ] Recipe 6.8 Extracting a Range of Lines 6.8.1 Problem You want to extract all lines from a starting pattern through an ending pattern or from a starting line number up to an ending line number. A common example of this is extracting the first 10 lines of a file (line numbers 1 to 10) or just the body of a mail message (everything past the blank line). 6.8.2 Solution Use the operators .. or ... with patterns or line numbers. The .. operator will test the right operand on the same iteration that the left operand flips the operator into the true state. while (<>) { if (/BEGIN PATTERN/ .. /END PATTERN/) { # line falls between BEGIN and END in the # text, inclusive. } } while (<>) { if (FIRST_LINE_NUM .. LAST_LINE_NUM) { # operate only between first and last line, inclusive. } } But the ... operator waits until the next iteration to check the right operand. while (<>) { if (/BEGIN PATTERN/ ... /END PATTERN/) { # line is between BEGIN and END on different lines } } while (<>) { if (FIRST_LINE_NUM ... LAST_LINE_NUM) { # operate only between first and last line, not inclusive } } 6.8.3 Discussion The range operators, .. and ..., are probably the least understood of Perl's myriad operators. They were designed to allow easy extraction of ranges of lines without forcing the programmer to retain explicit state information. Used in scalar context, such as in the test of if and while statements, these operators return a true or false value that's partially dependent on what they last returned. The expression left_operand .. right_operand returns false until left_operand is true, but once that test has been met, it stops evaluating left_operand and keeps returning true until right_operand becomes true, after which it restarts the cycle. Put another way, the first operand turns on the construct as soon as it returns a true value, whereas the second one turns it off as soon as it returns true. The two operands are completely arbitrary. You could write mytestfunc1( ) .. mytestfunc2( ), although this is rarely seen. Instead, the range operators are usually used with either line numbers as operands (the first example), patterns as operands (the second example), or both. # command-line to print lines 15 through 17 inclusive (see below) perl -ne 'print if 15 .. 17' datafile # print all .. displays from HTML doc while (<>) { print if m##i .. m##i; } # same, but as shell command % perl -ne 'print if m##i .. m##i' document.html If either operand is a numeric literal, the range operators implicitly compare against the $. variable ($NR or $INPUT_LINE_NUMBER if you use English). Be careful with implicit line number comparisons here. You must specify literal numbers in your code, not variables containing line numbers. That means you simply say 3 .. 5 in a conditional, but not $n .. $m where $n and $m are 3 and 5 respectively. For that, be more explicit by testing the $. variable directly. perl -ne 'BEGIN { $top=3; $bottom=5 } print if $top .. $bottom' /etc/passwd # WRONG perl -ne 'BEGIN { $top=3; $bottom=5 } print if $. = = $top .. $. = = $bottom' /etc/passwd # RIGHT perl -ne 'print if 3 .. 5' /etc/passwd # also RIGHT The difference between .. and ... is their behavior when both operands become true on the same iteration. Consider these two cases: print if /begin/ .. /end/; print if /begin/ ... /end/; Given the line "You may not end ere you begin", both versions of the previous range operator return true. But the code using .. won't print any further lines. That's because .. tests both conditions on the same line once the first test matches, and the second test tells it that it's reached the end of its region. On the other hand, ... continues until the next line that matches /end/ because it never tries to test both operands on the same line. You may mix and match conditions of different sorts, as in: while (<>) { $in_header = 1 .. /^$/; $in_body = /^$/ .. eof( ); } The first assignment sets $in_header to be true from the first input line until after the blank line separating the header, such as from a mail message, a USENET news posting, or even an HTTP header. (Technically, an HTTP header should have linefeeds and carriage returns as network line terminators, but in practice, servers are liberal in what they accept.) The second assignment sets $in_body to true as soon as the first blank line is encountered, up through end-of-file. Because range operators do not retest their initial condition, any further blank lines, like those between paragraphs, won't be noticed. Here's an example. It reads files containing mail messages and prints addresses it finds in headers. Each address is printed only once. The extent of the header is from a line beginning with a "From:" up through the first blank line. If we're not within that range, go on to the next line. This isn't an RFC-822 notion of an address, but it is easy to write. %seen = ( ); while (<>) { next unless /^From:?\s/i .. /^$/; while (/([^<>( ),;\s]+\@[^<>( ),;\s]+)/g) { print "$1\n" unless $seen{$1}++; } } 6.8.4 See Also The .. and ... operators in the "Range Operator" sections of perlop(1) and Chapter 3 of Programming Perl; the entry for $NR in perlvar(1) and the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.9 Matching Shell Globs as Regular Expressions 6.9.1 Problem You want to allow users to specify matches using traditional shell wildcards, not full Perl regular expressions. Wildcards are easier to type than full regular expressions for simple cases. 6.9.2 Solution Use the following subroutine to convert four shell wildcard characters into their equivalent regular expression; all other characters are quoted to render them literals. sub glob2pat { my $globstr = shift; my %patmap = ( '*' => '.*', '?' => '.', '[' => '[', ']' => ']', ); $globstr =~ s{(.)} { $patmap{$1} || "\Q$1" }ge; return '^' . $globstr . '$'; } 6.9.3 Discussion A Perl regex pattern is not the same as a shell wildcard pattern. The shell's *.* is not a valid regular expression. Its meaning as a pattern would be /^.*\..*$/s, which is admittedly much less fun to type. The function given in the Solution makes these conversions for you, following the standard wildcard rules used by the glob built-in. Table 6-1Table 6-1 shows equivalent wildcard patterns in the shell and in Perl. Table 6-2. Shell globs and equivalent Perl wildcard patterns Shell Perl list.? ^list\..$ project.* ^project\..*$ *old ^.*old$ Shell Perl type*.[ch] ^type.*\.[ch]$ *.* ^.*\..*$ * ^.*$ The function returns a string, not a regex object, because the latter would lock in (and out) any modifier flags, such as /i, but we'd rather delay that decision until later. Shell wildcard rules are different from those of a regular expression. The entire pattern is implicitly anchored at the ends; a question mark maps into any character; an asterisk is any amount of anything; and brackets are character ranges. Everything else is a literal. Most shells do more than simple one-directory globbing. For instance, */* means "all files (including directory files) in all subdirectories of the current directory." Also, shells usually don't expand wildcards to include files with names beginning with a period; you usually have to put that leading period into your glob pattern explicitly. Our glob2pat function doesn't do these things—if you need them, use the File::KGlob module from CPAN. 6.9.4 See Also Your system manpages for the various shells, such as csh(1), tcsh(1), sh(1), ksh(1), and bash(1); the glob function in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the CPAN module Glob::DosGlob; the "I/O Operators" section of perlop(1); we talk more about globbing in Recipe 9.6 [ Team LiB ] type*.[ch] ^type.*\.[ch]$ *.* ^.*\..*$ * ^.*$ The function returns a string, not a regex object, because the latter would lock in (and out) any modifier flags, such as /i, but we'd rather delay that decision until later. Shell wildcard rules are different from those of a regular expression. The entire pattern is implicitly anchored at the ends; a question mark maps into any character; an asterisk is any amount of anything; and brackets are character ranges. Everything else is a literal. Most shells do more than simple one-directory globbing. For instance, */* means "all files (including directory files) in all subdirectories of the current directory." Also, shells usually don't expand wildcards to include files with names beginning with a period; you usually have to put that leading period into your glob pattern explicitly. Our glob2pat function doesn't do these things—if you need them, use the File::KGlob module from CPAN. 6.9.4 See Also Your system manpages for the various shells, such as csh(1), tcsh(1), sh(1), ksh(1), and bash(1); the glob function in perlfunc(1) and Chapter 29 of Programming Perl; the documentation for the CPAN module Glob::DosGlob; the "I/O Operators" section of perlop(1); we talk more about globbing in Recipe 9.6 [ Team LiB ] [ Team LiB ] Recipe 6.10 Speeding Up Interpolated Matches 6.10.1 Problem You want your function or program to take one or more regular expressions as arguments, but doing so seems to run slower than using literals. 6.10.2 Solution To overcome this bottleneck, if you have only one pattern whose value won't change during the entire run of a program, store it in a string and use /$pattern/o: while ($line = <>) { if ($line =~ /$pattern/o) { # do something } } However, that won't work for more than one pattern. Precompile the pattern strings using the qr// operator, then match each result against each of the targets: @pats = map { qr/$_/ } @strings; while ($line = <>) { for $pat (@pats) { if ($line =~ /$pat/) { # do something; } } } 6.10.3 Discussion When Perl compiles a program, it converts patterns into an internal form. This conversion occurs at compile time for patterns without variables, but at runtime for those that do. Interpolating variables into patterns, as in /$pattern/, can slow your program down—sometimes substantially. This is particularly noticeable when $pattern changes often. The /o modifier locks in the values from variables interpolated into the pattern. That is, variables are interpolated only once: the first time the match is run. Because Perl ignores any later changes to those variables, make sure to use it only on unchanging variables. Using /o on patterns without interpolated variables doesn't hurt, but it also doesn't help. The /o modifier is also of no help when you have an unknown number of regular expressions and need to check one or more strings against all of these patterns, since you need to vary the patterns' contents. Nor is it of any use when the interpolated variable is a function argument, since each call to the function gives the variable a new value. Example 6-4 is an example of the slow but straightforward technique for matching many patterns against many lines. The array @popstates contains the standard two-letter abbreviations for some of the places in the heartland of North America where we normally refer to soft drinks as pop (soda to us means either plain soda water or else handmade delicacies from the soda fountain at the corner drugstore, preferably with ice cream). The goal is to print any line of input that contains any of those places, matching them at word boundaries only. It doesn't use /o, because the variable that holds the pattern keeps changing. Example 6-4. popgrep1 #!/usr/bin/perl # popgrep1 - grep for abbreviations of places that say "pop" # version 1: slow but obvious way @popstates = qw(CO ON MI WI MN); LINE: while (defined($line = <>)) { for $state (@popstates) { if ($line =~ /\b$state\b/) { # this is s l o o o w print; next LINE; } } } Such a direct, obvious, brute-force approach is also distressingly slow, because Perl has to recompile all patterns with each line of input. A better solution is the qr// operator (used in Example 6-5), which first appeared in v5.6 and offers a way to step around this bottleneck. The qr// operator quotes and possibly compiles its string argument, returning a scalar to use in later pattern matches. If that scalar is used by itself in the interpolated match, Perl uses the cached compiled form and so avoids recompiling the pattern. Example 6-5. popgrep2 #!/usr/bin/perl # popgrep2 - grep for abbreviations of places that say "pop" # version 2: fast way using qr// @popstates = qw(CO ON MI WI MN); @poppats = map { qr/\b$_\b/ } @popstates; LINE: while (defined($line = <>)) { for $pat (@poppats) { if ($line =~ /$pat/) { # this is fast print; next LINE; } } } Print the array @poppats and you'll see strings like this: (?-xism:\bCO\b) (?-xism:\bON\b) (?-xism:\bMI\b) (?-xism:\bWI\b) (?-xism:\bMN\b) Those are used for the stringified print value of the qr// operator, or to build up a larger pattern if the result is interpolated into a larger string. But also associated with each is a cached, compiled version of that string as a pattern, and this is what Perl uses when the interpolation into a match or substitution operator contains nothing else. 6.10.4 See Also The qr// operator in perlop(1) and in the section on "The qr// quote regex operator" in Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.11 Testing for a Valid Pattern 6.11.1 Problem You want to let users enter their own patterns, but an invalid one would abort your program the first time you tried to use it. 6.11.2 Solution Test the pattern in an eval { } construct first, matching against some dummy string. If $@ is not set, no exception occurred, so you know the pattern successfully compiled as a valid regular expression. Here is a loop that continues prompting until the user supplies a valid pattern: do { print "Pattern? "; chomp($pat = <>); eval { "" =~ /$pat/ }; warn "INVALID PATTERN $@" if $@; } while $@; Here's a standalone subroutine that verifies whether a pattern is valid: sub is_valid_pattern { my $pat = shift; eval { "" =~ /$pat/ }; return $@ ? 0 : 1; } Another way to write that is like this: sub is_valid_pattern { my $pat = shift; return eval { "" =~ /$pat/; 1 } || 0; } This version doesn't need to use $@, because if the pattern match executes without exception, the next statement with just a 1 is reached and returned. Otherwise it's skipped, so just a 0 is returned. 6.11.3 Discussion There's no limit to the number of invalid, uncompilable patterns. The user could mistakenly enter ", "*** GET RICH ***", or "+5-i". If you blindly use the proffered pattern in your program, it raises an exception, normally a fatal event. The tiny program in Example 6-6 demonstrates this. Example 6-6. paragrep #!/usr/bin/perl # paragrep - trivial paragraph grepper die "usage: $0 pat [files]\n" unless @ARGV; $/ = ''; $pat = shift; eval { "" =~ /$pat/; 1 } or die "$0: Bad pattern $pat: $@\n"; while (<>) { print "$ARGV $.: $_" if /$pat/o; } That /o means to interpolate variables once only, even if their contents later change. You could encapsulate this in a function call that returns 1 if the block completes and 0 if not, as shown in the Solution. The simpler eval "/$pat/" would also work to trap the exception, but has two other problems. One is that any slashes (or whatever your chosen pattern delimiter is) in the string the user entered would raise an exception. More importantly, it would open a drastic security hole that you almost certainly want to avoid. Strings like this could ruin your day: $pat = "You lose @{[ system('rm -rf *')]} big here"; If you don't want to let the user provide a real pattern, you can always metaquote the string first: $safe_pat = quotemeta($pat); something( ) if /$safe_pat/; Or, even easier, use: something( ) if /\Q$pat/; But if you're going to do that, why are you using pattern matching at all? In that case, a simple use of index would be enough. But sometimes you want a literal part and a regex part, such as: something( ) if /^\s*\Q$pat\E\s*$/; Letting the user supply a real pattern gives them power enough for many interesting and useful operations. This is a good thing. You just have to be slightly careful. Suppose they wanted to enter a case-insensitive pattern, but you didn't provide the program with an option like grep's -i option. By permitting full patterns, the user can enter an embedded /i modifier as (?i), as in /(?i)stuff/. What happens if the interpolated pattern expands to nothing? If $pat is the empty string, what does /$pat/ match—that is, what does a blank // match? It doesn't match the start of all possible strings. Surprisingly, matching the null pattern exhibits the dubiously useful semantics of reusing the previous successfully matched pattern. In practice, this is hard to make good use of in Perl. 6.11.4 See Also The eval function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 10.12 [ Team LiB ] [ Team LiB ] Recipe 6.12 Honoring Locale Settings in Regular Expressions 6.12.1 Problem You want to translate case when in a different locale, or you want to make \w match letters with diacritics, such as José or déjà vu. For example, let's say you're given half a gigabyte of text written in German and told to index it. You want to extract words (with \w+) and convert them to lowercase (with lc or \L), but the normal versions of \w and lc neither match the German words nor change the case of accented letters. 6.12.2 Solution Perl's regular-expression and text-manipulation routines have hooks to the POSIX locale setting. Under the use locale pragma, accented characters are taken care of—assuming a reasonable LC_CTYPE specification and system support for the same. use locale; 6.12.3 Discussion By default, \w+ and case-mapping functions operate on upper- and lowercase letters, digits, and underscores. This works only for the simplest of English words, failing even on many common imports. The use locale directive redefines what a "word character" means. In Example 6-7 you see the difference in output between having selected the English ("en") locale and the German ("de") one. Example 6-7. localeg #!/usr/bin/perl -w # localeg - demonstrate locale effects use locale; use POSIX 'locale_h'; $name = "andreas k\xF6nig"; @locale{qw(German English)} = qw(de_DE.ISO_8859-1 us-ascii); setlocale(LC_CTYPE, $locale{English}) or die "Invalid locale $locale{English}"; @english_names = ( ); while ($name =~ /\b(\w+)\b/g) { push(@english_names, ucfirst($1)); } setlocale(LC_CTYPE, $locale{German}) or die "Invalid locale $locale{German}"; @german_names = ( ); while ($name =~ /\b(\w+)\b/g) { push(@german_names, ucfirst($1)); } print "English names: @english_names\n"; print "German names: @german_names\n"; English names: Andreas K Nig German names: Andreas König This approach relies on POSIX locale support for 8-bit character encodings, which your system may or may not provide. Even if your system does claim to provide POSIX locale support, the standard does not specify the locale names. As you might guess, portability of this approach is not assured. If your data is already in Unicode, you don't need POSIX locales for this to work. 6.12.4 See Also The treatment of \b, \w, and \s in perlre(1) and in the "Classic Perl Character Class Shortcuts" section of Chapter 5 of Programming Perl; the treatment of locales in Perl in perllocale(1); your system's locale(3) manpage; we discuss locales in greater depth in Recipe 6.2; the "POSIX—An Attempt at Standardization" section of Chapter 3 of Mastering Regular Expressions [ Team LiB ] [ Team LiB ] Recipe 6.13 Approximate Matching 6.13.1 Problem You want to match fuzzily, that is, allowing for a margin of error, where the string doesn't quite match the pattern. Whenever you want to be forgiving of misspellings in user input, you want fuzzy matching. 6.13.2 Solution Use the String::Approx module, available from CPAN: use String::Approx qw(amatch); if (amatch("PATTERN", @list)) { # matched } @matches = amatch("PATTERN", @list); 6.13.3 Discussion String::Approx calculates the difference between the pattern and each string in the list. If less than a certain number—by default, 10 percent of the pattern length—of one-character insertions, deletions, or substitutions are required to make the string fit the pattern, it still matches. In scalar context, amatch returns the number of successful matches. In list context, it returns the strings matched. use String::Approx qw(amatch); open(DICT, "/usr/dict/words") or die "Can't open dict: $!"; while() { print if amatch("balast"); } ballast balustrade blast blastula sandblast Options passed to amatch control case-sensitivity and the permitted number of insertions, deletions, or substitutions. These are fully described in the String::Approx documentation. The module's matching function seems to run between 10 and 40 times slower than Perl's built- in pattern matching. So use String::Approx only if you're after a fuzziness in your matching that Perl's patterns can't provide. 6.13.4 See Also The documentation for the CPAN module String::Approx; Recipe 1.22 [ Team LiB ] [ Team LiB ] Recipe 6.14 Matching from Where the Last Pattern Left Off 6.14.1 Problem You want to match again in the same string, starting from where the last match left off. This is a useful approach to take when repeatedly extracting data in chunks from a string. 6.14.2 Solution Use a combination of the /g and /c match modifiers, the \G pattern anchor, and the pos function. 6.14.3 Discussion The /g modifier on a pattern match makes the matching engine keep track of the position in the string where it finished matching. If the next match also uses /g on that string, the engine starts looking for a match from this remembered position. This lets you, for example, use a while loop to progressively extract repeated occurrences of a match. Here we find all non- negative integers: while (/(\d+)/g) { print "Found number $1\n"; } Within a pattern, \G means the end of the previous match. For example, if you had a number stored in a string with leading blanks, you could change each leading blank into the digit zero this way: $n = " 49 here"; $n =~ s/\G /0/g; print $n; 00049 here You can also make good use of \G in a while loop. Here we use \G to parse a comma-separated list of numbers (e.g., "3,4,5,9,120"): while (/\G,?(\d+)/g) { print "Found number $1\n"; } By default, when your match fails (when we run out of numbers in the examples, for instance) the remembered position is reset to the start. If you don't want this to happen, perhaps because you want to continue matching from that position but with a different pattern, use the modifier /c with /g: $_ = "The year 1752 lost 10 days on the 3rd of September"; while (/(\d+)/gc) { print "Found number $1\n"; } # the /c above left pos at end of final match if (/\G(\S+)/g) { print "Found $1 right after the last number.\n"; } Found number 1752 Found number 10 Found number 3 Found rd after the last number. Successive patterns can use /g on a string, which remembers the ending position of the last successful match. That position is associated with the scalar matched against, not with the pattern. It's reset if the string is modified. The position of the last successful match can be directly inspected or altered with the pos function, whose argument is the string whose position you want to get or set. Assign to the function to set the position. $a = "Didst thou think that the eyes of the White Tower were blind?"; $a =~ /(\w{5,})/g; print "Got $1, position in \$a is ", pos($a), "\n"; Got Didst, position in $a is 5 pos($a) = 30; $a =~ /(\w{5,})/g; print "Got $1, position in \$a now ", pos($a), "\n"; Got White, position in $a now 43 Without an argument, pos operates on $_: $_ = "Nay, I have seen more than thou knowest, Grey Fool."; /(\w{5,})/g; print "Got $1, position in \$_ is ", pos, "\n"; pos = 42; /\b(\w+)/g; print "Next full word after position 42 is $1\n"; Got knowest, position in $_ is 39 Next full word after position 42 is Fool 6.14.4 See Also The /g and /c modifiers are discussed in perlre(1) and the "The m// Operator (Matching)" section of Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.15 Greedy and Non-Greedy Matches 6.15.1 Problem You have a pattern with a greedy quantifier like *, +, ?, or { }, and you want to stop it from being greedy. A classic example is the naïve substitution to remove tags from HTML. Although it looks appealing, s#.*##gsi deletes everything from the first open TT tag through the last closing one. This would turn "Even vi can edit troff effectively." into "Even effectively", completely changing the meaning of the sentence! 6.15.2 Solution Replace the offending greedy quantifier with the corresponding non-greedy version. That is, change *, +, ?, and { } into *?, +?, ??, and { }?, respectively. 6.15.3 Discussion Perl has two sets of quantifiers: the maximal ones—*, +, ?, and { }—and the minimal ones—*?, +?, ??, and { }?. Less formally, these two sorts of quantifiers are often referred to as greedy and non-greedy (or sometimes lazy), respectively. For instance, given the string "Perl is a Swiss Army Chainsaw!", the pattern /(r.*s)/ matches "rl is a Swiss Army Chains", whereas /(r.*?s)/ matches "rl is". With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for * or one or more times for +, the matching engine prefers the "or more" portion of that description. Thus /foo.*bar/ matches the first "foo" through the last "bar" in the string, rather than only through the next "bar" as some might expect. That's because the greedy .* first expands to the rest of the string, but since that wouldn't leave any characters for "bar" to match, the engine backs up one character at a time until it finds "bar". To make any repetition operator match minimally instead of maximally, add an extra ?. So *? still matches zero or more times, but rather than match as much as it can, the way * would, it matches as little as it can. # greedy pattern s/<.*>//gs; # try to remove tags, very badly # nongreedy pattern s/<.*?>//gs; # try to remove tags, better (but still rather badly) This approach doesn't remove tags from all possible HTML correctly, because a single regular expression is seldom an acceptable replacement for a real parser. See Recipe 20.6 for the right way. Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern BEGIN.*?END in a pattern amidst other elements will always match the shortest amount of text between occurrences of BEGIN and END. Consider the pattern /BEGIN(.*?)END/. If matched against the string "BEGIN and BEGIN and END", $1 would contain "and BEGIN and". This is probably not what you want. Imagine trying to pull out everything between bold-italic pairs: this and that are important Oh, me too! A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be: m{ (.*?) }sx You might be surprised to learn that the pattern doesn't find such pairs. Many people incorrectly understand this as matching a "" sequence, then anything up to a "" sequence, leaving the intervening text in $1. While it often works out that way due to the input data, that's not what it says. There's nothing in that pattern that says .*? can't match "" again (and again and again) before it comes to "". If the intention were to extract only stuff between "" and its corresponding "", with no other bold-italic tags in between, that pattern would be incorrect. If the string in question is just one character, and if what follows the minimal match is not a literal character, a negated class is remarkably more efficient than a minimal match, as in /X([^X]*)X/. But the general way to say "match BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN and END would be as follows (this also stores the intervening part in $1): /BEGIN((?:(?!BEGIN).)*)END/s or, more legibly: { BEGIN # locate initial portion ( # save this group into $1 (?: # non-capturing group (?! BEGIN) # assert: can't be at another BEGIN . # now match any one character ) * # entire group 0 or more ) # end $1 group END # locate final portion }sx However, this might not be what you're after, either. The greedy star quantifier means that the non-BEGIN portion in $1 will be maximized, giving fence posts of the last BEGIN through not the first END, but the last one. So if your string were: $_ = "BEGIN1 BEGIN2 BEGIN3 3END 2END 1END"; $1 would contain "3 3END 2END 1". Making the quantifier a minimal matching one: /BEGIN((?:(?!BEGIN).)*?)END/s puts "3 3" in $1 for you. Now add another lookahead negation, (?!END), next to the existing one. Written out with plenty of whitespace, we now have: m{ BEGIN # locate initial portion ( # save this group into $1 (?: # non-capturing group (?! BEGIN ) # can't be at a BEGIN (?! END ) # also can't be at an END . # finally, match any one char ) * # repeat entire group ad libitum ) # end $1 capture END }sx Instead of adding another lookahead, another possibility is to use alternation within the existing one: (?!BEGIN|END). Applying this approach to the HTML-matching code, we end up with something like: m{ ( (?: (?!|). )* ) }sx or perhaps: m{ ( (?: (?!). )* ) }sx Jeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed matters, such as: m{ [^<]* # stuff not possibly bad, and not possibly the end. (?: # at this point, we can have '<' if not part of something bad (?! ) # what we can't have < # okay, so match the '<' [^<]* # and continue with more safe stuff ) * }sx This is a variation on Jeffrey's unrolling-the-loop technique, described in Chapter 6 of Mastering Regular Expressions, Second Edition. 6.15.4 See Also The non-greedy quantifiers in the "Regular Expressions" section of perlre(1) and in Chapter 5 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.16 Detecting Doubled Words 6.16.1 Problem You want to check for doubled words in a document. 6.16.2 Solution Use backreferences in your pattern. 6.16.3 Discussion Parentheses in a pattern make the matching engine remember what text that portion of the pattern matched. Later in the pattern, refer to the actual string that matched with \1 (indicating the string matched by the first set of parentheses), \2 (for the string matched by the second set of parentheses), and so on. Don't use $1 within a regex, because it would be a variable interpolated before the match began. The pattern /([A-Z])\1/ matches a capital letter followed not just by any capital letter, but by whichever one was just matched (i.e., captured by the first set of parentheses in that pattern). The next sample code reads its input files by paragraph, with the definition of paragraph following Perl's notion of a paragraph—a chunk of text terminated by two or more contiguous newlines. Within each paragraph, the code finds all doubled words. It ignores case and can match across newlines. Here we use /x to embed whitespace and comments to improve readability. The /i permits both instances of "is" in the sentence "Is is this ok?" to match, even though they differ in case. We use /g in a while loop to keep finding doubled words until we run out of text. $/ = ''; # paragrep mode while (<>) { while ( m{ \b # start at a word boundary (begin letters) (\S+) # find chunk of non-whitespace \b # until another word boundary (end letters) ( \s+ # separated by some whitespace \1 # and that very same chunk again \b # until another word boundary ) + # one or more sets of those }xig ) { print "dup word '$1' at paragraph $.\n"; } } That code finds the duplicated test in the following paragraph: This is a test test of the doubled word finder. Word boundary anchors surrounding \S+ are often a bad idea because they do something you might not be expecting. That's because word boundaries in Perl are defined as transitions between alphanumunders (that's a \w) and either the edge of the string or a non- alphanumunder. Surrounding \S+ with \b subtly changes \S+ from its normal meaning of one or more non-whitespace characters to a stretch of non-whitespace whose first and last character must be an alphanumunder. Sometimes, though, this might be just what you're looking for. Consider the string: $string = q("I can't see this," she remarked.); @a = $string =~ /\b\S+\b/g; @b = $string =~ /\S+/g; The elements of @a are now: 0 I 1 can't 2 see 3 this 4 she 5 remarked but those of @b are: 0 "I 1 can't 2 see 3 this," 4 she 5 remarked. Here's another interesting demonstration of backreferences. Imagine two words in which the end of the first word is the same as the start of the next one, such as "nobody" and "bodysnatcher". You'd like to find that overlapping part and come up with "nobodysnatcher". This is a variant on the doubled word problem. Conventional character-by-character processing the way a C programmer would write it would take a great deal of tricky code. But with a backtracking pattern matcher, it just takes one simple pattern match. $a = 'nobody'; $b = 'bodysnatcher'; if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) { print "$2 overlaps in $1-$2-$3\n"; } body overlaps in no-body-snatcher You might think that $1 would first grab up all of "nobody" due to greediness. It does—for a while. But once it's done so, there aren't any more characters to put in $2. So the engine backs up, and $1 begrudgingly gives up one character to $2. The space character matches successfully, but then sees \2, which currently holds a lone "y". The next character in the string is not a "y", but a "b". This makes the engine back up, eventually forcing $1 to surrender enough to $2 that the pattern can match some string, a space, and then that same string again. That won't quite work out if the overlap is itself the product of a doubling, as in "rococo" and "cocoon". The preceding algorithm would have decided that the overlapping string, $2, must be just "co" rather than "coco". But we don't want a "rocococoon"; we want a "rococoon". Adding a minimal matching quantifier to the $1 part gives the much better pattern: /^(\w+?)(\w+) \2(\w+)$/, which solves this problem. Backtracking is more powerful than you might imagine. Example 6-8 offers another take on the prime factorization problem from Chapter 1. Example 6-8. prime-pattern #!/usr/bin/perl # prime_pattern -- find prime factors of argument using pattern matching for ($N = ('o' x shift); $N =~ /^(oo+?)\1+$/; $N =~ s/$1/o/g) { print length($1), " "; } print length ($N), "\n"; Although not practical, this approach marvelously demonstrates the power of backtracking. Here's another example. Using a brilliant insight first illustrated by Doug McIlroy (or so says Andrew Hume), you can find solutions to Diophantine equations of order one with regular expressions. Consider the equation 12x + 15y + 16z = 281. Can you think of possible values for x, y, and z? Perl can! # solve for 12x + 15y + 16z = 281, maximizing x if (($X, $Y, $Z) = (('o' x 281) =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/)) { ($x, $y, $z) = (length($X), length($Y), length($Z)); print "One solution is: x=$x; y=$y; z=$z.\n"; } else { print "No solution.\n"; } One solution is: x=17; y=3; z=2. Because the first o* was greedy, x was allowed to grow as large as it could. Changing one or more * quantifiers to *?, +, or +? can produce different solutions. ('o' x 281) =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/ One solution is: x=17; y=3; z=2 ('o' x 281) =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/ One solution is: x=0; y=7; z=11. ('o' x 281) =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/ One solution is: x=1; y=3; z=14. An important lesson to be learned from these amazing feats of mathematical prowess by a lowly pattern matcher is that a pattern-matching engine, particularly a backtracking one, very much wants to give you an answer, and it will work phenomenally hard to do so. But solving a regular expression with backreferences can take time exponentially proportional to the length of the input to complete. For all but trivial inputs, such algorithms make continental drift seem brisk. 6.16.4 See Also The explanation of backreferences in the "Regular Expressions" section of perlre(1), and in "The Little Engine That /Could(n't)?/" section of Chapter 5 of Programming Perl; the "The Doubled- Word Thing" section in Chapter 2 of Mastering Regular Expressions [ Team LiB ] [ Team LiB ] Recipe 6.17 Matching Nested Patterns 6.17.1 Problem You want to match a nested set of enclosing delimiters, such as the arguments to a function call. 6.17.2 Solution Use match-time pattern interpolation, recursively: my $np; $np = qr{ \( (?: (?> [^( )]+ ) # Non-capture group w/o backtracking | (??{ $np }) # Group with matching parens )* \) }x; Or use the Text::Balanced module's extract_bracketed function. 6.17.3 Discussion The $(??{ CODE }) construct runs the code and interpolates the string that the code returns right back into the pattern. A simple, non-recursive example that matches palindromes demonstrates this: if ($word =~ /^(\w+)\w?(??{reverse $1})$/ ) { print "$word is a palindrome.\n"; } Consider a word like "reviver", which this pattern correctly reports as a palindrome. The $1 variable contains "rev" partway through the match. The optional word character following catches the "i". Then the code reverse $1 runs and produces "ver", and that result is interpolated into the pattern. For matching something balanced, you need to recurse, which is a bit tricker. A compiled pattern that uses (??{ CODE }) can refer to itself. The pattern given in the Solution matches a set of nested parentheses, however deep they may go. Given the value of $np in that pattern, you could use it like this to match a function call: $text = "myfunfun(1,(2*(3+4)),5)"; $funpat = qr/\w+$np/; # $np as above $text =~ /^$funpat$/; # Matches! You'll find many CPAN modules that help with matching (parsing) nested strings. The Regexp::Common module supplies canned patterns that match many of the tricker strings. For example: use Regexp::Common; $text = "myfunfun(1,(2*(3+4)),5)"; if ($text =~ /(\w+\s*$RE{balanced}{-parens=>'( )'})/o) { print "Got function call: $1\n"; } Other patterns provided by that module match numbers in various notations and quote- delimited strings: $RE{num}{int} $RE{num}{real} $RE{num}{real}{'-base=2'}{'-sep=,'}{'-group=3'} $RE{quoted} $RE{delimited}{-delim=>'/'} The standard (as of v5.8) Text::Balanced module provides a general solution to this problem. use Text::Balanced qw/extract_bracketed/; $text = "myfunfun(1,(2*(3+4)),5)"; if (($before, $found, $after) = extract_bracketed($text, "(")) { print "answer is $found\n"; } else { print "FAILED\n"; } 6.17.4 See Also The section on "Match-time pattern interpolation" in Chapter 5 of Programming Perl; the documentation for the Regexp::Common CPAN module and the standard Text::Balanced module [ Team LiB ] [ Team LiB ] Recipe 6.18 Expressing AND, OR, and NOT in a Single Pattern 6.18.1 Problem You have an existing program that accepts a pattern as an argument or as input. It doesn't allow you to add extra logic, like case-insensitive options, ANDs, or NOTs. So you need to write a single pattern that matches either of two different patterns (the "or" case) or both of two patterns (the "and" case), or that reverses the sense of the match ("not"). This situation arises often in configuration files, web forms, or command-line arguments. Imagine there's a program that does this: chomp($pattern = ); if ( $data =~ /$pattern/ ) { ..... } As the maintainer of CONFIG_FH, you need to convey Booleans through to the program using one configuration parameter. 6.18.2 Solution True if either /ALPHA/ or /BETA/ matches, like /ALPHA/ || /BETA/: /ALPHA|BETA/ /(?:ALPHA)|(?:BETA)/ # works no matter what in both True if both /ALPHA/ and /BETA/ match, but may overlap, meaning "BETALPHA" should be okay, like /ALPHA/ && /BETA/: /^(?=.*ALPHA)BETA/s True if both /ALPHA/ and /BETA/ match, but may not overlap, meaning that "BETALPHA" should fail: /ALPHA.*BETA|BETA.*ALPHA/s True if pattern /PAT/ does not match, like $var !~ /PAT/: /^(?:(?!PAT).)*$/s True if pattern BAD does not match, but pattern GOOD does: /(?=^(?:(?!BAD).)*$)GOOD/s (You can't actually count on being able to place the /s modifier there after the trailing slash, but we'll show how to include it in the pattern itself at the end of the Discussion.) 6.18.3 Discussion When in a normal program you want to know whether something doesn't match, use one of: if (!($string =~ /pattern/)) { something( ) } # ugly if ( $string !~ /pattern/) { something( ) } # preferred unless ( $string =~ /pattern/) { something( ) } # sometimes clearer To see whether both patterns match, use: if ($string =~ /pat1/ && $string =~ /pat2/ ) { something( ) } To see whether either of two patterns matches: if ($string =~ /pat1/ || $string =~ /pat2/ ) { something( ) } Instead of trying to do it all within a single pattern, it's often more efficient and clearer to use Perl's normal Boolean connectives to combine regular expressions. However, imagine a trivially short minigrep program that reads its single pattern as an argument, as shown in Example 6-9. Example 6-9. minigrep #!/usr/bin/perl # minigrep - trivial grep $pat = shift; while (<>) { print if /$pat/o; } To tell minigrep that some pattern must not match, or that it has to match both subpatterns in any order, you're at an impasse. The program isn't built to accept multiple patterns. How can you do it using one pattern? This need comes up in programs reading patterns from configuration files. The OR case is pretty easy, since the | metacharacter provides for alternation. The AND and NOT cases, however, are more complex. For AND, you have to distinguish between overlapping and non-overlapping needs. If, for example, you want to see whether a string matches both "bell" and "lab" and allow overlapping, the word "labelled" should be matched. But if you don't want to count overlaps, it shouldn't be matched. The overlapping case uses a lookahead assertion: "labelled" =~ /^(?=.*bell)lab/s Remember: in a normal program, you don't have to go through these contortions. Simply say: $string =~ /bell/ && $string =~ /lab/ To unravel this, we'll spell it out using /x and comments. Here's the long version: if ($murray_hill =~ m{ ^ # start of string (?= # zero-width lookahead .* # any amount of intervening stuff bell # the desired bell string ) # rewind, since we were only looking lab # and the lab part }sx ) # /s means . can match newline { print "Looks like Bell Labs might be in Murray Hill!\n"; } We didn't use .*? to end early, because minimal matching is more expensive than maximal matching. It's more efficient to use .* over .*?, given random input where the occurrence of matches at the front or the end of the string is completely unpredictable. Of course, sometimes choosing between .* and .*? may depend on correctness rather than efficiency, but not here. To handle the non-overlapping case, you need two parts separated by an OR. The first branch is THIS followed by THAT; the second is the other way around: "labelled" =~ /(?:^.*bell.*lab)|(?:^.*lab.*bell)/ or in long form: $brand = "labelled"; if ($brand =~ m{ (?: # non-capturing grouper bell # look for a bell .*? # followed by any amount of anything lab # look for a lab ) # end grouper | # otherwise, try the other direction (?: # non-capturing grouper lab # look for a lab .*? # followed by any amount of anything bell # followed by a bell ) # end grouper }sx ) # /s means . can match newline { print "Our brand has bell and lab separate.\n"; } Neither of those patterns matches the test data of "labelled", since there "bell" and "lab" do overlap. These patterns aren't necessarily efficient. $murray_hill =~ /bell/ && $murray_hill =~ /lab/ scans the string at most twice, but the pattern-matching engine's only option is to try to find a "lab" for each occurrence of "bell" with (?=^.*?bell)(?=^.*?lab), leading to quadratic worst-case running times. If you followed those examples, the NOT case should be a breeze. The general form looks like this: $map =~ /^(?:(?!waldo).)*$/s Spelled out in long form, this yields: if ($map =~ m{ ^ # start of string (?: # clustering grouper (?! # look ahead negation waldo # is he ahead of us now? ) # is so, the negation failed . # any character (cuzza /s) ) * # repeat that grouping 0 or more $ # through the end of the string }sx ) # /s means . can match newline { print "There's no waldo here!\n"; } How would you combine AND, OR, and NOT? It's not a pretty picture, and in a regular program, you'd almost never do this. But you have little choice when you're reading from a config file or pulling in arguments from the command line, because you specify only one pattern. You just have to combine what we've learned so far. Carefully. Let's say you wanted to run the Unix w program and find out whether user tchrist were logged on anywhere but a terminal whose name began with ttyp; that is, tchrist must match, but ttyp must not. Here's sample input from w: 7:15am up 206 days, 13:30, 4 users, load average: 1.04, 1.07, 1.04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT tchrist tty1 5:16pm 36days 24:43 0.03s xinit tchrist tty2 5:19pm 6days 0.43s 0.43s -tcsh tchrist ttyp0 chthon 7:58am 3days 23.44s 0.44s -tcsh gnat ttyS4 coprolith 2:01pm 13:36m 0.30s 0.30s -tcsh Here's how to do that using the minigrep program previously outlined or with the tcgrep program from the end of this chapter: % w | minigrep '(?!.*ttyp)tchrist' Decoding that pattern: m{ (?! # zero-width look-ahead assertion .* # any amount of anything (faster than .*?) ttyp # the string you don't want to find ) # end look-ahead negation; rewind to start tchrist # now try to find Tom }x Of course, this example is contrived: any sane person would call the standard grep program twice, once with a -v option to select only non-matches. % w | grep tchrist | grep -v ttyp The point is that Boolean conjunctions and negations can be coded up in one single pattern. You should comment this kind of thing, though, having pity on those who come after you—before they do. One last thing: how would you embed that /s in a pattern passed to a program from the command line? The same way as you would a /i modifier: by using (?i) in the pattern. The /s and /m modifiers can be painlessly included in a pattern as well, using (?s) or (?m). These can even cluster, as in (?smi). That would make these two reasonably interchangeable: % grep -i 'pattern' files % minigrep '(?i)pattern' files When you turn on a modifier that way, it remains on for the entire pattern. An alternative notation restricts the scope of the modifier. Use a clustering parenthesis set, (?:...), and place the modifiers between the question mark and the colon. Printing out a qr// quoted regex demonstrates how to do this: % perl -le 'print qr/pattern/i' (?i-xsm:pattern) Modifiers placed before a minus are enabled for just that pattern; those placed after the minus are disabled for that pattern. 6.18.4 See Also Lookahead assertions are shown in the "Regular Expressions" section of perlre (1), and in the "Lookaround Assertions" section of Chapter 5 of Programming Perl; your system's grep(1) and w(1) manpages; we talk about configuration files in Recipe 8.16 [ Team LiB ] [ Team LiB ] Recipe 6.19 Matching a Valid Mail Address 6.19.1 Problem You want to find a pattern to verify the validity of a supplied mail address. 6.19.2 Solution Because you cannot do real-time validation of deliverable mail addresses, no single, succinct pattern will solve this problem. You must pick from several available compromise approaches. 6.19.3 Discussion Our best advice for verifying a person's mail address is to have them enter their address twice, just as you would when changing a password. This usually weeds out typos. If both entries match, send mail to that address with a personal message such as: Dear, Please confirm the mail address you gave us on Sun Jun 29 10:29:01 MDT 2003 by replying to this message. Include the string "Rumpelstiltskin" in that reply, but spelled in reverse; that is, start with "Nik...". Once this is done, your confirmed address will be entered into our records. If you get back a message where they've followed your directions, you can be reasonably assured that it's real. A related strategy that's less open to forgery is to give them a personal identification number (PIN). Record the address and PIN (preferably a random one) for later processing. In the mail you send, ask them to include the PIN in their reply. In case your email bounces, or the message is included via a vacation script, ask them to mail back the PIN slightly altered, such as with the characters reversed, one added or subtracted to each digit, etc. Most common patterns used for address verification or validation fail in various and sometimes subtle ways. For example, the address this& is valid and quite possibly deliverable, but most patterns that allegedly match valid mail addresses fail to let that one pass. 1 while $addr =~ s/\([^( )]*\)//g; You could use the 6598-byte pattern given on the last page of the first edition of Mastering Regular Expressions to test for RFC conformance, but even that monster isn't perfect, for three reasons. First, not all RFC-valid addresses are deliverable. For example, is valid in form, but in practice is not deliverable. Some people try to do DNS lookups for MX records, even trying to connect to the host handling that address's mail to check if it's valid at that site. This is a poor approach because most sites can't do a direct connect to any other site, and even if they could, mail-receiving sites increasingly either ignore the SMTP VRFY command or fib about its answer. Second, some RFC-invalid addresses, in practice, are perfectly deliverable. For example, a lone postmaster is almost certainly deliverable, but doesn't pass RFC 822 muster: it doesn't have an @ in it. Finally and most importantly, just because the address happens to be valid and deliverable doesn't mean that it's the right one., for example, is valid by the RFC and deliverable. But it's unlikely in the extreme that that would be the mail address of the person submitting information to your CGI script. The Email::Valid CPAN module makes a valiant (albeit provably imperfect) attempt at doing this correctly. It jumps through many hoops, including the RFC 822 regular expression from Mastering Regular Expressions, DNS MX record lookup, and stop lists for naughty words and famous people. But this is still a weak approach. The approach suggested at the beginning of the Discussion is easier to implement and less prone to error. 6.19.4 See Also The "Matching an Email Address" section of Chapter 7 of the first edition Mastering Regular Expressions; Recipe 18.16 [ Team LiB ] [ Team LiB ] Recipe 6.20 Matching Abbreviations 6.20.1 Problem Suppose you had a list of commands, such as "send", "abort", "list", and "edit". The user types one in, but you don't want to make them type out the whole thing. 6.20.2 Solution Use the following technique if all strings start with different characters, or to arrange matches so one takes precedence over another, as "SEND" has precedence over "STOP" here: chomp($answer = <>); if ("SEND" =~ /^\Q$answer/i) { print "Action is send\n" } elsif ("STOP" =~ /^\Q$answer/i) { print "Action is stop\n" } elsif ("ABORT" =~ /^\Q$answer/i) { print "Action is abort\n" } elsif ("LIST" =~ /^\Q$answer/i) { print "Action is list\n" } elsif ("EDIT" =~ /^\Q$answer/i) { print "Action is edit\n" } Or use the Text::Abbrev module: use Text::Abbrev; $href = abbrev qw(send abort list edit); for (print "Action: "; <>; print "Action: ") { chomp; my $action = $href->{ lc($_) }; print "Action is $action\n"; } 6.20.3 Discussion The first technique exchanges the typical operand order of a match. Normally you have a variable on the left side of the match and a known pattern on the right side. We might try to decide which action the user wanted us to take by saying $answer =~ /^ABORT/i, which is true if $answer begins with the string "ABORT". It matches regardless of whether $answer has anything after "ABORT", so "ABORT LATER" would still match. Handling abbreviations generally requires quite a bit of ugliness: $answer =~ /^A(B(O(R(T)?)?)?)?$/i. Compare the classic variable =~ /pattern/ with "ABORT" =~ /^\Q$answer/i. The \Q escapes characters that would otherwise be treated specially: that way your program won't blow up if the user enters an invalid pattern. When the user enters something like "ab", the expanded match becomes "ABORT" =~ /^ab/i after variable substitution and metaquoting. This matches. The standard Text::Abbrev module takes a different approach. You supply a list of words, and the abbrev( ) function returns a reference to a hash whose keys are all unambiguous abbreviations and whose values are the fully expanded strings. So if $href were created as in the Solution example, $href->{"a"} would return the string "abort". This technique is commonly used to call a function based on the name of the string the user types in. Although it's possible to implement this using symbolic references, as in: $name = 'send'; &$name($message); $name->($message); # alternate, simpler syntax that's scary because it lets the user run any function whose name they know (or can guess), not just those we want to make available to them. It also runs afoul of that pesky use strict 'refs' pragma. Here's a partial program that creates a hash in which the key is the command name and the value is a reference to the function to call for that command: # assumes that &invoke_editor, &deliver_message, # $file and $PAGER are defined somewhere else. use Text::Abbrev; my($href, %actions, $errors); %actions = ( "edit" => \&invoke_editor, "send" => \&deliver_message, "list" => sub { system($PAGER, $file) }, "abort" => sub { print "See ya!\n"; exit; }, "" => sub { print "Unknown command: $cmd\n"; $errors++; }, ); $href = abbrev(keys %actions); for (print "Action: "; my $choice = <>; print "Action: ") { $choice =~ s/^\s+//; # trim leading white space $choice =~ s/\s+$//; # trim trailing white space next unless $choice; $actions->{ $href->{ lc($choice) } }->( ); } If you're not into long expressions or need practice typing, that last statement could have been written: $abbreviation = lc($_); $expansion = $href->{$abbreviation}; $coderef = $actions->{$expansion}; $coderef->( ); 6.20.4 See Also The documentation for the standard Text::Abbrev module; interpolation is explained in the "Scalar Value Constructors" section of perldata(1), and in the "String Literals" section of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 6.21 Program: urlify This program puts HTML links around URLs in files. It doesn't work on all possible URLs, but does hit the most common ones. It tries to avoid including end-of-sentence punctuation in the marked-up URL. It is a typical Perl filter, so it can be fed input from a pipe: % gunzip -c ~/mail/archive.gz | urlify > archive.urlified or by supplying files on the command line: % urlify ~/mail/*.inbox > ~/allmail.urlified The program is shown in Example 6-10. Example 6-10. urlify #!/usr/bin/perl # urlify - wrap HTML links around URL-like constructs $protos = '(http|telnet|gopher|file|wais|ftp)'; $ltrs = '\w'; $gunk = ';/#~:.?+=&%@!\-'; $punc = '.:?\-'; $any = "${ltrs}${gunk}${punc}"; while (<>) { s{ \b # start at word boundary ( # begin $1 { $protos : # need resource and a colon [$any] +? # followed by on or more # of any valid character, but # be conservative and take only # what you need to.... ) # end $1 } (?= # look-ahead non-consumptive assertion [$punc]* # either 0 or more punctuation [^$any] # followed by a non-url char | # or else $ # then end of the string ) }{$1}igox; print; } [ Team LiB ] [ Team LiB ] Recipe 6.22 Program: tcgrep This program is a Perl rewrite of the Unix grep program. Although it runs slower than C versions (especially the GNU grep s), it offers many more features. The first and perhaps most important feature is that it runs anywhere Perl does. Other enhancements are that it can ignore anything that's not a plain text file, automatically expand compressed or gzip ped files, recurse down directories, search complete paragraphs or user- defined records, look in younger files before older ones, and add underlining or highlighting of matches. It also supports the -c option to indicate a count of matching records, as well as -C for a count of matching patterns when there could be more than one per record. This program uses gzcat or zcat to decompress compressed files, so this feature is unavailable on systems without these programs and systems that can't run external programs (such as old Macs). Run the program with no arguments for a usage message (see the usage subroutine in the following code). The following example recursively and case-insensitively greps every file in ~/mail for mail messages from someone called "kate", reporting filenames that contained matches: % tcgrep -ril '^From: .*kate' ~/mail The program is shown in Example 6-11 . Example 6-11. tcgrep #!/usr/bin/perl -w # tcgrep: tom christiansen's rewrite of grep # v1.0: Thu Sep 30 16:24:43 MDT 1993 # v1.1: Fri Oct 1 08:33:43 MDT 1993 # v1.2: Fri Jul 26 13:37:02 CDT 1996 # v1.3: Sat Aug 30 14:21:47 CDT 1997 # v1.4: Mon May 18 16:17:48 EDT 1998 use strict; # globals our ($Me, $Errors, $Grand_Total, $Mult, %Compress, $Matches); my ($matcher, $opt); # matcher - anon. sub to check for matches # opt - ref to hash w/ command-line options init( ); # initialize globals ($opt, $matcher) = parse_args( ); # get command line options and patterns matchfile($opt, $matcher, @ARGV); # process files exit(2) if $Errors; exit(0) if $Grand_Total; exit(1); ################################### sub init { ($Me = $0) =~ s!.*/!!; # get basename of program, "tcgrep" $Errors = $Grand_Total = 0; # initialize global counters $Mult = ""; # flag for multiple files in @ARGV $| = 1; # autoflush output %Compress = ( # file extensions and program names z => 'gzcat', # for uncompressing gz => 'gzcat', Z => 'zcat', ); } ################################### sub usage { die << EOF usage: $Me [flags] [files] Standard grep options: i case insensitive n number lines c give count of lines matching C ditto, but >1 match per line possible w word boundaries only s silent mode x exact matches only v invert search sense (lines that DON'T match) h hide filenames e expression (for exprs beginning with -) f file with expressions l list filenames matching Specials: 1 1 match per file H highlight matches u underline matches r recursive on directories or dot if none t process directories in 'ls -t' order p paragraph mode (default: line mode) P ditto, but specify separator, e.g. -P '%%\\n' a all files, not just plain text files q quiet about failed file and dir opens T trace files as opened May use a TCGREP environment variable to set default options. EOF } ################################### sub parse_args { use Getopt::Std; my ($optstring, $zeros, $nulls, %opt, $pattern, @patterns, $match_code); my ($SO, $SE); if (my $opts = $ENV{TCGREP}) { # get envariable TCGREP $opts =~ s/^([^\-])/-$1/; # add leading - if missing unshift(@ARGV, $opts); # add TCGREP opt string to @ARGV } $optstring = "incCwsxvhe:f:l1HurtpP:aqT"; $zeros = 'inCwxvhelut'; # options to init to 0 $nulls = 'pP'; # options to init to "" @opt{ split //, $zeros } = ( 0 ) x length($zeros); @opt{ split //, $nulls } = ( '' ) x length($nulls); getopts($optstring, \%opt) or usage( ); # handle option "-f patfile", for list of patterns if ($opt{f}) { open(PATFILE, $opt{f}) or die "$Me: Can't open '$opt{f}': $!"; # make sure each pattern in file is valid while ($pattern = ) { chomp $pattern; eval { 'foo' =~ /$pattern/, 1 } or die "$Me: $opt{f}:$.: bad pattern: $@"; push @patterns, $pattern; } close PATFILE; } else { # make sure pattern is valid $pattern = $opt{e} || shift(@ARGV) || usage( ); eval { 'foo' =~ /$pattern/; 1 } or die "$Me: bad pattern: $@"; @patterns = ($pattern); } # option -H is for highlight, option -u is for underline if ($opt{H} || $opt{u}) { my $term = $ENV{TERM} || 'vt100'; my $terminal; # eval{ } only to trap potential exceptions in function calls eval { # try to look up escapes for stand-out require POSIX; # or underline via Term::Cap use Term::Cap; my $termios = POSIX::Termios->new( ); $termios->getattr; my $ospeed = $termios->getospeed; $terminal = Tgetent Term::Cap { TERM=>undef, OSPEED=>$ospeed } }; unless ($@) { # if successful, get escapes for either local $^W = 0; # stand-out (-H) or underlined (-u) ($SO, $SE) = $opt{H} ? ($terminal->Tputs('so'), $terminal->Tputs('se')) : ($terminal->Tputs('us'), $terminal->Tputs('ue')); } else { # if use of Term::Cap fails, ($SO, $SE) = $opt{H} # use tput command to get escapes ? (`tput -T $term smso`, `tput -T $term rmso`) : (`tput -T $term smul`, `tput -T $term rmul`) } } # option -i makes all pattern case insensitive if ($opt{i}) { @patterns = map {"(?i)$_"} @patterns; } # option -p or -P is paragraph mode, so add /m if ($opt{p} || $opt{P}) { @patterns = map {"(?m)$_"} @patterns; } # option -p is standard paragraph mode $opt{p} && ($/ = ''); # option -p is user-defined paragraph mode $opt{P} && ($/ = eval(qq("$opt{P}"))); # for -P '%%\n' # option -w is at word boundary only (XXX: is this always optimal?) $opt{w} && (@patterns = map {'\b' . $_ . '\b'} @patterns); # option -x is for whole lines only $opt{'x'} && (@patterns = map {"^$_\$"} @patterns); # determine whether to emit file name in front of each match if (@ARGV) { $Mult = 1 if ($opt{r} || (@ARGV > 1) || -d $ARGV[0]) && !$opt{h}; } # if just listing filenames, stop after first match $opt{1} += $opt{l}; # that's a one and an ell # this way only need look for -H $opt{H} += $opt{u}; # if we're doing a complete count, where doing some counting $opt{c} += $opt{C}; # if we're counting, keep track of status $opt{'s'} += $opt{c}; # stop at first match if checking status but not counting $opt{1} += $opt{'s'} && !$opt{c}; # that's a one # default args are cwd if recursive, stdin otherwise @ARGV = ($opt{r} ? '.' : '-') unless @ARGV; # we're recursive even w/o -r if all args are directories $opt{r} = 1 if !$opt{r} && grep(-d, @ARGV) = = @ARGV; ###### # now the hard part: build of the matching function as text to eval # $match_code = ''; $match_code .= 'study;' if @patterns > 5; # might speed things up a bit foreach (@patterns) { s(/)(\\/)g } # add the stand-out and end-stand-out sequences for highlight mode if ($opt{H}) { foreach $pattern (@patterns) { $match_code .= "\$Matches += s/($pattern)/${SO}\$1${SE}/g;"; } } # option -v means to count a line if it *doesn't* match elsif ($opt{v}) { foreach $pattern (@patterns) { $match_code .= "\$Matches += !/$pattern/;"; } } # do full count, multiple hits per line elsif ($opt{C}) { foreach $pattern (@patterns) { $match_code .= "\$Matches++ while /$pattern/g;"; } } else { foreach $pattern (@patterns) { $match_code .= "\$Matches++ if /$pattern/;"; } } # now compile as a closure, and grab function pointer $matcher = eval "sub { $match_code }"; die if $@; return (\%opt, $matcher); } ################################### sub matchfile { $opt = shift; # reference to option hash $matcher = shift; # reference to matching sub my ($file, @list, $total, $name); local($_); $total = 0; FILE: while (defined ($file = shift(@_))) { if (-d $file) { if (-l $file && @ARGV != 1) { warn "$Me: \"$file\" is a symlink to a directory\n" if $opt->{T}; next FILE; } if (!$opt->{r}) { warn "$Me: \"$file\" is a directory, but no -r given\n" if $opt->{T}; next FILE; } unless (opendir(DIR, $file)) { unless ($opt->{'q'}) { warn "$Me: can't opendir $file: $!\n"; $Errors++; } next FILE; } @list = ( ); for (readdir(DIR)) { # skip cwd and parent dir push(@list, "$file/$_") unless /^\.{1,2}$/; } closedir(DIR); # option -t is sort by age, youngest first # use algorithm from Recipe 4.XXX, Sorting a List by Computable Field if ($opt->{t}) { @list = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, -M $_ ] } @list; } else { @list = sort @list; } matchfile($opt, $matcher, @list); # process files next FILE; } # avoid annoying situation of grep wanting to read from keyboard # but user not realizing this if ($file eq '-') { warn "$Me: reading from stdin\n" if -t STDIN && !$opt->{'q'}; $name = ''; } else { $name = $file; unless (-e $file) { warn qq($Me: file "$file" does not exist\n) unless $opt->{'q'}; $Errors++; next FILE; } unless (-f $file || $opt->{a}) { warn qq($Me: skipping non-plain file "$file"\n) if $opt->{T}; next FILE; } # could use File::Spec instead my ($ext) = $file =~ /\.([^.]+)$/; # check whether it's an extension whose contents we know # how to convert to plain text via a filter program if (defined($ext) && exists($Compress{$ext})) { $file = "$Compress{$ext} < $file |"; } elsif (! (-T $file || $opt->{a})) { warn qq($Me: skipping binary file "$file"\n) if $opt->{T}; next FILE; } } warn "$Me: checking $file\n" if $opt->{T}; unless (open(FILE, $file)) { unless ($opt->{'q'}) { warn "$Me: $file: $!\n"; $Errors++; } next FILE; } $total = 0; $Matches = 0; LINE: while () { $Matches = 0; ############## &{$matcher}( ); # do it! (check for matches) ############## next LINE unless $Matches; $total += $Matches; if ($opt->{p} || $opt->{P}) { s/\n{2,}$/\n/ if $opt->{p}; chomp if $opt->{P}; } print("$name\n"), next FILE if $opt->{l}; # The following commented out block is the # expanded/legible version of the statement # that immediately follows it. This is one # of the few times we sacrifice readability # for execution speed: we carefully arrange # that print( ) be called just once, not four times, # and we don't resort to a braced block either. # (note that $Mult must be "" not 0 for this to work) ######## ## unless ($opt->{'s'}) { ## print "$name:" if $Mult; ## print "$.:" if $opt{n}; ## print; ## print (('-' x 20) . "\n") if $opt->{p} || $opt->{P}; ## } ######## $opt->{'s'} || print $Mult && "$name:", $opt->{n} ? "$.:" : "", $_, ($opt->{p} || $opt->{P}) && ('-' x 20) . "\n"; next FILE if $opt->{1}; # that's a one } } continue { # again, next block equivalent to line following ####### ## if ($opt->{c}) { ## print $name if $Mult; ## print "$total\n"; ## } ####### print $Mult && "$name:", "$total\n" if $opt->{c}; } $Grand_Total += $total; } [ Team LiB ] [ Team LiB ] Recipe 6.23 Regular Expression Grab Bag We have found these regular expressions useful or interesting: Swap first two words s/(\S+)(\s+)(\S+)/$3$2$1/ Keyword = Value m/^(\w+)\s*=\s*(.*?)\s*$/ # keyword is $1, value is $2 Line of at least 80 characters m/.{80,}/ length( ) >= 80 # ok, not a regex MM/DD/YY HH:MM:SS m|(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)| Changing directories s(/usr/bin)(/usr/local/bin)g Expanding %7E (hex) escapes s/%([0-9A-Fa-f][0-9A-Fa-f])/chr(hex($1))/ge Deleting C comments (imperfectly) s{ /* # Match the opening delimiter .*? # Match a minimal number of characters */ # Match the closing delimiter }{ }gsx; Removing leading and trailing whitespace s/^\s+//; s/\s+$//; Turning \ followed by n into a real newline s/\\n/\n/g; Removing package portion of fully qualified symbols s/^.*::// Dotted quads (most IP addresses) # XXX: fails on legal IPs 127.1 and 2130706433. m{ ^ ( \d | [01]?\d\d | 2[0-4]\d | 25[0-5] ) \. ( \d | [01]?\d\d | 2[0-4]\d | 25[0-5] ) \. ( \d | [01]?\d\d | 2[0-4]\d | 25[0-5] ) \. ( \d | [01]?\d\d | 2[0-4]\d | 25[0-5] ) $ }x Removing leading path from filename s{^.*/}{ } Extracting columns setting from TERMCAP $cols = ( ($ENV{TERMCAP} || " ") =~ m/:co#(\d+):/ ) ? $1 : 80; Removing directory components from program name and arguments ($name = " $0 @ARGV") =~ s{ /\S+/}{ }g; Checking your operating system die "This isn't Linux" unless $^O =~ m/linux/i; Joining continuation lines in multiline string s/\n\s+/ /g Extracting all numbers from a string @nums = m/(\d+\.?\d*|\.\d+)/g; Finding all-caps words @capwords = m/(\b\p{ Upper-case Letter }+\b)/g; Finding all-lowercase words @lowords = m/(\b\p{ Lower-case Letter }+\b)/g; Finding initial-caps word @icwords = m{ ( \b [\p{ Upper-case Letter }\p{ Title-case Letter }] \p{ Lower-case Letter } * \b ) }gx; Finding links in simple HTML @links = m/]+?HREF\s*=\s*["']?([^'" >]+?)['"]?\s*>/ig; Finding middle initial in $_ $initial = /^\S+\s+(\S)\S*\s+\S/ ? $1 : ""; Changing double verticle prime pairs to curly quotes s/"([^"]*)"/``$1''/g # old way # next is unicode only s/"([^"]*)"/\x{201C}\x{201C}$1\x{201D}\x{201D}/g Extracting sentences (double spaces required between each) { local $/ = ""; while (<>) { s/\n/ /g; s/ {3,}/ /g; push @sentences, m/(\S.*?[!?.])(?= {2}|\Z)/g; } } YYYY-MM-DD m/\b(\d{4})-(\d\d)-(\d\d)\b/ # YYYY in $1, MM in $2, DD in $3 North American telephone numbers m/ ^ (?: 1 \s (?: \d\d\d \s)? # 1, or 1 and area code | # ... or ... \(\d\d\d\) \s # area code with parens | # ... or ... (?: \+\d\d?\d? \s)? # optional +country code \d\d\d ([\s\-]) # and area code ) \d\d\d (\s|\1) # prefix (and area code separator) \d\d\d\d # exchange $ /x Exclamations m/\boh\s+my\s+gh?o(d(dess(es)?|s?)|odness|sh)\b/i Extracting lines regardless of line terminator push(@lines, $1) while $input =~ s{ ^ # gobble from front ( . # begin $1: any single char (/s) ?* # but minimally matching even none ) (?: # make capturing if saving terminators \x0D \x0A # CRLF | \x0A # LF | \x0D # CR | \x0C # FF # (see | \x{2028} # Unicode LS | \x{2029} # Unicode PS ) }{ }sx; # consumes $input Or use split: @lines = split m{ (?: # make capturing if saving terminators \x0D \x0A # CRLF | \x0A # LF | \x0D # CR | \x0C # FF # (see | \x{2028} # Unicode LS | \x{2029} # Unicode PS ) }x, $input; [ Team LiB ] [ Team LiB ] Chapter 7. File Access I the heir of all ages, in the foremost files of time. —Alfred, Lord Tennyson, Locksley Hall [ Team LiB ] [ Team LiB ] Introduction Nothing is more central to data processing than the file. As with everything else in Perl, easy things are easy and hard things are possible. Common tasks (opening files, reading data, writing data) use simple I/O functions and operators, whereas fancier functions do hard things like non-blocking I/O and file locking. This chapter deals with the mechanics of file access: opening a file, telling subroutines which files to work with, locking files, and so on. Chapter 8 deals with techniques for working with the contents of a file: reading, writing, shuffling lines, and other operations you can do once you have access to the file. Here's Perl code for printing all lines from the file /usr/local/widgets/data that contain the word "blue": open(INPUT, "<", "/acme/widgets/data") or die "Couldn't open /acme/widgets/data for reading: $!\n"; while () { print if /blue/; } close(INPUT); Getting a Handle on the File Central to file access in Perl is the filehandle, like INPUT in the previous code example. Filehandles are symbols inside your Perl program that you associate with an external file, usually using the open function. Whenever your program performs an input or output operation, it provides that operation with an internal filehandle, not an external filename. It's the job of open to make that association, and of close to break it. Actually, any of several functions can be used to open files, and handles can refer to entities beyond mere files on disk; see Recipe 7.1 for details. While users think of open files in terms of those files' names, Perl programs do so using their filehandles. But as far as the operating system itself is concerned, an open file is nothing more than a file descriptor, which is a small, non-negative integer. The fileno function divulges the system file descriptor of its filehandle argument. Filehandles are enough for most file operations, but for when they aren't, Recipe 7.9 turns a system file descriptor into a filehandle you can use from Perl. Like the names for labels, subroutines, and packages, those for filehandles are unadorned symbols like INPUT, not variables like $input. However, with a few syntactic restrictions, Perl also accepts in lieu of a named filehandle a scalar expression that evaluates to a filehandle—or to something that passes for a filehandle, such as a typeglob, a reference to a typeglob, or an IO object. Typically, this entails storing the filehandle's typeglob in a scalar variable and then using that variable as an indirect filehandle. Code written this way can be simpler than code using named filehandles, because now that you're working with regular variables instead of names, certain untidy and unobvious issues involving quoting, scoping, and packages all become clearer. As of the v5.6 release, Perl can be coaxed into implicitly initializing variables used as indirect filehandles. If you supply a function expecting to initialize a filehandle (like open) with an undefined scalar, that function automatically allocates an anonymous typeglob and stores its reference into the previously undefined variable—a tongue-twisting description normally abbreviated to something more along the lines of, "Perl autovivifies filehandles passed to open as undefined scalars." my $input; # new lexical starts out undef open($input, "<", "/acme/widgets/data") or die "Couldn't open /acme/widgets/data for reading: $!\n"; while (<$input>) { print if /blue/; } close($input); # also occurs when $input GC'd For more about references and their autovivification, see Chapter 11. That chapter deals more with customary data references, though, than it does with exotics like the typeglob references seen here. Having open autovivify a filehandle is only one of several ways to get indirect filehandles. We show different ways of loading up variables with named filehandles and several esoteric equivalents for later use as indirect filehandles in Recipe 7.5. Some recipes in this chapter use filehandles along with the standard IO::Handle module, and sometimes with the IO::File module. Object constructors from these classes return new objects for use as indirect filehandles anywhere a regular handle would go, such as with built-ins like print, readline, close, , etc. You can likewise invoke any IO::Handle method on your regular, unblessed filehandles. This includes autovivified handles and even named ones like INPUT or STDIN, although none of these has been blessed as an object. Method invocation syntax is visually noisier than the equivalent Perl function call, and incurs some performance penalty compared with a function call (where an equivalent function exists). We generally restrict our method use to those providing functionality that would otherwise be difficult or impossible to achieve in pure Perl without resorting to modules. For example, the blocking method sets or disables blocking on a filehandle, a pleasant alternative to the Fcntl wizardry that at least one of the authors and probably most of the readership would prefer not having to know. This forms the basis of Recipe 7.20. Most methods are in the IO::Handle class, which IO::File inherits from, and can even be applied directly to filehandles that aren't objects. They need only be something that Perl will accept as a filehandle. For example: STDIN->blocking(0); # invoke on named handle open($fh, "<", $filename) or die; # first autovivify handle, then... $fh->blocking(0); # invoke on unblessed typeglob ref Like most names in Perl, including those of subroutines and global variables, named filehandles reside in packages. That way, two packages can have filehandles of the same name. When unqualified by package, a named filehandle has a full name that starts with the current package. Writing INPUT is really main::INPUT in the main package, but it's SomeMod::INPUT if you're in a hypothetical SomeMod package. The built-in filehandles STDIN, STDOUT, and STDERR are special. If they are left unqualified, the main package rather than the current one is used. This is the same exception to normal rules for finding the full name that occurs with built-in variables like @ARGV and %ENV, a topic discussed in the Introduction to Chapter 12. Unlike named filehandles, which are global symbols within the package, autovivified filehandles implicitly allocated by Perl are anonymous (i.e., nameless) and have no package of their own. More interestingly, they are also like other references in being subject to automatic garbage collection. When a variable holding them goes out of scope and no other copies or references to that variable or its value have been saved away somewhere more lasting, the garbage collection system kicks in, and Perl implicitly closes the handle for you (if you haven't yet done so yourself). This is important in large or long-running programs, because the operating system imposes a limit on how many underlying file descriptors any process can have open—and usually also on how many descriptors can be open across the entire system. In other words, just as real system memory is a finite resource that you can exhaust if you don't carefully clean up after yourself, the same is true of system file descriptors. If you keep opening new filehandles forever without ever closing them, you'll eventually run out, at which point your program will die if you're lucky or careful, and malfunction if you're not. The implicit close during garbage collection of autoallocated filehandles spares you the headaches that can result from less than perfect bookkeeping. For example, these two functions both autovivify filehandles into distinct lexical variables of the same name: sub versive { open(my $fh, "<", $SOURCE) or die "can't open $SOURCE: $!"; return $fh; } sub apparent { open(my $fh, ">", $TARGET) or die "can't open $TARGET: $!"; return $fh; } my($from, to) = ( versive( ), apparent( ) ); Normally, the handles in $fh would be closed implicitly when each function returns. But since both functions return those values, the handles will stay open a while longer. They remain open until explicitly closed, or until the $from and $to variables and any copies you make all go out of scope—at which point Perl dutifully tidies up by closing them if they've been left open. For buffered handles with internal buffers containing unwritten data, a more valuable benefit shows up. Because a flush precedes a close, this guarantees that all data finally makes it to where you thought it was going in the first place.[1] For global filehandle names, this implicit flush and close takes place on final program exit, but it is not forgotten.[2] [1] Or at least tries to; currently, no error is reported if the implicit write syscall should fail at this stage, which might occur if, for example, the filesystem the open file was on has run out of space. [2] Unless you exit by way of an uncaught signal, either by execing another program or by calling POSIX::_exit( ). Standard Filehandles Every program starts with three standard filehandles already open: STDIN, STDOUT, and STDERR. STDIN, typically pronounced standard in, represents the default source for data flowing into a program. STDOUT, typically pronounced standard out, represents the default destination for data flowing out from a program. Unless otherwise redirected, standard input will be read directly from your keyboard, and standard output will be written directly to your screen. One need not be so direct about matters, however. Here we tell the shell to redirect your program's standard input to datafile and its standard output to resultsfile, all before your program even starts: % program < datafile > resultsfile Suppose something goes wrong in your program that you need to report. If your standard output has been redirected, the person running your program probably wouldn't notice a message that appeared in this output. These are the precise circumstances for which STDERR, typically pronounced standard error, was devised. Like STDOUT, STDERR is initially directed to your screen, but if you redirect STDOUT to a file or pipe, STDERR's destination remains unchanged. That way you always have a standard way to get warnings or errors through to where they're likely to do some good. Unlike STDERR for STDOUT, for STDIN there's no preopened filehandle for times when STDIN has been redirected. That's because this need arises much less frequently than does the need for a coherent and reliable diagnostic stream. Rarely, your program may need to ask something of whoever ran it and read their response, even in the face of redirection. The more(1) and less(1) programs do this, for example, because their STDINs are often pipes from other programs whose long output you want to see a page at a time. On Unix systems, open the special file /dev/tty, which represents the controlling device for this login session. The open fails if the program has no controlling tty, which is the system's way of reporting that there's no one for your program to communicate with. This arrangement makes it easy to plug the output from one program into the input of the next, and so on down the line. % first | second | third That means to apply the first program to the input of the second, and the output of the second as the input of the third. You might not realize it at first, but this is the same logic as seen when stacking functions calls like third(second(first( ))), although the shell's pipeline is a bit easier to read because the transformations proceed from left to right instead of from inside the expression to outside. Under the uniform I/O interface of standard input and output, each program can be independently developed, tested, updated, and executed without risk of one program interfering with another, but they will still easily interoperate. They act as tools or parts used to build larger constructs, or as separate stages in a larger manufacturing process. Like having a huge stock of ready-made, interchangeable parts on hand, they can be reliably assembled into larger sequences of arbitrary length and complexity. If the larger sequences (call them scripts) are given names by being placed into executable scripts indistinguishable from the store-bought parts, they can then go on to take part in still larger sequences as though they were basic tools themselves. An environment where every data-transformation program does one thing well and where data flows from one program to the next through redirectable standard input and output streams is one that strongly encourages a level of power, flexibility, and reliability in software design that could not be achieved otherwise. This, in a nutshell, is the so-called tool-and-filter philosophy that underlies the design of not only the Unix shell but the entire operating system. Although problem domains do exist where this model breaks down—and Perl owes its very existence to plugging one of several infelicities the model forces on you—it is a model that has nevertheless demonstrated its fundamental soundness and scalability for nearly 30 years. I/O Operations Perl's most common operations for file interaction are open, print, to read a record, and close. Perl's I/O functions are documented in Chapter 29 of Programming Perl, and in the perlfunc(1) and perlopentut(1) manpages. The next chapter details I/O operations like , print, seek, and tell. This chapter focuses on open and how you access the data, rather than what you do with the data. Arguably the most important I/O function is open. You typically pass it two or three arguments: the filehandle, a string containing the access mode indicating how to open the file (for reading, writing, appending, etc.), and a string containing the filename. If two arguments are passed, the second contains both the access mode and the filename jammed together. We use this conflation of mode and path to good effect in Recipe 7.14. To open /tmp/log for writing and to associate it with the filehandle LOGFILE, say: open(LOGFILE, "> /tmp/log") or die "Can't write /tmp/log: $!"; The three most common access modes are < for reading, > for overwriting, and >> for appending. The open function is discussed in more detail in Recipe 7.1. Access modes can also include I/O layers like :raw and :encoding(iso-8859-1). Later in this Introduction we discuss I/O layers to control buffering, deferring until Chapter 8 the use of I/O layers to convert the contents of files as they're read. When opening a file or making virtually any other system call,[3] checking the return value is indispensable. Not every open succeeds; not every file is readable; not every piece of data you print reaches its destination. Most programmers check open, seek, tell, and close in robust programs. You might want to check other functions, too. [3] The term system call denotes a call into your operating system kernel. It is unrelated to the C and Perl function that's actually named system. We'll therefore often call these syscalls, after the C and Perl function of that name. If a function is documented to return an error under such and such conditions, and you don't check for these conditions, then this will certainly come back to haunt you someday. The Perl documentation lists return values from all functions and operators. Pay special attention to the glyph-like annotations in Chapter 29 of Programming Perl that are listed on the righthand side next to each function call entry—they tell you at a glance which variables are set on error and which conditions trigger exceptions. Typically, a function that's a true system call fails by returning undef, except for wait, waitpid, and syscall, which all return -1 on failure. You can find the system error message as a string and its corresponding numeric code in the $! variable. This is often used in die or warn messages. The most common input operation in Perl is , the line input operator. Instead of sitting in the middle of its operands the way infix operators are, the line input operator surrounds its filehandle operand, making it more of a circumfix operator, like parentheses. It's also known as the angle operator because of the left- and right-angle brackets that compose it, or as the readline function, since that's the underlying Perl core function that it calls. A record is normally a line, but you can change the record terminator, as detailed in Chapter 8. If FH is omitted, it defaults to the special filehandle, ARGV. When you read from this handle, Perl opens and reads in succession data from those filenames listed in @ARGV, or from STDIN if @ARGV is empty. Customary and curious uses of this are described in Recipe 7.14. At one abstraction level, files are simply streams of octets; that is, of eight-bit bytes. Of course, hardware may impose other organizations, such as blocks and sectors for files on disk or individual IP packets for a TCP connection on a network, but the operating system thankfully hides such low-level details from you. At a higher abstraction level, files are a stream of logical characters independent of any particular underlying physical representation. Because Perl programs most often deal with text strings containing characters, this is the default set by open when accessing filehandles. See the Introduction to Chapter 8 or Recipe 8.11 for how and when to change that default. Each filehandle has a numeric value associated with it, typically called its seek offset, representing the position at which the next I/O operation will occur. If you're thinking of files as octet streams, it's how many octets you are from the beginning of the file, with the starting offset represented by 0. This position is implicitly updated whenever you read or write non- zero-length data on a handle. It can also be updated explicitly with the seek function. Text files are a slightly higher level of abstraction than octet streams. The number of octets need not be identical to the number of characters. Unless you take special action, Perl's filehandles are logical streams of characters, not physical streams of octets. The only time those two numbers (characters and octets) are the same in text files is when each character read or written fits comfortably in one octet (because all code points are below 256), and when no special processing for end of line (such as conversion between "\cJ\cM" and "\n") occurs. Only then do logical character position and physical byte position work out to be the same. This is the sort of file you have with ASCII or Latin1 text files under Unix, where no fundamental distinction exists between text and binary files, which significantly simplifies programming. Unfortunately, 7-bit ASCII text is no longer prevalent, and even 8-bit encodings of ISO 8859-n are quickly giving way to multibyte-encoded Unicode text. In other words, because encoding layers such as ":utf8" and translation layers such as ":crlf" can change the number of bytes transferred between your program and the outside world, you cannot sum up how many characters you've transferred to infer your current file position in bytes. As explained in Chapter 1, characters are not bytes—at least, not necessarily and not dependably. Instead, you must use the tell function to retrieve your current file position. For the same reason, only values returned from tell (and the number 0) are guaranteed to be suitable for passing to seek. In Recipe 7.17, we read the entire contents of a file opened in update mode into memory, change our internal copy, and then seek back to the beginning of that file to write those modifications out again, thereby overwriting what we started with. When you no longer have use for a filehandle, close it. The close function takes a single filehandle and returns true if the filehandle could be successfully flushed and closed, and returns false otherwise. You don't need to explicitly close every filehandle. When you open a filehandle that's already open, Perl implicitly closes it first. When your program exits, any open filehandles also get closed. These implicit closes are for convenience, not stability, because they don't tell you whether the syscall succeeded or failed. Not all closes succeed, and even a close on a read-only file can fail. For instance, you could lose access to the device because of a network outage. It's even more important to check the close if the file was opened for writing; otherwise, you wouldn't notice if the filesystem filled up. close(FH) or die "FH didn't close: $!"; Closing filehandles as soon as you're done with them can also aid portability to non-Unix platforms, because some have problems in areas such as reopening a file before closing it and renaming or removing a file while it's still open. These operations pose no problem to POSIX systems, but others are less accommodating. The paranoid programmer even checks the close on standard output stream at the program's end, lest STDOUT had been redirected from the command line and the output filesystem filled up. Admittedly, your runtime system should take care of this for you, but it doesn't. Checking standard error, though, is more problematic. After all, if STDERR fails to close, what are you planning to do about it? Well, you could determine why the close failed to see whether there's anything you might do to correct the situation. You could even load up the Sys::Syslog module and call syslog( ), which is what system daemons do, since they don't otherwise have access to a good STDERR stream. STDOUT is the default filehandle used by the print, printf, and write functions if no filehandle argument is passed. Change this default with select, which takes the new default output filehandle and returns the previous one. The new output filehandle must have already been opened before calling select: $old_fh = select(LOGFILE); # switch to LOGFILE for output print "Countdown initiated ...\n"; select($old_fh); # return to original output print "You have 30 seconds to reach minimum safety distance.\n"; Some of Perl's special variables change the behavior of the currently selected output filehandle. Most important is $|, which controls output buffering for each filehandle. Flushing output buffers is explained in Recipe 7.19. Perl has functions for buffered and unbuffered I/O. Although there are some exceptions (see the following table), you shouldn't mix calls to buffered and unbuffered I/O functions. That's because buffered functions may keep data in their buffers that the unbuffered functions can't know about. The following table shows the two sets of functions you should not mix. Functions on a particular row are only loosely associated; for instance, sysread doesn't have the same semantics as , but they are on the same row because they both read from a filehandle. Repositioning is addressed in Chapter 8, but we also use it in Recipe 7.17. Action Buffered Unbuffered input ,readline sysread output print syswrite repositioning seek, tell sysseek As of Perl v5.8 there is a way to mix these functions: I/O layers. You can't turn on buffering for the unbuffered functions, but you can turn off buffering for the unbuffered ones. Perl now lets you select the implementation of I/O you wish to use. One possible choice is :unix, which makes Perl use unbuffered syscalls rather than your stdio library or Perl's portable reimplementation of stdio called perlio. Enable the unbuffered I/O layer when you open the file with: open(FH, "<:unix", $filename) or die; Having opened the handle with the unbuffered layer, you can now mix calls to Perl's buffered and unbuffered I/O functions with impunity because with that I/O layer, in reality there are no buffered I/O functions. When you print, Perl is then really using the equivalent of syswrite. More information can be found in Recipe 7.19. [ Team LiB ] [ Team LiB ] Recipe 7.1 Opening a File 7.1.1 Problem You want to read or write a file from Perl. 7.1.2 Solution Use open with two arguments for convenience or with three arguments for precision. Use sysopen for access to low-level features. The open function takes arguments specifying the internal filehandle to open, the external filename, and some indication of how to open it (the access mode). Called with only two arguments, the second comprises both path and mode: open(SOURCE, "< $path") or die "Couldn't open $path for reading: $!\n"; open(SINK, "> $path") or die "Couldn't open $path for writing: $!\n"; When called with three (or more) arguments, the mode is split out from the path, lest there be any ambiguity between one and the other: open(SOURCE, "<", $path) or die "Couldn't open $path for reading: $!\n"; open(SINK, ">", $path) or die "Couldn't open $path for writing: $!\n"; The sysopen function takes either three or four arguments: filehandle, filename, file-access flags, plus an optional permissions value. The flags argument is a number constructed from constants provided by the Fcntl module: use Fcntl; sysopen(SOURCE, $path, O_RDONLY) or die "Couldn't open $path for reading: $!\n"; sysopen(SINK, $path, O_WRONLY, 0600) or die "Couldn't open $path for writing: $!\n"; If you pass open or sysopen a scalar variable that's undefined, Perl fills in that variable with a new, anonymous filehandle. open(my $fh, "<", $path) or die "Couldn't open $path for reading: $!\n"; 7.1.3 Discussion All input and output goes through filehandles, regardless of whether filehandles are mentioned. Filehandles aren't exclusively connected to regular files in the filesystem—they're also used to communicate with other programs (see Chapter 16) and for network communication (see Chapter 17). The open function can also be used to manipulate file descriptors, as discussed in Recipe 7.9. The open function quickly and conveniently solves the problem of associating a filehandle with a file. It permits a shorthand for common access modes (reading, writing, reading and writing, appending) passed in with the filename. It doesn't let you control the permissions that files are created with or even whether files are created. For this level of control, you need sysopen, which uses constants provided by the Fcntl module to control individual settings like read, write, create, and truncate. Most programmers meet open long before they meet sysopen. Table 7-1 shows how open file- access modes (the "Filename" column) correspond to sysopen constants ("O_ flags") and to the fopen(3) strings that IO::File->new can take ("Char"). "Read" and "Write" indicate that the filehandle may be read from or written to. "Append" means no matter where you are in the file, output goes to the end of the file (on most systems). "Create" indicates whether the open statement creates a file if one having the given name doesn't already exist. "Trunc" indicates open will clobber any existing data if the file already exists. Table 7-1. File-access modes Filename Read Write Append Create Trunc O_flags Char < file yes no no no no RDONLY "r" > file no yes no yes yes WRONLY TRUNC CREAT "w" >> file no yes yes yes no WRONLY APPEND CREAT "a" +< file yes yes no no no RDWR "r+" +> file yes yes no yes yes RDWR TRUNC CREAT "w+" +>> file yes yes yes yes no RDWR APPEND CREAT "a+" Here's a tip: you almost never want to use +> or +>>. The first clobbers your file before you can read it, and the second one can be confusing. Although you can read from anywhere with +>>, many systems jump to the end of the file when you write. The sysopen function takes either three or four arguments: sysopen(FILEHANDLE, $name, $flags) or die "Can't open $name : $!"; sysopen(FILEHANDLE, $name, $flags, $perms) or die "Can't open $name : $!"; $name is the name of the file, without any < or + funny business. $flags is a number, formed by ORing together separate values for O_CREAT, O_WRONLY, O_TRUNC, etc. Availability of particular O_* constants depends on your operating system, so consult the online documentation for this (usually open(2), but not always), or look in /usr/include/fcntl.h. Common ones are: O_RDONLY Read only O_WRONLY Write only O_RDWR Read and write O_CREAT Create the file if it doesn't exist O_EXCL Fail if the file already exists O_APPEND Append to the file O_TRUNC Truncate the file O_NONBLOCK Non-blocking access Less common O_* flags sometimes available include O_SHLOCK, O_EXLOCK, O_BINARY, O_NOCTTY, and O_SYNC. Consult your open(2) manpage or its local equivalent for details. If you omit the $perms argument to sysopen, Perl uses the octal value 0666. These permissions values need to be in octal and are modified by your process's current umask. A umask value is a number representing disabled permissions bits—if your umask were 027 (group can't write; others can't read, write, or execute), then passing sysopen 0666 would create a file with mode 0640 (mathematically, 0666 & ~027 is 0640). If umask seems confusing, here's some advice: supply a creation mode of 0666 for regular files and one of 0777 for directories and executable files. This gives users a choice: if they want protected files, they can choose process umasks of 022, 027, or even the particularly antisocial mask of 077. Programs should rarely if ever make policy decisions better left to the user. One exception, though, is files that should be kept private: mail files, web browser cookies, .rhosts files, and so on. In short, seldom if ever use 0644 as argument to sysopen because that takes away the user's option to have a more permissive umask. Here are examples of open and sysopen in action. To open for reading: open(FH, "<", $path) or die $!; sysopen(FH, $path, O_RDONLY) or die $!; To open for writing, create a new file if needed, or else truncate an old one: open(FH, ">", $path) or die $!; sysopen(FH, $path, O_WRONLY|O_TRUNC|O_CREAT) or die $!; sysopen(FH, $path, O_WRONLY|O_TRUNC|O_CREAT, 0600) or die $!; To open for writing, create a new file, but that file must not previously exist: sysopen(FH, $path, O_WRONLY|O_EXCL|O_CREAT) or die $!; sysopen(FH, $path, O_WRONLY|O_EXCL|O_CREAT, 0600) or die $!; To open for appending, creating it if necessary: open(FH, ">>", $path) or die $!; sysopen(FH, $path, O_WRONLY|O_APPEND|O_CREAT) or die $!; sysopen(FH, $path, O_WRONLY|O_APPEND|O_CREAT, 0600) or die $!; To open for appending, where the file must exist: sysopen(FH, $path, O_WRONLY|O_APPEND) or die $!; To open for update, where the file must exist: open(FH, "+<", $path) or die $!; sysopen(FH, $path, O_RDWR) or die $!; To open for update, but create a new file if necessary: sysopen(FH, $path, O_RDWR|O_CREAT) or die $!; sysopen(FH, $path, O_RDWR|O_CREAT, 0600) or die $!; To open for update, where the file must not exist: sysopen(FH, $path, O_RDWR|O_EXCL|O_CREAT) or die $!; sysopen(FH, $path, O_RDWR|O_EXCL|O_CREAT, 0600) or die $!; We use a creation mask of 0600 here only to show how to create a private file. The argument is normally omitted. 7.1.4 See Also The open, sysopen, and umask functions in perlfunc(1) and Chapter 29 of Programming Perl; the perlopentut(1) manpage; the documentation for the standard IO::File and Fcntl modules (also in Chapter 32 of Programming Perl); your system's open(2), fopen(3), and umask(2) manpages; Recipe 7.2; Recipe 9.11 [ Team LiB ] [ Team LiB ] Recipe 7.2 Opening Files with Unusual Filenames 7.2.1 Problem You want to open a file with a funny filename, such as "-", or one that starts with <, >, or |; has leading or trailing whitespace; or ends with |. You don't want these to trigger open's do-what-I- mean behavior, since in this case, that's not what you mean. 7.2.2 Solution When open is called with three arguments, not two, placing the mode in the second argument: open(HANDLE, "<", $filename) or die "cannot open $filename : $!\n"; Or simply use sysopen: sysopen(HANDLE, $filename, O_RDONLY) or die "cannot open $filename: $!\n"; 7.2.3 Discussion When open is called with three arguments, the access mode and the filename are kept separate. But when called with only two arguments, open has to extract the access mode and the filename from a single string. If your filename begins with the same characters used to specify an access mode, open could easily do something unexpected. Imagine the following code: $filename = shift @ARGV; open(INPUT, $filename) or die "Couldn't open $filename : $!\n"; If the user gave ">/etc/passwd" as the filename on the command line, this code would attempt to open /etc/passwd for writing. We can try to give an explicit mode, say for writing: open(OUTPUT, ">$filename") or die "Couldn't open $filename for writing: $!\n"; but even this would let the user give a filename of ">data", and the code would append to the file data instead of erasing the old contents. The easiest solution is to pass three arguments to open, where the second argument is the mode and the third the path. Now there can be neither confusion nor subterfuge. open(OUTPUT, ">", $filename) or die "Couldn't open $filename for writing: $!\n"; Another solution is sysopen, which also takes the mode and filename as distinct arguments: use Fcntl; # for file constants sysopen(OUTPUT, $filename, O_WRONLY|O_TRUNC) or die "Can't open $filename for writing: $!\n"; This special way that open interprets filenames, sometimes referred to as magic open, is a matter of convenience—and usually a good thing. You don't have to worry about a space or two between the access mode and the path. You never have to use the special case of "-" to mean standard input or output. If you write a filter and use a simple open, users can pass "gzip -dc bible.gz|" as a filename, and your filter will automatically run the decoding program. It's only those programs that run under special privilege that should worry about security with open. When designing programs that will be run on someone else's behalf, such as setuid programs or CGI scripts, the prudent programmer always considers whether the user can supply their own filename and thereby cajole what would otherwise appear to be a normal open used for simple reading into overwriting a file or even running another program. Perl's -T command-line flag to enable taint-checking would take care of this. In versions of Perl without three-argument open (those before v5.6.0), one had little recourse but to resort to the following sort of chicanery to cope with filenames with leading or trailing whitespace: $file =~ s#^(\s)#./$1#; open(OUTPUT, "> $file\0") or die "Couldn't open $file for OUTPUT : $!\n"; The substitution protects initial whitespace (this cannot occur in fully specified filenames like "/etc/passwd", but only in relative filenames like ">passwd"). The NUL byte (ASCII 0, "\0") isn't considered part of the filename by open, but it does prevent trailing whitespace from being ignored. 7.2.4 See Also The open and sysopen functions in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 7.1; Recipe 7.14; Recipe 16.2; Recipe 19.4; Recipe 19.5 [ Team LiB ] [ Team LiB ] Recipe 7.3 Expanding Tildes in Filenames 7.3.1 Problem You want to open filenames like ~username/blah or ~/.mailrc, but open doesn't interpret the tilde to mean a home directory. 7.3.2 Solution Either use the glob function: open(FH, glob("~joebob/somefile")) || die "Couldn't open file: $!"; or expand the filename manually with a substitution: $filename =~ s{ ^ ~ ( [^/]* ) } { $1 ? (getpwnam($1))[7] : ( $ENV{HOME} || $ENV{LOGDIR} || (getpwuid($<))[7] ) }ex; 7.3.3 Discussion There is a useful convention, begun with the Unix csh shell and propagated widely by web addresses of the form, that ~ in a filename represents a user's home directory. Thus: ~ # current user's home directory ~/blah # file blah in current user's home directory ~user # a particular user's home directory ~user/blah # file blah in a particular user's home directory Unfortunately, Perl's open function does not expand wildcards, including tildes. As of the v5.6 release, Perl internally uses the File::Glob module when you use the glob operator. So all you need to do is glob the result first. open(MAILRC, "<", "~/.mailrc") # WRONG: tilde is a shell thing or die "can't open ~/.mailrc: $!"; open(MAILRC, "<", glob("~/.mailrc")) # so expand tilde first or die "can't open ~/.mailrc: $!"; The alternative solution, the substitution, uses /e to evaluate the replacement as Perl code. If a username follows the tilde, it's stored in $1, which getpwnam uses to extract the user's home directory out of the return list. This directory becomes the replacement string. If the tilde was not followed by a username, substitute in either the current HOME environment variable or the LOGDIR one. If neither of those is valid, look up the effective user ID's home directory. You could spell glob('~gnat') as <~gnat>, but that would look too much like a read from a filehandle, so don't do that. 7.3.4 See Also The glob and getpwnam functions in perlfunc(1) and Chapter 29 of Programming Perl; your system's getpwnam(2) manpage; Recipe 9.6 [ Team LiB ] [ Team LiB ] Recipe 7.4 Making Perl Report Filenames in Error Messages 7.4.1 Problem Your program works with files, but Perl's errors and warnings only report the last used filehandle, not the name of the file. 7.4.2 Solution Use the filename as the filehandle: open($path, "<", $path) or die "Couldn't open $path for reading : $!\n"; 7.4.3 Discussion Ordinarily, error messages say: Argument "3\n" isn't numeric in multiply at tallyweb line 16, chunk 17. The filehandle LOG doesn't help much because you don't know which file the handle was connected to. By using the filename itself as indirect filehandle, Perl produces more informative errors and warnings: Argument "3\n" isn't numeric in multiply at tallyweb line 16, chunk 17. Unfortunately, this doesn't work with strict refs turned on because the variable $path doesn't really have a filehandle in it, only a string that sometimes behaves like one. The chunk number mentioned in warnings and error messages is the current value of the $. variable. 7.4.4 See Also Recipe 7.1; the open function in perlfunc(1) and Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 7.5 Storing Filehandles into Variables 7.5.1 Problem You want to use a filehandle like a normal variable so you can pass it to or return it from a function, store it in a data structure, and so on. 7.5.2 Solution The easiest way to get a filehandle into a variable is to have open put it there for you: open(my $fh, "<", $filename) or die "$0: can't open $filename: $!"; To store named filehandles into a variable or pass them into or out of a function, use typeglob notation (*FH): $variable = *FILEHANDLE; # save in variable subroutine(*FILEHANDLE); # or pass directly sub subroutine { my $fh = shift; print $fh "Hello, filehandle!\n"; } 7.5.3 Discussion If you pass an undefined scalar variable as the first argument to open, Perl allocates an anonymous typeglob and stores a reference to that typeglob in that scalar, effectively creating filehandles on demand. Like all other references, autovivified filehandles are subject to garbage collection, so this code doesn't leak a filehandle: { open(my $fh, "< /etc/motd") or die; local $/; # slurp mode $text = <$fh>; } When Perl reaches the end of the block, $fh goes out of scope. As explained earlier in the Introduction, because that variable contained the last reference to the anonymous filehandle created by open, the variable is garbage collected and the filehandle implicitly closed. Autovivified filehandles, being anonymous and already held in variables, don't help you to understand how to pass named filehandles as function parameters or store them in variables, including elements of arrays or hashes. By named filehandles, we mean those of the form FH, including all predefined handles, such as STDIN and ARGV. So let's look at what FH is and how to extract a scalar value from it to use for all of those things. Named filehandles used in: print STDERR "stuff\n"; $input = ; open(TTY, "+<", "/dev/tty"); if (eof(ARGV)) { .... } are names, not variables. They're like subroutines in that way. This makes them inconvenient to pass around or store into data structures. Assuming you follow the prudent advice to compile all your code under use strict by default, normally you can't get away with this: $fh = SOMEHANDLE; somefunc(SOMEHANDLE); because, absent declarations to the contrary, SOMEHANDLE is in both of these cases an unquoted string, which is forbidden by use strict. Even if you aren't using strict subs, you'll get into trouble if you try to pass your handle into a subroutine that was compiled under strict refs or in a different package than the calling code was compiled in. The four named handles (STDERR, STDIN, TTY, and ARGV) we showed earlier didn't require special handling, but not because they are built-ins themselves; TTY, in fact, is not. Rather, they were okay because the built-in operations using them as arguments are all prototyped to take a filehandle argument. So you must do one of two things. You could use a prototype for the function as explained in Recipe 7.6. Otherwise, you must use something that Perl will accept in lieu of a filehandle name. Acceptable substitutes include strings, typeglobs, references to typeglobs, and an I/O object, all of which may be stored into variables or passed into a function for later use as indirect filehandles. somefunc( SOMEHANDLE ); # only w/ somefunc(*) proto somefunc( "SOMEHANDLE" ); # an quoted string somefunc( *SOMEHANDLE ); # a typeglob somefunc( \*SOMEHANDLE ); # ref to a typeglob somefunc( *SOMEHANDLE{IO} ); # an I/O object Using a quoted string for the named handle has potential problems, as already explained, although this can work if the code is careful enough (again, see the next recipe). Better to use typeglob notation, either directly using *SOMEHANDLE or by reference using \*SOMEHANDLE: somefunc(*SOMEHANDLE); $fh = *SOMEHANDLE; # or indirectly via a variable somefunc($fh); print $fh "data\n"; Typeglob notation spares you quoting or qualifying the handle name. It may help to conceptualize the asterisk as the type symbol for a filehandle. Like the little colored balls from high school chemistry that stood for atomic particles, it's not really true, but it is a convenient mental shorthand. By the time you understand where this model breaks down, you won't need it anymore. If you assign any of the four alternate forms for named filehandles into a scalar variable, you can use that variable as an indirect filehandle wherever you would use a named filehandle. However, complex expressions and subscripts into hashes or arrays cannot be used directly with built-ins like print, printf, or the line input operator. These are syntactically illegal and won't even compile: @fd = (*STDIN, *STDOUT, *STDERR); print $fd[1] "Type it: "; # WRONG $got = <$fd[0]> # WRONG print $fd[2] "What was that: $got"; # WRONG With print and printf, you can get around this by using a block, returning an expression where you would place the filehandle: print { $fd[1] } "funny stuff\n"; printf { $fd[1] } "Pity the poor %x.\n", 3_735_928_559; Pity the poor deadbeef. That block is a proper block in all senses, so you can put more complicated code there. This sends the message out to one of two places: $ok = -x "/bin/cat"; print { $ok ? $fd[1] : $fd[2] } "cat stat $ok\n"; print { $fd[ 1 + ($ok || 0) ] } "cat stat $ok\n"; This is so-called "indirect object" notation, discussed at length in Chapter 13. This restriction against using anything but simple scalar variables in the indirect object slot holds true for any sort of object. As with user-created objects, infix arrow notation avoids syntactic snafus here. If you have the IO::Handle module loaded, or anything that inherits from it, use an expression that produces the filehandle as though it were a proper object to invoke methods from that class: $fd[1]->print("funny stuff\n"); ($ok ? $fd[1] : $fd[2])->print("cat stat $ok\n"); This approach of treating print and printf like object methods calls won't work for the line input operator. Assuming you've been storing typeglobs in your structure as we did previously, the built-in readline function reads records just as does. Given the preceding initialization of @fd, this would work: $got = readline($fd[0]); or, with IO::Handle available, you can use the getline method: $got = $fd[0]->getline( ); IO::Handle doesn't replace the readline function using just one method—it uses two, one per context. If you prefer readline's context-dependent behavior, you could always do this, adding something to the class on the fly: sub IO::Handle::readline { my $fh = shift; if (wantarray) { return $fh->getlines( ); } else { return $fh->getline( ); } } 7.5.4 See Also The open function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 7.1; the documentation with the standard IO::Handle module (also in Chapter 32 of Programming Perl); and the "Typeglobs and Filehandles" sections of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 7.6 Writing a Subroutine That Takes Filehandles as Built-ins Do 7.6.1 Problem You can pass a bareword filehandle to Perl functions like eof, and you'd like to write similar subroutines of your own. 7.6.2 Solution Use the Symbol module's qualify_to_ref in conjunction with a * prototype on the subroutine: use Symbol qw(qualify_to_ref); sub my_eof (*) { my $handle = shift; $handle = qualify_to_ref($handle, caller( )); # use $handle } 7.6.3 Discussion The * prototype tells Perl that the function expects a bareword filehandle as its argument. This lets you call the function like so: my_eof(HANDLE); This works even when use strict 'subs' is in effect. The function receives a string as its argument, though. To safely use the argument as a filehandle, you need the Symbol module to turn it into a reference to a typeglob. And since typeglob refs can be used wherever you'd use named filehandles, store that reference in a scalar variable and use the variable as an indirect filehandle within your subroutine. If you pass in a filehandle that is already a reference to a typeglob, like those autovivified by open, Perl and qualify_to_ref still do the right thing: open(my $fh, "<", $filename) or die; my_eof($fh); This technique is used in Recipe 7.23. 7.6.4 See Also The documentation for the standard module Symbol (also in Chapter 32 of Programming Perl); the "Prototypes" section in the perlsub(1) manpage (also in Chapter 6 of Programming Perl); Recipe 7.23 [ Team LiB ] [ Team LiB ] Recipe 7.7 Caching Open Output Filehandles 7.7.1 Problem You need more output files open simultaneously than your system allows. 7.7.2 Solution Use the standard FileCache module: use FileCache; cacheout ($path); # each time you use a filehandle print $path "output"; 7.7.3 Discussion FileCache's cacheout function lets you work with more output files than your operating system lets you have open at any one time. If you use it to open an existing file that FileCache is seeing for the first time, the file is truncated to length zero, no questions asked. However, in its opening and closing of files in the background, cacheout tracks files it has opened before and does not overwrite them, but appends to them instead. This does not create directories for you, so if you give it /usr/local/dates/merino.ewe to open but the directory /usr/local/dates doesn't exist, cacheout will die. The cacheout function checks the value of the C-level constant NOFILE from the standard system include file sys/param.h to determine how many concurrently open files are allowed on your system. This value can be incorrect on some systems and even missing on a few (for instance, on those where the maximum number of open file descriptors is a process resource limit that can be set with the limit or ulimit commands). If cacheout can't get a value for NOFILE, set $FileCache::cacheout_maxopen to be four less than the correct value, or choose a reasonable number by trial and error. Example 7-1 splits an xferlog file (created by most FTP servers nowadays) into separate files, each named after the authenticated user. Fields in xferlog files are space-separated, with the fourth field from the last holding the authenticated username. Example 7-1. splitwulog #!/usr/bin/perl # splitwulog - split wuftpd log by authenticated user use FileCache; $outdir = "/var/log/ftp/by-user"; while (<>) { unless (defined ($user = (split)[-4])) { warn "Invalid line: $.\n"; next; } $path = "$outdir/$user"; cacheout $path; print $path $_; } 7.7.4 See Also Documentation for the standard FileCache module (also in Chapter 32 of Programming Perl); the open function in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 7.8 Printing to Many Filehandles Simultaneously 7.8.1 Problem You need to output the same data to several different filehandles. 7.8.2 Solution If you want to do it without forking, write a foreach loop that iterates across the filehandles: foreach $filehandle (@FILEHANDLES) { print $filehandle $stuff_to_print; } If you don't mind forking, open a filehandle that's a pipe to a tee program: open(MANY, "| tee file1 file2 file3 > /dev/null") or die $!; print MANY "data\n" or die $!; close(MANY) or die $!; If you don't have a tee program handy, use the IO::Tee module from CPAN: use IO::Tee; $tee = IO::Tee->new(@FILEHANDLES); print $tee $stuff_to_print; 7.8.3 Discussion A filehandle sends output to one file or program only. To duplicate output to several places, you must call print multiple times or make a filehandle connected to a program like tee, which distributes its input elsewhere. If you use the first option, it's probably easiest to put the filehandles in a list or array and loop through them (see Recipe 7.5): for $fh (*FH1, *FH2, *FH3) { print $fh "whatever\n" } However, if your system supports the tee(1) program, or if you've installed the Perl version from Recipe 8.25, you may open a pipe to tee and let it do the work of copying the file to several places. Remember that tee normally also copies its output to STDOUT, so you must redirect tee's standard output to /dev/null if you don't want an extra copy: open (FH, "| tee file1 file2 file3 >/dev/null"); print FH "whatever\n"; You could even redirect your own STDOUT to the tee process, and then you're able to use a regular print directly: # make STDOUT go to three files, plus original STDOUT open (STDOUT, "| tee file1 file2 file3") or die "Teeing off: $!\n"; print "whatever\n" or die "Writing: $!\n"; close(STDOUT) or die "Closing: $!\n"; The IO::Tee module from CPAN gives you a single filehandle (an object of the IO::Tee class) that you can write to. The object prints to many different filehandles whatever you print to it. Pass destination filehandles to the constructor: use IO::Tee; $t = IO::Tee->new(*FH1, *FH2, *FH3); print $t "Hello, world\n"; print $t "Goodbye, universe\n"; In addition to print, you can do any I/O operation you like to an IO::Tee filehandle. For example, if you close $t in the preceding example, the close will return true if FH1, FH2, and FH3 were all successfully closed. 7.8.4 See Also The print function in perlfunc(1) and in Chapter 29 of Programming Perl; the "Typeglobs and Filehandles" sections of Chapter 2 of Programming Perl; the documentation for the CPAN module IO::Tee; we use this technique in Recipe 8.25 and Recipe 13.15 [ Team LiB ] [ Team LiB ] Recipe 7.9 Opening and Closing File Descriptors by Number 7.9.1 Problem You know which file descriptors you'd like to do I/O on, but Perl requires filehandles, not descriptor numbers. 7.9.2 Solution To open the file descriptor, supply open with "<&=" or "<&" as the part of the file access mode, combined with a directional arrow: open(FH, "<&=", $FDNUM) # open FH to the descriptor itself open(FH, "<&", $FDNUM); # open FH to a copy of the descriptor Or use the IO::Handle module's new_from_fd class method: use IO::Handle; $fh = IO::Handle->new_from_fd($FDNUM, "r"); To close a file descriptor by number, either use the POSIX::close function or open it first as shown previously. 7.9.3 Discussion Occasionally you have a file descriptor but no filehandle. Perl's I/O system uses filehandles instead of file descriptors, so you have to make a new filehandle from an already open file descriptor. The "<&", ">&", and "+<&" access modes to open do this for reading, writing, and updating, respectively. Adding an equals sign to these—making them "<&=", ">&=", and "+<&="—is more parsimonious of file descriptors and nearly always what you want. That's because the underlying implementation of Perl's open statement uses only a C-level fdopen(3) function from the C library, not a dup2(2) syscall that calls the kernel. The new_from_fd IO::Handle object method is equivalent to: use IO::Handle; $fh = new IO::Handle; $fh->fdopen($FDNUM, "r"); # open fd 3 for reading Here's how you'd open file descriptors that the MH mail system feeds its child processes. It identifies them in the environment variable MHCONTEXTFD: $fd = $ENV{MHCONTEXTFD}; open(MHCONTEXT, "<&=", $fd) or die "couldn't fdopen $fd: $!"; # after processing close(MHCONTEXT) or die "couldn't close context file: $!"; Closing a file descriptor by number is even rarer. If you've already opened a filehandle for the file descriptor you want to close, simply use Perl's close function on the filehandle. If don't have a filehandle for that file descriptor, the POSIX::close function closes a file descriptor by number: use POSIX; POSIX::close(3); # close fd 3 7.9.4 See Also The open function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard POSIX and IO::Handle modules (also in Chapter 32 of Programming Perl); your system's fdopen(3) manpages [ Team LiB ] [ Team LiB ] Recipe 7.10 Copying Filehandles 7.10.1 Problem You want a copy of a filehandle. 7.10.2 Solution To create an alias for a named filehandle, say: *ALIAS = *ORIGINAL; Use open with the & file access mode to create an independent copy of the file descriptor for that filehandle: open(OUTCOPY, ">&STDOUT") or die "Couldn't dup STDOUT: $!"; open(INCOPY, "<&STDIN") or die "Couldn't dup STDIN : $!"; Use open with the &= mode to create an alias for that filehandle or file descriptor: open(OUTALIAS, ">&=STDOUT") or die "Couldn't alias STDOUT: $!"; open(INALIAS, "<&=STDIN") or die "Couldn't alias STDIN : $!"; open(BYNUMBER, ">&=5") or die "Couldn't alias file descriptor 5: $!"; With other types of filehandles (typeglobs, objects), use the same technique with a three- argument open: open(my $copy, "<&", $original) or die "Couldn't alias original: $!"; open(my $copy, "<&=", $original) or die "Couldn't alias original: $!"; 7.10.3 Discussion If you create an alias for a filehandle with typeglobs, only one Perl I/O object is still being accessed. If you close one of these aliased filehandles, the I/O object is closed. Any further attempt to use a copy of that filehandle fails, silently by default or, if you have warnings enabled, with the warning "print on closed filehandle". When alternating access through aliased filehandles, writes work as you'd expect because there are no duplicated stdio data structures to get out of sync. If you create a copy of a file descriptor with open(COPY, ">&HANDLE"), you're really calling the dup(2) syscall. You get two independent file descriptors whose file position, locks, and flags are shared, but which have independent stdio buffers. Closing one filehandle doesn't affect its copy. Simultaneously accessing the file through both filehandles is a recipe for disaster. Instead, this technique is normally used to save and restore STDOUT and STDERR: # take copies of the file descriptors open(OLDOUT, ">&STDOUT"); open(OLDERR, ">&STDERR"); # redirect stdout and stderr open(STDOUT, "> /tmp/program.out") or die "Can't redirect stdout: $!"; open(STDERR, ">&STDOUT") or die "Can't dup stdout: $!"; # run the program system($joe_random_program); # close the redirected filehandles close(STDOUT) or die "Can't close STDOUT: $!"; close(STDERR) or die "Can't close STDERR: $!"; # restore stdout and stderr open(STDERR, ">&OLDERR") or die "Can't restore stderr: $!"; open(STDOUT, ">&OLDOUT") or die "Can't restore stdout: $!"; # avoid leaks by closing the independent copies close(OLDOUT) or die "Can't close OLDOUT: $!"; close(OLDERR) or die "Can't close OLDERR: $!"; If you create an alias for a file descriptor using open(ALIAS, ">&=HANDLE"), you're really calling the fdopen(3) function from the stdio library or its equivalent. You get a single file descriptor with two stdio buffers accessed through two filehandles. Closing one filehandle closes the file descriptor of any aliases, but not their filehandles—if you tried to print to a filehandle whose alias had been closed, Perl wouldn't give a "print on closed filehandle" warning, even though the print failed. In short, accessing the file through both filehandles is also a recipe for disaster. This is really used only to open a file descriptor by number. See Recipe 7.9 for more information on this. 7.10.4 See Also The open function in perlfunc(1) and in Chapter 29 of Programming Perl; your system's dup(2) manpage [ Team LiB ] [ Team LiB ] Recipe 7.11 Creating Temporary Files 7.11.1 Problem You need to create a temporary file and have it deleted automatically when your program exits. For instance, if you needed a temporary configuration file to feed a program you're about to launch, you'd need a name for that file so you could pass that filename along to the utility program. In other cases, you may want a temporary file to write to and read from, but don't need a filename for it. 7.11.2 Solution Use the tempfile function from the File::Temp module: use File::Temp qw/ tempdir /; $fh = tempfile( ); # just the handle perhaps in conjunction with a temporary directory: use File::Temp qw/ tempdir /; # or specify a directory $dir = tempdir( CLEANUP => 1 ); ($fh, $filename) = tempfile( DIR => $dir ); $template = "myprogtempXXXXXX"; # trailing Xs are changed ($fh, $filename) = tempfile( $template, DIR => $dir); ($fh, $filename) = tempfile( $template, SUFFIX => ".data"); 7.11.3 Discussion The File::Temp module's functions are the best way to make temporary files. For one thing, they're extremely easy to use. For another, they're more portable than direct calls to the operating system. But perhaps of greatest importance is the care they take in security matters both various and subtle, especially those involving race conditions. Although this module provides a handful of slightly different functions for creating a temporary file, most are there simply to support legacy interfaces; few users will need more than the basic tempfile( ) function. This function safely and atomically creates and opens a brand new, empty file in read-write mode. In scalar context, it returns a filehandle to that temporary file; in list context, it returns the handle and pathname of the temporary file: use File::Temp qw(tempfile); # just the handle $fh = tempfile( ); # handle and filename ($fh, $filename) = tempfile( ); The tempfile function optionally accepts an argument containing a template and then named arguments in pairs. Named arguments specify such things as the directory to use instead of the current directory, that a specific file extension should be used, and on systems that support such a thing, whether the tempfile should be immediately unlinked before its handle is returned. (Files whose names have already been deleted from the filesystem are especially difficult for the guys with the black hats to find.) Any trailing X characters in the template are replaced by random characters in the final filename. You might use this feature if you need a temporary file with a specific extension. ($fh, $filename) = tempfile(DIR => $dir); ($fh, $filename) = tempfile($template); ($fh, $filename) = tempfile($template, DIR => $dir); ($fh, $filename) = tempfile($template, SUFFIX => ".dat"); ($fh, $filename) = tempfile($template, UNLINK => 1); Unless you specify OPEN => 0, the temporary file will be deleted automatically when your program finally exits or the file is closed. In recent releases, Perl's open function offers a simple way to create temporary files whose names you cannot know. Explicitly pass undef as the filename to open: open(my $fh, "+>", undef) or die "$0: can't create temporary file: $!\n"; 7.11.4 See Also The documentation for the standard File::Temp modules (also in Chapter 32 of Programming Perl); the open function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 7.9 [ Team LiB ] [ Team LiB ] Recipe 7.12 Storing a File Inside Your Program Text 7.12.1 Problem You have data that you want to bundle with your program and treat as though it were in a file, but you don't want it to be in a different file. 7.12.2 Solution Use the _ _DATA_ _ or _ _END_ _ tokens after your program code to mark the start of a data block, which can be read inside your program or module from the DATA filehandle. Use _ _DATA_ _ within a module: while ( ) { # process the line } _ _DATA_ _ # your data goes here Similarly, use _ _END_ _ within the main program file: while () { # process the line } _ _END_ _ # your data goes here 7.12.3 Discussion The _ _DATA_ _ and _ _END_ _ symbols tell the Perl compiler there's nothing more for it to do in the current file. They represent the logical end for code in a module or a program before the physical end-of-file. Text after _ _DATA_ _ or _ _END_ _ can be read through the per-package DATA filehandle. For example, take the hypothetical module Primes. Text after _ _DATA_ _ in can be read from the Primes::DATA filehandle. _ _END_ _ behaves as a synonym for _ _DATA_ _ in the main package. Any text occurring after an _ _END_ _ token in a module is completely inaccessible. This lets you write self-contained programs instead of keeping data in separate files. Often this is used for documentation. Sometimes it's configuration data or old test data that the program was originally developed with, left lying about should it ever be needed again. Another trick is to use DATA to find out the current program's or module's size or last modification date. On most systems, the $0 variable will contain the full pathname to your running script. On systems where $0 is not correct, you could try the DATA filehandle instead. This can be used to pull in the size, modification date, etc. Put a special token _ _DATA_ _ at the end of the file (and maybe a warning not to delete it), and the DATA filehandle is available to the script itself. use POSIX qw(strftime); $raw_time = (stat(DATA))[9]; $size = -s DATA; $kilosize = int($size / 1024) . "k"; print "

Script size is $kilosize\n"; print strftime("

Last script update: %c (%Z)\n", localtime($raw_time)); _ _DATA_ _ DO NOT REMOVE THE PRECEDING LINE. Everything else in this file will be ignored. If you want to store more than one file in your program, see Recipe 7.13. 7.12.4 See Also The "Scalar Value Constructors" section of perldata(1), and the "Other Literal Tokens" section of Chapter 2 of Programming Perl; Recipe 7.13 [ Team LiB ] [ Team LiB ] Recipe 7.13 Storing Multiple Files in the DATA Area 7.13.1 Problem You've figured out how to use _ _END_ _ or _ _DATA_ _ to store a virtual file in your source code, but you now want multiple virtual files in one source file. 7.13.2 Solution Use the Inline::Files module from CPAN. Carefully. use Inline::Files; while () { # ... } while () { # ... } _ _SETUP_ _ everything for the SETUP filehandle goes here _ _EXECUTION_ _ everything for the EXECUTION filehandle goes here 7.13.3 Discussion One limitation with the _ _DATA_ _ setup is that you can have only one embedded data file per physical file. The CPAN module Inline::Files cleverly circumvents this restriction by providing logical embedded files. It's used like this: use Inline::Files; # # All your code for the file goes here first, then... # _ _ALPHA_ _ This is the data in the first virtual file, ALPHA. _ _BETA_ _ This is the data in the next virtual file, BETA. _ _OMEGA_ _ This is the data in yet another virtual file, OMEGA. _ _ALPHA_ _ This is more data in the second part of virtual file, ALPHA. The code is expected to read from filehandles whose names correspond to the double- underbarred symbols: here ALPHA, BETA, and OMEGA. You may have more than one section by the same name in the same program, and differently named sections needn't be read in any particular order. These handles work much like the ARGV handle does. For one thing, they're implicitly opened on first usage. For example, using the following code in the designated spot in the preceding code example: while () { print "omega data: $_"; } while () { print "alpha data: $_"; } would produce this: omega data: This is the data in yet another virtual file, OMEGA. omega data: alpha data: This is the data in the first virtual file, ALPHA. alpha data: alpha data: This is more data in the second part of virtual file, ALPHA. alpha data: Also like the ARGV handle, while reading from a particular handle, the list of available virtual files is in the array by that name, and the currently opened virtual file is in the scalar by that name. There's also a hash by that name that holds various bits of status information about that set of virtual files, including the current file, line number, and byte offset. If we used the Perl debugger on this program and dumped out the variables, it might show this: DB2> \$ALPHA, \@ALPHA, \%ALPHA 0 SCALAR(0x362e34) -> '/home/tchrist/inline-demo(00000000000000000291)' 1 ARRAY(0x362e40) 0 '/home/tchrist/inline-demo(00000000000000000291)' 1 '/home/tchrist/inline-demo(00000000000000000476)' 2 HASH(0x362edc) 'file' => undef 'line' => undef 'offset' => undef 'writable' => 1 What's that last line telling us? It tells whether that virtual file is writable. By default, if your script is writable, then so too are the virtual files, and they are opened in read-write mode! Yes, that means you can update them yourself, including even adding new virtual files to your source code simply by running that code. There is absolutely no limit to the mischief or grief that can ensue from this: catastrophes are easy to come by as you accidentally obliterate your painstakingly won data. We therefore implore you to back everything up first. The module itself supports an automatic mechanism for this: use Inline::Files -backup; which saves the original in a file with a ".bak" appended to it. You may also specify an explicit backup file: use Inline::Files -backup => "/tmp/safety_net"; 7.13.4 See Also The documentation for the CPAN module Inline::Files; Recipe 7.12 [ Team LiB ] [ Team LiB ] Recipe 7.14 Writing a Unix-Style Filter Program 7.14.1 Problem You want to write a program that takes a list of filenames on the command line and reads from STDIN if no filenames were given. You'd like the user to be able to give the file "-" to indicate STDIN or "someprogram |" to indicate the output of another program. You might want your program to modify the files in place or to produce output based on its input. 7.14.2 Solution Read lines with <>: while (<>) { # do something with the line } 7.14.3 Discussion When you say: while (<>) { # ... } Perl translates this into:[4] [4] Except that the code written here won't work, because ARGV has internal magic. unshift(@ARGV, "-") unless @ARGV; while ($ARGV = shift @ARGV) { unless (open(ARGV, $ARGV)) { warn "Can't open $ARGV: $!\n"; next; } while (defined($_ = )) { # ... } } You can access ARGV and $ARGV inside the loop to read more from the filehandle or to find the filename currently being processed. Let's look at how this works. Behavior If the user supplies no arguments, Perl sets @ARGV to a single string, "-". This is shorthand for STDIN when opened for reading and STDOUT when opened for writing. It's also what lets the user of your program specify "-" as a filename on the command line to read from STDIN. Next, the file-processing loop removes one argument at a time from @ARGV and copies the filename into the global variable $ARGV. If the file cannot be opened, Perl goes on to the next one. Otherwise, it processes a line at a time. When the file runs out, the loop goes back and opens the next one, repeating the process until @ARGV is exhausted. The open statement didn't say open(ARGV, "<", $ARGV). There's no extra less-than sign supplied. This allows for interesting effects, like passing the string "gzip -dc file.gz |" as an argument, to make your program read the output of the command "gzip -dc file.gz". See Recipe 16.6 for more about this use of magic open. You can change @ARGV before or inside the loop. Let's say you don't want the default behavior of reading from STDIN if there aren't any arguments—you want it to default to all C or C++ source and header files. Insert this line before you start processing : @ARGV = glob("*.[Cch]") unless @ARGV; Process options before the loop, either with one of the Getopt libraries described in Chapter 15 or manually: # arg demo 1: Process optional -c flag if (@ARGV && $ARGV[0] eq "-c") { $chop_first++; shift; } # arg demo 2: Process optional -NUMBER flag if (@ARGV && $ARGV[0] =~ /^-(\d+)$/) { $columns = $1; shift; } # arg demo 3: Process clustering -a, -i, -n, or -u flags while (@ARGV && $ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) { next if /^$/; s/a// && (++$append, redo); die "usage: $0 [-ainu] [filenames] ...\n"; } Other than its implicit looping over command-line arguments, <> is not special. The special variables controlling I/O still apply; see Chapter 8 for more on them. You can set $/ to set the line terminator, and $. contains the current line (record) number. If you undefine $/, you don't get the concatenated contents of all files at once; you get one complete file each time: undef $/; while (<>) { # $_ now has the complete contents of # the file whose name is in $ARGV } If you localize $/, the old value is automatically restored when the enclosing block exits: { # create block for local local $/; # record separator now undef while (<>) { # do something; called functions still have # undeffed version of $/ } } # $/ restored here Because processing never explicitly closes filehandles, the record number in $. is not reset. If you don't like that, you can explicitly close the file yourself to reset $.: while (<>) { print "$ARGV:$.:$_"; close ARGV if eof; } The eof function defaults to checking the end-of-file status of the last file read. Since the last handle read was ARGV, eof reports whether we're at the end of the current file. If so, we close it and reset the $. variable. On the other hand, the special notation eof( ) with parentheses but no argument checks if we've reached the end of all files in the processing. Command-line options Perl has command-line options, -n, -p, -a, and -i, to make writing filters and one-liners easier. The -n option adds the while (<>) loop around your program text. It's normally used for filters like grep or programs that summarize the data they read. The program is shown in Example 7- 2. Example 7-2. findlogin1 #!/usr/bin/perl # findlogin1 - print all lines containing the string "login" while (<>) {# loop over files on command line print if /login/; } The program in Example 7-2 could be written as shown in Example 7-3. Example 7-3. findlogin2 #!/usr/bin/perl -n # findlogin2 - print all lines containing the string "login" print if /login/; You can combine the -n and -e options to run Perl code from the command line: % perl -ne 'print if /login/' The -p option is like -n but adds a print right before the end of the loop. It's normally used for programs that translate their input, such as the program shown in Example 7-4. Example 7-4. lowercase1 #!/usr/bin/perl # lowercase - turn all lines into lowercase while (<>) { # loop over lines on command line s/(\p{Letter})/\l$1/g; # change all letters to lowercase print; } The program in Example 7-4 could be written as shown in Example 7-5. Example 7-5. lowercase2 #!/usr/bin/perl -p # lowercase - turn all lines into lowercase s/(\p{Letter})/\l$1/g;# change all letters to lowercase Or it could be written from the command line as: % perl -pe 's/(\p{Letter})/\l$1/g' While using -n or -p for implicit input looping, the special label LINE: is silently created for the whole input loop. That means that from an inner loop, you can skip to the following input record by using next LINE (which is like awk's next statement), or go on to the next file by closing ARGV (which is like awk's nextfile statement). This is shown in Example 7-6. Example 7-6. countchunks #!/usr/bin/perl -n # countchunks - count how many words are used. # skip comments, and bail on file if _ _END_ _ # or _ _DATA_ _ seen. for (split /\W+/) { next LINE if /^#/; close ARGV if /_ _(DATA|END)_ _/; $chunks++; } END { print "Found $chunks chunks\n" } The tcsh keeps a .history file in a format such that every other line contains a commented out timestamp in Epoch seconds: #+0894382237 less /etc/motd #+0894382239 vi ~/.exrc #+0894382242 date #+0894382242 who #+0894382288 telnet home A simple one-liner can render that legible: % perl -pe 's/^#\+(\d+)\n/localtime($1) . " "/e' Tue May 5 09:30:37 1998 less /etc/motd Tue May 5 09:30:39 1998 vi ~/.exrc Tue May 5 09:30:42 1998 date Tue May 5 09:30:42 1998 who Tue May 5 09:31:28 1998 telnet home The -i option changes each file on the command line. It is described in Recipe 7.16, and is normally used in conjunction with -p. 7.14.4 See Also perlrun(1), and the "Switches" section of Chapter 19 of Programming Perl; Recipe 7.16; Recipe 16.6 [ Team LiB ] [ Team LiB ] Recipe 7.15 Modifying a File in Place with a Temporary File 7.15.1 Problem You need to update a file in place, and you can use a temporary file. 7.15.2 Solution Read from the original file, write changes to a temporary file, and then rename the temporary back to the original: open(OLD, "<", $old) or die "can't open $old: $!"; open(NEW, ">", $new) or die "can't open $new: $!"; while () { # change $_, then... print NEW $_ or die "can't write $new: $!"; } close(OLD) or die "can't close $old: $!"; close(NEW) or die "can't close $new: $!"; rename($old, "$old.orig") or die "can't rename $old to $old.orig: $!"; rename($new, $old) or die "can't rename $new to $old: $!"; This is the best way to update a file "in place." 7.15.3 Discussion This technique uses little memory compared to the approach that doesn't use a temporary file. It has the added advantages of giving you a backup file and being easier and safer to program. You can make the same changes to the file using this technique that you can with the version that uses no temporary file. For instance, to insert lines at line 20, say: while () { if ($. = = 20) { print NEW "Extra line 1\n"; print NEW "Extra line 2\n"; } print NEW $_; } To delete lines 20 through 30, say: while () { next if 20 .. 30; print NEW $_; } Note that rename won't work across filesystems, so you should create your temporary file in the same directory as the file being modified. The truly paranoid programmer would lock the file during the update. The tricky part is that you have to open the file for writing without destroying its contents before you can get a lock to modify it. Recipe 7.18 shows how to do this. 7.15.4 See Also Recipe 7.1; Recipe 7.16; Recipe 7.17; Recipe 7.18 [ Team LiB ] [ Team LiB ] Recipe 7.16 Modifying a File in Place with the -i Switch 7.16.1 Problem You need to modify a file in place from the command line, and you're too lazy[5] for the file manipulation of Recipe 7.15. [5] Lazy-as-virtue, not lazy-as-sin. 7.16.2 Solution Use the -i and -p switches to Perl. Write your program on the command line: % perl -i.orig -p -e 'FILTER COMMAND' file1 file2 file3 ... or use the switches in programs: #!/usr/bin/perl -i.orig -p # filter commands go here 7.16.3 Discussion The -i command-line switch modifies each file in place. It creates a temporary file as in the previous recipe, but Perl takes care of the tedious file manipulation for you. Use it with -p (explained in Recipe 7.14) to turn: while (<>) { if ($ARGV ne $oldargv) { # are we at the next file? rename($ARGV, $ARGV . ".orig"); open(ARGVOUT, ">", $ARGV); # plus error check select(ARGVOUT); $oldargv = $ARGV; } s/DATE/localtime/e; } continue{ print; } select (STDOUT); # restore default output into: % perl -pi.orig -e 's/DATE/localtime/e' The -i switch takes care of making a backup (say -i instead of -i.orig to discard the original file contents instead of backing them up), and -p makes Perl loop over filenames given on the command line (or STDIN if no files were given). The preceding one-liner would turn a file containing the following: Dear Sir/Madam/Ravenous Beast, As of DATE, our records show your account is overdue. Please settle by the end of the month. Yours in cheerful usury, --A. Moneylender into: Dear Sir/Madam/Ravenous Beast, As of Sat Apr 25 12:28:33 1998, our records show your account is overdue. Please settle by the end of the month. Yours in cheerful usury, --A. Moneylender This switch makes in-place translators a lot easier to write and to read. For instance, this changes isolated instances of "hisvar" to "hervar" in all C, C++, and yacc files: % perl -i.old -pe 's{\bhisvar\b}{hervar}g' *.[Cchy] Turn on and off the -i behavior with the special variable $^I. Set @ARGV, and then use <> as you would with -i on the command line: # set up to iterate over the *.c files in the current directory, # editing in place and saving the old file with a .orig extension local $^I = ".orig"; # emulate -i.orig local @ARGV = glob("*.c"); # initialize list of files while (<>) { if ($. = = 1) { print "This line should appear at the top of each file\n"; } s/\b(p)earl\b/${1}erl/ig; # Correct typos, preserving case print; } continue {close ARGV if eof} Beware that creating a backup file under a particular name when that name already exists clobbers the version previously backed up. 7.16.4 See Also perlrun(1), and the "Switches" section of Chapter 19 of Programming Perl; the $^I and $. variables in perlvar(1), and in Chapter 28 of Programming Perl; the .. operator in the "Range Operator" sections of perlop(1) and Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 7.17 Modifying a File in Place Without a Temporary File 7.17.1 Problem You need to insert, delete, or change one or more lines in a file, and you don't want to (or can't) use a temporary file. 7.17.2 Solution Open the file in update mode ("+<"), read the whole file into an array of lines, change the array, then rewrite the file and truncate it to its current seek pointer. open(FH, "+<", $FILE) or die "Opening: $!"; @ARRAY = ; # change ARRAY here seek(FH,0,0) or die "Seeking: $!"; print FH @ARRAY or die "Printing: $!"; truncate(FH,tell(FH)) or die "Truncating: $!"; close(FH) or die "Closing: $!"; 7.17.3 Discussion As explained in this chapter's Introduction, the operating system treats files as unstructured streams of bytes. This makes it impossible to insert, modify, or change bits of the file in place. (Except for the special case of fixed-record-length files, discussed in Recipe 8.13.) You can use a temporary file to hold the changed output, or you can read the entire file into memory, change it, and write it back out again. Reading everything into memory is fine for small files, but doesn't scale well. Trying it on your 800 MB web server log files will either deplete your virtual memory or thrash your machine's VM system. For small files, though, this works: open(F, "+<", $infile) or die "can't read $infile: $!"; $out = ""; while () { s/DATE/localtime/eg; $out .= $_; } seek(F, 0, 0) or die "can't seek to start of $infile: $!"; print F $out or die "can't print to $infile: $!"; truncate(F, tell(F)) or die "can't truncate $infile: $!"; close(F) or die "can't close $infile: $!"; For other examples of things you can do in-place, look at the recipes in Chapter 8. This approach is only for the truly determined. It's harder to write, takes more memory (potentially a lot more), doesn't keep a backup file, and may confuse other processes trying to read from the file you're updating. For most purposes, therefore, we suggest it's probably not worth it. Remember to lock if you're paranoid, careful, or both. 7.17.4 See Also The seek, truncate, open, and sysopen functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 7.15; Recipe 7.16; Recipe 7.18 [ Team LiB ] [ Team LiB ] Recipe 7.18 Locking a File 7.18.1 Problem Many processes need to update the same file simultaneously. 7.18.2 Solution Have all processes honor advisory locking by using flock: use Fcntl qw(:flock); # for the LOCK_* constants open(FH, "+<", $path) or die "can't open $path: $!"; flock(FH, LOCK_EX) or die "can't flock $path: $!"; # update file, then... close(FH) or die "can't close $path: $!"; 7.18.3 Discussion Operating systems vary greatly in the type and reliability of locking techniques available. Perl tries hard to give you something that works, even if your operating system uses its own underlying technique. The flock function takes two arguments: a filehandle and a number representing what to do with the lock on that filehandle. The numbers are normally represented by names, such as LOCK_EX, which you can get from the Fcntl or IO::File modules. Locks come in two varieties: shared (LOCK_SH) and exclusive (LOCK_EX). Despite what you might infer by "exclusive," processes aren't required to obey locks on files. Another way of saying this is that flock implements advisory locking. It allows processes to let the operating system suspend would-be writers of a file until any readers are finished with it. Flocking files is like putting up a stoplight at an intersection. It works only if people pay attention to whether the light is red or green—or yellow for a shared lock. The red light doesn't stop traffic; it merely signals that traffic should stop. A desperate, ignorant, or rude person will still go flying through the intersection no matter what the light says. Likewise, flock only blocks out other flockers—not all processes trying to do I/O. Unless everyone is polite, accidents can (and will) happen. The polite process customarily indicates its intent to read from the file by requesting a LOCK_SH. Many processes can have simultaneous shared locks on the file because they (presumably) won't be changing the data. If a process intends to write to the file, it should request an exclusive lock via LOCK_EX. The operating system then suspends the requesting process until all other processes have released their locks, at which point it grants the lock to the suspended process and unblocks it. You are guaranteed that no other process will be able to successfully run flock(FH, LOCK_EX) on the same file while you hold the lock. (This is almost—but not quite—like saying there can be only one exclusive lock on the file. Forked children inherit not only their parents' open files, but, on some systems, also any locks held. That means if you hold an exclusive lock and fork without execing, your child might also have that same exclusive lock on the file!) The flock function is therefore by default a blocking operation. You can also acquire a lock without wedging your process by using the LOCK_NB flag when you request a lock. This lets you warn the user that there may be a wait until other processes with locks are done: unless (flock(FH, LOCK_EX|LOCK_NB)) { warn "can't immediately write-lock the file ($!), blocking ..."; unless (flock(FH, LOCK_EX)) { die "can't get write-lock on numfile: $!"; } } If you use LOCK_NB and are refused a LOCK_SH, then you know that someone else has a LOCK_EX and is updating the file. If you are refused a LOCK_EX, then someone holds either a LOCK_SH or a LOCK_EX, so you shouldn't try to update the file. Locks dissolve when the file is closed, which may not be until your process exits. If you lock or unlock the file, Perl automatically flushes its buffers for you. Here's how you increment a number in a file, using flock: use Fcntl qw(:DEFAULT :flock); sysopen(FH, "numfile", O_RDWR|O_CREAT) or die "can't open numfile: $!"; flock(FH, LOCK_EX) or die "can't write-lock numfile: $!"; # Now we have acquired the lock, it's safe for I/O $num = || 0; # DO NOT USE "or" THERE!! seek(FH, 0, 0) or die "can't rewind numfile : $!"; truncate(FH, 0) or die "can't truncate numfile: $!"; print FH $num+1, "\n" or die "can't write numfile: $!"; close(FH) or die "can't close numfile: $!"; Closing the filehandle flushes the buffers and unlocks the file. The truncate function is discussed in Chapter 8. File locking is not as easy as you might think—or wish. Because locks are advisory, if one process uses locking and another doesn't, all bets are off. Never use the existence of a file as a locking indication because there exists a race condition between the test for the existence of the file and its creation. Furthermore, because file locking is an intrinsically stateful concept, it doesn't get along well with the stateless model embraced by network filesystems such as NFS. Although some vendors claim that fcntl addresses such matters, practical experience suggests otherwise. The CPAN module File::NFSLock uses a clever scheme to obtain and release locks on files over NFS, which is different from the flock system. Don't confuse Perl's flock with the SysV function lockf. Unlike lockf, flock locks entire files at once. Perl doesn't support lockf directly, although the CPAN module File::Lock does offer its functionality if your operating system has lockf. The only way in pure Perl to lock a portion of a file is to use the fnctl function, as demonstrated in the lockarea program at the end of this chapter. 7.18.4 See Also The flock and fcntl functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard Fcntl and DB_File modules (also in Chapter 32 of Programming Perl); the documentation for the CPAN modules File::Lock and File::NFSLock; Recipe 7.24; Recipe 7.25 [ Team LiB ] [ Team LiB ] Recipe 7.19 Flushing Output 7.19.1 Problem When printing to a filehandle, output doesn't appear immediately. This is a problem in CGI scripts running on some programmer-hostile web servers where, if the web server sees warnings from Perl before it sees the (buffered) output of your script, it sends the browser an uninformative 500 Server Error. These buffering problems also arise with concurrent access to files by multiple programs and when talking with devices or sockets. 7.19.2 Solution Disable buffering by setting the per-filehandle variable $| to a true value, customarily 1: $old_fh = select(OUTPUT_HANDLE); $| = 1; select($old_fh); Or, if you don't mind the expense of loading an IO module, disable buffering by invoking the autoflush method: use IO::Handle; OUTPUT_HANDLE->autoflush(1); This works with indirect filehandles as well: use IO::Handle; $fh->autoflush(1); 7.19.3 Discussion In most stdio implementations, buffering varies with the type of output device. Disk files are block buffered, often with a buffer size of more than 2K. Pipes and sockets are often buffered with a buffer size between ½K and 2K. Serial devices, including terminals, modems, mice, and joysticks, are normally line-buffered; stdio sends the entire line out only when it gets the newline. Perl's print function does not directly support truly unbuffered output, i.e., a physical write for each individual character. Instead, it supports command buffering, in which one physical write is made after every separate output command. This isn't as hard on your system as no buffering at all, and it still gets the output where you want it, when you want it. Control output buffering through the $| special variable. Enable command buffering on output handles by setting it to a true value. This does not affect input handles at all; see Recipe 15.6 and Recipe 15.8 for unbuffered input. Set this variable to a false value to use default stdio buffering. Example 7-7 illustrates the difference. Example 7-7. seeme #!/usr/bin/perl -w # seeme - demo stdio output buffering $| = (@ARGV > 0); # command buffered if arguments given print "Now you don't see it..."; sleep 2; print "now you do\n"; If you call this program with no arguments, STDOUT is not command buffered. Your terminal (console, window, telnet session, whatever) doesn't receive output until the entire line is completed, so you see nothing for two seconds and then get the full line "Now you don't see it ... now you do". If you call the program with at least one argument, STDOUT is command buffered. That means you first see "Now you don't see it...", and then after two seconds you finally see "now you do". The dubious quest for increasingly compact code has led programmers to use the return value of select, the filehandle that was currently selected, as part of the second select: select((select(OUTPUT_HANDLE), $| = 1)[0]); There's another way. The IO::Handle module and any modules that inherit from that class provide three methods for flushing: flush, autoflush, and printflush. All are invoked on filehandles, either as literals or as variables containing a filehandle or reasonable facsimile. The flush method causes all unwritten output in the buffer to be written out, returning true on failure and false on success. The printflush method is a print followed by a one-time flush. The autoflush method is syntactic sugar for the convoluted antics just shown. It sets the command-buffering property on that filehandle (or clears it if passed an explicit false value), and returns the previous value for that property on that handle. For example: use FileHandle; STDERR->autoflush; # already unbuffered in stdio $filehandle->autoflush(0); If you're willing to accept the oddities of indirect object notation covered in Chapter 13, you can even write something reasonably close to English: use IO::Handle; # assume REMOTE_CONN is an interactive socket handle, # but DISK_FILE is a handle to a regular file. autoflush REMOTE_CONN 1; # unbuffer for clarity autoflush DISK_FILE 0; # buffer this for speed This avoids the bizarre select business and makes your code much more readable. Unfortunately, your program takes longer to compile because now you're including the IO::Handle module, so dozens of files need to be opened and thousands and thousands of lines must first be read and compiled. For short and simple applications, you might as well learn to manipulate $| directly, and you'll be happy. But for larger applications that already use a class derived from the IO::Handle class, you've already paid the price for the ticket, so you might as well see the show. To ensure that your output gets where you want it, when you want it, buffer flushing is important. It's particularly important with sockets, pipes, and devices, because you may be trying to do interactive I/O with these—more so, even, because you can't assume line buffering. Consider the program in Example 7-8. Example 7-8. getpcomidx #!/usr/bin/perl -w # getpcomidx - fetch's index.html document use IO::Socket; $sock = new IO::Socket::INET (PeerAddr => "", PeerPort => "http(80)"); die "Couldn't create socket: $@" unless $sock; # the library doesn't support $! setting; it uses $@ $sock->autoflush(1); # Mac *must* have \015\012\015\012 instead of \n\n here. # It's a good idea for others, too, as that's the spec, # but implementations are encouraged to accept "\cJ\cJ" too, # and as far as we've seen, they do. $sock->print("GET /index.html http/1.1\n\n"); $document = join("", $sock->getlines( )); print "DOC IS: $document\n"; If you're running at least v5.8 Perl, you can use the new I/O layers mechanism to force unbuffered output. This is available through the :unix layer. If the handle is already open, you can do this: binmode(STDOUT, ":unix") || die "can't binmode STDOUT to :unix: $!"; or you can specify the I/O layer when initially calling open: open(TTY, ">:unix", "/dev/tty") || die "can't open /dev/tty: $!"; print TTY "54321"; sleep 2; print TTY "\n"; There's no way to control input buffering using the sorts of flushing discussed so far. For that, you need to see Recipe 15.6 and Recipe 15.8. 7.19.4 See Also The $| entry in perlvar(1), and Chapter 28 of Programming Perl; the documentation for the standard FileHandle and IO::Handle modules (also in Chapter 32 of Programming Perl); the select function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 15.6 and Recipe 15.8 [ Team LiB ] [ Team LiB ] Recipe 7.20 Doing Non-Blocking I/O 7.20.1 Problem You want to read from or write to a filehandle without the system blocking your process until the program, file, socket, or device at the other end is ready. This is desired less often of regular files than of special files. 7.20.2 Solution Open the file with sysopen, specifying the O_NONBLOCK option: use Fcntl; sysopen(MODEM, "/dev/cua0", O_NONBLOCK|O_RDWR) or die "Can't open modem: $!\n"; If you already have an open filehandle, invoke the blocking method from IO::Handle with an argument of 0: use IO::Handle; MODEM->blocking(0); # assume MODEM already opened Or use the low-level fcntl function: use Fcntl; $flags = ""; fcntl(HANDLE, F_GETFL, $flags) or die "Couldn't get flags for HANDLE : $!\n"; $flags |= O_NONBLOCK; fcntl(HANDLE, F_SETFL, $flags) or die "Couldn't set flags for HANDLE: $!\n"; 7.20.3 Discussion On a disk file, when no more data can be read because you're at the end of the file, the input operation returns immediately. But suppose the filehandle in question were the user's keyboard or a network connection. In those cases, simply because there's no data there right now doesn't mean there never will be, so the input function normally doesn't return until it gets data. Sometimes, though, you don't want to wait; you want to grab whatever's there and carry on with whatever you were doing. Once a filehandle has been set for non-blocking I/O, the sysread or syswrite calls that would otherwise block will instead return undef and set $! to EAGAIN: use Errno; $rv = syswrite(HANDLE, $buffer, length $buffer); if (!defined($rv) && $!{EAGAIN}) { # would block } elsif ($rv != length $buffer) { # incomplete write } else { # successfully wrote } $rv = sysread(HANDLE, $buffer, $BUFSIZ); if (!defined($rv) && $!{EAGAIN}) { # would block } else { # successfully read $rv bytes from HANDLE } The O_NONBLOCK constant is part of the POSIX standard, so most machines should support it. We use the Errno module to test for the error EAGAIN. Testing $!{EAGAIN} is the same as testing whether $! = = EAGAIN. 7.20.4 See Also The sysopen and fcntl functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard Errno and IO::Handle modules (also in Chapter 32 of Programming Perl); your system's open(2) and fcntl(2) manpages; Recipe 7.22; Recipe 7.21 [ Team LiB ] [ Team LiB ] Recipe 7.21 Determining the Number of Unread Bytes 7.21.1 Problem You want to know how many unread bytes are available for reading from a filehandle. 7.21.2 Solution Use the FIONREAD ioctl call: $size = pack("L", 0); ioctl(FH, $FIONREAD, $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); # $size bytes can be read Make sure the input filehandle is unbuffered (because you've used an I/O layer like :unix on it), or use only sysread. 7.21.3 Discussion The Perl ioctl function is a direct interface to the operating system's ioctl(2) system call. If your system doesn't have the FIONREAD request or the ioctl(2) call, you can't use this recipe. FIONREAD and the other ioctl(2) requests are numeric values normally found lurking in C include files. Perl's h2ph tool tries to convert C include files to Perl code, which can be required. FIONREAD ends up defined as a function in the sys/ file: require "sys/"; $size = pack("L", 0); ioctl(FH, FIONREAD( ), $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); If h2ph wasn't installed or doesn't work for you, you can manually grep the include files: % grep FIONREAD /usr/include/*/* /usr/include/asm/ioctls.h:#define FIONREAD 0x541B If you install Inline::C from CPAN, you can write a C subroutine to obtain the constant for you: use Inline C; $FIONREAD = get_FIONREAD( ); # ... _ _END_ _ _ _C_ _ #include int get_FIONREAD( ) { return FIONREAD; } If all else fails, write a small C program using the editor of champions: % cat > fionread.c #include main( ) { printf("%#08x\n", FIONREAD); } ^D % cc -o fionread fionread.c % ./fionread 0x4004667f Then hardcode it, leaving porting as an exercise to your successor. $FIONREAD = 0x4004667f; # XXX: opsys dependent $size = pack("L", 0); ioctl(FH, $FIONREAD, $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); FIONREAD requires a filehandle connected to a stream, which means sockets, pipes, and tty devices all work, but regular files don't. If this is too much system programming for you, try to think outside the problem. Read from the filehandle in non-blocking mode (see Recipe 7.20). Then, if you manage to read something, that's how much was there waiting to be read. If you couldn't read anything, you know there was nothing to be read. This might get you in trouble with other users (or other processes) who are using the same system, though— because it uses busy-wait I/O, it's a drain on system resources. 7.21.4 See Also Recipe 7.20; your system's ioctl(2) manpage; the ioctl function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the Inline::C module from CPAN [ Team LiB ] [ Team LiB ] Recipe 7.22 Reading from Many Filehandles Without Blocking 7.22.1 Problem You want to learn whether input is available to be read, rather than blocking until there's input the way does. This is useful when reading from pipes, sockets, devices, and other programs. 7.22.2 Solution Use select with a timeout value of 0 seconds if you're comfortable with manipulating bit- vectors representing file descriptor sets: $rin = ""; # repeat next line for all filehandles to poll vec($rin, fileno(FH1), 1) = 1; vec($rin, fileno(FH2), 1) = 1; vec($rin, fileno(FH3), 1) = 1; $nfound = select($rout=$rin, undef, undef, 0); if ($nfound) { # input waiting on one or more of those 3 filehandles if (vec($rout,fileno(FH1),1)) { # do something with FH1 } if (vec($rout,fileno(FH2),1)) { # do something with FH2 } if (vec($rout,fileno(FH3),1)) { # do something with FH3 } } The IO::Select module provides an abstraction layer to hide bit-vector operations: use IO::Select; $select = IO::Select->new( ); # repeat next line for all filehandles to poll $select->add(*FILEHANDLE); if (@ready = $select->can_read(0)) { # input waiting on the filehandles in @ready } 7.22.3 Discussion The select function is really two functions in one. If you call it with one argument, you change the current default output filehandle (see Recipe 7.19). If you call it with four arguments, it tells you which filehandles have input waiting or are ready to receive output. This recipe deals only with four-argument select. The first three arguments to select are strings containing bit-vectors. Each bit-vector represents a set of file descriptors to inspect for pending input, pending output, and pending expedited data (like out-of-band or urgent data on a socket), respectively. The final argument is the timeout—how long select should spend waiting for status to change. A timeout value of 0 indicates a poll. Timeout can also be a floating-point number of seconds, or undef to wait (block) until status changes: $rin = ""; vec($rin, fileno(FILEHANDLE), 1) = 1; $nfound = select($rin, undef, undef, 0); # just check if ($nfound) { # read ten bytes from FILEHANDLE sysread(HANDLE, $data, 10); print "I read $data"; } The IO::Select module hides the bit-vectors from you. IO::Select->new( ) returns a new object on which you invoke the add method to add one or more filehandles to the set. Once you've added the filehandles you are interested in, invoke can_read, can_write, or has_exception. These methods return a list of filehandles that you can safely read from or write to, or that have unread exceptional data (TCP out-of-band data, for example). If you want to read an entire line of data, you can't use the readline function or the line input operator (unless you use an unbuffered I/O layer). Otherwise, you'll mix a buffered I/O function with a check that ignores those buffers in user space and cares only about what's buffered in kernel space. This is a big no-no. For details on this and directions for calling sysread on whatever is available on a socket or pipe and then returning immediately, see Recipe 7.23. If you're trying to do non-blocking reads on a terminal line (that is, on a keyboard or other serial line device), see Recipe 15.6 and Recipe 15.8. 7.22.4 See Also The select function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard module IO::Select (also in Chapter 32 of Programming Perl); Recipe 7.20; Recipe 7.23 [ Team LiB ] [ Team LiB ] Recipe 7.23 Reading an Entire Line Without Blocking 7.23.1 Problem You need to read a line of data from a handle that select says is ready for reading, but you can't use Perl's normal operation (readline) in conjunction with select because may buffer extra data and select doesn't know about those buffers. 7.23.2 Solution Use the following sysreadline function, like this: $line = sysreadline(SOME_HANDLE); In case only a partial line has been sent, include a number of seconds to wait: $line = sysreadline(SOME_HANDLE, TIMEOUT); Here's the function to do that: use IO::Handle; use IO::Select; use Symbol qw(qualify_to_ref); sub sysreadline(*;$) { my($handle, $timeout) = @_; $handle = qualify_to_ref($handle, caller( )); my $infinitely_patient = (@_ = = 1 || $timeout < 0); my $start_time = time( ); my $selector = IO::Select->new( ); $selector->add($handle); my $line = ""; SLEEP: until (at_eol($line)) { unless ($infinitely_patient) { return $line if time( ) > ($start_time + $timeout); } # sleep only 1 second before checking again next SLEEP unless $selector->can_read(1.0); INPUT_READY: while ($selector->can_read(0.0)) { my $was_blocking = $handle->blocking(0); CHAR: while (sysread($handle, my $nextbyte, 1)) { $line .= $nextbyte; last CHAR if $nextbyte eq "\n"; } $handle->blocking($was_blocking); # if incomplete line, keep trying next SLEEP unless at_eol($line); last INPUT_READY; } } return $line; } sub at_eol($) { $_[0] =~ /\n\z/ } 7.23.3 Discussion As described in Recipe 7.22, to determine whether the operating system has data on a particular handle for your process to read, you can use either Perl's built-in select function or the can_read method from the standard IO::Select module. Although you can reasonably use functions like sysread and recv to get data, you can't use the buffered functions like readline (that is, ), read, or getc. Also, even the unbuffered input functions might still block. If someone connects and sends a character but never sends a newline, your program will block in a , which expects its input to end in a newline—or in whatever you've assigned to the $/ variable. We circumvent this by setting the handle to non-blocking mode and then reading in characters until we find "\n". This removes the need for the blocking call. The sysreadline function in the Solution takes an optional second argument so you don't have to wait forever in case you get a partial line and nothing more. A far more serious issue is that select reports only whether the operating system's low-level file descriptor is available for I/O. It's not reliable in the general case to mix calls to four- argument select with calls to any of the buffered I/O functions listed in this chapter's Introduction (read, , seek, tell, etc.). Instead, you must use sysread—and sysseek if you want to reposition the filehandle within the file. The reason is that select's response does not reflect any user-level buffering in your own process's address space once the kernel has transferred the data. But the —really Perl's readline( ) function—still uses your underlying buffered I/O system. If two lines were waiting, select would report true only once. You'd read the first line and leave the second one in the buffer. But the next call to select would block because, as far as the kernel is concerned, it's already given you all of the data it had. That second line, now hidden from your kernel, sits unread in an input buffer that's solely in user space. 7.23.4 See Also The sysread function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard modules Symbol, IO::Handle, and IO::Select (also in Chapter 32 of Programming Perl); Recipe 7.22 [ Team LiB ] [ Team LiB ] Recipe 7.24 Program: netlock When locking files, we recommend that you use flock when possible. However, on some systems, flock's locking strategy is not reliable. For example, perhaps the person who built Perl on your system configured flock to use a version of file locking that didn't even try to work over the Net, or you're on the increasingly rare system where no flock emulation exists at all. The following program and module provide a basic implementation of a file locking mechanism. Unlike a normal flock, with this module you lock file names, not file descriptors. Thus, you can use it to lock directories, domain sockets, and other non-regular files. You can even lock files that don't exist yet. It uses a directory created at the same level in the directory structure as the locked file, so you must be able to write to the enclosing directory of the file you wish to lock. A sentinel file within the lock directory contains the owner of the lock. This is also useful with Recipe 7.15 because you can lock the filename even though the file that has that name changes. The nflock function takes one or two arguments. The first is the pathname to lock; the second is the optional amount of time to wait for the lock. The function returns true if the lock is granted, returns false if the timeout expired, and raises an exception should various improbable events occur, such as being unable to write the directory. Set the $File::LockDir::Debug variable to true to make the module emit messages if it stalls waiting for a lock. If you forget to free a lock and try to exit the program, the module will remove them for you. This won't happen if your program is sent a signal it doesn't trap. Example 7-9 shows a driver program to demonstrate the File::LockDir module. Example 7-9. drivelock #!/usr/bin/perl -w # drivelock - demo File::LockDir module use strict; use File::LockDir; $SIG{INT} = sub { die "outta here\n" }; $File::LockDir::Debug = 1; my $path = shift or die "usage: $0 \n"; unless (nflock($path, 2)) { die "couldn't lock $path in 2 seconds\n"; } sleep 100; nunflock($path); The module itself is shown in Example 7-10. For more on building your own modules, see Chapter 12. Example 7-10. File::LockDir package File::LockDir; # module to provide very basic filename-level # locks. No fancy systems calls. In theory, # directory info is sync'd over NFS. Not # stress tested. use strict; use Exporter; our (@ISA, @EXPORT); @ISA = qw(Exporter); @EXPORT = qw(nflock nunflock); our ($Debug, $Check); $Debug ||= 0; # may be predefined $Check ||= 5; # may be predefined use Cwd; use Fcntl; use Sys::Hostname; use File::Basename; use File::stat; use Carp; my %Locked_Files = ( ); # usage: nflock(FILE; NAPTILL) sub nflock($;$) { my $pathname = shift; my $naptime = shift || 0; my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; my $start = time( ); my $missed = 0; my $owner; # if locking what I've already locked, return if ($Locked_Files{$pathname}) { carp "$pathname already locked"; return 1 } if (!-w dirname($pathname)) { croak "can't write to directory of $pathname"; } while (1) { last if mkdir($lockname, 0777); confess "can't get $lockname: $!" if $missed++ > 10 && !-d $lockname; if ($Debug) {{ open($owner, "< $whosegot") || last; # exit "if"! my $lockee = <$owner>; chomp($lockee); printf STDERR "%s $0\[$$]: lock on %s held by %s\n", scalar(localtime), $pathname, $lockee; close $owner; }} sleep $Check; return if $naptime && time > $start+$naptime; } sysopen($owner, $whosegot, O_WRONLY|O_CREAT|O_EXCL) or croak "can't create $whosegot: $!"; printf $owner "$0\[$$] on %s since %s\n", hostname( ), scalar(localtime); close($owner) or croak "close $whosegot: $!"; $Locked_Files{$pathname}++; return 1; } # free the locked file sub nunflock($) { my $pathname = shift; my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; unlink($whosegot); carp "releasing lock on $lockname" if $Debug; delete $Locked_Files{$pathname}; return rmdir($lockname); } # helper function sub name2lock($) { my $pathname = shift; my $dir = dirname($pathname); my $file = basename($pathname); $dir = getcwd( ) if $dir eq "."; my $lockname = "$dir/$file.LOCKDIR"; return $lockname; } # anything forgotten? END { for my $pathname (keys %Locked_Files) { my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; carp "releasing forgotten $lockname"; unlink($whosegot); rmdir($lockname); } } 1; [ Team LiB ] [ Team LiB ] Recipe 7.25 Program: lockarea Perl's flock function only locks complete files, not regions of the file. Although fcntl supports locking of a file's regions, this is difficult to access from Perl, largely because no one has written an XS module that portably packs up the necessary structure. The program in Example 7-11 implements fcntl, but only for the three architectures it already knows about: SunOS, BSD, and Linux. If you're running something else, you'll have to figure out the layout of the flock structure. We did this by eyeballing the C-language sys/fcntl.h #include file—and running the c2ph program to figure out alignment and typing. This program, while included with Perl, only works on systems with a strong Berkeley heritage, like those listed above. As with Unix—or Perl itself—you don't have to use c2ph, but it sure makes life easier if you can. The struct_flock function in the lockarea program packs and unpacks in the proper format for the current architectures by consulting the $^O variable, which contains your current operating system name. There is no struct_flock function declaration. It's just aliased to the architecture-specific version. Function aliasing is discussed in Recipe 10.14. The lockarea program opens a temporary file, clobbering any existing contents and writing a screenful (80 by 23) of blanks. Each line is the same length. The program then forks one or more times and lets the child processes try to update the file at the same time. The first argument, N, is the number of times to fork to produce 2 ** N processes. So lockarea 1 makes two children, lockarea 2 makes four, lockarea 3 makes eight, lockarea 4 makes sixteen, etc. The more kids, the more contention for the locks. Each process picks a random line in the file, locks that line only, and then updates it. It writes its process ID into the line, prepended with a count of how many times the line has been updated: 4: 18584 was just here If the line was already locked, then when the lock is finally granted, that line is updated with a message telling which process was in the way of this process: 29: 24652 ZAPPED 24656 A fun demo is to run the lockarea program in the background and the rep program from Chapter 15, watching the file change. Think of it as a video game for systems programmers. % lockarea 5 & % rep -1 'cat /tmp/lkscreen' When you interrupt the original parent, usually with Ctrl-C or by sending it a SIGINT from the command line, it kills all of its children by sending its entire process group a signal. Example 7-11. lockarea #!/usr/bin/perl -w # lockarea - demo record locking with fcntl use strict; my $FORKS = shift || 1; my $SLEEP = shift || 1; use Fcntl; use POSIX qw(:unistd_h); use Errno; my $COLS = 80; my $ROWS = 23; # when's the last time you saw *this* mode used correctly? open(FH, "+> /tmp/lkscreen") or die $!; select(FH); $| = 1; select STDOUT; # clear screen for (1 .. $ROWS) { print FH " " x $COLS, "\n"; } my $progenitor = $$; fork( ) while $FORKS-- > 0; print "hello from $$\n"; if ($progenitor = = $$) { $SIG{INT} = \&infanticide; } else { $SIG{INT} = sub { die "goodbye from $$" }; } while (1) { my $line_num = int rand($ROWS); my $line; my $n; # move to line seek(FH, $n = $line_num * ($COLS+1), SEEK_SET) or next; # get lock my $place = tell(FH); my $him; next unless defined($him = lockplace(*FH, $place, $COLS)); # read line read(FH, $line, $COLS) = = $COLS or next; my $count = ($line =~ /(\d+)/) ? $1 : 0; $count++; # update line seek(FH, $place, 0) or die $!; my $update = sprintf($him ? "%6d: %d ZAPPED %d" : "%6d: %d was just here", $count, $$, $him); my $start = int(rand($COLS - length($update))); die "XXX" if $start + length($update) > $COLS; printf FH "%*.*s\n", -$COLS, $COLS, " " x $start . $update; # release lock and go to sleep unlockplace(*FH, $place, $COLS); sleep $SLEEP if $SLEEP; } die "NOT REACHED"; # just in case # lock($handle, $offset, $timeout) - get an fcntl lock sub lockplace { my ($fh, $start, $till) = @_; ##print "$$: Locking $start, $till\n"; my $lock = struct_flock(F_WRLCK, SEEK_SET, $start, $till, 0); my $blocker = 0; unless (fcntl($fh, F_SETLK, $lock)) { die "F_SETLK $$ @_: $!" unless $!{EAGAIN} || $!{EDEADLK}; fcntl($fh, F_GETLK, $lock) or die "F_GETLK $$ @_: $!"; $blocker = (struct_flock($lock))[-1]; ##print "lock $$ @_: waiting for $blocker\n"; $lock = struct_flock(F_WRLCK, SEEK_SET, $start, $till, 0); unless (fcntl($fh, F_SETLKW, $lock)) { warn "F_SETLKW $$ @_: $!\n"; return; # undef } } return $blocker; } # unlock($handle, $offset, $timeout) - release an fcntl lock sub unlockplace { my ($fh, $start, $till) = @_; ##print "$$: Unlocking $start, $till\n"; my $lock = struct_flock(F_UNLCK, SEEK_SET, $start, $till, 0); fcntl($fh, F_SETLK, $lock) or die "F_UNLCK $$ @_: $!"; } # OS-dependent flock structures # Linux struct flock # short l_type; # short l_whence; # off_t l_start; # off_t l_len; # pid_t l_pid; BEGIN { # c2ph says: typedef='s2 l2 i', sizeof=16 my $FLOCK_STRUCT = "s s l l i"; sub linux_flock { if (wantarray) { my ($type, $whence, $start, $len, $pid) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; return pack($FLOCK_STRUCT, $type, $whence, $start, $len, $pid); } } } # SunOS struct flock: # short l_type; /* F_RDLCK, F_WRLCK, or F_UNLCK */ # short l_whence; /* flag to choose starting offset */ # long l_start; /* relative offset, in bytes */ # long l_len; /* length, in bytes; 0 means lock to EOF */ # short l_pid; /* returned with F_GETLK */ # short l_xxx; /* reserved for future use */ BEGIN { # c2ph says: typedef='s2 l2 s2', sizeof=16 my $FLOCK_STRUCT = "s s l l s s"; sub sunos_flock { if (wantarray) { my ($type, $whence, $start, $len, $pid, $xxx) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; return pack($FLOCK_STRUCT, $type, $whence, $start, $len, $pid, 0); } } } # (Free)BSD struct flock: # off_t l_start; /* starting offset */ # off_t l_len; /* len = 0 means until end-of-file */ # pid_t l_pid; /* lock owner */ # short l_type; /* lock type: read/write, etc. */ # short l_whence; /* type of l_start */ BEGIN { # c2ph says: typedef="q2 i s2", size=24 my $FLOCK_STRUCT = "ll ll i s s"; # XXX: q is ll sub bsd_flock { if (wantarray) { my ($xxstart, $start, $xxlen, $len, $pid, $type, $whence) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; my ($xxstart, $xxlen) = (0,0); return pack($FLOCK_STRUCT, $xxstart, $start, $xxlen, $len, $pid, $type, $whence); } } } # alias the fcntl structure at compile time BEGIN { for ($^O) { *struct_flock = do { /bsd/ && \&bsd_flock || /linux/ && \&linux_flock || /sunos/ && \&sunos_flock || die "unknown operating system $^O, bailing out"; }; } } # install signal handler for children BEGIN { my $called = 0; sub infanticide { exit if $called++; print "$$: Time to die, kiddies.\n" if $$ = = $progenitor; my $job = getpgrp( ); $SIG{INT} = "IGNORE"; kill -2, $job if $job; # killpg(SIGINT, job) 1 while wait > 0; print "$$: My turn\n" if $$ = = $progenitor; exit; } } END { &infanticide } [ Team LiB ] [ Team LiB ] Chapter 8. File Contents The most brilliant decision in all of Unix was the choice of a single character for the newline sequence. —Mike O'Dell, only half jokingly [ Team LiB ] [ Team LiB ] Introduction Before the Unix Revolution, every kind of data source and destination was inherently different. Getting two programs merely to understand each other required heavy wizardry and the occasional sacrifice of a virgin stack of punch cards to an itinerant mainframe repairman. This computational Tower of Babel made programmers dream of quitting the field to take up a less painful hobby, like autoflagellation. These days, such cruel and unusual programming is largely behind us. Modern operating systems work hard to provide the illusion that I/O devices, network connections, process control information, other programs, the system console, and even users' terminals are all abstract streams of bytes called files . This lets you easily write programs that don't care where their input came from or where their output goes. Because programs read and write streams of simple text, every program can communicate with every other program. It is difficult to overstate the power and elegance of this approach. No longer dependent upon troglodyte gnomes with secret tomes of JCL (or COM) incantations, users can now create custom tools from smaller ones by using simple command-line I/O redirection, pipelines, and backticks. Basic Operations Treating files as unstructured byte streams necessarily governs what you can do with them. You can read and write sequential, fixed-size blocks of data at any location in the file, increasing its size if you write past the current end. Perl uses an I/O library that emulates C's stdio (3) to implement reading and writing of variable-length records like lines, paragraphs, and words. What can't you do to an unstructured file? Because you can't insert or delete bytes anywhere but at end-of-file, you can't easily change the length of, insert, or delete records. An exception is the last record, which you can delete by truncating the file to the end of the previous record. For other modifications, you need to use a temporary file or work with a copy of the file in memory. If you need to do this a lot, a database system may be a better solution than a raw file (see Chapter 14 ). Standard with Perl as of v5.8 is the Tie::File module, which offers an array interface to files of records. We use it in Recipe 8.4 . The most common files are text files, and the most common operations on text files are reading and writing lines. Use the line-input operator, (or the internal function implementing it, readline ), to read lines, and use print to write them. These functions can also read or write any record that has a specific record separator. Lines are simply variable-length records that end in "\n ". The operator returns undef on error or when end of the file is reached, so use it in loops like this: while (defined ($line = )) { chomp $line; $size = length($line); print "$size\n"; # output size of line } Because this operation is extremely common in Perl programs that process lines of text, and that's an awful lot to type, Perl conveniently provides some shorter aliases for it. If all shortcuts are taken, this notation might be too abstract for the uninitiated to guess what it's really doing. But it's an idiom you'll see thousands of times in Perl, so you'll soon get used to it. Here are increasingly shortened forms, where the first line is the completely spelled-out version: while (defined ($line = )) { ... } while ($line = ) { ... } while () { ... } In the second line, the explicit defined test needed for detecting end-of-file is omitted. To make everyone's life easier, you're safe to skip that defined test, because when the Perl compiler detects this situation, it helpfully puts one there for you to guarantee your program's correctness in odd cases. This implicit addition of a defined occurs on all while tests that do nothing but assign to one scalar variable the result of calling readline , readdir , or readlink . As is just shorthand for readline(FH) , it also counts. We're not quite done shortening up yet. As the third line shows, you can also omit the variable assignment completely, leaving just the line input operator in the while test. When you do that here in a while test, it doesn't simply discard the line it just read as it would anywhere else. Instead, it reads lines into the special global variable $_ . Because so many other operations in Perl also default to $_ , this is more useful than it might initially appear. while () { chomp; print length( ), "\n"; # output size of line } In scalar context, reads just the next line, but in list context, it reads all remaining lines: @lines = ; Each time reads a record from a filehandle, it increments the special variable $ . (the "current input record number"). This variable is reset only when close is called explicitly, which means that it's not reset when you reopen an already opened filehandle. Another special variable is $/ , the input record separator. It is set to "\n " by default. You can set it to any string you like; for instance, "\0 " to read null-terminated records. Read entire paragraphs by setting $/ to the empty string, "". This is almost like setting $/ to "\n\n ", in that empty lines function as record separators. However, "" treats two or more consecutive empty lines as a single record separator, whereas "\n\n " returns empty records when more than two consecutive empty lines are read. Undefine $/ to read the rest of the file as one scalar: undef $/; $whole_file = ; # "slurp" mode The -0 option to Perl lets you set $/ from the command line: % perl -040 -e '$word = <>; print "First word is $word\n";' The digits after -0 are the octal value of the single character to which $/ is to be set. If you specify an illegal value (e.g., with -0777 ), Perl will set $/ to undef . If you specify -00 , Perl will set $/ to "". The limit of a single octal value means you can't set $/ to a multibyte string; for instance, "%%\n " to read fortune files. Instead, you must use a BEGIN block: % perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat Use print to write a line or any other data. The print function writes its arguments one after another and doesn't automatically add a line or record terminator by default. print HANDLE "One", "two", "three"; # "Onetwothree" print "Baa baa black sheep.\n"; # Sent to default output handle There is no comma between the filehandle and the data to print. If you put a comma in there, Perl gives the error message "No comma allowed after filehandle ". The default output handle is STDOUT . Change it with the select function. (See the Introduction to Chapter 7 .) Newlines All systems use the virtual "\n " to represent a line terminator, called a newline . There is no such thing as a newline character; it is a platform-independent way of saying "whatever your string library uses to represent a line terminator." On Unix, VMS, and Windows, this line terminator in strings is "\cJ " (the Ctrl-J character). Versions of the old Macintosh operating system before Mac OS X used "\cM ". As a Unix variant, Mac OS X uses "\cJ ". Operating systems also vary in how they store newlines in files. Unix also uses "\cJ " for this. On Windows, though, lines in a text file end in "\cM\cJ ". If your I/O library knows you are reading or writing a text file, it will automatically translate between the string line terminator and the file line terminator. So on Windows, you could read four bytes ("Hi\cM\cJ ") from disk and end up with three in memory ("Hi\cJ " where "\cJ " is the physical representation of the newline character). This is never a problem on Unix, as no translation needs to happen between the disk's newline ("\cJ ") and the string's newline ("\cJ "). Terminals, of course, are a different kettle of fish. Except when you're in raw mode (as in system("stty raw") ), the Enter key generates a "\cM " (carriage return) character. This is then translated by the terminal driver into a "\n " for your program. When you print a line to a terminal, the terminal driver notices the "\n " newline character (whatever it might be on your platform) and turns it into the "\cM\cJ " (carriage return, line feed) sequence that moves the cursor to the start of the line and down one line. Even network protocols have their own expectations. Most protocols prefer to receive and send "\cM\cJ " as the line terminator, but many servers also accept merely a "\cJ ". This varies between protocols and servers, so check the documentation closely! The important notion here is that if the I/O library thinks you are working with a text file, it may be translating sequences of bytes for you. This is a problem in two situations: when your file is not text (e.g., you're reading a JPEG file) and when your file is text but not in a byte-oriented ASCII-like encoding (e.g., UTF-8 or any of the other encodings the world uses to represent their characters). As if this weren't bad enough, some systems (again, MS-DOS is an example) use a particular byte sequence in a text file to indicate end-of-file. An I/O library that knows about text files on such a platform will indicate EOF when that byte sequence is read. Recipe 8.11 shows how to disable any translation that your I/O library might be doing. I/O Layers With v5.8, Perl I/O operations are no longer simply wrappers on top of stdio. Perl now has a flexible system (I/O layers) that transparently filters multiple encodings of external data. In Chapter 7 we met the :unix layer, which implements unbuffered I/O. There are also layers for using your platform's stdio (:stdio ) and Perl's portable stdio implementation (:perlio ), both of which buffer input and output. In this chapter, these implementation layers don't interest us as much as the encoding layers built on top of them. The :crlf layer converts a carriage return and line feed (CRLF, "\cM\cJ ") to "\n " when reading from a file, and converts "\n " to CRLF when writing. The opposite of :crlf is :raw , which makes it safe to read or write binary data from the filehandle. You can specify that a filehandle contains UTF-8 data with :utf8 , or specify an encoding with :encoding(...) . You can even write your own filter in Perl that processes data being read before your program gets it, or processes data being written before it is sent to the device. It's worth emphasizing: to disable :crlf , specify the :raw layer. The :bytes layer is sometimes misunderstood to be the opposite of :crlf , but they do completely different things. The former refers to the UTF-8ness of strings, and the latter to the behind-the-scenes conversion of carriage returns and line feeds. You may specify I/O layers when you open the file: open($fh, "<:raw:utf8", $filename); # read UTF-8 from the file open($fh, "<:encoding(shiftjis)", $filename); # shiftjis japanese encoding open(FH, "+<:crlf", $filename); # convert between CRLF and \n Or you may use binmode to change the layers of an existing handle: binmode($fh, ":raw:utf8"); binmode($fh, ":raw:encoding(shiftjis)"); binmode(FH, "<:raw:crlf"); Because binmode pushes onto the stack of I/O layers, and the facility for removing layers is still evolving, you should always specify a complete set of layers by making the first layer be :raw as follows: binmode(HANDLE, ":raw"); # binary-safe binmode(HANDLE); # same as :raw binmode(HANDLE, ":raw :utf8"); # read/write UTF-8 binmode(HANDLE, ":raw :encoding(shiftjis)"); # read/write shiftjis Recipe 8.18 , Recipe 8.19 , and Recipe 8.20 show how to manipulate I/O layers. Advanced Operations Use the read function to read a fixed-length record. It takes three arguments: a filehandle, a scalar variable, and the number of characters to read. It returns undef if an error occurred or else returns the number of characters read. $rv = read(HANDLE, $buffer, 4096) or die "Couldn't read from HANDLE : $!\n"; # $rv is the number of bytes read, # $buffer holds the data read To write a fixed-length record, just use print . The truncate function changes the length (in bytes) of a file, which can be specified as a filehandle or as a filename. It returns true if the file was successfully truncated, false otherwise: truncate(HANDLE, $length) or die "Couldn't truncate: $!\n"; truncate("/tmp/$$.pid", $length) or die "Couldn't truncate: $!\n"; Each filehandle keeps track of where it is in the file. Reads and writes occur from this point, unless you've specified the O_APPEND flag (see Recipe 7.1 ). Fetch the file position for a filehandle with tell , and set it with seek . Because the library rewrites data to preserve the illusion that "\n " is the line terminator, and also because you might be using characters with code points above 255 and therefore requiring a multibyte encoding, you cannot portably seek to offsets calculated simply by counting characters. Unless you can guarantee your file uses one byte per character, seek only to offsets returned by tell . $pos = tell(DATAFILE); print "I'm $pos bytes from the start of DATAFILE.\n"; The seek function takes three arguments: the filehandle, the offset (in bytes) to go to, and a numeric argument indicating how to interpret the offset. 0 indicates an offset from the start of the file (like the value returned by tell ); 1, an offset from the current location (a negative number means move backward in the file, a positive number means move forward); and 2, an offset from end-of-file. seek(LOGFILE, 0, 2) or die "Couldn't seek to the end: $!\n"; seek(DATAFILE, $pos, 0) or die "Couldn't seek to $pos: $!\n"; seek(OUT, -20, 1) or die "Couldn't seek back 20 bytes: $!\n"; So far we've been describing buffered I/O. That is, readline or , print , read , seek , and tell are all operations that use buffering for speed and efficiency. This is their default behavior, although if you've specified an unbuffered I/O layer for that handle, they won't be buffered. Perl also provides an alternate set of I/O operations guaranteed to be unbuffered no matter what I/O layer is associated with the handle. These are sysread , syswrite , and sysseek , all discussed in Chapter 7 . The sysread and syswrite functions are different in appearance from their and print counterparts. Both take a filehandle to act on: a scalar variable to either read into or write out from, and the number of characters to transfer. (With binary data, this is the number of bytes, not characters.) They also accept an optional fourth argument, the offset from the start of the scalar variable at which to start reading or writing: $written = syswrite(DATAFILE, $mystring, length($mystring)); die "syswrite failed: $!\n" unless $written = = length($mystring); $read = sysread(INFILE, $block, 256, 5); warn "only read $read bytes, not 256" if 256 != $read; The syswrite call sends the contents of $mystring to DATAFILE . The sysread call reads 256 characters from INFILE and stores 5 characters into $block , leaving intact the 5 characters it skipped. Both sysread and syswrite return the number of characters transferred, which could be different than the amount of data you were attempting to transfer. Maybe the file didn't have as much data as you thought, so you got a short read. Maybe the filesystem that the file lives on filled up. Maybe your process was interrupted partway through the write. Stdio takes care of finishing the transfer in cases of interruption, but if you use raw sysread and syswrite calls, you must finish up yourself. See Recipe 9.3 for an example. The sysseek function doubles as an unbuffered replacement for both seek and tell . It takes the same arguments as seek , but it returns the new position on success and undef on error. To find the current position within the file: $pos = sysseek(HANDLE, 0, 1); # don't change position die "Couldn't sysseek: $!\n" unless defined $pos; These are the basic operations available to you. The art and craft of programming lies in using these basic operations to solve complex problems such as finding the number of lines in a file, reversing lines in a file, randomly selecting a line from a file, building an index for a file, and so on. [ Team LiB ] [ Team LiB ] Recipe 8.1 Reading Lines with Continuation Characters 8.1.1 Problem You have a file with long lines split over two or more lines, with backslashes to indicate that a continuation line follows. You want to rejoin those split lines. Makefiles, shell scripts, and many other scripting or configuration languages let you break a long line into several shorter ones in this fashion. 8.1.2 Solution Build up the complete lines one at a time until reaching one without a backslash: while (defined($line = ) ) { chomp $line; if ($line =~ s/\\$//) { $line .= ; redo unless eof(FH); } # process full record in $line here } 8.1.3 Discussion Here's an example input file: DISTFILES = $(DIST_COMMON) $(SOURCES) $(HEADERS) \ $(TEXINFOS) $(INFOS) $(MANS) $(DATA) DEP_DISTFILES = $(DIST_COMMON) $(SOURCES) $(HEADERS) \ $(TEXINFOS) $(INFO_DEPS) $(MANS) $(DATA) \ $(EXTRA_DIST) You'd like to process that file a record at a time with the escaped newlines ignored. The first record would then be the first two lines, the second record the next three lines, etc. Here's how the algorithm works. The while loop reads lines one at a time. The substitution operator s/// tries to remove a trailing backslash. If the substitution fails, we've found a line without a backslash at the end. Otherwise, read another record, concatenate it onto the accumulating $line variable, and use redo to jump back to just inside the opening brace of the while loop. This lands us back on the chomp. A frequent problem with files intended to be in this format arises when unnoticed spaces or tabs follow the backslash before the newline. The substitution that found continuation lines would be more forgiving if written this way: if ($line =~ s/\\\s*$//) { # as before } Unfortunately, even if your program is forgiving, surely others will not be. Remember to be liberal in what you accept, but conservative in what you produce. 8.1.4 See Also The chomp function in perlfunc(1) and in Chapter 29 of Programming Perl; the redo keyword in the "Loop Control" sections of perlsyn(1) and Chapter 4 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 8.2 Counting Lines (or Paragraphs or Records) in a File 8.2.1 Problem You need to compute the number of lines in a file. 8.2.2 Solution Many systems have a wc program to count lines in a file: $count = `wc -l < $file`; die "wc failed: $?" if $?; chomp($count); You could also open the file and read line-by-line until the end, counting lines as you go: open(FILE, "<", $file) or die "can't open $file: $!"; $count++ while ; # $count now holds the number of lines read Here's the fastest solution, assuming your line terminator really is "\n": $count += tr/\n/\n/ while sysread(FILE, $_, 2 ** 20); 8.2.3 Discussion Although you can use -s $file to determine the file size in bytes, you generally cannot use it to derive a line count. See the Introduction in Chapter 9 for more on -s. If you can't or don't want to call another program to do your dirty work, you can emulate wc by opening up and reading the file yourself: open(FILE, "<", $file) or die "can't open $file: $!"; $count++ while ; # $count now holds the number of lines read Another way of writing this is: open(FILE, "<", $file) or die "can't open $file: $!"; for ($count=0; ; $count++) { } If you're not reading from any other files, you don't need the $count variable in this case. The special variable $. holds the number of lines read since a filehandle was last explicitly closed: 1 while ; $count = $.; This reads in all records in the file, then discards them. To count paragraphs, set the global input record separator variable $/ to the empty string ("") before reading to make the input operator () read a paragraph at a time. $/ = ""; # enable paragraph mode for all reads open(FILE, "<", $file) or die "can't open $file: $!"; 1 while ; $para_count = $.; The sysread solution reads the file a megabyte at a time. Once end-of-file is reached, sysread returns 0. This ends the loop, as does undef, which would indicate an error. The tr operation doesn't really substitute \n for \n in the string; it's an old idiom for counting occurrences of a character in a string. 8.2.4 See Also The tr operator in perlop(1) and Chapter 5 of Programming Perl; your system's wc(1) manpage; the $/ entry in perlvar(1), and in the "Special Variables in Alphabetical Order" section of Chapter 28 of Programming Perl; the Introduction to Chapter 9 [ Team LiB ] [ Team LiB ] Recipe 8.3 Processing Every Word in a File 8.3.1 Problem You need to do something to every word in a file, similar to the foreach function of csh. 8.3.2 Solution Either split each line on whitespace: while (<>) { for $chunk (split) { # do something with $chunk } } or use the m//g operator to pull out one chunk at a time: while (<>) { while ( /(\w[\w'-]*)/g ) { # do something with $1 } } 8.3.3 Discussion Decide what you mean by "word." Sometimes you want anything but whitespace, sometimes you want only program identifiers, and sometimes you want English words. Your definition governs which regular expression to use. The preceding two approaches work differently. Patterns are used in the first approach to decide what is not a word. In the second, they're used to decide what is a word. With these techniques, it's easy to make a word frequency counter. Use a hash to store how many times each word has been seen: # Make a word frequency count %seen = ( ); while (<>) { while ( /(\w[\w'-]*)/g ) { $seen{lc $1}++; } } # output hash in a descending numeric sort of its values foreach $word ( sort { $seen{$b} <=> $seen{$a} } keys %seen) { printf "%5d %s\n", $seen{$word}, $word; } To make the example program count line frequency instead of word frequency, omit the second while loop and use $seen{lc $_}++ instead: # Line frequency count %seen = ( ); while (<>) { $seen{lc $_}++; } foreach $line ( sort { $seen{$b} <=> $seen{$a} } keys %seen ) { printf "%5d %s", $seen{$line}, $line; } Odd things that may need to be considered as words include "M.I.T.", "Micro$oft", "o'clock", "49ers", "street-wise", "and/or", "&", "c/o", "St.", "Tschüß", and "Niño". Bear this in mind when you choose a pattern to match. The last two require you to place a use locale in your program and then use \w for a word character in the current locale, or else use the Unicode letter property if you have Unicode text: /(\p{Letter}[\p{Letter}'-]*)/ 8.3.4 See Also perlre(1); the split function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 6.3; Recipe 6.23 [ Team LiB ] [ Team LiB ] Recipe 8.4 Reading a File Backward by Line or Paragraph 8.4.1 Problem You want to process each line or paragraph of a text file in reverse. 8.4.2 Solution Read all lines into an array, then process that array from the end to the start: @lines = ; while ($line = pop @lines) { # do something with $line } Or store an array of lines in reverse order: @lines = reverse ; foreach $line (@lines) { # do something with $line } Or use the Tie::File module (standard as of v5.8): use Tie::File; tie(@lines, "Tie::File", $FILENAME, mode => 0) or die "Can't tie $FILENAME: $!"; $max_lines = $#lines; for ($i = $max_lines; $i; $i--) { # do something with $lines[$i], eg line number them: printf "%5d %s\n", $i+1, $lines[$i], } 8.4.3 Discussion The limitations of file access mentioned in this chapter's Introduction prevent reading a line at a time starting from the end. You must read the lines into memory, then process them in reverse order. This requires at least as much available memory as the size of the file, unless you use tricks like Tie::File does. The first technique moves through the array of lines in reverse order. This destructively processes the array, popping an element off the end of the array each time through the loop. We could do it non-destructively with: for ($i = $#lines; $i != -1; $i--) { $line = $lines[$i]; } The second approach generates an array of lines already in reverse order. This array can then be processed non-destructively. We get the reversed lines because the assignment to @lines confers list context on the return from reverse, and reverse confers list context on its argument of , which returns a list of all lines in the file. These approaches are easily extended to paragraphs just by changing $/: # this enclosing block keeps local $/ temporary { local $/ = ""; @paragraphs = reverse ; } foreach $paragraph (@paragraphs) { # do something } The Tie::File module lets you treat the file as an array of lines. The solution then becomes simply iterating through the array a line at a time from the end back to the start. It's much slower than reading everything into memory and reversing it, but works on files too big to fit into memory all at once. Be careful, though: Tie::File will rewrite the file if you change the contents of the tied @lines, so don't do that. In our example, assigning @lines = reverse(@lines) would reverse the file on disk! By opening the file with mode O_RDONLY (0), you can avoid that possibility. The default mode is O_RDWR | O_CREAT. Also, Tie::File cannot emulate the paragraph semantics of setting $/ to the empty string (""). 8.4.4 See Also The reverse function in perlfunc(1) and in Chapter 29 of Programming Perl; the $/ entry in perlvar(1), and in Chapter 28 of Programming Perl; the documentation for the standard Tie::File module; Recipe 4.11; Recipe 1.7 [ Team LiB ] [ Team LiB ] Recipe 8.5 Trailing a Growing File 8.5.1 Problem You want to read from a continually growing file, but the read fails when you reach the current end-of-file. 8.5.2 Solution Read until end-of-file. Sleep, clear the EOF flag, and read some more. Repeat until interrupted. To clear the EOF flag, either use seek: for (;;) { while () { .... } sleep $SOMETIME; seek(FH, 0, 1); } or use the IO::Handle module's clearerr method: use IO::Handle; for (;;) { while () { .... } sleep $SOMETIME; FH->clearerr( ); } 8.5.3 Discussion When you read until end-of-file, an internal flag is set that prevents further reading. The most direct way to clear this flag is the clearerr method, if supported: it's in the IO::Handle modules. $naptime = 1; use IO::Handle; open (LOGFILE, "/tmp/logfile") or die "can't open /tmp/logfile: $!"; for (;;) { while () { print } # or appropriate processing sleep $naptime; LOGFILE->clearerr( ); # clear stdio error flag } Because Perl v5.8 ships with its own stdio implementation, that simple approach should almost always work. On the rare system where it doesn't work, you may need to use seek. The seek code given in the Solution tries to move zero bytes from the current position, which nearly always works. It doesn't change the current position, but it should clear the end-of-file condition on the handle so that the next operation picks up new data. If that still doesn't work, perhaps because it relies on features of your I/O implementation, you may need to use the following seek code, which remembers the old file position explicitly and returns there directly. for (;;) { for ($curpos = tell(LOGFILE); ; $curpos = tell(LOGFILE)) { # process $_ here } sleep $naptime; seek(LOGFILE, $curpos, 0); # seek to where we had been } On some kinds of filesystems, the file could be removed while you are reading it. If so, there's probably little reason to continue checking whether it grows. To make the program exit in that case, stat the handle and make sure its link count (the third field in the return list) hasn't gone to 0: exit if (stat(LOGFILE))[3] = = 0 If you're using the File::stat module, you could write that more readably as: use File::stat; exit if stat(*LOGFILE)->nlink = = 0; The CPAN module File::Tail lets you tie a filehandle so that the read operation blocks at the end of the file until more data is available: use File::Tail; tie *FH, "File::Tail", (name => $FILENAME); while () { # do something with line read } The operator in this case never returns undef to indicate end-of-file. 8.5.4 See Also The seek and tell functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's tail(1) and stdio(3) manpages; the documentation for the standard File::stat module (also in Chapter 32 of Programming Perl); the documentation for the CPAN module File::Tail [ Team LiB ] [ Team LiB ] Recipe 8.6 Picking a Random Line from a File 8.6.1 Problem You want to return a random line from a file. 8.6.2 Solution Use rand and $. (the current line number) to decide which line to print: srand; rand($.) < 1 && ($line = $_) while <>; # $line is the random line 8.6.3 Discussion This is a beautiful example of a solution that may not be obvious. We read every line in the file but don't have to store them all in memory. This is great for large files. Each line has a 1 in N (where N is the number of lines read so far) chance of being selected. Here's a replacement for fortune using this algorithm: $/ = "%%\n"; @ARGV = ("/usr/share/games/fortunes") unless @ARGV; srand; rand($.) < 1 && ($adage = $_) while <>; print $adage; If you know line offsets (for instance, you've created an index) and the number of lines, you can randomly select a line and jump to its offset in the file, but you usually don't have such an index. Here's a more rigorous explanation of how the algorithm works. The function call rand ($.) picks a random number between 0 and the current line number. Therefore, you have a one in N chance, that is, , of keeping the Nth line. Therefore you've a 100% chance of keeping the first line, a 50% chance of keeping the second, a 33% chance of keeping the third, and so on. The question is whether this is fair for all N, where N is any positive integer. First, some concrete examples, then abstract ones. Obviously, a file with one line (N=1) is fair: you always keep the first line because = 100%, making it fair for files of 1 line. For a file with two lines, N=2. You always keep the first line; then when reaching the second line, you have a 50% chance of keeping it. Thus, both lines have an equal chance of being selected, which shows that N=2 is fair. For a file with three lines, N=3. You have a one-third chance, 33%, of keeping that third line. That leaves a two-thirds chance of retaining one of the first two out of the three lines. But we've already shown that for those first two lines there's a 50-50 chance of selecting either one. 50 percent of two-thirds is one-third. Thus, you have a one-third chance of selecting each of the three lines of the file. In the general case, a file of N+1 lines will choose the last line times and one of the previous N lines times. Dividing by N leaves us with for each the first N lines in our N+1 line file, and also for line number N+1. The algorithm is therefore fair for all N, where N is a positive integer. We've managed to fairly choose a random line from a file with speed directly proportional to the size of the file, but using no more memory than it takes to hold the longest line, even in the worst case. 8.6.4 See Also The $. entry in perlvar(1) and in Chapter 28 of Programming Perl; Recipe 2.6; Recipe 2.7 [ Team LiB ] [ Team LiB ] Recipe 8.7 Randomizing All Lines 8.7.1 Problem You want to copy a file and randomly reorder its lines. 8.7.2 Solution Read all lines into an array, shuffle the array using List::Util's shuffle function, and write the shuffled lines back out: use List::Util qw(shuffle); while () { push(@lines, $_); } @lines = shuffle(@lines); foreach (@reordered) { print OUTPUT $_; } 8.7.3 Discussion The easiest approach is to read all lines into memory and shuffle them there. Because you don't know where lines start in the file, you can't just shuffle a list of line numbers and then extract lines in the order they'll appear in the shuffled file. Even if you did know the byte offsets of the start of each line, it would probably still be slower because you'd be seeking around in the file instead of sequentially reading it from start to finish. If you have a version of Perl older than v5.8, you can download the List::Util module from CPAN. 8.7.4 See Also The documentation for the standard List::Util module; Recipe 2.6; Recipe 2.7; Recipe 4.18 [ Team LiB ] [ Team LiB ] Recipe 8.8 Reading a Particular Line in a File 8.8.1 Problem You want to extract a single line from a file. 8.8.2 Solution The simplest solution is to read the lines until you get to the one you want: # looking for line number $DESIRED_LINE_NUMBER $. = 0; do { $LINE = } until $. = = $DESIRED_LINE_NUMBER || eof; If you are going to be doing this a lot and the file fits into memory, read the file into an array: @lines = ; $LINE = $lines[$DESIRED_LINE_NUMBER]; The standard (as of v5.8) Tie::File ties an array to a file, one line per array element: use Tie::File; use Fcntl; tie(@lines, Tie::File, $FILE, mode => O_RDWR) or die "Cannot tie file $FILE: $!\n"; $line = $lines[$sought - 1]; If you have the DB_File module, its DB_RECNO access method ties an array to a file, one line per array element: use DB_File; use Fcntl; $tie = tie(@lines, DB_File, $FILE, O_RDWR, 0666, $DB_RECNO) or die "Cannot open file $FILE: $!\n"; # extract it $line = $lines[$sought - 1]; 8.8.3 Discussion Each strategy has different features, useful in different circumstances. The linear access approach is easy to write and best for short files. The Tie::File module gives good performance, regardless of the size of the file or which line you're reading (and is pure Perl, so doesn't require any external libraries). The DB_File mechanism has some initial overhead, but later accesses are faster than with linear access, so use it for long files that are accessed more than once and are accessed out of order. It is important to know whether you're counting lines from 0 or 1. The $. variable is 1 after the first line is read, so count from 1 when using linear access. The index mechanism uses many offsets, so count from 0. Tie::File and DB_File treat the file's records as an array indexed from 0, so count lines from 0. Here are three different implementations of the same program, print_line. The program takes two arguments: a filename and a line number to extract. The version in Example 8-1 simply reads lines until it finds the one it's looking for. Example 8-1. print_line-v1 #!/usr/bin/perl -w # print_line-v1 - linear style @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; open(INFILE, "<", $filename) or die "Can't open $filename for reading: $!\n"; while () { $line = $_; last if $. = = $line_number; } if ($. != $line_number) { die "Didn't find line $line_number in $filename\n"; } print; The Tie::File version is shown in Example 8-2. Example 8-2. print_line-v2 #!/usr/bin/perl -w # print_line-v2 - Tie::File style use Tie::File; use Fcntl; @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; tie @lines, Tie::File, $filename, mode => O_RDWR or die "Can't open $filename for reading: $!\n"; if (@lines > $line_number) { die "Didn't find line $line_number in $filename\n"; } print "$lines[$line_number-1]\n"; The DB_File version in Example 8-3 follows the same logic as Tie::File. Example 8-3. print_line-v3 #!/usr/bin/perl -w # print_line-v3 - DB_File style use DB_File; use Fcntl; @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; $tie = tie(@lines, DB_File, $filename, O_RDWR, 0666, $DB_RECNO) or die "Cannot open file $filename: $!\n"; unless ($line_number < $tie->length) { die "Didn't find line $line_number in $filename\n" } print $lines[$line_number-1]; # easy, eh? If you will be retrieving lines by number often and the file doesn't fit into memory, build a byte- address index to let you seek directly to the start of the line using the techniques in Recipe 8.27. 8.8.4 See Also The documentation for the standard Tie::File and DB_File modules (also in Chapter 32 of Programming Perl); the tie function in perlfunc(1) and in Chapter 29 of Programming Perl; the entry on $. in perlvar(1) and in Chapter 28 of Programming Perl; Recipe 8.27 [ Team LiB ] [ Team LiB ] Recipe 8.9 Processing Variable-Length Text Fields 8.9.1 Problem You want to extract variable-length fields from your input. 8.9.2 Solution Use split with a pattern matching the field separators. # given $RECORD with field separated by a pattern, # extract a list of fields @FIELDS = split(/PATTERN/, $RECORD); 8.9.3 Discussion The split function takes up to three arguments: PATTERN, EXPRESSION, and LIMIT. The LIMIT parameter is the maximum number of fields to split into. (If the input contains more fields, they are returned unsplit in the final list element.) If LIMIT is omitted, all fields (except any final empty ones) are returned. EXPRESSION gives the string value to split. If EXPRESSION is omitted, $_ is split. PATTERN is a pattern matching the field separator. If PATTERN is omitted, contiguous stretches of whitespace are used as the field separator and leading empty fields are silently discarded. If your input field separator isn't a fixed string, you might want split to return the field separators as well as the data by using parentheses in PATTERN to save the field separators. For instance: split(/([+-])/, "3+5-2"); returns the values: (3, "+", 5, "-", 2) To split colon-separated records in the style of the /etc/passwd file, use: @fields = split(/:/, $RECORD); The classic application of split is whitespace-separated records: @fields = split(/\s+/, $RECORD); If $RECORD started with whitespace, this last use of split would have put an empty string into the first element of @fields because split would consider the record to have an initial empty field. If you didn't want this, you could use this special form of split: @fields = split(" ", $RECORD); This behaves like split with a pattern of /\s+/, but ignores leading whitespace. When the record separator can appear in the record, you have a problem. The usual solution is to escape occurrences of the record separator in records by prefixing them with a backslash. See Recipe 1.18. 8.9.4 See Also The split function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 1.18 [ Team LiB ] [ Team LiB ] Recipe 8.10 Removing the Last Line of a File 8.10.1 Problem You'd like to remove the last line from a file. 8.10.2 Solution Use the standard (as of v5.8) Tie::File module and delete the last element from the tied array: use Tie::File; tie @lines, Tie::File, $file or die "can't update $file: $!"; delete $lines[-1]; 8.10.3 Discussion The Tie::File solution is the most efficient solution, at least for large files, because it doesn't have to read through the entire file to find the last line and doesn't read the entire file into memory. It is, however, considerably slower for small files than code you could implement yourself by hand. That doesn't mean you shouldn't use Tie::File; it just means you've optimized for programmer time instead of for computer time. If you don't have Tie::File and can't install it from CPAN, read the file a line at a time and keep track of the byte address of the last line you've seen. When you've exhausted the file, truncate to the last address you saved: open (FH, "+<", $file) or die "can't update $file: $!"; while () { $addr = tell(FH) unless eof(FH); } truncate(FH, $addr) or die "can't truncate $file: $!"; Remembering the offset is more efficient than reading the whole file into memory because it holds only one given line at a time. Although you still have to grope your way through the whole file, you can use this technique on files larger than available memory. 8.10.4 See Also The documentation for the standard Tie::File module; the truncate and tell functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's open(2) and fopen(3) manpages; Recipe 8.18 [ Team LiB ] [ Team LiB ] Recipe 8.11 Processing Binary Files 8.11.1 Problem You want to read 8-bit binary data as 8-bit binary data, i.e., neither as characters in a particular encoding nor as a text file with any newline or end-of-file conversions that your I/O library might want to do. 8.11.2 Solution Use the binmode function on the filehandle: binmode(HANDLE); 8.11.3 Discussion The binmode function lets you specify new I/O layers for a filehandle. The default layer to specify is :raw, which removes any layers that would interfere with binary data. The Solution is thus equivalent to: binmode(HANDLE, ":raw"); except that explicitly specifying :raw only works on Perl 5.8 and later. The one-argument form of binmode works on all versions of Perl. Because Perl makes :crlf the default if you are on an operating system that needs it, you should rarely (if ever) need to specify :crlf in your program. Furthermore, it's generally not wise to add or remove the :crlf layer once you've begun reading from the file, as there may be data already read into buffers that you can't unread. You can, however, safely change the :encoding(...) layer midstream (when parsing XML, for example). You should get into the habit of calling binmode when you open a binary file. This will make your program portable to systems that might (un)helpfully translate bytes in your binary file into something unusable. You may specify the I/O layers when you open a filehandle, rather than using binmode after the fact: open(FH, "< :raw", $filename); # binary mode Specify the default set of layers for all subsequently opened input and output filehandles with the open pragma: use open IN => ":raw"; # binary files 8.11.4 See Also The PerlIO(3) manpage; the open and binmode functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's open(2) and fopen(3) manpages [ Team LiB ] [ Team LiB ] Recipe 8.12 Using Random-Access I/O 8.12.1 Problem You have to read a binary record from the middle of a large file but don't want to read a record at a time to get there. 8.12.2 Solution Once you know the record's size, multiply it by the record number to get the byte address, and then seek to that byte address and read the record: $ADDRESS = $RECSIZE * $RECNO; seek(FH, $ADDRESS, 0) or die "seek:$!"; read(FH, $BUFFER, $RECSIZE); 8.12.3 Discussion The Solution assumes the first record has a RECNO of 0. If you're counting from one, use: $ADDRESS = $RECSIZE * ($RECNO-1); This is best applied to binary data. Applying it to text files assumes you have a constant character width and constant line length. This rules out most Unicode encodings, any kind of Windows text file, and any text file where lines can have different lengths. 8.12.4 See Also The seek function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.13 [ Team LiB ] [ Team LiB ] Recipe 8.13 Updating a Random-Access File 8.13.1 Problem You want to read an old record from a binary file, change its values, and write back the record. 8.13.2 Solution After reading the old record, pack up the updated values, seek to the previous address, and write it back. use Fcntl; # for SEEK_SET and SEEK_CUR $ADDRESS = $RECSIZE * $RECNO; seek(FH, $ADDRESS, SEEK_SET) or die "Seeking: $!"; read(FH, $BUFFER, $RECSIZE) = = $RECSIZE or die "Reading: $!"; @FIELDS = unpack($FORMAT, $BUFFER); # update fields, then $BUFFER = pack($FORMAT, @FIELDS); seek(FH, -$RECSIZE, SEEK_CUR) or die "Seeking: $!"; print FH $BUFFER; close FH or die "Closing: $!"; 8.13.3 Discussion You don't have to use anything fancier than print in Perl to output a record. Remember that the opposite of read is not write but print, although oddly enough, the opposite of sysread is syswrite. The example program shown in Example 8-4, weekearly, takes one argument: the user whose record you want to backdate by a week. (Of course, in practice, you wouldn't really want to (nor be able to!) mess with the system accounting files.) This program requires write access to the file to be updated, since it opens the file in update mode. After fetching and altering the record, it packs it up again, skips backward in the file one record, and writes it out. Example 8-4. weekearly #!/usr/bin/perl -w # weekearly -- set someone's login date back a week use User::pwent; use IO::Seekable; $typedef = "L A12 A16"; # linux fmt; sunos is "L A8 A16" $sizeof = length(pack($typedef, ( ))); $user = shift(@ARGV) || $ENV{USER} || $ENV{LOGNAME}; $address = getpwnam($user)->uid * $sizeof; open (LASTLOG, "+<:raw", "/var/log/lastlog") or die "can't update /var/log/lastlog: $!"; seek(LASTLOG, $address, SEEK_SET) or die "seek failed: $!"; read(LASTLOG, $buffer, $sizeof) = = $sizeof or die "read failed: $!"; ($time, $line, $host) = unpack($typedef, $buffer); $time -= 24 * 7 * 60 * 60; # back-date a week $buffer = pack($typedef, $time, $line, $time); seek(LASTLOG, -$sizeof, SEEK_CUR) # backup one record or die "seek failed: $!"; print LASTLOG $record; close(LASTLOG) or die "close failed: $!"; 8.13.4 See Also The PerlIO(3) manpage; the open, seek, read, pack, and unpack functions in the perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.12; Recipe 8.14 [ Team LiB ] [ Team LiB ] Recipe 8.14 Reading a String from a Binary File 8.14.1 Problem You want to read a NUL-terminated string from a file, starting at a particular address. 8.14.2 Solution Ensure you're working with a binary file, set $/ to an ASCII NUL, and read the string with <>: binmode(FH); # binary mode $old_rs = $/; # save old $/ $/ = "\0"; # ASCII 0: NUL seek(FH, $addr, SEEK_SET) or die "Seek error: $!\n"; $string = ; # read string chomp $string; # remove NUL $/ = $old_rs; # restore old $/ You can use local to save and restore $/: { local $/ = "\0"; # ... } # $/ is automatically restored 8.14.3 Discussion The example program shown in Example 8-5, bgets, accepts a filename and one or more byte addresses as arguments. Decimal, octal, or hexadecimal addresses may be specified. For each address, the program reads and prints the null- or EOF-terminated string at that position. Example 8-5. bgets #!/usr/bin/perl -w # bgets - get a string from an address in a binary file use IO::Seekable; use open IO => ":raw"; # binary mode on all opened handles ($file, @addrs) = @ARGV or die "usage: $0 file addr ..."; open(FH, $file) or die "cannot open $file: $!"; $/ = "\000"; foreach $addr (@addrs) { $addr = oct $addr if $addr =~ /^0/; seek(FH, $addr, SEEK_SET) or die "can't seek to $addr in $file: $!"; printf qq{%#x %#o %d "%s"\n}, $addr, $addr, $addr, scalar <>; } Example 8-6 is a simple implementation of the Unix strings program. Example 8-6. strings #!/usr/bin/perl -w # strings - pull strings out of a binary file $/ = "\0"; use open IO => ":raw"; while (<>) { while (/([\040-\176\s]{4,})/g) { print $1, "\n"; } } 8.14.4 See Also The PerlIO(3) manpage; the seek, getc, and ord functions in perlfunc(1) and in Chapter 29 of Programming Perl; the discussion of qq// in the "Quote and Quote-Like Operators" section of the perlop(1) manpage, and in the "Pick Your Own Quotes" section of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 8.15 Reading Fixed-Length Records 8.15.1 Problem You want to read a file whose records have a fixed length. 8.15.2 Solution Use read and unpack: # $RECORDSIZE is the length of a record, in bytes. # $TEMPLATE is the unpack template for the record # FILE is the file to read from # @FIELDS is an array, one element per field until ( eof(FILE) ) { read(FILE, $record, $RECORDSIZE) = = $RECORDSIZE or die "short read\n"; @FIELDS = unpack($TEMPLATE, $record); } 8.15.3 Discussion Because the file in question is not a text file, you can't use or IO::Handle's getline method to read records. Instead, you must simply read a particular number of bytes into a variable. This variable contains one record's data, which you decode using unpack with the appropriate format. For binary data, the catch is determining that format. When reading data written by a C program, this can mean peeking at C include files or manpages describing the structure layout, and this requires knowledge of C. It also requires that you become unnaturally chummy with your C compiler, because otherwise it's hard to predict field padding and alignment (such as the x2 in the format used in Recipe 8.24). If you're lucky enough to be on a Berkeley Unix system or a system supporting gcc, then you may be able to use the c2ph tool distributed with Perl to cajole your C compiler into helping you with this. The tailwtmp program at the end of this chapter uses the format described in utmp(5) under Linux, and works on its /var/log/wtmp and /var/run/utmp files. Once you commit to working in binary format, machine dependencies creep in fast. It probably won't work unaltered on your system, but the procedure is still illustrative. Here is the relevant layout from the C include file on Linux: #define UT_LINESIZE 12 #define UT_NAMESIZE 8 #define UT_HOSTSIZE 16 struct utmp { /* here are the pack template codes */ short ut_type; /* s for short, must be padded */ pid_t ut_pid; /* i for integer */ char ut_line[UT_LINESIZE]; /* A12 for 12-char string */ char ut_id[2]; /* A2, but need x2 for alignment */ time_t ut_time; /* l for long */ char ut_user[UT_NAMESIZE]; /* A8 for 8-char string */ char ut_host[UT_HOSTSIZE]; /* A16 for 16-char string */ long ut_addr; /* l for long */ }; Once you figure out the binary layout, feed that (in this case, "s x2 i A12 A2 x2 l A8 A16 l") to pack with an empty field list to determine the record's size. Remember to check the return value of read to make sure you got the number of bytes you asked for. If your records are text strings, use the "a" or "A" unpack templates. Fixed-length records are useful in that the nth record begins at byte offset SIZE * (n-1) in the file, where SIZE is the size of a single record. See the indexing code in Recipe 8.8 for an example. 8.15.4 See Also The unpack, pack, and read functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 1.1 [ Team LiB ] [ Team LiB ] Recipe 8.16 Reading Configuration Files 8.16.1 Problem You want to allow users of your program to change its behavior through configuration files. 8.16.2 Solution Either process a file in trivial VAR=VALUE format, setting a hash key-value pair for each setting: while () { chomp; # no newline s/#.*//; # no comments s/^\s+//; # no leading white s/\s+$//; # no trailing white next unless length; # anything left? my ($var, $value) = split(/\s*=\s*/, $_, 2); $User_Preferences{$var} = $value; } or better yet, treat the config file as full Perl code: do "$ENV{HOME}/.progrc"; 8.16.3 Discussion The first solution lets you read config files in a trivial format like this (comments and empty lines are allowed): # set class C net NETMASK = MTU = 296 DEVICE = cua1 RATE = 115200 MODE = adaptive After you're done, you can pull in a setting by using something like $User_Preferences{"RATE"} to find the value 115200. If you wanted the config file to set the global variable by that name, instead of assigning to the hash, use this: no strict "refs"; $$var = $value; and the $RATE variable would contain 115200. The second solution uses do to pull in raw Perl code directly. When used with an expression instead of a block, do interprets the expression as a filename. This is nearly identical to using require, but without risk of taking a fatal exception. In the second format, the config file would look like: # set class C net $NETMASK = ""; $MTU = 0x128; # Brent, please turn on the modem $DEVICE = "cua1"; $RATE = 115_200; $MODE = "adaptive"; If you don't see the point of having extra punctuation and live code, consider this: you can have all of Perl at your disposal. You can now add arbitrary logic and tests to your simple assignments: if ($DEVICE =~ /1$/) { $RATE = 28_800; } else { $RATE = 115_200; } Many programs support system and personal configuration files. If you want the user's choices to override the system ones, load the user file second: $APPDFLT = "/usr/local/share/myprog"; do "$APPDFLT/"; do "$ENV{HOME}/.myprogrc"; If you want to ignore the system config file when the user has his own, test the return value of the do. do "$APPDFLT/" or do "$ENV{HOME}/.myprogrc"; You might wonder what package those files are compiled in. They will be in the same package that do itself was compiled into. Typically you'll direct users to set particular variables, which, being unqualified globals, will end up in the current package. If you'd prefer unqualified variables go into a particular package, do this: { package Settings; do "$ENV{HOME}/.myprogrc" } As with a file read using require or use, those read using do count as a separate and unrelated lexical scope. That means the configuration file can't access its caller's lexical (my) variables, nor can the caller find any such variables that might have been set in the file. It also means that the user's code isn't held accountable to a lexically scoped pragma like use strict or use warnings, which may be in effect in the caller. If you don't want clean partitioning of variable visibility, you can get the config file's code executed in your own lexical scope. If you have a cat program or its technical equivalent handy, you could write yourself a hand-rolled do: eval `cat $ENV{HOME}/.myprogrc`; We've never actually seen anyone (except Larry Wall himself) use that approach in production code. For one thing, do is a lot easier to type. Also, it respects the @INC path, which is normally searched if a full path is not specified, but, unlike using a require, no implicit error checking happens under do. This means you don't have to wrap it in an eval to catch exceptions that would otherwise cause your program to die, because do already functions as an eval. You can still check for errors on your own if you'd like: $file = ""; unless ($return = do $file) { warn "couldn't parse $file: $@" if $@; warn "couldn't do $file: $!" unless defined $return; warn "couldn't run $file" unless $return; } This is much simpler for the programmer to source in code than it would be to invent and then parse a complicated, new syntax. It's also much easier on the users than forcing them to learn the syntax rules of yet another configuration file. Even better, you give the user access to a powerful algorithmic programming language. One reasonable concern is security. How do you know that the file hasn't been tampered with by someone other than the user? The traditional approach here is to do nothing, trusting the directory and file permissions. Nine times out of ten, this is also the right approach. Most projects just aren't worth being that paranoid over. For those that are, see the next recipe. 8.16.4 See Also The eval and require functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.17 [ Team LiB ] [ Team LiB ] Recipe 8.17 Testing a File for Trustworthiness 8.17.1 Problem You want to read from a file, perhaps because it has configuration information. You want to use the file only if it can't be written to (or perhaps not even be read from) by anyone else than its owner. 8.17.2 Solution Use the stat function to retrieve ownership and file permissions information. You can use the built-in version, which returns a list: ( $dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks ) = stat($filename) or die "no $filename: $!"; $mode &= 07777; # discard file type info Or you can use the by-name interface: use File::stat; $info = stat($filename) or die "no $filename: $!"; if ($info->uid = = 0) { print "Superuser owns $filename\n"; } if ($info->atime > $info->mtime) { print "$filename has been read since it was written.\n"; } 8.17.3 Discussion Usually you trust users to set file permissions as they wish. If they want others to read their files, or even to write to them, that's their business. Applications such as editors, mailers, and shells are often more discerning, though, refusing to evaluate code in configuration files if anyone but the owner can write to them. This helps avoid Trojan horse attacks. Security- minded programs such as ftp and ssh may even reject config files that can be read by anyone but their owner. If the file is writable by someone other than the owner or is owned by someone other than the current user or the superuser, it shouldn't be trusted. To figure out file ownership and permissions, the stat function is used. The following function returns true if the file is deemed safe and false otherwise. If the stat fails, undef is returned. use File::stat; sub is_safe { my $path = shift; my $info = stat($path); return unless $info; # owner neither superuser nor me # the real uid is in stored in the $< variable if (($info->uid != 0) && ($info->uid != $<)) { return 0; } # check whether group or other can write file. # use 066 to detect either reading or writing if ($info->mode & 022) { # someone else can write this return 0 unless -d _; # non-directories aren't safe # but directories with the sticky bit (01000) are return 0 unless $info->mode & 01000; } return 1; } A directory is considered safe even if others can write to it, provided its mode 01000 (owner delete only) bit is set. Careful programmers also ensure that no enclosing directory is writable. This is due to systems with the "chown giveaway" problem in which any user can give away a file they own to make it owned by someone else. The following function handles that by using the is_safe function to check every enclosing directory up to the root if it detects that you have the chown problem, for which it queries the POSIX::sysconf. If you don't have an unrestricted version of chown, the is_verysafe subroutine just calls is_safe. If you do have the problem, it walks up the filesystem tree until it reaches the root. use Cwd; use POSIX qw(sysconf _PC_CHOWN_RESTRICTED); sub is_verysafe { my $path = shift; return is_safe($path) if sysconf(_PC_CHOWN_RESTRICTED); $path = getcwd( ) . "/" . $path if $path !~ m{^/}; do { return unless is_safe($path); $path =~ s#([^/]+|/)$##; # dirname $path =~ s#/$## if length($path) > 1; # last slash } while length $path; return 1; } To use this in a program, try something like this: $file = "$ENV{HOME}/.myprogrc"; readconfig($file) if is_safe($file); This has potential for a race condition, because it's presumed that the hypothetical readconfig function will open the file. Between the time when is_safe checks the file's stats and when readconfig opens it, something wicked could theoretically occur. To avoid this, pass is_safe the already open filehandle, which is set up to handle this: $file = "$ENV{HOME}/.myprogrc"; if (open(FILE, "<", $file)) { readconfig(*FILE) if is_safe(*FILE); } You would still have to arrange for readconfig to accept a filehandle instead of a filename, though. 8.17.4 See Also The stat function in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard POSIX and File::stat modules; Recipe 8.16 [ Team LiB ] [ Team LiB ] Recipe 8.18 Treating a File as an Array 8.18.1 Problem Your file contains a list of lines or records, and you'd like to be able to use Perl's powerful array operations to access and manipulate the file. 8.18.2 Solution Use the Tie::File module, standard with v5.8 of Perl: use Tie::File; use Fcntl; tie @data, Tie::File, $FILENAME or die "Can't tie to $filename : $!\n"; # use array operations on @data to work with the file 8.18.3 Discussion The Tie::File module makes a file appear to be an array, one record per element. You can then fetch and assign to elements of the array, use array functions like push and splice, use negative indices, or reverse it, and in every instance you're really working with the data on disk. If you don't specify how Tie::File should open the file, it is opened for read and write access and created if it doesn't exist. To specify a particular access mode (see Recipe 7.1), pass the Fcntl mode with the mode parameter when you tie. For example: use Fcntl; tie(@data, Tie::File, $filename, mode => O_RDONLY) or die "Can't open $filename for reading: $!\n"; When you alter the array, the file is rewritten on disk. For example, if you change the length of an element, all records later in the file must be copied to make the change. Take this code: foreach (@data) { s/Perl Cookbook/Perl Cookbook (2nd edition)/g; } That's close because you change the length of record 0, forcing a copy of records 1..N. Then you change the length of record 1, forcing a copy of records 2..N. It's better to defer the update until all changes have been made and then have Tie::File update the file in one single write. To do this, call a method on the object behind the tied array: (tied @data)->defer; # defer updates foreach (@data) { s/Perl Cookbook/Perl Cookbook (2nd edition)/g; } (tied @data)->flush; Exactly how much rewriting to defer is governed by how much memory you let Tie::File use, because the only way to keep track of changes without updating the file is to store those changes in memory. The Tie::File manpage shows how to change options for memory use. 8.18.4 See Also Recipe 8.4; Recipe 8.8; Recipe 8.10 [ Team LiB ] [ Team LiB ] Recipe 8.19 Setting the Default I/O Layers 8.19.1 Problem You want to ensure all files opened by your program use a particular set of I/O layers. For example, you know that every file will contain UTF-8 data. 8.19.2 Solution Use the open pragma: use open IO => ":raw:utf8"; 8.19.3 Discussion You can easily specify I/O layers when you open a filehandle directly, but that doesn't help you when the filehandle is opened by someone else's code (possibly even the Perl core). The open pragma lets you specify a default set of layers for every open that doesn't specify its own layers. The open module also offers separate IN and OUT control for input and output handles. For example, to read bytes and emit UTF-8: use open "IN" => ":bytes", "OUT" => ":utf8"; The :std option tells open to apply the input and output layers to STDIN and STDOUT/STDERR. For example, the following code makes input handles read Greek (ISO 8859-7) and output handles write in the UTF-8 Unicode encoding. Then it applies the same layers to STDIN, STDOUT, and STDERR: use open "IN" => ":encoding(Greek)", # reading Greek "OUT" => ":utf8", # writing 8-bit data in Unicode UTF-8, ":std"; # STDIN is Greek, 8.19.4 See Also The documentation for the standard open pragma; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.20 Reading or Writing Unicode from a Filehandle 8.20.1 Problem You have a file containing text in a particular encoding and when you read data from that into a Perl string, Perl treats it as a series of 8-bit bytes. You'd like to work with characters instead of bytes because your encoding characters can take more than one byte. Also, if Perl doesn't know about your encoding, it may fail to identify certain characters as letters. Similarly, you may want to output text in a particular encoding. 8.20.2 Solution Use I/O layers to tell Perl that data from that filehandle is in a particular encoding. open(my $ifh, "<:encoding(ENCODING_NAME)", $filename); open(my $ofh, ">:encoding(ENCODING_NAME)", $filename); 8.20.3 Discussion Perl's text manipulation functions handle UTF-8 strings just as well as they do 8-bit data—they just need to know what type of data they're working with. Each string in Perl is internally marked as either UTF-8 or 8-bit data. The encoding(...) layer converts data between variable external encodings and the internal UTF-8 within Perl. This is done by way of the Encode module. In the section on Unicode Support in Perl back in the Introduction to Chapter 1, we explained how under Unicode, every different character had a different code point (i.e., a different number) associated with it. Assigning all characters unique code points solves many problems. No longer does the same number, like 0xC4, represent one character under one character repertoire (e.g., a LATIN CAPITAL LETTER A WITH DIAERESIS under ISO-8859-1) and a different character in another repertoire (e.g., a GREEK CAPITAL LETTER DELTA under ISO- 8859-7). This neatly solves many problems, but still leaves one important issue: the precise format used in memory or disk for each code point. If most code points fit in 8 bits, it would seem wasteful to use, say, a full 32 bits for each character. But if every character is the same size as every other character, the code is easier to write and may be faster to execute. This has given rise to different encoding systems for storing Unicode, each offering distinct advantages. Fixed-width encodings fit every code point into the same number of bits, which simplifies programming but at the expense of some wasted space. Variable-width encodings use only as much space as each code point requires, which saves space but complicates programming. One further complication is combined characters, which may look like single letters on paper but in code require multiple code points. When you see a capital A with two dots above it (a diaeresis) on your screen, it may not even be character U+00C4. As explained in Recipe 1.8, Unicode supports the idea of combining characters, where you start with a base character and add non-spacing marks to it. U+0308 is a "COMBINING DIAERESIS", so you could use a capital A (U+0041) followed by U+0308, or A\x{308} to produce the same output. The following table shows the old ISO 8859-1 way of writing a capital A with a diaeresis, in which the logical character code and the physical byte layout enjoyed an identical representation, and the new way under Unicode. We'll include both ways of writing that character: one precomposed in one code point and the other using two code points to create a combined character. Old way New way Ä A Ä Ä Character(s) 0xC4 U+0041 U+00C4 U+0041 U+0308 Character repertoire ISO 8859-1 Unicode Unicode Unicode Character code(s) 0xC4 0x0041 0x00C4 0x0041 0x0308 Encoding — UTF-8 UTF-8 UTF-8 Byte(s) 0xC4 0x41 0xC3 0x84 0x41 0xCC 0x88 The internal format used by Perl is UTF-8, a variable-width encoding system. One reason for this choice is that legacy ASCII requires no conversion for UTF-8, looking in memory exactly as it did before—just one byte per character. Character U+0041 is just 0x41 in memory. Legacy data sets don't increase in size, and even those using Western character sets like ISO 8859-n grow only slightly, since in practice you still have a favorable ratio of regular ASCII characters to 8-bit accented characters. Just because Perl uses UTF-8 internally doesn't preclude using other formats externally. Perl automatically converts all data between UTF-8 and whatever encoding you've specified for that handle. The Encode module is used implicitly when you specify an I/O layer of the form ":encoding(....)". For example: binmode(FH, ":encoding(UTF-16BE)") or die "can't binmode to utf-16be: $!"; or directly in the open: open(FH, "< :encoding(UTF-32)", $pathname) or die "can't open $pathname: $!"; Here's a comparison of actual byte layouts of those two sequences, both representing a capital A with diaeresis, under several other popular formats: U+00C4 U+0041 U+0308 UTF-8 c3 84 41 cc 88 UTF-16BE 00 c4 00 41 03 08 UTF-16LE c4 00 41 00 08 03 U+00C4 U+0041 U+0308 UTF-16 fe ff 00 c4 fe ff 00 41 03 08 UTF-32LE c4 00 00 00 41 00 00 00 08 03 00 00 UTF-32BE 00 00 00 c4 00 00 00 41 00 00 03 08 UTF-32 00 00 fe ff 00 00 00 c4 00 00 fe ff 00 00 00 41 00 00 03 08 This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian- ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream. If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance. The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO- 646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp- roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding. 8.20.4 See Also The documentation for the standard Encode module; the Encode::Supported manpage; Recipe 8.12 and Recipe 8.19 [ Team LiB ] UTF-16 fe ff 00 c4 fe ff 00 41 03 08 UTF-32LE c4 00 00 00 41 00 00 00 08 03 00 00 UTF-32BE 00 00 00 c4 00 00 00 41 00 00 03 08 UTF-32 00 00 fe ff 00 00 00 c4 00 00 fe ff 00 00 00 41 00 00 03 08 This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian- ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream. If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance. The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO- 646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp- roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding. 8.20.4 See Also The documentation for the standard Encode module; the Encode::Supported manpage; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.21 Converting Microsoft Text Files into Unicode 8.21.1 Problem You have a text file written on a Microsoft computer that looks like garbage when displayed. How do you fix this? 8.21.2 Solution Set the encoding layer appropriately when reading to convert this into Unicode: binmode(IFH, ":encoding(cp1252)") || die "can't binmode to cp1252 encoding: $!"; 8.21.3 Discussion Suppose someone sends you a file in cp1252 format, Microsoft's default in-house 8-bit character set. Files in this format can be annoying to read—while they might claim to be Latin1, they are not, and if you look at them with Latin1 fonts loaded, you'll get garbage on your screen. A simple solution is as follows: open(MSMESS, "< :crlf :encoding(cp1252)", $inputfile) || die "can't open $inputfile: $!"; Now data read from that handle will be automatically converted into Unicode when you read it in. It will also be processed in CRLF mode, which is needed on systems that don't use that sequence to indicate end of line. You probably won't be able to write out this text as Latin1. That's because cp1252 includes characters that don't exist in Latin1. You'll have to leave it in Unicode, and displaying Unicode properly may not be as easy as you wish, because finding tools to work with Unicode is something of a quest in its own right. Most web browsers support ISO 10646 fonts; that is, Unicode fonts (see ). Whether your text editor does is a different matter, although both emacs and vi (actually, vim , not nvi ) have mechanisms for handling Unicode. The authors used the following xterm (1) command to look at text: xterm -n unicode -u8 -fn -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1 But many open questions still exist, such as cutting and pasting of Unicode data between windows. The site has help for finding and installing suitable tools for a variety of platforms, including both Unix and Microsoft systems. You'll also need to tell Perl it's alright to emit Unicode. If you don't, you'll get a warning about a "Wide character in print " every time you try. Assuming you're running in an xterm like the one shown previously (or its equivalent for your system) that has Unicode fonts available, you could just do this: binmode(STDOUT, ":utf8"); But that requires the rest of your program to emit Unicode, which might not be convenient. When writing new programs specifically designed for this, though, it might not be too much trouble. As of v5.8.1, Perl offers a couple of other means of getting this effect. The -C command-line switch controls some Unicode features related to your runtime environment. This way you can set those features on a per-command basis without having to edit the source code. The -C switch can be followed by either a number or a list of option letters. Some available letters, their numeric values, and effects are as follows: I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded in UTF-8 Letter Number Meaning You may use letters or numbers. If you use numbers, you have to add them up. For example, - COE and -C6 are synonyms of UTF-8 on both STDOUT and STDERR . One last approach is to use the PERL_UNICODE environment variable. If set, it contains the same value as you would use with -C . For example, with the xterm that has Unicode fonts loaded, you could do this in a POSIX shell: sh% export PERL_UNICODE=6 or this in the csh : csh% setenv PERL_UNICODE 6 The advantage of using the environment variable is that you don't have to edit the source code as the pragma would require, and you don't even need to change the command invocation as setting -C would require. 8.21.4 See Also The perlrun (1), encoding (3), PerlIO (3), and Encode (3) manpages [ Team LiB ] [ Team LiB ] Recipe 8.22 Comparing the Contents of Two Files 8.22.1 Problem You have two files and want to see whether they're the same or different. 8.22.2 Solution Use the standard File::Compare module with filenames, typeglobs, or any indirect filehandles: use File::Compare; if (compare($FILENAME_1, $FILENAME_2) = = 0) { # they're equal } if (compare(*FH1, *FH2) = = 0) { # they're equal } if (compare($fh1, $fh2) = = 0) { # they're equal } 8.22.3 Discussion The File::Compare module (standard as of v5.8 and available on CPAN if you have an earlier version of Perl) compares two files for equality. The compare function, exported by default, returns 0 when the files are equal, 1 when they differ, and -1 when any error occurs during reading. To compare more than two filehandles, simply loop, comparing two at a time: # ensure all filehandles in @fh hold the same data foreach $fh (@fh[1..$#fh]) { if (compare($fh[0], $fh)) { # $fh differs } } If you want details of exactly how two files differ, use the Text::Diff module from CPAN: use Text::Diff; $diff = diff(*FH1, *FH2); $diff = diff($FILENAME_1, $FILENAME_2, { STYLE => "Context" }); In addition to filehandles, diff can also take filenames, strings, and even arrays of records. Pass a hash of options as the third argument. The STYLE option controls the type of output returned; it can be "Unified" (the default), "Context", or "OldStyle". You can even write your own class for custom diff formats. The value returned by diff is a string similar to the output of the diff(1) program. This string is in valid diff format, suitable for feeding into patch(1). Although Text::Diff will not always produce the same output as GNU diff, byte for byte, its diffs are nevertheless correct. 8.22.4 See Also The documentation for the standard File::Compare module; the documentation for the CPAN module Text::Diff; the diff(1) and patch(1) manpages. [ Team LiB ] [ Team LiB ] Recipe 8.23 Pretending a String Is a File 8.23.1 Problem You have data in string, but would like to treat it as a file. For example, you have a subroutine that expects a filehandle as an argument, but you would like that subroutine to work directly on the data in your string instead. Additionally, you don't want to write the data to a temporary file. 8.23.2 Solution Use the scalar I/O in Perl v5.8: open($fh, "+<", \$string); # read and write contents of $string 8.23.3 Discussion Perl's I/O layers include support for input and output from a scalar. When you read a record with <$fh>, you are reading the next line from $string. When you write a record with print, you change $string. You can pass $fh to a function that expects a filehandle, and that subroutine need never know that it's really working with data in a string. Perl respects the various access modes in open for strings, so you can specify that the strings be opened as read-only, with truncation, in append mode, and so on: open($fh, "<", \$string); # read only open($fh, ">", \$string); # write only, discard original contents open($fh, "+>", \$string); # read and write, discard original contents open($fh, "+<", \$string); # read and write, preserve original contents These handles behave in all respects like regular filehandles, so all I/O functions work, such as seek, truncate, sysread, and friends. 8.23.4 See Also The open function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.24 Program: tailwtmp Every time a user logs into or out of a Unix system, a record is added to the wtmp file. You can't use the normal tail program on it, because the file is in binary format. The tailwtmp program in Example 8-7 knows the format of the binary file and shows every new record as it appears. You'll have to adjust the pack format for your own system. Example 8-7. tailwtmp #!/usr/bin/perl -w # tailwtmp - watch for logins and logouts; # uses linux utmp structure, from utmp(5) $typedef = "s x2 i A12 A4 l A8 A16 l"; $sizeof = length pack($typedef, ( ) ); use IO::File; open(WTMP, "< :raw", "/var/log/wtmp") or die "can't open /var/log/wtmp: $!"; seek(WTMP, 0, SEEK_END); for (;;) { while (read(WTMP, $buffer, $sizeof) = = $sizeof) { ($type, $pid, $line, $id, $time, $user, $host, $addr) = unpack($typedef, $buffer); next unless $user && ord($user) && $time; printf "%1d %-8s %-12s %2s %-24s %-16s %5d %08x\n", $type,$user,$line,$id,scalar(localtime($time)), $host,$pid,$addr; } for ($size = -s WTMP; $size = = -s WTMP; sleep 1) { } WTMP->clearerr( ); } [ Team LiB ] [ Team LiB ] Recipe 8.25 Program: tctee Not all systems support the classic tee program for splitting output pipes to multiple destinations. This command sends the output from someprog to /tmp/output and to the mail pipe beyond: % someprog | tee /tmp/output | Mail -s "check this" This program helps not only users who aren't on Unix systems and don't have a regular tee; it also helps those who are, because it offers features not found on other versions of tee. The four flag arguments are -i to ignore interrupts, -a to append to output files, -u for unbuffered output, and -n to omit copying the output on to standard out. Because this program uses Perl's magic open, you can specify pipes as well as files. % someprog | tctee f1 "|cat -n" f2 ">>f3" That sends the output from someprog to the files f1 and f2, appends it to f3, sends a copy to the program cat -n, and also produces the stream on standard output. The program in Example 8-8 is one of many venerable Perl programs written nearly a decade ago that still runs perfectly well. If written from scratch now, we'd probably use strict, warnings, and ten to thirty thousand lines of modules. But if it ain't broke . . . Example 8-8. tctee #!/usr/bin/perl # tctee - clone that groks process tees # perl3 compatible, or better. while ($ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) { next if /^$/; s/i// && (++$ignore_ints, redo); s/a// && (++$append, redo); s/u// && (++$unbuffer, redo); s/n// && (++$nostdout, redo); die "usage $0 [-aiun] [filenames] ...\n"; } if ($ignore_ints) { for $sig ("INT", "TERM", "HUP", "QUIT") { $SIG{$sig} = "IGNORE"; } } $SIG{"PIPE"} = "PLUMBER"; $mode = $append ? ">>" : ">"; $fh = "FH000"; unless ($nostdout) { %fh = ("STDOUT", "standard output"); # always go to stdout } $| = 1 if $unbuffer; for (@ARGV) { if (!open($fh, (/^[^>|]/ && $mode) . $_)) { warn "$0: cannot open $_: $!\n"; # like sun's; i prefer die $status++; next; } select((select($fh), $| = 1)[0]) if $unbuffer; $fh{$fh++} = $_; } while () { for $fh (keys %fh) { print $fh $_; } } for $fh (keys %fh) { next if close($fh) || !defined $fh{$fh}; warn "$0: couldnt close $fh{$fh}: $!\n"; $status++; } exit $status; sub PLUMBER { warn "$0: pipe to \"$fh{$fh}\" broke!\n"; $status++; delete $fh{$fh}; } [ Team LiB ] [ Team LiB ] Recipe 8.26 Program: laston When you log in to a Unix system, it tells you when you last logged in. That information is stored in a binary file called lastlog. Each user has their own record; UID 8 is at record 8, UID 239 at record 239, and so on. To find out when a given user last logged in, convert their login name to a number, seek to their record in that file, read, and unpack. Doing so with shell tools is hard, but with the laston program, it's easy. Here's an example: % laston gnat gnat UID 314 at Mon May 25 08:32:52 2003 on ttyp0 from The program in Example 8-9 is much newer than the tctee program in Example 8-8, but it's less portable. It uses the Linux binary layout of the lastlog file. You'll have to change this for other systems. Example 8-9. laston #!/usr/bin/perl -w # laston - find out when given user last logged on use User::pwent; use IO::Seekable qw(SEEK_SET); open (LASTLOG, "< :raw", "/var/log/lastlog") or die "can't open /var/log/lastlog: $!"; $typedef = "L A12 A16"; # linux fmt; sunos is "L A8 A16" $sizeof = length(pack($typedef, ( ))); for $user (@ARGV) { $U = ($user =~ /^\d+$/) ? getpwuid($user) : getpwnam($user); unless ($U) { warn "no such uid $user\n"; next; } seek(LASTLOG, $U->uid * $sizeof, SEEK_SET) or die "seek failed: $!"; read(LASTLOG, $buffer, $sizeof) = = $sizeof or next; ($time, $line, $host) = unpack($typedef, $buffer); printf "%-8s UID %5d %s%s%s\n", $U->name, $U->uid, $time ? ("at " . localtime($time)) : "never logged in", $line && " on $line", $host && " from $host"; } [ Team LiB ] [ Team LiB ] Recipe 8.27 Program: Flat File Indexes It sometimes happens that you need to jump directly to a particular line number in a file, but the lines vary in length, so you can't use Recipe 8.12. Although you could start at the beginning of the file and read every line, this is inefficient if you're making multiple queries. The solution is to build an index of fixed-width records, one per line. Each record contains the offset in the data file of the corresponding line. The subroutine in Example 8-10 takes the data file and a filehandle to send the index to. It reads a record at a time and prints the current offset in the file to the index, packed into a big-ending unsigned 32-bit integer; see the documentation for the pack function in perlfunc(1) for alternative storage types. Example 8-10. build_index # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE) sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } } Once you have an index, it becomes easy to read a particular line from the data file. Jump to that record in the index, read the offset, and jump to that position in the data file. The next line you read will be the one you want. Example 8-11 returns the line, given the line number and the index and data file handles. Example 8-11. line_with_index # usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER) # returns line or undef if LINE_NUMBER was out of range sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; # size of an index entry my $i_offset; # offset into the index of the entry my $entry; # index entry my $d_offset; # offset into the data file $size = length(pack("N", 0)); $i_offset = $size * ($line_number-1); seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); } To use these subroutines, just say: open($fh, "<", $file) or die "Can't open $file for reading: $!\n"; open($index, "+>", $file.idx) or die "Can't open $file.idx for read/write: $!\n"; build_index($fh, $index); $line = line_with_index($file, $index, $seeking); The next step is to cache the index file between runs of the program, so you're not building it each time. This is shown in Example Recipe 8.12. Then add locking for concurrent access, and check time stamps on the files to see whether a change to the data file has made an old index file out of date. Example 8-12. cache_line_index #!/usr/bin/perl -w # cache_line_index - index style # build_index and line_with_index from above @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER"; ($filename, $line_number) = @ARGV; open(my $orig, "<", $filename) or die "Can't open $filename for reading: $!"; # open the index and build it if necessary # there's a race condition here: two copies of this # program can notice there's no index for the file and # try to build one. This would be easily solved with # locking $indexname = "$filename.index"; sysopen(my $idx, $indexname, O_CREAT|O_RDWR) or die "Can't open $indexname for read/write: $!"; build_index($orig, $idx) if -z $indexname; # XXX: race unless lock $line = line_with_index($orig, $idx, $line_number); die "Didn't find line $line_number in $filename" unless defined $line; print $line; [ Team LiB ] [ Team LiB ] Chapter 9. Directories Unix has its weak points, but its file system is not one of them. —Chris Torek [ Team LiB ] [ Team LiB ] Introduction To fully understand directories, you need to be acquainted with the underlying mechanics. The following explanation is slanted toward the Unix filesystem, for whose syscalls and behavior Perl's directory access routines were designed, but it is applicable to some degree to most other platforms. A filesystem consists of two parts: a set of data blocks where the contents of files and directories are kept, and an index to those blocks. Each entity in the filesystem has an entry in the index, be it a plain file, a directory, a link, or a special file like those in /dev. Each entry in the index is called an inode (short for index node). Since the index is a flat index, inodes are addressed by number. A directory is a specially formatted file, whose inode entry marks it as a directory. A directory's data blocks contain a set of pairs. Each pair consists of the name of something in that directory and the inode number of that thing. The data blocks for /usr/bin might contain: Name Inode bc 17 du 29 nvi 8 pine 55 vi 8 Every directory is like this, even the root directory (/). To read the file /usr/bin/vi, the operating system reads the inode for /, reads its data blocks to find the entry for /usr, reads /usr's inode, reads its data block to find /usr/bin, reads /usr/bin's inode, reads its data block to find /usr/bin/vi, reads /usr/bin/vi's inode, and then reads the data from its data block. The name in a directory entry isn't fully qualified. The file /usr/bin/vi has an entry with the name vi in the /usr/bin directory. If you open the directory /usr/bin and read entries one by one, you get filenames like patch, rlogin, and vi instead of fully qualified names like /usr/bin/patch, /usr/bin/rlogin, and /usr/bin/vi. The inode has more than a pointer to the data blocks. Each inode also contains the type of thing it represents (directory, plain file, etc.), the size of the thing, a set of permissions bits, owner and group information, the time the thing was last modified, the number of directory entries that point to this inode, and so on. Some operations on files change the contents of the file's data blocks; others change just the inode. For instance, appending to or truncating a file updates its inode by changing the size field. Other operations change the directory entry that points to the file's inode. Changing a file's name changes only the directory entry; it updates neither the file's data nor its inode. Three fields in the inode structure contain the last access, change, and modification times: atime, ctime, and mtime. The atime field is updated each time the pointer to the file's data blocks is followed and the file's data is read. The mtime field is updated each time the file's data changes. The ctime field is updated each time the file's inode changes. The ctime is not creation time; there is no way under standard Unix to find a file's creation time. Reading a file changes its atime only. Changing a file's name doesn't change atime, ctime, or mtime, because the directory entry changed (it does change the atime and mtime of the directory the file is in, though). Truncating a file doesn't change its atime (because we haven't read; we've just changed the size field in its directory entry), but it does change its ctime because we changed its size field and its mtime because we changed its contents (even though we didn't follow the pointer to do so). We can access the inode of a file or directory by calling the built-in function stat on its name. For instance, to get the inode for /usr/bin/vi, say: @entry = stat("/usr/bin/vi") or die "Couldn't stat /usr/bin/vi : $!"; To get the inode for the directory /usr/bin, say: @entry = stat("/usr/bin") or die "Couldn't stat /usr/bin : $!"; You can stat filehandles, too: @entry = stat(INFILE) or die "Couldn't stat INFILE : $!"; The stat function returns a list of the values of the fields in the directory entry. If it couldn't get this information (for instance, if the file doesn't exist), it returns an empty list. It's this empty list we test for using the or die construct. Be careful of using || die because that throws the expression into scalar context, in which case stat only reports whether it worked. It doesn't return the list of values. The underscore ( _ ) cache referred to later will still be updated, though. The values returned by stat are listed in Table 9-1. Table 9-1. Stat return values Element Abbreviation Description 0 dev Device number of filesystem 1 ino Inode number (the "pointer" field) 2 mode File mode (type and permissions) 3 nlink Number of (hard) links to the file 4 uid Numeric user ID of file's owner 5 gid Numeric group ID of file's owner 6 rdev The device identifier (special files only) 7 size Total size of file, in bytes 8 atime Last access time, in seconds, since the Epoch 9 mtime Last modify time, in seconds, since the Epoch 10 ctime Inode change time, in seconds, since the Epoch The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: Element Abbreviation Description 11 blksize Preferred block size for filesystem I/O 12 blocks Actual number of blocks allocated The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: use File::stat; $inode = stat("/usr/bin/vi"); $ctime = $inode->ctime; $size = $inode->size; In addition, Perl provides operators that call stat and return one value only (see Table 9-2). These are collectively referred to as the -X operators because they all take the form of a dash followed by a single character. They're modeled on the shell's test operators. Table 9-2. File test operators -X Stat field Meaning -r mode File is readable by effective UID/GID -w mode File is writable by effective UID/GID -x mode File is executable by effective UID/GID -o mode File is owned by effective UID -R mode File is readable by real UID/GID -W mode File is writable by real UID/GID -X mode File is executable by real UID/GID -O mode File is owned by real UID -e File exists -z size File has zero size -s size File has nonzero size (returns size) -f mode,rdev File is a plain file -d mode,rdev File is a directory -l mode File is a symbolic link -p mode File is a named pipe (FIFO) 11 blksize Preferred block size for filesystem I/O 12 blocks Actual number of blocks allocated The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: use File::stat; $inode = stat("/usr/bin/vi"); $ctime = $inode->ctime; $size = $inode->size; In addition, Perl provides operators that call stat and return one value only (see Table 9-2). These are collectively referred to as the -X operators because they all take the form of a dash followed by a single character. They're modeled on the shell's test operators. Table 9-2. File test operators -X Stat field Meaning -r mode File is readable by effective UID/GID -w mode File is writable by effective UID/GID -x mode File is executable by effective UID/GID -o mode File is owned by effective UID -R mode File is readable by real UID/GID -W mode File is writable by real UID/GID -X mode File is executable by real UID/GID -O mode File is owned by real UID -e File exists -z size File has zero size -s size File has nonzero size (returns size) -f mode,rdev File is a plain file -d mode,rdev File is a directory -l mode File is a symbolic link -X Stat field Meaning -p mode File is a named pipe (FIFO) -S mode File is a socket -b rdev File is a block special file -c rdev File is a character special file -t rdev Filehandle is opened to a tty -u mode File has setuid bit set -g mode File has setgid bit set -k mode File has sticky bit set -T N/A File is a text file -B N/A File is a binary file (opposite of -T) -M mtime Age of file in days when script started -A atime Same for access time -C ctime Same for inode change time (not creation) The stat and the -X operators cache the values that the stat(2) syscall returned. If you then call stat or a -X operator with the special filehandle _ (a single underscore), it won't call stat again but will instead return information from its cache. This lets you test many properties of a single file without calling stat(2) many times or introducing a race condition: open(F, "<", $filename ) or die "Opening $filename: $!\n"; unless (-s F && -T _) { die "$filename doesn't have text in it.\n"; } The stat call just returns the information in one inode, though. How do we list the directory contents? For that, Perl provides opendir, readdir, and closedir: opendir(DIRHANDLE, "/usr/bin") or die "couldn't open /usr/bin : $!"; while ( defined ($filename = readdir(DIRHANDLE)) ) { print "Inside /usr/bin is something called $filename\n"; } closedir(DIRHANDLE); These directory-reading functions are designed to look like the file open and close functions. Where open takes a filehandle, though, opendir takes a directory handle. They may look the same to you (the same bare word), but they occupy different namespaces. Therefore, you could open(BIN, "/a/file") and opendir(BIN, "/a/dir"), and Perl won't get confused. You might, but Perl won't. Because filehandles and directory handles are different, you can't use the <> operator to read from a directory handle (<> calls readline on the filehandle). Similar to what happens with open and the other functions that initialize filehandles, you can -p mode File is a named pipe (FIFO) -S mode File is a socket -b rdev File is a block special file -c rdev File is a character special file -t rdev Filehandle is opened to a tty -u mode File has setuid bit set -g mode File has setgid bit set -k mode File has sticky bit set -T N/A File is a text file -B N/A File is a binary file (opposite of -T) -M mtime Age of file in days when script started -A atime Same for access time -C ctime Same for inode change time (not creation) The stat and the -X operators cache the values that the stat(2) syscall returned. If you then call stat or a -X operator with the special filehandle _ (a single underscore), it won't call stat again but will instead return information from its cache. This lets you test many properties of a single file without calling stat(2) many times or introducing a race condition: open(F, "<", $filename ) or die "Opening $filename: $!\n"; unless (-s F && -T _) { die "$filename doesn't have text in it.\n"; } The stat call just returns the information in one inode, though. How do we list the directory contents? For that, Perl provides opendir, readdir, and closedir: opendir(DIRHANDLE, "/usr/bin") or die "couldn't open /usr/bin : $!"; while ( defined ($filename = readdir(DIRHANDLE)) ) { print "Inside /usr/bin is something called $filename\n"; } closedir(DIRHANDLE); These directory-reading functions are designed to look like the file open and close functions. Where open takes a filehandle, though, opendir takes a directory handle. They may look the same to you (the same bare word), but they occupy different namespaces. Therefore, you could open(BIN, "/a/file") and opendir(BIN, "/a/dir"), and Perl won't get confused. You might, but Perl won't. Because filehandles and directory handles are different, you can't use the <> operator to read from a directory handle (<> calls readline on the filehandle). Similar to what happens with open and the other functions that initialize filehandles, you can supply opendir an undefined scalar variable where the directory handle is expected. If the function succeeds, Perl initializes that variable with a reference to a new, anonymous directory handle. opendir(my $dh, "/usr/bin") or die; while (defined ($filename = readdir($dh))) { # ... } closedir($dh); Just like any other autovivified reference, when this one is no longer used (for example, when it goes out of scope and no other references to it are held), Perl automatically deallocates it. And just as close is implicitly called on filehandles autovivified through open at that point, directory handles autovivified through opendir have closedir called on them, too. Filenames in a directory aren't necessarily stored alphabetically. For an alphabetical list of files, read the entries and sort them yourself. The separation of directory information from inode information can create some odd situations. Operations that update the directory—such as linking, unlinking, or renaming a file—all require write permission only on the directory, not on the file. This is because the name of a file is actually something the directory calls that file, not a property inherent to the file itself. Only directories hold names of files; files are ignorant of their own names. Only operations that change information in the file data itself demand write permission on the file. Lastly, operations that alter the file's permissions or other metadata are restricted to the file's owner or the superuser. This can lead to the interesting situation of being able to delete (i.e., unlink from its directory) a file you can't read, or write to a file you can't delete. Although these situations may make the filesystem structure seem odd at first, they're actually the source of much of Unix's power. Links, two filenames that refer to the same file, are now extremely simple. The two directory entries just list the same inode number. The inode structure includes a count of the number of directory entries referring to the file (nlink in the values returned by stat). This lets the operating system store and maintain only one copy of the modification times, size, and other file attributes. When one directory entry is unlinked, data blocks are deleted only if the directory entry was the last one that referred to the file's inode—and no processes still have the file open. You can unlink an open file, but its disk space won't be released until the last close. Links come in two forms. The kind described previously, where two directory entries list the same inode number (like vi and nvi in the earlier table), are called hard links. The operating system cannot tell the first directory entry of a file (the one created when the file was created) from any subsequent hard links to it. The other kind, soft or symbolic links, are very different. A soft link is a special type of file whose data block stores the filename the file is linked to. Soft links have a different mode value, indicating they're not regular files. The operating system, when asked to open a soft link, instead opens the filename contained in the data block. Executive Summary Filenames are kept in a directory, separate from the size, protections, and other metadata kept in an inode. The stat function returns the inode information (metadata). opendir, readdir, and friends provide access to filenames in a directory through a directory handle. Directory handles look like filehandles, but they are not the same. In particular, you can't use <> on directory handles. Permissions on a directory determine whether you can read and write the list of filenames. Permissions on a file determine whether you can change the file's metadata or contents. Three different times are stored in an inode. None of them is the file's creation time. [ Team LiB ] [ Team LiB ] Recipe 9.1 Getting and Setting Timestamps 9.1.1 Problem You need to retrieve or alter when a file was last modified (written or changed) or accessed (read). 9.1.2 Solution Use stat to get those times and utime to set them. Both functions are built into Perl: ($READTIME, $WRITETIME) = (stat($filename))[8,9]; utime($NEWREADTIME, $NEWWRITETIME, $filename); 9.1.3 Discussion As explained in the Introduction, three different times are associated with an inode in the traditional Unix filesystem. Of these, any user can set the atime and mtime with utime, assuming the user has write access to the parent directory of the file. There is effectively no way to change the ctime. This example shows how to call utime: $SECONDS_PER_DAY = 60 * 60 * 24; ($atime, $mtime) = (stat($file))[8,9]; $atime -= 7 * $SECONDS_PER_DAY; $mtime -= 7 * $SECONDS_PER_DAY; utime($atime, $mtime, $file) or die "couldn't backdate $file by a week w/ utime: $!"; You must call utime with both atime and mtime values. If you want to change only one, you must call stat first to get the other: $mtime = (stat $file)[9]; utime(time, $mtime, $file); This is easier to understand if you use File::stat: use File::stat; utime(time, stat($file)->mtime, $file); Use utime to make it appear as though you never touched a file at all (beyond its ctime being updated). For example, to edit a file, use the program in Example 9-1. Example 9-1. uvi #!/usr/bin/perl -w # uvi - vi a file without changing its access times $file = shift or die "usage: uvi filename\n"; ($atime, $mtime) = (stat($file))[8,9]; system($ENV{EDITOR} || "vi", $file); utime($atime, $mtime, $file) or die "couldn't restore $file to orig times: $!"; 9.1.4 See Also The stat and utime functions in perlfunc(1) and in Chapter 29 of Programming Perl; the standard File::stat module (also in Chapter 32 of Programming Perl); your system's utime(3) manpage [ Team LiB ] [ Team LiB ] Recipe 9.2 Deleting a File 9.2.1 Problem You want to delete a file. Perl's delete function isn't what you want. 9.2.2 Solution Use Perl's unlink function: unlink($FILENAME) or die "Can't delete $FILENAME: $!\n"; unlink(@FILENAMES) = = @FILENAMES or die "Couldn't unlink all of @FILENAMES: $!\n"; 9.2.3 Discussion The unlink function takes its name from the Unix syscall. Perl's unlink takes a list of filenames and returns the number of filenames successfully deleted. This return value can then be tested with || or or : unlink($file) or die "Can't unlink $file: $!"; unlink doesn't report which filenames it couldn't delete, only how many it deleted. Here's one way to test for successful deletion of many files and report the number deleted: unless (($count = unlink(@filelist)) = = @filelist) { warn "could only delete $count of " . (@filelist) . " files"; } A foreach over @filelist would permit individual error messages. Under Unix, deleting a file from a directory requires write access to the directory,[1] not to the file, because it's the directory you're changing. Under some circumstances, you could remove a file you couldn't write to or write to a file you couldn't remove. [1] Unless the sticky bit, mode 01000, is turned on for the directory, which further restricts deletions to be by the file's owner only. Shared directories such as /tmp are usually mode 01777 for security reasons. If you delete a file that some process still has open, the operating system removes the directory entry but doesn't free up data blocks until all processes have closed the file. This is how the tmpfile function in File::Temp works (see Recipe 7.11 ). 9.2.4 See Also The unlink function in perlfunc (1) and in Chapter 29 of Programming Perl ; your system's unlink (2) manpage; Recipe 7.11 [ Team LiB ] [ Team LiB ] Recipe 9.3 Copying or Moving a File 9.3.1 Problem You need to copy a file, but Perl has no built-in copy function. 9.3.2 Solution Use the copy function from the standard File::Copy module: use File::Copy; copy($oldfile, $newfile); You can do it by hand: open(IN, "<", $oldfile) or die "can't open $oldfile: $!"; open(OUT, ">", $newfile) or die "can't open $newfile: $!"; $blksize = (stat IN)[11] || 16384; # preferred block size? while (1) { $len = sysread IN, $buf, $blksize); if (!defined $len) { next if $! =~ /^Interrupted/; # ^Z and fg on EINTR die "System read error: $!\n"; } last unless $len; $offset = 0; while ($len) { # Handle partial writes. defined($written = syswrite OUT, $buf, $len, $offset) or die "System write error: $!\n"; $len -= $written; $offset += $written; }; } close(IN); close(OUT); or you can call your system's copy program: system("cp $oldfile $newfile"); # unix system("copy $oldfile $newfile"); # dos, vms 9.3.3 Discussion The File::Copy module provides copy and move functions. These are more convenient than resorting to low-level I/O calls and more portable than calling system. This version of move works across file-system boundaries; the standard Perl built-in rename (usually) does not. use File::Copy; copy("datafile.dat", "datafile.bak") or die "copy failed: $!"; move("datafile.dat", "") or die "move failed: $!"; Because these functions return only a simple success status, you can't easily tell which file prevented the copy or move from working. Copying the files manually lets you pinpoint which files didn't copy, but it fills your program with complex sysreads and syswrites. 9.3.4 See Also Documentation for the standard File::Copy module (also in Chapter 32 of Programming Perl); the rename, read, and syswrite functions in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.4 Recognizing Two Names for the Same File 9.4.1 Problem You want to determine whether two filenames in a list correspond to the same file on disk (because of hard and soft links, two filenames can refer to a single file). You might do this to make sure that you don't change a file you've already worked with. 9.4.2 Solution Maintain a hash, keyed by the device and inode number of the files you've seen. The values are the names of the files: %seen = ( ); sub do_my_thing { my $filename = shift; my ($dev, $ino) = stat $filename; unless ($seen{$dev, $ino}++) { # do something with $filename because we haven't # seen it before } } 9.4.3 Discussion A key in %seen is made by combining the device number ($dev) and inode number ($ino) of each file. Files that are the same will have the same device and inode numbers, so they will have the same key. If you want to maintain a list of all files of the same name, instead of counting the number of times seen, save the name of the file in an anonymous array. foreach $filename (@files) { ($dev, $ino) = stat $filename; push( @{ $seen{$dev,$ino} }, $filename); } foreach $devino (sort keys %seen) { ($dev, $ino) = split(/$;/o, $devino); if (@{$seen{$devino}} > 1) { # @{$seen{$devino}} is a list of filenames for the same file } } The $; variable contains the separator string using the old multidimensional associative array emulation syntax, $hash{$x,$y,$z}. It's still a one-dimensional hash, but it has composite keys. The key is really join($; => $x, $y, $z). The split separates them again. Although you'd normally just use a real multilevel hash directly, here there's no need, and it's cheaper not to. 9.4.4 See Also The $; ($SUBSEP) variable in perlvar(1), and in the "Special Variables" section of Chapter 28 of Programming Perl; the stat function in perlfunc(1) and in Chapter 29 of Programming Perl; Chapter 5 [ Team LiB ] [ Team LiB ] Recipe 9.5 Processing All Files in a Directory 9.5.1 Problem You want to do something to each file in a particular directory. 9.5.2 Solution Use opendir to open the directory and readdir to retrieve every filename: opendir(DIR, $dirname) or die "can't opendir $dirname: $!"; while (defined($file = readdir(DIR))) { # do something with "$dirname/$file" } closedir(DIR); 9.5.3 Discussion The opendir, readdir, and closedir functions operate on directories as open, <>, and close operate on files. Both use handles, but the directory handles used by opendir and friends are different from the filehandles used by open and friends. In particular, you can't use <> on a directory handle. In scalar context, readdir returns the next filename in the directory until it reaches the end of the directory, when it returns undef. In list context it returns the rest of the filenames in the directory or an empty list if there were no files left. As explained in this chapter's Introduction, the filenames returned by readdir do not include the directory name. When you work with the filenames returned by readdir, you must either move to the right directory first or prepend the directory to the filename. This shows one way of prepending: $dir = "/usr/local/bin"; print "Text files in $dir are:\n"; opendir(BIN, $dir) or die "Can't open $dir: $!"; while( $file = readdir BIN) { print "$file\n" if -T "$dir/$file"; } closedir(BIN); The readdir function will return the special directories "." (the directory itself) and ".." (the parent of the directory). Most people skip those files with code like: while ( defined ($file = readdir BIN) ) { next if $file =~ /^\.\.?$/; # skip . and .. # ... } Like filehandles, bareword directory handles are per-package constructs. You can use the local *DIRHANDLE syntax to get a new bareword directory handle. Alternatively, pass an undefined scalar as the first argument to opendir and Perl will put a new indirect directory handle into that scalar: opendir my $dh, $directory or die; while (defined ($filename = readdir($dh))) { # ... } closedir $dh; Or, finally, you can use DirHandle to get an object-oriented view of a directory handle. The following code uses DirHandle and produces a sorted list of plain files that aren't dotfiles (that is, their names don't begin with a "."): use DirHandle; sub plainfiles { my $dir = shift; my $dh = DirHandle->new($dir) or die "can't opendir $dir: $!"; return sort # sort pathnames grep { -f } # choose only "plain" files map { "$dir/$_" } # create full paths grep { !/^\./ } # filter out dot files $dh->read( ); # read all entries } DirHandle's read method behaves just like readdir, returning all remaining filenames. The bottom grep returns only those that don't begin with a period. The map turns the filenames returned by read into fully qualified filenames, and the top grep filters out directories, links, etc. The resulting list is then sorted and returned. In addition to readdir, there's also rewinddir (to move the directory handle back to the start of the filename list), seekdir (to move to a specific offset in the list), and telldir (to find out how far from the start of the list you are). 9.5.4 See Also The closedir, opendir, readdir, rewinddir, seekdir, and telldir functions in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard DirHandle module (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 9.6 Globbing, or Getting a List of Filenames Matching a Pattern 9.6.1 Problem You want to get a list of filenames similar to those produced by MS-DOS's *.* and Unix's *.h. This is called globbing, and the filename wildcard expression is called a glob, or occasionally a fileglob to distinguish it from a typeglob. 9.6.2 Solution Perl provides globbing with the semantics of the Unix C shell through the glob keyword and <>: @list = <*.c>; @list = glob("*.c"); You can also use readdir to extract the filenames manually: opendir(DIR, $path); @files = grep { /\.c$/ } readdir(DIR); closedir(DIR); 9.6.3 Discussion In versions of Perl before v5.6, Perl's built-in glob and notation (not to be confused with ) ran an external program (often the csh shell) to get the list of filenames. This led to globbing being tarred with security and performance concerns. As of v5.6, Perl uses the File::Glob module to glob files, which solves the security and performance problems of the old implementation. Globs have C shell semantics on non-Unix systems to encourage portability. In particular, glob syntax isn't regular expression syntax—glob uses ? to mean "any single character" and * to mean "zero or more characters," so glob("f?o*") matches flo and flood but not fo. For complex rules about which filenames you want, roll your own selection mechanism using readdir and regular expressions. At its simplest, an opendir solution uses grep to filter the list returned by readdir: @files = grep { /\.[ch]$/i } readdir(DH); As always, the filenames returned don't include the directory. When you use the filename, prepend the directory name to get the full pathname: opendir(DH, $dir) or die "Couldn't open $dir for reading: $!"; @files = ( ); while( defined ($file = readdir(DH)) ) { next unless /\.[ch]$/i; my $filename = "$dir/$file"; push(@files, $filename) if -T $filename; } The following example combines directory reading and filtering with the efficient sorting technique from Recipe 4.16. It sets @dirs to a sorted list of the subdirectories in a directory whose names are all numeric: @dirs = map { $_->[1] } # extract pathnames sort { $a->[0] <=> $b->[0] } # sort names numeric grep { -d $_->[1] } # path is a dir map { [ $_, "$path/$_" ] } # form (name, path) grep { /^\d+$/ } # just numerics readdir(DIR); # all files Recipe 4.16 explains how to read these strange-looking constructs. As always, formatting and documenting your code can make it much easier to read and understand. 9.6.4 See Also The opendir, readdir, closedir, grep, map, and sort functions in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard DirHandle module (also in Chapter 32 of Programming Perl); the "I/O Operators" section of perlop(1), and the "Filename Globbing Operator" section of Chapter 2 of Programming Perl; we talk more about globbing in Recipe 6.9; Recipe 9.5 [ Team LiB ] [ Team LiB ] Recipe 9.7 Processing All Files in a Directory Recursively 9.7.1 Problem You want to do something to each file and subdirectory in a particular directory. 9.7.2 Solution Use the standard File::Find module. use File::Find; sub process_file { # do whatever; } find(\&process_file, @DIRLIST); 9.7.3 Discussion File::Find provides a convenient way to process a directory recursively. It does the directory scans and recursion for you. All you do is pass find a code reference and a list of directories. For each file in those directories, recursively, find calls your function. Before calling your function, find by default changes to the directory being visited, whose path relative to the starting directory is stored in the $File::Find::dir variable. $_ is set to the basename of the file being visited, and the full path of that file can be found in $File::Find::name. Your code can set $File::Find::prune to true to tell find not to descend into the directory just seen. This simple example demonstrates File::Find. We give find an anonymous subroutine that prints the name of each file visited and adds a / to the names of directories: @ARGV = qw(.) unless @ARGV; use File::Find; find sub { print $File::Find::name, -d && "/", "\n" }, @ARGV; The -d file test operator returns the empty string '' if it fails, making the && return that, too. But if -d succeeds, the && returns "/", which is then printed. The following program prints the total bytes occupied by everything in a directory, including subdirectories. It gives find an anonymous subroutine to keep a running sum of the sizes of each file it visits. That includes all inode types, including the sizes of directories and symbolic links, not just regular files. Once the find function returns, the accumulated sum is displayed. use File::Find; @ARGV = (".") unless @ARGV; my $sum = 0; find sub { $sum += -s }, @ARGV; print "@ARGV contains $sum bytes\n"; This code finds the largest single file within a set of directories: use File::Find; @ARGV = (".") unless @ARGV; my ($saved_size, $saved_name) = (-1, ""); sub biggest { return unless -f && -s _ > $saved_size; $saved_size = -s _; $saved_name = $File::Find::name; } find(\&biggest, @ARGV); print "Biggest file $saved_name in @ARGV is $saved_size bytes long.\n"; We use $saved_size and $saved_name to keep track of the name and the size of the largest file visited. If we find a file bigger than the largest seen so far, we replace the saved name and size with the current ones. When the find finishes, the largest file and its size are printed out, rather verbosely. A more general tool would probably just print the filename, its size, or both. This time we used a named function rather than an anonymous one because the function was getting big. It's simple to change this to find the most recently changed file: use File::Find; @ARGV = (".") unless @ARGV; my ($age, $name); sub youngest { return if defined $age && $age > (stat($_))[9]; $age = (stat(_))[9]; $name = $File::Find::name; } find(\&youngest, @ARGV); print "$name " . scalar(localtime($age)) . "\n"; The File::Find module doesn't export its $name variable, so always refer to it by its fully qualified name. Example 9-2 is more a demonstration of namespace munging than of recursive directory traversal, although it does find all directories. It makes $name in our current package an alias for the one in File::Find, which is essentially how Exporter works. Then it declares its own version of find with a prototype so it can be called like grep or map. Example 9-2. fdirs #!/usr/bin/perl -lw # fdirs - find all directories @ARGV = qw(.) unless @ARGV; use File::Find ( ); sub find(&@) { &File::Find::find } *name = *File::Find::name; find { print $name if -d } @ARGV; Our own find only calls the find in File::Find, which we were careful not to import by specifying an ( ) empty list in the use statement. Rather than write this: find sub { print $File::Find::name if -d }, @ARGV; we can write the more pleasant: find { print $name if -d } @ARGV; 9.7.4 See Also The documentation for the standard File::Find and Exporter modules (also in Chapter 32 of Programming Perl); your system's find(1) manpage; Recipe 9.6 [ Team LiB ] [ Team LiB ] Recipe 9.8 Removing a Directory and Its Contents 9.8.1 Problem You want to remove a directory tree recursively without using rm -r. 9.8.2 Solution Use the finddepth function from File::Find, shown in Example 9-3. Example 9-3. rmtree1 #!/usr/bin/perl # rmtree1 - remove whole directory trees like rm -r use File::Find; die "usage: $0 dir ..\n" unless @ARGV; find { bydepth => 1, no_chdir => 1, wanted => sub { if (!-l && -d _) { rmdir or warn "couldn't rmdir directory $_: $!"; } else { unlink or warn "couldn't unlink file $_: $!"; } } } => @ARGV; Or use rmtree from File::Path, as shown in Example 9-4. Example 9-4. rmtree2 #!/usr/bin/perl # rmtree2 - remove whole directory trees like rm -r use File::Path; die "usage: $0 dir ..\n" unless @ARGV; foreach $dir (@ARGV) { rmtree($dir); } These programs remove an entire directory tree. Use with extreme caution! 9.8.3 Discussion The File::Find module supports an alternate interface in which find's first argument is a hash reference containing options and their settings. The bydepth option is the same as calling finddepth instead of find. This is guaranteed to visit all files beneath a directory before the directory itself, just what we need to remove a directory and its contents. The no_chdir option stops find from descending into directories during processing; under this option, $_ is the same as $File::Find::name. Finally, the wanted option takes a code reference, our old wanted( ) function. We use two different functions, rmdir and unlink; both default to $_ if no argument is provided. The unlink function deletes only files, and rmdir deletes only empty directories. We need to use finddepth or the bydepth option to make sure we've first removed the directory's contents before we rmdir the directory itself. We first check that the file isn't a symbolic link before determining whether it's a directory, because -d returns true for both a real directory and a symbolic link to a directory. stat, lstat, and file test operators like -d all use the syscall stat(2), which returns the file meta- information stored in the file's inode. These functions and operators cache that information in the special underscore (_) filehandle. This permits tests on the same file while avoiding redundant syscalls that would return the same information, slowly. According to POSIX, if the directory is either the root directory (the mount point for the filesystems or the result of a chroot(2) syscall) or the current working directory of any process, it is unspecified whether the rmdir syscall succeeds, or whether it fails and sets errno ($! in Perl) to EBUSY ("Device busy"). Many systems tolerate the latter condition, but few the former. 9.8.4 See Also The unlink, rmdir, lstat, and stat functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard File::Find module (also in Chapter 32 of Programming Perl); your system's rm(1) and stat(2) manpages; the -X section of perlfunc(1), and the "Named Unary and File Test Operators" section of Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.9 Renaming Files 9.9.1 Problem You have many files whose names you want to change. 9.9.2 Solution Use a foreach loop and the rename function: foreach $file (@NAMES) { my $newname = $file; # change $newname rename($file, $newname) or warn "Couldn't rename $file to $newname: $!\n"; } 9.9.3 Discussion This is straightforward. rename takes two arguments. The first is the filename to change, and the second is its new name. Perl's rename is a frontend to the operating system's rename syscall, which typically won't rename files across filesystem boundaries. A small change turns this into a generic rename script, such as the one by Larry Wall shown in Example 9-5. Example 9-5. rename #!/usr/bin/perl -w # rename - Larry's filename fixer $op = shift or die "Usage: rename expr [files]\n"; chomp(@ARGV = ) unless @ARGV; for (@ARGV) { $was = $_; eval $op; die $@ if $@; rename($was,$_) unless $was eq $_; } This script's first argument is Perl code that alters the filename (stored in $_) to reflect how you want the file renamed. It can do this because it uses an eval to do the hard work. It also skips rename calls when the filename is untouched. This lets you simply use wildcards like rename EXPR * instead of making long lists of filenames. Here are five examples of calling the rename program from the shell: % rename 's/\.orig$//' *.orig % rename "tr/A-Z/a-z/ unless /^Make/" * % rename '$_ .= ".bad"' *.f % rename 'print "$_: "; s/foo/bar/ if =~ /^y/i' * % find /tmp -name "*~" -print | rename 's/^(.+)~$/.#$1/' The first shell command removes a trailing ".orig" from each filename. The second converts uppercase to lowercase. Because a translation is used rather than the lc function, this conversion won't be locale-aware. To fix that, you'd have to write: % rename 'use locale; $_ = lc($_) unless /^Make/' * The third appends ".bad" to each Fortran file ending in ".f", something many of us have wanted to do for a long time. The fourth prompts the user for the change. Each file's name is printed to standard output and a response read from standard input. If the user types something starting with a "y" or "Y", any "foo" in the filename is changed to "bar". The fifth uses find to locate files in /tmp that end with a tilde. It renames these so that instead of ending with a tilde, they start with a dot and a pound sign. In effect, this switches between two common conventions for backup files. The rename script exemplifies the powerful Unix tool-and-filter philosophy. Even though we could have created a dedicated command for lowercase conversion, it's nearly as easy to write a flexible, reusable tool by embedding an eval. By reading filenames from standard input, we don't have to build in the recursive directory walk. Instead, we just use find, which performs this function well. There's no reason to recreate the wheel, although using File::Find we could have. 9.9.4 See Also The rename function in perlfunc(1) and in Chapter 29 of Programming Perl; your system's mv(1) and rename(2) manpages; the documentation for the standard File::Find module (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 9.10 Splitting a Filename into Its Component Parts 9.10.1 Problem You want to extract a filename, its enclosing directory, or the extension(s) from a string that contains a full pathname. 9.10.2 Solution Use routines from the standard File::Basename module. use File::Basename; $base = basename($path); $dir = dirname($path); ($base, $dir, $ext) = fileparse($path); 9.10.3 Discussion The standard File::Basename module contains routines to split up a filename. dirname and basename supply the directory and filename portions, respectively: $path = "/usr/lib/libc.a"; $file = basename($path); $dir = dirname($path); print "dir is $dir, file is $file\n"; # dir is /usr/lib, file is libc.a The fileparse function can extract the extension. Pass fileparse the path to decipher and a regular expression that matches the extension. You must supply a pattern because an extension isn't necessarily dot-separated. Consider ".tar.gz": is the extension ".tar", ".gz", or ".tar.gz"? By specifying the pattern, you control which you get. $path = "/usr/lib/libc.a"; ($name,$dir,$ext) = fileparse($path,'\..*'); print "dir is $dir, name is $name, extension is $ext\n"; # dir is /usr/lib/, name is libc, extension is .a By default, these routines parse pathnames using your operating system's normal conventions for directory separators by consulting the $^O ($OSNAME) variable, which holds a string identifying the platform you're running on. That value was determined when Perl was built and installed. You can change the default by calling the fileparse_set_fstype routine. This alters the behavior of subsequent calls to the File::Basename functions: fileparse_set_fstype("MacOS"); $path = "Hard%20Drive:System%20Folder:README.txt"; ($name,$dir,$ext) = fileparse($path,'\..*'); print "dir is $dir, name is $name, extension is $ext\n"; # dir is Hard%20Drive:System%20Folder, name is README, extension is .txt To pull out just the extension, you might use this: sub extension { my $path = shift; my $ext = (fileparse($path,'\..*'))[2]; $ext =~ s/^\.//; return $ext; } When called on a file like source.c.bak, this returns an extension of "c.bak", not just "bak". If you want ".bak" returned, use '\.[^.]*' as the second argument to fileparse (this will, of course, leave the filename as source.c). When passed a pathname with a trailing directory separator, such as "lib/", fileparse considers the directory name to be "lib/", whereas dirname considers it to be ".". 9.10.4 See Also The documentation for the standard File::Basename module (also in Chapter 32 of Programming Perl); the entry for $^O ($OSNAME) in perlvar(1), and in the "Special Variables in Alphabetical Order" section of Chapter 28 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.11 Working with Symbolic File Permissions Instead of Octal Values 9.11.1 Problem You want to print, inspect, or change permissions on a file or directory, but you don't want to specify the permissions in octal (e.g., 0644, 0755). You want to print permissions as ls(1) shows them (e.g., -rwx-r-xr-x) and specify permissions changes in the way that chmod(1) does (e.g., g-w to remove write access for the group). 9.11.2 Solution Use the CPAN module Stat::lsMode to convert numeric permissions to a string: use Stat::lsMode; $lsmode = file_mode($pathname); Use the CPAN module File::chmod to manipulate symbolic permissions: use File::chmod; chmod("g=rw,o=-w", @files); # group can read/write, others can't write chmod("-rwxr-xr--", @files); # ls-style permissions 9.11.3 Discussion The Stat::lsMode module provides functions for generating ls-style permissions strings. The file_mode function takes a pathname and returns a permissions string. This string is false if the pathname doesn't exist or Perl can't stat it. If all goes well, you get a string like "drwxr-x---" for a directory or "-rwxr-x----" for a file. For more fine-grained control, Stat::lsMode offers format_mode, which takes a numeric permissions value and returns the 10-character ls-style string. Notice the leading d and - in those strings. This indicates the type of file whose permissions you're inspecting: - means regular file, d means directory, l means symbolic link, and so on. The format_perms function from Stat::lsMode does the same job as format_mode, but it returns a nine-character string, which does not have the type indicator. For example: use Stat::lsMode; print file_mode("/etc"), "\n"; print format_mode((stat "/etc")[2]), "\n"; drwxr-xr-x r-xr-xr-x The File::chmod module gives you a chmod that accepts these nine-character permissions strings: use File::chmod; chmod("rwxr-xr-x", @files); These strings are three clusters of three characters. The three clusters represent what the user, group, and others can do to the file (respectively). The three characters represent reading, writing, and executing, with a dash (-) in a column indicating the corresponding permission is denied to the group. So in "rwxrw-r--", the owner can read, write, and execute; users in the same group as the file can read and write but not execute; and everyone else can only read. You can specify relative changes to the permissions for a particular file; for example, g-w removes write permission from the group. The first letter(s) indicates whose permissions are being changed (user, group, other, or a combination). Then comes a + or - to indicate adding or removing permissions, or = to indicate you're specifying the complete set of permissions. Then you specify some or all of rwx. You can join these with commas to form relative permissions; for example, g-w,o+x (remove write from group, add execute to other). If you omit the u, g, or o, then the change applies to everyone. Here are some valid permissions changes and what they do: u= remove all permissions for the user g=r group can only read g+wx group can also write and execute g=rwx,o=rx group can do all, other can only read and execute =rwx everybody can do everything So you can now say: chmod("u=", @files); # remove all permissions for the user on @files chmod("g=r", @files); chmod("g+wx", @files); chmod("g=rwx,o-rx", @files); chmod("=rwx", @files); File::chmod also provides functions for seeing what the new permission would be without actually making the change. See the File::chmod documentation for more details. 9.11.4 See Also The documentation for the CPAN modules File::chmod and Stat::lsMode; the chmod and stat functions in perlfunc(1) [ Team LiB ] [ Team LiB ] Recipe 9.12 Program: symirror The program in Example 9-6 recursively duplicates a directory tree, making a shadow forest full of symlinks pointing back at the real files. Example 9-6. symirror #!/usr/bin/perl # symirror - build spectral forest of symlinks use warnings; use strict; use Cwd qw(realpath); use File::Find qw(find); die "usage: $0 realdir mirrordir" unless @ARGV == 2; our $SRC = realpath $ARGV[0]; our $DST = realpath $ARGV[1]; my $oldmask = umask 077; # in case was insanely uncreatable chdir $SRC or die "can't chdir $SRC: $!"; unless (-d $DST) { mkdir($DST, 0700) or die "can't mkdir $DST: $!"; } find { wanted => \&shadow, postprocess => \&fixmode, } => "."; umask $oldmask; sub shadow { (my $name = $File::Find::name) =~ s!^\./!!; # correct name return if $name eq "."; if (-d) { # make a real dir; we'll copy mode later mkdir("$DST/$name", 0700) or die "can't mkdir $DST/$name: $!"; } else { # all else gets symlinked symlink("$SRC/$name", "$DST/$name") or die "can't symlink $SRC/$name to $DST/$name: $!"; } } sub fixmode { my $dir = $File::Find::dir; my $mode = (stat("$SRC/$dir"))[2] & 07777; chmod($mode, "$DST/$dir") or die "can't set mode on $DST/$dir: $!"; } [ Team LiB ] [ Team LiB ] Recipe 9.13 Program: lst Have you ever wondered what the newest or biggest files within a directory are? The standard ls program has options for listing out directories sorted in time order (the -t flag) and for recursing into subdirectories (the -R flag). However, it pauses at each directory to display the sorted contents of just that directory. It doesn't descend through all subdirectories first and then sort everything it found. The following lst program does that. Here's an example using its -l flag to get a long listing: % lst -l /etc 12695 0600 1 root wheel 512 Fri May 29 10:42:41 1998 /etc/ssh_random_seed 12640 0644 1 root wheel 10104 Mon May 25 7:39:19 1998 /etc/ 12626 0664 1 root wheel 12288 Sun May 24 19:23:08 1998 /etc/psdevtab 12304 0644 1 root root 237 Sun May 24 13:59:33 1998 /etc/exports 12309 0644 1 root root 3386 Sun May 24 13:24:33 1998 /etc/inetd.conf 12399 0644 1 root root 30205 Sun May 24 10:08:37 1998 /etc/ 18774 0644 1 gnat perldoc 2199 Sun May 24 9:35:57 1998 /etc/X11/XMetroconfig 12636 0644 1 root wheel 290 Sun May 24 9:05:40 1998 /etc/mtab 12627 0640 1 root root 0 Sun May 24 8:24:31 1998 /etc/wtmplock 12310 0644 1 root tchrist 65 Sun May 24 8:23:04 1998 /etc/issue .... /etc/X11/XMetroconfig showed up in the middle of the listing for /etc because it wasn't just for /etc, but for everything within that directory, recursively. Other supported options include sorting on read time instead of write time using -u and sorting on size rather than time with -s. The -i flag takes the list of filenames from standard input instead of recursing with find. That way, if you already had a list of filenames, you could feed them to lst for sorting. The program is shown in Example 9-7. Example 9-7. lst #!/usr/bin/perl # lst - list sorted directory contents (depth first) use Getopt::Std; use File::Find; use File::stat; use User::pwent; use User::grent; getopts("lusrcmi") or die << DEATH; Usage: $0 [-mucsril] [dirs ...] or $0 -i [-mucsrl] < filelist Input format: -i read pathnames from stdin Output format: -l long listing Sort on: -m use mtime (modify time) [DEFAULT] -u use atime (access time) -c use ctime (inode change time) -s use size for sorting Ordering: -r reverse sort NB: You may only use select one sorting option at a time. DEATH unless ($opt_i || @ARGV) { @ARGV = (".") } if ($opt_c + $opt_u + $opt_s + $opt_m > 1) { die "can only sort on one time or size"; } $IDX = "mtime"; $IDX = "atime" if $opt_u; $IDX = "ctime" if $opt_c; $IDX = "size" if $opt_s; $TIME_IDX = $opt_s ? "mtime" : $IDX; *name = *File::Find::name; # forcibly import that variable # the $opt_i flag tricks wanted into taking # its filenames from ARGV instead of being # called from find. if ($opt_i) { *name = *_; # $name now alias for $_ while (<>) { chomp; &wanted; } # ok, not stdin really } else { find(\&wanted, @ARGV); } # sort the files by their cached times, youngest first @skeys = sort { $time{$b} <=> $time{$a} } keys %time; # but flip the order if -r was supplied on command line @skeys = reverse @skeys if $opt_r; for (@skeys) { unless ($opt_l) { # emulate ls -l, except for permissions print "$_\n"; next; } $now = localtime $stat{$_}->$TIME_IDX( ); printf "%6d %04o %6d %8s %8s %8d %s %s\n", $stat{$_}->ino( ), $stat{$_}->mode( ) & 07777, $stat{$_}->nlink( ), user($stat{$_}->uid( )), group($stat{$_}->gid( )), $stat{$_}->size( ), $now, $_; } # get stat info on the file, saving the desired # sort criterion (mtime, atime, ctime, or size) # in the %time hash indexed by filename. # if they want a long list, we have to save the # entire stat object in %stat. yes, this is a # hash of objects sub wanted { my $sb = stat($_); # XXX: should be stat or lstat? return unless $sb; $time{$name} = $sb->$IDX( ); # indirect method call $stat{$name} = $sb if $opt_l; } # cache user number to name conversions; don't worry # about the apparently extra call, as the system caches the # last one called all by itself sub user { my $uid = shift; $user{$uid} = getpwuid($uid) ? getpwuid($uid)->name : "#$uid" unless defined $user{$uid}; return $user{$uid}; } # cache group number to name conversions; ditto on unworryness sub group { my $gid = shift; $group{$gid} = getgrgid($gid) ? getgrgid($gid)->name : "#$gid" unless defined $group{$gid}; return $group{$gid}; } [ Team LiB ] [ Team LiB ] Chapter 10. Subroutines Composing mortals with immortal fire. —W. H. Auden, "Three Songs for St Cecilia's Day" [ Team LiB ] [ Team LiB ] Introduction To avoid the dangerous practice of copying and pasting code, larger programs reuse chunks of code as subroutines and functions. We'll use the terms subroutine and function interchangeably because Perl doesn't distinguish between the two. Even object-oriented methods are just subroutines that are called using a special syntax, described in Chapter 13. A subroutine is declared with the sub keyword. Here's a simple subroutine definition: sub hello { $greeted++; # global variable print "hi there!\n"; } The typical way of calling that subroutine is: hello( ); # call subroutine hello with no arguments/parameters Because Perl compiles your program before executing it, it doesn't matter where subroutines are declared. Definitions don't have to be in the same file as your main program. They can be pulled in from other files using the do, require, or use operators, as described in Chapter 12. They can even be created on the fly using eval or AUTOLOAD, or generated using closures, which can act as function templates. If you are familiar with other programming languages, several characteristics of Perl's functions may surprise you if you're unprepared for them. Most recipes in this chapter illustrate how to be aware of—and to take advantage of—these properties. Perl functions have no formal, named parameters, but this is not necessarily a bad thing. See Recipe 10.1 and Recipe 10.7. All variables are global unless declared otherwise. See Recipe 10.2, Recipe 10.3, and Recipe 10.13 for details. Passing or returning more than one array or hash normally causes them to lose their separate identities. See Recipe 10.5, Recipe 10.8, Recipe 10.9, and Recipe 10.11 to avoid this. A function can know in which context it was called, how many arguments it was called with, and even which other function called it. See Recipe 10.4 and Recipe 10.6 to find out how. Perl's undef value can be used to signal an error return from the function because no valid string, number, or reference ever has that value. Recipe 10.10 covers subtle pitfalls with undef you should avoid, and Recipe 10.12 shows how to deal with other catastrophic conditions. Perl supports interesting operations on functions that you might not see in other languages, such as anonymous functions, creating functions on the fly, and calling them indirectly using function pointers. See Recipe 10.14 and Recipe 10.16 for these esoteric topics. Calling a function as $x = &func; does not supply any arguments, but rather provides direct access to its caller's @_ array! If you omit the ampersand and use either func( ) or func, then a new and empty @_ is provided instead. Historically, Perl hasn't provided a construct like C's switch or the shell's case for multiway branching. The switch function shown in Recipe 10.17 takes care of that for you. [ Team LiB ] [ Team LiB ] Recipe 10.1 Accessing Subroutine Arguments 10.1.1 Problem You have written a function that takes arguments supplied by its caller, and you need to access those arguments. 10.1.2 Solution The special array @_ holds the values passed in as the function's arguments. Thus, the first argument to the function is in $_[0], the second in $_[1], and so on. The number of arguments is simply scalar(@_). For example: sub hypotenuse { return sqrt( ($_[0] ** 2) + ($_[1] ** 2) ); } $diag = hypotenuse(3,4); # $diag is 5 Most subroutines start by copying arguments into named private variables for safer and more convenient access: sub hypotenuse { my ($side1, $side2) = @_; return sqrt( ($side1 ** 2) + ($side2 ** 2) ); } 10.1.3 Discussion It's been said that programming has only three nice numbers: zero, one, and however many you please. Perl's subroutine mechanism was designed to facilitate writing functions with as many—or as few—elements in the parameter and return lists as you wish. All incoming parameters appear as separate scalar values in the special array @_, which is automatically local to each function (see Recipe 10.13). To return a value or values from a subroutine, use the return statement with arguments. If there is no return statement, the return value is the result of the last evaluated expression. Here are some sample calls to the hypotenuse function defined in the Solution: print hypotenuse(3, 4), "\n"; # prints 5 @a = (3, 4); print hypotenuse(@a), "\n"; # prints 5 If you look at the arguments used in the second call to hypotenuse, it might appear that only one argument was passed: the array @a. This isn't what happens—the elements of @a are copied into the @_ array separately. Similarly, if you called a function with (@a, @b), you'd be giving it all arguments from both arrays. This is the same principle of flattened lists at work as in: @both = (@men, @women); The scalars in @_ are implicit aliases for the ones passed in, not copies. That means changing the elements of @_ in a subroutine changes the values in the subroutine's caller. This is a holdover from before Perl had proper references. You can write functions that leave their arguments intact by copying the arguments to private variables like this: @nums = (1.4, 3.5, 6.7); @ints = int_all(@nums); # @nums unchanged sub int_all { my @retlist = @_; # make safe copy for return for my $n (@retlist) { $n = int($n) } return @retlist; } You can also write functions that change their caller's variables: @nums = (1.4, 3.5, 6.7); trunc_em(@nums); # @nums now (1,3,6) sub trunc_em { for (@_) { $_ = int($_) } # truncate each argument } Don't pass constants as arguments to a function that intends to modify those arguments; for example, don't call trunc_em(1.4, 3.5, 6.7). If you do, you'll get a runtime exception to the effect of Modification of a read-only value attempted at .... The built-in functions chop and chomp are like that; they modify their caller's variables and return something else entirely. Beginning Perl programmers who notice regular functions that all return some new value—including int, uc, and readline— without modifying those functions' arguments sometimes incorrectly infer that chop and chomp work similarly. This leads them to write code like: $line = chomp(<>); # WRONG $removed_chars = chop($line); # RIGHT $removed_count = chomp($line); # RIGHT until they get the hang of how this pair really works. Given the vast potential for confusion, you might want to think twice before modifying @_ in your own subroutines, especially if you also intend to provide a distinct return value. 10.1.4 See Also Chapter 6 of Programming Perl and perlsub(1) [ Team LiB ] [ Team LiB ] Recipe 10.2 Making Variables Private to a Function 10.2.1 Problem Your subroutine needs temporary variables. You shouldn't use global variables, because another subroutine might also use the same variables. 10.2.2 Solution Use my to declare a variable private to a region of your program: sub somefunc { my $variable; # $variable is invisible outside somefunc( ) my ($another, @an_array, %a_hash); # declaring many variables at once # ... } 10.2.3 Discussion The my operator confines a variable to a particular region of code in which it can be used and accessed. Outside that region, it can't be accessed. This region is called its scope. Variables declared with my have lexical scope, meaning that they exist only within a specific textual region of code. For instance, the scope of $variable in the Solution is the function it was defined in, somefunc. The variable is created when somefunc is entered, and it is destroyed when the function returns. The variable can be accessed only from inside the function, not from outside. A lexical scope is usually a block of code with braces around it, such as those defining the body of the somefunc subroutine or those marking the code blocks of if, while, for, foreach, and eval. An entire source file and the string argument to eval are each a lexical scope;[1] think of them as blocks with invisible braces delimiting their confines. Because a lexical scope is most often found as a brace-delimited block, when discussing lexical variables we sometimes say that they are visible only in their block, but what we really mean is that they're visible only in their scope. [1] Although not of the same sort: the eval scope is a nested scope, just like a nested block, but the file scope is unrelated to any other. The code that can legally access a my variable is determined statically at compile time and never changes, and so lexical scoping is sometimes referred to as static scoping, especially when in contrast to dynamic scoping, a topic we'll cover in Recipe 10.13. You can combine a my declaration with an assignment. Use parentheses when defining more than one variable: my ($name, $age) = @ARGV; my $start = fetch_time( ); These lexical variables behave as you would expect of a local variable. Nested blocks can see lexicals declared in enclosing, outer blocks, but not in unrelated blocks: my ($a, $b) = @pair; my $c = fetch_time( ); sub check_x { my $x = $_[0]; my $y = "whatever"; run_check( ); if ($condition) { print "got $x\n"; } } In the preceding code, the if block inside the function can access the private $x variable. However, the run_check function called from within that scope cannot access $x or $y, because run_check was presumably defined in another scope. However, check_x can access $a, $b, or $c from the outer scope because the function was defined in the same scope as those three variables. Don't nest definitions of named subroutines. If you do, they won't get the right bindings of the lexical variables. Recipe 10.16 shows how to cope with this restriction. When a lexical variable goes out of scope, its storage is freed unless a reference to the variable still exists, as with @arguments in the following code: sub save_array { my @arguments = @_; push(our @Global_Array, \@arguments); } This code creates a new array each time save_array is called, so you don't have to worry that it'll reuse the same array each time the function is called. Perl's garbage collection system knows not to deallocate things until they're no longer used. This is why you can return a reference to a private variable without leaking memory. 10.2.4 See Also The section on "Scoped Declarations" in Chapter 4 of Programming Perl and the section on "Private Variables via my( )" in perlsub(1) [ Team LiB ] [ Team LiB ] Recipe 10.3 Creating Persistent Private Variables 10.3.1 Problem You want a variable to retain its value between calls to a subroutine but not be visible outside that routine. For instance, you'd like your function to keep track of how many times it was called. 10.3.2 Solution Wrap the function in another block, then declare my variables in that block's scope rather than in the function's: { my $variable; sub mysub { # ... accessing $variable } } If the variables require initialization, make that block an INIT so the variable is guaranteed to be set before the main program starts running: INIT { my $variable = 1; # initial value sub othersub { # ... accessing $variable } } 10.3.3 Discussion Unlike local[2] variables in C or C++, Perl's lexical variables don't necessarily get recycled just because their scope has exited. If something more permanent is still aware of the lexical, it will stick around. In this code, mysub uses $variable, so Perl doesn't reclaim the variable when the block around the definition of mysub ends. [2] Technically speaking, auto variables. Here's how to write a counter: { my $counter; sub next_counter { return ++$counter } } Each time next_counter is called, it increments and returns the $counter variable. The first time next_counter is called, $counter is undefined, so it behaves as though it were 0 for the ++. The variable is not part of next_counter's scope, but rather part of the block surrounding it. No code from outside can change $counter except by calling next_counter. Generally, you should use an INIT for the extra scope. Otherwise, you could call the function before its variables were initialized. INIT { my $counter = 42; sub next_counter { return ++$counter } sub prev_counter { return --$counter } } This technique creates the Perl equivalent of C's static variables. Actually, it's a little better: rather than being limited to just one function, both functions share their private variable. 10.3.4 See Also The sections on "Closures" in Chapter 8 of Programming Perl and on "Avante-Garde Compiler, Retro Interpreter" in Chapter 18 of Programming Perl; the section on "Private Variables via my( )" in perlsub(1); the section on "Package Constructors and Destructors" in perlmod(1); Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.4 Determining Current Function Name 10.4.1 Problem You want to determine the name of the currently running function. This is useful for creating error messages that don't need to be changed if you copy and paste the subroutine code. 10.4.2 Solution Use the caller function: $this_function = (caller(0))[3]; 10.4.3 Discussion Code can always determine the current source line number via the special symbol _ _LINE_ _, the current file via _ _FILE_ _, and the current package via _ _PACKAGE_ _. But no such symbol for the current subroutine name exists, let alone the name for the subroutine that called this one. The built-in function caller handles all of these. In scalar context it returns the calling function's package name, but in list context it returns much more. You can also pass it a number indicating how many frames (nested subroutine calls) back you'd like information about: 0 is your own function, 1 is your caller, and so on. Here's the full syntax, where $i is how far back you're interested in: ($package, $filename, $line, $subr, $has_args, $wantarray # 0 1 2 3 4 5 $evaltext, $is_require, $hints, $bitmask # 6 7 8 9 )= caller($i); Here's what each of those return values means: $package The package in which the code was compiled. $filename The name of the file in which the code was compiled, reporting -e if launched from that command-line switch, or - if the script was read from standard input. $line The line number from which that frame was called. $subr The name of that frame's function, including its package. Closures are indicated by names like main::_ _ANON_ _, which are not callable. In an eval, it contains (eval). $has_args Whether the function had its own @_ variable set up. It may be that there are no arguments, even if true. The only way for this to be false is if the function was called using the &fn notation instead of fn( ) or &fn( ). $wantarray The value the wantarray function would return for that stack frame; either true, false but defined, or else undefined. This tells whether the function was called in list, scalar, or void context (respectively). $evaltext The text of the current eval STRING, if any. $is_require Whether the code is currently being loaded by a do, require, or use. $hints, $bitmask These both contain pragmatic hints that the caller was compiled with. Consider them to be for internal use only by Perl itself. Rather than using caller directly as in the Solution, you might want to write functions instead: $me = whoami( ); $him = whowasi( ); sub whoami { (caller(1))[3] } sub whowasi { (caller(2))[3] } These use arguments of 1 and 2 for parent and grandparent functions because the call to whoami or whowasi would itself be frame number 0. 10.4.4 See Also The wantarray and caller functions in Chapter 29 of Programming Perl and in perlfunc(1); Recipe 10.6 [ Team LiB ] [ Team LiB ] Recipe 10.5 Passing Arrays and Hashes by Reference 10.5.1 Problem You want to pass a function more than one array or hash and have each remain distinct. For example, you want to put the algorithm from Recipe 4.8 into a subroutine. This subroutine must then be called with two arrays that remain distinct. 10.5.2 Solution Pass arrays and hashes by reference, using the backslash operator: array_diff( \@array1, \@array2 ); 10.5.3 Discussion See Chapter 11 for more about manipulation of references. Here's a subroutine that expects array references, along with code to call it correctly: @a = (1, 2); @b = (5, 8); @c = add_vecpair( \@a, \@b ); print "@c\n"; 6 10 sub add_vecpair { # assumes both vectors the same length my ($x, $y) = @_; # copy in the array references my @result; for (my $i=0; $i < @$x; $i++) { $result[$i] = $x->[$i] + $y->[$i]; } return @result; } A potential problem with this function is that it doesn't verify the number and types of arguments passed into it. You could check explicitly this way: unless (@_ = = 2 && ref($x) eq 'ARRAY' && ref($y) eq 'ARRAY') { die "usage: add_vecpair ARRAYREF1 ARRAYREF2"; } If all you plan to do is die on error (see Recipe 10.12), you can sometimes omit this check, since dereferencing the wrong kind of reference triggers an exception anyway. However, good defensive programming style encourages argument validation for all functions. 10.5.4 See Also The sections on "Passing References" and on "Prototypes" in Chapter 6 of Programming Perl and on "Pass by Reference" in perlsub(1); Recipe 10.11; Chapter 11; Chapter 8 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 10.6 Detecting Return Context 10.6.1 Problem You want to know in which context your function was called. This lets one function do different things, depending on how its return value or values are used, just like many of Perl's built-in functions. 10.6.2 Solution Use the wantarray( ) function, which has three possible return values, depending on how the current function was called: if (wantarray( )) { # list context } elsif (defined wantarray( )) { # scalar context } else { # void context } 10.6.3 Discussion Many built-in functions act differently when called in scalar context than they do when called in list context. A user-defined function can learn which context it was called in by checking wantarray. List context is indicated by a true return value. If wantarray returns a value that is false but defined, then the function's return value will be used in scalar context. If wantarray returns undef, your function isn't being asked to provide any value at all. if (wantarray( )) { print "In list context\n"; return @many_things; } elsif (defined wantarray( )) { print "In scalar context\n"; return $one_thing; } else { print "In void context\n"; return; # nothing } mysub( ); # void context $a = mysub( ); # scalar context if (mysub( )) { } # scalar context @a = mysub( ); # list context print mysub( ); # list context 10.6.4 See Also The return and wantarray functions in Chapter 29 of Programming Perl and in perlfunc(1) [ Team LiB ] [ Team LiB ] Recipe 10.7 Passing by Named Parameter 10.7.1 Problem You want to make a function with many parameters that are easy to call so that programmers remember what the arguments do, rather than having to memorize their order. 10.7.2 Solution Name each parameter in the call: thefunc(INCREMENT => "20s", START => "+5m", FINISH => "+30m"); thefunc(START => "+5m", FINISH => "+30m"); thefunc(FINISH => "+30m"); thefunc(START => "+5m", INCREMENT => "15s"); Then in the subroutine, create a hash loaded up with default values plus the array of named pairs. sub thefunc { my %args = ( INCREMENT => '10s', FINISH => 0, START => 0, @_, # argument pair list goes here ); if ($args{INCREMENT} =~ /m$/ ) { ... } } 10.7.3 Discussion Functions whose arguments require a particular order work well for short argument lists, but as the number of parameters increases, it's awkward to make some optional or have default values. You can only leave out trailing arguments, never initial ones. A more flexible approach allows the caller to supply arguments using name-value pairs. The first element of each pair is the argument name; the second, its value. This makes for self- documenting code because you can see the parameters' intended meanings without having to read the full function definition. Even better, programmers using your function no longer have to remember argument order, and they can leave unspecified any extraneous, unused arguments. This works by having the function declare a private hash variable to hold the default parameter values. Put the current arguments, @_, after the default values, so the actual arguments override the defaults because of the order of the values in the assignment. A common variation on this is to preface the parameter name with a hyphen, intended to evoke the feel of command-line parameters: thefunc(-START => "+5m", -INCREMENT => "15s"); Ordinarily the hyphen isn't part of a bareword, but the Perl tokenizer makes an exception for the => operator to permit this style of function argument. 10.7.4 See Also Chapter 4 [ Team LiB ] [ Team LiB ] Recipe 10.8 Skipping Selected Return Values 10.8.1 Problem You have a function that returns many values, but you only care about some of them. The stat function is a classic example: you often want only one value from its long return list (mode, for instance). 10.8.2 Solution Either assign to a list that has undef in some positions: ($a, undef, $c) = func( ); or else take a slice of the return list, selecting only what you want: ($a, $c) = (func( ))[0,2]; 10.8.3 Discussion Using dummy temporary variables is wasteful; plus it feels artificial and awkward: ($dev,$ino,$DUMMY,$DUMMY,$uid) = stat($filename); A nicer style is to use undef instead of dummy variables to discard a value: ($dev,$ino,undef,undef,$uid) = stat($filename); Or you can take a slice, picking up just the values you care about: ($dev,$ino,$uid,$gid) = (stat($filename))[0,1,4,5]; If you want to put an expression into list context and discard all of its return values (calling it simply for side effects), you can assign this to the empty list: ( ) = some_function( ); This last strategy is rather like a list version of the scalar operator—it calls the function in list context, even in a place it wouldn't otherwise do so. You can get just a count of return values this way: $count = ( ) = some_function( ); or you can call it in list context and make sure it returns some non-zero number of items (which you immediately discard) like this: if (( ) = some_function( )) { .... } If you hadn't assigned to the empty list, the Boolean context of the if test would have called the function in scalar context. 10.8.4 See Also The section on "List Values and Arrays" in Chapter 2 of Programming Perl and perlsub(1); Recipe 3.1 [ Team LiB ] [ Team LiB ] Recipe 10.9 Returning More Than One Array or Hash 10.9.1 Problem You want a function to return more than one array or hash, but the return list flattens into just one long list of scalars. 10.9.2 Solution Return references to the hashes or arrays: ($array_ref, $hash_ref) = somefunc( ); sub somefunc { my @array; my %hash; # ... return ( \@array, \%hash ); } 10.9.3 Discussion Just as all arguments collapse into one flat list of scalars, return values do, too. Functions that want to return multiple, distinct arrays or hashes need to return those by reference, and the caller must be prepared to receive references. If a function wants to return three separate hashes, for example, it should use one of the following: sub fn { ..... return (\%a, \%b, \%c); # or return \(%a, %b, %c); # same thing } The caller must expect a list of hash references returned by the function. It cannot just assign to three hashes. (%h0, %h1, %h2) = fn( ); # WRONG! @array_of_hashes = fn( ); # eg: $array_of_hashes[2]{"keystring"} ($r0, $r1, $r2) = fn( ); # eg: $r2->{"keystring"} 10.9.4 See Also The general discussions on references in Chapter 11, and in Chapter 8 of Programming Perl; Recipe 10.5 [ Team LiB ] [ Team LiB ] Recipe 10.10 Returning Failure 10.10.1 Problem You want to return a value indicating that your function failed. 10.10.2 Solution Use a bare return statement without any argument, which returns undef in scalar context and the empty list ( ) in list context. return; 10.10.3 Discussion A return without an argument means: sub empty_retval { return ( wantarray ? ( ) : undef ); } You can't use just return undef, because in list context you will get a list of one value: undef. If your caller says: if (@a = yourfunc( )) { ... } then the "error" condition will be perceived as true because @a will be assigned (undef) and then evaluated in scalar context. This yields 1, the number of elements assigned to @a, which is true. You could use the wantarray function to see what context you were called in, but a bare return is a clear and tidy solution that always works: unless ($a = sfunc( )) { die "sfunc failed" } unless (@a = afunc( )) { die "afunc failed" } unless (%a = hfunc( )) { die "hfunc failed" } Some of Perl's built-in functions have a peculiar return value. Both fcntl and ioctl have the curious habit of returning the string "0 but true" in some circumstances. (This magic string is conveniently exempt from nagging warnings about improper numerical conversions.) This has the advantage of letting you write code like this: ioctl(....) or die "can't ioctl: $!"; That way, code doesn't have to check for a defined zero as distinct from the undefined value, as it would for the read or glob functions. "0 but true" is zero when used numerically. It's rare that this kind of return value is needed. A more common (and spectacular) way to indicate failure in a function is to raise an exception, as described in Recipe 10.12. 10.10.4 See Also The undef, wantarray, and return functions in Chapter 29 of Programming Perl and in perlfunc(1); Recipe 10.12 [ Team LiB ] [ Team LiB ] Recipe 10.11 Prototyping Functions 10.11.1 Problem You want to use function prototypes so the compiler can check your argument types. 10.11.2 Solution Perl has something of a prototype facility, but it isn't what you're thinking. Perl's function prototypes are more like a context coercion used to write functions that behave like some Perl built-ins, such as push and pop. 10.11.3 Discussion Manually checking the validity of a function's arguments can't happen until runtime. If you make sure the function is declared before it is used, you can tickle the compiler into using a very limited form of prototype checking. But don't confuse Perl's function prototypes with those found in any other language. A Perl function prototype is zero or more spaces, backslashes, or type characters enclosed in parentheses after the subroutine definition or name. A backslashed type symbol means that the argument is passed by reference, and the argument in that position must start with that type character. A prototype can impose context on the prototyped function's arguments. This is done when Perl compiles your program. But this does not always mean that Perl checks the number or type of arguments; since a scalar prototype is like inserting a scalar in front of just one argument, sometimes an implicit conversion occurs instead. For example, if Perl sees func(3, 5) for a function prototyped as sub func ($), it will stop with a compile-time error. But if it sees func(@array) with the same prototype, it will merely put @array into scalar context instead of complaining that you passed an array, but it wanted a scalar. This is so important that it bears repeating: don't use Perl prototypes expecting the compiler to check type and number of arguments for you. It does a little bit of that, sometimes, but mostly it's about helping you type less, and sometimes to emulate the calling and parsing conventions of built-in functions. Omitting parentheses Ordinarily your subroutines take a list of arguments, and you can omit parentheses on the function call if the compiler has already seen a declaration or definition for that function: @results = reverse myfunc 3, 5; Without prototypes, this is the same as: @results = reverse(myfunc(3, 5)); Without parentheses, Perl puts the righthand side of the subroutine call into list context. You can use prototypes to change this behavior. Here is a function that's prototyped to take just one argument: sub myfunc($); @results = reverse myfunc 3, 5; Now this is the same as: @results = reverse(myfunc(3), 5); Notice how the scalar prototype has altered the Perl parser! It grabs only the next thing it sees, leaving what remains for whatever other function is looking for arguments. A void prototype like: sub myfunc( ); will also alter the parser, causing no arguments to be passed to the function. This works just like the time built-in. That means that in the absence of parentheses, you cannot know what is going on by casual inspection. Things that look the same can quietly behave completely differently from one another. Consider these declarations and assignments: sub fn0( ); sub fn1($); sub fnN(@); $x = fn0 + 42; $x = fn1 + 42; $y = fnN fn1 + 42, fn0 + 42; $y = fnN fn0 + 42, fn1 + 42; $z = fn1 fn1 + 42, fn1 + 42; $z = fnN fnN + 42, fnN + 42; Astonishingly enough, those are parsed by the Perl compiler as though they'd been written this way: $x = fn0( ) + 42; $x = fn1(42); $y = fnN(fn1(42), fn0( ) + 42); $y = fnN(fn0( ) + 42, fn1(42)); $z = fn1(fn1(42)), fn1(42); $z = fnN(fnN(42, fnN(42))); Without first looking closely at the prototypes and then thinking really hard about how Perl's parser works, you'd never be able to predict that. Maintainability would suffer horribly. This is one strong argument for using more parentheses than might be demanded by purely precedential concerns (or, alternatively, this is an argument for avoiding prototypes). Mimicking built-ins The other common use of prototypes is to give the convenient pass-without-flattening behavior of built-in functions like push and shift. When you call push as push(@array, 1, 2, 3) the function gets a reference to @array instead of the actual array. This is accomplished by backslashing the @ character in the prototype: sub mypush (\@@) { my $array_ref = shift; my @remainder = @_; # ... } The \@ in the prototype says "require the first argument to begin with an @ character, and pass it by reference." The second @ says "the rest of the arguments are a (possibly empty) list." A backslash in a prototype requires that the argument actually begin with the literal type character, which can sometimes be annoying. You can't even use the conditional ?: construct to pick which array to pass: mypush( $x > 10 ? @a : @b, 3, 5 ); # WRONG Instead, you must play games with references: mypush( @{ $x > 10 ? \@a : \@b }, 3, 5 ); # RIGHT (but ugly) Here's an hpush function that works like push, but on hashes. It uses a list of key-value pairs to add to an existing hash, overwriting any previous values associated with those keys. sub hpush(\%@) { my $href = shift; while ( my ($k, $v) = splice(@_, 0, 2) ) { $href->{$k} = $v; } } hpush(%pieces, "queen" => 9, "rook" => 5); You may also backslash several argument types simultaneously by using the \[ ] notation: sub mytie ( \[$@%&*] $; @ ) That function accepts any of the five types and passes it by reference, followed by one mandatory scalar context argument and optional trailing list of remaining arguments. You can discover a particular function's prototype using the prototype built-in function. For example, calling prototype("hpush") given the previous definition would return the string "\%@". You can even find out a built-in's prototype this way—if it has one, that is. Not all core built-ins can be emulated. For those that can, the prototype function returns what their built-in prototype is. Since you can always call a core built-in function like int as CORE::int, built-ins are deemed to reside in package CORE. For example: for $func (qw/int reverse keys push open print/) { printf "Prototype for %s is %s\n", $func, prototype("CORE::$func") || "UNAVAILABLE"; } Prototype for int is ;$ Prototype for reverse is @ Prototype for keys is \% Prototype for push is \@@ Prototype for open is *;$@ Prototype for print is UNAVAILABLE 10.11.4 See Also The prototype function in perlfunc(1); the section on "Prototypes" in Chapter 6 of Programming Perl and in perlsub(1); Recipe 10.5 [ Team LiB ] [ Team LiB ] Recipe 10.12 Handling Exceptions 10.12.1 Problem How do you safely call a function that might raise an exception? How do you create a function that raises an exception? 10.12.2 Solution Sometimes you encounter a problem so exceptional that merely returning an error isn't strong enough, because the caller could unintentionally ignore the error. Use die STRING from your function to trigger an exception: die "some message"; # raise exception The caller can wrap the function call in an eval to intercept that exception, then consult the special variable $@ to see what happened: eval { func( ) }; if ($@) { warn "func raised an exception: $@"; } 10.12.3 Discussion Raising exceptions is not a facility to be used lightly. Most functions should return an error using a bare return statement. Wrapping every call in an exception trap is tedious and unsightly, removing the appeal of using exceptions in the first place. But, on rare occasions, failure in a function should cause the entire program to abort. Rather than calling the irrecoverable exit function, you should call die instead, which at least gives the programmer the chance to cope. If no exception handler has been installed via eval, then the program aborts at that point. To detect this, wrap the call to the function with a block eval. The $@ variable will be set to the offending exception if one occurred; otherwise, it will be false. eval { $val = func( ) }; warn "func blew up: $@" if $@; Any eval catches all exceptions, not just specific ones. Usually you should propagate unexpected exceptions to an enclosing handler. For example, suppose your function raised an exception containing the string "Full moon!". You could safely trap that exception while letting others through by inspecting the $@ variable. Calling die without an argument uses the contents of $@ to construct a new exception string. eval { $val = func( ) }; if ($@ && $@ !~ /Full moon!/) { die; # re-raise unknown errors } If the function is part of a module, consider using the Carp module and call croak or confess instead of die. The only difference between die and croak is that with croak, the error appears to be from the caller's perspective, not the module's. The confess function, on the other hand, creates a full stack backtrace of who called whom and with what arguments. Another intriguing possibility is for the function to detect that its return value is being completely ignored because the function was called in a void context. If that were returning an error indication would be useless, so raise an exception instead. if (defined wantarray( )) { return; } else { die "pay attention to my error!"; } Of course, just because it's not voided doesn't mean the return value is being dealt with appropriately. But if it is voided, it's certainly not being checked. There are CPAN modules that offer alternative ways of handling exceptions. The Error module offers try, catch, and throw notation instead of eval and die: use Error ':try'; try { something( ); } catch Error::Database with { my $e = shift; warn "Problem in " . $e->{'-database'} . " (caught)\n"; }; Error offers try, catch ... with, except, otherwise, and finally blocks for maximum flexibility in error handling. The Exception::Class module from CPAN lets you create classes of exceptions and objects to represent specific exceptions. The two modules can be combined so that you can catch these exception objects. 10.12.4 See Also The $@ ($EVAL_ERROR) variable in Chapter 28 of Programming Perl and perlvar(1); the die and eval functions in Chapter 29 of Programming Perl and perlfunc(1); the documentation for the CPAN modules Error and Exception::Class; Recipe 10.15; Recipe 12.2; Recipe 16.21 [ Team LiB ] [ Team LiB ] Recipe 10.13 Saving Global Values 10.13.1 Problem You need to temporarily save away the value of a global variable. 10.13.2 Solution Use the local operator to save a previous global value, automatically restoring it when the current block exits: our $age = 18; # declare and set global variable if (CONDITION) { local $age = 23; func( ); # sees temporary value of 23 } # Perl restores the old value at block exit 10.13.3 Discussion Despite its name, Perl's local operator does not create a local variable. That's what my does. Instead, local merely preserves an existing value for the duration of its enclosing block. Hindsight shows that if local had been called save_value instead, much confusion could have been avoided. Three places where you must use local instead of my are: You need to give a global variable a temporary value, especially $_.1. You need to create a local file or directory handle or a local function.2. You want to temporarily change just one element of an array or hash.3. Using local( ) for temporary values for globals The first situation is more apt to happen with predefined, built-in variables than with user variables. Often these are variables that Perl consults for hints for its high-level operations. In particular, any function that uses $_, implicitly or explicitly, should certainly have a local $_. This is annoyingly easy to forget to do. See Recipe 13.15 for one solution to this. Another common target for local is the $/ variable, a global that implicitly affects the behavior of the readline operator used in operations. $para = get_paragraph(*FH); # pass filehandle glob $para = get_paragraph(*FH); # pass filehandle by glob reference $para = get_paragraph(*IO{FH}); # pass filehandle by IO reference [ Team LiB ] Recipe 10.13 Saving Global Values 10.13.1 Problem You need to temporarily save away the value of a global variable. 10.13.2 Solution Use the local operator to save a previous global value, automatically restoring it when the current block exits: our $age = 18; # declare and set global variable if (CONDITION) { local $age = 23; func( ); # sees temporary value of 23 } # Perl restores the old value at block exit 10.13.3 Discussion Despite its name, Perl's local operator does not create a local variable. That's what my does. Instead, local merely preserves an existing value for the duration of its enclosing block. Hindsight shows that if local had been called save_value instead, much confusion could have been avoided. Three places where you must use local instead of my are: You need to give a global variable a temporary value, especially $_.1. You need to create a local file or directory handle or a local function.2. You want to temporarily change just one element of an array or hash.3. Using local( ) for temporary values for globals The first situation is more apt to happen with predefined, built-in variables than with user variables. Often these are variables that Perl consults for hints for its high-level operations. In particular, any function that uses $_, implicitly or explicitly, should certainly have a local $_. This is annoyingly easy to forget to do. See Recipe 13.15 for one solution to this. Another common target for local is the $/ variable, a global that implicitly affects the behavior of the readline operator used in operations. $para = get_paragraph(*FH); # pass filehandle glob $para = get_paragraph(*FH); # pass filehandle by glob reference $para = get_paragraph(*IO{FH}); # pass filehandle by IO reference sub get_paragraph { my $fh = shift; local $/ = ''; my $paragraph = <$fh>; chomp($paragraph); return $paragraph; } Using local( ) for local handles The second situation used to arise whenever you needed a local filehandle or directory handle—or more rarely, a local function. $contents = get_motd( ); sub get_motd { local *MOTD; open(MOTD, "/etc/motd") or die "can't open motd: $!"; local $/ = undef; # slurp full file; local $_ = ; close (MOTD); return $_; } If you wanted to return the open filehandle, you'd use: return *MOTD; However, in modern releases of Perl, you would make use of the filehandle autovivification property: $contents = get_motd( ); sub get_motd { my $motd; # this will be filled in by the next line open($motd, "/etc/motd") or die "can't open motd: $!"; local $/ = undef; # slurp full file; return scalar <$motd>; } When the function returns, the anonymous filehandle is automatically closed for you. However, if you'd chosen to return $motd, then it wouldn't be. This is explained more fully in the Introduction to Chapter 7. Using local( ) on parts of aggregates The third situation is exceedingly rare, except for one common case. Because the local operator is really a "save value" operator, you can use it to save off just one element of an array or hash, even if that array or hash is itself a lexical! my @nums = (0 .. 5); sub first { local $nums[3] = 3.14159; second( ); } sub second { print "@nums\n"; } second( ); 0 1 2 3 4 5 first( ); 0 1 2 3.14159 4 5 The only common use for this kind of thing is for temporary signal handlers. sub first { local $SIG{INT} = 'IGNORE'; second( ); } Now while second is running, interrupt signals are ignored. When first returns, the previous value of $SIG{INT} is automatically restored. Although a lot of old code uses local, it's definitely something to steer clear of when it can be avoided. Because local still manipulates the values of global variables, not local variables, you'll run afoul of use strict unless you declared the globals using our or the older use vars. The local operator produces dynamic scoping or runtime scoping. This is in contrast with the other kind of scoping Perl supports, which is much more easily understood. That's the kind of scoping that my provides, known as lexical scoping, or sometimes as static or compile-time scoping. With dynamic scoping, a variable is accessible if it's found in the current scope—or in the scope of any frames (blocks) in its entire subroutine call stack, as determined at runtime. Any functions called have full access to dynamic variables, because they're still globals, just ones with temporary values. Only lexical variables are safe from such tampering. Old code that says: sub func { local($x, $y) = @_; #.... } can almost always be replaced without ill effect by the following: sub func { my($x, $y) = @_; #.... } The only case where code can't be so upgraded is when it relies on dynamic scoping. That would happen if one function called another, and the latter relied upon access to the former's temporary versions of the global variables $x and $y. Code that handles global variables and expects strange action at a distance instead of using proper parameters is fragile at best. Good programmers avoid this kind of thing like the plague. (The solution is to explicitly pass values as parameters, rather than storing them in shared global variables.) If you come across old code that uses: &func(*Global_Array); sub func { local(*aliased_array) = shift; for (@aliased_array) { .... } } this should probably be changed into something like this: func(\@Global_Array); sub func { my $array_ref = shift; for (@$array_ref) { .... } } They're using the old pass-the-typeglob strategy devised before Perl supported proper references. It's not a pretty thing. 10.13.4 See Also The local, my, and our functions in Chapter 29 of Programming Perl and perlfunc(1); Chapter 6 of Programming Perl; the section on "Scoped Declarations" in Chapter 4 of Programming Perl; the sections on "Private Variables via my( )" and "Temporary Values via local( )" in perlsub(1); Recipe 10.2; Recipe 10.16 [ Team LiB ] [ Team LiB ] Recipe 10.14 Redefining a Function 10.14.1 Problem You want to temporarily or permanently redefine a function, but functions can't be assigned to. 10.14.2 Solution To redefine a function, assign a new code reference to the typeglob of the name of that function. Use local if you want this redefinition to be temporary. undef &grow; # silence -w complaints of redefinition *grow = \&expand; grow( ); # calls expand( ) { local *grow = \&shrink; # only until this block exists grow( ); # calls shrink( ) } 10.14.3 Discussion Unlike a variable (but like named filehandles, directory handles, and formats), a named function cannot be directly assigned to. It's just a name and doesn't vary. You can manipulate it almost as though it were a variable, because you can directly manipulate the runtime symbol table using typeglobs like *foo to produce interesting aliasing effects. Assigning a reference to a typeglob changes what is accessed the next time a symbol of the referent's type is needed. This is what the Exporter does when you import a function or variable from one package into another. Since this is direct manipulation of the package symbol table, it works only on package variables (globals), not lexicals. *one::var = \%two::Table; # make %one::var alias for %two::Table *one::big = \&two::small; # make &one::big alias for &two::small A typeglob is one of those things you can only use local on, not my. If you do use local, the aliasing effect is then limited to the duration of the current block. local *fred = \&barney; # temporarily alias &fred to &barney If the value assigned to a typeglob is not a reference but itself another typeglob, then all types by that name are aliased. The types aliased in a full typeglob assignment are scalar, array, hash, function, filehandle, directory handle, and format. That means that assigning *Top = *Bottom would make the current package variable $Top an alias for $Bottom, @Top for @Bottom, %Top for %Bottom, and &Top for &Bottom. It would even alias the corresponding file and directory handles and formats! You probably don't want to do this. Use assignments to typeglobs together with closures to clone a bunch of similar functions cheaply and easily. Imagine you wanted a function for HTML generation to help with colors. For example: $string = red("careful here"); print $string; careful here You could write the red function this way: sub red { "@_" } If you need more colors, you could do something like this: sub color_font { my $color = shift; return "@_"; } sub red { color_font("red", @_) } sub green { color_font("green", @_) } sub blue { color_font("blue", @_) } sub purple { color_font("purple", @_) } # etc The similar nature of these functions suggests that there may be a way to factor out the common bit. To do this, use an assignment to an indirect typeglob. If you're running with the highly recommended use strict pragma, you must first disable strict "refs" for that block. @colors = qw(red blue green yellow orange purple violet); for my $name (@colors) { no strict 'refs'; *$name = sub { "@_" }; } These functions all seem independent, but the real code was compiled only once. This technique saves on compile time and memory use. To create a proper closure, any variables in the anonymous subroutine must be lexicals. That's the reason for the my on the loop iteration variable. This is one of the few places where giving a prototype to a closure is sensible. If you wanted to impose scalar context on the arguments of these functions (probably not a wise idea), you could have written it this way instead: *$name = sub ($) { "$_[0]" }; However, since prototype checking happens at compile time, the preceding assignment happens too late to be useful. So, put the whole loop of assignments within a BEGIN block, forcing it to occur during compilation. You really want to use a BEGIN here, not an INIT, because you're doing something that you want the compiler itself to notice right away, not something for the interpreter to do just before your program runs. 10.14.4 See Also The sections on "Symbol Tables" in Chapter 10 of Programming Perl and in perlmod(1); the sections on "Closures" and "Symbol Table References" in Chapter 8 of Programming Perl; the discussion of closures in perlref(1); Recipe 10.11; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.15 Trapping Undefined Function Calls with AUTOLOAD 10.15.1 Problem You want to intercept calls to undefined functions so you can handle them gracefully. 10.15.2 Solution Declare a function called AUTOLOAD for the package whose undefined function calls you'd like to trap. While running, that package's $AUTOLOAD variable contains the name of the undefined function being called. 10.15.3 Discussion Another strategy for creating similar functions is to use a proxy function. If you call an undefined function, instead of automatically raising an exception, you can trap the call. If the function's package has a function named AUTOLOAD, then this function is called in its place, with the special package global $AUTOLOAD set to the package-qualified function name. The AUTOLOAD subroutine can then do whatever that function would do. sub AUTOLOAD { my $color = our $AUTOLOAD; $color =~ s/.*:://; return "@_"; } #note: sub chartreuse isn't defined. print chartreuse("stuff"); When the nonexistent main::chartreuse function is called, rather than raising an exception, main::AUTOLOAD is called with the same arguments as you passed chartreuse. The package variable $AUTOLOAD would contain the string main::chartreuse because that's the function it's proxying. The technique using typeglob assignments shown in Recipe 10.14 is faster and more flexible than using AUTOLOAD. It's faster because you don't have to run the copy and substitute. It's more flexible because it lets you do this: { local *yellow = \&violet; local (*red, *green) = (\&green, \&red); print_stuff( ); } While print_stuff( ) is running, including from within any functions it calls, anything printed in yellow will come out violet, and the red and green texts will exchange colors. Aliasing subroutines like this won't handle calls to undefined subroutines. AUTOLOAD does. 10.15.4 See Also The section on "Autoloading" in Chapter 10 of Programming Perl and in perlsub(1); the documentation for the standard modules AutoLoader and AutoSplit; Recipe 10.12; Recipe 12.11; Recipe 13.12 [ Team LiB ] [ Team LiB ] Recipe 10.16 Nesting Subroutines 10.16.1 Problem You want subroutines to nest, such that one subroutine is visible and callable only from another. When you try the obvious approach of nesting sub FOO { sub BAR { } ... }, Perl gives warnings about variables that will not stay shared. 10.16.2 Solution Instead of making the inner functions normal subroutines, make them closures and temporarily assign their references to the typeglob of the right name to create a localized function. 10.16.3 Discussion If you use nested subroutines in other programming languages with their own private variables, you'll have to work at it a bit in Perl. The intuitive coding of this kind of thing gives the warning "will not stay shared." For example, this won't work: sub outer { my $x = $_[0] + 35; sub inner { return $x * 19 } # WRONG return $x + inner( ); } The following is a workaround: sub outer { my $x = $_[0] + 35; local *inner = sub { return $x * 19 }; return $x + inner( ); } Now inner( ) can be called only from within outer( ) because of the temporary assignments of the closure. Once called, it has normal access to the lexical variable $x from the scope of outer( ). This essentially creates a function local to another function, something not directly supported in Perl; however, the programming isn't always clear. 10.16.4 See Also The sections on "Symbol Tables" in Chapter 10 in Programming Perl and in perlmod(1); the sections on "Closures" and "Symbol Table References" in Chapter 8 of Programming Perl and the discussion of closures in perlref(1); Recipe 10.13; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.17 Writing a Switch Statement 10.17.1 Problem You want to write a multiway branch statement, much as you can in C using its switch statement or in the shell using case—but Perl seems to support neither. 10.17.2 Solution Use the Switch module, standard as of the v5.8 release of Perl. use Switch; switch ($value) { case 17 { print "number 17" } case "snipe" { print "a snipe" } case /[a-f]+/i { print "pattern matched" } case [1..10,42] { print "in the list" } case (@array) { print "in the array" } case (%hash) { print "in the hash" } else { print "no case applies" } } 10.17.3 Discussion The Switch module extends Perl's basic syntax by providing a powerful and flexible switch construct. In fact, it's so powerful and flexible that instead of a complete description of how it works, we'll instead provide examples of some common uses. For the full story, make sure to consult the documentation that accompanies the module. A switch takes an argument and a mandatory block, within which can occur any number of cases. Each of those cases also takes an argument and a mandatory block. The arguments to each case can vary in type, allowing (among many other things) any or all of string, numeric, or regex comparisons against the switch's value. When the case is an array or hash (or reference to the same), the case matches if the switch value corresponds to any of the array elements or hash keys. If no case matches, a trailing else block will be executed. Unlike certain languages' multiway branching constructs, here once a valid case is found and its block executed, control transfers out of the enclosing switch. In other words, there's no implied fall-through behavior the way there is in C. This is considered desirable because even the best of programmers will occasionally forget about fall-through. However, this is Perl, so you can have your cake and eat it, too. Just use a next from within a switch to transfer control to the next case. Consider: %traits = (pride => 2, sloth => 3, hope => 14); switch (%traits) { case "impatience" { print "Hurry up!\n"; next } case ["laziness","sloth"] { print "Maybe tomorrow!\n"; next } case ["hubris","pride"] { print "Mine's best!\n"; next } case ["greed","cupidity","avarice"] { print "More more more!"; next } } Maybe tomorrow! Mine's best! Because each case has a next, it doesn't just do the first one it finds, but goes on for further tests. The next can be conditional, too, allowing for conditional fall through. You might have noticed something else interesting about that previous example: the argument to the switch wasn't a scalar; it was the %traits hash. It turns out that you can switch on other things than scalars. In fact, both case and switch accept nearly any kind of argument. The behavior varies depending on the particular combination. Here, the strings from each of those cases are taken as keys to index into the hash we're switching on. If you find yourself preferring fall-through as the default, you can have that, too: use Switch 'fallthrough'; %traits = (pride => 2, sloth => 3, hope => 14); switch (%traits) { case "impatience" { print "Hurry up!\n" } case ["laziness","sloth"] { print "Maybe tomorrow!\n" } case ["hubris","pride"] { print "Mine's best!\n" } case ["greed","cupidity","avarice"] { print "More more more!" } } One area where a bunch of cascading ifs would still seem to excel is when each test involves a different expression, and those expressions are more complex than a simple string, numeric, or pattern comparison. For example: if ($n % 2 = = 0) { print "two " } elsif ($n % 3 = = 0) { print "three " } elsif ($n % 5 = = 0) { print "five " } elsif ($n % 7 = = 0) { print "seven " } Or if you want more than one test to be able to apply, you can do this with fall-through behavior: if ($n % 2 = = 0) { print "two " } if ($n % 3 = = 0) { print "three " } if ($n % 5 = = 0) { print "five " } if ($n % 7 = = 0) { print "seven " } Perl's switch can handle this too, but you have to be a bit more careful. For a case item to be an arbitrary expression, wrap that expression in a subroutine. That subroutine is called with the switch argument as the subroutine's argument. If the subroutine returns a true value, then the case is satisfied. use Switch 'fallthrough'; $n = 30; print "Factors of $n include: "; switch ($n) { case sub{$_[0] % 2 = = 0} { print "two " } case sub{$_[0] % 3 = = 0} { print "three " } case sub{$_[0] % 5 = = 0} { print "five " } case sub{$_[0] % 7 = = 0} { print "seven " } } That's pretty cumbersome to write—and to read—but with a little bit of highly magical syntactic sugar, even that clumsiness goes away. If you import the _ _ subroutine (yes, that really is a double underscore), you can use that in an expression as the case target, and the _ _ will represent the value being switched on. For example: use Switch qw( _ _ fallthrough ); $n = 30; print "Factors of $n include: "; switch ($n) { case _ _ % 2 = = 0 { print "two " } case _ _ % 3 = = 0 { print "three " } case _ _ % 5 = = 0 { print "five " } case _ _ % 7 = = 0 { print "seven " } } print "\n"; Due to the way that _ _ is implemented, some restrictions on its use apply. The main one is that your expression can't use && or || in it. Here's one final trick with switch. This time, instead of having a scalar in the switch and subroutines in the cases, let's do it the other way around. You can switch on a subroutine reference; each case value will be passed into that subroutine, and if the sub returns a true value, then the case is deemed to have matched and its code block executed. That makes the factor example read: use Switch qw(fallthrough); $n = 30; print "Factors of $n include: "; switch (sub {$n % $_[0] = = 0} ) { case 2 { print "two " } case 3 { print "three " } case 5 { print "five " } case 7 { print "seven " } } This is probably the most aesthetically pleasing way of writing it, since there's no longer duplicate code on each line. The Switch module uses a facility called source filters to emulate behavior anticipated in Perl6 (whenever that might be). This has been known to cause mysterious compilation errors if you use constructs in your code you were warned against. You should therefore pay very close attention to the section on "Dependencies, Bugs, and Limitations" in the Switch manpage. 10.17.4 See Also The documentation for the Switch module; the perlsyn(1) manpage's section on "Basic BLOCKs and Switch Statements"; the section on "Case Statements" in Chapter 4 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 10.18 Program: Sorting Your Mail The program in Example 10-1 sorts a mailbox by subject by reading input a paragraph at a time, looking for one with a "From " at the start of a line. When it finds one, it searches for the subject, strips it of any "Re : " marks, and stores its lowercased version in the @sub array. Meanwhile, the messages themselves are stored in a corresponding @msgs array. The $msgno variable keeps track of the message number. Example 10-1. bysub1 #!/usr/bin/perl # bysub1 - simple sort by subject my(@msgs, @sub); my $msgno = -1; $/ = ''; # paragraph reads while (<>) { if (/^From/m) { /^Subject:\s*(?:Re:\s*)*(.*)/mi; $sub[++$msgno] = lc($1) || ''; } $msgs[$msgno] .= $_; } for my $i (sort { $sub[$a] cmp $sub[$b] || $a <=> $b } (0 .. $#msgs)) { print $msgs[$i]; } That sort is only sorting array indices. If the subjects are the same, cmp returns 0, so the second part of the || is taken, which compares the message numbers in the order they originally appeared. If sort were fed a list like (0,1,2,3) , that list would get sorted into a different permutation, perhaps (2,1,3,0) . We iterate across them with a for loop to print out each message. Example 10-2 shows how an awk programmer might code this program, using the -00 switch to read paragraphs instead of lines. Example 10-2. bysub2 #!/usr/bin/perl -n00 # bysub2 - awkish sort-by-subject INIT { $msgno = -1 } $sub[++$msgno] = (/^Subject:\s*(?:Re:\s*)*(.*)/mi)[0] if /^From/m; $msg[$msgno] .= $_; END { print @msg[ sort { $sub[$a] cmp $sub[$b] || $a <=> $b } (0 .. $#msg) ] } Perl programmers have used parallel arrays like this since Perl 1. Keeping each message in a hash is a more elegant solution, though. We'll sort on each field in the hash, by making an anonymous hash as described in Chapter 11 . Example 10-3 is a program similar in spirit to Example 10-1 and Example 10-2 . Example 10-3. bysub3 #!/usr/bin/perl -00 # bysub3 - sort by subject using hash records use strict; my @msgs = ( ); while (<>) { push @msgs, { SUBJECT => /^Subject:\s*(?:Re:\s*)*(.*)/mi, NUMBER => scalar @msgs, # which msgno this is TEXT => '', } if /^From/m; $msgs[-1]{TEXT} .= $_; } for my $msg (sort { $a->{SUBJECT} cmp $b->{SUBJECT} || $a->{NUMBER} <=> $b->{NUMBER} } @msgs ) { print $msg->{TEXT}; } Once you have real hashes, adding further sorting criteria is simple. A common way to sort a folder is subject major, date minor order. The hard part is figuring out how to parse and compare dates. Date::Manip does this, returning a string you can compare; however, the datesort program in Example 10-4 , which uses Date::Manip, runs more than 10 times slower than the previous one. Parsing dates in unpredictable formats is extremely slow. Example 10-4. datesort #!/usr/bin/perl -00 # datesort - sort mbox by subject then date use strict; use Date::Manip; my @msgs = ( ); while (<>) { next unless /^From/m; my $date = ''; if (/^Date:\s*(.*)/m) { ($date = $1) =~ s/\s+\(.*//; # library hates (MST) $date = ParseDate($date); } push @msgs, { SUBJECT => /^Subject:\s*(?:Re:\s*)*(.*)/mi, DATE => $date, NUMBER => scalar @msgs, TEXT => '', }; } continue { $msgs[-1]{TEXT} .= $_; } for my $msg (sort { $a->{SUBJECT} cmp $b->{SUBJECT} || $a->{DATE} cmp $b->{DATE} || $a->{NUMBER} <=> $b->{NUMBER} } @msgs ) { print $msg->{TEXT}; } Example 10-4 is written to draw attention to the continue block. When a loop's end is reached, either because it fell through to that point or got there from a next , the whole continue block is executed. It corresponds to the third portion of a three-part for loop, except that the continue block isn't restricted to an expression. It's a full block, with separate statements. 10.18.1 See Also The sort function in Chapter 29 of Programming Perl and in perlfunc (1); the discussion of the $/ ($RS , $INPUT_RECORD_SEPARATOR ) variable in Chapter 28 of Programming Perl , in perlvar (1), and in the Introduction to Chapter 8 ; Recipe 3.7 ; Recipe 4.16 ; Recipe 5.10 ; Recipe 11.9 [ Team LiB ] [ Team LiB ] Chapter 11. References and Records With as little a web as this will I ensnare as great a fly as Cassio. —Shakespeare, Othello, Act II, scene i [ Team LiB ] [ Team LiB ] Introduction Perl provides three fundamental data types: scalars, arrays, and hashes. It's certainly possible to write many programs without complex records, but most programs need something more sophisticated than simple variables and lists. Perl's three built-in types combine with references to produce arbitrarily complex and powerful data structures. Selecting the proper data structure and algorithm can make the difference between an elegant program that does its job quickly and an ungainly concoction that's glacially slow to execute and consumes system resources voraciously. The first part of this chapter shows how to create and use plain references. The second part shows how to create higher-order data structures out of references. References To grasp the concept of references, you must first understand how Perl stores values in variables. Each defined variable has associated with it a name and the address of a chunk of memory. This idea of storing addresses is fundamental to references because a reference is a value that holds the location of another value. The scalar value that contains the memory address is called a reference. Whatever value lives at that memory address is called its referent. See Figure 11-1. Figure 11-1. Reference and referent The referent could be any built-in type (scalar, array, hash, ref, code, or glob) or a user-defined type based on one of the built-ins. Referents in Perl are typed. This means, for example, that you can't treat a reference to an array as though it were a reference to a hash. Attempting to do so raises a runtime exception. No mechanism for type casting exists in Perl. This is considered a feature. So far, it may look as though a reference were little more than a raw address with strong typing. But it's far more than that. Perl takes care of automatic memory allocation and deallocation (garbage collection) for references, just as it does for everything else. Every chunk of memory in Perl has a reference count associated with it, representing how many places know about that referent. The memory used by a referent is not returned to the process's free pool until its reference count reaches zero. This ensures that you never have a reference that isn't valid—no more core dumps and general protection faults from mismanaged pointers as in C. Freed memory is returned to Perl for later use, but few operating systems reclaim it and decrease the process's memory footprint. This is because most memory allocators use a stack, and if you free up memory in the middle of the stack, the operating system can't take it back without moving the rest of the allocated memory around. That would destroy the integrity of your pointers and blow XS code out of the water. To follow a reference to its referent, preface the reference with the appropriate type symbol for the data you're accessing. For instance, if $sref is a reference to a scalar, you can say: print $$sref; # prints the scalar value that the reference $sref refers to $$sref = 3; # assigns to $sref's referent To access one element of an array or hash whose reference you have, use the infix pointer- arrow notation, as in $rv->[37] or $rv->{"wilma"}. Besides dereferencing array references and hash references, the arrow is also used to call an indirect function through its reference, as in $code_ref->("arg1", "arg2"); this is discussed in Recipe 11.4. If you're using an object, use an arrow to call a method, $object->methodname("arg1", "arg2"), as shown in Chapter 13. Perl's syntax rules make dereferencing complex expressions tricky—it falls into the category of "hard things that should be possible." Mixing right associative and left associative operators doesn't work out well. For example, $$x[4] is the same as $x->[4]; that is, it treats $x as a reference to an array and then extracts element number four from that. This could also have been written ${$x}[4]. If you really meant "take the fifth element of @x and dereference it as a scalar reference," then you need to use ${$x[4]}. Avoid putting two type signs ($@%&) side-by- side, unless it's simple and unambiguous like %hash = %$hashref. In the simple cases using $$sref in the previous example, you could have written: print ${$sref}; # prints the scalar $sref refers to ${$sref} = 3; # assigns to $sref's referent For safety, some programmers use this notation exclusively. When passed a reference, the ref function returns a string describing its referent. (It returns false if passed a non-reference.) This string is usually one of SCALAR, ARRAY, HASH, or CODE, although the other built-in types of GLOB, REF, IO, Regexp, and LVALUE also occasionally appear. If you call ref on a non-reference, it returns an empty string. If you call ref on an object (a reference whose referent has been blessed), it returns the class the object was blessed into: CGI, IO::Socket, or even ACME::Widget. Create references in Perl by using a backslash on things already there, or dynamically allocate new things using the [ ], { }, and sub { } composers. The backslash operator is simple to use: put it before whatever you want a reference to. For instance, if you want a reference to the contents of @array, just say: $aref = \@array; You can even create references to constant values; future attempts to change the value of the referent cause a runtime exception: $pi = \3.14159; $$pi = 4; # runtime error Anonymous Data Using a backslash to produce references to existing, named variables is simple enough for implementing pass-by-reference semantics in subroutine calls, but for creating complex data structures on the fly, it quickly becomes cumbersome. You don't want to be bogged down by having to invent a unique name for each subsection of the large, complex data structure. Instead, you allocate new, nameless arrays and hashes (or scalars or functions) on demand, growing your structure dynamically. Explicitly create anonymous arrays and hashes with the [ ] and { } composers. This notation allocates a new array or hash, initializes it with any data values listed between the pair of square or curly brackets, and returns a reference to the newly allocated aggregate: $aref = [ 3, 4, 5 ]; # new anonymous array $href = { "How" => "Now", "Brown" => "Cow" }; # new anonymous hash Perl also implicitly creates anonymous data types through autovivification. This occurs when you indirectly store data through a variable that's currently undefined; that is, you treat that variable as though it holds the reference type appropriate for that operation. When you do so, Perl allocates the needed array or hash and stores its reference in the previously undefined variable. undef $aref; @$aref = (1, 2, 3); print $aref; ARRAY(0x80c04f0) See how we went from an undefined variable to one with an array reference in it without explicitly assigning that reference? Perl filled in the reference for us. This property lets code like the following work correctly, even as the first statement in your program, all without declarations or allocations: $a[4][23][53][21] = "fred"; print $a[4][23][53][21]; fred print $a[4][23][53]; ARRAY(0x81e2494) print $a[4][23]; ARRAY(0x81e0748) print $a[4]; ARRAY(0x822cd40) Table 11-1 shows mechanisms for producing references to both named and anonymous scalars, arrays, hashes, functions, and typeglobs. (See the discussion of filehandle autovivification in the Introduction to Chapter 7 for a discussion of anonymous filehandles.) Table 11-1. Syntax for named and anonymous values Reference to Named Anonymous Scalar \$scalar \do{my $anon} Array \@array [ LIST ] Hash \%hash { LIST } Code \&function sub { CODE } Glob *symbol open(my $handle, ...); $handle Figure 11-2 and Figure 11-3 illustrate the differences between named and anonymous values. Figure 11-2 shows named values, and Figure 11-3 shows anonymous ones. Figure 11-2. Named values Figure 11-3. Anonymous values In other words, saying $a = \$b makes $$a and $b the same piece of memory. If you say $$a = 3, then $b is set to 3, even though you only mentioned $a by name, not $b. All references evaluate to true when used in Boolean context. That way a subroutine that normally returns a reference can indicate an error by returning undef. sub cite { my (%record, $errcount); ... return $errcount ? undef( ) : record; } $op_cit = cite($ibid) or die "couldn't make a reference"; Without an argument, undef produces an undefined value. But passed a variable or function as its argument, the undef operator renders that variable or function undefined when subsequently tested with the defined function. However, this does not necessarily free memory, call object destructors, etc. It just decrements its referent's reference count by one. my ($a, $b) = ("Thing1", "Thing2"); $a = \$b; undef $b; Memory isn't freed yet, because you can still reach "Thing2" indirectly using its reference in $a. "Thing1", however, is completely gone, having been recycled as soon as $a was assigned \$b. Although memory allocation in Perl is sometimes explicit and sometimes implicit, memory deallocation is nearly always implicit. You don't routinely have cause to undefine variables. Just let lexical variables (those declared with my) evaporate when their scope terminates; the next time you enter that scope, those variables will be new again. For global variables (those declared with our, fully-qualified by their package name, or imported from a different package) that you want reset, it normally suffices to assign the empty list to an aggregate variable or a false value to a scalar one. It has been said that there exist two opposing schools of thought regarding memory management in programming. One school holds that memory management is too important a task to be left to the programming language, while the other judges it too important to be left to the programmer. Perl falls solidly in the second camp, since if you never have to remember to free something, you can never forget to do so. As a rule, you need rarely concern yourself with freeing any dynamically allocated storage in Perl,[1] because memory management—garbage collection, if you would—is fully automatic. Recipe 11.15 and Recipe 13.13, however, illustrate exceptions to this rule. [1] External subroutines compiled in C notwithstanding. Records The predominant use of references in Perl is to circumvent the restriction that arrays and hashes may hold scalars only. References are scalars, so to make an array of arrays, make an array of array references. Similarly, hashes of hashes are implemented as hashes of hash references, arrays of hashes as arrays of hash references, hashes of arrays as hashes of array references, and so on. Once you have these complex structures, you can use them to implement records. A record is a single logical unit comprising various different attributes. For instance, a name, an address, and a birthday might compose a record representing a person. C calls such things structs, and Pascal calls them RECORDs. Perl doesn't have a particular name for these because you can implement this notion in different ways. The most common technique in Perl is to treat a hash as a record, where the keys of the hash are the record's field names and the values of the hash are those fields' values. For instance, we might create a "person" record like this: $person = { "Name" => "Leonhard Euler", "Address" => "1729 Ramanujan Lane\nMathworld, PI 31416", "Birthday" => 0x5bb5580, }; Because $person is a scalar, it can be stored in an array or hash element, thus creating groups of people. Now apply the array and hash techniques from Chapter 4 and Chapter 5 to sort the sets, merge hashes, pick a random record, and so on. The attributes of a record, including the "person" record, are always scalars. You can certainly use numbers as readily as strings there, but that's no great trick. The real power play happens when you use even more references for values in the record. "Birthday", for instance, might be stored as an anonymous array with three elements: day, month, and year. You could then say $person->{"Birthday"}->[0] to access just the day field. Or a date might be represented as a hash record, which would then lend itself to access such as $person->{"Birthday"}->{"day"}. Adding references to your collection of skills makes possible many more complex and useful programming strategies. At this point, we've conceptually moved beyond simple records. We're now creating elaborate data structures that represent complicated relationships between the data they hold. Although we can use these to implement traditional data structures like linked lists, recipes in the second half of this chapter don't deal specifically with any particular structure. Instead, they give generic techniques for loading, printing, copying, and saving generic data structures. The final program example demonstrates creating binary trees. See Also Chapters 8 and 9 of Programming Perl; perlref(1), perlreftut(1), perllol(1), and perldsc(1) [ Team LiB ] [ Team LiB ] Recipe 11.1 Taking References to Arrays 11.1.1 Problem You need to manipulate an array by reference. 11.1.2 Solution To get a reference to an array: $aref = \@array; $anon_array = [1, 3, 5, 7, 9]; $anon_copy = [ @array ]; @$implicit_creation = (2, 4, 6, 8, 10); To deference an array reference, precede it with an at sign (@): push(@$anon_array, 11); Or use a pointer arrow plus a bracketed subscript for a particular element: $two = $implicit_creation->[0]; To get the last index number by reference, or the number of items in that referenced array: $last_idx = $#$aref; $num_items = @$aref; Or defensively embracing and forcing context: $last_idx = $#{ $aref }; $num_items = scalar @{ $aref }; 11.1.3 Discussion Here are array references in action: # check whether $someref contains a simple array reference if (ref($someref) ne "ARRAY") { die "Expected an array reference, not $someref\n"; } print "@{$array_ref}\n"; # print original data @order = sort @{ $array_ref }; # sort it push @{ $array_ref }, $item; # append new element to orig array If you can't decide whether to use a reference to a named array or to create a new one, here's a simplistic guideline that will prove right more often than not. Only take a reference to an existing array to return the reference out of scope, thereby creating an anonymous array, or to pass the array by reference to a function. For virtually all other cases, use [@array] to create a new array reference with a copy of the old values. Automatic reference counting and the backslash operator make a powerful combination: sub array_ref { my @array; return \@array; } $aref1 = array_ref( ); $aref2 = array_ref( ); Each time array_ref is called, the function allocates a new piece of memory for @array. If we hadn't returned a reference to @array, its memory would have been freed when its block, the subroutine, ended. But here a reference to @array is still accessible, so Perl doesn't free that storage, and we wind up with a reference to a piece of memory no longer accessible through the symbol table. Such a piece of memory is called anonymous because it has no name associated with it. To access a particular element of the array referenced by $aref, you could write $$aref[4], but $aref->[4] is the same thing, and clearer. print $array_ref->[$N]; # access item in position N (best) print $$array_ref[$N]; # same, but confusing print ${$array_ref}[$N]; # same, but still confusing, and ugly to boot If you have an array reference, you can only access a slice of the referenced array in this way: @$pie[3..5]; # array slice, but a little confusing to read @{$pie}[3..5]; # array slice, easier (?) to read Array slices, even when accessed through array references, are assignable. In the next line, the array dereference happens first, then the slice: @{$pie}[3..5] = ("blackberry", "blueberry", "pumpkin"); An array slice is just syntactic sugar for a list of individual array elements. Because you can't take a reference to a list, you can't take a reference to an array slice: $sliceref = \@{$pie}[3..5]; # WRONG! To iterate through the entire array, loop with foreach or for: foreach $item ( @{$array_ref} ) { # $item has data } for ($idx = 0; $idx <= $#{ $array_ref }; $idx++) { # $array_ref->[$idx] has data } 11.1.4 See Also Chapters 8 and 9 of Programming Perl; perlref(1), perlreftut(1), and perllol(1); Recipe 2.13; Recipe 4.6 [ Team LiB ] [ Team LiB ] Recipe 11.2 Making Hashes of Arrays 11.2.1 Problem For each key in a hash, only one scalar value is allowed, but you'd like to use one key to store and retrieve multiple values. That is, you'd like the value to produce a list. 11.2.2 Solution Use references to arrays as the hash values. Use push to append: push(@{ $hash{"KEYNAME"} }, "new value"); Then, dereference the value as an array reference when printing out the hash: foreach $string (keys %hash) { print "$string: @{$hash{$string}}\n"; } 11.2.3 Discussion You can only store scalar values in a hash. References, however, are scalars. This solves the problem of storing multiple values for one key by making $hash{$key} a reference to an array containing the values for $key. Normal hash operations acting on individual scalar values (insertion, deletion, iteration, and testing for existence) are now written with array operations acting on lists of values (like push, splice, and foreach). Here's how to give a key many values: $hash{"a key"} = [ 3, 4, 5 ]; # anonymous array Once you have a key with many values, here's how to use them: @values = @{ $hash{"a key"} }; To append a new value to the array of values associated with a particular key, use push: push @{ $hash{"a key"} }, $value; One common application of such data structures is inverting a hash that may have several keys with the same associated value. When inverted, you end up with a hash that has many values for the same key. This is addressed in Recipe 5.9. Be warned that this: @residents = @{ $phone2name{$number} }; causes a runtime exception under use strict because you're dereferencing an undefined reference where autovivification won't occur. You must do this instead: @residents = exists( $phone2name{$number} ) ? @{ $phone2name{$number} } : ( ); 11.2.4 See Also The section on "Hashes of Arrays" in Chapter 9 of Programming Perl and in perldsc(1); the section on "Symbolic References" in Chapter 8 of Programming Perl; Recipe 5.9; Tie Example: Make a Hash That Always Appends in Recipe 13.15 [ Team LiB ] [ Team LiB ] Recipe 11.3 Taking References to Hashes 11.3.1 Problem You need to manipulate a hash by reference. This might be because it was passed into a function that way or because it's part of a larger data structure. 11.3.2 Solution To get a hash reference: $href = \%hash; $anon_hash = { "key1" => "value1", "key2" => "value2", ... }; $anon_hash_copy = { %hash }; To dereference a hash reference: %hash = %$href; $value = $href->{$key}; @slice = @$href{$key1, $key2, $key3}; # note: no arrow! @keys = keys %$href; To check whether something is a hash reference: if (ref($someref) ne "HASH") { die "Expected a hash reference, not $someref\n"; } 11.3.3 Discussion This example prints out all keys and values from two predefined hashes: foreach $href ( \%ENV, \%INC ) { # OR: for $href ( \(%ENV,%INC) ) { foreach $key ( keys %$href ) { print "$key => $href->{$key}\n"; } } Access slices of hashes by reference as you'd access slices of arrays by reference. For example: @values = @$hash_ref{"key1", "key2", "key3"}; for $val (@$hash_ref{"key1", "key2", "key3"}) { $val += 7; # add 7 to each value in hash slice } 11.3.4 See Also The Introductionin Chapter 5; Chapter 8 of Programming Perl; perlref(1); Recipe 11.9 [ Team LiB ] [ Team LiB ] Recipe 11.4 Taking References to Functions 11.4.1 Problem You need to manipulate a subroutine by reference. This might happen if you need to create a signal handler, a Tk callback, or a hash of function pointers. 11.4.2 Solution To get a code reference: $cref = \&func; $cref = sub { ... }; To call a code reference: @returned = $cref->(@arguments); @returned = &$cref(@arguments); 11.4.3 Discussion If the name of a function is func, you can produce a reference to it by prefixing its name with \&. You can also dynamically allocate anonymous functions using the sub { } notation. These code references can be stored just like any other reference. It is possible to store the name of a function in a variable, such as: $funcname = "thefunc"; &$funcname( ); but that's not a very good solution for several reasons. First, it uses symbolic references, not real (hard) references, and so is forbidden under the use strict "refs" pragma. Symbolic references to variables are usually a bad idea, since they can't access lexical variables, only globals, and aren't reference counted. Second, as written it doesn't include package information, so if executed in a different package, it would try to call the wrong function. Finally, in the odd case that the function were redefined at some point, the symbolic reference would get whatever the current definition for the function was, whereas the hard reference would still access the old definition. Instead of placing the name of the function in the variable, use the backslash operator to create a reference to the function. This is the normal way to store a function in a variable or pass along to another function. You can mix and match references to named functions with references to unnamed ones: my %commands = ( "happy" => \&joy, "sad" => \&sullen, "done" => sub { die "See ya!" }, "mad" => \&angry, ); print "How are you? "; chomp($string = ); if ($commands{$string}) { $commands{$string}->( ); } else { print "No such command: $string\n"; } If you create an anonymous function that refers to a lexical (my) variable from an enclosing scope, reference counting ensures that the lexical variable is never deallocated so long as that function reference exists: sub counter_maker { my $start = 0; return sub { # this is a closure return $start++; # lexical from enclosing scope }; } $counter = counter_maker( ); for ($i = 0; $i < 5; $i ++) { print &$counter, "\n"; } Even though counter_maker has ended and $start gone out of scope, Perl doesn't free the variable, because the anonymous subroutine referenced by $counter still has a reference to $start. If we call counter_maker again, it'll return another, different anonymous subroutine reference that uses a different $start: $counter1 = counter_maker( ); $counter2 = counter_maker( ); for ($i = 0; $i < 5; $i ++) { print &$counter1, "\n"; } print &$counter1, " ", &$counter2, "\n"; 1 2 3 4 5 0 Closures are often used in callback routines. In graphical and other event-based programming, you associate code with a keypress, mouse click, window expose event, etc. The code will be called much later, probably from an entirely different scope. Variables mentioned in the closure must be available when it's finally called. To work properly, such variables must be lexicals, not globals. Another use for closures is function generators, that is, functions that create and return brand- new functions. The counter_maker function is such a function generator. Here's another simple one: sub timestamp { my $start_time = time( ); return sub { return time( ) - $start_time }; } $early = timestamp( ); sleep 20; $later = timestamp( ); sleep 10; printf "It's been %d seconds since early.\n", $early->( ); printf "It's been %d seconds since later.\n", $later->( ); It's been 30 seconds since early. It's been 10 seconds since later. Each call to timestamp generates and returns a brand-new function. The timestamp function creates a lexical called $start_time that contains the current clock time (in epoch seconds). Every time that closure is called, it returns how many seconds have elapsed since it was created by subtracting its starting time from the current time. 11.4.4 See Also The section on "Closures" in Chapter 8 of Programming Perl and the discussion on closures in perlref(1); Recipe 10.11; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 11.5 Taking References to Scalars 11.5.1 Problem You want to create and manipulate a reference to a scalar value. 11.5.2 Solution To create a reference to a scalar variable, use the backslash operator: $scalar_ref = \$scalar; # get reference to named scalar To create a reference to an anonymous scalar value (a value that isn't in a variable), assign to a dereferenced undefined variable: undef $anon_scalar_ref; $$anon_scalar_ref = 15; This creates a reference to a constant scalar: $anon_scalar_ref = \15; Use ${...} to dereference: print ${ $scalar_ref }; # dereference it ${ $scalar_ref } .= "string"; # alter referent's value 11.5.3 Discussion If you want to create many new anonymous scalars, use a subroutine that returns a reference to a lexical variable out of scope, as explained in this chapter's Introduction: sub new_anon_scalar { my $temp; return \$temp; } Dereference a scalar reference by prefacing it with $ to get at its contents: $sref = new_anon_scalar( ); $$sref = 3; print "Three = $$sref\n"; @array_of_srefs = ( new_anon_scalar( ), new_anon_scalar( ) ); ${ $array[0] } = 6.02e23; ${ $array[1] } = "avocado"; print "\@array contains: ", join(", ", map { $$_ } @array ), "\n"; Notice we put braces around $array[0] and $array[1]. If we tried to say $$array[0], the tight binding of dereferencing would turn it into $array->[0]. It would treat $array as an array reference and return the element at index zero. Here are other examples where it is safe to omit the braces: $var = `uptime`; # $var holds text $vref = \$var; # $vref "points to" $var if ($$vref =~ /load/) { } # look at $var, indirectly chomp $$vref; # alter $var, indirectly As mentioned in the Introduction, you may use the ref built-in to inspect a reference for its referent's type. Calling ref on a scalar reference returns the string "SCALAR": # check whether $someref contains a simple scalar reference if (ref($someref) ne "SCALAR") { die "Expected a scalar reference, not $someref\n"; } 11.5.4 See Also Chapters 8 and 9 of Programming Perl and perlref(1) [ Team LiB ] [ Team LiB ] Recipe 11.6 Creating Arrays of Scalar References 11.6.1 Problem You want to create and manipulate an array of references to scalars. This arises when you pass variables by reference to a function so the function can change their values. 11.6.2 Solution To create an array, either backslash each scalar in the list to store in the array: @array_of_scalar_refs = ( \$a, \$b ); or simply backslash the entire list, taking advantage of the backslash operator's distributive property: @array_of_scalar_refs = \( $a, $b ); To get or set the value of an element of the list, use ${ ... }: ${ $array_of_scalar_refs[1] } = 12; # $b = 12 11.6.3 Discussion In the following examples, @array is a simple array containing references to scalars (an array of references is not a reference to an array). To access the original data indirectly, braces are mandatory. ($a, $b, $c, $d) = (1 .. 4); # initialize @array = (\$a, \$b, \$c, \$d); # refs to each scalar @array = \( $a, $b, $c, $d); # same thing! @array = map { \my $anon } 0 .. 3; # allocate 4 anon scalar refs ${ $array[2] } += 9; # $c now 12 ${ $array[ $#array ] } *= 5; # $d now 20 ${ $array[-1] } *= 5; # same; $d now 100 $tmp = $array[-1]; # using temporary $$tmp *= 5; # $d now 500 The two assignments to @array are equivalent—the backslash operator is distributive across a list. So preceding a list (including a slice or a function's return list, but not an array) with a backslash is the same as applying a backslash to everything in that list. The ensuing code changes the values of the variables whose references were stored in the array. Here's how to deal with such an array without explicit indexing: use Math::Trig qw(pi); # load the constant pi foreach $sref (@array) { # prepare to change $a,$b,$c,$d ($$sref **= 3) *= (4/3 * pi); # replace with spherical volumes } This code uses the formula for deriving the volume of a sphere: The $sref loop index variable is each reference in @array, and $$sref is the number itself, that is, the original variables $a, $b, $c, and $d. Changing $$sref in the loop changes those variables as well. First we replace $$sref with its cube, then multiply the resulting value by 4/3p. This takes advantage of the fact that assignment in Perl returns an lvalue, letting you chain assignment operators together as we've done using the **= and *= assignment operators. Actually, anonymous scalars are pretty useless, given that a scalar value fits in the same space as a scalar reference. That's why there's no explicit composer. Scalar references exist only to allow aliasing—which can be done in other ways. 11.6.4 See Also The section on "Assignment Operators" in Chapter 3 of Programming Perl and in perlop(1); the section on "Other Tricks You Can Do with Hard References" in Chapter 8 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 11.7 Using Closures Instead of Objects 11.7.1 Problem You want records with private state, behavior, and identity, but you don't want to learn object- oriented programming to accomplish this. 11.7.2 Solution Write a function that returns (by reference) a hash of code references. These code references are closures created in the same scope, so when they execute, they'll share bindings to the same private variables. 11.7.3 Discussion Because a closure is a binding of code and data, it can implement what might be thought of as an object. Here's an example that creates and returns a hash of anonymous functions. mkcounter takes an argument of a seed counter and returns a reference to a hash of code references that you can use to manipulate the counter indirectly. $c1 = mkcounter(20); $c2 = mkcounter(77); printf "next c1: %d\n", $c1->{NEXT}->( ); # 21 printf "next c2: %d\n", $c2->{NEXT}->( ); # 78 printf "next c1: %d\n", $c1->{NEXT}->( ); # 22 printf "last c1: %d\n", $c1->{PREV}->( ); # 21 printf "old c2: %d\n", $c2->{RESET}->( ); # 77 The code values in the hash references in $c1 and $c2 maintain their own separate state. Here's how to set that up: sub mkcounter { my $count = shift; my $start = $count; my $bundle = { "NEXT" => sub { return ++$count }, "PREV" => sub { return --$count }, "GET" => sub { return $count }, "SET" => sub { $count = shift }, "BUMP" => sub { $count += shift }, "RESET" => sub { $count = $start }, }; $bundle->{"LAST"} = $bundle->{"PREV"}; return $bundle; } Because the lexical variables used by the closures in the $bundle hash reference are returned by the function, they are not deallocated. The next time mkcounter is called, the closures get a different set of variable bindings for the same code. Because no one outside those closures can access these two variables, this assures true privacy. The assignment right before the return makes both "PREV" and "LAST" values point to the same closure. Depending on your object-oriented background, you might think of these as being two different messages, both implemented using the same method. The bundle we return is not an object in that it has no obvious inheritance and polymorphism. (Yet.) But it certainly does have state, behavior, and identity, as well as encapsulation. 11.7.4 See Also The section on "Closures" in Chapter 8 of Programming Perl and the discussion on closures in perlref(1); Recipe 11.4; Recipe 11.9; Chapter 13 [ Team LiB ] [ Team LiB ] Recipe 11.8 Creating References to Methods 11.8.1 Problem You want to store a reference to a method. 11.8.2 Solution Create a closure that makes the proper method call on the appropriate object. 11.8.3 Discussion When you ask for a reference to a method, you're asking for more than just a raw function pointer. You also need to record which object the method needs to be called upon as the object contains the data the method will work with. The best way to do this is using a closure. Assuming $obj is lexically scoped, you can say: $mref = sub { $obj->meth(@_) }; # later... $mref->("args", "go", "here"); Even when $obj goes out of scope, the closure stored in $mref has captured it. Later when it's called indirectly, the correct object is used for the method call. Be aware that the notation: $sref = \$obj->meth; doesn't do what you probably expected. It first calls the method on that object and gives you either a reference to the return value or a reference to the last of the return values if the method returns a list. The can method from the UNIVERSAL base class, while appealing, is also unlikely to produce what you want. $cref = $obj->can("meth"); This produces a code ref to the appropriate method (should one be found), but one that carries no object information. Think of it as a raw function pointer. The information about the object is lost. That's why you need a closure to capture both the object state as well as the method to call. 11.8.4 See Also The discussion on methods in the Introduction to Chapter 13; the section on "Closures" in Chapter 8 of Programming Perl; Recipe 11.7; Recipe 13.8 [ Team LiB ] [ Team LiB ] Recipe 11.9 Constructing Records 11.9.1 Problem You want to create a record data type. 11.9.2 Solution Use a reference to an anonymous hash. 11.9.3 Discussion Suppose you wanted to create a data type that contained various data fields. The easiest way is to use an anonymous hash. For example, here's how to initialize and use that record: $record = { NAME => "Jason", EMPNO => 132, TITLE => "deputy peon", AGE => 23, SALARY => 37_000, PALS => [ "Norbert", "Rhys", "Phineas"], }; printf "I am %s, and my pals are %s.\n", $record->{NAME}, join(", ", @{$record->{PALS}}); Just having one of these records isn't much fun—you'd like to build larger structures. For example, you might want to create a %byname hash that you could initialize and use this way: # store record $byname{ $record->{NAME} } = $record; # later on, look up by name if ($rp = $byname{"Aron"}) { # false if missing printf "Aron is employee %d.\n", $rp->{EMPNO}; } # give jason a new pal push @{$byname{"Jason"}->{PALS}}, "Theodore"; printf "Jason now has %d pals\n", scalar @{$byname{"Jason"}->{PALS}}; That makes %byname a hash of hashes because its values are hash references. Looking up employees by name would be easy using such a structure. If we find a value in the hash, we store a reference to the record in a temporary variable, $rp, which we then use to get any field we want. We can use our existing hash tools to manipulate %byname. For instance, we could use the each iterator to loop through it in an arbitrary order: # Go through all records while (($name, $record) = each %byname) { printf "%s is employee number %d\n", $name, $record->{EMPNO}; } What about looking employees up by employee number? Just build and use another data structure, an array of hashes called @employees. If your employee numbers aren't consecutive (for instance, they jump from 1 to 159997) an array would be a bad choice. Instead, you should use a hash mapping employee number to record. For consecutive employee numbers, use an array: # store record $employees[ $record->{EMPNO} ] = $record; # lookup by id if ($rp = $employee[132]) { printf "employee number 132 is %s\n", $rp->{NAME}; } With a data structure like this, updating a record in one place effectively updates it everywhere. For example, this gives Jason a 3.5% raise: $byname{"Jason"}->{SALARY} *= 1.035; This change is reflected in all views of these records. Remember that $byname{"Jason"} and $employees[132] both refer to the same record because the references they contain refer to the same anonymous hash. How would you select all records matching a particular criterion? This is what grep is for. Here's how to get everyone with "peon" in their titles or all 27-year-olds: @peons = grep { $_->{TITLE} =~ /peon/i } @employees; @tsevens = grep { $_->{AGE} = = 27 } @employees; Each element of @peons and @tsevens is itself a reference to a record, making them arrays of hashes, like @employees. Here's how to print all records sorted in a particular order, say by age: # Go through all records foreach $rp (sort { $a->{AGE} <=> $b->{AGE} } values %byname) { printf "%s is age %d.\n", $rp->{NAME}, $rp->{AGE}; # or with a hash slice on the reference printf "%s is employee number %d.\n", @$rp{"NAME","EMPNO"}; } Rather than take time to sort them by age, you could create another view of these records, @byage. Each element in this array, $byage[27] for instance, would be an array of all records with that age. In effect, this is an array of arrays of hashes. Build it this way: # use @byage, an array of arrays of records push @{ $byage[ $record->{AGE} ] }, $record; Then you could find them all this way: for ($age = 0; $age <= $#byage; $age++) { next unless $byage[$age]; print "Age $age: "; foreach $rp (@{$byage[$age]}) { print $rp->{NAME}, " "; } print "\n"; } A similar approach is to use map to avoid the foreach loop: for ($age = 0; $age <= $#byage; $age++) { next unless $byage[$age]; printf "Age %d: %s\n", $age, join(", ", map {$_->{NAME}} @{$byage[$age]}); } 11.9.4 See Also Recipe 4.14; Recipe 11.3 [ Team LiB ] [ Team LiB ] Recipe 11.10 Reading and Writing Hash Records to Text Files 11.10.1 Problem You want to read or write hash records stored in text files. 11.10.2 Solution Use a simple file format with one field per line: FieldName: Value and separate records with blank lines. 11.10.3 Discussion If you have an array of records that you'd like to store into and retrieve from a text file, you can use a simple format based on mail headers. The format's simplicity requires that the keys have neither colons nor newlines, and the values not have newlines. This code writes them out: foreach $record (@Array_of_Records) { for $key (sort keys %$record) { print "$key: $record->{$key}\n"; } print "\n"; } Reading them in is easy, too. $/ = ""; # paragraph read mode while (<>) { my @fields = split /^([^:]+):\s*/m; shift @fields; # for leading null field push(@Array_of_Records, { map /(.*)/, @fields }); } The split acts upon $_, its default second argument, which contains a full paragraph. The pattern looks for start of line (not just start of record, thanks to the /m) followed by one or more non-colons, followed by a colon and optional whitespace. When split's pattern contains parentheses, these are returned along with the values. The return values placed in @fields are in key-value order, with a leading null field we shift off. The braces in the call to push produce a reference to a new anonymous hash, which we copy @fields into. Since that array was stored in order of the needed key-value pairing, this makes for well-ordered hash contents. All you're doing is reading and writing a plain text file, so you can use related recipes for additional components. You could use Recipe 7.18 to ensure that you have clean, concurrent access; Recipe 1.18 to store colons and newlines in keys and values; and Recipe 11.3 to store more complex structures. If you are willing to sacrifice the elegance of a plain textfile for a quick, random-access database of records, use a DBM file, as described in Recipe 11.14. 11.10.4 See Also The split function in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 11.9; Recipe 11.13; Recipe 11.14 [ Team LiB ] [ Team LiB ] Recipe 11.11 Printing Data Structures 11.11.1 Problem You want to print out a data structure. 11.11.2 Solution If the output's legibility and layout are important, write your own custom printing routine. If you are in the Perl debugger, use the x command: DB<1> $reference = [ { "foo" => "bar" }, 3, sub { print "hello, world\n" } ]; DB<2> x $reference 0 ARRAY(0x1d033c) 0 HASH(0x7b390) 'foo' = 'bar' 1 3 2 CODE(0x21e3e4) -> &main::_ _ANON_ _[(eval 15)[/usr/local/]:2] in (eval 15)[/usr/local/.../]:2-2 From within your own programs, use the Dumper function from the standard module Data::Dumper: use Data::Dumper; print Dumper($reference); Or if you'd like output formatted in the same style as the Debugger uses: use Dumpvalue; Dumpvalue->new->dumpValue($reference); 11.11.3 Discussion Sometimes you'll want to make a dedicated function for your data structure that delivers a particular output format, but often this is overkill. If you're running under the Perl debugger, the x and X commands provide nice pretty-printing. The x command is more useful because it works on both global and lexical variables, whereas X works only on globals. Pass x a reference to the data structure you want to print. DB<3> x @INC 0 ARRAY(0x807d0a8) 0 '/home/tchrist/perllib' 1 '/usr/lib/perl5/i686-linux/5.00403' 2 '/usr/lib/perl5' 3 '/usr/lib/perl5/site_perl/i686-linux' 4 '/usr/lib/perl5/site_perl' 5 '.' The standard Dumpvalue module provides the Debugger's output formatting using an object- oriented interface. Here's an example: use Dumpvalue; Dumpvalue->new->dumpvars("main", "INC"); @INC = ( 0 '/usr/local/lib/perl5/5.8.1/OpenBSD.i386-openbsd' 1 '/usr/local/lib/perl5/5.8.1' 2 '/usr/local/lib/perl5/site_perl/5.8.1/OpenBSD.i386-openbsd' 3 '/usr/local/lib/perl5/site_perl/5.8.1' 4 '/usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd' 5 '/usr/local/lib/perl5/site_perl/5.8.0' 6 '/usr/local/lib/perl5/site_perl' 7 '.' ) %INC = ( '' = '/usr/local/lib/perl5/5.8.1/'> '' = '/usr/local/lib/perl5/5.8.1/'> ) which is like using the V main INC command in the Debugger. All the output formatting options from the Debugger are available from Dumpvalue. Just pass Dumpvalue->new option pairs: $dobj = Dumpvalue->new(option1 => value1, option2 => value2); Options available as of v5.8.1 include arrayDepth, hashDepth, compactDump, veryCompact, globPrint, dumpDBFiles, dumpPackages, dumpReused, tick, quoteHighBit, printUndef, usageOnly, unctrl, subdump, bareStringify, quoteHighBit, and stopDbSignal. The Data::Dumper module, also included in the standard Perl distribution, has a different approach. It provides a Dumper function that takes a list of references and returns a string with a printable (and evalable) form of those references. use Data::Dumper; print Dumper(\@INC); $VAR1 = [ '/usr/local/lib/perl5/5.8.1/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/5.8.1', '/usr/local/lib/perl5/site_perl/5.8.1/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/site_perl/5.8.1', '/usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/site_perl/5.8.0', '/usr/local/lib/perl5/site_perl', '.' ]; Data::Dumper supports a variety of output formats. Check its documentation for details. Particularly useful is the option to decompile Perl code: use Data::Dumper; $Data::Dumper::Deparse = 1; $a = sub { print "hello, world\n" }; print Dumper($a); $VAR1 = sub { print 'hello, world'; }; 11.11.4 See Also The documentation for Data::Dumper; Chapter 20 of Programming Perl or perldebug(1) [ Team LiB ] [ Team LiB ] Recipe 11.12 Copying Data Structures 11.12.1 Problem You need to copy a complex data structure. 11.12.2 Solution Use the dclone function from the standard Storable module: use Storable; $r2 = dclone($r1); 11.12.3 Discussion Two types of "copy" are sometimes confused. A surface copy (also known as shallow copy) simply copies references without creating copies of the data behind them: @original = ( \@a, \@b, \@c ); @surface = @original; A deep copy creates an entirely new structure with no overlapping references. This copies references to one layer deep: @deep = map { [ @$_ ] } @original; If @a, @b, and @c themselves contain references, the preceding map is no longer adequate. Writing your own code to deep-copy structures is laborious and rapidly becomes tiresome. The Storable module provides a function called dclone that recursively copies its argument: use Storable qw(dclone); $r2 = dclone($r1); This only works on references or blessed objects of type SCALAR, ARRAY, HASH, or CODE;[2] references of type GLOB, IO, and the more esoteric types are not supported. The safeFreeze function from the FreezeThaw module supports even these types when used in the same address space by using a reference cache that could interfere with garbage collection and object destructors under some circumstances. [2] Believe it or not, it's true. Storable can even serialize closures. See its manpage for how to unthaw these using a Safe compartment. Because dclone takes and returns references, you must add extra punctuation if you have a hash or arrays to copy: %newhash = %{ dclone(\%oldhash) }; 11.12.4 See Also The documentation for the standard Storable and Data::Dumper modules, and for the FreezeThaw CPAN module. [ Team LiB ] [ Team LiB ] Recipe 11.13 Storing Data Structures to Disk 11.13.1 Problem You want to save your large, complex data structure to disk so you don't have to reconstruct it from scratch each time your program runs. 11.13.2 Solution Use the Storable module's store and retrieve functions: use Storable; store(\%hash, "filename"); # later on... $href = retrieve("filename"); # by ref %hash = %{ retrieve("filename") }; # direct to hash 11.13.3 Discussion The Storable module uses C functions and a binary format to walk Perl's internal data structures and lay out its data. It's more efficient than a pure Perl and string-based approach, but it's also more fragile. The store and retrieve functions expect binary data using the machine's own byte-ordering. This means files created with these functions cannot be shared across different architectures. nstore does the same job store does, but keeps data in canonical (network) byte order, at a slight speed cost: use Storable qw(nstore); nstore(\%hash, "filename"); # later ... $href = retrieve("filename"); No matter whether store or nstore was used, you need to call the same retrieve routine to restore the objects in memory. The producer must commit to portability, but the consumer doesn't have to. Code need only be changed in one place when the producer has a change of heart; the code thus offers a consistent interface for the consumer, who does not need to know or care. The store and nstore functions don't lock the files they work on. If you're worried about concurrent access, open the file yourself, lock it using Recipe 7.18, and then use store_fd or its slower but machine-independent version nstore_fd. Here's code to save a hash to a file, with locking. We don't open with the O_TRUNC flag because we have to wait to get the lock before we can clobber the file. use Storable qw(nstore_fd); use Fcntl qw(:DEFAULT :flock); sysopen(DF, "/tmp/datafile", O_RDWR|O_CREAT, 0666) or die "can't open /tmp/datafile: $!"; flock(DF, LOCK_EX) or die "can't lock /tmp/datafile: $!"; nstore_fd(\%hash, *DF) or die "can't store hash\n"; truncate(DF, tell(DF)); close(DF); Here's code to restore that hash from a file, with locking: use Storable qw(retrieve_fd); use Fcntl qw(:DEFAULT :flock); open(DF, " < /tmp/datafile") or die "can't open /tmp/datafile: $!"; flock(DF, LOCK_SH) or die "can't lock /tmp/datafile: $!"; $href = retrieve_fd(*DF); close(DF); With care, you can pass large data objects efficiently between processes using this strategy, since a filehandle connected to a pipe or socket is still a byte stream, just like a plain file. Unlike the various DBM bindings, Storable does not restrict you to using only hashes (or arrays, with DB_File). Arbitrary data structures, including objects, can be stored to disk. The whole structure must be read in or written out in its entirety. 11.13.4 See Also The section on "Remote Procedure Calls (RPC)" in Chapter 13 of Advanced Perl Programming, by Sriram Srinivasan (O'Reilly); Recipe 11.14 [ Team LiB ] [ Team LiB ] Recipe 11.14 Transparently Persistent Data Structures 11.14.1 Problem You have a complex data structure that you want to persist outside your program. 11.14.2 Solution Use MLDBM and either (preferably) DB_File or else GDBM_File: use MLDBM qw(DB_File); use Fcntl; tie(%hash, "MLDBM", "testfile.db", O_CREAT|O_RDWR, 0666) or die "can't open tie to testfile.db: $!"; # ... act on %hash untie %hash; 11.14.3 Discussion A hash with 100,000 items in it would undoubtably take considerable time to build. Storing this to disk, either slowly by hand or quickly with Storable, is still an expensive operation in memory and computation. The DBM modules solve this by tying hashes to disk database files. Rather than reading the whole structure in at once, they only pull in what they need, when they need it. To the user, it looks like a hash that persists across program invocations. Unfortunately, the values in this persistent hash must be plain strings. You cannot readily use a database file as a backing store for a hash of hashes, a hash of arrays, and so on—just for a hash of strings. However, the MLDBM module from CPAN allows you to store references in a database. It uses Data::Dumper to stringify these references for external storage: use MLDBM qw(DB_File); use Fcntl; tie(%hash, "MLDBM", "testfile.db", O_CREAT|O_RDWR, 0666) or die "can't open tie to testfile.db: $!"; Now you can use %hash to fetch or store complex records from disk. The only drawback is that you can't access the references piecemeal. You have to pull in the reference from the database, work with it, and then store it back. # this doesn't work! $hash{"some key"}[4] = "fred"; # RIGHT $aref = $hash{"some key"}; $aref->[4] = "fred"; $hash{"some key"} = $aref; 11.14.4 See Also Recipe 11.13 [ Team LiB ] [ Team LiB ] Recipe 11.15 Coping with Circular Data Structures Using Weak References 11.15.1 Problem You have an inherently self-referential data structure, so Perl's reference-based garbage collection system won't notice when that structure is no longer being used. You want to prevent your program from leaking memory. 11.15.2 Solution Convert all internal references within the data structure into weak references so they don't increment the reference count. 11.15.3 Description Perl's memory management system relies on an underlying reference count to know when to reclaim memory. In practice, this works fairly well except for one particular situation: when a variable directly or indirectly points at itself. Consider: { my ($a, $b); ($a, $b) = \($b, $a); # same as (\$b, \$a); } The two underlying scalars that $a and $b represent each start out with a reference count of one apiece in the first line of the block. In the second line, those scalars are each initialized to contain a reference to the other variable; $a points to $b and vice versa. Saving a reference increments the underlying reference count on the scalars, so now both refcounts are set to two. As the block exits and those lexical variables become unreachable (by name), both refcounts are decremented by one, leaving one in each—forever. Since the refcounts can never reach zero, memory used by those two underlying scalars will never be reclaimed. You'll leak two scalars every time that block executes; if it's a loop or a subroutine, you could eventually run out of memory. The standard Devel::Peek module's Dump function shows you underlying reference counts, plus a whole lot more. This code: use Devel::Peek; $a = 42; $b = \$a; Dump $a; produces this output: SV = IV(0xd7cc4) at 0xd72b8 REFCNT = 2 FLAGS = (IOK,pIOK) IV = 42 The important thing to notice there is that the refcount is two. That's because the scalar can be reached two ways: once via the variable named $a, and then again through dereferencing $b using $$b. You can produce the same condition, even without using another variable: { my $a; $a = \$a; } This most often occurs when creating a data structure whose elements contain references that directly or indirectly point back to the initial element. Imagine a circular linked list—a ring data structure. $ring = { VALUE => undef, NEXT => undef, PREV => undef, }; $ring->{NEXT} = $ring; $ring->{PREV} = $ring; The underlying hash has an underlying refcount of three, and undeffing $ring or letting it go out of scope will decrement that count only by one, resulting in a whole hash full of memory irrecoverable by Perl. To address this situation, Perl introduced in its v5.6 release the concept of weak references. A weak reference is just like any other regular reference (meaning a "hard" reference, not a "symbolic" one) except for two critical properties: it no longer contributes to the reference count on its referent, and when its referent is garbage collected, the weak reference itself becomes undefined. These properties make weak references perfect for data structures that hold internal references to themselves. That way those internal references do not count toward the structure's reference count, but external ones still do. Although Perl supported weak references starting in v5.6, there was no standard weaken( ) function to access them from Perl itself in the standard release. You had to go to CPAN to pull in the WeakRef module. Beginning in v5.8.1, the weaken( ) function is included standard with the Scalar::Util module. That module also provides an is_weak( ) function that reports whether its reference argument has been weakened. Here's how you would apply weak references to the ring example just given: use Scalar::Util qw(weaken); $ring = { VALUE => undef, NEXT => undef, PREV => undef, }; $ring->{NEXT} = $ring; $ring->{PREV} = $ring; weaken($ring->{NEXT}); weaken($ring->{PREV}); In Recipe 13.13, we show how to create a circular-linked list data structure that won't leak memory by employing an elaborate trick using a dummy head node and an object-oriented device called a destructor. With weak references, the code becomes much simpler. Here's the same algorithm as that recipe uses, but here using weak references to address the memory- leak issue. use Scalar::Util qw(weaken); my $COUNT = 1000; for (1..20) { my $ring = node(100_000 + $_); for my $value (1 .. $COUNT) { insert_value($ring, $value); } } # return a node sub node($) { my ($init_value) = @_; my $node = { VALUE => $init_value }; $node->{NEXT} = $node->{PREV} = $node; weaken($node->{NEXT}); weaken($node->{PREV}); return $node; } # $node = search_ring($ring, $value) : find $value in the ring # structure in $node sub search_ring { my ($ring, $value) = @_; my $node = $ring->{NEXT}; while ($node != $ring && $node->{VALUE} != $value) { $node = $node->{NEXT}; } return $node; } # insert_value( $ring, $value ) : insert $value into the ring structure sub insert_value { my ($ring, $value) = @_; my $node = { VALUE => $value }; weaken($node->{NEXT} = $ring->{NEXT}); weaken($ring->{NEXT}->{PREV} = $node); weaken($ring->{NEXT} = $node); weaken($node->{PREV} = $ring); ++$ring->{COUNT}; } # delete_value( $ring, $value ) : delete a node from the ring # structure by value sub delete_value { my ($ring, $value) = @_; my $node = search_ring($ring, $value); return if $node = = $ring; $ring->delete_node($node); } # delete a node from the ring structure sub delete_node { my ($ring, $node) = @_; weaken($node->{PREV}->{NEXT} = $node->{NEXT}); weaken($node->{NEXT}->{PREV} = $node->{PREV}); --$ring->{COUNT}; } Every time we store a reference to part of the data structure within that same structure, we weaken the reference so it doesn't count toward the reference count. Otherwise our program's in-core memory footprint would have grown terrifically. You can watch that happen by adding: system("ps v$$"); within the loop on systems that support the ps(1) program. All it takes to trigger the leak is not weakening any of the four assignments in the insert_value function just shown. 11.15.4 See Also The algorithms in this recipe derive in part from pages 206-207 of Introduction to Algorithms, by Cormen, Leiserson, and Rivest (MIT Press/McGraw-Hill). See also Recipe 13.13; the section on "Garbage Collection, Circular References, and Weak References" in Chapter 8 of Programming Perl; the documentation for the standard Devel::Peek and Scalar::Util modules [ Team LiB ] [ Team LiB ] Recipe 11.16 Program: Outlines Outlines are a simple (and thus popular) way of structuring data. The hierarchy of detail implied by an outline maps naturally to our top-down way of thinking about the world. The only problem is that it's not obvious how to represent outlined data as a Perl data structure. Take, for example, this simple outline of some musical genres: Alternative .Punk ..Emo ..Folk Punk .Goth ..Goth Rock ..Glam Goth Country .Old Time .Bluegrass .Big Hats Rock .80s ..Big Hair ..New Wave .60s ..British ..American Here we use a period to indicate a subgroup. There are many different formats in which that outline could be output. For example, you might write the genres out in full: Alternative Alternative - Punk Alternative - Punk - Emo Alternative - Punk - Folk Punk Alternative - Goth ... You might number the sections: 1 Alternative 1.1 Punk 1.1.1 Emo 1.1.2 Folk Punk 1.2 Goth ... or alphabetize: Alternative Alternative - Goth Alternative - Goth - Glam Goth Alternative - Goth - Goth Rock Alternative - Punk Alternative - Punk - Emo ... or show inheritance: Alternative Punk - Alternative Emo - Punk - Alternative Folk Punk - Punk - Alternative Goth - Alternative Goth Rock - Goth - Alternative ... These transformations are all much easier than it might seem. The trick is to represent the levels of the hierarchy as elements in an array. For example, you'd represent the third entry in the sample outline as: @array = ("Alternative", "Goth", "Glam Goth"); Now reformatting the entry is trivial. There's an elegant way to parse the input file to get this array representation: while () { chomp; $tag[$in = s/\G\.//g] = $_; # do something with @tag[0..$in] } The substitution deletes leading periods from the current entry, returning how many it deleted. This number indicates the indentation level of the current entry. Alphabetizing is now simple using the Unix sort program: $ISA = "-"; open(STDOUT, "|sort -b -t'$ISA' -df"); while () { chomp; $tag[$in = s/\G\.//g] = $_; print join(" $ISA ", @tag[0 .. $in]); } close STDOUT; _ _END_ _ Alternative .Punk ..Emo ..Folk Punk .Goth Numbering the outline is equally simple: while () { chomp; $count[$in = s/\G\.//g]++; delete @count[($in+1) .. $#count]; print join(".", @count), " $_"; } _ _END_ _ Alternative .Punk ..Emo ..Folk Punk .Goth ..Goth Rock Notice that renumbering is our only application where we've deleted elements from the array. This is because we're not keeping names of hierarchy levels in the array; now we're keeping counts. When we go up a level (e.g., from three levels down to a new second-level heading), we reset the counter on the old level. [ Team LiB ] [ Team LiB ] Recipe 11.17 Program: Binary Trees Because Perl's built-in data types are already powerful, high-level, dynamic data types in their own right, most code can use what's already provided. If you just want quick lookups, you nearly always want to use a simple hash. As Larry has said, "The trick is to use Perl's strengths rather than its weaknesses." However, hashes provide no inherent ordering. To traverse the hash in a particular order, you must first extract its keys and then sort them. If you find yourself doing so many times, performance will suffer, but probably not enough to justify the time required to craft a fancy algorithm. A tree structure provides ordered traversals. How do you write a tree in Perl? First, you grab one of your favorite textbooks on data structures; the authors recommend Cormen et al., as mentioned in Other Books in the Preface. Using an anonymous hash to represent each node in the tree, translate the algorithms in the book into Perl. This is usually much more straightforward than you would imagine. The program code in Example 11-1 demonstrates an ordered binary tree implementation using anonymous hashes. Each node has three fields: a left child, a right child, and a value. The crucial property of an ordered binary tree is that at every node, all left children have values that are less than the current node value, and all right children have values that are greater. The main program does three things. First, it creates a tree with 20 random nodes. Then it shows the in-order, pre-order, and post-order traversals of that tree. Finally, it allows the user to enter a key and reports whether that key is in the tree. The insert function takes advantage of Perl's implicit pass-by-reference behavior on scalars to initialize an empty tree when asked to insert into an empty node. The assignment of the new node back to $_[0] alters the value in its caller. Although this data structure takes much more memory than a simple hash and the lookups are slower, the ordered traversals themselves are faster. A B-Tree is not a binary tree; it is a more flexible tree structure normally maintained on disk. DB_File has a BTREE interface (see DB_File(3)), and Mark-Jason Dominus has an excellent article on B-Trees in The Perl Journal, Volume 2, Issue 4, Winter 1997, pp. 35-42. If you want to learn more about binary trees, Introduction to Algorithms, by Cormen, Leiserson, and Rivest, and Algorithms in C, by Robert Sedgewick, both cover the basic material. But for a treatment of that material cast in native Perl, no book can compare with Mastering Algorithms with Perl, by Orwant, Hietaniemi, and MacDonald. The program is shown in Example 11-1. Example 11-1. bintree #!/usr/bin/perl -w # bintree - binary tree demo program use strict; my($root, $n); # first generate 20 random inserts while ($n++ < 20) { insert($root, int(rand(1000)))} # now dump out the tree all three ways print "Pre order: "; pre_order($root); print "\n"; print "In order: "; in_order($root); print "\n"; print "Post order: "; post_order($root); print "\n"; # prompt until EOF for (print "Search? "; <>; print "Search? ") { chomp; my $found = search($root, $_); if ($found) { print "Found $_ at $found, $found->{VALUE}\n" } else { print "No $_ in tree\n" } } exit; ######################################### # insert given value into proper point of # provided tree. If no tree provided, # use implicit pass by reference aspect of @_ # to fill one in for our caller. sub insert { my($tree, $value) = @_; unless ($tree) { $tree = { }; # allocate new node $tree->{VALUE} = $value; $tree->{LEFT} = undef; $tree->{RIGHT} = undef; $_[0] = $tree; # $_[0] is reference param! return; } if ($tree->{VALUE} > $value) { insert($tree->{LEFT}, $value) } elsif ($tree->{VALUE} < $value) { insert($tree->{RIGHT}, $value) } else { warn "dup insert of $value\n" } # XXX: no dups } # recurse on left child, # then show current value, # then recurse on right child. sub in_order { my($tree) = @_; return unless $tree; in_order($tree->{LEFT}); print $tree->{VALUE}, " "; in_order($tree->{RIGHT}); } # show current value, # then recurse on left child, # then recurse on right child. sub pre_order { my($tree) = @_; return unless $tree; print $tree->{VALUE}, " "; pre_order($tree->{LEFT}); pre_order($tree->{RIGHT}); } # recurse on left child, # then recurse on right child, # then show current value. sub post_order { my($tree) = @_; return unless $tree; post_order($tree->{LEFT}); post_order($tree->{RIGHT}); print $tree->{VALUE}, " "; } # find out whether provided value is in the tree. # if so, return the node at which the value was found. # cut down search time by only looking in the correct # branch, based on current value. sub search { my($tree, $value) = @_; return unless $tree; if ($tree->{VALUE} = = $value) { return $tree; } search($tree->{ ($value < $tree->{VALUE}) ? "LEFT" : "RIGHT"}, $value) } [ Team LiB ] [ Team LiB ] Chapter 12. Packages, Libraries, and Modules Like all those possessing a library, Aurelian was aware that he was guilty of not knowing his in its entirety. —Jorge Luis Borges, The Theologians [ Team LiB ] [ Team LiB ] Introduction Imagine that you have two separate programs, both of which work fine by themselves, and you decide to make a third program that combines the best features from the first two. You copy both programs into a new file or cut and paste selected pieces. You find that the two programs had variables and functions with the same names that should remain separate. For example, both might have an init function or a global $count variable. When merged into one program, these separate parts would interfere with each other. The solution to this problem is packages. Perl uses packages to partition the global namespace. The package is the basis for both traditional modules and object-oriented classes. Just as directories contain files, packages contain identifiers. Every global identifier (variables, functions, file and directory handles, and formats) has two parts: its package name and the identifier proper. These two pieces are separated from one another with a double colon. For example, the variable $CGI::needs_binmode is a global variable named $needs_binmode, which resides in package CGI. Where the filesystem uses slashes to separate the directory from the filename, Perl uses a double colon. $Names::startup is the variable named $startup in the package Names, whereas $Dates::startup is the $startup variable in package Dates. Saying $startup by itself without a package name means the global variable $startup in the current package. (This assumes that no lexical $startup variable is currently visible. Lexical variables are explained in Chapter 10.) When looking at an unqualified variable name, a lexical takes precedence over a global. Lexicals live in scopes; globals live in packages. If you really want the global instead, you need to fully qualify it. package is a compile-time declaration that sets the default package prefix for unqualified global identifiers, much as chdir sets the default directory prefix for relative pathnames. This effect lasts until the end of the current scope (a brace-enclosed block, file, or eval). The effect is also terminated by any subsequent package statement in the same scope. (See the following code.) All programs are in package main until they use a package statement to change this. package Alpha; $name = "first"; package Omega; $name = "last"; package main; print "Alpha is $Alpha::name, Omega is $Omega::name.\n"; Alpha is first, Omega is last. Unlike user-defined identifiers, built-in variables with punctuation names (like $_ and $.) and the identifiers STDIN, STDOUT, STDERR, ARGV, ARGVOUT, ENV, INC, and SIG are all forced to be in package main when unqualified. That way things like STDIN, @ARGV, %ENV, and $_ are always the same no matter what package you're in; for example, @ARGV always means @main::ARGV, even if you've used package to change the default package. A fully qualified @ElseWhere::ARGV would not, and carries no special built-in meaning. Make sure to localize $_ if you use it in your module. Modules The unit of software reuse in Perl is the module, a file containing related functions designed to be used by programs and other modules. Every module has a public interface, a set of variables and functions that outsiders are encouraged to use. From inside the module, the interface is defined by initializing certain package variables that the standard Exporter module looks at. From outside the module, the interface is accessed by importing symbols as a side effect of the use statement. The public interface of a Perl module is whatever is documented to be public. When we talk about modules in this chapter, and traditional modules in general, we mean those that use the Exporter. The require and use statements load a module into your program, although their semantics vary slightly. require loads modules at runtime, with a check to avoid the redundant loading of a given module. use is like require, with two added properties: compile-time loading and automatic importing. Modules included with use are processed at compile time, but require processing happens at runtime. This is important because if a module needed by a program is missing, the program won't even start because the use fails during compilation of your script. Another advantage of compile-time use over runtime require is that function prototypes in the module's subroutines become visible to the compiler. This matters because only the compiler cares about prototypes, not the interpreter. (Then again, we don't usually recommend prototypes except for replacing built-in commands, which do have them.) use is suitable for giving hints to the compiler because of its compile-time behavior. A pragma is a special module that acts as a directive to the compiler to alter how Perl compiles your code. A pragma's name is always all lowercase, so when writing a regular module instead of a pragma, choose a name that starts with a capital letter. Pragmas supported by the v5.8.1 release of Perl include attributes, autouse, base, bigint, bignum, bigrat, bytes, charnames, constant, diagnostics, fields, filetest, if, integer, less, locale, open, overload, sigtrap, sort, strict, subs, utf8, vars, vmsish, and warnings. Each has its own manpage. The other difference between require and use is that use performs an implicit import on the included module's package. Importing a function or variable from one package to another is a form of aliasing; that is, it makes two different names for the same underlying thing. It's like linking files from another directory into your current one by the command ln /somedir/somefile. Once it's linked in, you no longer have to use the full pathname to access the file. Likewise, an imported symbol no longer needs to be fully qualified by package name (or declared with our or the older use vars if a variable, or with use subs if a subroutine). You can use imported variables as though they were part of your package. If you imported $English::OUTPUT_AUTOFLUSH in the current package, you could refer to it as $OUTPUT_AUTOFLUSH. The required file extension for a Perl module is .pm. The module named FileHandle would be stored in the file The full path to the file depends on your include path, which is stored in the global @INC variable. Recipe 12.8 shows how to manipulate this array for your own purposes. If the module name itself contains any double colons, these are translated into your system's directory separator. That means that the File::Find module resides in the file File/ under most filesystems. For example: require ""; # runtime load require FileHandle; # ".pm" assumed; same as previous use FileHandle; # compile-time load require "Cards/"; # runtime load require Cards::Poker; # ".pm" assumed; same as previous use Cards::Poker; # compile-time load Import/Export Regulations The following is a typical setup for a hypothetical module named Cards::Poker that demonstrates how to manage its exports. The code goes in the file named within the directory Cards; that is, Cards/ (See Recipe 12.8 for where the Cards directory should reside.) Here's that file, with line numbers included for reference: 1 package Cards::Poker; 2 use Exporter; 3 @ISA = ("Exporter"); 4 @EXPORT = qw(&shuffle @card_deck); 5 @card_deck = ( ); # initialize package global 6 sub shuffle { } # fill-in definition later 7 1; # don't forget this Line 1 declares the package that the module will put its global variables and functions in. Typically, a module first switches to a particular package so that it has its own place for global variables and functions, one that won't conflict with that of another program. This package name must be written exactly as in the corresponding use statement when the module is loaded. Don't say package Poker just because the basename of your file is Rather, say package Cards::Poker because your users will say use Cards::Poker. This common problem is hard to debug. If you don't make the package names specified by the package and use statements identical, you won't see a problem until you try to call imported functions or access imported variables, which will be mysteriously missing. Line 2 loads in the Exporter module, which manages your module's public interface as described later. Line 3 initializes the special, per-package array @ISA to contain the word "Exporter". When a user says use Cards::Poker, Perl implicitly calls a special method, Cards::Poker- >import( ). You don't have an import method in your package, but that's okay, because the Exporter package does, and you're inheriting from it because of the assignment to @ISA (is a). Perl looks at the package's @ISA for resolution of undefined methods. Inheritance is a topic of Chapter 13. You may ignore it for now—so long as you put code like that in lines 2 and 3 into each module you write. Line 4 assigns the list ('&shuffle', '@card_deck') to the special, per-package array @EXPORT. When someone imports this module, variables and functions listed in that array are aliased into the caller's own package. That way they don't have to call the function Cards::Poke::shuffle(23) after the import. They can just write shuffle(23) instead. This won't happen if they load Cards::Poker with require Cards::Poker; only a use imports. Lines 5 and 6 set up the package global variables and functions to be exported. (We presume you'll actually flesh out their initializations and definitions more than in these examples.) You're free to add other variables and functions to your module, including ones you don't put in the public interface via @EXPORT. See Recipe 12.1 for more about using the Exporter. Finally, line 7 is a simple 1, indicating the overall return value of the module. If the last evaluated expression in the module doesn't produce a true value, an exception will be raised. Trapping this is the topic of Recipe 12.2. Packages group and organize global identifiers. They have nothing to do with privacy. Code compiled in package Church can freely examine and alter variables in package State. Package variables are always global and are used for sharing. But that's okay, because a module is more than just a package; it's also a file, and files count as their own scope. So if you want privacy, use lexical variables instead of globals. This is the topic of Recipe 12.4. Other Kinds of Library Files A library is a collection of loosely related functions designed to be used by other programs. It lacks the rigorous semantics of a Perl module. The file extension .pl indicates that it's a Perl library file. Examples include and These are included with the standard release for compatibility with prehistoric scripts written under Perl v4 or below. Perl libraries—or in fact, any arbitrary file with Perl code in it—can be loaded in using do "" or with require "". The latter is preferred in most situations, because unlike do, require does implicit error checking. It raises an exception if the file can't be found in your @INC path, doesn't compile, or if it doesn't return a true value when any initialization code is run (the last part is what the 1 was for earlier). Another advantage of require is that it keeps track of which files have already been loaded in the global hash %INC. It doesn't reload the file if %INC indicates that the file has already been read. Libraries work well when used by a program, but problems arise when libraries use one another. Consequently, simple Perl libraries have been rendered mostly obsolete, replaced by the more modern modules. But some programs still use libraries, usually loading them in with require instead of do. Other file extensions are occasionally seen in Perl. A .ph is used for C header files that have been translated into Perl libraries using the h2ph tool, as discussed in Recipe 12.17. A .xs indicates an augmented C source file, possibly created by the h2xs tool, which will be compiled by the xsubpp tool and your C compiler into native machine code. This process of creating mixed-language modules is discussed in Recipe 12.18. So far we've talked only about traditional modules, which export their interface by allowing the caller direct access to particular subroutines and variables. Most modules fall into this category. But some problems—and some programmers—lend themselves to more intricately designed modules: those involving objects. An object-oriented module seldom uses the import-export mechanism at all. Instead, it provides an object-oriented interface full of constructors, destructors, methods, inheritance, and operator overloading. This is the subject of Chapter 13. Not Reinventing the Wheel CPAN, the Comprehensive Perl Archive Network, is a gigantic repository of nearly everything about Perl you could imagine, including source, documentation, alternate ports, and above all, modules—some 4,500 of them as of spring of 2003. Before you write a new module, check with CPAN to see whether one already exists that does what you need. Even if one doesn't, something close enough might give you ideas. CPAN is a replicated archive, currently mirrored on nearly 250 sites. Access CPAN via If you just want to poke around, you can manually browse through the directories there. There are many indices, including listings of just new modules and of all modules organized by name, author, or category. A convenient alternative to picking through thousands of modules is the search engine available at You can search for modules by their name or author, but the facility for grepping through all registered modules' documentation is often more useful. That way you don't have download and install a module just to see what it's supposed to do. See Also Chapters 10, 11, and 22 of Programming Perl; perlmod(1) [ Team LiB ] [ Team LiB ] Recipe 12.1 Defining a Module's Interface 12.1.1 Problem You want the standard Exporter module to define the external interface to your module. 12.1.2 Solution In module file, place the following code. Fill in the ellipses as explained in the Discussion section. package YourModule; use strict; our (@ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS, $VERSION); use Exporter; $VERSION = 1.00; # Or higher @ISA = qw(Exporter); @EXPORT = qw(...); # Symbols to autoexport (:DEFAULT tag) @EXPORT_OK = qw(...); # Symbols to export on request %EXPORT_TAGS = ( # Define names for sets of symbols TAG1 => [...], TAG2 => [...], ... ); ######################## # your code goes here ######################## 1; # this should be your last line In other files where you want to use YourModule, choose one of these lines: use YourModule; # Import default symbols into my package use YourModule qw(...); # Import listed symbols into my package use YourModule ( ); # Do not import any symbols use YourModule qw(:TAG1); # Import whole tag set 12.1.3 Discussion The standard Exporter module handles the module's external interface. Although you could define your own import method for your package, almost no one does this. When someone says use YourModule, this does a require "" statement followed a YourModule->import( ) method call, both during compile time. The import method inherited from the Exporter package looks for global variables in your package to govern its behavior. Because they must be package globals, we've declared them with our to satisfy use strict. These variables are: $VERSION When a module is loaded, a minimal required version number can be supplied. If the version isn't at least this high, the use will raise an exception. use YourModule 1.86; # If $VERSION < 1.86, fail @EXPORT This array contains a list of functions and variables that will be exported into the caller's own namespace so they can be accessed without being fully qualified. Typically, a qw( ) list is used. @EXPORT = qw(&F1 &F2 @List); @EXPORT = qw( F1 F2 @List); # same thing With the simple use YourModule call the function &F1 can be called as F1( ) rather than YourModule::F1( ), and the array can be accessed as @List instead of @YourModule::List. The ampersand is optional in front of an exported function specification. To load the module at compile time but request that no symbols be exported, use the special form use Exporter ( ), with empty parentheses. @EXPORT_OK This array contains symbols that can be imported if they're specifically asked for. If the array were loaded this way: @EXPORT_OK = qw(Op_Func %Table); then the user could load the module like so: use YourModule qw(Op_Func %Table F1); and import only the Op_Func function, the %Table hash, and the F1 function. The F1 function was listed in the @EXPORT array. Notice that this does not automatically import F2 or @List, even though they're in @EXPORT. To get everything in @EXPORT plus extras from @EXPORT_OK, use the special :DEFAULT tag, such as: use YourModule qw(:DEFAULT %Table); %EXPORT_TAGS This hash is used by large modules like CGI or POSIX to create higher-level groupings of related import symbols. Its values are references to arrays of symbol names, all of which must be in either @EXPORT or @EXPORT_OK. Here's a sample initialization: %EXPORT_TAGS = ( Functions => [ qw(F1 F2 Op_Func) ], Variables => [ qw(@List %Table) ], ); An import symbol with a leading colon means to import a whole group of symbols. Here's an example: use YourModule qw(:Functions %Table); That pulls in all symbols from: @{ $YourModule::EXPORT_TAGS{Functions} }, that is, it pulls in the F1, F2, and Op_Func functions and then the %Table hash. Although you don't list it in %EXPORT_TAGS, the implicit tag :DEFAULT automatically means everything in @EXPORT. You don't have to have all those variables defined in your module. You just need the ones that you expect people to be able to use. 12.1.4 See Also The "Creating Modules" section of Chapter 11 of Programming Perl; the documentation for the standard Exporter module, also found in Chapter 32 of Programming Perl; Recipe 12.8; Recipe 12.22 [ Team LiB ] [ Team LiB ] Recipe 12.2 Trapping Errors in require or use 12.2.1 Problem You need to load in a module that might not be present on your system. This normally results in a fatal exception. You want to detect and trap these failures. 12.2.2 Solution Wrap the require or use in an eval, and wrap the eval in a BEGIN block: # no import BEGIN { unless (eval "require $mod; 1") { warn "couldn't require $mod: $@"; } } # imports into current package BEGIN { unless (eval "use $mod; 1") { warn "couldn't use $mod: $@"; } } 12.2.3 Discussion You usually want a program to fail if it tries to load a module that is missing or doesn't compile. Sometimes, though, you'd like to recover from that error, perhaps trying an alternative module instead. As with any other exception, you insulate yourself from compilation errors with an eval. You don't want to use eval { BLOCK }, because this traps only runtime exceptions, and use is a compile-time event. Instead, you must use eval "string" to catch compile-time problems as well. Remember, require on a bareword has a slightly different meaning than require on a variable. It adds a ".pm" and translates double-colons into your operating system's path separators, canonically / (as in URLs), but sometimes \, :, or even . on some systems. If you need to try several modules in succession, stopping at the first one that works, you could do something like this: BEGIN { my($found, @DBs, $mod); $found = 0; @DBs = qw(Giant::Eenie Giant::Meanie Mouse::Mynie Moe); for $mod (@DBs) { if (eval "require $mod") { $mod->import( ); # if needed $found = 1; last; } } die "None of @DBs loaded" unless $found; } We wrap the eval in a BEGIN block to ensure the module-loading happens at compile time instead of runtime. 12.2.4 See Also The eval, die, use, and require functions in Chapter 32 of Programming Perl and in perlfunc(1); Recipe 10.12; Recipe 12.3 [ Team LiB ] [ Team LiB ] Recipe 12.3 Delaying use Until Runtime 12.3.1 Problem You have a module that you don't need to load each time the program runs, or whose inclusion you wish to delay until after the program starts up. 12.3.2 Solution Either break up the use into its separate require and import components, or else employ the use autouse pragma. 12.3.3 Discussion Programs that check their arguments and abort with a usage message on error have no reason to load modules they never use. This delays the inevitable and annoys users. But those use statements happen during compilation, not execution, as explained in the Introduction. Here, an effective strategy is to place argument checking in a BEGIN block before loading the modules. The following is the start of a program that checks to make sure it was called with exactly two arguments, which must be whole numbers, before going on to load the modules it will need: BEGIN { unless (@ARGV = = 2 && (2 = = grep {/^\d+$/} @ARGV)) { die "usage: $0 num1 num2\n"; } } use Some::Module; use More::Modules; A related situation arises in programs that don't always use the same set of modules every time they're run. For example, the factors program from Chapter 2 needs the infinite precision arithmetic library only when the -b command-line flag is supplied. A use statement would be pointless within a conditional because it's evaluated at compile time, long before the if can be checked. So we use a require instead: if ($opt_b) { require Math::BigInt; } Because Math::BigInt is an object-oriented module instead of a traditional one, no import was needed. If you have an import list, specify it with a qw( ) construct as you would with use. For example, rather than this: use Fcntl qw(O_EXCL O_CREAT O_RDWR); you might say this instead: require Fcntl; Fcntl->import(qw(O_EXCL O_CREAT O_RDWR)); Delaying the import until runtime means that the rest of your program is not subject to any imported semantic changes that the compiler would have seen if you'd used a use. In particular, subroutine prototypes and the overriding of built-in functions are not seen in time. You might want to encapsulate this delayed loading in a subroutine. The following deceptively simple approach does not work: sub load_module { require $_[0]; #WRONG import $_[0]; #WRONG } It fails for subtle reasons. Imagine calling require with an argument of "Math::BigFloat". If that's a bareword, the double colon is converted into your operating system's path separator and a trailing .pm is added. But as a simple variable, it's a literal filename. Worse, Perl doesn't have a built-in import function. Instead, there's a class method named import that we're using the dubious indirect object syntax on. As with indirect filehandles, you can use indirect objects only on a plain scalar variable, a bareword, or a block returning the object, not an expression or one element from an array or hash. A better implementation might look more like: load_module("Fcntl", qw(O_EXCL O_CREAT O_RDWR)); sub load_module { eval "require $_[0]"; die if $@; $_[0]->import(@_[1 .. $#_]); } But this still isn't perfectly correct in the general case. It really shouldn't import those symbols into its own package. It should put them into its caller's package. We could account for this, but the whole procedure is getting increasingly messy. Occasionally, the condition can be reasonably evaluated before runtime, perhaps because it uses only built-in, predefined variables, or because you've arranged to initialize the variables used in the conditional expression at compile time with a BEGIN block. If so, the if pragma comes in handy. The syntax is: use CONDITION, MODULE; use CONDITION, MODULE => ARGUMENTS; As in: use if $^O =~ /bsd/i, BSD::Resource; use if $] >= 5.006_01, File::Temp => qw/tempfile tempdir/; A convenient alternative is the use autouse pragma. This directive can save time on infrequently loaded functions by delaying their loading until they're actually used: use autouse Fcntl => qw( O_EXCL( ) O_CREAT( ) O_RDWR( ) ); We put parentheses after O_EXCL, O_CREAT, and O_RDWR when we autoused them but not when we used them or imported them. The autouse pragma doesn't just take function names; it can also take a prototype for the function. The Fcntl constants are prototyped to take no arguments, so we can use them as barewords in our program without use strict kvetching. Remember, too, that use strict's checks take place at compile time. If we use Fcntl, the prototypes in the Fcntl module are compiled and we can use the constants without parentheses. If we require or wrap the use in an eval, as we did earlier, we prevent the compiler from reading the prototypes, so we can't use the Fcntl constants without parentheses. Read the autouse pragma's online documentation to learn its various caveats and provisos. 12.3.4 See Also Recipe 12.2; the discussion on the import method in the documentation for the standard Exporter module, also found in Chapter 32 of Programming Perl; the documentation for the standard use autouse pragma [ Team LiB ] [ Team LiB ] Recipe 12.4 Making Variables Private to a Module 12.4.1 Problem You want to make a variable private to a package. 12.4.2 Solution You can't. But you can make them private to the file that the module sits in, which usually suffices. 12.4.3 Discussion Remember that a package is just a way of grouping variables and functions together, conferring no privacy. Anything in a package is by definition global and accessible from anywhere. Packages only group; they don't hide. For privacy, only lexical variables will do. A module is implemented in a file, with all its globals in the package named Module. Because that whole file is by definition a scope and lexicals are private to a scope, creating file-scoped lexicals is effectively the same thing as a module-private variable. If you alternate packages within a scope, though, you may be surprised that the scope's lexicals are still visible throughout that scope. That's because a package statement only sets a different prefix for a global identifier; it does not end the current scope, not does it begin a new one. package Alpha; my $aa = 10; $x = "azure"; package Beta; my $bb = 20; $x = "blue"; package main; print "$aa, $bb, $x, $Alpha::x, $Beta::x\n"; 10, 20, , azure, blue Was that the output you expected? The two lexicals, $aa and $bb, are still in scope because we haven't left the current block, file, or eval. You might think of globals and lexicals as existing in separate dimensions, forever unrelated to each other. Package statements have nothing to do with lexicals. By setting the current prefix, the first global variable $x is really $Alpha::x, whereas the second $x is now $Beta::x because of the intervening package statement changing the default prefix. Package identifiers, if fully qualified, can be accessed from anywhere, as we've done in the print statement. So, packages can't have privacy—but modules can because they're in a file, which is always its own scope. Here's a simple module, placed in the file, that exports two functions, flip_words and flip_boundary. The module provides code to reverse words in a line, and to change the definition of a word boundary. # package Flipper; use strict; require Exporter; use vars qw(@ISA @EXPORT $VERSION); @ISA = qw(Exporter); @EXPORT = qw(flip_words flip_boundary); $VERSION = 1.0; my $Separatrix = " "; # default to blank; must precede functions sub flip_boundary { my $prev_sep = $Separatrix; if (@_) { $Separatrix = $_[0] } return $prev_sep; } sub flip_words { my $line = $_[0]; my @words = split($Separatrix, $line); return join($Separatrix, reverse @words); } 1; This module sets three package variables needed by the Exporter and also initializes a lexical variable at file level called $Separatrix. Again, this variable is private to the file, not to the package. All code beneath its declaration in the same scope (or nested within that scope, as are the functions' blocks) can see $Separatrix perfectly. Even though they aren't exported, global variables could be accessed using the fully qualified name, as in $Flipper::VERSION. A scope's lexicals cannot be examined or tinkered with from outside that scope, which in this case is the entire file below their point of declaration. You cannot fully qualify lexicals or export them either; only globals can be exported. If someone outside the module needs to look at or change the file's lexicals, they must ask the module itself. That's where the flip_boundary function comes into play, allowing indirect access to the module's private parts. This module would work the same even if its $Separatrix variable were a package global rather than a file lexical. Someone from the outside could theoretically play with it without the module realizing this. On the other hand, if they really want to that badly, perhaps you should let them do so. Peppering your module with file-scoped lexicals is not necessary. You already have your own namespace (Flipper, in this case) where you can store any identifier you want. That's what it's there for, after all. Good Perl programming style nearly always avoids fully qualified identifiers. Speaking of style, the case of identifiers used in the Flipper module was not random. Following the Perl style guide, identifiers in all capitals are reserved for those with special meaning to Perl itself. Functions and local variables are all lowercase. The module's persistent variables (either file lexicals or package globals) are capitalized. Identifiers with multiple words have each word separated by an underscore to make them easier to read. We advise against using mixed capitals without underscores—you wouldn't like reading this book without spaces, either. 12.4.4 See Also The discussion on file-scoped lexicals in perlmod(1); the "Scoped Declarations" section in Chapter 4 of Programming Perl; the section on "Programming with Style" in Chapter 24 of Programming Perl or perlstyle(1); Recipe 10.2; Recipe 10.3 [ Team LiB ] [ Team LiB ] Recipe 12.5 Making Functions Private to a Module 12.5.1 Problem You want to make a function private to a package. 12.5.2 Solution You can't. But you can make a private variable and store a reference to an anonymous function in it. # this is the file package Some_Module; my $secret_function = sub { # your code here }; sub regular_function { # now call your "private" function via the code ref $secret_function->(ARG1, ARG2); } 12.5.3 Discussion Even a function that isn't exported can still be accessed by anyone, anywhere if they qualify that function's name with its package. That's because function names are always in the package symbol table, which is globally accessible. By creating a lexical variable at the file scope, code in that module file below the point of declaration has full access to that variable. Code in other files will not, because those scopes are unrelated. The subroutine created via sub { .... } is anonymous, so there's no name in the symbol table for anyone outside to find. Not even other code in the module can call the function by name, since it doesn't have one, but that code can use the lexical variable to dereference the code reference indirectly. $secret_function->(ARGS); # infix deref form &$secret_function(ARGS); # prefix deref form Curiously, if you really wanted to, you could give this anonymous function a temporary name. Using the technique outlined in Recipe 10.16, assign the code reference to a localized typeglob, like this: sub module_function { local *secret = $secret_function; Other_Package::func1( ); secret(ARG1, ARG2); Yet_Another_Package::func2( ); } Now for the duration of module_function, your previously secret function can be called using a direct function call; no indirection required. However, code outside the module can also find that function. In the example, it doesn't matter whether func1 and func2 are in the module's file scope, because you've made a temporary symbol table entry through which they could get at your secret function. Therefore, if Other_Package::func1 turned around and called Some_Module::secret, it could find it—but only if func1 were called from the module_function in the example. If it were called from some other point, there wouldn't be any secret function in the Some_Module package symbol table, so the attempted function call would fail. This slightly peculiar behavior, where temporaries' values and visibility depend upon who called whom at runtime, is called dynamic scoping. This is the nature of the local keyword. You can see why we don't usually suggest using it. 12.5.4 See Also Recipe 12.4; the section on "Dynamically Scoped Variables: local" in Chapter 4 of Programming Perl; the section on "Symbol Tables" in Chapter 10 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.6 Determining the Caller's Package 12.6.1 Problem You need to find out the current or calling package. 12.6.2 Solution To find the current package: $this_pack = _ _PACKAGE_ _; To find the caller's package: $that_pack = caller( ); 12.6.3 Discussion The _ _PACKAGE_ _ symbol returns the package that the code is currently being compiled into. This doesn't interpolate into double-quoted strings: print "I am in package _ _PACKAGE_ _\n"; # WRONG! I am in package _ _PACKAGE_ _ Needing to figure out the caller's package arose more often in older code that received as input a string of code to be evaluated, or a filehandle, format, or directory handle name. Consider a call to a hypothetical runit function: package Alpha; runit('$line = '); package Beta; sub runit { my $codestr = shift; eval $codestr; die if $@; } Because runit was compiled in a different package than was currently executing, when the eval runs, it acts as though it were passed $Beta::line and Beta::TEMP. The old workaround was to include your caller's package first: package Beta; sub runit { my $codestr = shift; my $hispack = caller; eval "package $hispack; $codestr"; die if $@; } That approach works only when $line is a global variable. If it's lexical, that won't help at all. Instead, arrange for runit to accept a reference to a subroutine: package Alpha; runit( sub { $line = } ); package Beta; sub runit { my $coderef = shift; &$coderef( ); } This not only works with lexicals, but has the added benefit of checking the code's syntax at compile time, which is a major win. If all that's being passed in is a filehandle, it's more portable to use the Symbol::qualify function. This function takes a name and package to qualify the name into. If the name needs qualification, it fixes it; otherwise, it's left alone. But that's considerably less efficient than a * prototype. Here's an example that reads and returns n lines from a filehandle. The function qualifies the handle before working with it. open (FH, "<", "/etc/termcap") or die "can't open /etc/termcap: $!"; ($a, $b, $c) = nreadline(3, "FH"); use Symbol ( ); use Carp; sub nreadline { my ($count, $handle) = @_; my(@retlist,$line); croak "count must be > 0" unless $count > 0; $handle = Symbol::qualify($handle, (caller( ))[0]); croak "need open filehandle" unless defined fileno($handle); push(@retlist, $line) while defined($line = <$handle>) && $count--; return @retlist; } If everyone who called your nreadline function passed the filehandle as a typeglob *FH, as a glob reference *FH, or using FileHandle or IO::Handle objects, you wouldn't need to do this. It's only the possibility of a bare "FH" string that requires qualification. 12.6.4 See Also The documentation for the standard Symbol module, also found in Chapter 32 of Programming Perl; the descriptions of the special symbols _ _FILE_ _, _ _LINE_ _, and _ _PACKAGE_ _ in perldata(1); Recipe 12.14; Recipe 7.6 [ Team LiB ] [ Team LiB ] Recipe 12.7 Automating Module Cleanup 12.7.1 Problem You need to create module setup code and cleanup code that gets called automatically, without user intervention. 12.7.2 Solution For setup code, put executable statements outside subroutine definitions in the module file. For cleanup code, use an END subroutine in that module. 12.7.3 Discussion In some languages, the programmer must remember to call module initialization code before accessing any of that module's regular functions. Similarly, when the program is done, the programmer may have to call module-specific finalization code. Not so in Perl. For per-module initialization code, executable statements outside of any subroutines in your module suffice. When the module is loaded in, that code runs right then and there. The user never has to remember to do this, because it's done automatically. Now, why would you want automatic cleanup code? It depends on the module. You might want to write a shutdown message to a logfile, tell a database server to commit any pending state, refresh a screen, or return the tty to its original state. Suppose you want a module to log quietly whenever a program using it starts up or finishes. Add code in an END subroutine to run after your program finishes: $Logfile = "/tmp/mylog" unless defined $Logfile; open(LF, ">>", $Logfile) or die "can't append to $Logfile: $!"; select(((select(LF), $|=1))[0]); # unbuffer LF logmsg("startup"); sub logmsg { my $now = scalar gmtime; print LF "$0 $$ $now: @_\n" or die "write to $Logfile failed: $!"; } END { logmsg("shutdown"); close(LF) or die "close $Logfile failed: $!"; } The first part of code, outside any subroutine declaration, is executed at module load time. The module user doesn't have to do anything special to make this happen. Someone might be unpleasantly surprised, however, if the file couldn't be accessed, since the die would make the use or require fail. END routines work like exit handlers, such as trap 0 in the shell, atexit in C programming, or global destructors or finalizers in object-oriented languages. All of the ENDs in a program are run in the opposite order that they were loaded; that is, last seen, first run. These get called whether the program finishes through normal process termination by implicitly reaching the end of your main program, through an explicit call to the exit function, or via an uncaught exception such as die or a mistake involving division by zero. Uncaught signals are a different matter, however. Death by signal does not run your exit handlers. The following pragma takes care of them: use sigtrap qw(die normal-signals error-signals); That causes all normal signals and error signals to make your program expire via the die mechanism, effectively converting a signal into an exception and thus permitting your END handlers to run. You can get fancier, too: use sigtrap qw( die untrapped normal-signals stack-trace any error-signals ); That says to die only on an untrapped normal signal, but for error signals, to produce a stack trace before dying—like the confess function from the Carp module. END also isn't called when a process polymorphs itself via the exec function because you are still in the same process, just a different program. All normal process attributes remain, like process ID and parent PID, user and group IDs, umask, current directory, environment variables, resource limits and accumulated statistics, and open file descriptors (however, see the $^F variable in perlvar(1) or Programming Perl). If it didn't work this way, exit handlers would execute redundantly in programs manually managing their fork and exec calls. This would not be good. 12.7.4 See Also The standard use sigtrap pragma, also in Chapter 31 of Programming Perl; Chapter 18 of Programming Perl and the section on "Package Constructors and Destructors" in perlmod(1); the $^F ($SYSTEM_FD_MAX) variable in Chapter 28 of Programming Perl and in perldata(1); the fork and exec functions in Chapter 29 of Programming Perl and in perlmod(1) [ Team LiB ] [ Team LiB ] Recipe 12.8 Keeping Your Own Module Directory 12.8.1 Problem You don't want to install your own personal modules in the standard per-system extension library. 12.8.2 Solution You have several choices: use Perl's -I command line switch; set your PERL5LIB environment variable; or employ the use lib pragma, possibly in conjunction with the FindBin module. 12.8.3 Discussion The @INC array contains a list of directories to consult when do, require, or use pulls in code from another file. You can print these out easily from the command line: % perl -e 'printf "%d %s\n", $i++, $_ for @INC' 0 /usr/local/lib/perl5/5.8.0/OpenBSD.i386-openbsd 1 /usr/local/lib/perl5/5.8.0 2 /usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd 3 /usr/local/lib/perl5/site_perl/5.8.0 4 /usr/local/lib/perl5/site_perl/5.6.0 5 /usr/local/lib/perl5/site_perl/5.00554 6 /usr/local/lib/perl5/site_perl/5.005 7 /usr/local/lib/perl5/site_perl 8 . The first two directories, elements 0 and 1 of @INC, are respectively the standard architecture- dependent and architecture-independent directories, which all standard libraries, modules, and pragmas will go into. You have two of them because some modules contain information or formatting that makes sense only on that particular architecture. For example, the Config module contains information that cannot be shared across several architectures, so it goes in the 0th array element. Modules that include compiled C components, such as, are also placed there. Most modules, however, go in the platform-independent directory in the 1st element. The next pair, elements 2 and 3, fulfills roles analogous to elements and 1, but on a site-specific basis. Suppose you have a module that didn't come with Perl, such as one from CPAN or that you wrote yourself. When you or (more likely) your system administrator installs this module, its components go into one of the site-specific directories. You are encouraged to use these for any modules that your entire site should be able to access conveniently. In this particular configuration, elements 4 -7 are there so that Perl can find any site-specific modules installed under a previous release of Perl. Such directories can be automatically added to @INC when you configure, build, and install a newer Perl release, making it easier to upgrade. The last standard component, "." (your current working directory), is useful only when developing and testing your software, not when deploying it. If your modules are in the same directory that you last chdired to, you're fine. If you're anywhere else, it doesn't work. So sometimes none of the @INC directories work out. Maybe you have your own personal modules. Perhaps your project group has particular modules that are relevant only to that project. In these cases, you need to augment the standard @INC search. The first approach involves a command-line flag, -Idirlist. The dirlist is a colon-separated[1] list of one or more directories, which are prepended to the front of the @INC array. This works well for simple command lines, and thus can be used on a per-command basis, such as when you call a quick one-liner from a shell script. [1] Comma-separated on Mac OS 9. This technique should not be used in the #! (pound-bang) line. First, it's not much fun to modify each program. More importantly, some older operating systems have bugs related to how long that line can be, typically 32 characters, including the #! part. That means if you have a very long path, such as #!/opt/languages/free/extrabits/perl, you may get the mysterious "Command not found" error. Perl does its best to rescan the line manually, but this is still too dicey to rely on. Often, a better solution is to set the PERL5LIB environment variable. This can be done in your shell start-up file. Or, your system administrator may want to do so in a systemwide start-up file so all users can benefit. For example, suppose you have all your own modules in a directory called ~/perllib. You would place one of the following lines in your shell start-up file, depending on which shell you use: # syntax for sh, bash, ksh, or zsh $ export PERL5LIB=$HOME/perllib # syntax for csh or tcsh % setenv PERL5LIB ~/perllib Probably the most convenient solution from your users' perspective is for you to add a use lib pragma near the top of your script. That way users of the program need take no special action to run that program. Imagine a hypothetical project called Spectre whose programs rely on its own set of libraries. Those programs could have a statement like this at their start: use lib "/projects/spectre/lib"; What happens when you don't know the exact path to the library? Perhaps you've installed the whole project in an arbitrary path. You could create an elaborate installation procedure to dynamically update the script, but even if you did, paths would still be frozen at installation time. If someone moved the files later, the libraries wouldn't be found. The FindBin module conveniently solves this problem. This module tries to determine the full path to the executing script's enclosing directory, setting an importable package variable called $Bin to that directory. Typical usage is to look for modules either in the same directory as the program or in a lib directory at the same level. To demonstrate the first case, suppose you have a program called /wherever/spectre/myprog that needs to look in /wherever/spectre for its modules, but you don't want that path hardcoded. use FindBin; use lib $FindBin::Bin; The second case would apply if your program lives in /wherever/spectre/bin/myprog but needs to look at /wherever/spectre/lib for its modules. use FindBin qw($Bin); use lib "$Bin/../lib"; 12.8.4 See Also The documentation for the standard use lib pragma (also in Chapter 31 of Programming Perl) and the standard FindBin module; the discussion of the PERL5LIB environment in perl(1) and the "Environmental Variables" section of Chapter 19 of Programming Perl; your shell's syntax for setting environment variables [ Team LiB ] [ Team LiB ] Recipe 12.9 Preparing a Module for Distribution 12.9.1 Problem You want to prepare your module in standard distribution format so you can easily send your module to a friend. Better yet, you plan to contribute your module to CPAN so everyone can use it. 12.9.2 Solution It's best to start with Perl's standard h2xs tool. Let's say you want to make a Planets module or an Astronomy::Orbits module. You'd type: % h2xs -XA -n Planets % h2xs -XA -n Astronomy::Orbits These commands make subdirectories called ./Planets/ and ./Astronomy/Orbits/, respectively, where you will find all the components you need to get you started. The -n flag names the module you want to make, -X suppresses creation of XS (external subroutine) components, and -A means the module won't use the AutoLoader. 12.9.3 Discussion Writing modules is easy—once you know how. Writing a proper module is like filling out a legal contract: it's full of places to initial, sign, and date exactly right. If you miss any, it's not valid. Instead of hiring a contract lawyer, you can get a quick start on writing modules using the h2xs program. This tool gives you a skeletal module file with the right parts filled in, and it also gives you the other files needed to correctly install your module and its documentation or to bundle up for contributing to CPAN or sending off to a friend. h2xs is something of a misnomer because XS is Perl's external subroutine interface for linking with C or C ++. But the h2xs tool is also extremely convenient for preparing a distribution even when you aren't using the XS interface. Let's look at the module file that h2xs has made. Because the module is called Astronomy::Orbits, the user specifies not use Orbits but rather use Astronomy::Orbits. Therefore an extra Astronomy subdirectory is made, under which an Orbits subdirectory is placed. Here is the first and perhaps most important line of package Astronomy::Orbits; This sets the package—the default prefix—on all global identifiers (variables, functions, filehandles, etc.) in the file. Therefore a variable like @ISA is really the global variable @Astronomy::Orbits::ISA. As we said in the Introduction, you must not make the mistake of saying package Orbits because it's in the file The package statement in the module must be exactly match the target of the use or require statement, which means the leading directory portion needs to be there and the characters' case must be the same. Furthermore, it must be installed in an Astronomy subdirectory. The h2xs command will set this all up properly, including the installation rule in the Makefile. But if you're doing this by hand, you must keep this in mind. See Recipe 12.1 for that. If you plan to use autoloading, described in Recipe 12.11, omit the -A flag to h2xs, which produces lines like this: require Exporter; require AutoLoader; @ISA = qw(Exporter AutoLoader); If your module is bilingual in Perl and C as described in Recipe 12.18, omit the -X flag to h2xs to produce lines like this: require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); Following this is the Exporter's variables as explained in Recipe 12.1. If you're writing an object-oriented module as described in Chapter 13, you probably won't use the Exporter at all. That's all there is for setup. Now, write your module code. When you're ready to ship it off, use the make dist directive from your shell to bundle it all up into a tar archive for easy distribution. (The name of the make program may vary from system to system.) % perl Makefile.PL % make dist This will leave you with a file whose name is something like Astronomy-Orbits-1.03.tar.Z. To register as a CPAN developer, check out 12.9.4 See Also to find a mirror near you and directions for submission; h2xs(1); the documentation for the standard Exporter, AutoLoader, AutoSplit, and ExtUtils::MakeMaker modules, also found in Chapter 32 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.10 Speeding Module Loading with SelfLoader 12.10.1 Problem You'd like to load a very large module quickly. 12.10.2 Solution Use the SelfLoader module: require Exporter; require SelfLoader; @ISA = qw(Exporter SelfLoader); # # other initialization or declarations here # _ _DATA_ _ sub abc { .... } sub def { .... } 12.10.3 Discussion When you load a module using require or use, the entire module file must be read and compiled (into internal parse trees, not into byte code or native machine code) right then. For very large modules, this annoying delay is unnecessary if you need only a few functions from a particular file. To address this problem, the SelfLoader module delays compilation of each subroutine until that subroutine is actually called. SelfLoader is easy to use: just place your module's subroutines underneath the _ _DATA_ _ marker so the compiler will ignore them, use a require to pull in the SelfLoader, and include SelfLoader in the module's @ISA array. That's all there is to it. When your module is loaded, the SelfLoader creates stub functions for all routines below _ _DATA_ _. The first time a function gets called, the stub replaces itself by first compiling the real function and then calling it. There is one significant restriction on modules that employ the SelfLoader (or the AutoLoader for that matter, described in Recipe 12.11). SelfLoaded or AutoLoaded subroutines have no access to lexical variables in the file whose _ _DATA_ _ block they are in because they are compiled via eval in an imported AUTOLOAD block. Such dynamically generated subroutines are therefore compiled in the scope of SelfLoader's or AutoLoader's AUTOLOAD. Whether the SelfLoader helps or hinders performance depends on how many subroutines the module has, how large they are, and whether they are all called over the lifetime of the program or not. You should initially develop and test your module without SelfLoader. Commenting out the _ _DATA_ _ line will take care of that, making those functions visible to the compiler. 12.10.4 See Also The documentation for the standard module SelfLoader; Recipe 12.11 [ Team LiB ] [ Team LiB ] Recipe 12.11 Speeding Up Module Loading with Autoloader 12.11.1 Problem You want to use the AutoLoader module. 12.11.2 Solution The easiest solution is to use the h2xs facility to create a directory and all the files you need. Here we assume you have your own directory, ~/perllib/, which contains your personal library modules. % h2xs -Xn Sample % cd Sample % perl Makefile.PL LIB=~/perllib % (edit % make install 12.11.3 Discussion The AutoLoader addresses the same performance issues as the SelfLoader. It also provides stub functions that get replaced by real ones the first time they're called. But instead of looking for functions all in the same file, hidden below a _ _DATA_ _ marker, the AutoLoader expects to find the real definition for each function in its own file. If your module had two functions, foo and bar, then the AutoLoader would expect to find them in Sample/auto/ and Sample/auto/, respectively. Modules employing the AutoLoader load faster than those using the SelfLoader, but at the cost of extra files, disk space, and complexity. This setup sounds complicated. If you were doing it manually, it probably would be. Fortunately, h2xs helps out tremendously here. Besides creating a module directory with templates for your file and other files you need, it also generates a Makefile that uses the AutoSplit module to break your module's functions into little files, one function per file. The make install rule installs these so they will be found automatically. All you have to do is put the module functions down below an _ _END_ _ line (rather than a _ _DATA_ _ line as in SelfLoader) that h2xs already created. As with the SelfLoader, it's easier to develop and test your module without the AutoLoader. Just comment out the _ _END_ _ line while developing it. The same restrictions about invisibility of file lexicals that apply to modules using the SelfLoader also apply when using the AutoLoader, so using file lexicals to maintain private state doesn't work. If state is becoming that complex and significant an issue, consider writing an object module instead of a traditional one. 12.11.4 See Also The documentation for the standard module AutoLoader; h2xs(1); Recipe 12.10 [ Team LiB ] [ Team LiB ] Recipe 12.12 Overriding Built-in Functions 12.12.1 Problem You want to replace a standard, built-in function with your own version. 12.12.2 Solution Import that function from another module into your own namespace. 12.12.3 Discussion Suppose you want to give a function of your own the same name as one of Perl's core built-ins. If you write: sub time { "it's howdy doody time" } print time( ); then you won't get your function called—you'll still get Perl's original, built-in version. You could use an explicit ampersand to call the function: print &time( ); because that always gets your function, never the built-in. But then you forego any prototype checking and context coercion on the function's arguments. However, there is a way to override that. Many (but not all) of Perl's built-in functions may be overridden. This is not something to be attempted lightly, but it is possible. You might do this, for example, if you are running on a platform that doesn't support the function that you'd like to emulate. Or, you might want to add your own wrapper around the built-in. Not all reserved words have the same status. Those that return a negative number in the C- language keyword( ) function in the toke.c file in your Perl source kit may be overridden. Keywords that cannot be overridden as of v5.8.1 are defined, delete, do, else, elsif, eval, exists, for, foreach, format, glob, goto, grep, if, last, local, m, map, my, next, no, our, package, pos, print, printf, prototype, q, qq, qr, qw, qx, redo, require, return, s, scalar, sort, split, study, sub, tie, tied, tr, undef, unless, untie, until, use, while, and y. The rest can. A standard Perl module that overrides a built-in is Cwd, which can overload chdir. Others are the by-name versions of functions that return lists: File::stat, Net::hostent, Net::netent, Net::protoent, Net::servent, Time::gmtime, Time::localtime, Time::tm, User::grent, and User::pwent. These modules all override built-in functions like stat or getpwnam to return an object that can be accessed using a name, like getpwnam("daemon")->dir. To do this, they have to override the original, list-returning versions of those functions. Overriding may be done uniquely by importing the function from another package. This import only takes effect in the importing package, not in all possible packages. It's not enough simply to predeclare the function. You have to import it. This is a guard against accidentally redefining built-ins. Let's say that you'd like to replace the built-in time function, whose return value is in integer seconds, with one that returns a floating-point number instead. You could make a Time::HiRes module with an optionally exported time function as follows: package Time::HiRes; use strict; require Exporter; use vars qw(@ISA @EXPORT_OK); @ISA = qw(Exporter); @EXPORT_OK = qw(time); sub time( ) { ..... } # TBA Then the user who wants to use this augmented version of time would say something like: use Time::HiRes qw(time); $start = time( ); 1 while print time( ) - $start, "\n"; This code assumes that your system has a function you can stick in the "TBA" definition shown previously. It just so happens, however, that you don't have to figure that part out, because the Time::HiRes module (which is included standard with the Perl distribution) does indeed behave as we've outlined it here. You can import its time( ) function to get the one that is fancier than the core built-in, just as we did here. If you don't want to take the trouble to create a full module file, set up its exports, and all the rest of the rigamarole, there's a shortcut approach via the subs pragma. It works like this: use subs qw(time); sub time { "it's howdy doody time" } print time( ); Now you'd get your own function, even without the ampersand. Even when you override a built-in by importing a function, that built-in is always still accessible if you fully qualify it using the (pseudo)package named CORE. Thus, even if you imported time( ) from FineTime, overriding the built-in, that original built-in can be called as CORE::time( ). For overriding of methods and operators, see Chapter 13. 12.12.4 See Also The section on "Overriding Built-in Functions" in Chapter 11 of Programming Perl and in perlsub(1); Recipe 10.11 [ Team LiB ] [ Team LiB ] Recipe 12.13 Overriding a Built-in Function in All Packages 12.13.1 Problem You want to change the definition of a core built-in function within your entire program, not just the current package. 12.13.2 Solution Manually import, via direct symbol-table manipulation, the function into the CORE::GLOBAL pseudopackage. *CORE::GLOBAL::int = \&myown_int; 12.13.3 Discussion The technique demonstrated in the previous recipe only overrides a built-in in a particular package. It doesn't change everything for your whole program, no matter what package that function is called from. To do so would risk changing the behavior of code from modules you didn't write, and which were therefore not prepared for the change. It has been said that Unix was not designed to stop you from doing stupid things, because that would also stop you from doing clever things. So, too, with Perl. Just because overriding a function in all packages at once might seem, well, imprudent doesn't mean a clever person won't someday find a marvelous use for such a facility. For example, let's suppose that you've decided that the core int function's behavior of integer truncation, also known as rounding toward zero, is so annoying to your program that you want to provide an alternative by the same name. This would do it: package Math::Rounding; use warnings; use Carp; use Exporter; our @EXPORT = qw(int); our @ISA = qw(Exporter); sub int(;$) { my $arg = @_ ? shift : $_; use warnings FATAL => "numeric"; # promote to die( )ing my $result = eval { sprintf("%.0f", $arg) }; if ($@) { die if $@ !~ /isn't numeric/; $@ =~ s/ in sprintf.*/ in replacement int/s; croak $@; } else { return $result; } } Your replacement version uses sprintf( ) to round to the closest integer. It also raises an exception if passed a non-numeric string. A program could access this function either by saying: use Math::Rounding ( ); $y = Math::Rounding::int($x); or by importing the function and overriding the built-in: use Math::Rounding qw(int); $y = int($x); However, that only manages to replace the built-in for the current package. To replace it in all packages, at some point during compile time you'll have to execute a line of code like this: *CORE::GLOBAL::int = \&Math::Rounding::int; The standard File::Glob module allows you to change Perl's core glob operator using special import tags: ## override the core glob, forcing case sensitivity use File::Glob qw(:globally :case); my @sources = <*.{c,h,y}> ## override the core glob forcing case insensitivity use File::Glob qw(:globally :nocase); my @sources = <*.{c,h,y}> The module does this with its own version of import that detects those tags and makes the necessary assignments. You could do this, too. That way, this: use Math::Rounding qw(-global int); would make Perl use your replacement version for all calls to int from any package anywhere in your program. Here's a replacement import function that handles this: sub import { if (@_ && $_[1] =~ /^-/) { if ($_[1] ne "-global") { croak "unknown import pragma"; } splice(@_, 1, 1); # discard "-global" no warnings "once"; # suppress "used only once" warnings *CORE::GLOBAL::int = \∫ } else { die; } _ _PACKAGE_ _ -> export_to_level(1, @_); } The assignment happens only if the first thing to import is "-global". The last line in our import function uses part of the Exporter module's internal API to handle any normal import. 12.13.4 See Also Recipe 12.12; the section on "Overriding Built-in Functions" in Chapter 11 of Programming Perl and in perlsub(1); the documentation for the standard BSD::Glob module, as well as its source code [ Team LiB ] [ Team LiB ] Recipe 12.14 Reporting Errors and Warnings Like Built- ins 12.14.1 Problem You want to generate errors and warnings in your modules, but when you use warn or die, the user sees your own filename and line number. You'd like your functions to act like built-ins and report messages from the perspective of the user's code, not your own. 12.14.2 Solution The standard Carp module provides functions to do this. Use carp instead of warn. Use croak (for a short message) and confess (for a long message) instead of die. 12.14.3 Discussion Like built-ins, some of your module's functions generate warnings or errors if all doesn't go well. Think about sqrt: when you pass it a negative number (and you haven't used the Math::Complex module), an exception is raised, producing a message such as "Can't take sqrt of -3 at /tmp/negroot line 17", where /tmp/negroot is the name of your own program. But if you write your own function that dies, perhaps like this: sub even_only { my $n = shift; die "$n is not even" if $n & 1; # one way to test #.... } then the message will say it's coming from the file your even_only function was itself compiled in, rather than from the file the user was in when they called your function. That's where the Carp module comes in handy. Instead of using die, use croak instead: use Carp; sub even_only { my $n = shift; croak "$n is not even" if $n % 2; # here's another #.... } If you just want to complain about something, but have the message report where in the user's code the problem occurred, call carp instead of warn. For example: use Carp; sub even_only { my $n = shift; if ($n & 1) { # test whether odd number carp "$n is not even, continuing"; ++$n; } #.... } Many built-ins emit warnings only when the -w command-line switch has been used. The $^W variable (which is not meant to be a control character but rather a ^ followed by a W) reflects whether that switch was used. You could choose to grouse only if the user asked for complaints: carp "$n is not even, continuing" if $^W; The Carp module provides a third function: confess. This works just like croak, except that it provides a full stack backtrace as it dies, reporting who called whom and with what arguments. If you're only interested in the error message from carp, croak, and friends, the longmess and shortmess functions offer those: use Carp; $self->transplant_organ( ) or $self->error( Carp::longmess("Unable to transplant organ") ); 12.14.4 See Also The warn and die functions in Chapter 29 of Programming Perl and in perlfunc(1); the documentation for the standard Carp module, also in Chapter 32 of Programming Perl; Recipe 19.2; the discussion on _ _WARN_ _ and _ _DIE_ _ in the %SIG entry of Chapter 28 of Programming Perl, in perlvar(1), and in Recipe 16.15 [ Team LiB ] [ Team LiB ] Recipe 12.15 Customizing Warnings 12.15.1 Problem You would like your module to respect its caller's settings for lexical warnings, but you can't inspect the predefined $^W[2] variable to determine those settings. [2] That's $WARNING if you've used English. 12.15.2 Solution Your module should use this pragma: use warnings::register; Then from inside your module, use the warnings::enabled function from that module as described in the Discussion to check whether the caller has warnings enabled. This works for both the old-style, global warnings and for lexical warnings set via the use warnings pragma. 12.15.3 Discussion Perl's -w command-line flag, mirrored by the global $^W variable, suffers from several problems. For one thing, it's an all-or-nothing affair, so if you turn it on for the program, module code included by that program—including code you may not have written—is also affected by it. For another, it's at best cumbersome to control compile-time warnings with it, forcing you to resort to convoluted BEGIN blocks. Finally, suppose you were interested in numeric warnings but not any other sort; you'd have to write a $SIG{_ _WARN_ _} handler to sift through all warnings to find those you did or did not want to see. Lexical warnings, first introduced in Perl v5.6, address all this and more. By lexical, we mean that their effects are constrained to the lexical scope in which use warnings or no warnings occurs. Lexical warnings pay no attention to the -w command-line switch. Now when you turn warnings on in one scope, such as the main program's file scope, that doesn't enable warnings in modules you load. You can also selectively enable or disable individual categories of warnings. For example: use warnings qw(numeric uninitialized); use warnings qw(all); no warnings qw(syntax); The warnings::register pragma permits a module to check the warnings preferences of its caller's lexical scope. The pragma also creates a new warning category, taken from the name of the current package. These user-defined warning categories are easily distinguishable from the built-in warning categories because a module's package always starts (or should always start) with an uppercase letter. This way lowercase warning categories, like lowercase module names, are reserved to Perl itself. Built-in warnings categories are organized into several groups. The all category means all built-in warnings categories, including subcategories such as unsafe, io, syntax, etc. (see Figure 12-1). The syntax category comprises particular warnings categories, such as ambiguous, precedence, and deprecated. These can be added and subtracted at will, but order matters: Figure 12-1. Warnings categories use warnings; # turn on all warnings no warnings "syntax"; # turn off the syntax group use warnings "deprecated"; # but turn back on deprecated warnings Back to your module. Suppose you write a module called Whiskey. The file begins this way: package Whiskey; use warnings::register; Now code using that module does this: use Whiskey; use warnings qw(Whiskey); It's important to load the module before asking to use warnings for that module. Otherwise, the Whiskey warning category hasn't been registered yet, and you'll raise an exception if you try to use it as a warnings category. Here's a whimsical Whiskey module: package Whiskey; use strict; use warnings; # for our own code, not our caller use warnings::register; sub drink { if (warnings::enabled( ) && (localtime( ))[2] < 12) { warnings:warn("Sun not yet over the yardarm"); } print "Merry!\n"; } sub quaff { if (warnings::enabled("deprecated")) { warnings::warn("deprecated", "quaffing deprecated in favor of chugging"); } &drink; } # chuggers care not of the hour sub chug { print "Very merry\n"; } 1; The Whiskey::drink function uses the warnings::enabled function to check whether its caller has warnings enabled. Any of these in the caller's scope is enough to make that function return true: use warnings; use warnings qw(all); # means same as previous use warnings qw(Whiskey); The function will also return true if global warnings are enabled using -w or $^W. In the Whiskey::quaff function, a specific category of warnings is checked: deprecated. This is enabled if all warnings have been selected, if the syntax warnings have been selected (because deprecated warnings are considered a subcategory of syntax warnings, which is a subcategory of all warnings), or if deprecated warnings have been specifically selected. It will not be enabled just because the caller has enabled Whiskey warnings. Any category you create is considered a subcategory of all, but not of anything else. Check for Whiskey warnings using: warnings::enabled("Whiskey") The warnings::warn function is used instead of the warn built-in, in case Whiskey warnings have been promoted into exceptions: use warnings FATAL => "Whiskey"; 12.15.4 See Also The documentation on the use warnings pragma in Chapter 31 of Programming Perl and perllexwarn(1) [ Team LiB ] [ Team LiB ] Recipe 12.16 Referring to Packages Indirectly 12.16.1 Problem You want to refer to a variable or function in a package unknown until runtime, but syntax like $packname::$varname is illegal. 12.16.2 Solution Use symbolic references: { no strict "refs"; $val = ${ $packname . "::" . $varname }; @vals = @{ $packname . "::" . $aryname }; &{ $packname . "::" . $funcname }("args"); ($packname . "::" . $funcname) -> ("args"); } 12.16.3 Discussion A package declaration has meaning at compile time. If you don't know the name of the package or variable until runtime, you'll have to resort to symbolic references for direct access to the package symbol table. Assuming you normally run with use strict in effect, you must disable part of it to use symbolic references. Once you've used the no strict "refs" directive in that block, build up a string with the fully qualified name of the variable or function you're interested in. Then dereference this name as though it were a proper Perl reference. During the prehistoric eras (before Perl 5), programmers were forced to use an eval for this kind of thing: eval "package $packname; \$'$val = \$$varname"; # set $main'val die if $@; As you see, this approach makes quoting difficult. It's also comparatively slow. Fortunately, you never need to do this just to access variables indirectly by name. Symbolic references are a necessary compromise. Similarly, eval could be used to define functions on the fly. Suppose you wanted to be able to get the base 2 or base 10 logs of numbers: printf "log2 of 100 is %.2f\n", log2(100); printf "log10 of 100 is %.2f\n", log10(100); Perl has only the natural log function. Here's how one could use eval to create these functions at runtime. Here we'll create functions named log2 up through log999: $packname = "main"; for ($i = 2; $i < 1000; $i++) { $logN = log($i); eval "sub ${packname}::log$i { log(shift) / $logN }"; die if $@; } Here, at least, you don't need to do that. The following code does the same thing, but instead of compiling a new function 998 times, we compile it only once, as a closure. Then we use symbolic dereferencing of the symbol table to assign the same subroutine reference to many function names: $packname = "main"; for ($i = 2; $i < 1000; $i++) { my $logN = log($i); no strict "refs"; *{"${packname}::log$i"} = sub { log(shift) / $logN }; } When you assign a reference to a typeglob, you create an alias for just the referent type of that name. That's how the Exporter does its job. The first line in the next code sample manually imports the function name Colors::blue into the current package. The second makes the main::blue function an alias for the Colors::azure function. *blue = \&Colors::blue; *main::blue = \&Colors::azure; Given the flexibility of typeglob assignments and symbolic references, a full-blown eval "STRING" is nearly always unnecessary for these sorts of indirect namespace manipulation, the last resort of the desperate programmer. The only thing worse would be if it weren't available at all. 12.16.4 See Also The section on "Symbolic References" in Chapter 8 of Programming Perl and in the start of perlsub(1); Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 12.17 Using h2ph to Translate C #include Files 12.17.1 Problem Someone gave you code that generates the bizarre error message: Can't locate sys/ in @INC (did you run h2ph?) (@INC contains: /usr/lib/perl5/i686-linux/5.00404 /usr/lib/perl5 /usr/lib/perl5/site_perl/i686-linux /usr/lib/perl5/site_perl .) at some_program line 7. You want to know what it means and how to fix it. 12.17.2 Solution Get your system administrator to do this, running as the superuser: % cd /usr/include; h2ph sys/syscall.h However, most include files require other include files, which means you should probably just translate them all: % cd /usr/include; h2ph *.h */*.h If that reports too many filenames or misses some that are more deeply nested, try this instead: % cd /usr/include; find . -name "*.h" -print | xargs h2ph 12.17.3 Discussion A file whose name ends in .ph has been created by the h2ph tool, which translates C preprocessor directives from C #include files into Perl. The goal is to allow Perl code to access the same constants as C code. h2xs is a better approach in most cases because it provides compiled C code for your modules, not Perl code simulating C code. However, using h2xs requires a lot more programming savvy (at least, for accessing C code) than h2ph does. When h2ph's translation process works, it's wonderful. When it doesn't, you're probably out of luck. As system architectures and include files become more complex, h2ph fails more frequently. If you're lucky, the constants you need are already in the Fcntl, Socket, or POSIX modules. The POSIX module implements constants from sys/file.h, sys/errno.h, and sys/wait.h, among others. It also allows fancy tty handling, as described in Recipe 15.8. So what can you do with these .ph files? Here are a few examples. The first uses the pessimally non-portable syscall function to access your operating system's gettimeofday syscall. This implements the FineTime module described in Recipe 12.12. # file package main; require "sys/"; die "No SYS_gettimeofday in sys/" unless defined &SYS_gettimeofday; package FineTime; use strict; require Exporter; use vars qw(@ISA @EXPORT_OK); @ISA = qw(Exporter); @EXPORT_OK = qw(time); sub time( ) { my $tv = pack("LL", ( )); # presize buffer to two longs syscall(&main::SYS_gettimeofday, $tv, undef) >= 0 or die "gettimeofday: $!"; my($seconds, $microseconds) = unpack("LL", $tv); return $seconds + ($microseconds / 1_000_000); } 1; If you are forced to require an old-style .pl or .ph file, do so from the main package (package main in the preceding code). These old libraries always put their symbols in the current package, and main serves as a reasonable rendezvous point. To use a symbol, use its fully qualified name, as we did with main::SYS_gettimeofday. The sys/ file, if you can get it to build on your system, is the gateway to your system's idiosyncratic I/O functions through the ioctl function. One such function is the TIOCSTI ioctl, shown in Example 12-1. That abbreviation stands for "terminal I/O control, simulate terminal input." On systems that implement this function, it will push one character into your device stream so that the next time any process reads from that device, it gets the character you put there. Example 12-1. jam #!/usr/bin/perl -w # jam - stuff characters down STDIN's throat require "sys/"; die "no TIOCSTI" unless defined &TIOCSTI; sub jam { local $SIG{TTOU} = "IGNORE"; # "Stopped for tty output" local *TTY; # make local filehandle open(TTY, "+<", "/dev/tty") or die "no tty: $!"; for (split(//, $_[0])) { ioctl(TTY, &TIOCSTI, $_) or die "bad TIOCSTI: $!"; } close(TTY); } jam("@ARGV\n"); Since sys/ioctl.h translation is so dodgy, you'll probably have to run this C program to get your TIOCSTI value: % cat > tio.c << EOF && cc tio.c && a.out #include main( ) { printf("%#08x\n", TIOCSTI); } EOF 0x005412 Another popular use for ioctl is for figuring out your current window size in rows and columns, and maybe even in pixels. This is shown in Example 12-2. Example 12-2. winsz #!/usr/bin/perl # winsz - find x and y for chars and pixels require "sys/"; die "no TIOCGWINSZ " unless defined &TIOCGWINSZ; open(TTY, "+<", "/dev/tty") or die "No tty: $!"; unless (ioctl(TTY, &TIOCGWINSZ, $winsize="")) { die sprintf "$0: ioctl TIOCGWINSZ (%08x: $!)\n", &TIOCGWINSZ; } ($row, $col, $xpixel, $ypixel) = unpack("S4", $winsize); print "(row,col) = ($row,$col)"; print " (xpixel,ypixel) = ($xpixel,$ypixel)" if $xpixel || $ypixel; print "\n"; As you see, as soon as you start playing with .ph files, unpacking binary data, and calling syscall and ioctl, you need to know about the C APIs that Perl normally hides. The only other thing that requires this much C knowledge is using the XS interface. Some suggest you should resist the temptation to descend into such unportable convolutions. Others feel that the demands put upon the trenchworkers are such that they must be forgiven these desperate measures. Fortunately, less fragile mechanisms are increasingly available. CPAN modules for most of these functions now exist, which should theoretically prove more robust than sourcing .ph files. 12.17.4 See Also h2ph(1); the instructions on running h2ph in the INSTALL file from the Perl source distribution; the syscall and ioctl functions in Chapter 29 of Programming Perl and in perlmod(1); Recipe 12.18 [ Team LiB ] [ Team LiB ] Recipe 12.18 Using h2xs to Make a Module with C Code 12.18.1 Problem You'd like to access your system's unique C functions from Perl. 12.18.2 Solution Use the h2xs tool to generate the necessary template files, fill the files in appropriately, and then type: % perl Makefile.PL % make 12.18.3 Discussion A Perl module need not be written solely in Perl. As with any other module, first pick a module name and use h2xs on it. We'll make a FineTime::time function with the same semantics as in the previous recipe, but this time around, we'll implement it using real C. First, we run the following command: % h2xs -cn FineTime If we had a .h file with function prototype declarations, we could include that, but because we're writing this one from scratch, we'll use the -c switch to omit building code to translate any #define symbols. The -n switch says to create a module directory named FineTime/, which will have the following files: Manifest List of files in the distribution Changes Change log Makefile.PL A meta-makefile The Perl parts FineTime.xs The soon-to-be C parts A test driver Before we can type make, we'll have to generate a Makefile based on our system's configuration using the Makefile.PL template. Here's how to do that: % perl Makefile.PL If the XS code calls library code that isn't in the normal set of libraries Perl links from, add one more line to Makefile.PL first. For example, if we wanted to link against the librpm.a library, which lives in the /usr/redhat/lib directory, we'd change the line of Makefile.PL that reads: "LIBS" => [""], # e.g., "-lm" so that it says: "LIBS" => ["-L/usr/redhat/lib -lrpm"], If the module is to be installed somewhere other than the local site_lib directory, specify that on the command line: % perl Makefile.PL LIB=~/perllib Finally, edit the and FineTime.xs files. In the first case, most of the work has been done for us. We just set up the export list with the function to be exported. This time we put it in @EXPORT_OK so that if the user wants the function, they must ask for it by name. Here's package FineTime; use strict; use vars qw($VERSION @ISA @EXPORT_OK); require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); @EXPORT_OK = qw(time); $VERSION = "0.01"; bootstrap FineTime $VERSION; 1; The make process automatically translates FineTime.xs into a FineTime.c file and eventually into a shared library, probably called on most platforms. The utility that does this translation is xsubpp, which is described in its own manpage and perlxstut(1). The build will call xsubpp automatically. Besides a strong C background, you also need to understand the C-to-Perl interface, called XS (external subroutine). The details and nuances of XS are beyond the scope of this book. The automatically generated FineTime.xs had the Perl-specific include files in it, as well as the MODULE declaration. We've added some extra includes and written the code for the new time function. Although this doesn't look entirely like C, it will, once xsubpp is done with it. Here's the FineTime.xs we used: #include #include #include "EXTERN.h" #include "perl.h" #include "XSUB.h" MODULE = FineTime PACKAGE = FineTime double time( ) CODE: struct timeval tv; gettimeofday(&tv,0); RETVAL = tv.tv_sec + ((double) tv.tv_usec) / 1000000; OUTPUT: RETVAL Defining a function by the same name as one from the standard C library won't cause a problem when it's compiled, because that's not its real name. That's just what Perl calls it. The C linker will see it as XS_FineTime_time, so no conflict exists. Here's what happened with make install (with some edits): % make install mkdir ./blib/lib/auto/FineTime cp ./blib/lib/ /usr/local/bin/perl -I/usr/lib/perl5/i686-linux/5.00403 -I/usr/lib/perl5 /usr/lib/perl5/ExtUtils/xsubpp -typemap /usr/lib/perl5/ExtUtils/typemap FineTime.xs && mv FineTime.c && cc -c -Dbool=char -DHAS_BOOL -O2-DVERSION=\"0.01\" -DXS_VERSION=\"0.01\" -fpic -I/usr/lib/perl5/i686-linux/5.00403/CORE FineTime.c Running Mkbootstrap for FineTime ( ) chmod 644 LD_RUN_PATH="" cc -o blib/arch/auto/FineTime/ -shared -L/usr/local/lib FineTime.o chmod 755 blib/arch/auto/FineTime/ cp ./blib/arch/auto/FineTime/ chmod 644 blib/arch/auto/FineTime/ Installing /home/tchrist/perllib/i686-linux/./auto/FineTime/ Installing /home/tchrist/perllib/i686-linux/./auto/FineTime/ Installing /home/tchrist/perllib/./ Writing /home/tchrist/perllib/i686-linux/auto/FineTime/.packlist Appending installation info to /home/tchrist/perllib/i686-linux/perllocal.pod Once this is all done, we'll be able to type something like this into the shell: % perl -I ~/perllib -MFineTime=time -le "1 while print time( )" | head 888177070.090978 888177070.09132 888177070.091389 888177070.091453 888177070.091515 888177070.091577 888177070.091639 888177070.0917 888177070.091763 888177070.091864 12.18.4 See Also Chapters 18 through 20 in Advanced Perl Programming; perlxstut(1) and perlxs(1) to learn how to call C from Perl; perlcall(1) and perlguts(1) to understand the internal Perl API, also the "Extending Perl" section of Chapter 21 of Programming Perl; perlembed(1) to learn how to call Perl from C, also the "Embedding Perl" section of Chapter 21 of Programming Perl; the documentation for the standard ExtUtils::MakeMaker module, h2ph(1) and xsubpp(1);, which contains Dean's comprehensive XS cookbook that includes directions on interfacing with C++ [ Team LiB ] [ Team LiB ] Recipe 12.19 Writing Extensions in C with Inline::C 12.19.1 Problem You'd like to write functions in C that you can call from Perl. You may already have tried XS and found it harmful to your mental health. 12.19.2 Solution Use the Inline::C module available from CPAN: use Inline C; $answer = somefunc(20, 4); print "$answer\n"; # prints 80 _ _END_ _ _ _C_ _ double somefunc(int a, int b) { /* Inline knows most basic C types */ double answer = a * b; return answer; } 12.19.3 Discussion Inline::C was created as an alternative to the XS system for building C extension modules. Rather than jumping through all the hoopla of h2xs and the format of an .xs file, Inline::C lets you embed C code into your Perl program. There are also Inline modules for Python, Ruby, and Java, among other languages. By default, your C source is in the _ _END_ _ or _ _DATA_ _ section of your program after a _ _C_ _ token. This permits multiple Inlined language blocks in a single file. If you want, use a here document when you load Inline: use Inline C <<'END_OF_C'; double somefunc(int a, int b) { /* Inline knows most basic C types */ double answer = a * b; return answer; } END_OF_C Inline::C scans the source code for ANSI-style function definitions. When it finds a function definition it knows how to deal with, it creates a Perl wrapper for the function. Inline can automatically translate the basic C data types (double, int, char *, etc.) by using the typemap that comes with Perl. A typemap shows Perl how to convert between C values and Perl data types, and you can install your own if you need to use more complex data structures than the basic typemap supports. You can link against external libraries, parse header files as h2xs does, pass and return multiple values, handle objects, and more. See the Inline::C-Cookbook manpage that comes with the Inline::C module for more details. 12.19.4 See Also The documentation with the Inline::C module from CPAN; the Inline::C-Cookbook manpage [ Team LiB ] [ Team LiB ] Recipe 12.20 Documenting Your Module with Pod 12.20.1 Problem You need to document your module, but don't know what format to use. 12.20.2 Solution Embed your documentation in the your module file using pod format. 12.20.3 Discussion Pod stands for plain old documentation. It's documentation embedded in your program using a very simple markup format. Programmers are notorious for writing the code first and the documentation never, so pod was designed to make writing documentation so easy that anyone can and will do so. Sometimes this even works. When Perl is parsing your source code, a line starting with an equals sign (where a new statement is expected) says to ignore all text until it finds a line beginning with =cut, after which it will start parsing code again. This lets you mix code and documentation throughout your Perl program or module file. Since it's mostly plain text, type in your documentation as literal text, or nearly so. The translators try to be clever and make output-specific decisions so the programmer doesn't have to specifically format variable names, function calls, etc. Perl ships with several translators that filter generic pod format into specific output styles. These include pod2man to change your pods into troff for use with the man program or for phototypesetting and printing; pod2html for creating web pages (which works even on non-Unix systems); and pod2text for plain ASCII. Other translators, such as pod2ipf, pod2fm, pod2texi, pod2latex, and pod2ps, may also be available or can be found on CPAN. Many books are written using proprietary word processors with limited scripting capabilities. Not this one! It was written in pod format using common text editors (vi for Tom, emacs for Nat). The final book was produced by converting the pod source files to FrameMaker. Although formally documented in perlpod(1), pod is probably easiest to learn by reading existing module files. If you started making your module using h2xs, then you already have the sample pods right there. The Makefile knows to convert these into man format and install those manpages so others can read them. Alternatively, the perldoc program can translate pods on the fly using pod2text. Indented paragraphs will be left verbatim. Other paragraphs will be reformatted to fit the page. Only two kinds of markups are used in pod: paragraphs beginning with an equals sign and one or more words, and interior sequences starting with a single letter followed by text enclosed in angle brackets. Paragraph tags are for headers, list enumeration, and per-translator escapes. Angle bracket sequences are mainly used for font changes, such as selecting bold, italic, or constant-width fonts. Here's an example of an =head2 pod directive and various bracket escapes for font changes: =head2 Discussion If we had a I<.h> file with function prototype declarations, we could include that, but since we're writing this one from scratch, we'll use the B<-c> flag to omit building code to translate any #define symbols. The B<-n> flag says to create a module directory named I, which will have the following files. The =for escape introduces specific code that is only for a particular output filter. This book, for example, written mostly in pod, includes calls to the standard troff tools eqn, tbl, and pic. Here's an example of embedded eqn. Only translators that produce troff will heed this paragraph. =for troff .EQ log sub n (x) = { {log sub e (x)} over {log sub e (n)} } .EN Pod can also create multiline comments. In C, the sequence /* .... */ can comment out many lines of text all at once—there's no need to put a marker on each line. Since Perl ignores pod directives, use these for block commenting. The trick is to find a directive that the pod filters ignore. You could specify that a block is "for later" or "for nobody": =for later next if 1 .. ?^$?; s/^(.)/>$1/; s/(.{73})........*/$1 /; =cut back to perl or you could use a =begin and =end pair: =begin comment if (!open(FILE, "<", $file)) { unless ($opt_q) { warn "$me: $file: $!\n"; $Errors++; } next FILE; } $total = 0; $matches = 0; =end comment 12.20.4 See Also The section on "PODs: Embedded Documentation" in perlsyn(1), as well as perlpod(1), pod2man(1), pod2html(1), and pod2text(1); Chapter 26 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.21 Building and Installing a CPAN Module 12.21.1 Problem You want to install a module file that you downloaded from CPAN over the Net or obtained from a CD. 12.21.2 Solution Type the following commands into your shell. It will build and install Version 4.54 of the Some::Module package. % gunzip Some-Module-4.54.tar.gz % tar xf Some-Module-4.54 % cd Some-Module-4.54 % perl Makefile.PL % make % make test % make install 12.21.3 Discussion Like most programs on the Net, Perl modules are available in source kits stored as tar archives in GNU zip format.[3] If tar warns of "Directory checksum errors ", then you downloaded the binary file in text format, mutilating it. [3] This is not the same as the zip format common on Windows machines, but newer version of Windows winzip will read it. Prior to Perl 5.005, you'll need the standard port of Perl for Win32, not the ActiveState port, to build CPAN modules. Free versions of tar and gnutar are also available for Microsoft systems. You'll probably have to become a privileged user with adequate permissions to install the module in the system directories. Standard modules are installed in a directory like /usr/lib/perl5 , whereas third-party modules are installed in /usr/lib/perl5/site_ perl . Here's a sample run, showing the installation of the MD5 module: % gunzip MD5-1.7.tar.gz % tar xf MD5-1.7.tar % cd MD5-1.7 % perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for MD5 % make mkdir ./blib mkdir ./blib/lib cp ./blib/lib/ AutoSplitting MD5 (./blib/lib/auto/MD5) /usr/bin/perl -I/usr/local/lib/perl5/i386 ... ... cp ./blib/arch/auto/MD5/ chmod 644 ./blib/arch/auto/MD5/MD5.bsmkdir ./blib/man3 Manifying ./blib/man3/MD5.3 % make test PERL_DL_NONLAZY=1 /usr/bin/perl -I./blib/arch -I./blib/lib -I/usr/local/lib/perl5/i386-freebsd/5.00404 -I/usr/local/lib/perl5 1..14 ok 1 ok 2 ... ok 13 ok 14 % sudo make install Password: Installing /usr/local/lib/perl5/site_perl/i386-freebsd/./auto/MD5/ Installing /usr/local/lib/perl5/site_perl/i386-freebsd/./auto/MD5/ Installing /usr/local/lib/perl5/site_perl/./auto/MD5/autosplit.ix Installing /usr/local/lib/perl5/site_perl/./ Installing /usr/local/lib/perl5/man/man3/./MD5.3 Writing /usr/local/lib/perl5/site_perl/i386-freebsd/auto/MD5/.packlist Appending installation info to /usr/local/lib/perl5/i386-freebsd/ 5.00404/perllocal.pod If your system manager isn't around or can't be prevailed upon to run the installation, don't worry. When you use Perl to generate the Makefile from template Makefile.PL , you can specify alternate installation directories. # if you just want the modules installed in your own directory % perl Makefile.PL LIB=~/lib # if you have your own complete distribution % perl Makefile.PL PREFIX=~/perl5-private An even simpler approach is to use the CPAN module from the command line, because it can search for, download, and install the module you need. Suppose you wanted to find the CPAN module Getopt::Declare. All you'd have to do is type: % perl -MCPAN -e "install Getopt::Declare" The first time you use the CPAN module, it will ask you some configuration questions. It saves these away so that when you use it in the future, it won't need to ask you those questions again. The CPAN module also supports an interactive command shell. This can be used to search for modules whose precise names you're uncertain of, check which CPAN modules have newer versions than you have installed, install bundles of related modules, and various other useful commands. Here's an example run of the interactive shell. % perl -MCPAN -e shell cpan shell -- CPAN exploration and modules installation (v1.70) ReadLine support enabled cpan> h Display Information command argument description a,b,d,m WORD or /REGEXP/ about authors, bundles, distributions, modules i WORD or /REGEXP/ about anything of above r NONE reinstall recommendations ls AUTHOR about files in the author's directory Download, Test, Make, Install... get download make make (implies get) test MODULES, make test (implies make) install DISTS, BUNDLES make install (implies test) clean make clean look open subshell in these dists' directories readme display these dists' README files Other h,? display this menu ! perl-code eval a perl command o conf [opt] set and query options q quit the cpan shell reload cpan load again reload index load newer indices autobundle Snapshot force cmd unconditionally do cmd cpan> i /inflect/ CPAN: Storable loaded ok Going to read /home/tchrist/.cpan/Metadata Database was generated on Mon, 07 Apr 2003 22:42:33 GMT Distribution D/DC/DCONWAY/Lingua-EN-Inflect-1.88.tar.gz Module Lingua::EN::Inflect (D/DC/DCONWAY/Lingua-EN-Inflect-1.88.tar.gz) 2 items found cpan> install Lingua::EN::Inflect [build and install output deleted] cpan> quit The CPAN module is slowly being phased out in favor of CPANPLUS, a module with similar functionality that is built for flexibility as well as power. The CPANPLUS text interface is similar to that of the CPAN module, but it also offers a GUI and programmer interfaces, which can access a lot of functionality that the CPAN module hides. 12.21.4 See Also The documentation for the standard ExtUtils::MakeMaker module; the INSTALL file in the Perl source distribution for information on building a statically linked perl binary [ Team LiB ] [ Team LiB ] Recipe 12.22 Example: Module Template Following is the skeleton of a module. If you want to write a module of your own, you can copy this and customize it. package Some::Module; # must live in Some/ use strict; require Exporter; # set the version for version checking our $VERSION = 0.01; our @ISA = qw(Exporter); our @EXPORT = qw(&func1 &func2 &func4); our %EXPORT_TAGS = ( ); # eg: TAG => [ qw!name1 name2! ], # your exported package globals go here, # as well as any optionally exported functions our @EXPORT_OK = qw($Var1 %Hashit &func3); use vars qw($Var1 %Hashit); # non-exported package globals go here our(@more, $stuff); # initialize package globals, first exported ones $Var1 = ""; %Hashit = ( ); # then the others (which are still accessible as $Some::Module::stuff) $stuff = ""; @more = ( ); # all file-scoped lexicals must be created before # the functions below that use them. # file-private lexicals go here my $priv_var = ""; my %secret_hash = ( ); # here's a file-private function as a closure, # callable as &$priv_func. my $priv_func = sub { # stuff goes here. }; # make all your functions, whether exported or not; # remember to put something interesting in the { } stubs sub func1 { .... } # no prototype sub func2( ) { .... } # proto'd void sub func3($$) { .... } # proto'd to 2 scalars # this one isn't auto-exported, but could be called! sub func4(\%) { .... } # proto'd to 1 hash ref END { } # module cleanup code here (global destructor) 1; [ Team LiB ] [ Team LiB ] Recipe 12.23 Program: Finding Versions and Descriptions of Installed Modules Perl comes with many modules included standard. Even more can be found on CPAN. The following program prints out the names, versions, and descriptions of all modules installed on your system. It uses standard modules like File::Find and includes several techniques described in this chapter. To run it, type: % pmdesc It prints a list of modules and their descriptions: FileHandle (2.00) - supply object methods for filehandles IO::File (1.06021) - supply object methods for filehandles IO::Select (1.10) - OO interface to the select system call IO::Socket (1.1603) - Object interface to socket communications ... With the -v flag, pmdesc provides the names of the directories the files are in: % pmdesc -v <<>> FileHandle (2.00) - supply object methods for filehandles ... The -w flag warns if a module doesn't come with a pod description, and -s sorts the module list within each directory. The program is given in Example 12-3 . Example 12-3. pmdesc #!/usr/bin/perl -w # pmdesc - describe pm files # use strict; use File::Find qw(find); use Getopt::Std qw(getopts); use Carp; use vars ( q!$opt_v!, # give debug info q!$opt_w!, # warn about missing descs on modules q!$opt_a!, # include relative paths q!$opt_s!, # sort output within each directory ); $| = 1; getopts("wvas") or die "bad usage"; @ARGV = @INC unless @ARGV; # Globals. wish I didn't really have to do this. use vars ( q!$Start_Dir!, # The top directory find was called with q!%Future!, # topdirs find will handle later ); my $Module; # install an output filter to sort my module list, if wanted. if ($opt_s) { if (open(ME, "-|")) { $/ = ""; while () { chomp; print join("\n", sort split /\n/), "\n"; } exit; } } MAIN: { my %visited; my ($dev,$ino); @Future{@ARGV} = (1) x @ARGV; foreach $Start_Dir (@ARGV) { delete $Future{$Start_Dir}; print "\n << Modules from $Start_Dir>>\n\n" if $opt_v; next unless ($dev,$ino) = stat($Start_Dir); next if $visited{$dev,$ino}++; next unless $opt_a || $Start_Dir =~ m!^/!; find(\&wanted, $Start_Dir); } exit; } # calculate module name from file and directory sub modname { local $_ = $File::Find::name; if (index($_, $Start_Dir . "/") = = 0) { substr($_, 0, 1+length($Start_Dir)) = ""; } s { / } {::}gx; s { \.p(m|od)$ } { }x; return $_; } # decide if this is a module we want sub wanted { if ( $Future{$File::Find::name} ) { warn "\t(Skipping $File::Find::name, qui venit in futuro.)\n" if 0 and $opt_v; $File::Find::prune = 1; return; } return unless /\.pm$/ && -f; $Module = &modname; # skip obnoxious modules if ($Module =~ /^CPAN(\Z|::)/) { warn("$Module -- skipping because it misbehaves\n"); return; } my $file = $_; unless (open(POD, "<", $file)) { warn "\tcannot open $file: $!"; # if $opt_w; return 0; } $: = " -:"; local $/ = ""; local $_; while () { if (/=head\d\s+NAME/) { chomp($_ = ); s/^.*?-\s+//s; s/\n/ /g; #write; my $v; if (defined ($v = getversion($Module))) { print "$Module ($v) "; } else { print "$Module "; } print "- $_\n"; return 1; } } warn "\t(MISSING DESC FOR $File::Find::name)\n" if $opt_w; return 0; } # run Perl to load the module and print its verson number, redirecting # errors to /dev/null sub getversion { my $mod = shift; my $vers = `$^X -m$mod -e 'print \$${mod}::VERSION' 2>/dev/null`; $vers =~ s/^\s*(.*?)\s*$/$1/; # remove stray whitespace return ($vers || undef); } format = ^<<<<<<<<<<<<<<<<<~~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $Module, $_ . This can also be accomplished through the backend programmer interface in the CPANPLUS module, if you have it installed. This program displays information on all available modules (the -X option is to silence any warnings about invalid paths or version numbers): #!/usr/bin/perl -X use CPANPLUS::Backend; use Data::Dumper; $cp = CPANPLUS::Backend->new; $installed = $cp->installed->rv; # fetch list of installed mods foreach my $module (sort keys %$installed) { # get the module's information $info = $cp->details(modules => [$module])->rv->{$module}; # display the fields we care about printf("%-35.35s %44.44s\n", $module, $info->{Description}); } When run, it outputs a table like this: Algorithm::Cluster Perl extension for the C clustering library Algorithm::NaiveBayes None given AnyDBM_File Uses first available *_File module above Apache Interface to the Apache server API Apache::AuthDBI None given Apache::Connection Inteface to Apache conn_rec struct [ Team LiB ] [ Team LiB ] Chapter 13. Classes, Objects, and Ties All the world over, I will back the masses against the classes. —William E. Gladstone, Speech at Liverpool, 28 June 1886 [ Team LiB ] [ Team LiB ] Introduction Although Perl was not initially conceived of as an object-oriented language, within a few years of its initial release, complete support for object-oriented programming had been added. As usual, Perl doesn't try to enforce one true style, but embraces many. This helps more people do their job the way they want to do it. You don't have to use objects to write programs, unlike Java, where programs are instances of objects. If you want to, though, you can write Perl programs that use nearly every weapon in the object-oriented arsenal. Perl supports classes and objects, single and multiple inheritance, instance methods and class methods, access to overridden methods, constructors and destructors, operator overloading, proxy methods through autoloading, delegation, a rooted hierarchy for all objects, and two levels of garbage collection. You can use as many or as few object-oriented techniques as you want and need. Ties are the only part of Perl where you must use object orientation. And even then, only the module implementor need be aware of this; the casual user gets to remain blissfully unaware of the internal mechanics. Ties, discussed in Recipe 13.15, let you transparently intercept access to a variable. For example, you can use ties to create hashes that support lookups by key or value instead of just by key. Under the Hood If you ask 10 people what object orientation is, you'll get 10 different answers. People bandy about terms like abstraction and encapsulation, trying to isolate the basic units of object- oriented programming languages and give them big names to write papers and books about. Not all object-oriented languages offer the same features, yet they are still deemed object- oriented. This, of course, produces more papers and books. We follow the nomenclature used in Perl's documentation, the perlobj(1) manpage, and Chapter 12 of Programming Perl. An object is a variable that belongs to a class. Methods are functions associated with a class. In Perl, a class is a package—and usually a module. An object is a reference to something associated with a class. Once associated with a class, something is said to be blessed into that class. There's nothing ecclesiastical or spooky going on here. Blessing merely associates a referent with a class, and this is done with the bless function, which takes one or two arguments. The first is a reference to the thing you want associated with the class; the second is the package with