Python标准库示例


The Python Standard Library by Example Doug Hellmann Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales (800) 382-3419 corpsales@pearsontechgroup.com For sales outside the United States, please contact: International Sales international@pearsoned.com Visit us on the Web: informit.com/aw Library of Congress Cataloging-in-Publication Data Hellmann, Doug. The Python standard library by example / Doug Hellmann. p. cm. Includes index. ISBN 978-0-321-76734-9 (pbk. : alk. paper) 1. Python (Computer program language) I. Title. QA76.73.P98H446 2011 005.13'3—dc22 2011006256 Copyright © 2011 Pearson Education, Inc. All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax: (617) 671-3447 ISBN-13: 978-0-321-76734-9 ISBN-10: 0-321-76734-9 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing, May 2011 CONTENTS AT A GLANCE Contents ix Tables xxxi Foreword xxxiii Acknowledgments xxxvii About the Author xxxix INTRODUCTION 1 1 TEXT 3 2 DATA STRUCTURES 69 3 ALGORITHMS 129 4 DATES AND TIMES 173 5 MATHEMATICS 197 6 THE FILE SYSTEM 247 7 DATA PERSISTENCE AND EXCHANGE 333 8 DATA COMPRESSION AND ARCHIVING 421 9 CRYPTOGRAPHY 469 vii viii Contents at a Glance 10 PROCESSES AND THREADS 481 11 NETWORKING 561 12 THE INTERNET 637 13 EMAIL 727 14 APPLICATION BUILDING BLOCKS 769 15 INTERNATIONALIZATION AND LOCALIZATION 899 16 DEVELOPER TOOLS 919 17 RUNTIME FEATURES 1045 18 LANGUAGE TOOLS 1169 19 MODULES AND PACKAGES 1235 Index of Python Modules 1259 Index 1261 CONTENTS Tables xxxi Foreword xxxiii Acknowledgments xxxvii About the Author xxxix INTRODUCTION 1 1 TEXT 3 1.1 string—Text Constants and Templates 4 1.1.1 Functions 4 1.1.2 Templates 5 1.1.3 Advanced Templates 7 1.2 textwrap—Formatting Text Paragraphs 9 1.2.1 Example Data 9 1.2.2 Filling Paragraphs 10 1.2.3 Removing Existing Indentation 10 1.2.4 Combining Dedent and Fill 11 1.2.5 Hanging Indents 12 1.3 re—Regular Expressions 13 1.3.1 Finding Patterns in Text 14 1.3.2 Compiling Expressions 14 1.3.3 Multiple Matches 15 1.3.4 Pattern Syntax 16 1.3.5 Constraining the Search 28 1.3.6 Dissecting Matches with Groups 30 ix x Contents 1.3.7 Search Options 37 1.3.8 Looking Ahead or Behind 45 1.3.9 Self-Referencing Expressions 50 1.3.10 Modifying Strings with Patterns 56 1.3.11 Splitting with Patterns 58 1.4 difflib—Compare Sequences 61 1.4.1 Comparing Bodies of Text 62 1.4.2 Junk Data 65 1.4.3 Comparing Arbitrary Types 66 2 DATA STRUCTURES 69 2.1 collections—Container Data Types 70 2.1.1 Counter 70 2.1.2 defaultdict 74 2.1.3 Deque 75 2.1.4 namedtuple 79 2.1.5 OrderedDict 82 2.2 array—Sequence of Fixed-Type Data 84 2.2.1 Initialization 84 2.2.2 Manipulating Arrays 85 2.2.3 Arrays and Files 85 2.2.4 Alternate Byte Ordering 86 2.3 heapq—Heap Sort Algorithm 87 2.3.1 Example Data 88 2.3.2 Creating a Heap 89 2.3.3 Accessing Contents of a Heap 90 2.3.4 Data Extremes from a Heap 92 2.4 bisect—Maintain Lists in Sorted Order 93 2.4.1 Inserting in Sorted Order 93 2.4.2 Handling Duplicates 95 2.5 Queue—Thread-Safe FIFO Implementation 96 2.5.1 Basic FIFO Queue 96 2.5.2 LIFO Queue 97 2.5.3 Priority Queue 98 2.5.4 Building a Threaded Podcast Client 99 2.6 struct—Binary Data Structures 102 2.6.1 Functions vs. Struct Class 102 2.6.2 Packing and Unpacking 102 Contents xi 2.6.3 Endianness 103 2.6.4 Buffers 105 2.7 weakref—Impermanent References to Objects 106 2.7.1 References 107 2.7.2 Reference Callbacks 108 2.7.3 Proxies 108 2.7.4 Cyclic References 109 2.7.5 Caching Objects 114 2.8 copy—Duplicate Objects 117 2.8.1 Shallow Copies 118 2.8.2 Deep Copies 118 2.8.3 Customizing Copy Behavior 119 2.8.4 Recursion in Deep Copy 120 2.9 pprint—Pretty-Print Data Structures 123 2.9.1 Printing 123 2.9.2 Formatting 124 2.9.3 Arbitrary Classes 125 2.9.4 Recursion 125 2.9.5 Limiting Nested Output 126 2.9.6 Controlling Output Width 126 3 ALGORITHMS 129 3.1 functools—Tools for Manipulating Functions 129 3.1.1 Decorators 130 3.1.2 Comparison 138 3.2 itertools—Iterator Functions 141 3.2.1 Merging and Splitting Iterators 142 3.2.2 Converting Inputs 145 3.2.3 Producing New Values 146 3.2.4 Filtering 148 3.2.5 Grouping Data 151 3.3 operator—Functional Interface to Built-in Operators 153 3.3.1 Logical Operations 154 3.3.2 Comparison Operators 154 3.3.3 Arithmetic Operators 155 3.3.4 Sequence Operators 157 3.3.5 In-Place Operators 158 3.3.6 Attribute and Item “Getters” 159 3.3.7 Combining Operators and Custom Classes 161 xii Contents 3.3.8 Type Checking 162 3.4 contextlib—Context Manager Utilities 163 3.4.1 Context Manager API 164 3.4.2 From Generator to Context Manager 167 3.4.3 Nesting Contexts 168 3.4.4 Closing Open Handles 169 4 DATES AND TIMES 173 4.1 time—Clock Time 173 4.1.1 Wall Clock Time 174 4.1.2 Processor Clock Time 174 4.1.3 Time Components 176 4.1.4 Working with Time Zones 177 4.1.5 Parsing and Formatting Times 179 4.2 datetime—Date and Time Value Manipulation 180 4.2.1 Times 181 4.2.2 Dates 182 4.2.3 timedeltas 185 4.2.4 Date Arithmetic 186 4.2.5 Comparing Values 187 4.2.6 Combining Dates and Times 188 4.2.7 Formatting and Parsing 189 4.2.8 Time Zones 190 4.3 calendar—Work with Dates 191 4.3.1 Formatting Examples 191 4.3.2 Calculating Dates 194 5 MATHEMATICS 197 5.1 decimal—Fixed and Floating-Point Math 197 5.1.1 Decimal 198 5.1.2 Arithmetic 199 5.1.3 Special Values 200 5.1.4 Context 201 5.2 fractions—Rational Numbers 207 5.2.1 Creating Fraction Instances 207 5.2.2 Arithmetic 210 5.2.3 Approximating Values 210 5.3 random—Pseudorandom Number Generators 211 5.3.1 Generating Random Numbers 211 Contents xiii 5.3.2 Seeding 212 5.3.3 Saving State 213 5.3.4 Random Integers 214 5.3.5 Picking Random Items 215 5.3.6 Permutations 216 5.3.7 Sampling 218 5.3.8 Multiple Simultaneous Generators 219 5.3.9 SystemRandom 221 5.3.10 Nonuniform Distributions 222 5.4 math—Mathematical Functions 223 5.4.1 Special Constants 223 5.4.2 Testing for Exceptional Values 224 5.4.3 Converting to Integers 226 5.4.4 Alternate Representations 227 5.4.5 Positive and Negative Signs 229 5.4.6 Commonly Used Calculations 230 5.4.7 Exponents and Logarithms 234 5.4.8 Angles 238 5.4.9 Trigonometry 240 5.4.10 Hyperbolic Functions 243 5.4.11 Special Functions 244 6 THE FILE SYSTEM 247 6.1 os.path—Platform-Independent Manipulation of Filenames 248 6.1.1 Parsing Paths 248 6.1.2 Building Paths 252 6.1.3 Normalizing Paths 253 6.1.4 File Times 254 6.1.5 Testing Files 255 6.1.6 Traversing a Directory Tree 256 6.2 glob—Filename Pattern Matching 257 6.2.1 Example Data 258 6.2.2 Wildcards 258 6.2.3 Single Character Wildcard 259 6.2.4 Character Ranges 260 6.3 linecache—Read Text Files Efficiently 261 6.3.1 Test Data 261 6.3.2 Reading Specific Lines 262 6.3.3 Handling Blank Lines 263 xiv Contents 6.3.4 Error Handling 263 6.3.5 Reading Python Source Files 264 6.4 tempfile—Temporary File System Objects 265 6.4.1 Temporary Files 265 6.4.2 Named Files 268 6.4.3 Temporary Directories 268 6.4.4 Predicting Names 269 6.4.5 Temporary File Location 270 6.5 shutil—High-Level File Operations 271 6.5.1 Copying Files 271 6.5.2 Copying File Metadata 274 6.5.3 Working with Directory Trees 276 6.6 mmap—Memory-Map Files 279 6.6.1 Reading 279 6.6.2 Writing 280 6.6.3 Regular Expressions 283 6.7 codecs—String Encoding and Decoding 284 6.7.1 Unicode Primer 284 6.7.2 Working with Files 287 6.7.3 Byte Order 289 6.7.4 Error Handling 291 6.7.5 Standard Input and Output Streams 295 6.7.6 Encoding Translation 298 6.7.7 Non-Unicode Encodings 300 6.7.8 Incremental Encoding 301 6.7.9 Unicode Data and Network Communication 303 6.7.10 Defining a Custom Encoding 307 6.8 StringIO—Text Buffers with a File-like API 314 6.8.1 Examples 314 6.9 fnmatch—UNIX-Style Glob Pattern Matching 315 6.9.1 Simple Matching 315 6.9.2 Filtering 317 6.9.3 Translating Patterns 318 6.10 dircache—Cache Directory Listings 319 6.10.1 Listing Directory Contents 319 6.10.2 Annotated Listings 321 6.11 filecmp—Compare Files 322 6.11.1 Example Data 323 6.11.2 Comparing Files 325 Contents xv 6.11.3 Comparing Directories 327 6.11.4 Using Differences in a Program 328 7 DATA PERSISTENCE AND EXCHANGE 333 7.1 pickle—Object Serialization 334 7.1.1 Importing 335 7.1.2 Encoding and Decoding Data in Strings 335 7.1.3 Working with Streams 336 7.1.4 Problems Reconstructing Objects 338 7.1.5 Unpicklable Objects 340 7.1.6 Circular References 340 7.2 shelve—Persistent Storage of Objects 343 7.2.1 Creating a New Shelf 343 7.2.2 Writeback 344 7.2.3 Specific Shelf Types 346 7.3 anydbm—DBM-Style Databases 347 7.3.1 Database Types 347 7.3.2 Creating a New Database 348 7.3.3 Opening an Existing Database 349 7.3.4 Error Cases 349 7.4 whichdb—Identify DBM-Style Database Formats 350 7.5 sqlite3—Embedded Relational Database 351 7.5.1 Creating a Database 352 7.5.2 Retrieving Data 355 7.5.3 Query Metadata 357 7.5.4 Row Objects 358 7.5.5 Using Variables with Queries 359 7.5.6 Bulk Loading 362 7.5.7 Defining New Column Types 363 7.5.8 Determining Types for Columns 366 7.5.9 Transactions 368 7.5.10 Isolation Levels 372 7.5.11 In-Memory Databases 376 7.5.12 Exporting the Contents of a Database 376 7.5.13 Using Python Functions in SQL 378 7.5.14 Custom Aggregation 380 7.5.15 Custom Sorting 381 7.5.16 Threading and Connection Sharing 383 7.5.17 Restricting Access to Data 384 xvi Contents 7.6 xml.etree.ElementTree—XML Manipulation API 387 7.6.1 Parsing an XML Document 387 7.6.2 Traversing the Parsed Tree 388 7.6.3 Finding Nodes in a Document 390 7.6.4 Parsed Node Attributes 391 7.6.5 Watching Events While Parsing 393 7.6.6 Creating a Custom Tree Builder 396 7.6.7 Parsing Strings 398 7.6.8 Building Documents with Element Nodes 400 7.6.9 Pretty-Printing XML 401 7.6.10 Setting Element Properties 403 7.6.11 Building Trees from Lists of Nodes 405 7.6.12 Serializing XML to a Stream 408 7.7 csv—Comma-Separated Value Files 411 7.7.1 Reading 411 7.7.2 Writing 412 7.7.3 Dialects 413 7.7.4 Using Field Names 418 8 DATA COMPRESSION AND ARCHIVING 421 8.1 zlib—GNU zlib Compression 421 8.1.1 Working with Data in Memory 422 8.1.2 Incremental Compression and Decompression 423 8.1.3 Mixed Content Streams 424 8.1.4 Checksums 425 8.1.5 Compressing Network Data 426 8.2 gzip—Read and Write GNU Zip Files 430 8.2.1 Writing Compressed Files 431 8.2.2 Reading Compressed Data 433 8.2.3 Working with Streams 434 8.3 bz2—bzip2 Compression 436 8.3.1 One-Shot Operations in Memory 436 8.3.2 Incremental Compression and Decompression 438 8.3.3 Mixed Content Streams 439 8.3.4 Writing Compressed Files 440 8.3.5 Reading Compressed Files 442 8.3.6 Compressing Network Data 443 8.4 tarfile—Tar Archive Access 448 8.4.1 Testing Tar Files 448 Contents xvii 8.4.2 Reading Metadata from an Archive 449 8.4.3 Extracting Files from an Archive 450 8.4.4 Creating New Archives 453 8.4.5 Using Alternate Archive Member Names 453 8.4.6 Writing Data from Sources Other than Files 454 8.4.7 Appending to Archives 455 8.4.8 Working with Compressed Archives 456 8.5 zipfile—ZIP Archive Access 457 8.5.1 Testing ZIP Files 457 8.5.2 Reading Metadata from an Archive 457 8.5.3 Extracting Archived Files from an Archive 459 8.5.4 Creating New Archives 460 8.5.5 Using Alternate Archive Member Names 462 8.5.6 Writing Data from Sources Other than Files 462 8.5.7 Writing with a ZipInfo Instance 463 8.5.8 Appending to Files 464 8.5.9 Python ZIP Archives 466 8.5.10 Limitations 467 9 CRYPTOGRAPHY 469 9.1 hashlib—Cryptographic Hashing 469 9.1.1 Sample Data 470 9.1.2 MD5 Example 470 9.1.3 SHA-1 Example 470 9.1.4 Creating a Hash by Name 471 9.1.5 Incremental Updates 472 9.2 hmac—Cryptographic Message Signing and Verification 473 9.2.1 Signing Messages 474 9.2.2 SHA vs. MD5 474 9.2.3 Binary Digests 475 9.2.4 Applications of Message Signatures 476 10 PROCESSES AND THREADS 481 10.1 subprocess—Spawning Additional Processes 481 10.1.1 Running External Commands 482 10.1.2 Working with Pipes Directly 486 10.1.3 Connecting Segments of a Pipe 489 10.1.4 Interacting with Another Command 490 10.1.5 Signaling between Processes 492 xviii Contents 10.2 signal—Asynchronous System Events 497 10.2.1 Receiving Signals 498 10.2.2 Retrieving Registered Handlers 499 10.2.3 Sending Signals 501 10.2.4 Alarms 501 10.2.5 Ignoring Signals 502 10.2.6 Signals and Threads 502 10.3 threading—Manage Concurrent Operations 505 10.3.1 Thread Objects 505 10.3.2 Determining the Current Thread 507 10.3.3 Daemon vs. Non-Daemon Threads 509 10.3.4 Enumerating All Threads 512 10.3.5 Subclassing Thread 513 10.3.6 Timer Threads 515 10.3.7 Signaling between Threads 516 10.3.8 Controlling Access to Resources 517 10.3.9 Synchronizing Threads 523 10.3.10 Limiting Concurrent Access to Resources 524 10.3.11 Thread-Specific Data 526 10.4 multiprocessing—Manage Processes like Threads 529 10.4.1 Multiprocessing Basics 529 10.4.2 Importable Target Functions 530 10.4.3 Determining the Current Process 531 10.4.4 Daemon Processes 532 10.4.5 Waiting for Processes 534 10.4.6 Terminating Processes 536 10.4.7 Process Exit Status 537 10.4.8 Logging 539 10.4.9 Subclassing Process 540 10.4.10 Passing Messages to Processes 541 10.4.11 Signaling between Processes 545 10.4.12 Controlling Access to Resources 546 10.4.13 Synchronizing Operations 547 10.4.14 Controlling Concurrent Access to Resources 548 10.4.15 Managing Shared State 550 10.4.16 Shared Namespaces 551 10.4.17 Process Pools 553 10.4.18 Implementing MapReduce 555 Contents xix 11 NETWORKING 561 11.1 socket—Network Communication 561 11.1.1 Addressing, Protocol Families, and Socket Types 562 11.1.2 TCP/IP Client and Server 572 11.1.3 User Datagram Client and Server 580 11.1.4 UNIX Domain Sockets 583 11.1.5 Multicast 587 11.1.6 Sending Binary Data 591 11.1.7 Nonblocking Communication and Timeouts 593 11.2 select—Wait for I/O Efficiently 594 11.2.1 Using select() 595 11.2.2 Nonblocking I/O with Timeouts 601 11.2.3 Using poll() 603 11.2.4 Platform-Specific Options 608 11.3 SocketServer—Creating Network Servers 609 11.3.1 Server Types 609 11.3.2 Server Objects 609 11.3.3 Implementing a Server 610 11.3.4 Request Handlers 610 11.3.5 Echo Example 610 11.3.6 Threading and Forking 616 11.4 asyncore—Asynchronous I/O 619 11.4.1 Servers 619 11.4.2 Clients 621 11.4.3 The Event Loop 623 11.4.4 Working with Other Event Loops 625 11.4.5 Working with Files 628 11.5 asynchat—Asynchronous Protocol Handler 629 11.5.1 Message Terminators 629 11.5.2 Server and Handler 630 11.5.3 Client 632 11.5.4 Putting It All Together 634 12 THE INTERNET 637 12.1 urlparse—Split URLs into Components 638 12.1.1 Parsing 638 12.1.2 Unparsing 641 12.1.3 Joining 642 xx Contents 12.2 BaseHTTPServer—Base Classes for Implementing Web Servers 644 12.2.1 HTTP GET 644 12.2.2 HTTP POST 646 12.2.3 Threading and Forking 648 12.2.4 Handling Errors 649 12.2.5 Setting Headers 650 12.3 urllib—Network Resource Access 651 12.3.1 Simple Retrieval with Cache 651 12.3.2 Encoding Arguments 653 12.3.3 Paths vs. URLs 655 12.4 urllib2—Network Resource Access 657 12.4.1 HTTP GET 657 12.4.2 Encoding Arguments 660 12.4.3 HTTP POST 661 12.4.4 Adding Outgoing Headers 661 12.4.5 Posting Form Data from a Request 663 12.4.6 Uploading Files 664 12.4.7 Creating Custom Protocol Handlers 667 12.5 base64—Encode Binary Data with ASCII 670 12.5.1 Base64 Encoding 670 12.5.2 Base64 Decoding 671 12.5.3 URL-Safe Variations 672 12.5.4 Other Encodings 673 12.6 robotparser—Internet Spider Access Control 674 12.6.1 robots.txt 674 12.6.2 Testing Access Permissions 675 12.6.3 Long-Lived Spiders 676 12.7 Cookie—HTTP Cookies 677 12.7.1 Creating and Setting a Cookie 678 12.7.2 Morsels 678 12.7.3 Encoded Values 680 12.7.4 Receiving and Parsing Cookie Headers 681 12.7.5 Alternative Output Formats 682 12.7.6 Deprecated Classes 683 12.8 uuid—Universally Unique Identifiers 684 12.8.1 UUID 1—IEEE 802 MAC Address 684 12.8.2 UUID 3 and 5—Name-Based Values 686 12.8.3 UUID 4—Random Values 688 12.8.4 Working with UUID Objects 689 Contents xxi 12.9 json—JavaScript Object Notation 690 12.9.1 Encoding and Decoding Simple Data Types 690 12.9.2 Human-Consumable vs. Compact Output 692 12.9.3 Encoding Dictionaries 694 12.9.4 Working with Custom Types 695 12.9.5 Encoder and Decoder Classes 697 12.9.6 Working with Streams and Files 700 12.9.7 Mixed Data Streams 701 12.10 xmlrpclib—Client Library for XML-RPC 702 12.10.1 Connecting to a Server 704 12.10.2 Data Types 706 12.10.3 Passing Objects 709 12.10.4 Binary Data 710 12.10.5 Exception Handling 712 12.10.6 Combining Calls into One Message 712 12.11 SimpleXMLRPCServer—An XML-RPC Server 714 12.11.1 A Simple Server 714 12.11.2 Alternate API Names 716 12.11.3 Dotted API Names 718 12.11.4 Arbitrary API Names 719 12.11.5 Exposing Methods of Objects 720 12.11.6 Dispatching Calls 722 12.11.7 Introspection API 724 13 EMAIL 727 13.1 smtplib—Simple Mail Transfer Protocol Client 727 13.1.1 Sending an Email Message 728 13.1.2 Authentication and Encryption 730 13.1.3 Verifying an Email Address 732 13.2 smtpd—Sample Mail Servers 734 13.2.1 Mail Server Base Class 734 13.2.2 Debugging Server 737 13.2.3 Proxy Server 737 13.3 imaplib—IMAP4 Client Library 738 13.3.1 Variations 739 13.3.2 Connecting to a Server 739 13.3.3 Example Configuration 741 13.3.4 Listing Mailboxes 741 13.3.5 Mailbox Status 744 xxii Contents 13.3.6 Selecting a Mailbox 745 13.3.7 Searching for Messages 746 13.3.8 Search Criteria 747 13.3.9 Fetching Messages 749 13.3.10 Whole Messages 752 13.3.11 Uploading Messages 753 13.3.12 Moving and Copying Messages 755 13.3.13 Deleting Messages 756 13.4 mailbox—Manipulate Email Archives 758 13.4.1 mbox 759 13.4.2 Maildir 762 13.4.3 Other Formats 768 14 APPLICATION BUILDING BLOCKS 769 14.1 getopt—Command-Line Option Parsing 770 14.1.1 Function Arguments 771 14.1.2 Short-Form Options 771 14.1.3 Long-Form Options 772 14.1.4 A Complete Example 772 14.1.5 Abbreviating Long-Form Options 775 14.1.6 GNU-Style Option Parsing 775 14.1.7 Ending Argument Processing 777 14.2 optparse—Command-Line Option Parser 777 14.2.1 Creating an OptionParser 777 14.2.2 Short- and Long-Form Options 778 14.2.3 Comparing with getopt 779 14.2.4 Option Values 781 14.2.5 Option Actions 784 14.2.6 Help Messages 790 14.3 argparse—Command-Line Option and Argument Parsing 795 14.3.1 Comparing with optparse 796 14.3.2 Setting Up a Parser 796 14.3.3 Defining Arguments 796 14.3.4 Parsing a Command Line 796 14.3.5 Simple Examples 797 14.3.6 Automatically Generated Options 805 14.3.7 Parser Organization 807 14.3.8 Advanced Argument Processing 815 Contents xxiii 14.4 readline—The GNU Readline Library 823 14.4.1 Configuring 823 14.4.2 Completing Text 824 14.4.3 Accessing the Completion Buffer 828 14.4.4 Input History 832 14.4.5 Hooks 834 14.5 getpass—Secure Password Prompt 836 14.5.1 Example 836 14.5.2 Using getpass without a Terminal 837 14.6 cmd—Line-Oriented Command Processors 839 14.6.1 Processing Commands 839 14.6.2 Command Arguments 840 14.6.3 Live Help 842 14.6.4 Auto-Completion 843 14.6.5 Overriding Base Class Methods 845 14.6.6 Configuring Cmd through Attributes 847 14.6.7 Running Shell Commands 848 14.6.8 Alternative Inputs 849 14.6.9 Commands from sys.argv 851 14.7 shlex—Parse Shell-Style Syntaxes 852 14.7.1 Quoted Strings 852 14.7.2 Embedded Comments 854 14.7.3 Split 855 14.7.4 Including Other Sources of Tokens 855 14.7.5 Controlling the Parser 856 14.7.6 Error Handling 858 14.7.7 POSIX vs. Non-POSIX Parsing 859 14.8 ConfigParser—Work with Configuration Files 861 14.8.1 Configuration File Format 862 14.8.2 Reading Configuration Files 862 14.8.3 Accessing Configuration Settings 864 14.8.4 Modifying Settings 869 14.8.5 Saving Configuration Files 871 14.8.6 Option Search Path 872 14.8.7 Combining Values with Interpolation 875 14.9 logging—Report Status, Error, and Informational Messages 878 14.9.1 Logging in Applications vs. Libraries 878 14.9.2 Logging to a File 879 14.9.3 Rotating Log Files 879 xxiv Contents 14.9.4 Verbosity Levels 880 14.9.5 Naming Logger Instances 882 14.10 fileinput—Command-Line Filter Framework 883 14.10.1 Converting M3U Files to RSS 883 14.10.2 Progress Metadata 886 14.10.3 In-Place Filtering 887 14.11 atexit—Program Shutdown Callbacks 890 14.11.1 Examples 890 14.11.2 When Are atexit Functions Not Called? 891 14.11.3 Handling Exceptions 893 14.12 sched—Timed Event Scheduler 894 14.12.1 Running Events with a Delay 895 14.12.2 Overlapping Events 896 14.12.3 Event Priorities 897 14.12.4 Canceling Events 897 15 INTERNATIONALIZATION AND LOCALIZATION 899 15.1 gettext—Message Catalogs 899 15.1.1 Translation Workflow Overview 900 15.1.2 Creating Message Catalogs from Source Code 900 15.1.3 Finding Message Catalogs at Runtime 903 15.1.4 Plural Values 905 15.1.5 Application vs. Module Localization 907 15.1.6 Switching Translations 908 15.2 locale—Cultural Localization API 909 15.2.1 Probing the Current Locale 909 15.2.2 Currency 915 15.2.3 Formatting Numbers 916 15.2.4 Parsing Numbers 917 15.2.5 Dates and Times 917 16 DEVELOPER TOOLS 919 16.1 pydoc—Online Help for Modules 920 16.1.1 Plain-Text Help 920 16.1.2 HTML Help 920 16.1.3 Interactive Help 921 16.2 doctest—Testing through Documentation 921 16.2.1 Getting Started 922 16.2.2 Handling Unpredictable Output 924 Contents xxv 16.2.3 Tracebacks 928 16.2.4 Working around Whitespace 930 16.2.5 Test Locations 936 16.2.6 External Documentation 939 16.2.7 Running Tests 942 16.2.8 Test Context 945 16.3 unittest—Automated Testing Framework 949 16.3.1 Basic Test Structure 949 16.3.2 Running Tests 949 16.3.3 Test Outcomes 950 16.3.4 Asserting Truth 952 16.3.5 Testing Equality 953 16.3.6 Almost Equal? 954 16.3.7 Testing for Exceptions 955 16.3.8 Test Fixtures 956 16.3.9 Test Suites 957 16.4 traceback—Exceptions and Stack Traces 958 16.4.1 Supporting Functions 958 16.4.2 Working with Exceptions 959 16.4.3 Working with the Stack 963 16.5 cgitb—Detailed Traceback Reports 965 16.5.1 Standard Traceback Dumps 966 16.5.2 Enabling Detailed Tracebacks 966 16.5.3 Local Variables in Tracebacks 968 16.5.4 Exception Properties 971 16.5.5 HTML Output 972 16.5.6 Logging Tracebacks 972 16.6 pdb—Interactive Debugger 975 16.6.1 Starting the Debugger 976 16.6.2 Controlling the Debugger 979 16.6.3 Breakpoints 990 16.6.4 Changing Execution Flow 1002 16.6.5 Customizing the Debugger with Aliases 1009 16.6.6 Saving Configuration Settings 1011 16.7 trace—Follow Program Flow 1012 16.7.1 Example Program 1013 16.7.2 Tracing Execution 1013 16.7.3 Code Coverage 1014 16.7.4 Calling Relationships 1017 xxvi Contents 16.7.5 Programming Interface 1018 16.7.6 Saving Result Data 1020 16.7.7 Options 1022 16.8 profile and pstats—Performance Analysis 1022 16.8.1 Running the Profiler 1023 16.8.2 Running in a Context 1026 16.8.3 pstats: Saving and Working with Statistics 1027 16.8.4 Limiting Report Contents 1028 16.8.5 Caller / Callee Graphs 1029 16.9 timeit—Time the Execution of Small Bits of Python Code 1031 16.9.1 Module Contents 1031 16.9.2 Basic Example 1032 16.9.3 Storing Values in a Dictionary 1033 16.9.4 From the Command Line 1035 16.10 compileall—Byte-Compile Source Files 1037 16.10.1 Compiling One Directory 1037 16.10.2 Compiling sys.path 1038 16.10.3 From the Command Line 1039 16.11 pyclbr—Class Browser 1039 16.11.1 Scanning for Classes 1041 16.11.2 Scanning for Functions 1042 17 RUNTIME FEATURES 1045 17.1 site—Site-Wide Configuration 1046 17.1.1 Import Path 1046 17.1.2 User Directories 1047 17.1.3 Path Configuration Files 1049 17.1.4 Customizing Site Configuration 1051 17.1.5 Customizing User Configuration 1053 17.1.6 Disabling the site Module 1054 17.2 sys—System-Specific Configuration 1055 17.2.1 Interpreter Settings 1055 17.2.2 Runtime Environment 1062 17.2.3 Memory Management and Limits 1065 17.2.4 Exception Handling 1071 17.2.5 Low-Level Thread Support 1074 17.2.6 Modules and Imports 1080 17.2.7 Tracing a Program as It Runs 1101 Contents xxvii 17.3 os—Portable Access to Operating System Specific Features 1108 17.3.1 Process Owner 1108 17.3.2 Process Environment 1111 17.3.3 Process Working Directory 1112 17.3.4 Pipes 1112 17.3.5 File Descriptors 1116 17.3.6 File System Permissions 1116 17.3.7 Directories 1118 17.3.8 Symbolic Links 1119 17.3.9 Walking a Directory Tree 1120 17.3.10 Running External Commands 1121 17.3.11 Creating Processes with os.fork() 1122 17.3.12 Waiting for a Child 1125 17.3.13 Spawn 1127 17.3.14 File System Permissions 1127 17.4 platform—System Version Information 1129 17.4.1 Interpreter 1129 17.4.2 Platform 1130 17.4.3 Operating System and Hardware Info 1131 17.4.4 Executable Architecture 1133 17.5 resource—System Resource Management 1134 17.5.1 Current Usage 1134 17.5.2 Resource Limits 1135 17.6 gc—Garbage Collector 1138 17.6.1 Tracing References 1138 17.6.2 Forcing Garbage Collection 1141 17.6.3 Finding References to Objects that Cannot Be Collected 1146 17.6.4 Collection Thresholds and Generations 1148 17.6.5 Debugging 1151 17.7 sysconfig—Interpreter Compile-Time Configuration 1160 17.7.1 Configuration Variables 1160 17.7.2 Installation Paths 1163 17.7.3 Python Version and Platform 1167 18 LANGUAGE TOOLS 1169 18.1 warnings—Nonfatal Alerts 1170 18.1.1 Categories and Filtering 1170 18.1.2 Generating Warnings 1171 xxviii Contents 18.1.3 Filtering with Patterns 1172 18.1.4 Repeated Warnings 1174 18.1.5 Alternate Message Delivery Functions 1175 18.1.6 Formatting 1176 18.1.7 Stack Level in Warnings 1177 18.2 abc—Abstract Base Classes 1178 18.2.1 Why Use Abstract Base Classes? 1178 18.2.2 How Abstract Base Classes Work 1178 18.2.3 Registering a Concrete Class 1179 18.2.4 Implementation through Subclassing 1179 18.2.5 Concrete Methods in ABCs 1181 18.2.6 Abstract Properties 1182 18.3 dis—Python Bytecode Disassembler 1186 18.3.1 Basic Disassembly 1187 18.3.2 Disassembling Functions 1187 18.3.3 Classes 1189 18.3.4 Using Disassembly to Debug 1190 18.3.5 Performance Analysis of Loops 1192 18.3.6 Compiler Optimizations 1198 18.4 inspect—Inspect Live Objects 1200 18.4.1 Example Module 1200 18.4.2 Module Information 1201 18.4.3 Inspecting Modules 1203 18.4.4 Inspecting Classes 1204 18.4.5 Documentation Strings 1206 18.4.6 Retrieving Source 1207 18.4.7 Method and Function Arguments 1209 18.4.8 Class Hierarchies 1210 18.4.9 Method Resolution Order 1212 18.4.10 The Stack and Frames 1213 18.5 exceptions—Built-in Exception Classes 1216 18.5.1 Base Classes 1216 18.5.2 Raised Exceptions 1217 18.5.3 Warning Categories 1233 19 MODULES AND PACKAGES 1235 19.1 imp—Python’s Import Mechanism 1235 19.1.1 Example Package 1236 19.1.2 Module Types 1236 Contents xxix 19.1.3 Finding Modules 1237 19.1.4 Loading Modules 1238 19.2 zipimport—Load Python Code from ZIP Archives 1240 19.2.1 Example 1240 19.2.2 Finding a Module 1241 19.2.3 Accessing Code 1242 19.2.4 Source 1243 19.2.5 Packages 1244 19.2.6 Data 1244 19.3 pkgutil—Package Utilities 1247 19.3.1 Package Import Paths 1247 19.3.2 Development Versions of Packages 1249 19.3.3 Managing Paths with PKG Files 1251 19.3.4 Nested Packages 1253 19.3.5 Package Data 1255 Index of Python Modules 1259 Index 1261 TABLES 1.1 Regular Expression Escape Codes 24 1.2 Regular Expression Anchoring Codes 27 1.3 Regular Expression Flag Abbreviations 45 2.1 Byte Order Specifiers for struct 104 6.1 Codec Error Handling Modes 292 7.1 The “project” Table 353 7.2 The “task” Table 353 7.3 CSV Dialect Parameters 415 10.1 Multiprocessing Exit Codes 537 11.1 Event Flags for poll() 604 13.1 IMAP 4 Mailbox Status Conditions 744 14.1 Flags for Variable Argument Definitions in argparse 815 14.2 Logging Levels 881 16.1 Test Case Outcomes 950 17.1 CPython Command-Line Option Flags 1057 17.2 Event Hooks for settrace() 1101 17.3 Platform Information Functions 1132 17.4 Path Names Used in sysconfig 1164 18.1 Warning Filter Actions 1171 xxxi FOREWORD It’s Thanksgiving Day, 2010. For those outside of the United States, and for many of those within it, it might just seem like a holiday where people eat a ton of food, watch some football, and otherwise hang out. For me, and many others, it’s a time to take a look back and think about the things that have enriched our lives and give thanks for them. Sure, we should be doing that every day, but having a single day that’s focused on just saying thanks sometimes makes us think a bit more broadly and a bit more deeply. I’m sitting here writing the foreward to this book, something I’m very thankful for having the opportunity to do—but I’m not just thinking about the content of the book, or the author, who is a fantastic community member. I’m thinking about the subject matter itself—Python—and specifically, its standard library. Every version of Python shipped today contains hundreds of modules spanning many years, many developers, many subjects, and many tasks. It contains modules for everything from sending and receiving email, to GUI development, to a built-in HTTP server. By itself, the standard library is a massive work. Without the people who have maintained it throughout the years, and the hundreds of people who have submitted patches, documentation, and feedback, it would not be what it is today. It’s an astounding accomplishment, and something that has been the critical com- ponent in the rise of Python’s popularity as a language and ecosystem. Without the standard library, without the “batteries included” motto of the core team and others, Python would never have come as far. It has been downloaded by hundreds of thou- sands of people and companies, and has been installed on millions of servers, desktops, and other devices. Without the standard library, Python would still be a fantastic language, built on solid concepts of teaching, learning, and readability. It might have gotten far enough xxxiii xxxiv Foreword on its own, based on those merits. But the standard library turns it from an interesting experiment into a powerful and effective tool. Every day, developers across the world build tools and entire applications based on nothing but the core language and the standard library. You not only get the ability to conceptualize what a car is (the language), but you also get enough parts and tools to put together a basic car yourself. It might not be the perfect car, but it gets you from A to B, and that’s incredibly empowering and rewarding. Time and time again, I speak to people who look at me proudly and say, “Look what I built with nothing except what came with Python!” It is not, however, a fait accompli. The standard library has its warts. Given its size and breadth, and its age, it’s no real surprise that some of the modules have varying levels of quality, API clarity, and coverage. Some of the modules have suffered “feature creep,” or have failed to keep up with modern advances in the areas they cover. Python continues to evolve, grow, and improve over time through the help and hard work of many, many unpaid volunteers. Some argue, though, that due to the shortcomings and because the standard library doesn’t necessarily comprise the “best of breed” solutions for the areas its modules cover (“best of” is a continually moving and adapting target, after all), that it should be killed or sent out to pasture, despite continual improvement. These people miss the fact that not only is the standard library a critical piece of what makes Python continually successful, but also, despite its warts, it is still an excellent resource. But I’ve intentionally ignored one giant area: documentation. The standard li- brary’s documentation is good and is constantly improving and evolving. Given the size and breadth of the standard library, the documentation is amazing for what it is. It’s awesome that we have hundreds of pages of documentation contributed by hundreds of developers and users. The documentation is used every single day by hundreds of thou- sands of people to create things—things as simple as one-off scripts and as complex as the software that controls giant robotic arms. The documentation is why we are here, though. All good documentation and code starts with an idea—a kernel of a concept about what something is, or will be. Outward from that kernel come the characters (the APIs) and the storyline (the modules). In the case of code, sometimes it starts with a simple idea: “I want to parse a string and look for a date.” But when you reach the end—when you’re looking at the few hun- dred unit tests, functions, and other bits you’ve made—you sit back and realize you’ve built something much, much more vast than originally intended. The same goes for documentation, especially the documentation of code. The examples are the most critical component in the documentation of code, in my estimation. You can write a narrative about a piece of an API until it spans entire books, and you can describe the loosely coupled interface with pretty words and thoughtful use Foreword xxxv cases. But it all falls flat if a user approaching it for the first time can’t glue those pretty words, thoughtful use cases, and API signatures together into something that makes sense and solves their problems. Examples are the gateway by which people make the critical connections—those logical jumps from an abstract concept into something concrete. It’s one thing to “know” the ideas and API; it’s another to see it used. It helps jump the void when you’re not only trying to learn something, but also trying to improve existing things. Which brings us back to Python. Doug Hellmann, the author of this book, started a blog in 2007 called the Python Module of the Week. In the blog, he walked through various modules of the standard library, taking an example-first approach to showing how each one worked and why. From the first day I read it, it had a place right next to the core Python documentation. His writing has become an indispensable resource for me and many other people in the Python community. Doug’s writings fill a critical gap in the Python documentation I see today: the need for examples. Showing how and why something works in a functional, simple manner is no easy task. And, as we’ve seen, it’s a critical and valuable body of work that helps people every single day. People send me emails with alarming regularity saying things like, “Did you see this post by Doug? This is awesome!” or “Why isn’t this in the core documentation? It helped me understand how things really work!” When I heard Doug was going to take the time to further flesh out his existing work, to turn it into a book I could keep on my desk to dog-ear and wear out from near constant use, I was more than a little excited. Doug is a fantastic technical writer with a great eye for detail. Having an entire book dedicated to real examples of how over a hundred modules in the standard library work, written by him, blows my mind. You see, I’m thankful for Python. I’m thankful for the standard library—warts and all. I’m thankful for the massive, vibrant, yet sometimes dysfunctional community we have. I’m thankful for the tireless work of the core development team, past, present and future. I’m thankful for the resources, the time, and the effort so many community members—of which Doug Hellmann is an exemplary example—have put into making this community and ecosystem such an amazing place. Lastly, I’m thankful for this book. Its author will continue to be well respected and the book well used in the years to come. — Jesse Noller Python Core Developer PSF Board Member Principal Engineer, Nasuni Corporation ACKNOWLEDGMENTS This book would not have come into being without the contributions and support of many people. I was first introduced to Python around 1997 by Dick Wall, while we were working together on GIS software at ERDAS. I remember being simultaneously happy that I had found a new tool language that was so easy to use, and sad that the company did not let us use it for “real work.” I have used Python extensively at all of my subsequent jobs, and I have Dick to thank for the many happy hours I have spent working on software since then. The Python core development team has created a robust ecosystem of language, tools, and libraries that continue to grow in popularity and find new application areas. Without the amazing investment in time and resources they have given us, we would all still be spending our time reinventing wheel after wheel. As described in the Introduction, the material in this book started out as a series of blog posts. Each of those posts has been reviewed and commented on by members of the Python community, with corrections, suggestions, and questions that led to changes in the version you find here. Thank you all for reading along week after week, and contributing your time and attention. The technical reviewers for the book—Matt Culbreth, Katie Cunningham, Jeff McNeil, and Keyton Weissinger—spent many hours looking for issues with the ex- ample code and accompanying explanations. The result is stronger than I could have produced on my own. I also received advice from Jesse Noller on the multiprocessing module and Brett Cannon on creating custom importers. A special thanks goes to the editors and production staff at Pearson for all their hard work and assistance in helping me realize my vision for this book. xxxvii xxxviii Acknowledgments Finally, I want to thank my wife, Theresa Flynn, who has always given me excel- lent writing advice and was a constant source of encouragement throughout the entire process of creating this book. I doubt she knew what she was getting herself into when she told me, “You know, at some point, you have to sit down and start writing it.” It’s your turn. ABOUT THE AUTHOR Doug Hellmann is currently a senior developer with Racemi, Inc., and communica- tions director of the Python Software Foundation. He has been programming in Python since version 1.4 and has worked on a variety of UNIX and non-UNIX platforms for projects in fields such as mapping, medical news publishing, banking, and data cen- ter automation. After a year as a regular columnist for Python Magazine, he served as editor-in-chief from 2008–2009. Since 2007, Doug has published the popular Python Module of the Week series on his blog. He lives in Athens, Georgia. xxxix INTRODUCTION Distributed with every copy of Python, the standard library contains hundreds of modules that provide tools for interacting with the operating system, interpreter, and Internet. All of them are tested and ready to be used to jump start the development of your applications. This book presents selected examples demonstrating how to use the most commonly used features of the modules that give Python its “batteries included” slogan, taken from the popular Python Module of the Week (PyMOTW) blog series. This Book’s Target Audience The audience for this book is an intermediate Python programmer, so although all the source code is presented with discussion, only a few cases include line-by-line expla- nations. Every section focuses on the features of the modules, illustrated by the source code and output from fully independent example programs. Each feature is presented as concisely as possible, so the reader can focus on the module or function being demon- strated without being distracted by the supporting code. An experienced programmer familiar with other languages may be able to learn Python from this book, but it is not intended to be an introduction to the language. Some prior experience writing Python programs will be useful when studying the examples. Several sections, such as the description of network programming with sockets or hmac encryption, require domain-specific knowledge. The basic information needed to explain the examples is included here, but the range of topics covered by the modules in the standard library makes it impossible to cover every topic comprehensively in a single volume. The discussion of each module is followed by a list of suggested sources for more information and further reading. These include online resources, RFC standards documents, and related books. Although the current transition to Python 3 is well underway, Python 2 is still likely to be the primary version of Python used in production environments for years 1 2 Introduction to come because of the large amount of legacy Python 2 source code available and the slow transition rate to Python 3. All the source code for the examples has been updated from the original online versions and tested with Python 2.7, the final release of the 2.x series. Many of the example programs can be readily adapted to work with Python 3, but others cover modules that have been renamed or deprecated. How This Book Is Organized The modules are grouped into chapters to make it easy to find an individual module for reference and browse by subject for more leisurely exploration. The book supplements the comprehensive reference guide available on http://docs.python.org, providing fully functional example programs to demonstrate the features described there. Downloading the Example Code The original versions of the articles, errata for the book, and the sample code are avail- able on the author’s web site (http://www.doughellmann.com/books/byexample). Chapter 1 TEXT The string class is the most obvious text-processing tool available to Python program- mers, but plenty of other tools in the standard library are available to make advanced text manipulation simple. Older code, written before Python 2.0, uses functions from the string module, instead of methods of string objects. There is an equivalent method for each function from the module, and use of the functions is deprecated for new code. Programs using Python 2.4 or later may use string.Template as a simple way to parameterize strings beyond the features of the string or unicode classes. While not as feature-rich as templates defined by many of the Web frameworks or extension modules available from the Python Package Index, string.Template is a good mid- dle ground for user-modifiable templates where dynamic values need to be inserted into otherwise static text. The textwrap module includes tools for formatting text taken from paragraphs by limiting the width of output, adding indentation, and inserting line breaks to wrap lines consistently. The standard library includes two modules related to comparing text values beyond the built-in equality and sort comparison supported by string objects. re provides a complete regular expression library, implemented in C for speed. Regular expressions are well-suited to finding substrings within a larger data set, comparing strings against a pattern more complex than another fixed string, and performing mild parsing. difflib, on the other hand, computes the actual differences between sequences of text in terms of the parts added, removed, or changed. The output of the comparison functions in difflib can be used to provide more detailed feedback to users about where changes occur in two inputs, how a document has changed over time, etc. 3 4 Text 1.1 string—Text Constants and Templates Purpose Contains constants and classes for working with text. Python Version 1.4 and later The string module dates from the earliest versions of Python. In version 2.0, many of the functions previously implemented only in the module were moved to methods of str and unicode objects. Legacy versions of those functions are still available, but their use is deprecated and they will be dropped in Python 3.0. The string module retains several useful constants and classes for working with string and unicode objects, and this discussion will concentrate on them. 1.1.1 Functions The two functions capwords() and maketrans() are not moving from the string module. capwords() capitalizes all words in a string. import string s = ’The quick brown fox jumped over the lazy dog.’ print s print string.capwords(s) The results are the same as calling split(), capitalizing the words in the resulting list, and then calling join() to combine the results. $ python string_capwords.py The quick brown fox jumped over the lazy dog. The Quick Brown Fox Jumped Over The Lazy Dog. The maketrans() function creates translation tables that can be used with the translate() method to change one set of characters to another more efficiently than with repeated calls to replace(). import string leet = string.maketrans(’abegiloprstz’, ’463611092572’) 1.1. string—Text Constants and Templates 5 s = ’The quick brown fox jumped over the lazy dog.’ print s print s.translate(leet) In this example, some letters are replaced by their l33t number alternatives. $ python string_maketrans.py The quick brown fox jumped over the lazy dog. Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06. 1.1.2 Templates String templates were added in Python 2.4 as part of PEP 292 and are intended as an alternative to the built-in interpolation syntax. With string.Template interpolation, variables are identified by prefixing the name with $ (e.g., $var) or, if necessary to set them off from surrounding text, they can also be wrapped with curly braces (e.g., ${var}). This example compares a simple template with a similar string interpolation using the % operator. import string values = { ’var’:’foo’ } t = string.Template(""" Variable : $var Escape : $$ Variable in text: ${var}iable """) print ’TEMPLATE:’, t.substitute(values) s = """ Variable : %(var)s Escape : %% Variable in text: %(var)siable """ print ’INTERPOLATION:’, s % values 6 Text In both cases, the trigger character ($ or %) is escaped by repeating it twice. $ python string_template.py TEMPLATE: Variable : foo Escape : $ Variable in text: fooiable INTERPOLATION: Variable : foo Escape :% Variable in text: fooiable One key difference between templates and standard string interpolation is that the argument type is not considered. The values are converted to strings, and the strings are inserted into the result. No formatting options are available. For exam- ple, there is no way to control the number of digits used to represent a floating-point value. A benefit, though, is that by using the safe_substitute() method, it is possible to avoid exceptions if not all values the template needs are provided as arguments. import string values = { ’var’:’foo’ } t = string.Template("$var is here but $missing is not provided") try: print ’substitute() :’, t.substitute(values) except KeyError, err: print ’ERROR:’, str(err) print ’safe_substitute():’, t.safe_substitute(values) Since there is no value for missing in the values dictionary, a KeyError is raised by substitute(). Instead of raising the error, safe_substitute() catches it and leaves the variable expression alone in the text. $ python string_template_missing.py 1.1. string—Text Constants and Templates 7 substitute() : ERROR: ’missing’ safe_substitute(): foo is here but $missing is not provided 1.1.3 Advanced Templates The default syntax for string.Template can be changed by adjusting the regular expression patterns it uses to find the variable names in the template body. A simple way to do that is to change the delimiter and idpattern class attributes. import string template_text = ’’’ Delimiter : %% Replaced : %with_underscore Ignored : %notunderscored ’’’ d = { ’with_underscore’:’replaced’, ’notunderscored’:’not replaced’, } class MyTemplate(string.Template): delimiter = ’%’ idpattern = ’[a-z]+_[a-z]+’ t = MyTemplate(template_text) print ’Modified ID pattern:’ print t.safe_substitute(d) In this example, the substitution rules are changed so that the delimiter is % instead of $ and variable names must include an underscore. The pattern %notunderscored is not replaced by anything because it does not include an underscore character. $ python string_template_advanced.py Modified ID pattern: Delimiter : % Replaced : replaced Ignored : %notunderscored 8 Text For more complex changes, override the pattern attribute and define an entirely new regular expression. The pattern provided must contain four named groups for cap- turing the escaped delimiter, the named variable, a braced version of the variable name, and any invalid delimiter patterns. import string t = string.Template(’$var’) print t.pattern.pattern The value of t.pattern is a compiled regular expression, but the original string is available via its pattern attribute. \$(?: (?P\$) | # two delimiters (?P[_a-z][_a-z0-9]*) | # identifier {(?P[_a-z][_a-z0-9]*)} | # braced identifier (?P) # ill-formed delimiter exprs ) This example defines a new pattern to create a new type of template using {{var}} as the variable syntax. import re import string class MyTemplate(string.Template): delimiter = ’{{’ pattern = r’’’ \{\{(?: (?P\{\{)| (?P[_a-z][_a-z0-9]*)\}\}| (?P[_a-z][_a-z0-9]*)\}\}| (?P) ) ’’’ t = MyTemplate(’’’ {{{{ {{var}} ’’’) 1.2. textwrap—Formatting Text Paragraphs 9 print ’MATCHES:’, t.pattern.findall(t.template) print ’SUBSTITUTED:’, t.safe_substitute(var=’replacement’) Both the named and braced patterns must be provided separately, even though they are the same. Running the sample program generates: $ python string_template_newsyntax.py MATCHES: [(’{{’, ’’, ’’, ’’), (’’, ’var’, ’’, ’’)] SUBSTITUTED: {{ replacement See Also: string (http://docs.python.org/lib/module-string.html) Standard library documenta- tion for this module. String Methods (http://docs.python.org/lib/string-methods.html#string-methods) Methods of str objects that replace the deprecated functions in string. PEP 292 (www.python.org/dev/peps/pep-0292) A proposal for a simpler string sub- stitution syntax. l33t (http://en.wikipedia.org/wiki/Leet) “Leetspeak” alternative alphabet. 1.2 textwrap—Formatting Text Paragraphs Purpose Formatting text by adjusting where line breaks occur in a paragraph. Python Version 2.5 and later The textwrap module can be used to format text for output when pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors and word processors. 1.2.1 Example Data The examples in this section use the module textwrap_example.py, which contains a string sample_text. sample_text = ’’’ The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers 10 Text programmatic functionality similar to the paragraph wrapping or filling features found in many text editors. ’’’ 1.2.2 Filling Paragraphs The fill() function takes text as input and produces formatted text as output. import textwrap from textwrap_example import sample_text print ’No dedent:\n’ print textwrap.fill(sample_text, width=50) The results are something less than desirable. The text is now left justified, but the first line retains its indent and the spaces from the front of each subsequent line are embedded in the paragraph. $ python textwrap_fill.py No dedent: The textwrap module can be used to format text for output in situations where pretty- printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors. 1.2.3 Removing Existing Indentation The previous example has embedded tabs and extra spaces mixed into the output, so it is not formatted very cleanly. Removing the common whitespace prefix from all lines in the sample text produces better results and allows the use of docstrings or embedded multiline strings straight from Python code while removing the code formatting itself. The sample string has an artificial indent level introduced for illustrating this feature. import textwrap from textwrap_example import sample_text dedented_text = textwrap.dedent(sample_text) print ’Dedented:’ print dedented_text 1.2. textwrap—Formatting Text Paragraphs 11 The results are starting to look better: $ python textwrap_dedent.py Dedented: The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors. Since “dedent” is the opposite of “indent,” the result is a block of text with the common initial whitespace from each line removed. If one line is already indented more than another, some of the whitespace will not be removed. Input like Line one. Line two. Line three. becomes Line one. Line two. Line three. 1.2.4 Combining Dedent and Fill Next, the dedented text can be passed through fill() with a few different width values. import textwrap from textwrap_example import sample_text dedented_text = textwrap.dedent(sample_text).strip() for width in [ 45, 70 ]: print ’%d Columns:\n’ % width print textwrap.fill(dedented_text, width=width) print 12 Text This produces outputs in the specified widths. $ python textwrap_fill_width.py 45 Columns: The textwrap module can be used to format text for output in situations where pretty- printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors. 70 Columns: The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors. 1.2.5 Hanging Indents Just as the width of the output can be set, the indent of the first line can be controlled independently of subsequent lines. import textwrap from textwrap_example import sample_text dedented_text = textwrap.dedent(sample_text).strip() print textwrap.fill(dedented_text, initial_indent=’’, subsequent_indent=’’* 4, width=50, ) This makes it possible to produce a hanging indent, where the first line is indented less than the other lines. $ python textwrap_hanging_indent.py The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality 1.3. re—Regular Expressions 13 similar to the paragraph wrapping or filling features found in many text editors. The indent values can include nonwhitespace characters, too. The hanging indent can be prefixed with * to produce bullet points, etc. See Also: textwrap (http://docs.python.org/lib/module-textwrap.html) Standard library doc- umentation for this module. 1.3 re—Regular Expressions Purpose Searching within and changing text using formal patterns. Python Version 1.5 and later Regular expressions are text-matching patterns described with a formal syntax. The patterns are interpreted as a set of instructions, which are then executed with a string as input to produce a matching subset or modified version of the original. The term “regular expressions” is frequently shortened to “regex” or “regexp” in conversation. Expressions can include literal text matching, repetition, pattern composition, branch- ing, and other sophisticated rules. Many parsing problems are easier to solve using a regular expression than by creating a special-purpose lexer and parser. Regular expressions are typically used in applications that involve a lot of text processing. For example, they are commonly used as search patterns in text-editing programs used by developers, including vi, emacs, and modern IDEs. They are also an integral part of UNIX command line utilities, such as sed, grep, and awk. Many programming languages include support for regular expressions in the language syntax (Perl, Ruby, Awk, and Tcl). Other languages, such as C, C++, and Python, support regular expressions through extension libraries. There are multiple open source implementations of regular expressions, each shar- ing a common core syntax but having different extensions or modifications to their advanced features. The syntax used in Python’s re module is based on the syntax used for regular expressions in Perl, with a few Python-specific enhancements. Note: Although the formal definition of “regular expression” is limited to expres- sions that describe regular languages, some of the extensions supported by re go beyond describing regular languages. The term “regular expression” is used here in a more general sense to mean any expression that can be evaluated by Python’s re module. 14 Text 1.3.1 Finding Patterns in Text The most common use for re is to search for patterns in text. The search() function takes the pattern and text to scan, and returns a Match object when the pattern is found. If the pattern is not found, search() returns None. Each Match object holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs. import re pattern = ’this’ text = ’Does this text match the pattern?’ match = re.search(pattern, text) s = match.start() e = match.end() print ’Found "%s"\nin "%s"\nfrom %d to %d ("%s")’ %\ (match.re.pattern, match.string, s, e, text[s:e]) The start() and end() methods give the indexes into the string showing where the text matched by the pattern occurs. $ python re_simple_match.py Found "this" in "Does this text match the pattern?" from 5 to 9 ("this") 1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently. The com- pile() function converts an expression string into a RegexObject. import re # Precompile the patterns regexes = [ re.compile(p) 1.3. re—Regular Expressions 15 for p in [ ’this’, ’that’ ] ] text = ’Does this text match the pattern?’ print ’Text: %r\n’ % text for regex in regexes: print ’Seeking "%s" ->’ % regex.pattern, if regex.search(text): print ’match!’ else: print ’no match’ The module-level functions maintain a cache of compiled expressions. However, the size of the cache is limited, and using compiled expressions directly avoids the cache lookup overhead. Another advantage of using compiled expressions is that by precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action. $ python re_simple_compiled.py Text: ’Does this text match the pattern?’ Seeking "this" -> match! Seeking "that" -> no match 1.3.3 Multiple Matches So far, the example patterns have all used search() to look for single instances of literal text strings. The findall() function returns all substrings of the input that match the pattern without overlapping. import re text = ’abbaaabbbbaaaaa’ pattern = ’ab’ for match in re.findall(pattern, text): print ’Found "%s"’ % match 16 Text There are two instances of ab in the input string. $ python re_findall.py Found "ab" Found "ab" finditer() returns an iterator that produces Match instances instead of the strings returned by findall(). import re text = ’abbaaabbbbaaaaa’ pattern = ’ab’ for match in re.finditer(pattern, text): s = match.start() e = match.end() print ’Found "%s" at %d:%d’ % (text[s:e], s, e) This example finds the same two occurrences of ab, and the Match instance shows where they are in the original input. $ python re_finditer.py Found "ab" at 0:2 Found "ab" at 5:7 1.3.4 Pattern Syntax Regular expressions support more powerful patterns than simple literal text strings. Patterns can repeat, can be anchored to different logical locations within the input, and can be expressed in compact forms that do not require every literal character to be present in the pattern. All of these features are used by combining literal text values with metacharacters that are part of the regular expression pattern syntax implemented by re. import re def test_patterns(text, patterns=[]): 1.3. re—Regular Expressions 17 """Given source text and a list of patterns, look for matches for each pattern within the text and print them to stdout. """ # Look for each pattern in the text and print the results for pattern, desc in patterns: print ’Pattern %r (%s)\n’ % (pattern, desc) print ’ %r’ % text for match in re.finditer(pattern, text): s = match.start() e = match.end() substr = text[s:e] n_backslashes = text[:s].count(’\\’) prefix = ’.’ * (s + n_backslashes) print ’ %s%r’ % (prefix, substr) print return if __name__ == ’__main__’: test_patterns(’abbaaabbbbaaaaa’, [(’ab’, "’a’ followed by ’b’"), ]) The following examples will use test_patterns() to explore how variations in patterns change the way they match the same input text. The output shows the input text and the substring range from each portion of the input that matches the pattern. $ python re_test_patterns.py Pattern ’ab’ (’a’ followed by ’b’) ’abbaaabbbbaaaaa’ ’ab’ .....’ab’ Repetition There are five ways to express repetition in a pattern. A pattern followed by the metacharacter * is repeated zero or more times. (Allowing a pattern to repeat zero times means it does not need to appear at all to match.) Replace the * with + and the pattern must appear at least once. Using ? means the pattern appears zero times or one time. For a specific number of occurrences, use {m} after the pattern, where m is the 18 Text number of times the pattern should repeat. And, finally, to allow a variable but limited number of repetitions, use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value appears at least m times, with no maximum. from re_test_patterns import test_patterns test_patterns( ’abbaabbba’, [(’ab*’, ’a followed by zero or more b’), (’ab+’, ’a followed by one or more b’), (’ab?’, ’a followed by zero or one b’), (’ab{3}’, ’a followed by three b’), (’ab{2,3}’, ’a followed by two to three b’), ]) There are more matches for ab* and ab? than ab+. $ python re_repetition.py Pattern ’ab*’ (a followed by zero or more b) ’abbaabbba’ ’abb’ ...’a’ ....’abbb’ ........’a’ Pattern ’ab+’ (a followed by one or more b) ’abbaabbba’ ’abb’ ....’abbb’ Pattern ’ab?’ (a followed by zero or one b) ’abbaabbba’ ’ab’ ...’a’ ....’ab’ ........’a’ 1.3. re—Regular Expressions 19 Pattern ’ab{3}’ (a followed by three b) ’abbaabbba’ ....’abbb’ Pattern ’ab{2,3}’ (a followed by two to three b) ’abbaabbba’ ’abb’ ....’abbb’ Normally, when processing a repetition instruction, re will consume as much of the input as possible while matching the pattern. This so-called greedy behavior may result in fewer individual matches, or the matches may include more of the input text than intended. Greediness can be turned off by following the repetition instruction with ?. from re_test_patterns import test_patterns test_patterns( ’abbaabbba’, [(’ab*?’, ’a followed by zero or more b’), (’ab+?’, ’a followed by one or more b’), (’ab??’, ’a followed by zero or one b’), (’ab{3}?’, ’a followed by three b’), (’ab{2,3}?’, ’a followed by two to three b’), ]) Disabling greedy consumption of the input for any patterns where zero occurrences of b are allowed means the matched substring does not include any b characters. $ python re_repetition_non_greedy.py Pattern ’ab*?’ (a followed by zero or more b) ’abbaabbba’ ’a’ ...’a’ ....’a’ ........’a’ 20 Text Pattern ’ab+?’ (a followed by one or more b) ’abbaabbba’ ’ab’ ....’ab’ Pattern ’ab??’ (a followed by zero or one b) ’abbaabbba’ ’a’ ...’a’ ....’a’ ........’a’ Pattern ’ab{3}?’ (a followed by three b) ’abbaabbba’ ....’abbb’ Pattern ’ab{2,3}?’ (a followed by two to three b) ’abbaabbba’ ’abb’ ....’abb’ Character Sets A character set is a group of characters, any one of which can match at that point in the pattern. For example, [ab] would match either a or b. from re_test_patterns import test_patterns test_patterns( ’abbaabbba’, [(’[ab]’, ’either a or b’), (’a[ab]+’, ’a followed by 1 or more a or b’), (’a[ab]+?’, ’a followed by 1 or more a or b, not greedy’), ]) The greedy form of the expression (a[ab]+) consumes the entire string because the first letter is a and every subsequent character is either a or b. 1.3. re—Regular Expressions 21 $ python re_charset.py Pattern ’[ab]’ (either a or b) ’abbaabbba’ ’a’ .’b’ ..’b’ ...’a’ ....’a’ .....’b’ ......’b’ .......’b’ ........’a’ Pattern ’a[ab]+’ (a followed by 1 or more a or b) ’abbaabbba’ ’abbaabbba’ Pattern ’a[ab]+?’ (a followed by 1 or more a or b, not greedy) ’abbaabbba’ ’ab’ ...’aa’ A character set can also be used to exclude specific characters. The carat (^) means to look for characters not in the set following. from re_test_patterns import test_patterns test_patterns( ’This is some text -- with punctuation.’, [(’[^-. ]+’, ’sequences without -, ., or space’), ]) This pattern finds all the substrings that do not contain the characters -, .,ora space. $ python re_charset_exclude.py Pattern ’[^-. ]+’ (sequences without -, ., or space) 22 Text ’This is some text -- with punctuation.’ ’This’ .....’is’ ........’some’ .............’text’ .....................’with’ ..........................’punctuation’ As character sets grow larger, typing every character that should (or should not) match becomes tedious. A more compact format using character ranges can be used to define a character set to include all contiguous characters between a start point and a stop point. from re_test_patterns import test_patterns test_patterns( ’This is some text -- with punctuation.’, [(’[a-z]+’, ’sequences of lowercase letters’), (’[A-Z]+’, ’sequences of uppercase letters’), (’[a-zA-Z]+’, ’sequences of lowercase or uppercase letters’), (’[A-Z][a-z]+’, ’one uppercase followed by lowercase’), ]) Here the range a-z includes the lowercase ASCII letters, and the range A-Z in- cludes the uppercase ASCII letters. The ranges can also be combined into a single character set. $ python re_charset_ranges.py Pattern ’[a-z]+’ (sequences of lowercase letters) ’This is some text -- with punctuation.’ .’his’ .....’is’ ........’some’ .............’text’ .....................’with’ ..........................’punctuation’ Pattern ’[A-Z]+’ (sequences of uppercase letters) ’This is some text -- with punctuation.’ ’T’ 1.3. re—Regular Expressions 23 Pattern ’[a-zA-Z]+’ (sequences of lowercase or uppercase letters) ’This is some text -- with punctuation.’ ’This’ .....’is’ ........’some’ .............’text’ .....................’with’ ..........................’punctuation’ Pattern ’[A-Z][a-z]+’ (one uppercase followed by lowercase) ’This is some text -- with punctuation.’ ’This’ As a special case of a character set, the metacharacter dot, or period (.), indicates that the pattern should match any single character in that position. from re_test_patterns import test_patterns test_patterns( ’abbaabbba’, [(’a.’, ’a followed by any one character’), (’b.’, ’b followed by any one character’), (’a.*b’, ’a followed by anything, ending in b’), (’a.*?b’, ’a followed by anything, ending in b’), ]) Combining a dot with repetition can result in very long matches, unless the non- greedy form is used. $ python re_charset_dot.py Pattern ’a.’ (a followed by any one character) ’abbaabbba’ ’ab’ ...’aa’ Pattern ’b.’ (b followed by any one character) 24 Text ’abbaabbba’ .’bb’ .....’bb’ .......’ba’ Pattern ’a.*b’ (a followed by anything, ending in b) ’abbaabbba’ ’abbaabbb’ Pattern ’a.*?b’ (a followed by anything, ending in b) ’abbaabbba’ ’ab’ ...’aab’ Escape Codes An even more compact representation uses escape codes for several predefined charac- ter sets. The escape codes recognized by re are listed in Table 1.1. Table 1.1. Regular Expression Escape Codes Code Meaning \d A digit \D A nondigit \s Whitespace (tab, space, newline, etc.) \S Nonwhitespace \w Alphanumeric \W Nonalphanumeric Note: Escapes are indicated by prefixing the character with a backslash (\). Unfor- tunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, eliminates this problem and maintains readability. from re_test_patterns import test_patterns test_patterns( ’A prime #1 example!’, 1.3. re—Regular Expressions 25 [(r’\d+’, ’sequence of digits’), (r’\D+’, ’sequence of nondigits’), (r’\s+’, ’sequence of whitespace’), (r’\S+’, ’sequence of nonwhitespace’), (r’\w+’, ’alphanumeric characters’), (r’\W+’, ’nonalphanumeric’), ]) These sample expressions combine escape codes with repetition to find sequences of like characters in the input string. $ python re_escape_codes.py Pattern ’\\d+’ (sequence of digits) ’A prime #1 example!’ .........’1’ Pattern ’\\D+’ (sequence of nondigits) ’A prime #1 example!’ ’A prime #’ ..........’ example!’ Pattern ’\\s+’ (sequence of whitespace) ’A prime #1 example!’ .’ ’ .......’ ’ ..........’ ’ Pattern ’\\S+’ (sequence of nonwhitespace) ’A prime #1 example!’ ’A’ ..’prime’ ........’#1’ ...........’example!’ Pattern ’\\w+’ (alphanumeric characters) ’A prime #1 example!’ ’A’ 26 Text ..’prime’ .........’1’ ...........’example’ Pattern ’\\W+’ (nonalphanumeric) ’A prime #1 example!’ .’ ’ .......’ #’ ..........’ ’ ..................’!’ To match the characters that are part of the regular expression syntax, escape the characters in the search pattern. from re_test_patterns import test_patterns test_patterns( r’\d+ \D+ \s+’, [(r’\\.\+’, ’escape code’), ]) The pattern in this example escapes the backslash and plus characters, since, as metacharacters, both have special meaning in a regular expression. $ python re_escape_escapes.py Pattern ’\\\\.\\+’ (escape code) ’\\d+ \\D+ \\s+’ ’\\d+’ .....’\\D+’ ..........’\\s+’ Anchoring In addition to describing the content of a pattern to match, the relative location can be specified in the input text where the pattern should appear by using anchoring instruc- tions. Table 1.2 lists valid anchoring codes. 1.3. re—Regular Expressions 27 Table 1.2. Regular Expression Anchoring Codes Code Meaning ^ Start of string, or line $ End of string, or line \A Start of string \Z End of string \b Empty string at the beginning or end of a word \B Empty string not at the beginning or end of a word from re_test_patterns import test_patterns test_patterns( ’This is some text -- with punctuation.’, [(r’^\w+’, ’word at start of string’), (r’\A\w+’, ’word at start of string’), (r’\w+\S*$’, ’word near end of string, skip punctuation’), (r’\w+\S*\Z’, ’word near end of string, skip punctuation’), (r’\w*t\w*’, ’word containing t’), (r’\bt\w+’, ’t at start of word’), (r’\w+t\b’, ’t at end of word’), (r’\Bt\B’, ’t, not start or end of word’), ]) The patterns in the example for matching words at the beginning and end of the string are different because the word at the end of the string is followed by punctuation to terminate the sentence. The pattern \w+$ would not match, since . is not considered an alphanumeric character. $ python re_anchoring.py Pattern ’^\\w+’ (word at start of string) ’This is some text -- with punctuation.’ ’This’ Pattern ’\\A\\w+’ (word at start of string) ’This is some text -- with punctuation.’ ’This’ Pattern ’\\w+\\S*$’ (word near end of string, skip punctuation) 28 Text ’This is some text -- with punctuation.’ ..........................’punctuation.’ Pattern ’\\w+\\S*\\Z’ (word near end of string, skip punctuation) ’This is some text -- with punctuation.’ ..........................’punctuation.’ Pattern ’\\w*t\\w*’ (word containing t) ’This is some text -- with punctuation.’ .............’text’ .....................’with’ ..........................’punctuation’ Pattern ’\\bt\\w+’ (t at start of word) ’This is some text -- with punctuation.’ .............’text’ Pattern ’\\w+t\\b’ (t at end of word) ’This is some text -- with punctuation.’ .............’text’ Pattern ’\\Bt\\B’ (t, not start or end of word) ’This is some text -- with punctuation.’ .......................’t’ ..............................’t’ .................................’t’ 1.3.5 Constraining the Search If it is known in advance that only a subset of the full input should be searched, the reg- ular expression match can be further constrained by telling re to limit the search range. For example, if the pattern must appear at the front of the input, then using match() instead of search()will anchor the search without having to explicitly include an anchor in the search pattern. import re text = ’This is some text -- with punctuation.’ pattern = ’is’ 1.3. re—Regular Expressions 29 print ’Text :’, text print ’Pattern:’, pattern m = re.match(pattern, text) print ’Match :’, m s = re.search(pattern, text) print ’Search :’, s Since the literal text is does not appear at the start of the input text, it is not found using match(). The sequence appears two other times in the text, though, so search() finds it. $ python re_match.py Text : This is some text -- with punctuation. Pattern: is Match : None Search : <_sre.SRE_Match object at 0x100d2bed0> The search() method of a compiled regular expression accepts optional start and end position parameters to limit the search to a substring of the input. import re text = ’This is some text -- with punctuation.’ pattern = re.compile(r’\b\w*is\w*\b’) print ’Text:’, text print pos = 0 while True: match = pattern.search(text, pos) if not match: break s = match.start() e = match.end() print ’ %2d : %2d = "%s"’ %\ (s, e-1, text[s:e]) # Move forward in text for the next search pos = e 30 Text This example implements a less efficient form of iterall(). Each time a match is found, the end position of that match is used for the next search. $ python re_search_substring.py Text: This is some text -- with punctuation. 0 : 3 = "This" 5 : 6 = "is" 1.3.6 Dissecting Matches with Groups Searching for pattern matches is the basis of the powerful capabilities provided by regular expressions. Adding groups to a pattern isolates parts of the matching text, expanding those capabilities to create a parser. Groups are defined by enclosing patterns in parentheses (( and )). from re_test_patterns import test_patterns test_patterns( ’abbaaabbbbaaaaa’, [(’a(ab)’, ’a followed by literal ab’), (’a(a*b*)’, ’a followed by 0-n a and 0-n b’), (’a(ab)*’, ’a followed by 0-n ab’), (’a(ab)+’, ’a followed by 1-n ab’), ]) Any complete regular expression can be converted to a group and nested within a larger expression. All repetition modifiers can be applied to a group as a whole, requir- ing the entire group pattern to repeat. $ python re_groups.py Pattern ’a(ab)’ (a followed by literal ab) ’abbaaabbbbaaaaa’ ....’aab’ Pattern ’a(a*b*)’ (a followed by 0-n a and 0-n b) ’abbaaabbbbaaaaa’ 1.3. re—Regular Expressions 31 ’abb’ ...’aaabbbb’ ..........’aaaaa’ Pattern ’a(ab)*’ (a followed by 0-n ab) ’abbaaabbbbaaaaa’ ’a’ ...’a’ ....’aab’ ..........’a’ ...........’a’ ............’a’ .............’a’ ..............’a’ Pattern ’a(ab)+’ (a followed by 1-n ab) ’abbaaabbbbaaaaa’ ....’aab’ To access the substrings matched by the individual groups within a pattern, use the groups() method of the Match object. import re text = ’This is some text -- with punctuation.’ print text print patterns = [ (r’^(\w+)’, ’word at start of string’), (r’(\w+)\S*$’, ’word at end, with optional punctuation’), (r’(\bt\w+)\W+(\w+)’, ’word starting with t, another word’), (r’(\w+t)\b’, ’word ending with t’), ] for pattern, desc in patterns: regex = re.compile(pattern) match = regex.search(text) print ’Pattern %r (%s)\n’ % (pattern, desc) 32 Text print ’’, match.groups() print Match.groups() returns a sequence of strings in the order of the groups within the expression that matches the string. $ python re_groups_match.py This is some text -- with punctuation. Pattern ’^(\\w+)’ (word at start of string) (’This’,) Pattern ’(\\w+)\\S*$’ (word at end, with optional punctuation) (’punctuation’,) Pattern ’(\\bt\\w+)\\W+(\\w+)’ (word starting with t, another word) (’text’, ’with’) Pattern ’(\\w+t)\\b’ (word ending with t) (’text’,) Ask for the match of a single group with group(). This is useful when grouping is being used to find parts of the string, but some parts matched by groups are not needed in the results. import re text = ’This is some text -- with punctuation.’ print ’Input text :’, text # word starting with ’t’ then another word regex = re.compile(r’(\bt\w+)\W+(\w+)’) print ’Pattern :’, regex.pattern match = regex.search(text) print ’Entire match :’, match.group(0) 1.3. re—Regular Expressions 33 print ’Word starting with "t":’, match.group(1) print ’Word after "t" word :’, match.group(2) Group 0 represents the string matched by the entire expression, and subgroups are numbered starting with 1 in the order their left parenthesis appears in the expression. $ python re_groups_individual.py Input text : This is some text -- with punctuation. Pattern : (\bt\w+)\W+(\w+) Entire match : text -- with Word starting with "t": text Word after "t" word : with Python extends the basic grouping syntax to add named groups. Using names to refer to groups makes it easier to modify the pattern over time, without having to also modify the code using the match results. To set the name of a group, use the syntax (?Ppattern). import re text = ’This is some text -- with punctuation.’ print text print for pattern in [ r’^(?P\w+)’, r’(?P\w+)\S*$’, r’(?P\bt\w+)\W+(?P\w+)’, r’(?P\w+t)\b’, ]: regex = re.compile(pattern) match = regex.search(text) print ’Matching "%s"’ % pattern print ’’, match.groups() print ’’, match.groupdict() print Use groupdict() to retrieve the dictionary that maps group names to substrings from the match. Named patterns also are included in the ordered sequence returned by groups(). 34 Text $ python re_groups_named.py This is some text -- with punctuation. Matching "^(?P\w+)" (’This’,) {’first_word’: ’This’} Matching "(?P\w+)\S*$" (’punctuation’,) {’last_word’: ’punctuation’} Matching "(?P\bt\w+)\W+(?P\w+)" (’text’, ’with’) {’other_word’: ’with’, ’t_word’: ’text’} Matching "(?P\w+t)\b" (’text’,) {’ends_with_t’: ’text’} An updated version of test_patterns() that shows the numbered and named groups matched by a pattern will make the following examples easier to follow. import re def test_patterns(text, patterns=[]): """Given source text and a list of patterns, look for matches for each pattern within the text and print them to stdout. """ # Look for each pattern in the text and print the results for pattern, desc in patterns: print ’Pattern %r (%s)\n’ % (pattern, desc) print ’ %r’ % text for match in re.finditer(pattern, text): s = match.start() e = match.end() prefix = ’’* (s) print ’ %s%r%s ’ % (prefix, text[s:e], ’’*(len(text)-e)), print match.groups() if match.groupdict(): print ’%s%s’ %(’’* (len(text)-s), match.groupdict()) print return 1.3. re—Regular Expressions 35 Since a group is itself a complete regular expression, groups can be nested within other groups to build even more complicated expressions. from re_test_patterns_groups import test_patterns test_patterns( ’abbaabbba’, [(r’a((a*)(b*))’, ’a followed by 0-n a and 0-n b’), ]) In this case, the group (a*) matches an empty string, so the return value from groups() includes that empty string as the matched value. $ python re_groups_nested.py Pattern ’a((a*)(b*))’ (a followed by 0-n a and 0-n b) ’abbaabbba’ ’abb’ (’bb’, ’’, ’bb’) ’aabbb’ (’abbb’, ’a’, ’bbb’) ’a’ (’’, ’’, ’’) Groups are also useful for specifying alternative patterns. Use the pipe symbol (|) to indicate that one pattern or another should match. Consider the placement of the pipe carefully, though. The first expression in this example matches a sequence of a followed by a sequence consisting entirely of a single letter, a or b. The second pattern matches a followed by a sequence that may include either a or b. The patterns are similar, but the resulting matches are completely different. from re_test_patterns_groups import test_patterns test_patterns( ’abbaabbba’, [(r’a((a+)|(b+))’, ’a then seq. of a or seq. of b’), (r’a((a|b)+)’, ’a then seq. of [ab]’), ]) When an alternative group is not matched but the entire pattern does match, the return value of groups() includes a None value at the point in the sequence where the alternative group should appear. 36 Text $ python re_groups_alternative.py Pattern ’a((a+)|(b+))’ (a then seq. of a or seq. of b) ’abbaabbba’ ’abb’ (’bb’, None, ’bb’) ’aa’ (’a’, ’a’, None) Pattern ’a((a|b)+)’ (a then seq. of [ab]) ’abbaabbba’ ’abbaabbba’ (’bbaabbba’, ’a’) Defining a group containing a subpattern is also useful when the string matching the subpattern is not part of what should be extracted from the full text. These groups are called noncapturing. Noncapturing groups can be used to describe repetition patterns or alternatives, without isolating the matching portion of the string in the value returned. To create a noncapturing group, use the syntax (?:pattern). from re_test_patterns_groups import test_patterns test_patterns( ’abbaabbba’, [(r’a((a+)|(b+))’, ’capturing form’), (r’a((?:a+)|(?:b+))’, ’noncapturing’), ]) Compare the groups returned for the capturing and noncapturing forms of a pattern that match the same results. $ python re_groups_noncapturing.py Pattern ’a((a+)|(b+))’ (capturing form) ’abbaabbba’ ’abb’ (’bb’, None, ’bb’) ’aa’ (’a’, ’a’, None) Pattern ’a((?:a+)|(?:b+))’ (noncapturing) ’abbaabbba’ 1.3. re—Regular Expressions 37 ’abb’ (’bb’,) ’aa’ (’a’,) 1.3.7 Search Options The way the matching engine processes an expression can be changed using op- tion flags. The flags can be combined using a bitwise OR operation, then passed to compile(), search(), match(), and other functions that accept a pattern for searching. Case-Insensitive Matching IGNORECASE causes literal characters and character ranges in the pattern to match both uppercase and lowercase characters. import re text = ’This is some text -- with punctuation.’ pattern = r’\bT\w+’ with_case = re.compile(pattern) without_case = re.compile(pattern, re.IGNORECASE) print ’Text:\n %r’ % text print ’Pattern:\n %s’ % pattern print ’Case-sensitive:’ for match in with_case.findall(text): print ’ %r’ % match print ’Case-insensitive:’ for match in without_case.findall(text): print ’ %r’ % match Since the pattern includes the literal T, without setting IGNORECASE, the only match is the word This. When case is ignored, text also matches. $ python re_flags_ignorecase.py Text: ’This is some text -- with punctuation.’ Pattern: \bT\w+ Case-sensitive: ’This’ 38 Text Case-insensitive: ’This’ ’text’ Input with Multiple Lines Two flags affect how searching in multiline input works: MULTILINE and DOTALL. The MULTILINE flag controls how the pattern-matching code processes anchoring instruc- tions for text containing newline characters. When multiline mode is turned on, the anchor rules for ^ and $ apply at the beginning and end of each line, in addition to the entire string. import re text = ’This is some text -- with punctuation.\nA second line.’ pattern = r’(^\w+)|(\w+\S*$)’ single_line = re.compile(pattern) multiline = re.compile(pattern, re.MULTILINE) print ’Text:\n %r’ % text print ’Pattern:\n %s’ % pattern print ’Single Line :’ for match in single_line.findall(text): print ’ %r’ % (match,) print ’Multiline :’ for match in multiline.findall(text): print ’ %r’ % (match,) The pattern in the example matches the first or last word of the input. It matches line. at the end of the string, even though there is no newline. $ python re_flags_multiline.py Text: ’This is some text -- with punctuation.\nA second line.’ Pattern: (^\w+)|(\w+\S*$) Single Line : (’This’, ’’) (’’, ’line.’) Multiline : (’This’, ’’) (’’, ’punctuation.’) 1.3. re—Regular Expressions 39 (’A’, ’’) (’’, ’line.’) DOTALL is the other flag related to multiline text. Normally, the dot character (.) matches everything in the input text except a newline character. The flag allows dot to match newlines as well. import re text = ’This is some text -- with punctuation.\nA second line.’ pattern = r’.+’ no_newlines = re.compile(pattern) dotall = re.compile(pattern, re.DOTALL) print ’Text:\n %r’ % text print ’Pattern:\n %s’ % pattern print ’No newlines :’ for match in no_newlines.findall(text): print ’ %r’ % match print ’Dotall :’ for match in dotall.findall(text): print ’ %r’ % match Without the flag, each line of the input text matches the pattern separately. Adding the flag causes the entire string to be consumed. $ python re_flags_dotall.py Text: ’This is some text -- with punctuation.\nA second line.’ Pattern: .+ No newlines : ’This is some text -- with punctuation.’ ’A second line.’ Dotall : ’This is some text -- with punctuation.\nA second line.’ Unicode Under Python 2, str objects use the ASCII character set, and regular expression pro- cessing assumes that the pattern and input text are both ASCII. The escape codes 40 Text described earlier are defined in terms of ASCII by default. Those assumptions mean that the pattern \w+ will match the word “French” but not the word “Français,” since the ç is not part of the ASCII character set. To enable Unicode matching in Python 2, add the UNICODE flag when compiling the pattern or when calling the module-level functions search() and match(). import re import codecs import sys # Set standard output encoding to UTF-8. sys.stdout = codecs.getwriter(’UTF-8’)(sys.stdout) text = u’Français złoty Österreich’ pattern = ur’\w+’ ascii_pattern = re.compile(pattern) unicode_pattern = re.compile(pattern, re.UNICODE) print ’Text :’, text print ’Pattern :’, pattern print ’ASCII :’, u’, ’.join(ascii_pattern.findall(text)) print ’Unicode :’, u’, ’.join(unicode_pattern.findall(text)) The other escape sequences (\W, \b, \B, \d, \D, \s, and \S) are also processed differently for Unicode text. Instead of assuming what members of the character set are identified by the escape sequence, the regular expression engine consults the Unicode database to find the properties of each character. $ python re_flags_unicode.py Text : Français złoty Österreich Pattern : \w+ ASCII : Fran, ais, z, oty, sterreich Unicode : Français, złoty, Österreich Note: Python 3 uses Unicode for all strings by default, so the flag is not necessary. Verbose Expression Syntax The compact format of regular expression syntax can become a hindrance as expres- sions grow more complicated. As the number of groups in an expression increases, it 1.3. re—Regular Expressions 41 will be more work to keep track of why each element is needed and how exactly the parts of the expression interact. Using named groups helps mitigate these issues, but a better solution is to use verbose mode expressions, which allow comments and extra whitespace to be embedded in the pattern. A pattern to validate email addresses will illustrate how verbose mode makes working with regular expressions easier. The first version recognizes addresses that end in one of three top-level domains: .com, .org, and .edu. import re address = re.compile(’[\w\d.+-]+@([\w\d.]+\.)+(com|org|edu)’, re.UNICODE) candidates = [ u’first.last@example.com’, u’first.last+category@gmail.com’, u’valid-address@mail.example.com’, u’not-valid@example.foo’, ] for candidate in candidates: match = address.search(candidate) print ’%-30s %s’ % (candidate, ’Matches’ if match else ’No match’) This expression is already complex. There are several character classes, groups, and repetition expressions. $ python re_email_compact.py first.last@example.com Matches first.last+category@gmail.com Matches valid-address@mail.example.com Matches not-valid@example.foo No match Converting the expression to a more verbose format will make it easier to extend. import re address = re.compile( ’’’ [\w\d.+-]+ # username @ 42 Text ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # TODO: support more top-level domains ’’’, re.UNICODE | re.VERBOSE) candidates = [ u’first.last@example.com’, u’first.last+category@gmail.com’, u’valid-address@mail.example.com’, u’not-valid@example.foo’, ] for candidate in candidates: match = address.search(candidate) print ’%-30s %s’ % (candidate, ’Matches’ if match else ’No match’) The expression matches the same inputs, but in this extended format, it is easier to read. The comments also help identify different parts of the pattern so that it can be expanded to match more inputs. $ python re_email_verbose.py first.last@example.com Matches first.last+category@gmail.com Matches valid-address@mail.example.com Matches not-valid@example.foo No match This expanded version parses inputs that include a person’s name and email ad- dress, as might appear in an email header. The name comes first and stands on its own, and the email address follows surrounded by angle brackets (< and >). import re address = re.compile( ’’’ # A name is made up of letters, and may include "." # for title abbreviations and middle initials. ((?P ([\w.,]+\s+)*[\w.,]+) \s* # Email addresses are wrapped in angle 1.3. re—Regular Expressions 43 # brackets: < > but only if a name is # found, so keep the start bracket in this # group. < )? # the entire name is optional # The address itself: username@domain.tld (?P [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains ) >? # optional closing angle bracket ’’’, re.UNICODE | re.VERBOSE) candidates = [ u’first.last@example.com’, u’first.last+category@gmail.com’, u’valid-address@mail.example.com’, u’not-valid@example.foo’, u’First Last ’, u’No Brackets first.last@example.com’, u’First Last’, u’First Middle Last ’, u’First M. Last ’, u’’, ] for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Name :’, match.groupdict()[’name’] print ’ Email:’, match.groupdict()[’email’] else: print ’ No match’ As with other programming languages, the ability to insert comments into ver- bose regular expressions helps with their maintainability. This final version includes 44 Text implementation notes to future maintainers and whitespace to separate the groups from each other and highlight their nesting level. $ python re_email_with_name.py Candidate: first.last@example.com Name : None Email: first.last@example.com Candidate: first.last+category@gmail.com Name : None Email: first.last+category@gmail.com Candidate: valid-address@mail.example.com Name : None Email: valid-address@mail.example.com Candidate: not-valid@example.foo No match Candidate: First Last Name : First Last Email: first.last@example.com Candidate: No Brackets first.last@example.com Name : None Email: first.last@example.com Candidate: First Last No match Candidate: First Middle Last Name : First Middle Last Email: first.last@example.com Candidate: First M. Last Name : First M. Last Email: first.last@example.com Candidate: Name : None Email: first.last@example.com Embedding Flags in Patterns If flags cannot be added when compiling an expression, such as when a pattern is passed as an argument to a library function that will compile it later, the flags can be embedded inside the expression string itself. For example, to turn case-insensitive matching on, add (?i) to the beginning of the expression. 1.3. re—Regular Expressions 45 import re text = ’This is some text -- with punctuation.’ pattern = r’(?i)\bT\w+’ regex = re.compile(pattern) print ’Text :’, text print ’Pattern :’, pattern print ’Matches :’, regex.findall(text) Because the options control the way the entire expression is evaluated or parsed, they should always come at the beginning of the expression. $ python re_flags_embedded.py Text : This is some text -- with punctuation. Pattern : (?i)\bT\w+ Matches : [’This’, ’text’] The abbreviations for all flags are listed in Table 1.3. Table 1.3. Regular Expression Flag Abbreviations Flag Abbreviation IGNORECASE i MULTILINE m DOTALL s UNICODE u VERBOSE x Embedded flags can be combined by placing them within the same group. For example, (?imu) turns on case-insensitive matching for multiline Unicode strings. 1.3.8 Looking Ahead or Behind In many cases, it is useful to match a part of a pattern only if some other part will also match. For example, in the email parsing expression, the angle brackets were each marked as optional. Really, though, the brackets should be paired, and the expression should only match if both are present or neither is. This modified version of the 46 Text expression uses a positive look-ahead assertion to match the pair. The look-ahead as- sertion syntax is (?=pattern). import re address = re.compile( ’’’ # A name is made up of letters, and may include "." # for title abbreviations and middle initials. ((?P ([\w.,]+\s+)*[\w.,]+ ) \s+ ) # name is no longer optional # LOOKAHEAD # Email addresses are wrapped in angle brackets, but only # if they are both present or neither is. (?= (<.*>$) # remainder wrapped in angle brackets | ([^<].*[^>]$) # remainder *not* wrapped in angle brackets ) [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains ) >? # optional closing angle bracket ’’’, re.UNICODE | re.VERBOSE) candidates = [ u’First Last ’, u’No Brackets first.last@example.com’, u’Open Bracket ’, ] 1.3. re—Regular Expressions 47 for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Name :’, match.groupdict()[’name’] print ’ Email:’, match.groupdict()[’email’] else: print ’ No match’ Several important changes occur in this version of the expression. First, the name portion is no longer optional. That means stand-alone addresses do not match, but it also prevents improperly formatted name/address combinations from matching. The positive look-ahead rule after the “name” group asserts that the remainder of the string is either wrapped with a pair of angle brackets or there is not a mismatched bracket; the brackets are either both present or neither is. The look-ahead is expressed as a group, but the match for a look-ahead group does not consume any of the input text. The rest of the pattern picks up from the same spot after the look-ahead matches. $ python re_look_ahead.py Candidate: First Last Name : First Last Email: first.last@example.com Candidate: No Brackets first.last@example.com Name : No Brackets Email: first.last@example.com Candidate: Open Bracket No match A negative look-ahead assertion ((?!pattern)) says that the pattern does not match the text following the current point. For example, the email recognition pattern could be modified to ignore noreply mailing addresses automated systems commonly use. import re address = re.compile( ’’’ ^ 48 Text # An address: username@domain.tld # Ignore noreply addresses (?!noreply@.*$) [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains $ ’’’, re.UNICODE | re.VERBOSE) candidates = [ u’first.last@example.com’, u’noreply@example.com’, ] for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Match:’, candidate[match.start():match.end()] else: print ’ No match’ The address starting with noreply does not match the pattern, since the look- ahead assertion fails. $ python re_negative_look_ahead.py Candidate: first.last@example.com Match: first.last@example.com Candidate: noreply@example.com No match Instead of looking ahead for noreply in the username portion of the email ad- dress, the pattern can also be written using a negative look-behind assertion after the username is matched using the syntax (? \1 # first name \. \4 # last name @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains ) > ’’’, re.UNICODE | re.VERBOSE | re.IGNORECASE) candidates = [ u’First Last ’, u’Different Name ’, u’First Middle Last ’, u’First M. Last ’, ] for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Match name :’, match.group(1), match.group(4) print ’ Match email:’, match.group(5) else: print ’ No match’ Although the syntax is simple, creating back-references by numerical id has a couple of disadvantages. From a practical standpoint, as the expression changes, the groups must be counted again and every reference may need to be updated. The other disadvantage is that only 99 references can be made this way, because if the id number 52 Text is three digits long, it will be interpreted as an octal character value instead of a group reference. On the other hand, if an expression has more than 99 groups, more serious maintenance challenges will arise than not being able to refer to some groups in the expression. $ python re_refer_to_group.py Candidate: First Last Match name : First Last Match email: first.last@example.com Candidate: Different Name No match Candidate: First Middle Last Match name : First Last Match email: first.last@example.com Candidate: First M. Last Match name : First Last Match email: first.last@example.com Python’s expression parser includes an extension that uses (?P=name) to refer to the value of a named group matched earlier in the expression. import re address = re.compile( ’’’ # The regular name (?P\w+) \s+ (([\w.]+)\s+)? # optional middle name or initial (?P\w+) \s+ < # The address: first_name.last_name@domain.tld (?P (?P=first_name) \. (?P=last_name) 1.3. re—Regular Expressions 53 @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains ) > ’’’, re.UNICODE | re.VERBOSE | re.IGNORECASE) candidates = [ u’First Last ’, u’Different Name ’, u’First Middle Last ’, u’First M. Last ’, ] for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Match name :’, match.groupdict()[’first_name’], print match.groupdict()[’last_name’] print ’ Match email:’, match.groupdict()[’email’] else: print ’ No match’ The address expression is compiled with the IGNORECASE flag on, since proper names are normally capitalized but email addresses are not. $ python re_refer_to_named_group.py Candidate: First Last Match name : First Last Match email: first.last@example.com Candidate: Different Name No match Candidate: First Middle Last Match name : First Last Match email: first.last@example.com Candidate: First M. Last Match name : First Last Match email: first.last@example.com 54 Text The other mechanism for using back-references in expressions chooses a different pattern based on whether a previous group matched. The email pattern can be cor- rected so that the angle brackets are required if a name is present, but not if the email address is by itself. The syntax for testing to see if a group has matched is (?(id)yes-expression|no-expression), where id is the group name or num- ber, yes-expression is the pattern to use if the group has a value, and no-expression is the pattern to use otherwise. import re address = re.compile( ’’’ ^ # A name is made up of letters, and may include "." # for title abbreviations and middle initials. (?P ([\w.]+\s+)*[\w.]+ )? \s* # Email addresses are wrapped in angle brackets, but # only if a name is found. (?(name) # remainder wrapped in angle brackets because # there is a name (?P(?=(<.*>$))) | # remainder does not include angle brackets without name (?=([^<].*[^>]$)) ) # Only look for a bracket if the look-ahead assertion # found both of them. (?(brackets)<|\s*) # The address itself: username@domain.tld (?P [\w\d.+-]+ # username @ ([\w\d.]+\.)+ # domain name prefix (com|org|edu) # limit the allowed top-level domains 1.3. re—Regular Expressions 55 ) # Only look for a bracket if the look-ahead assertion # found both of them. (?(brackets)>|\s*) $ ’’’, re.UNICODE | re.VERBOSE) candidates = [ u’First Last ’, u’No Brackets first.last@example.com’, u’Open Bracket ’, u’no.brackets@example.com’, ] for candidate in candidates: print ’Candidate:’, candidate match = address.search(candidate) if match: print ’ Match name :’, match.groupdict()[’name’] print ’ Match email:’, match.groupdict()[’email’] else: print ’ No match’ This version of the email address parser uses two tests. If the name group matches, then the look-ahead assertion requires both angle brackets and sets up the brackets group. If name is not matched, the assertion requires that the rest of the text not have an- gle brackets around it. Later, if the brackets group is set, the actual pattern-matching code consumes the brackets in the input using literal patterns; otherwise, it consumes any blank space. $ python re_id.py Candidate: First Last Match name : First Last Match email: first.last@example.com Candidate: No Brackets first.last@example.com No match Candidate: Open Bracket No match Candidate: no.brackets@example.com Match name : None Match email: no.brackets@example.com 1.3.10 Modifying Strings with Patterns In addition to searching through text, re also supports modifying text using regular ex- pressions as the search mechanism, and the replacements can reference groups matched in the regex as part of the substitution text. Use sub() to replace all occurrences of a pattern with another string. import re bold = re.compile(r’\*{2}(.*?)\*{2}’) text = ’Make this **bold**. This **too**.’ print ’Text:’, text print ’Bold:’, bold.sub(r’\1’, text) References to the text matched by the pattern can be inserted using the \num syntax used for back-references. $ python re_sub.py Text: Make this **bold**. This **too**. Bold: Make this bold. This too. To use named groups in the substitution, use the syntax \g. import re bold = re.compile(r’\*{2}(?P.*?)\*{2}’, re.UNICODE) text = ’Make this **bold**. This **too**.’ print ’Text:’, text print ’Bold:’, bold.sub(r’\g’, text) 1.3. re—Regular Expressions 57 The \g syntax also works with numbered references, and using it elimi- nates any ambiguity between group numbers and surrounding literal digits. $ python re_sub_named_groups.py Text: Make this **bold**. This **too**. Bold: Make this bold. This too. Pass a value to count to limit the number of substitutions performed. import re bold = re.compile(r’\*{2}(.*?)\*{2}’, re.UNICODE) text = ’Make this **bold**. This **too**.’ print ’Text:’, text print ’Bold:’, bold.sub(r’\1’, text, count=1) Only the first substitution is made because count is 1. $ python re_sub_count.py Text: Make this **bold**. This **too**. Bold: Make this bold. This **too**. subn() works just like sub(), except that it returns both the modified string and the count of substitutions made. import re bold = re.compile(r’\*{2}(.*?)\*{2}’, re.UNICODE) text = ’Make this **bold**. This **too**.’ print ’Text:’, text print ’Bold:’, bold.subn(r’\1’, text) The search pattern matches twice in the example. $ python re_subn.py 58 Text Text: Make this **bold**. This **too**. Bold: (’Make this bold. This too.’, 2) 1.3.11 Splitting with Patterns str.split() is one of the most frequently used methods for breaking apart strings to parse them. It only supports using literal values as separators, though, and sometimes a regular expression is necessary if the input is not consistently formatted. For example, many plain-text markup languages define paragraph separators as two or more newline (\n) characters. In this case, str.split() cannot be used because of the “or more” part of the definition. A strategy for identifying paragraphs using findall() would use a pattern like (.+?)\n{2,}. import re text = ’’’Paragraph one on two lines. Paragraph two. Paragraph three.’’’ for num, para in enumerate(re.findall(r’(.+?)\n{2,}’, text, flags=re.DOTALL) ): print num, repr(para) print That pattern fails for paragraphs at the end of the input text, as illustrated by the fact that “Paragraph three.” is not part of the output. $ python re_paragraphs_findall.py 0 ’Paragraph one\non two lines.’ 1 ’Paragraph two.’ 1.3. re—Regular Expressions 59 Extending the pattern to say that a paragraph ends with two or more newlines or the end of input fixes the problem, but makes the pattern more complicated. Converting to re.split() instead of re.findall() handles the boundary condition automatically and keeps the pattern simpler. import re text = ’’’Paragraph one on two lines. Paragraph two. Paragraph three.’’’ print ’With findall:’ for num, para in enumerate(re.findall(r’(.+?)(\n{2,}|$)’, text, flags=re.DOTALL)): print num, repr(para) print print print ’With split:’ for num, para in enumerate(re.split(r’\n{2,}’, text)): print num, repr(para) print The pattern argument to split() expresses the markup specification more pre- cisely: Two or more newline characters mark a separator point between paragraphs in the input string. $ python re_split.py With findall: 0 (’Paragraph one\non two lines.’, ’\n\n’) 1 (’Paragraph two.’, ’\n\n\n’) 2 (’Paragraph three.’, ’’) 60 Text With split: 0 ’Paragraph one\non two lines.’ 1 ’Paragraph two.’ 2 ’Paragraph three.’ Enclosing the expression in parentheses to define a group causes split() to work more like str.partition(), so it returns the separator values as well as the other parts of the string. import re text = ’’’Paragraph one on two lines. Paragraph two. Paragraph three.’’’ print ’With split:’ for num, para in enumerate(re.split(r’(\n{2,})’, text)): print num, repr(para) print The output now includes each paragraph, as well as the sequence of newlines separating them. $ python re_split_groups.py With split: 0 ’Paragraph one\non two lines.’ 1 ’\n\n’ 2 ’Paragraph two.’ 3 ’\n\n\n’ 4 ’Paragraph three.’ 1.4. difflib—Compare Sequences 61 See Also: re (http://docs.python.org/library/re.html) The standard library documentation for this module. Regular Expression HOWTO (http://docs.python.org/howto/regex.html) Andrew Kuchling’s introduction to regular expressions for Python developers. Kodos (http://kodos.sourceforge.net/) An interactive tool for testing regular expres- sions, created by Phil Schwartz. Python Regular Expression Testing Tool (http://www.pythonregex.com/) A Web- based tool for testing regular expressions created by David Naffziger at Brand Verity.com and inspired by Kodos. Regular expression (http://en.wikipedia.org/wiki/Regular_expressions) Wikipedia article that provides a general introduction to regular expression concepts and techniques. locale (page 909) Use the locale module to set the language configuration when working with Unicode text. unicodedata (docs.python.org/library/unicodedata.html) Programmatic access to the Unicode character property database. 1.4 difflib—Compare Sequences Purpose Compare sequences, especially lines of text. Python Version 2.1 and later The difflib module contains tools for computing and working with differences be- tween sequences. It is especially useful for comparing text and includes functions that produce reports using several common difference formats. The examples in this section will all use this common test data in the difflib_data.py module: text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec mauris eget magna consequat convallis. Nam sed sem vitae odio pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac, suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique enim. Donec quis lectus a justo imperdiet tempus.""" 62 Text text1_lines = text1.splitlines() text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec mauris eget magna consequat convallis. Nam cras vitae mi vitae odio pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac, suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo imperdiet tempus. Suspendisse eu lectus. In nunc.""" text2_lines = text2.splitlines() 1.4.1 Comparing Bodies of Text The Differ class works on sequences of text lines and produces human-readable deltas, or change instructions, including differences within individual lines. The default output produced by Differ is similar to the diff command line tool under UNIX. It in- cludes the original input values from both lists, including common values, and markup data to indicate what changes were made. • Lines prefixed with - indicate that they were in the first sequence, but not the second. • Lines prefixed with + were in the second sequence, but not the first. • If a line has an incremental difference between versions, an extra line prefixed with ? is used to highlight the change within the new version. • If a line has not changed, it is printed with an extra blank space on the left column so that it is aligned with the other output, which may have differences. Breaking up the text into a sequence of individual lines before passing it to compare() produces more readable output than passing it in large strings. import difflib from difflib_data import * d = difflib.Differ() diff = d.compare(text1_lines, text2_lines) print ’\n’.join(diff) 1.4. difflib—Compare Sequences 63 The beginning of both text segments in the sample data is the same, so the first line prints without any extra annotation. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer eu lacus accumsan arcu fermentum euismod. Donec The third line of the data changes to include a comma in the modified text. Both versions of the line print, with the extra information on line five showing the column where the text is modified, including the fact that the , character is added. - pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis + pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis ? + The next few lines of the output show that an extra space is removed. - pharetra tortor. In nec mauris eget magna consequat ? - + pharetra tortor. In nec mauris eget magna consequat Next, a more complex change is made, replacing several words in a phrase. - convallis. Nam sed sem vitae odio pellentesque interdum. Sed ? - -- + convallis. Nam cras vitae mi vitae odio pellentesque interdum. Sed ? +++ +++++ + The last sentence in the paragraph is changed significantly, so the difference is represented by removing the old version and adding the new. consequat viverra nisl. Suspendisse arcu metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac, suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta - adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate - tristique enim. Donec quis lectus a justo imperdiet tempus. + adipiscing. Duis vulputate tristique enim. Donec quis lectus a + justo imperdiet tempus. Suspendisse eu lectus. In nunc. 64 Text The ndiff() function produces essentially the same output. The processing is specifically tailored for working with text data and eliminating noise in the input. Other Output Formats While the Differ class shows all input lines, a unified diff includes only modified lines and a bit of context. In Python 2.3, the unified_diff() function was added to produce this sort of output. import difflib from difflib_data import * diff = difflib.unified_diff(text1_lines, text2_lines, lineterm=’’, ) print ’\n’.join(list(diff)) The lineterm argument is used to tell unified_diff() to skip appending new- lines to the control lines it returns because the input lines do not include them. Newlines are added to all lines when they are printed. The output should look familiar to users of subversion or other version control tools. $ python difflib_unified.py --- +++ @@ -1,11 +1,11 @@ Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer eu lacus accumsan arcu fermentum euismod. Donec -pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis -pharetra tortor. In nec mauris eget magna consequat -convallis. Nam sed sem vitae odio pellentesque interdum. Sed +pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis +pharetra tortor. In nec mauris eget magna consequat +convallis. Nam cras vitae mi vitae odio pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac, suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta -adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate -tristique enim. Donec quis lectus a justo imperdiet tempus. 1.4. difflib—Compare Sequences 65 +adipiscing. Duis vulputate tristique enim. Donec quis lectus a +justo imperdiet tempus. Suspendisse eu lectus. In nunc. Using context_diff() produces similar readable output. 1.4.2 Junk Data All functions that produce difference sequences accept arguments to indicate which lines should be ignored and which characters within a line should be ignored. These parameters can be used to skip over markup or whitespace changes in two versions of a file, for example. # This example is adapted from the source for difflib.py. from difflib import SequenceMatcher def show_results(s): i, j, k = s.find_longest_match(0, 5, 0, 9) print ’ i = %d’ % i print ’ j = %d’ % j print ’ k = %d’ % k print ’ A[i:i+k] = %r’ % A[i:i+k] print ’ B[j:j+k] = %r’ % B[j:j+k] A = " abcd" B = "abcd abcd" print ’A = %r’ %A print ’B = %r’ %B print ’\nWithout junk detection:’ show_results(SequenceMatcher(None, A, B)) print ’\nTreat spaces as junk:’ show_results(SequenceMatcher(lambda x: x=="", A, B)) The default for Differ is to not ignore any lines or characters explicitly, but to rely on the ability of SequenceMatcher to detect noise. The default for ndiff() is to ignore space and tab characters. $ python difflib_junk.py 66 Text A = ’ abcd’ B = ’abcd abcd’ Without junk detection: i = 0 j = 4 k = 5 A[i:i+k] = ’ abcd’ B[j:j+k] = ’ abcd’ Treat spaces as junk: i = 1 j = 0 k = 4 A[i:i+k] = ’abcd’ B[j:j+k] = ’abcd’ 1.4.3 Comparing Arbitrary Types The SequenceMatcher class compares two sequences of any type, as long as the values are hashable. It uses an algorithm to identify the longest contiguous matching blocks from the sequences, eliminating junk values that do not contribute to the real data. import difflib from difflib_data import * s1 = [ 1, 2, 3, 5, 6, 4 ] s2 = [ 2, 3, 5, 4, 6, 1 ] print ’Initial data:’ print ’s1 =’, s1 print ’s2 =’, s2 print ’s1 == s2:’, s1==s2 print matcher = difflib.SequenceMatcher(None, s1, s2) for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()): if tag == ’delete’: print ’Remove %s from positions [%d:%d]’ %\ (s1[i1:i2], i1, i2) del s1[i1:i2] 1.4. difflib—Compare Sequences 67 elif tag == ’equal’: print ’s1[%d:%d] and s2[%d:%d] are the same’ %\ (i1, i2, j1, j2) elif tag == ’insert’: print ’Insert %s from s2[%d:%d] into s1 at %d’ %\ (s2[j1:j2], j1, j2, i1) s1[i1:i2] = s2[j1:j2] elif tag == ’replace’: print ’Replace %s from s1[%d:%d] with %s from s2[%d:%d]’ %( s1[i1:i2], i1, i2, s2[j1:j2], j1, j2) s1[i1:i2] = s2[j1:j2] print ’ s1 =’, s1 print ’s1 == s2:’, s1==s2 This example compares two lists of integers and uses get_opcodes() to derive the instructions for converting the original list into the newer version. The modifications are applied in reverse order so that the list indexes remain accurate after items are added and removed. $ python difflib_seq.py Initial data: s1 = [1, 2, 3, 5, 6, 4] s2 = [2, 3, 5, 4, 6, 1] s1 == s2: False Replace [4] from s1[5:6] with [1] from s2[5:6] s1 = [1, 2, 3, 5, 6, 1] s1[4:5] and s2[4:5] are the same s1 = [1, 2, 3, 5, 6, 1] Insert [4] from s2[3:4] into s1 at 4 s1 = [1, 2, 3, 5, 4, 6, 1] s1[1:4] and s2[0:3] are the same s1 = [1, 2, 3, 5, 4, 6, 1] Remove [1] from positions [0:1] s1 = [2, 3, 5, 4, 6, 1] s1 == s2: True 68 Text SequenceMatcher works with custom classes, as well as built-in types, as long as they are hashable. See Also: difflib (http://docs.python.org/library/difflib.html) The standard library documenta- tion for this module. Pattern Matching: The Gestalt Approach (http://www.ddj.com/documents/s= 1103/ddj8807c/) Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener, published in Dr. Dobb’s Journal in July 1988. Chapter 2 DATA STRUCTURES Python includes several standard programming data structures, such as list, tuple, dict, and set, as part of its built-in types. Many applications do not require other structures, but when they do, the standard library provides powerful and well-tested versions that are ready to use. The collections module includes implementations of several data structures that extend those found in other modules. For example, Deque is a double-ended queue that allows the addition or removal of items from either end. The defaultdict is a dictionary that responds with a default value if a key is missing, while OrderedDict remembers the sequence in which items are added to it. And namedtuple extends the normal tuple to give each member item an attribute name in addition to a numeric index. For large amounts of data, an array may make more efficient use of memory than a list. Since the array is limited to a single data type, it can use a more compact memory representation than a general purpose list. At the same time, arrays can be manipulated using many of the same methods as a list, so it may be possible to replace lists with arrays in an application without a lot of other changes. Sorting items in a sequence is a fundamental aspect of data manipulation. Python’s list includes a sort() method, but sometimes it is more efficient to maintain a list in sorted order without resorting it each time its contents are changed. The functions in heapq modify the contents of a list while preserving the sort order of the list with low overhead. Another option for building sorted lists or arrays is bisect. It uses a binary search to find the insertion point for new items and is an alternative to repeatedly sorting a list that changes frequently. 69 70 Data Structures Although the built-in list can simulate a queue using the insert() and pop() methods, it is not thread-safe. For true ordered communication between threads, use the Queue module. multiprocessing includes a version of a Queue that works between processes, making it easier to convert a multithreaded program to use processes instead. struct is useful for decoding data from another application, perhaps coming from a binary file or stream of data, into Python’s native types for easier manipulation. This chapter covers two modules related to memory management. For highly interconnected data structures, such as graphs and trees, use weakref to maintain ref- erences while still allowing the garbage collector to clean up objects after they are no longer needed. The functions in copy are used for duplicating data structures and their contents, including recursive copies with deepcopy(). Debugging data structures can be time consuming, especially when wading through printed output of large sequences or dictionaries. Use pprint to create easy- to-read representations that can be printed to the console or written to a log file for easier debugging. And, finally, if the available types do not meet the requirements, subclass one of the native types and customize it, or build a new container type using one of the abstract base classes defined in collections as a starting point. 2.1 collections—Container Data Types Purpose Container data types. Python Version 2.4 and later The collections module includes container data types beyond the built-in types list, dict, and tuple. 2.1.1 Counter A Counter is a container that tracks how many times equivalent values are added. It can be used to implement the same algorithms for which other languages commonly use bag or multiset data structures. Initializing Counter supports three forms of initialization. Its constructor can be called with a sequence of items, a dictionary containing keys and counts, or using keyword arguments mapping string names to counts. 2.1. collections—Container Data Types 71 import collections print collections.Counter([’a’, ’b’, ’c’, ’a’, ’b’, ’b’]) print collections.Counter({’a’:2, ’b’:3, ’c’:1}) print collections.Counter(a=2, b=3, c=1) The results of all three forms of initialization are the same. $ python collections_counter_init.py Counter({’b’: 3, ’a’: 2, ’c’: 1}) Counter({’b’: 3, ’a’: 2, ’c’: 1}) Counter({’b’: 3, ’a’: 2, ’c’: 1}) An empty Counter can be constructed with no arguments and populated via the update() method. import collections c = collections.Counter() print ’Initial :’, c c.update(’abcdaab’) print ’Sequence:’, c c.update({’a’:1, ’d’:5}) print ’Dict :’, c The count values are increased based on the new data, rather than replaced. In this example, the count for a goes from 3 to 4. $ python collections_counter_update.py Initial : Counter() Sequence: Counter({’a’: 3, ’b’: 2, ’c’: 1, ’d’: 1}) Dict : Counter({’d’: 6, ’a’: 4, ’b’: 2, ’c’: 1}) Accessing Counts Once a Counter is populated, its values can be retrieved using the dictionary API. 72 Data Structures import collections c = collections.Counter(’abcdaab’) for letter in ’abcde’: print ’%s : %d’ % (letter, c[letter]) Counter does not raise KeyError for unknown items. If a value has not been seen in the input (as with e in this example), its count is 0. $ python collections_counter_get_values.py a : 3 b : 2 c : 1 d : 1 e : 0 The elements() method returns an iterator that produces all items known to the Counter. import collections c = collections.Counter(’extremely’) c[’z’] = 0 print c print list(c.elements()) The order of elements is not guaranteed, and items with counts less than or equal to zero are not included. $ python collections_counter_elements.py Counter({’e’: 3, ’m’: 1, ’l’: 1, ’r’: 1, ’t’: 1, ’y’: 1, ’x’: 1, ’z’: 0}) [’e’, ’e’, ’e’, ’m’, ’l’, ’r’, ’t’, ’y’, ’x’] Use most_common() to produce a sequence of the n most frequently encountered input values and their respective counts. 2.1. collections—Container Data Types 73 import collections c = collections.Counter() with open(’/usr/share/dict/words’, ’rt’) as f: for line in f: c.update(line.rstrip().lower()) print ’Most common:’ for letter, count in c.most_common(3): print ’%s: %7d’ % (letter, count) This example counts the letters appearing in all words in the system dictionary to produce a frequency distribution, and then prints the three most common letters. Leaving out the argument to most_common() produces a list of all the items, in order of frequency. $ python collections_counter_most_common.py Most common: e: 234803 i: 200613 a: 198938 Arithmetic Counter instances support arithmetic and set operations for aggregating results. import collections c1 = collections.Counter([’a’, ’b’, ’c’, ’a’, ’b’, ’b’]) c2 = collections.Counter(’alphabet’) print ’C1:’, c1 print ’C2:’, c2 print ’\nCombined counts:’ print c1 + c2 print ’\nSubtraction:’ print c1 - c2 74 Data Structures print ’\nIntersection (taking positive minimums):’ print c1 & c2 print ’\nUnion (taking maximums):’ print c1 | c2 Each time a new Counter is produced through an operation, any items with zero or negative counts are discarded. The count for a is the same in c1 and c2, so subtrac- tion leaves it at zero. $ python collections_counter_arithmetic.py C1: Counter({’b’: 3, ’a’: 2, ’c’: 1}) C2: Counter({’a’: 2, ’b’: 1, ’e’: 1, ’h’: 1, ’l’: 1, ’p’: 1, ’t’: 1}) Combined counts: Counter({’a’: 4, ’b’: 4, ’c’: 1, ’e’: 1, ’h’: 1, ’l’: 1, ’p’: 1, ’t’: 1}) Subtraction: Counter({’b’: 2, ’c’: 1}) Intersection (taking positive minimums): Counter({’a’: 2, ’b’: 1}) Union (taking maximums): Counter({’b’: 3, ’a’: 2, ’c’: 1, ’e’: 1, ’h’: 1, ’l’: 1, ’p’: 1, ’t’: 1}) 2.1.2 defaultdict The standard dictionary includes the method setdefault() for retrieving a value and establishing a default if the value does not exist. By contrast, defaultdict lets the caller specify the default up front when the container is initialized. import collections def default_factory(): return ’default value’ d = collections.defaultdict(default_factory, foo=’bar’) print ’d:’, d 2.1. collections—Container Data Types 75 print ’foo =>’, d[’foo’] print ’bar =>’, d[’bar’] This method works well, as long as it is appropriate for all keys to have the same default. It can be especially useful if the default is a type used for aggregating or accu- mulating values, such as a list, set, or even int. The standard library documentation includes several examples of using defaultdict this way. $ python collections_defaultdict.py d: defaultdict(, {’foo’: ’bar’}) foo => bar bar => default value See Also: defaultdict examples (http://docs.python.org/lib/defaultdict-examples.html) Examples of using defaultdict from the standard library documentation. Evolution of Default Dictionaries in Python (http://jtauber.com/blog/2008/02/27/evolution_of_default_dictionaries_in_ python/) Discussion from James Tauber of how defaultdict relates to other means of initializing dictionaries. 2.1.3 Deque A double-ended queue, or deque, supports adding and removing elements from either end. The more commonly used structures, stacks, and queues are degenerate forms of deques where the inputs and outputs are restricted to a single end. import collections d = collections.deque(’abcdefg’) print ’Deque:’, d print ’Length:’, len(d) print ’Left end:’, d[0] print ’Right end:’, d[-1] d.remove(’c’) print ’remove(c):’, d 76 Data Structures Since deques are a type of sequence container, they support some of the same operations as list, such as examining the contents with __getitem__(), determining length, and removing elements from the middle by matching identity. $ python collections_deque.py Deque: deque([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’]) Length: 7 Left end: a Right end: g remove(c): deque([’a’, ’b’, ’d’, ’e’, ’f’, ’g’]) Populating A deque can be populated from either end, termed “left” and “right” in the Python implementation. import collections # Add to the right d1 = collections.deque() d1.extend(’abcdefg’) print ’extend :’, d1 d1.append(’h’) print ’append :’, d1 # Add to the left d2 = collections.deque() d2.extendleft(xrange(6)) print ’extendleft:’, d2 d2.appendleft(6) print ’appendleft:’, d2 The extendleft() function iterates over its input and performs the equivalent of an appendleft() for each item. The end result is that the deque contains the input sequence in reverse order. $ python collections_deque_populating.py extend : deque([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’]) append : deque([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’]) 2.1. collections—Container Data Types 77 extendleft: deque([5, 4, 3, 2, 1, 0]) appendleft: deque([6, 5, 4, 3, 2, 1, 0]) Consuming Similarly, the elements of the deque can be consumed from both ends or either end, depending on the algorithm being applied. import collections print ’From the right:’ d = collections.deque(’abcdefg’) while True: try: print d.pop(), except IndexError: break print print ’\nFrom the left:’ d = collections.deque(xrange(6)) while True: try: print d.popleft(), except IndexError: break print Use pop() to remove an item from the right end of the deque and popleft() to take from the left end. $ python collections_deque_consuming.py From the right: g f e d c b a From the left: 0 1 2 3 4 5 Since deques are thread-safe, the contents can even be consumed from both ends at the same time from separate threads. 78 Data Structures import collections import threading import time candle = collections.deque(xrange(5)) def burn(direction, nextSource): while True: try: next = nextSource() except IndexError: break else: print ’%8s: %s’ % (direction, next) time.sleep(0.1) print ’%8s done’ % direction return left = threading.Thread(target=burn, args=(’Left’, candle.popleft)) right = threading.Thread(target=burn, args=(’Right’, candle.pop)) left.start() right.start() left.join() right.join() The threads in this example alternate between each end, removing items until the deque is empty. $ python collections_deque_both_ends.py Left: 0 Right: 4 Right: 3 Left: 1 Right: 2 Left done Right done Rotating Another useful capability of the deque is to rotate it in either direction, to skip over some items. 2.1. collections—Container Data Types 79 import collections d = collections.deque(xrange(10)) print ’Normal :’, d d = collections.deque(xrange(10)) d.rotate(2) print ’Right rotation:’, d d = collections.deque(xrange(10)) d.rotate(-2) print ’Left rotation :’, d Rotating the deque to the right (using a positive rotation) takes items from the right end and moves them to the left end. Rotating to the left (with a negative value) takes items from the left end and moves them to the right end. It may help to visualize the items in the deque as being engraved along the edge of a dial. $ python collections_deque_rotate.py Normal : deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Right rotation: deque([8, 9, 0, 1, 2, 3, 4, 5, 6, 7]) Left rotation : deque([2, 3, 4, 5, 6, 7, 8, 9, 0, 1]) See Also: Deque (http://en.wikipedia.org/wiki/Deque) Wikipedia article that provides a dis- cussion of the deque data structure. Deque Recipes (http://docs.python.org/lib/deque-recipes.html) Examples of using deques in algorithms from the standard library documentation. 2.1.4 namedtuple The standard tuple uses numerical indexes to access its members. bob = (’Bob’, 30, ’male’) print ’Representation:’, bob jane = (’Jane’, 29, ’female’) print ’\nField by index:’, jane[0] print ’\nFields by index:’ for p in [ bob, jane ]: print ’%s is a %d year old %s’ % p 80 Data Structures This makes tuples convenient containers for simple uses. $ python collections_tuple.py Representation: (’Bob’, 30, ’male’) Field by index: Jane Fields by index: Bob is a 30 year old male Jane is a 29 year old female On the other hand, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used. A namedtuple assigns names, as well as the numerical index, to each member. Defining namedtuple instances are just as memory efficient as regular tuples because they do not have per-instance dictionaries. Each kind of namedtuple is represented by its own class, created by using the namedtuple() factory function. The arguments are the name of the new class and a string containing the names of the elements. import collections Person = collections.namedtuple(’Person’, ’name age gender’) print ’Type of Person:’, type(Person) bob = Person(name=’Bob’, age=30, gender=’male’) print ’\nRepresentation:’, bob jane = Person(name=’Jane’, age=29, gender=’female’) print ’\nField by name:’, jane.name print ’\nFields by index:’ for p in [ bob, jane ]: print ’%s is a %d year old %s’ % p As the example illustrates, it is possible to access the fields of the namedtuple by name using dotted notation (obj.attr) as well as using the positional indexes of standard tuples. 2.1. collections—Container Data Types 81 $ python collections_namedtuple_person.py Type of Person: Representation: Person(name=’Bob’, age=30, gender=’male’) Field by name: Jane Fields by index: Bob is a 30 year old male Jane is a 29 year old female Invalid Field Names Field names are invalid if they are repeated or conflict with Python keywords. import collections try: collections.namedtuple(’Person’, ’name class age gender’) except ValueError, err: print err try: collections.namedtuple(’Person’, ’name age gender age’) except ValueError, err: print err As the field names are parsed, invalid values cause ValueError exceptions. $ python collections_namedtuple_bad_fields.py Type names and field names cannot be a keyword: ’class’ Encountered duplicate field name: ’age’ If a namedtuple is being created based on values outside of the control of the pro- gram (such as to represent the rows returned by a database query, where the schema is not known in advance), set the rename option to True so the invalid fields are renamed. import collections with_class = collections.namedtuple( ’Person’, ’name class age gender’, rename=True) 82 Data Structures print with_class._fields two_ages = collections.namedtuple( ’Person’, ’name age gender age’, rename=True) print two_ages._fields The new names for renamed fields depend on their index in the tuple, so the field with name class becomes _1 and the duplicate age field is changed to _3. $ python collections_namedtuple_rename.py (’name’, ’_1’, ’age’, ’gender’) (’name’, ’age’, ’gender’, ’_3’) 2.1.5 OrderedDict An OrderedDict is a dictionary subclass that remembers the order in which its con- tents are added. import collections print ’Regular dictionary:’ d = {} d[’a’] = ’A’ d[’b’] = ’B’ d[’c’] = ’C’ for k, v in d.items(): print k, v print ’\nOrderedDict:’ d = collections.OrderedDict() d[’a’] = ’A’ d[’b’] = ’B’ d[’c’] = ’C’ for k, v in d.items(): print k, v A regular dict does not track the insertion order, and iterating over it produces the values in order based on how the keys are stored in the hash table. In an OrderedDict, 2.1. collections—Container Data Types 83 by contrast, the order in which the items are inserted is remembered and used when creating an iterator. $ python collections_ordereddict_iter.py Regular dictionary: a A c C b B OrderedDict: a A b B c C Equality A regular dict looks at its contents when testing for equality. An OrderedDict also considers the order the items were added. import collections print ’dict :’, d1 = {} d1[’a’] = ’A’ d1[’b’] = ’B’ d1[’c’] = ’C’ d2 = {} d2[’c’] = ’C’ d2[’b’] = ’B’ d2[’a’] = ’A’ print d1 == d2 print ’OrderedDict:’, d1 = collections.OrderedDict() d1[’a’] = ’A’ d1[’b’] = ’B’ d1[’c’] = ’C’ 84 Data Structures d2 = collections.OrderedDict() d2[’c’] = ’C’ d2[’b’] = ’B’ d2[’a’] = ’A’ print d1 == d2 In this case, since the two ordered dictionaries are created from values in a different order, they are considered to be different. $ python collections_ordereddict_equality.py dict : True OrderedDict: False See Also: collections (http://docs.python.org/library/collections.html) The standard library documentation for this module. 2.2 array—Sequence of Fixed-Type Data Purpose Manage sequences of fixed-type numerical data efficiently. Python Version 1.4 and later The array module defines a sequence data structure that looks very much like a list, except that all members have to be of the same primitive type. Refer to the standard library documentation for array for a complete list of the types supported. 2.2.1 Initialization An array is instantiated with an argument describing the type of data to be allowed, and possibly an initial sequence of data to store in the array. import array import binascii s = ’This is the array.’ a = array.array(’c’, s) print ’As string:’, s print ’As array :’, a print ’As hex :’, binascii.hexlify(a) 2.2. array—Sequence of Fixed-Type Data 85 In this example, the array is configured to hold a sequence of bytes and is initial- ized with a simple string. $ python array_string.py As string: This is the array. As array : array(’c’, ’This is the array.’) As hex : 54686973206973207468652061727261792e 2.2.2 Manipulating Arrays An array can be extended and otherwise manipulated in the same ways as other Python sequences. import array import pprint a = array.array(’i’, xrange(3)) print ’Initial :’, a a.extend(xrange(3)) print ’Extended:’, a print ’Slice :’, a[2:5] print ’Iterator:’ print list(enumerate(a)) The supported operations include slicing, iterating, and adding elements to the end. $ python array_sequence.py Initial : array(’i’, [0, 1, 2]) Extended: array(’i’, [0, 1, 2, 0, 1, 2]) Slice : array(’i’, [2, 0, 1]) Iterator: [(0, 0), (1, 1), (2, 2), (3, 0), (4, 1), (5, 2)] 2.2.3 Arrays and Files The contents of an array can be written to and read from files using built-in methods coded efficiently for that purpose. 86 Data Structures import array import binascii import tempfile a = array.array(’i’, xrange(5)) print ’A1:’, a # Write the array of numbers to a temporary file output = tempfile.NamedTemporaryFile() a.tofile(output.file) # must pass an *actual* file output.flush() # Read the raw data with open(output.name, ’rb’) as input: raw_data = input.read() print ’Raw Contents:’, binascii.hexlify(raw_data) # Read the data into an array input.seek(0) a2 = array.array(’i’) a2.fromfile(input, len(a)) print ’A2:’, a2 This example illustrates reading the data raw, directly from the binary file, versus reading it into a new array and converting the bytes to the appropriate types. $ python array_file.py A1: array(’i’, [0, 1, 2, 3, 4]) Raw Contents: 0000000001000000020000000300000004000000 A2: array(’i’, [0, 1, 2, 3, 4]) 2.2.4 Alternate Byte Ordering If the data in the array is not in the native byte order, or needs to be swapped before being sent to a system with a different byte order (or over the network), it is possible to convert the entire array without iterating over the elements from Python. import array import binascii def to_hex(a): chars_per_item = a.itemsize * 2 # 2 hex digits 2.3. heapq—Heap Sort Algorithm 87 hex_version = binascii.hexlify(a) num_chunks = len(hex_version) / chars_per_item for i in xrange(num_chunks): start = i*chars_per_item end = start + chars_per_item yield hex_version[start:end] a1 = array.array(’i’, xrange(5)) a2 = array.array(’i’, xrange(5)) a2.byteswap() fmt = ’%10s %10s %10s %10s’ print fmt % (’A1 hex’, ’A1’, ’A2 hex’, ’A2’) print fmt % ((’-’ * 10,) * 4) for values in zip(to_hex(a1), a1, to_hex(a2), a2): print fmt % values The byteswap() method switches the byte order of the items in the array from within C, so it is much more efficient than looping over the data in Python. $ python array_byteswap.py A1 hex A1 A2 hex A2 ---------- ---------- ---------- ---------- 00000000 0 00000000 0 01000000 1 00000001 16777216 02000000 2 00000002 33554432 03000000 3 00000003 50331648 04000000 4 00000004 67108864 See Also: array (http://docs.python.org/library/array.html) The standard library documenta- tion for this module. struct (page 102) The struct module. Numerical Python (www.scipy.org) NumPy is a Python library for working with large data sets efficiently. 2.3 heapq—Heap Sort Algorithm Purpose The heapq module implements a min-heap sort algorithm suit- able for use with Python’s lists. Python Version New in 2.3 with additions in 2.5 88 Data Structures A heap is a tree-like data structure where the child nodes have a sort-order relationship with the parents. Binary heaps can be represented using a list or an array organized so that the children of element N are at positions 2*N+1 and 2*N+2 (for zero-based indexes). This layout makes it possible to rearrange heaps in place, so it is not necessary to reallocate as much memory when adding or removing items. A max-heap ensures that the parent is larger than or equal to both of its children. A min-heap requires that the parent be less than or equal to its children. Python’s heapq module implements a min-heap. 2.3.1 Example Data The examples in this section use the data in heapq_heapdata.py. # This data was generated with the random module. data = [19, 9, 4, 10, 11] The heap output is printed using heapq_showtree.py. import math from cStringIO import StringIO def show_tree(tree, total_width=36, fill=’’): """Pretty-print a tree.""" output = StringIO() last_row = -1 for i, n in enumerate(tree): if i: row = int(math.floor(math.log(i+1, 2))) else: row = 0 if row != last_row: output.write(’\n’) columns = 2**row col_width = int(math.floor((total_width * 1.0) / columns)) output.write(str(n).center(col_width, fill)) last_row = row print output.getvalue() print ’-’ * total_width print return 2.3. heapq—Heap Sort Algorithm 89 2.3.2 Creating a Heap There are two basic ways to create a heap: heappush() and heapify(). import heapq from heapq_showtree import show_tree from heapq_heapdata import data heap = [] print ’random :’, data print for n in data: print ’add %3d:’ % n heapq.heappush(heap, n) show_tree(heap) Using heappush(), the heap sort order of the elements is maintained as new items are added from a data source. $ python heapq_heappush.py random : [19, 9, 4, 10, 11] add 19: 19 ------------------------------------ add 9: 9 19 ------------------------------------ add 4: 4 19 9 ------------------------------------ add 10: 4 90 Data Structures 10 9 19 ------------------------------------ add 11: 4 10 9 19 11 ------------------------------------ If the data is already in memory, it is more efficient to use heapify() to rearrange the items of the list in place. import heapq from heapq_showtree import show_tree from heapq_heapdata import data print ’random :’, data heapq.heapify(data) print ’heapified :’ show_tree(data) The result of building a list in heap order one item at a time is the same as building it unordered and then calling heapify(). $ python heapq_heapify.py random : [19, 9, 4, 10, 11] heapified : 4 9 19 10 11 ------------------------------------ 2.3.3 Accessing Contents of a Heap Once the heap is organized correctly, use heappop() to remove the element with the lowest value. import heapq from heapq_showtree import show_tree from heapq_heapdata import data 2.3. heapq—Heap Sort Algorithm 91 print ’random :’, data heapq.heapify(data) print ’heapified :’ show_tree(data) print for i in xrange(2): smallest = heapq.heappop(data) print ’pop %3d:’ % smallest show_tree(data) In this example, adapted from the stdlib documentation, heapify() and heappop() are used to sort a list of numbers. $ python heapq_heappop.py random : [19, 9, 4, 10, 11] heapified : 4 9 19 10 11 ------------------------------------ pop 4: 9 10 19 11 ------------------------------------ pop 9: 10 11 19 ------------------------------------ To remove existing elements and replace them with new values in a single opera- tion, use heapreplace(). import heapq from heapq_showtree import show_tree from heapq_heapdata import data 92 Data Structures heapq.heapify(data) print ’start:’ show_tree(data) for n in [0, 13]: smallest = heapq.heapreplace(data, n) print ’replace %2d with %2d:’ % (smallest, n) show_tree(data) Replacing elements in place makes it possible to maintain a fixed-size heap, such as a queue of jobs ordered by priority. $ python heapq_heapreplace.py start: 4 9 19 10 11 ------------------------------------ replace 4 with 0: 0 9 19 10 11 ------------------------------------ replace 0 with 13: 9 10 19 13 11 ------------------------------------ 2.3.4 Data Extremes from a Heap heapq also includes two functions to examine an iterable to find a range of the largest or smallest values it contains. import heapq from heapq_heapdata import data 2.4. bisect—Maintain Lists in Sorted Order 93 print ’all :’, data print ’3 largest :’, heapq.nlargest(3, data) print ’from sort :’, list(reversed(sorted(data)[-3:])) print ’3 smallest:’, heapq.nsmallest(3, data) print ’from sort :’, sorted(data)[:3] Using nlargest() and nsmallest() is only efficient for relatively small values of n > 1, but can still come in handy in a few cases. $ python heapq_extremes.py all : [19, 9, 4, 10, 11] 3 largest : [19, 11, 10] from sort : [19, 11, 10] 3 smallest: [4, 9, 10] from sort : [4, 9, 10] See Also: heapq (http://docs.python.org/library/heapq.html) The standard library documen- tation for this module. Heap (data structure) (http://en.wikipedia.org/wiki/Heap_(data_structure)) Wikipedia article that provides a general description of heap data structures. Priority Queue (page 98) A priority queue implementation from Queue (page 96) in the standard library. 2.4 bisect—Maintain Lists in Sorted Order Purpose Maintains a list in sorted order without having to call sort each time an item is added to the list. Python Version 1.4 and later The bisect module implements an algorithm for inserting elements into a list while maintaining the list in sorted order. For some cases, this is more efficient than repeatedly sorting a list or explicitly sorting a large list after it is constructed. 2.4.1 Inserting in Sorted Order Here is a simple example using insort() to insert items into a list in sorted order. 94 Data Structures import bisect import random # Use a constant seed to ensure that # the same pseudo-random numbers # are used each time the loop is run. random.seed(1) print ’New Pos Contents’ print ’--- --- --------’ # Generate random numbers and # insert them into a list in sorted # order. l = [] for i in range(1, 15): r = random.randint(1, 100) position = bisect.bisect(l, r) bisect.insort(l, r) print ’%3d %3d’ % (r, position), l The first column of the output shows the new random number. The second column shows the position where the number will be inserted into the list. The remainder of each line is the current sorted list. $ python bisect_example.py New Pos Contents --- --- -------- 14 0 [14] 85 1 [14, 85] 77 1 [14, 77, 85] 26 1 [14, 26, 77, 85] 50 2 [14, 26, 50, 77, 85] 45 2 [14, 26, 45, 50, 77, 85] 66 4 [14, 26, 45, 50, 66, 77, 85] 79 6 [14, 26, 45, 50, 66, 77, 79, 85] 10 0 [10, 14, 26, 45, 50, 66, 77, 79, 85] 3 0 [3, 10, 14, 26, 45, 50, 66, 77, 79, 85] 84 9 [3, 10, 14, 26, 45, 50, 66, 77, 79, 84, 85] 44 4 [3, 10, 14, 26, 44, 45, 50, 66, 77, 79, 84, 85] 77 9 [3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85] 1 0 [1, 3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85] 2.4. bisect—Maintain Lists in Sorted Order 95 This is a simple example, and for the amount of data being manipulated, it might be faster to simply build the list and then sort it once. But for long lists, significant time and memory savings can be achieved using an insertion sort algorithm such as this one. 2.4.2 Handling Duplicates The result set shown previously includes a repeated value, 77. The bisect module pro- vides two ways to handle repeats. New values can be inserted to the left of existing val- ues or to the right. The insort() function is actually an alias for insort_right(), which inserts after the existing value. The corresponding function insort_left() inserts before the existing value. import bisect import random # Reset the seed random.seed(1) print ’New Pos Contents’ print ’--- --- --------’ # Use bisect_left and insort_left. l = [] for i in range(1, 15): r = random.randint(1, 100) position = bisect.bisect_left(l, r) bisect.insort_left(l, r) print ’%3d %3d’ % (r, position), l When the same data is manipulated using bisect_left() and insort_left(), the results are the same sorted list, but the insert positions are different for the duplicate values. $ python bisect_example2.py New Pos Contents --- --- -------- 14 0 [14] 85 1 [14, 85] 77 1 [14, 77, 85] 26 1 [14, 26, 77, 85] 50 2 [14, 26, 50, 77, 85] 45 2 [14, 26, 45, 50, 77, 85] 96 Data Structures 66 4 [14, 26, 45, 50, 66, 77, 85] 79 6 [14, 26, 45, 50, 66, 77, 79, 85] 10 0 [10, 14, 26, 45, 50, 66, 77, 79, 85] 3 0 [3, 10, 14, 26, 45, 50, 66, 77, 79, 85] 84 9 [3, 10, 14, 26, 45, 50, 66, 77, 79, 84, 85] 44 4 [3, 10, 14, 26, 44, 45, 50, 66, 77, 79, 84, 85] 77 8 [3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85] 1 0 [1, 3, 10, 14, 26, 44, 45, 50, 66, 77, 77, 79, 84, 85] In addition to the Python implementation, a faster C implementation is available. If the C version is present, that implementation automatically overrides the pure Python implementation when bisect is imported. See Also: bisect (http://docs.python.org/library/bisect.html) The standard library documenta- tion for this module. Insertion Sort (http://en.wikipedia.org/wiki/Insertion_sort) Wikipedia article that provides a description of the insertion sort algorithm. 2.5 Queue—Thread-Safe FIFO Implementation Purpose Provides a thread-safe FIFO implementation. Python Version At least 1.4 The Queue module provides a first-in, first-out (FIFO) data structure suitable for mul- tithreaded programming. It can be used to pass messages or other data safely between producer and consumer threads. Locking is handled for the caller, so many threads can work with the same Queue instance safely. The size of a Queue (the number of ele- ments it contains) may be restricted to throttle memory usage or processing. Note: This discussion assumes you already understand the general nature of a queue. If you do not, you may want to read some of the references before con- tinuing. 2.5.1 Basic FIFO Queue The Queue class implements a basic first-in, first-out container. Elements are added to one end of the sequence using put(), and removed from the other end using get(). 2.5. Queue—Thread-Safe FIFO Implementation 97 import Queue q = Queue.Queue() for i in range(5): q.put(i) while not q.empty(): print q.get(), print This example uses a single thread to illustrate that elements are removed from the queue in the same order they are inserted. $ python Queue_fifo.py 0 1 2 3 4 2.5.2 LIFO Queue In contrast to the standard FIFO implementation of Queue, the LifoQueue uses last-in, first-out (LIFO) ordering (normally associated with a stack data structure). import Queue q = Queue.LifoQueue() for i in range(5): q.put(i) while not q.empty(): print q.get(), print The item most recently put into the queue is removed by get. $ python Queue_lifo.py 4 3 2 1 0 98 Data Structures 2.5.3 Priority Queue Sometimes, the processing order of the items in a queue needs to be based on charac- teristics of those items, rather than just on the order in which they are created or added to the queue. For example, print jobs from the payroll department may take precedence over a code listing printed by a developer. PriorityQueue uses the sort order of the contents of the queue to decide which to retrieve. import Queue import threading class Job(object): def __init__(self, priority, description): self.priority = priority self.description = description print ’New job:’, description return def __cmp__(self, other): return cmp(self.priority, other.priority) q = Queue.PriorityQueue() q.put( Job(3, ’Mid-level job’)) q.put( Job(10, ’Low-level job’)) q.put( Job(1, ’Important job’)) def process_job(q): while True: next_job = q.get() print ’Processing job:’, next_job.description q.task_done() workers = [ threading.Thread(target=process_job, args=(q,)), threading.Thread(target=process_job, args=(q,)), ] for w in workers: w.setDaemon(True) w.start() q.join() This example has multiple threads consuming the jobs, which are to be processed based on the priority of items in the queue at the time get() was called. The order 2.5. Queue—Thread-Safe FIFO Implementation 99 of processing for items added to the queue while the consumer threads are running depends on thread context switching. $ python Queue_priority.py New job: Mid-level job New job: Low-level job New job: Important job Processing job: Important job Processing job: Mid-level job Processing job: Low-level job 2.5.4 Building a Threaded Podcast Client The source code for the podcasting client in this section demonstrates how to use the Queue class with multiple threads. The program reads one or more RSS feeds, queues up the enclosures for the five most recent episodes to be downloaded, and processes several downloads in parallel using threads. It does not have enough error handling for production use, but the skeleton implementation provides an example of how to use the Queue module. First, some operating parameters are established. Normally, these would come from user inputs (preferences, a database, etc.). The example uses hard-coded values for the number of threads and a list of URLs to fetch. from Queue import Queue from threading import Thread import time import urllib import urlparse import feedparser # Set up some global variables num_fetch_threads = 2 enclosure_queue = Queue() # A real app wouldn’t use hard-coded data... feed_urls = [ ’http://advocacy.python.org/podcasts/littlebit.rss’, ] The function downloadEnclosures() will run in the worker thread and process the downloads using urllib. 100 Data Structures def downloadEnclosures(i, q): """This is the worker thread function. It processes items in the queue one after another. These daemon threads go into an infinite loop, and only exit when the main thread ends. """ while True: print ’%s: Looking for the next enclosure’ % i url = q.get() parsed_url = urlparse.urlparse(url) print ’%s: Downloading:’ % i, parsed_url.path response = urllib.urlopen(url) data = response.read() # Save the downloaded file to the current directory outfile_name = url.rpartition(’/’)[-1] with open(outfile_name, ’wb’) as outfile: outfile.write(data) q.task_done() Once the threads’ target function is defined, the worker threads can be started. When downloadEnclosures() processes the statement url = q.get(), it blocks and waits until the queue has something to return. That means it is safe to start the threads before there is anything in the queue. # Set up some threads to fetch the enclosures for i in range(num_fetch_threads): worker = Thread(target=downloadEnclosures, args=(i, enclosure_queue,)) worker.setDaemon(True) worker.start() The next step is to retrieve the feed contents using Mark Pilgrim’s feedparser module (www.feedparser.org) and enqueue the URLs of the enclosures. As soon as the first URL is added to the queue, one of the worker threads picks it up and starts downloading it. The loop will continue to add items until the feed is exhausted, and the worker threads will take turns dequeuing URLs to download them. # Download the feed(s) and put the enclosure URLs into # the queue. for url in feed_urls: response = feedparser.parse(url, agent=’fetch_podcasts.py’) 2.5. Queue—Thread-Safe FIFO Implementation 101 for entry in response[’entries’][-5:]: for enclosure in entry.get(’enclosures’, []): parsed_url = urlparse.urlparse(enclosure[’url’]) print ’Queuing:’, parsed_url.path enclosure_queue.put(enclosure[’url’]) The only thing left to do is wait for the queue to empty out again, using join(). # Now wait for the queue to be empty, indicating that we have # processed all the downloads. print ’*** Main thread waiting’ enclosure_queue.join() print ’*** Done’ Running the sample script produces the following. $ python fetch_podcasts.py 0: Looking for the next enclosure 1: Looking for the next enclosure Queuing: /podcasts/littlebit/2010-04-18.mp3 Queuing: /podcasts/littlebit/2010-05-22.mp3 Queuing: /podcasts/littlebit/2010-06-06.mp3 Queuing: /podcasts/littlebit/2010-07-26.mp3 Queuing: /podcasts/littlebit/2010-11-25.mp3 *** Main thread waiting 0: Downloading: /podcasts/littlebit/2010-04-18.mp3 0: Looking for the next enclosure 0: Downloading: /podcasts/littlebit/2010-05-22.mp3 0: Looking for the next enclosure 0: Downloading: /podcasts/littlebit/2010-06-06.mp3 0: Looking for the next enclosure 0: Downloading: /podcasts/littlebit/2010-07-26.mp3 0: Looking for the next enclosure 0: Downloading: /podcasts/littlebit/2010-11-25.mp3 0: Looking for the next enclosure *** Done The actual output will depend on the contents of the RSS feed used. See Also: Queue (http://docs.python.org/lib/module-Queue.html) Standard library documen- tation for this module. 102 Data Structures Deque (page 75) from collections (page 70) The collections module includes a deque (double-ended queue) class. Queue data structures (http://en.wikipedia.org/wiki/Queue_(data_structure)) Wikipedia article explaining queues. FIFO (http://en.wikipedia.org/wiki/FIFO) Wikipedia article explaining first-in, first-out data structures. 2.6 struct—Binary Data Structures Purpose Convert between strings and binary data. Python Version 1.4 and later The struct module includes functions for converting between strings of bytes and native Python data types, such as numbers and strings. 2.6.1 Functions vs. Struct Class There is a set of module-level functions for working with structured values, and there is also the Struct class. Format specifiers are converted from their string format to a compiled representation, similar to the way regular expressions are handled. The con- version takes some resources, so it is typically more efficient to do it once when creating a Struct instance and call methods on the instance, instead of using the module-level functions. The following examples all use the Struct class. 2.6.2 Packing and Unpacking Structs support packing data into strings and unpacking data from strings using for- mat specifiers made up of characters representing the data type and optional count and endianness indicators. Refer to the standard library documentation for a complete list of the supported format specifiers. In this example, the specifier calls for an integer or long value, a two-character string, and a floating-point number. The spaces in the format specifier are included to separate the type indicators and are ignored when the format is compiled. import struct import binascii values = (1, ’ab’, 2.7) s = struct.Struct(’I 2s f’) packed_data = s.pack(*values) 2.6. struct—Binary Data Structures 103 print ’Original values:’, values print ’Format string :’, s.format print ’Uses :’, s.size, ’bytes’ print ’Packed Value :’, binascii.hexlify(packed_data) The example converts the packed value to a sequence of hex bytes for printing with binascii.hexlify(), since some characters are nulls. $ python struct_pack.py Original values: (1, ’ab’, 2.7) Format string : I 2s f Uses : 12 bytes Packed Value : 0100000061620000cdcc2c40 Use unpack() to extract data from its packed representation. import struct import binascii packed_data = binascii.unhexlify(’0100000061620000cdcc2c40’) s = struct.Struct(’I 2s f’) unpacked_data = s.unpack(packed_data) print ’Unpacked Values:’, unpacked_data Passing the packed value to unpack() gives basically the same values back (note the discrepancy in the floating-point value). $ python struct_unpack.py Unpacked Values: (1, ’ab’, 2.700000047683716) 2.6.3 Endianness By default, values are encoded using the native C library notion of endianness.Itis easy to override that choice by providing an explicit endianness directive in the format string. import struct import binascii 104 Data Structures values = (1, ’ab’, 2.7) print ’Original values:’, values endianness = [ (’@’, ’native, native’), (’=’, ’native, standard’), (’<’, ’little-endian’), (’>’, ’big-endian’), (’!’, ’network’), ] for code, name in endianness: s = struct.Struct(code + ’ I 2s f’) packed_data = s.pack(*values) print print ’Format string :’, s.format, ’for’, name print ’Uses :’, s.size, ’bytes’ print ’Packed Value :’, binascii.hexlify(packed_data) print ’Unpacked Value :’, s.unpack(packed_data) Table 2.1 lists the byte order specifiers used by Struct. Table 2.1. Byte Order Specifiers for struct Code Meaning @ Native order = Native standard < Little-endian > Big-endian ! Network order $ python struct_endianness.py Original values: (1, ’ab’, 2.7) Format string : @ I 2s f for native, native Uses : 12 bytes Packed Value : 0100000061620000cdcc2c40 Unpacked Value : (1, ’ab’, 2.700000047683716) Format string : = I 2s f for native, standard Uses : 10 bytes Packed Value : 010000006162cdcc2c40 2.6. struct—Binary Data Structures 105 Unpacked Value : (1, ’ab’, 2.700000047683716) Format string : < I 2s f for little-endian Uses : 10 bytes Packed Value : 010000006162cdcc2c40 Unpacked Value : (1, ’ab’, 2.700000047683716) Format string : > I 2s f for big-endian Uses : 10 bytes Packed Value : 000000016162402ccccd Unpacked Value : (1, ’ab’, 2.700000047683716) Format string : ! I 2s f for network Uses : 10 bytes Packed Value : 000000016162402ccccd Unpacked Value : (1, ’ab’, 2.700000047683716) 2.6.4 Buffers Working with binary packed data is typically reserved for performance-sensitive sit- uations or when passing data into and out of extension modules. These cases can be optimized by avoiding the overhead of allocating a new buffer for each packed struc- ture. The pack_into() and unpack_from() methods support writing to preallocated buffers directly. import struct import binascii s = struct.Struct(’I 2s f’) values = (1, ’ab’, 2.7) print ’Original:’, values print print ’ctypes string buffer’ import ctypes b = ctypes.create_string_buffer(s.size) print ’Before :’, binascii.hexlify(b.raw) s.pack_into(b, 0, *values) print ’After :’, binascii.hexlify(b.raw) print ’Unpacked:’, s.unpack_from(b, 0) 106 Data Structures print print ’array’ import array a = array.array(’c’, ’\0’ * s.size) print ’Before :’, binascii.hexlify(a) s.pack_into(a, 0, *values) print ’After :’, binascii.hexlify(a) print ’Unpacked:’, s.unpack_from(a, 0) The size attribute of the Struct tells us how big the buffer needs to be. $ python struct_buffers.py Original: (1, ’ab’, 2.7) ctypes string buffer Before : 000000000000000000000000 After : 0100000061620000cdcc2c40 Unpacked: (1, ’ab’, 2.700000047683716) array Before : 000000000000000000000000 After : 0100000061620000cdcc2c40 Unpacked: (1, ’ab’, 2.700000047683716) See Also: struct (http://docs.python.org/library/struct.html) The standard library documenta- tion for this module. array (page 84 ) The array module, for working with sequences of fixed-type values. binascii (http://docs.python.org/library/binascii.html) The binascii module, for producing ASCII representations of binary data. Endianness (http://en.wikipedia.org/wiki/Endianness) Wikipedia article that pro- vides an explanation of byte order and endianness in encoding. 2.7 weakref—Impermanent References to Objects Purpose Refer to an “expensive” object, but allow its memory to be reclaimed by the garbage collector if there are no other nonweak ref- erences. Python Version 2.1 and later 2.7. weakref—Impermanent References to Objects 107 The weakref module supports weak references to objects. A normal reference incre- ments the reference count on the object and prevents it from being garbage collected. This is not always desirable, either when a circular reference might be present or when building a cache of objects that should be deleted when memory is needed. A weak reference is a handle to an object that does not keep it from being cleaned up automati- cally. 2.7.1 References Weak references to objects are managed through the ref class. To retrieve the original object, call the reference object. import weakref class ExpensiveObject(object): def __del__(self): print ’(Deleting %s)’ % self obj = ExpensiveObject() r = weakref.ref(obj) print ’obj:’, obj print ’ref:’, r print ’r():’, r() print ’deleting obj’ del obj print ’r():’, r() In this case, since obj is deleted before the second call to the reference, the ref returns None. $ python weakref_ref.py obj: <__main__.ExpensiveObject object at 0x100da5750> ref: r(): <__main__.ExpensiveObject object at 0x100da5750> deleting obj (Deleting <__main__.ExpensiveObject object at 0x100da5750>) r(): None 108 Data Structures 2.7.2 Reference Callbacks The ref constructor accepts an optional callback function to invoke when the refer- enced object is deleted. import weakref class ExpensiveObject(object): def __del__(self): print ’(Deleting %s)’ % self def callback(reference): """Invoked when referenced object is deleted""" print ’callback(’, reference, ’)’ obj = ExpensiveObject() r = weakref.ref(obj, callback) print ’obj:’, obj print ’ref:’, r print ’r():’, r() print ’deleting obj’ del obj print ’r():’, r() The callback receives the reference object as an argument after the reference is “dead” and no longer refers to the original object. One use for this feature is to remove the weak reference object from a cache. $ python weakref_ref_callback.py obj: <__main__.ExpensiveObject object at 0x100da1950> ref: r(): <__main__.ExpensiveObject object at 0x100da1950> deleting obj callback( ) (Deleting <__main__.ExpensiveObject object at 0x100da1950>) r(): None 2.7.3 Proxies It is sometimes more convenient to use a proxy, rather than a weak reference. Proxies can be used as though they were the original object and do not need to be called before 2.7. weakref—Impermanent References to Objects 109 the object is accessible. That means they can be passed to a library that does not know it is receiving a reference instead of the real object. import weakref class ExpensiveObject(object): def __init__(self, name): self.name = name def __del__(self): print ’(Deleting %s)’ % self obj = ExpensiveObject(’My Object’) r = weakref.ref(obj) p = weakref.proxy(obj) print ’via obj:’, obj.name print ’via ref:’, r().name print ’via proxy:’, p.name del obj print ’via proxy:’, p.name If the proxy is accessed after the referent object is removed, a ReferenceError exception is raised. $ python weakref_proxy.py via obj: My Object via ref: My Object via proxy: My Object (Deleting <__main__.ExpensiveObject object at 0x100da27d0>) via proxy: Traceback (most recent call last): File "weakref_proxy.py", line 26, in print ’via proxy:’, p.name ReferenceError: weakly-referenced object no longer exists 2.7.4 Cyclic References One use for weak references is to allow cyclic references without preventing garbage collection. This example illustrates the difference between using regular objects and proxies when a graph includes a cycle. The Graph class in weakref_graph.py accepts any object given to it as the “next” node in the sequence. For the sake of brevity, this implementation supports 110 Data Structures a single outgoing reference from each node, which is of limited use generally, but makes it easy to create cycles for these examples. The function demo() is a utility function to exercise the Graph class by creating a cycle and then removing various references. import gc from pprint import pprint import weakref class Graph(object): def __init__(self, name): self.name = name self.other = None def set_next(self, other): print ’%s.set_next(%r)’ % (self.name, other) self.other = other def all_nodes(self): "Generate the nodes in the graph sequence." yield self n = self.other while n and n.name != self.name: yield n n = n.other if n is self: yield n return def __str__(self): return ’->’.join(n.name for n in self.all_nodes()) def __repr__(self): return ’<%s at 0x%x name=%s>’ % (self.__class__.__name__, id(self), self.name) def __del__(self): print ’(Deleting %s)’ % self.name self.set_next(None) def collect_and_show_garbage(): "Show what garbage is present." print ’Collecting...’ n = gc.collect() print ’Unreachable objects:’, n print ’Garbage:’, pprint(gc.garbage) 2.7. weakref—Impermanent References to Objects 111 def demo(graph_factory): print ’Set up graph:’ one = graph_factory(’one’) two = graph_factory(’two’) three = graph_factory(’three’) one.set_next(two) two.set_next(three) three.set_next(one) print print ’Graph:’ print str(one) collect_and_show_garbage() print three = None two = None print ’After 2 references removed:’ print str(one) collect_and_show_garbage() print print ’Removing last reference:’ one = None collect_and_show_garbage() This example uses the gc module to help debug the leak. The DEBUG_LEAK flag causes gc to print information about objects that cannot be seen, other than through the reference the garbage collector has to them. import gc from pprint import pprint import weakref from weakref_graph import Graph, demo, collect_and_show_garbage gc.set_debug(gc.DEBUG_LEAK) print ’Setting up the cycle’ print demo(Graph) 112 Data Structures print print ’Breaking the cycle and cleaning up garbage’ print gc.garbage[0].set_next(None) while gc.garbage: del gc.garbage[0] print collect_and_show_garbage() Even after deleting the local references to the Graph instances in demo(), the graphs all show up in the garbage list and cannot be collected. Several dictionaries are also found in the garbage list. They are the __dict__ values from the Graph instances and contain the attributes for those objects. The graphs can be forcibly deleted, since the program knows what they are. Enabling unbuffered I/O by passing the -u option to the interpreter ensures that the output from the print statements in this example program (written to standard output) and the debug output from gc (written to standard error) are interleaved correctly. $ python -u weakref_cycle.py Setting up the cycle Set up graph: one.set_next() two.set_next() three.set_next() Graph: one->two->three->one Collecting... Unreachable objects: 0 Garbage:[] After 2 references removed: one->two->three->one Collecting... Unreachable objects: 0 Garbage:[] Removing last reference: Collecting... gc: uncollectable gc: uncollectable 2.7. weakref—Impermanent References to Objects 113 gc: uncollectable gc: uncollectable gc: uncollectable gc: uncollectable Unreachable objects: 6 Garbage:[, , , {’name’: ’one’, ’other’: }, {’name’: ’two’, ’other’: }, {’name’: ’three’, ’other’: }] Breaking the cycle and cleaning up garbage one.set_next(None) (Deleting two) two.set_next(None) (Deleting three) three.set_next(None) (Deleting one) one.set_next(None) Collecting... Unreachable objects: 0 Garbage:[] The next step is to create a more intelligent WeakGraph class that knows how to avoid creating cycles with regular references by using weak references when a cycle is detected. import gc from pprint import pprint import weakref from weakref_graph import Graph, demo class WeakGraph(Graph): def set_next(self, other): if other is not None: # See if we should replace the reference # to other with a weakref. if self in other.all_nodes(): other = weakref.proxy(other) 114 Data Structures super(WeakGraph, self).set_next(other) return demo(WeakGraph) Since the WeakGraph instances use proxies to refer to objects that have already been seen, as demo() removes all local references to the objects, the cycle is broken and the garbage collector can delete the objects. $ python weakref_weakgraph.py Set up graph: one.set_next() two.set_next() three.set_next( ) Graph: one->two->three Collecting... Unreachable objects: 0 Garbage:[] After 2 references removed: one->two->three Collecting... Unreachable objects: 0 Garbage:[] Removing last reference: (Deleting one) one.set_next(None) (Deleting two) two.set_next(None) (Deleting three) three.set_next(None) Collecting... Unreachable objects: 0 Garbage:[] 2.7.5 Caching Objects The ref and proxy classes are considered “low level.” While they are useful for maintaining weak references to individual objects and allowing cycles to be garbage 2.7. weakref—Impermanent References to Objects 115 collected, the WeakKeyDictionary and WeakValueDictionary provide a more appropriate API for creating a cache of several objects. The WeakValueDictionary uses weak references to the values it holds, allow- ing them to be garbage collected when other code is not actually using them. Using explicit calls to the garbage collector illustrates the difference between memory han- dling with a regular dictionary and WeakValueDictionary. import gc from pprint import pprint import weakref gc.set_debug(gc.DEBUG_LEAK) class ExpensiveObject(object): def __init__(self, name): self.name = name def __repr__(self): return ’ExpensiveObject(%s)’ % self.name def __del__(self): print ’ (Deleting %s)’ % self def demo(cache_factory): # hold objects so any weak references # are not removed immediately all_refs = {} # create the cache using the factory print ’CACHE TYPE:’, cache_factory cache = cache_factory() for name in [ ’one’, ’two’, ’three’ ]: o = ExpensiveObject(name) cache[name] = o all_refs[name] = o del o # decref print ’ all_refs =’, pprint(all_refs) print ’\n Before, cache contains:’, cache.keys() for name, value in cache.items(): print ’ %s = %s’ % (name, value) del value # decref # Remove all references to the objects except the cache print ’\n Cleanup:’ 116 Data Structures del all_refs gc.collect() print ’\n After, cache contains:’, cache.keys() for name, value in cache.items(): print ’ %s = %s’ % (name, value) print ’ demo returning’ return demo(dict) print demo(weakref.WeakValueDictionary) Any loop variables that refer to the values being cached must be cleared explicitly so the reference count of the object is decremented. Otherwise, the garbage collec- tor would not remove the objects, and they would remain in the cache. Similarly, the all_refs variable is used to hold references to prevent them from being garbage collected prematurely. $ python weakref_valuedict.py CACHE TYPE: all_refs ={’one’: ExpensiveObject(one), ’three’: ExpensiveObject(three), ’two’: ExpensiveObject(two)} Before, cache contains: [’three’, ’two’, ’one’] three = ExpensiveObject(three) two = ExpensiveObject(two) one = ExpensiveObject(one) Cleanup: After, cache contains: [’three’, ’two’, ’one’] three = ExpensiveObject(three) two = ExpensiveObject(two) one = ExpensiveObject(one) demo returning (Deleting ExpensiveObject(three)) (Deleting ExpensiveObject(two)) (Deleting ExpensiveObject(one)) 2.8. copy—Duplicate Objects 117 CACHE TYPE: weakref.WeakValueDictionary all_refs ={’one’: ExpensiveObject(one), ’three’: ExpensiveObject(three), ’two’: ExpensiveObject(two)} Before, cache contains: [’three’, ’two’, ’one’] three = ExpensiveObject(three) two = ExpensiveObject(two) one = ExpensiveObject(one) Cleanup: (Deleting ExpensiveObject(three)) (Deleting ExpensiveObject(two)) (Deleting ExpensiveObject(one)) After, cache contains: [] demo returning The WeakKeyDictionary works similarly, but it uses weak references for the keys instead of the values in the dictionary. Warning: The library documentation for weakref contains this warning: Caution: Because a WeakValueDictionary is built on top of a Python dictionary, it must not change size when iterating over it. This can be difficult to ensure for a WeakValueDictionary because actions performed by the program during iter- ation may cause items in the dictionary to vanish “by magic” (as a side effect of garbage collection). See Also: weakref (http://docs.python.org/lib/module-weakref.html) Standard library docu- mentation for this module. gc (page 1138) The gc module is the interface to the interpreter’s garbage collector. 2.8 copy—Duplicate Objects Purpose Provides functions for duplicating objects using shallow or deep copy semantics. Python Version 1.4 and later 118 Data Structures The copy module includes two functions, copy() and deepcopy(), for duplicating existing objects. 2.8.1 Shallow Copies The shallow copy created by copy() is a new container populated with references to the contents of the original object. When making a shallow copy of a list object, a new list is constructed and the elements of the original object are appended to it. import copy class MyClass: def __init__(self, name): self.name = name def __cmp__(self, other): return cmp(self.name, other.name) a = MyClass(’a’) my_list = [ a ] dup = copy.copy(my_list) print ’ my_list:’, my_list print ’ dup:’, dup print ’ dup is my_list:’, (dup is my_list) print ’ dup == my_list:’, (dup == my_list) print ’dup[0] is my_list[0]:’, (dup[0] is my_list[0]) print ’dup[0] == my_list[0]:’, (dup[0] == my_list[0]) For a shallow copy, the MyClass instance is not duplicated, so the reference in the dup list is to the same object that is in my_list. $ python copy_shallow.py my_list: [<__main__.MyClass instance at 0x100dadc68>] dup: [<__main__.MyClass instance at 0x100dadc68>] dup is my_list: False dup == my_list: True dup[0] is my_list[0]: True dup[0] == my_list[0]: True 2.8.2 Deep Copies The deep copy created by deepcopy() is a new container populated with copies of the contents of the original object. To make a deep copy of a list, a new list 2.8. copy—Duplicate Objects 119 is constructed, the elements of the original list are copied, and then those copies are appended to the new list. Replacing the call to copy() with deepcopy() makes the difference in the output apparent. dup = copy.deepcopy(my_list) The first element of the list is no longer the same object reference, but when the two objects are compared, they still evaluate as being equal. $ python copy_deep.py my_list: [<__main__.MyClass instance at 0x100dadc68>] dup: [<__main__.MyClass instance at 0x100dadc20>] dup is my_list: False dup == my_list: True dup[0] is my_list[0]: False dup[0] == my_list[0]: True 2.8.3 Customizing Copy Behavior It is possible to control how copies are made using the __copy__() and __deepcopy__() special methods. • __copy__() is called without any arguments and should return a shallow copy of the object. • __deepcopy__() is called with a memo dictionary and should return a deep copy of the object. Any member attributes that need to be deep-copied should be passed to copy.deepcopy(), along with the memo dictionary, to control for recursion. (The memo dictionary is explained in more detail later.) This example illustrates how the methods are called. import copy class MyClass: def __init__(self, name): self.name = name def __cmp__(self, other): return cmp(self.name, other.name) 120 Data Structures def __copy__(self): print ’__copy__()’ return MyClass(self.name) def __deepcopy__(self, memo): print ’__deepcopy__(%s)’ % str(memo) return MyClass(copy.deepcopy(self.name, memo)) a = MyClass(’a’) sc = copy.copy(a) dc = copy.deepcopy(a) The memo dictionary is used to keep track of the values that have been copied already, to avoid infinite recursion. $ python copy_hooks.py __copy__() __deepcopy__({}) 2.8.4 Recursion in Deep Copy To avoid problems with duplicating recursive data structures, deepcopy() uses a dic- tionary to track objects that have already been copied. This dictionary is passed to the __deepcopy__() method so it can be examined there as well. This example shows how an interconnected data structure, such as a directed graph, can assist with protecting against recursion by implementing a __deepcopy __() method. import copy import pprint class Graph: def __init__(self, name, connections): self.name = name self.connections = connections def add_connection(self, other): self.connections.append(other) def __repr__(self): return ’Graph(name=%s, id=%s)’ % (self.name, id(self)) 2.8. copy—Duplicate Objects 121 def __deepcopy__(self, memo): print ’\nCalling __deepcopy__ for %r’ % self if self in memo: existing = memo.get(self) print ’ Already copied to %r’ % existing return existing print ’ Memo dictionary:’ pprint.pprint(memo, indent=4, width=40) dup = Graph(copy.deepcopy(self.name, memo), []) print ’ Copying to new object %s’ % dup memo[self] = dup for c in self.connections: dup.add_connection(copy.deepcopy(c, memo)) return dup root = Graph(’root’, []) a = Graph(’a’, [root]) b = Graph(’b’, [a, root]) root.add_connection(a) root.add_connection(b) dup = copy.deepcopy(root) The Graph class includes a few basic directed-graph methods. An instance can be initialized with a name and a list of existing nodes to which it is connected. The add_connection() method is used to set up bidirectional connections. It is also used by the deepcopy operator. The __deepcopy__() method prints messages to show how it is called and man- ages the memo dictionary contents, as needed. Instead of copying the connection list wholesale, it creates a new list and appends copies of the individual connections to it. That ensures that the memo dictionary is updated as each new node is duplicated and avoids recursion issues or extra copies of nodes. As before, it returns the copied object when it is done. There are several cycles in the graph shown in Figure 2.1, but handling the re- cursion with the memo dictionary prevents the traversal from causing a stack overflow error. When the root node is copied, the output is as follows. $ python copy_recursion.py Calling __deepcopy__ for Graph(name=root, id=4309347072) Memo dictionary: { } 122 Data Structures root a b Figure 2.1. Deepcopy for an object graph with cycles Copying to new object Graph(name=root, id=4309347360) Calling __deepcopy__ for Graph(name=a, id=4309347144) Memo dictionary: { Graph(name=root, id=4309347072): Graph(name=root, id=4309347360), 4307936896: [’root’], 4309253504: ’root’} Copying to new object Graph(name=a, id=4309347504) Calling __deepcopy__ for Graph(name=root, id=4309347072) Already copied to Graph(name=root, id=4309347360) Calling __deepcopy__ for Graph(name=b, id=4309347216) Memo dictionary: { Graph(name=root, id=4309347072): Graph(name=root, id=4309347360), Graph(name=a, id=4309347144): Graph(name=a, id=4309347504), 4307936896: [ ’root’, ’a’, Graph(name=root, id=4309347072), Graph(name=a, id=4309347144)], 4308678136: ’a’, 4309253504: ’root’, 4309347072: Graph(name=root, id=4309347360), 4309347144: Graph(name=a, id=4309347504)} Copying to new object Graph(name=b, id=4309347864) The second time the root node is encountered, while the a node is being copied, __deepcopy__() detects the recursion and reuses the existing value from the memo dictionary instead of creating a new object. 2.9. pprint—Pretty-Print Data Structures 123 See Also: copy (http://docs.python.org/library/copy.html) The standard library documenta- tion for this module. 2.9 pprint—Pretty-Print Data Structures Purpose Pretty-print data structures. Python Version 1.4 and later pprint contains a “pretty printer” for producing aesthetically pleasing views of data structures. The formatter produces representations of data structures that can be parsed correctly by the interpreter and are also easy for a human to read. The output is kept on a single line, if possible, and indented when split across multiple lines. The examples in this section all depend on pprint_data.py, which contains the following. data = [ (1, { ’a’:’A’, ’b’:’B’, ’c’:’C’, ’d’:’D’ }), (2, { ’e’:’E’, ’f’:’F’, ’g’:’G’, ’h’:’H’, ’i’:’I’, ’j’:’J’, ’k’:’K’, ’l’:’L’, }), ] 2.9.1 Printing The simplest way to use the module is through the pprint() function. from pprint import pprint from pprint_data import data print ’PRINT:’ print data print print ’PPRINT:’ pprint(data) pprint() formats an object and writes it to the data stream passed as argument (or sys.stdout by default). $ python pprint_pprint.py 124 Data Structures PRINT: [(1, {’a’: ’A’, ’c’: ’C’, ’b’: ’B’, ’d’: ’D’}), (2, {’e’: ’E’, ’g’: ’G’, ’f’: ’F’, ’i’: ’I’, ’h’: ’H’, ’k’: ’K’, ’j’: ’J’, ’l’: ’L’})] PPRINT: [(1, {’a’: ’A’, ’b’: ’B’, ’c’: ’C’, ’d’: ’D’}), (2, {’e’: ’E’, ’f’: ’F’, ’g’: ’G’, ’h’: ’H’, ’i’: ’I’, ’j’: ’J’, ’k’: ’K’, ’l’: ’L’})] 2.9.2 Formatting To format a data structure without writing it directly to a stream (i.e., for logging), use pformat() to build a string representation. import logging from pprint import pformat from pprint_data import data logging.basicConfig(level=logging.DEBUG, format=’%(levelname)-8s %(message)s’, ) logging.debug(’Logging pformatted data’) formatted = pformat(data) for line in formatted.splitlines(): logging.debug(line.rstrip()) The formatted string can then be printed or logged independently. $ python pprint_pformat.py DEBUG Logging pformatted data DEBUG [(1, {’a’: ’A’, ’b’: ’B’, ’c’: ’C’, ’d’: ’D’}), DEBUG (2, DEBUG {’e’: ’E’, DEBUG ’f’: ’F’, 2.9. pprint—Pretty-Print Data Structures 125 DEBUG ’g’: ’G’, DEBUG ’h’: ’H’, DEBUG ’i’: ’I’, DEBUG ’j’: ’J’, DEBUG ’k’: ’K’, DEBUG ’l’: ’L’})] 2.9.3 Arbitrary Classes The PrettyPrinter class used by pprint() can also work with custom classes, if they define a __repr__() method. from pprint import pprint class node(object): def __init__(self, name, contents=[]): self.name = name self.contents = contents[:] def __repr__(self): return ( ’node(’ + repr(self.name) + ’, ’ + repr(self.contents) + ’)’ ) trees = [ node(’node-1’), node(’node-2’, [ node(’node-2-1’)]), node(’node-3’, [ node(’node-3-1’)]), ] pprint(trees) The representations of the nested objects are combined by the PrettyPrinter to return the full string representation. $ python pprint_arbitrary_object.py [node(’node-1’, []), node(’node-2’, [node(’node-2-1’, [])]), node(’node-3’, [node(’node-3-1’, [])])] 2.9.4 Recursion Recursive data structures are represented with a reference to the original source of the data, with the form . 126 Data Structures from pprint import pprint local_data = [ ’a’, ’b’, 1, 2 ] local_data.append(local_data) print ’id(local_data) =>’, id(local_data) pprint(local_data) In this example, the list local_data is added to itself, creating a recursive reference. $ python pprint_recursion.py id(local_data) => 4309215280 [’a’, ’b’, 1, 2, ] 2.9.5 Limiting Nested Output For very deep data structures, it may not be desirable for the output to include all details. The data may not format properly, the formatted text might be too large to manage, or some of the data may be extraneous. from pprint import pprint from pprint_data import data pprint(data, depth=1) Use the depth argument to control how far down into the nested data structure the pretty printer recurses. Levels not included in the output are represented by an ellipsis. $ python pprint_depth.py [(...), (...)] 2.9.6 Controlling Output Width The default output width for the formatted text is 80 columns. To adjust that width, use the width argument to pprint(). from pprint import pprint 2.9. pprint—Pretty-Print Data Structures 127 from pprint_data import data for width in [ 80, 5 ]: print ’WIDTH =’, width pprint(data, width=width) print When the width is too low to accommodate the formatted data structure, the lines are not truncated or wrapped if that would introduce invalid syntax. $ python pprint_width.py WIDTH = 80 [(1, {’a’: ’A’, ’b’: ’B’, ’c’: ’C’, ’d’: ’D’}), (2, {’e’: ’E’, ’f’: ’F’, ’g’: ’G’, ’h’: ’H’, ’i’: ’I’, ’j’: ’J’, ’k’: ’K’, ’l’: ’L’})] WIDTH = 5 [(1, {’a’: ’A’, ’b’: ’B’, ’c’: ’C’, ’d’: ’D’}), (2, {’e’: ’E’, ’f’: ’F’, ’g’: ’G’, ’h’: ’H’, ’i’: ’I’, ’j’: ’J’, ’k’: ’K’, ’l’: ’L’})] See Also: pprint (http://docs.python.org/lib/module-pprint.html) Standard library documen- tation for this module. Chapter 3 ALGORITHMS Python includes several modules for implementing algorithms elegantly and concisely using whatever style is most appropriate for the task. It supports purely procedural, object-oriented, and functional styles. All three styles are frequently mixed within dif- ferent parts of the same program. functools includes functions for creating function decorators, enabling aspect- oriented programming and code reuse beyond what a traditional object-oriented approach supports. It also provides a class decorator for implementing all rich com- parison APIs using a shortcut and partial objects for creating references to functions with their arguments included. The itertools module includes functions for creating and working with iterators and generators used in functional programming. The operator module eliminates the need for many trivial lambda functions when using a functional programming style by providing function-based interfaces to built-in operations, such as arithmetic or item lookup. contextlib makes resource management easier, more reliable, and more con- cise for all programming styles. Combining context managers and the with statement reduces the number of try:finally blocks and indentation levels needed, while ensuring that files, sockets, database transactions, and other resources are closed and released at the right time. 3.1 functools—Tools for Manipulating Functions Purpose Functions that operate on other functions. Python Version 2.5 and later The functools module provides tools for adapting or extending functions and other callable objects, without completely rewriting them. 129 130 Algorithms 3.1.1 Decorators The primary tool supplied by the functools module is the class partial, which can be used to “wrap” a callable object with default arguments. The resulting object is itself callable and can be treated as though it is the original function. It takes all the same arguments as the original, and it can be invoked with extra positional or named arguments as well. A partial can be used instead of a lambda to provide default arguments to a function, while leaving some arguments unspecified. Partial Objects This example shows two simple partial objects for the function myfunc(). The output of show_details() includes the func, args, and keywords attributes of the partial object. import functools def myfunc(a, b=2): """Docstring for myfunc().""" print ’ called myfunc with:’, (a, b) return def show_details(name, f, is_partial=False): """Show details of a callable object.""" print ’%s:’ % name print ’ object:’, f if not is_partial: print ’ __name__:’, f.__name__ if is_partial: print ’ func:’, f.func print ’ args:’, f.args print ’ keywords:’, f.keywords return show_details(’myfunc’, myfunc) myfunc(’a’, 3) print # Set a different default value for ’b’, but require # the caller to provide ’a’. p1 = functools.partial(myfunc, b=4) show_details(’partial with named default’, p1, True) 3.1. functools—Tools for Manipulating Functions 131 p1(’passing a’) p1(’override b’, b=5) print # Set default values for both ’a’ and ’b’. p2 = functools.partial(myfunc, ’default a’, b=99) show_details(’partial with defaults’, p2, True) p2() p2(b=’override b’) print print ’Insufficient arguments:’ p1() At the end of the example, the first partial created is invoked without passing a value for a, causing an exception. $ python functools_partial.py myfunc: object: __name__: myfunc called myfunc with: (’a’, 3) partial with named default: object: func: args: () keywords: {’b’: 4} called myfunc with: (’passing a’, 4) called myfunc with: (’override b’, 5) partial with defaults: object: func: args: (’default a’,) keywords: {’b’: 99} called myfunc with: (’default a’, 99) called myfunc with: (’default a’, ’override b’) Insufficient arguments: Traceback (most recent call last): 132 Algorithms File "functools_partial.py", line 51, in p1() TypeError: myfunc() takes at least 1 argument (1 given) Acquiring Function Properties The partial object does not have __name__ or __doc__ attributes by default, and without those attributes, decorated functions are more difficult to debug. Using update_wrapper() copies or adds attributes from the original function to the partial object. import functools def myfunc(a, b=2): """Docstring for myfunc().""" print ’ called myfunc with:’, (a, b) return def show_details(name, f): """Show details of a callable object.""" print ’%s:’ % name print ’ object:’, f print ’ __name__:’, try: print f.__name__ except AttributeError: print ’(no __name__)’ print ’ __doc__’, repr(f.__doc__) print return show_details(’myfunc’, myfunc) p1 = functools.partial(myfunc, b=4) show_details(’raw wrapper’, p1) print ’Updating wrapper:’ print ’ assign:’, functools.WRAPPER_ASSIGNMENTS print ’ update:’, functools.WRAPPER_UPDATES print functools.update_wrapper(p1, myfunc) show_details(’updated wrapper’, p1) 3.1. functools—Tools for Manipulating Functions 133 The attributes added to the wrapper are defined in WRAPPER_ASSIGNMENTS, while WRAPPER_UPDATES lists values to be modified. $ python functools_update_wrapper.py myfunc: object: __name__: myfunc __doc__ ’Docstring for myfunc().’ raw wrapper: object: __name__: (no __name__) __doc__ ’partial(func, *args, **keywords) - new function with parti al application\n of the given arguments and keywords.\n’ Updating wrapper: assign: (’__module__’, ’__name__’, ’__doc__’) update: (’__dict__’,) updated wrapper: object: __name__: myfunc __doc__ ’Docstring for myfunc().’ Other Callables Partials work with any callable object, not just with stand-alone functions. import functools class MyClass(object): """Demonstration class for functools""" def method1(self, a, b=2): """Docstring for method1().""" print ’ called method1 with:’, (self, a, b) return def method2(self, c, d=5): """Docstring for method2""" print ’ called method2 with:’, (self, c, d) return 134 Algorithms wrapped_method2 = functools.partial(method2, ’wrapped c’) functools.update_wrapper(wrapped_method2, method2) def __call__(self, e, f=6): """Docstring for MyClass.__call__""" print ’ called object with:’, (self, e, f) return def show_details(name, f): """Show details of a callable object.""" print ’%s:’ % name print ’ object:’, f print ’ __name__:’, try: print f.__name__ except AttributeError: print ’(no __name__)’ print ’ __doc__’, repr(f.__doc__) return o = MyClass() show_details(’method1 straight’, o.method1) o.method1(’no default for a’, b=3) print p1 = functools.partial(o.method1, b=4) functools.update_wrapper(p1, o.method1) show_details(’method1 wrapper’, p1) p1(’a goes here’) print show_details(’method2’, o.method2) o.method2(’no default for c’, d=6) print show_details(’wrapped method2’, o.wrapped_method2) o.wrapped_method2(’no default for c’, d=6) print show_details(’instance’, o) o(’no default for e’) print 3.1. functools—Tools for Manipulating Functions 135 p2 = functools.partial(o, f=7) show_details(’instance wrapper’, p2) p2(’e goes here’) This example creates partials from an instance and methods of an instance. $ python functools_method.py method1 straight: object: > __name__: method1 __doc__ ’Docstring for method1().’ called method1 with: (<__main__.MyClass object at 0x100da3550>, ’n o default for a’, 3) method1 wrapper: object: __name__: method1 __doc__ ’Docstring for method1().’ called method1 with: (<__main__.MyClass object at 0x100da3550>, ’a goes here’, 4) method2: object: > __name__: method2 __doc__ ’Docstring for method2’ called method2 with: (<__main__.MyClass object at 0x100da3550>, ’n o default for c’, 6) wrapped method2: object: __name__: method2 __doc__ ’Docstring for method2’ called method2 with: (’wrapped c’, ’no default for c’, 6) instance: object: <__main__.MyClass object at 0x100da3550> __name__: (no __name__) __doc__ ’Demonstration class for functools’ called object with: (<__main__.MyClass object at 0x100da3550>, ’no 136 Algorithms default for e’, 6) instance wrapper: object: __name__: (no __name__) __doc__ ’partial(func, *args, **keywords) - new function with part ial application\n of the given arguments and keywords.\n’ called object with: (<__main__.MyClass object at 0x100da3550>, ’e goes here’, 7) Acquiring Function Properties for Decorators Updating the properties of a wrapped callable is especially useful when used in a dec- orator, since the transformed function ends up with properties of the original “bare” function. import functools def show_details(name, f): """Show details of a callable object.""" print ’%s:’ % name print ’ object:’, f print ’ __name__:’, try: print f.__name__ except AttributeError: print ’(no __name__)’ print ’ __doc__’, repr(f.__doc__) print return def simple_decorator(f): @functools.wraps(f) def decorated(a=’decorated defaults’, b=1): print ’ decorated:’, (a, b) print ’’, f(a, b=b) return return decorated def myfunc(a, b=2): "myfunc() is not complicated" print ’ myfunc:’, (a,b) return 3.1. functools—Tools for Manipulating Functions 137 # The raw function show_details(’myfunc’, myfunc) myfunc(’unwrapped, default b’) myfunc(’unwrapped, passing b’, 3) print # Wrap explicitly wrapped_myfunc = simple_decorator(myfunc) show_details(’wrapped_myfunc’, wrapped_myfunc) wrapped_myfunc() wrapped_myfunc(’args to wrapped’, 4) print # Wrap with decorator syntax @simple_decorator def decorated_myfunc(a, b): myfunc(a, b) return show_details(’decorated_myfunc’, decorated_myfunc) decorated_myfunc() decorated_myfunc(’args to decorated’, 4) functools provides a decorator, wraps(), that applies update_wrapper() to the decorated function. $ python functools_wraps.py myfunc: object: __name__: myfunc __doc__ ’myfunc() is not complicated’ myfunc: (’unwrapped, default b’, 2) myfunc: (’unwrapped, passing b’, 3) wrapped_myfunc: object: __name__: myfunc __doc__ ’myfunc() is not complicated’ decorated: (’decorated defaults’, 1) myfunc: (’decorated defaults’, 1) 138 Algorithms decorated: (’args to wrapped’, 4) myfunc: (’args to wrapped’, 4) decorated_myfunc: object: __name__: decorated_myfunc __doc__ None decorated: (’decorated defaults’, 1) myfunc: (’decorated defaults’, 1) decorated: (’args to decorated’, 4) myfunc: (’args to decorated’, 4) 3.1.2 Comparison Under Python 2, classes can define a __cmp__() method that returns -1, 0,or1 based on whether the object is less than, equal to, or greater than the item being compared. Python 2.1 introduces the rich comparison methods API (__lt__(), __le__(), __eq__(), __ne__(), __gt__(), and __ge__()), which perform a single compari- son operation and return a Boolean value. Python 3 deprecated __cmp__() in favor of these new methods, so functools provides tools to make it easier to write Python 2 classes that comply with the new comparison requirements in Python 3. Rich Comparison The rich comparison API is designed to allow classes with complex comparisons to implement each test in the most efficient way possible. However, for classes where comparison is relatively simple, there is no point in manually creating each of the rich comparison methods. The total_ordering() class decorator takes a class that pro- vides some of the methods and adds the rest of them. import functools import inspect from pprint import pprint @functools.total_ordering class MyObject(object): def __init__(self, val): self.val = val def __eq__(self, other): print ’ testing __eq__(%s, %s)’ % (self.val, other.val) return self.val == other.val 3.1. functools—Tools for Manipulating Functions 139 def __gt__(self, other): print ’ testing __gt__(%s, %s)’ % (self.val, other.val) return self.val > other.val print ’Methods:\n’ pprint(inspect.getmembers(MyObject, inspect.ismethod)) a = MyObject(1) b = MyObject(2) print ’\nComparisons:’ for expr in [ ’a < b’, ’a <= b’, ’a == b’, ’a >= b’, ’a > b’ ]: print ’\n%-6s:’ % expr result = eval(expr) print ’ result of %s: %s’ % (expr, result) The class must provide implementation of __eq__() and one other rich compar- ison method. The decorator adds implementations of the rest of the methods that work by using the comparisons provided. $ python functools_total_ordering.py Methods: [(’__eq__’, ), (’__ge__’, ), (’__gt__’, ), (’__init__’, ), (’__le__’, ), (’__lt__’, )] Comparisons: a < b : testing __gt__(2, 1) result of a < b: True a <= b: testing __gt__(1, 2) result of a <= b: True a == b: testing __eq__(1, 2) result of a == b: False 140 Algorithms a >= b: testing __gt__(2, 1) result of a >= b: False a > b : testing __gt__(1, 2) result of a > b: False Collation Order Since old-style comparison functions are deprecated in Python 3, the cmp argument to functions like sort() is also no longer supported. Python 2 programs that use com- parison functions can use cmp_to_key() to convert them to a function that returns a collation key, which is used to determine the position in the final sequence. import functools class MyObject(object): def __init__(self, val): self.val = val def __str__(self): return ’MyObject(%s)’ % self.val def compare_obj(a, b): """Old-style comparison function. """ print ’comparing %s and %s’ % (a, b) return cmp(a.val, b.val) # Make a key function using cmp_to_key() get_key = functools.cmp_to_key(compare_obj) def get_key_wrapper(o): """Wrapper function for get_key to allow for print statements. """ new_key = get_key(o) print ’key_wrapper(%s) -> %s’ % (o, new_key) return new_key objs = [ MyObject(x) for x in xrange(5, 0, -1) ] for o in sorted(objs, key=get_key_wrapper): print o 3.2. itertools—Iterator Functions 141 Normally, cmp_to_key() would be used directly, but in this example, an extra wrapper function is introduced to print out more information as the key function is being called. The output shows that sorted() starts by calling get_key_wrapper() for each item in the sequence to produce a key. The keys returned by cmp_to_key() are instances of a class defined in functools that implements the rich comparison API using the old-style comparison function passed in. After all keys are created, the se- quence is sorted by comparing the keys. $ python functools_cmp_to_key.py key_wrapper(MyObject(5)) -> key_wrapper(MyObject(4)) -> key_wrapper(MyObject(3)) -> key_wrapper(MyObject(2)) -> key_wrapper(MyObject(1)) -> comparing MyObject(4) and MyObject(5) comparing MyObject(3) and MyObject(4) comparing MyObject(2) and MyObject(3) comparing MyObject(1) and MyObject(2) MyObject(1) MyObject(2) MyObject(3) MyObject(4) MyObject(5) See Also: functools (http://docs.python.org/library/functools.html) The standard library doc- umentation for this module. Rich comparison methods (http://docs.python.org/reference/datamodel.html# object.__lt__) Description of the rich comparison methods from the Python Reference Guide. inspect (page 1200) Introspection API for live objects. 3.2 itertools—Iterator Functions Purpose The itertools module includes a set of functions for working with sequence data sets. Python Version 2.3 and later 142 Algorithms The functions provided by itertools are inspired by similar features of functional programming languages such as Clojure and Haskell. They are intended to be fast and use memory efficiently, and also to be hooked together to express more complicated iteration-based algorithms. Iterator-based code offers better memory consumption characteristics than code that uses lists. Since data is not produced from the iterator until it is needed, all data does not need to be stored in memory at the same time. This “lazy” processing model uses less memory, which can reduce swapping and other side effects of large data sets, improving performance. 3.2.1 Merging and Splitting Iterators The chain() function takes several iterators as arguments and returns a single iterator that produces the contents of all of them as though they came from a single iterator. from itertools import * for i in chain([1, 2, 3], [’a’, ’b’, ’c’]): print i, print chain() makes it easy to process several sequences without constructing one large list. $ python itertools_chain.py 1 2 3 a b c izip() returns an iterator that combines the elements of several iterators into tuples. from itertools import * for i in izip([1, 2, 3], [’a’, ’b’, ’c’]): print i It works like the built-in function zip(), except that it returns an iterator instead of a list. 3.2. itertools—Iterator Functions 143 $ python itertools_izip.py (1, ’a’) (2, ’b’) (3, ’c’) The islice() function returns an iterator that returns selected items from the input iterator, by index. from itertools import * print ’Stop at 5:’ for i in islice(count(), 5): print i, print ’\n’ print ’Start at 5, Stop at 10:’ for i in islice(count(), 5, 10): print i, print ’\n’ print ’By tens to 100:’ for i in islice(count(), 0, 100, 10): print i, print ’\n’ islice() takes the same arguments as the slice operator for lists: start, stop, and step. The start and step arguments are optional. $ python itertools_islice.py Stop at 5: 0 1 2 3 4 Start at 5, Stop at 10: 5 6 7 8 9 By tens to 100: 0 10 20 30 40 50 60 70 80 90 144 Algorithms The tee() function returns several independent iterators (defaults to 2) based on a single original input. from itertools import * r = islice(count(), 5) i1, i2 = tee(r) print ’i1:’, list(i1) print ’i2:’, list(i2) tee() has semantics similar to the UNIX tee utility, which repeats the values it reads from its input and writes them to a named file and standard output. The iterators returned by tee() can be used to feed the same set of data into multiple algorithms to be processed in parallel. $ python itertools_tee.py i1: [0, 1, 2, 3, 4] i2: [0, 1, 2, 3, 4] The new iterators created by tee() share their input, so the original iterator should not be used once the new ones are created. from itertools import * r = islice(count(), 5) i1, i2 = tee(r) print ’r:’, for i in r: print i, if i > 1: break print print ’i1:’, list(i1) print ’i2:’, list(i2) If values are consumed from the original input, the new iterators will not produce those values: 3.2. itertools—Iterator Functions 145 $ python itertools_tee_error.py r: 0 1 2 i1: [3, 4] i2: [3, 4] 3.2.2 Converting Inputs The imap() function returns an iterator that calls a function on the values in the input iterators and returns the results. It works like the built-in map(), except that it stops when any input iterator is exhausted (instead of inserting None values to completely consume all inputs). from itertools import * print ’Doubles:’ for i in imap(lambda x:2*x, xrange(5)): print i print ’Multiples:’ for i in imap(lambda x,y:(x, y, x*y), xrange(5), xrange(5,10)): print ’%d * %d = %d’ % i In the first example, the lambda function multiplies the input values by 2. In the second example, the lambda function multiplies two arguments, taken from separate iterators, and returns a tuple with the original arguments and the computed value. $ python itertools_imap.py Doubles: 0 2 4 6 8 Multiples: 0 * 5 = 0 1 * 6 = 6 2 * 7 = 14 3 * 8 = 24 4 * 9 = 36 146 Algorithms The starmap() function is similar to imap(), but instead of constructing a tuple from multiple iterators, it splits up the items in a single iterator as arguments to the mapping function using the * syntax. from itertools import * values = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)] for i in starmap(lambda x,y:(x, y, x*y), values): print ’%d * %d = %d’ % i Where the mapping function to imap() is called f(i1, i2), the mapping func- tion passed to starmap() is called f(*i). $ python itertools_starmap.py 0 * 5 = 0 1 * 6 = 6 2 * 7 = 14 3 * 8 = 24 4 * 9 = 36 3.2.3 Producing New Values The count() function returns an iterator that produces consecutive integers, indefi- nitely. The first number can be passed as an argument (the default is zero). There is no upper bound argument [see the built-in xrange() for more control over the result set]. from itertools import * for i in izip(count(1), [’a’, ’b’, ’c’]): print i This example stops because the list argument is consumed. $ python itertools_count.py (1, ’a’) (2, ’b’) (3, ’c’) 3.2. itertools—Iterator Functions 147 The cycle() function returns an iterator that indefinitely repeats the contents of the arguments it is given. Since it has to remember the entire contents of the input iterator, it may consume quite a bit of memory if the iterator is long. from itertools import * for i, item in izip(xrange(7), cycle([’a’, ’b’, ’c’])): print (i, item) A counter variable is used to break out of the loop after a few cycles in this example. $ python itertools_cycle.py (0, ’a’) (1, ’b’) (2, ’c’) (3, ’a’) (4, ’b’) (5, ’c’) (6, ’a’) The repeat() function returns an iterator that produces the same value each time it is accessed. from itertools import * for i in repeat(’over-and-over’, 5): print i The iterator returned by repeat() keeps returning data forever, unless the optional times argument is provided to limit it. $ python itertools_repeat.py over-and-over over-and-over over-and-over over-and-over over-and-over 148 Algorithms It is useful to combine repeat() with izip() or imap() when invariant values need to be included with the values from the other iterators. from itertools import * for i, s in izip(count(), repeat(’over-and-over’, 5)): print i, s A counter value is combined with the constant returned by repeat() in this example. $ python itertools_repeat_izip.py 0 over-and-over 1 over-and-over 2 over-and-over 3 over-and-over 4 over-and-over This example uses imap() to multiply the numbers in the range 0 through 4 by 2. from itertools import * for i in imap(lambda x,y:(x, y, x*y), repeat(2), xrange(5)): print ’%d * %d = %d’ % i The repeat() iterator does not need to be explicitly limited, since imap() stops processing when any of its inputs ends, and the xrange() returns only five elements. $ python itertools_repeat_imap.py 2 * 0 = 0 2 * 1 = 2 2 * 2 = 4 2 * 3 = 6 2 * 4 = 8 3.2.4 Filtering The dropwhile() function returns an iterator that produces elements of the input iterator after a condition becomes false for the first time. 3.2. itertools—Iterator Functions 149 from itertools import * def should_drop(x): print ’Testing:’, x return (x<1) for i in dropwhile(should_drop, [ -1, 0, 1, 2, -2 ]): print ’Yielding:’, i dropwhile() does not filter every item of the input; after the condition is false the first time, all remaining items in the input are returned. $ python itertools_dropwhile.py Testing: -1 Testing: 0 Testing: 1 Yielding: 1 Yielding: 2 Yielding: -2 The opposite of dropwhile() is takewhile(). It returns an iterator that returns items from the input iterator, as long as the test function returns true. from itertools import * def should_take(x): print ’Testing:’, x return (x<2) for i in takewhile(should_take, [ -1, 0, 1, 2, -2 ]): print ’Yielding:’, i As soon as should_take() returns False, takewhile() stops processing the input. $ python itertools_takewhile.py Testing: -1 Yielding: -1 Testing: 0 150 Algorithms Yielding: 0 Testing: 1 Yielding: 1 Testing: 2 ifilter() returns an iterator that works like the built-in filter() does for lists, including only items for which the test function returns true. from itertools import * def check_item(x): print ’Testing:’, x return (x<1) for i in ifilter(check_item, [ -1, 0, 1, 2, -2 ]): print ’Yielding:’, i ifilter() is different from dropwhile() in that every item is tested before it is returned. $ python itertools_ifilter.py Testing: -1 Yielding: -1 Testing: 0 Yielding: 0 Testing: 1 Testing: 2 Testing: -2 Yielding: -2 ifilterfalse() returns an iterator that includes only items where the test func- tion returns false. from itertools import * def check_item(x): print ’Testing:’, x return (x<1) for i in ifilterfalse(check_item, [ -1, 0, 1, 2, -2 ]): print ’Yielding:’, i 3.2. itertools—Iterator Functions 151 The test expression in check_item() is the same, so the results in this example with ifilterfalse() are the opposite of the results from the previous example. $ python itertools_ifilterfalse.py Testing: -1 Testing: 0 Testing: 1 Yielding: 1 Testing: 2 Yielding: 2 Testing: -2 3.2.5 Grouping Data The groupby() function returns an iterator that produces sets of values organized by a common key. This example illustrates grouping related values based on an attribute. from itertools import * import operator import pprint class Point: def __init__(self, x, y): self.x = x self.y = y def __repr__(self): return ’(%s, %s)’ % (self.x, self.y) def __cmp__(self, other): return cmp((self.x, self.y), (other.x, other.y)) # Create a dataset of Point instances data = list(imap(Point, cycle(islice(count(), 3)), islice(count(), 7), ) ) print ’Data:’ pprint.pprint(data, width=69) print # Try to group the unsorted data based on X values print ’Grouped, unsorted:’ 152 Algorithms for k, g in groupby(data, operator.attrgetter(’x’)): print k, list(g) print # Sort the data data.sort() print ’Sorted:’ pprint.pprint(data, width=69) print # Group the sorted data based on X values print ’Grouped, sorted:’ for k, g in groupby(data, operator.attrgetter(’x’)): print k, list(g) print The input sequence needs to be sorted on the key value in order for the groupings to work out as expected. $ python itertools_groupby_seq.py Data: [(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9)] Grouped, unsorted: 0 [(0, 0)] 1 [(1, 1)] 2 [(2, 2)] 0 [(0, 3)] 1 [(1, 4)] 2 [(2, 5)] 0 [(0, 6)] 1 [(1, 7)] 3.3. operator—Functional Interface to Built-in Operators 153 2 [(2, 8)] 0 [(0, 9)] Sorted: [(0, 0), (0, 3), (0, 6), (0, 9), (1, 1), (1, 4), (1, 7), (2, 2), (2, 5), (2, 8)] Grouped, sorted: 0 [(0, 0), (0, 3), (0, 6), (0, 9)] 1 [(1, 1), (1, 4), (1, 7)] 2 [(2, 2), (2, 5), (2, 8)] See Also: itertools (http://docs.python.org/library/itertools.html) The standard library docu- mentation for this module. The Standard ML Basis Library (www.standardml.org/Basis/) The library for SML. Definition of Haskell and the Standard Libraries (www.haskell.org/definition/) Standard library specification for the functional language Haskell. Clojure (http://clojure.org/) Clojure is a dynamic functional language that runs on the Java Virtual Machine. tee (http://unixhelp.ed.ac.uk/CGI/man-cgi?tee) UNIX command line tool for split- ting one input into multiple identical output streams. 3.3 operator—Functional Interface to Built-in Operators Purpose Functional interface to built-in operators. Python Version 1.4 and later Programming with iterators occasionally requires creating small functions for simple expressions. Sometimes, these can be implemented as lambda functions, but for some operations, new functions are not needed at all. The operator module defines func- tions that correspond to built-in operations for arithmetic and comparison. 154 Algorithms 3.3.1 Logical Operations There are functions for determining the Boolean equivalent for a value, negating it to create the opposite Boolean value, and comparing objects to see if they are identical. from operator import * a = -1 b = 5 print ’a =’, a print ’b =’, b print print ’not_(a) :’, not_(a) print ’truth(a) :’, truth(a) print ’is_(a, b) :’, is_(a,b) print ’is_not(a, b):’, is_not(a,b) not_() includes the trailing underscore because not is a Python keyword. truth() applies the same logic used when testing an expression in an if statement. is_() implements the same check used by the is keyword, and is_not() does the same test and returns the opposite answer. $ python operator_boolean.py a = -1 b = 5 not_(a) : False truth(a) : True is_(a, b) : False is_not(a, b): True 3.3.2 Comparison Operators All rich comparison operators are supported. from operator import * a = 1 b = 5.0 3.3. operator—Functional Interface to Built-in Operators 155 print ’a =’, a print ’b =’, b for func in (lt, le, eq, ne, ge, gt): print ’%s(a, b):’ % func.__name__, func(a, b) The functions are equivalent to the expression syntax using <, <=, ==, >=, and >. $ python operator_comparisons.py a = 1 b = 5.0 lt(a, b): True le(a, b): True eq(a, b): False ne(a, b): True ge(a, b): False gt(a, b): False 3.3.3 Arithmetic Operators The arithmetic operators for manipulating numerical values are also supported. from operator import * a = -1 b = 5.0 c = 2 d = 6 print ’a =’, a print ’b =’, b print ’c =’, c print ’d =’, d print ’\nPositive/Negative:’ print ’abs(a):’, abs(a) print ’neg(a):’, neg(a) print ’neg(b):’, neg(b) print ’pos(a):’, pos(a) print ’pos(b):’, pos(b) 156 Algorithms print ’\nArithmetic:’ print ’add(a, b) :’, add(a, b) print ’div(a, b) :’, div(a, b) print ’div(d, c) :’, div(d, c) print ’floordiv(a, b):’, floordiv(a, b) print ’floordiv(d, c):’, floordiv(d, c) print ’mod(a, b) :’, mod(a, b) print ’mul(a, b) :’, mul(a, b) print ’pow(c, d) :’, pow(c, d) print ’sub(b, a) :’, sub(b, a) print ’truediv(a, b) :’, truediv(a, b) print ’truediv(d, c) :’, truediv(d, c) print ’\nBitwise:’ print ’and_(c, d) :’, and_(c, d) print ’invert(c) :’, invert(c) print ’lshift(c, d):’, lshift(c, d) print ’or_(c, d) :’, or_(c, d) print ’rshift(d, c):’, rshift(d, c) print ’xor(c, d) :’, xor(c, d) There are two separate division operators: floordiv() (integer division as implemented in Python before version 3.0) and truediv() (floating-point division). $ python operator_math.py a = -1 b = 5.0 c = 2 d = 6 Positive/Negative: abs(a): 1 neg(a): 1 neg(b): -5.0 pos(a): -1 pos(b): 5.0 Arithmetic: add(a, b) : 4.0 div(a, b) : -0.2 div(d, c) : 3 floordiv(a, b): -1.0 floordiv(d, c): 3 mod(a, b) : 4.0 3.3. operator—Functional Interface to Built-in Operators 157 mul(a, b) : -5.0 pow(c, d) : 64 sub(b, a) : 6.0 truediv(a, b) : -0.2 truediv(d, c) : 3.0 Bitwise: and_(c, d) : 2 invert(c) : -3 lshift(c, d): 128 or_(c, d) : 6 rshift(d, c): 1 xor(c, d) : 4 3.3.4 Sequence Operators The operators for working with sequences can be divided into four groups: build- ing up sequences, searching for items, accessing contents, and removing items from sequences. from operator import * a = [ 1, 2, 3 ] b = [ ’a’, ’b’, ’c’ ] print ’a =’, a print ’b =’, b print ’\nConstructive:’ print ’ concat(a, b):’, concat(a, b) print ’ repeat(a, 3):’, repeat(a, 3) print ’\nSearching:’ print ’ contains(a, 1) :’, contains(a, 1) print ’ contains(b, "d"):’, contains(b, "d") print ’ countOf(a, 1) :’, countOf(a, 1) print ’ countOf(b, "d") :’, countOf(b, "d") print ’ indexOf(a, 5) :’, indexOf(a, 1) print ’\nAccess Items:’ print ’ getitem(b, 1) :’, getitem(b, 1) print ’ getslice(a, 1, 3) :’, getslice(a, 1, 3) print ’ setitem(b, 1, "d") :’, setitem(b, 1, "d"), print ’, after b =’, b 158 Algorithms print ’ setslice(a, 1, 3, [4, 5]):’, setslice(a, 1, 3, [4, 5]), print ’, after a =’, a print ’\nDestructive:’ print ’ delitem(b, 1) :’, delitem(b, 1), ’, after b =’, b print ’ delslice(a, 1, 3):’, delslice(a, 1, 3), ’, after a =’, a Some of these operations, such as setitem() and delitem(), modify the sequence in place and do not return a value. $ python operator_sequences.py a = [1, 2, 3] b = [’a’, ’b’, ’c’] Constructive: concat(a, b): [1, 2, 3, ’a’, ’b’, ’c’] repeat(a, 3): [1, 2, 3, 1, 2, 3, 1, 2, 3] Searching: contains(a, 1) : True contains(b, "d"): False countOf(a, 1) : 1 countOf(b, "d") : 0 indexOf(a, 5) : 0 Access Items: getitem(b, 1) : b getslice(a, 1, 3) : [2, 3] setitem(b, 1, "d") : None , after b = [’a’, ’d’, ’c’] setslice(a, 1, 3, [4, 5]): None , after a = [1, 4, 5] Destructive: delitem(b, 1) : None , after b = [’a’, ’c’] delslice(a, 1, 3): None , after a = [1] 3.3.5 In-Place Operators In addition to the standard operators, many types of objects support “in-place” modifi- cation through special operators such as +=. There are equivalent functions for in-place modifications, too. 3.3. operator—Functional Interface to Built-in Operators 159 from operator import * a = -1 b = 5.0 c = [ 1, 2, 3 ] d = [ ’a’, ’b’, ’c’] print ’a =’, a print ’b =’, b print ’c =’, c print ’d =’, d print a = iadd(a, b) print ’a = iadd(a, b) =>’, a print c = iconcat(c, d) print ’c = iconcat(c, d) =>’, c These examples demonstrate only a few of the functions. Refer to the standard library documentation for complete details. $ python operator_inplace.py a = -1 b = 5.0 c = [1, 2, 3] d = [’a’, ’b’, ’c’] a = iadd(a, b) => 4.0 c = iconcat(c, d) => [1, 2, 3, ’a’, ’b’, ’c’] 3.3.6 Attribute and Item “Getters” One of the most unusual features of the operator module is the concept of getters. These are callable objects constructed at runtime to retrieve attributes of objects or contents from sequences. Getters are especially useful when working with iterators or generator sequences, where they are intended to incur less overhead than a lambda or Python function. 160 Algorithms from operator import * class MyObj(object): """example class for attrgetter""" def __init__(self, arg): super(MyObj, self).__init__() self.arg = arg def __repr__(self): return ’MyObj(%s)’ % self.arg l = [ MyObj(i) for i in xrange(5) ] print ’objects :’, l # Extract the ’arg’ value from each object g = attrgetter(’arg’) vals = [ g(i) for i in l ] print ’arg values:’, vals # Sort using arg l.reverse() print ’reversed :’, l print ’sorted :’, sorted(l, key=g) Attribute getters work like lambda x, n=’attrname’: getattr(x, n): $ python operator_attrgetter.py objects : [MyObj(0), MyObj(1), MyObj(2), MyObj(3), MyObj(4)] arg values: [0, 1, 2, 3, 4] reversed : [MyObj(4), MyObj(3), MyObj(2), MyObj(1), MyObj(0)] sorted : [MyObj(0), MyObj(1), MyObj(2), MyObj(3), MyObj(4)] Item getters work like lambda x, y=5: x[y]: from operator import * l = [ dict(val=-1 * i) for i in xrange(4) ] print ’Dictionaries:’, l g = itemgetter(’val’) vals = [ g(i) for i in l ] print ’ values:’, vals print ’ sorted:’, sorted(l, key=g) 3.3. operator—Functional Interface to Built-in Operators 161 print l = [ (i, i*-2) for i in xrange(4) ] print ’Tuples :’, l g = itemgetter(1) vals = [ g(i) for i in l ] print ’ values:’, vals print ’ sorted:’, sorted(l, key=g) Item getters work with mappings as well as sequences. $ python operator_itemgetter.py Dictionaries: [{’val’: 0}, {’val’: -1}, {’val’: -2}, {’val’: -3}] values: [0, -1, -2, -3] sorted: [{’val’: -3}, {’val’: -2}, {’val’: -1}, {’val’: 0}] Tuples : [(0, 0), (1, -2), (2, -4), (3, -6)] values: [0, -2, -4, -6] sorted: [(3, -6), (2, -4), (1, -2), (0, 0)] 3.3.7 Combining Operators and Custom Classes The functions in the operator module work via the standard Python interfaces for their operations, so they work with user-defined classes as well as the built-in types. from operator import * class MyObj(object): """Example for operator overloading""" def __init__(self, val): super(MyObj, self).__init__() self.val = val return def __str__(self): return ’MyObj(%s)’ % self.val def __lt__(self, other): """compare for less-than""" print ’Testing %s < %s’ % (self, other) return self.val < other.val def __add__(self, other): """add values""" 162 Algorithms print ’Adding %s + %s’ % (self, other) return MyObj(self.val + other.val) a = MyObj(1) b = MyObj(2) print ’Comparison:’ print lt(a, b) print ’\nArithmetic:’ print add(a, b) Refer to the Python reference guide for a complete list of the special methods each operator uses. $ python operator_classes.py Comparison: Testing MyObj(1) < MyObj(2) True Arithmetic: Adding MyObj(1) + MyObj(2) MyObj(3) 3.3.8 Type Checking The operator module also includes functions for testing API compliance for mapping, number, and sequence types. from operator import * class NoType(object): """Supports none of the type APIs""" class MultiType(object): """Supports multiple type APIs""" def __len__(self): return 0 def __getitem__(self, name): return ’mapping’ def __int__(self): return 0 3.4. contextlib—Context Manager Utilities 163 o = NoType() t = MultiType() for func in (isMappingType, isNumberType, isSequenceType): print ’%s(o):’ % func.__name__, func(o) print ’%s(t):’ % func.__name__, func(t) The tests are not perfect, since the interfaces are not strictly defined, but they do provide some idea of what is supported. $ python operator_typechecking.py isMappingType(o): False isMappingType(t): True isNumberType(o): False isNumberType(t): True isSequenceType(o): False isSequenceType(t): True See Also: operator (http://docs.python.org/lib/module-operator.html) Standard library docu- mentation for this module. functools (page 129) Functional programming tools, including the total_ ordering() decorator for adding rich comparison methods to a class. itertools (page 141) Iterator operations. abc (page 1178) The abc module includes abstract base classes that define the APIs for collection types. 3.4 contextlib—Context Manager Utilities Purpose Utilities for creating and working with context managers. Python Version 2.5 and later The contextlib module contains utilities for working with context managers and the with statement. Note: Context managers are tied to the with statement. Since with is officially part of Python 2.6, import it from __future__ before using contextlib in Python 2.5. 164 Algorithms 3.4.1 Context Manager API A context manager is responsible for a resource within a code block, possibly creating it when the block is entered and then cleaning it up after the block is exited. For example, files support the context manager API to make it easy to ensure they are closed after all reading or writing is done. with open(’/tmp/pymotw.txt’, ’wt’) as f: f.write(’contents go here’) # file is automatically closed A context manager is enabled by the with statement, and the API involves two methods. The __enter__() method is run when execution flow enters the code block inside the with. It returns an object to be used within the context. When execution flow leaves the with block, the __exit__() method of the context manager is called to clean up any resources being used. class Context(object): def __init__(self): print ’__init__()’ def __enter__(self): print ’__enter__()’ return self def __exit__(self, exc_type, exc_val, exc_tb): print ’__exit__()’ with Context(): print ’Doing work in the context’ Combining a context manager and the with statement is a more compact way of writing a try:finally block, since the context manager’s __exit__() method is always called, even if an exception is raised. $ python contextlib_api.py __init__() __enter__() Doing work in the context __exit__() 3.4. contextlib—Context Manager Utilities 165 The __enter__() method can return any object to be associated with a name specified in the as clause of the with statement. In this example, the Context returns an object that uses the open context. class WithinContext(object): def __init__(self, context): print ’WithinContext.__init__(%s)’ % context def do_something(self): print ’WithinContext.do_something()’ def __del__(self): print ’WithinContext.__del__’ class Context(object): def __init__(self): print ’Context.__init__()’ def __enter__(self): print ’Context.__enter__()’ return WithinContext(self) def __exit__(self, exc_type, exc_val, exc_tb): print ’Context.__exit__()’ with Context() as c: c.do_something() The value associated with the variable c is the object returned by __enter__(), which is not necessarily the Context instance created in the with statement. $ python contextlib_api_other_object.py Context.__init__() Context.__enter__() WithinContext.__init__(<__main__.Context object at 0x100d98a10>) WithinContext.do_something() Context.__exit__() WithinContext.__del__ The __exit__() method receives arguments containing details of any exception raised in the with block. 166 Algorithms class Context(object): def __init__(self, handle_error): print ’__init__(%s)’ % handle_error self.handle_error = handle_error def __enter__(self): print ’__enter__()’ return self def __exit__(self, exc_type, exc_val, exc_tb): print ’__exit__()’ print ’ exc_type =’, exc_type print ’ exc_val =’, exc_val print ’ exc_tb =’, exc_tb return self.handle_error with Context(True): raise RuntimeError(’error message handled’) print with Context(False): raise RuntimeError(’error message propagated’) If the context manager can handle the exception, __exit__() should return a true value to indicate that the exception does not need to be propagated. Returning false causes the exception to be reraised after __exit__() returns. $ python contextlib_api_error.py __init__(True) __enter__() __exit__() exc_type = exc_val = error message handled exc_tb = __init__(False) __enter__() __exit__() exc_type = exc_val = error message propagated exc_tb = 3.4. contextlib—Context Manager Utilities 167 Traceback (most recent call last): File "contextlib_api_error.py", line 33, in raise RuntimeError(’error message propagated’) RuntimeError: error message propagated 3.4.2 From Generator to Context Manager Creating context managers the traditional way, by writing a class with __enter__() and __exit__() methods, is not difficult. But sometimes, writing everything out fully is extra overhead for a trivial bit of context. In those sorts of situations, use the contextmanager() decorator to convert a generator function into a context manager. import contextlib @contextlib.contextmanager def make_context(): print ’ entering’ try: yield {} except RuntimeError, err: print ’ ERROR:’, err finally: print ’ exiting’ print ’Normal:’ with make_context() as value: print ’ inside with statement:’, value print ’\nHandled error:’ with make_context() as value: raise RuntimeError(’showing example of handling an error’) print ’\nUnhandled error:’ with make_context() as value: raise ValueError(’this exception is not handled’) The generator should initialize the context, yield exactly one time, and then clean up the context. The value yielded, if any, is bound to the variable in the as clause of the with statement. Exceptions from within the with block are reraised inside the generator, so they can be handled there. 168 Algorithms $ python contextlib_contextmanager.py Normal: entering inside with statement: {} exiting Handled error: entering ERROR: showing example of handling an error exiting Unhandled error: entering exiting Traceback (most recent call last): File "contextlib_contextmanager.py", line 34, in raise ValueError(’this exception is not handled’) ValueError: this exception is not handled 3.4.3 Nesting Contexts At times, it is necessary to manage multiple contexts simultaneously (such as when copying data between input and output file handles, for example). It is possible to nest with statements one inside another, but if the outer contexts do not need their own separate block, this adds to the indention level without giving any real benefit. Using nested() nests the contexts using a single with statement. import contextlib @contextlib.contextmanager def make_context(name): print ’entering:’, name yield name print ’exiting :’, name with contextlib.nested(make_context(’A’), make_context(’B’)) as (A, B): print ’inside with statement:’, A, B Program execution leaves the contexts in the reverse order in which they are entered. 3.4. contextlib—Context Manager Utilities 169 $ python contextlib_nested.py entering: A entering: B inside with statement: A B exiting : B exiting : A In Python 2.7 and later, nested() is deprecated because the with statement sup- ports nesting directly. import contextlib @contextlib.contextmanager def make_context(name): print ’entering:’, name yield name print ’exiting :’, name with make_context(’A’) as A, make_context(’B’) as B: print ’inside with statement:’, A, B Each context manager and optional as clause are separated by a comma (,). The effect is similar to using nested(), but avoids some of the edge-cases around error handling that nested() could not implement correctly. $ python contextlib_nested_with.py entering: A entering: B inside with statement: A B exiting : B exiting : A 3.4.4 Closing Open Handles The file class supports the context manager API directly, but some other objects that represent open handles do not. The example given in the standard library documentation for contextlib is the object returned from urllib.urlopen(). There are other legacy classes that use a close() method but do not support the context manager API. To ensure that a handle is closed, use closing() to create a context manager for it. 170 Algorithms import contextlib class Door(object): def __init__(self): print ’ __init__()’ def close(self): print ’ close()’ print ’Normal Example:’ with contextlib.closing(Door()) as door: print ’ inside with statement’ print ’\nError handling example:’ try: with contextlib.closing(Door()) as door: print ’ raising from inside with statement’ raise RuntimeError(’error message’) except Exception, err: print ’ Had an error:’, err The handle is closed whether there is an error in the with block or not. $ python contextlib_closing.py Normal Example: __init__() inside with statement close() Error handling example: __init__() raising from inside with statement close() Had an error: error message See Also: contextlib (http://docs.python.org/library/contextlib.html) The standard library documentation for this module. PEP 343 (http://www.python.org/dev/peps/pep-0343) The with statement. 3.4. contextlib—Context Manager Utilities 171 Context Manager Types (http://docs.python.org/library/stdtypes.html#type contextmanager) Description of the context manager API from the standard library documentation. With Statement Context Managers (http://docs.python.org/reference/datamodel.html#context-managers) Description of the context manager API from the Python Reference Guide. Chapter 4 DATES AND TIMES Python does not include native types for dates and times as it does for int, float, and str, but there are three modules for manipulating date and time values in several representations. • The time module exposes the time-related functions from the underlying C library. It includes functions for retrieving the clock time and the processor run- time, as well as basic parsing and string-formatting tools. • The datetime module provides a higher-level interface for date, time, and com- bined values. The classes in datetime support arithmetic, comparison, and time zone configuration. • The calendar module creates formatted representations of weeks, months, and years. It can also be used to compute recurring events, the day of the week for a given date, and other calendar-based values. 4.1 time—Clock Time Purpose Functions for manipulating clock time. Python Version 1.4 and later The time module exposes C library functions for manipulating dates and times. Since it is tied to the underlying C implementation, some details (such as the start of the epoch and the maximum date value supported) are platform specific. Refer to the library documentation for complete details. 173 174 Dates and Times 4.1.1 Wall Clock Time One of the core functions of the time module is time(), which returns the number of seconds since the start of the epoch as a floating-point value. import time print ’The time is:’, time.time() Although the value is always a float, actual precision is platform dependent. $ python time_time.py The time is: 1291499267.33 The float representation is useful when storing or comparing dates, but it is not as useful for producing human-readable representations. For logging or printing time, ctime() can be more useful. import time print ’The time is :’, time.ctime() later = time.time() + 15 print ’15 secs from now :’, time.ctime(later) The second print statement in this example shows how to use ctime() to format a time value other than the current time. $ python time_ctime.py The time is : Sat Dec 4 16:47:47 2010 15 secs from now : Sat Dec 4 16:48:02 2010 4.1.2 Processor Clock Time While time() returns a wall clock time, clock() returns processor clock time. The values returned from clock() should be used for performance testing, benchmarking, etc., since they reflect the actual time the program uses and can be more precise than the values from time(). 4.1. time—Clock Time 175 import hashlib import time # Data to use to calculate md5 checksums data = open(__file__, ’rt’).read() for i in range(5): h = hashlib.sha1() print time.ctime(), ’: %0.3f %0.3f’ % (time.time(), time.clock()) for i in range(300000): h.update(data) cksum = h.digest() In this example, the formatted ctime() is printed along with the floating-point values from time() and clock() for each iteration through the loop. Note: If you want to run the example on your system, you may have to add more cycles to the inner loop or work with a larger amount of data to actually see a difference in the times. $ python time_clock.py Sat Dec 4 16:47:47 2010 : 1291499267.446 0.028 Sat Dec 4 16:47:48 2010 : 1291499268.844 1.413 Sat Dec 4 16:47:50 2010 : 1291499270.247 2.794 Sat Dec 4 16:47:51 2010 : 1291499271.658 4.171 Sat Dec 4 16:47:53 2010 : 1291499273.128 5.549 Typically, the processor clock does not tick if a program is not doing anything. import time for i in range(6, 1, -1): print ’%s %0.2f %0.2f’ % (time.ctime(), time.time(), time.clock()) print ’Sleeping’, i time.sleep(i) 176 Dates and Times In this example, the loop does very little work by going to sleep after each iteration. The time() value increases even while the application is asleep, but the clock() value does not. $ python time_clock_sleep.py Sat Dec 4 16:47:54 2010 1291499274.65 0.03 Sleeping 6 Sat Dec 4 16:48:00 2010 1291499280.65 0.03 Sleeping 5 Sat Dec 4 16:48:05 2010 1291499285.65 0.03 Sleeping 4 Sat Dec 4 16:48:09 2010 1291499289.66 0.03 Sleeping 3 Sat Dec 4 16:48:12 2010 1291499292.66 0.03 Sleeping 2 Calling sleep() yields control from the current thread and asks it to wait for the system to wake it back up. If a program has only one thread, this effectively blocks the app and it does no work. 4.1.3 Time Components Storing times as elapsed seconds is useful in some situations, but there are times when a program needs to have access to the individual fields of a date (year, month, etc.). The time module defines struct_time for holding date and time values with components broken out so they are easy to access. Several functions work with struct_time val- ues instead of floats. import time def show_struct(s): print ’ tm_year :’, s.tm_year print ’ tm_mon :’, s.tm_mon print ’ tm_mday :’, s.tm_mday print ’ tm_hour :’, s.tm_hour print ’ tm_min :’, s.tm_min print ’ tm_sec :’, s.tm_sec print ’ tm_wday :’, s.tm_wday print ’ tm_yday :’, s.tm_yday print ’ tm_isdst:’, s.tm_isdst 4.1. time—Clock Time 177 print ’gmtime:’ show_struct(time.gmtime()) print ’\nlocaltime:’ show_struct(time.localtime()) print ’\nmktime:’, time.mktime(time.localtime()) The gmtime() function returns the current time in UTC. localtime() returns the current time with the current time zone applied. mktime() takes a struct_time instance and converts it to the floating-point representation. $ python time_struct.py gmtime: tm_year : 2010 tm_mon : 12 tm_mday : 4 tm_hour : 21 tm_min : 48 tm_sec : 14 tm_wday : 5 tm_yday : 338 tm_isdst: 0 localtime: tm_year : 2010 tm_mon : 12 tm_mday : 4 tm_hour : 16 tm_min : 48 tm_sec : 14 tm_wday : 5 tm_yday : 338 tm_isdst: 0 mktime: 1291499294.0 4.1.4 Working with Time Zones The functions for determining the current time depend on having the time zone set, either by the program or by using a default time zone set for the system. Changing the time zone does not change the actual time, just the way it is represented. 178 Dates and Times To change the time zone, set the environment variable TZ, and then call tzset(). The time zone can be specified with a lot of detail, right down to the start and stop times for daylight savings time. It is usually easier to use the time zone name and let the underlying libraries derive the other information, though. This example program changes the time zone to a few different values and shows how the changes affect other settings in the time module. import time import os def show_zone_info(): print ’ TZ :’, os.environ.get(’TZ’, ’(not set)’) print ’ tzname:’, time.tzname print ’ Zone : %d (%d)’ % (time.timezone, (time.timezone / 3600)) print ’ DST :’, time.daylight print ’ Time :’, time.ctime() print print ’Default :’ show_zone_info() ZONES = [ ’GMT’, ’Europe/Amsterdam’, ] for zone in ZONES: os.environ[’TZ’] = zone time.tzset() print zone, ’:’ show_zone_info() The default time zone on the system used to prepare the examples is US/Eastern. The other zones in the example change the tzname, daylight flag, and timezone offset value. $ python time_timezone.py Default : TZ : (not set) tzname: (’EST’, ’EDT’) Zone : 18000 (5) 4.1. time—Clock Time 179 DST : 1 Time : Sat Dec 4 16:48:14 2010 GMT : TZ : GMT tzname: (’GMT’, ’GMT’) Zone : 0 (0) DST : 0 Time : Sat Dec 4 21:48:14 2010 Europe/Amsterdam : TZ : Europe/Amsterdam tzname: (’CET’, ’CEST’) Zone : -3600 (-1) DST : 1 Time : Sat Dec 4 22:48:15 2010 4.1.5 Parsing and Formatting Times The two functions strptime() and strftime() convert between struct_time and string representations of time values. A long list of formatting instructions is available to support input and output in different styles. The complete list is documented in the library documentation for the time module. This example converts the current time from a string to a struct_time instance and back to a string. import time def show_struct(s): print ’ tm_year :’, s.tm_year print ’ tm_mon :’, s.tm_mon print ’ tm_mday :’, s.tm_mday print ’ tm_hour :’, s.tm_hour print ’ tm_min :’, s.tm_min print ’ tm_sec :’, s.tm_sec print ’ tm_wday :’, s.tm_wday print ’ tm_yday :’, s.tm_yday print ’ tm_isdst:’, s.tm_isdst now = time.ctime() print ’Now:’, now 180 Dates and Times parsed = time.strptime(now) print ’\nParsed:’ show_struct(parsed) print ’\nFormatted:’, time.strftime("%a %b %d %H:%M:%S %Y", parsed) The output string is not exactly like the input, since the day of the month is prefixed with a zero. $ python time_strptime.py Now: Sat Dec 4 16:48:14 2010 Parsed: tm_year : 2010 tm_mon : 12 tm_mday : 4 tm_hour : 16 tm_min : 48 tm_sec : 14 tm_wday : 5 tm_yday : 338 tm_isdst: -1 Formatted: Sat Dec 04 16:48:14 2010 See Also: time (http://docs.python.org/lib/module-time.html) Standard library documentation for this module. datetime (page 180) The datetime module includes other classes for doing calcu- lations with dates and times. calendar (page 191) Work with higher-level date functions to produce calendars or calculate recurring events. 4.2 datetime—Date and Time Value Manipulation Purpose The datetime module includes functions and classes for doing date and time parsing, formatting, and arithmetic. Python Version 2.3 and later datetime contains functions and classes for working with dates and times, separately and together. 4.2. datetime—Date and Time Value Manipulation 181 4.2.1 Times Time values are represented with the time class. A time instance has attributes for hour, minute, second, and microsecond and can also include time zone information. import datetime t = datetime.time(1, 2, 3) print t print ’hour :’, t.hour print ’minute :’, t.minute print ’second :’, t.second print ’microsecond:’, t.microsecond print ’tzinfo :’, t.tzinfo The arguments to initialize a time instance are optional, but the default of 0 is unlikely to be correct. $ python datetime_time.py 01:02:03 hour : 1 minute : 2 second : 3 microsecond: 0 tzinfo : None A time instance only holds values of time, and not a date associated with the time. import datetime print ’Earliest :’, datetime.time.min print ’Latest :’, datetime.time.max print ’Resolution:’, datetime.time.resolution The min and max class attributes reflect the valid range of times in a single day. $ python datetime_time_minmax.py Earliest : 00:00:00 Latest : 23:59:59.999999 Resolution: 0:00:00.000001 The resolution for time is limited to whole microseconds. 182 Dates and Times import datetime for m in [ 1, 0, 0.1, 0.6 ]: try: print ’%02.1f :’ % m, datetime.time(0, 0, 0, microsecond=m) except TypeError, err: print ’ERROR:’, err The way floating-point values are treated depends on the version of Python. Ver- sion 2.7 raises a TypeError, while earlier versions produce a DeprecationWarning and convert the floating-point number to an integer. $ python2.7 datetime_time_resolution.py 1.0 : 00:00:00.000001 0.0 : 00:00:00 0.1 : ERROR: integer argument expected, got float 0.6 : ERROR: integer argument expected, got float $ python2.6 datetime_time_resolution.py 1.0 : 00:00:00.000001 0.0 : 00:00:00 datetime_time_resolution.py:16: DeprecationWarning: integer argument expected, got float print ’%02.1f :’ % m, datetime.time(0, 0, 0, microsecond=m) 0.1 : 00:00:00 0.6 : 00:00:00 4.2.2 Dates Calendar date values are represented with the date class. Instances have attributes for year, month, and day. It is easy to create a date representing the current date using the today() class method. import datetime today = datetime.date.today() print today print ’ctime :’, today.ctime() tt = today.timetuple() print ’tuple : tm_year =’, tt.tm_year 4.2. datetime—Date and Time Value Manipulation 183 print ’ tm_mon =’, tt.tm_mon print ’ tm_mday =’, tt.tm_mday print ’ tm_hour =’, tt.tm_hour print ’ tm_min =’, tt.tm_min print ’ tm_sec =’, tt.tm_sec print ’ tm_wday =’, tt.tm_wday print ’ tm_yday =’, tt.tm_yday print ’ tm_isdst =’, tt.tm_isdst print ’ordinal:’, today.toordinal() print ’Year :’, today.year print ’Mon :’, today.month print ’Day :’, today.day This example prints the current date in several formats. $ python datetime_date.py 2010-11-27 ctime : Sat Nov 27 00:00:00 2010 tuple : tm_year = 2010 tm_mon = 11 tm_mday = 27 tm_hour = 0 tm_min = 0 tm_sec = 0 tm_wday = 5 tm_yday = 331 tm_isdst = -1 ordinal: 734103 Year : 2010 Mon : 11 Day : 27 There are also class methods for creating instances from POSIX timestamps or integers representing date values from the Gregorian calendar, where January 1 of the year 1 is 1 and each subsequent day increments the value by 1. import datetime import time o = 733114 print ’o :’, o 184 Dates and Times print ’fromordinal(o) :’, datetime.date.fromordinal(o) t = time.time() print ’t :’, t print ’fromtimestamp(t):’, datetime.date.fromtimestamp(t) This example illustrates the different value types used by fromordinal() and fromtimestamp(). $ python datetime_date_fromordinal.py o : 733114 fromordinal(o) : 2008-03-13 t : 1290874810.14 fromtimestamp(t): 2010-11-27 As with time, the range of date values supported can be determined using the min and max attributes. import datetime print ’Earliest :’, datetime.date.min print ’Latest :’, datetime.date.max print ’Resolution:’, datetime.date.resolution The resolution for dates is whole days. $ python datetime_date_minmax.py Earliest : 0001-01-01 Latest : 9999-12-31 Resolution: 1 day, 0:00:00 Another way to create new date instances uses the replace() method of an existing date. import datetime d1 = datetime.date(2008, 3, 29) print ’d1:’, d1.ctime() 4.2. datetime—Date and Time Value Manipulation 185 d2 = d1.replace(year=2009) print ’d2:’, d2.ctime() This example changes the year, leaving the day and month unmodified. $ python datetime_date_replace.py d1: Sat Mar 29 00:00:00 2008 d2: Sun Mar 29 00:00:00 2009 4.2.3 timedeltas Future and past dates can be calculated using basic arithmetic on two datetime objects, or by combining a datetime with a timedelta. Subtracting dates produces a timedelta, and a timedelta can be added or subtracted from a date to produce another date. The internal values for a timedelta are stored in days, seconds, and microseconds. import datetime print "microseconds:", datetime.timedelta(microseconds=1) print "milliseconds:", datetime.timedelta(milliseconds=1) print "seconds :", datetime.timedelta(seconds=1) print "minutes :", datetime.timedelta(minutes=1) print "hours :", datetime.timedelta(hours=1) print "days :", datetime.timedelta(days=1) print "weeks :", datetime.timedelta(weeks=1) Intermediate level values passed to the constructor are converted into days, sec- onds, and microseconds. $ python datetime_timedelta.py microseconds: 0:00:00.000001 milliseconds: 0:00:00.001000 seconds : 0:00:01 minutes : 0:01:00 hours : 1:00:00 days : 1 day, 0:00:00 weeks : 7 days, 0:00:00 186 Dates and Times The full duration of a timedelta can be retrieved as a number of seconds using total_seconds(). import datetime for delta in [datetime.timedelta(microseconds=1), datetime.timedelta(milliseconds=1), datetime.timedelta(seconds=1), datetime.timedelta(minutes=1), datetime.timedelta(hours=1), datetime.timedelta(days=1), datetime.timedelta(weeks=1), ]: print ’%15s = %s seconds’ % (delta, delta.total_seconds()) The return value is a floating-point number, to accommodate subsecond durations. $ python datetime_timedelta_total_seconds.py 0:00:00.000001 = 1e-06 seconds 0:00:00.001000 = 0.001 seconds 0:00:01 = 1.0 seconds 0:01:00 = 60.0 seconds 1:00:00 = 3600.0 seconds 1 day, 0:00:00 = 86400.0 seconds 7 days, 0:00:00 = 604800.0 seconds 4.2.4 Date Arithmetic Date math uses the standard arithmetic operators. import datetime today = datetime.date.today() print ’Today :’, today one_day = datetime.timedelta(days=1) print ’One day :’, one_day yesterday = today - one_day print ’Yesterday:’, yesterday 4.2. datetime—Date and Time Value Manipulation 187 tomorrow = today + one_day print ’Tomorrow :’, tomorrow print print ’tomorrow - yesterday:’, tomorrow - yesterday print ’yesterday - tomorrow:’, yesterday - tomorrow This example with date objects illustrates using timedelta objects to compute new dates, and subtracting date instances to produce timedeltas (including a negative delta value). $ python datetime_date_math.py Today : 2010-11-27 One day : 1 day, 0:00:00 Yesterday: 2010-11-26 Tomorrow : 2010-11-28 tomorrow - yesterday: 2 days, 0:00:00 yesterday - tomorrow: -2 days, 0:00:00 4.2.5 Comparing Values Both date and time values can be compared using the standard comparison operators to determine which is earlier or later. import datetime import time print ’Times:’ t1 = datetime.time(12, 55, 0) print ’ t1:’, t1 t2 = datetime.time(13, 5, 0) print ’ t2:’, t2 print ’ t1 < t2:’, t1 < t2 print print ’Dates:’ d1 = datetime.date.today() print ’ d1:’, d1 d2 = datetime.date.today() + datetime.timedelta(days=1) 188 Dates and Times print ’ d2:’, d2 print ’ d1 > d2:’, d1 > d2 All comparison operators are supported. $ python datetime_comparing.py Times: t1: 12:55:00 t2: 13:05:00 t1 < t2: True Dates: d1: 2010-11-27 d2: 2010-11-28 d1 > d2: False 4.2.6 Combining Dates and Times Use the datetime class to hold values consisting of both date and time components. As with date, there are several convenient class methods to create datetime instances from other common values. import datetime print ’Now :’, datetime.datetime.now() print ’Today :’, datetime.datetime.today() print ’UTC Now:’, datetime.datetime.utcnow() print FIELDS = [ ’year’, ’month’, ’day’, ’hour’, ’minute’, ’second’, ’microsecond’, ] d = datetime.datetime.now() for attr in FIELDS: print ’%15s: %s’ % (attr, getattr(d, attr)) As might be expected, the datetime instance has all attributes of both a date and a time object. 4.2. datetime—Date and Time Value Manipulation 189 $ python datetime_datetime.py : 2010-11-27 11:20:10.479880 Today : 2010-11-27 11:20:10.481494 UTC Now: 2010-11-27 16:20:10.481521 year: 2010 month: 11 day: 27 hour: 11 minute: 20 second: 10 microsecond: 481752 Just as with date, datetime provides convenient class methods for creating new instances. It also includes fromordinal() and fromtimestamp(). import datetime t = datetime.time(1, 2, 3) print ’t :’, t d = datetime.date.today() print ’d :’, d dt = datetime.datetime.combine(d, t) print ’dt:’, dt combine() creates datetime instances from one date and one time instance. $ python datetime_datetime_combine.py t : 01:02:03 d : 2010-11-27 dt: 2010-11-27 01:02:03 4.2.7 Formatting and Parsing The default string representation of a datetime object uses the ISO-8601 for- mat (YYYY-MM-DDTHH:MM:SS.mmmmmm). Alternate formats can be generated using strftime(). 190 Dates and Times import datetime format = "%a %b %d %H:%M:%S %Y" today = datetime.datetime.today() print ’ISO :’, today s = today.strftime(format) print ’strftime:’, s d = datetime.datetime.strptime(s, format) print ’strptime:’, d.strftime(format) Use datetime.strptime() to convert formatted strings to datetime instances. $ python datetime_datetime_strptime.py ISO : 2010-11-27 11:20:10.571582 strftime: Sat Nov 27 11:20:10 2010 strptime: Sat Nov 27 11:20:10 2010 4.2.8 Time Zones Within datetime, time zones are represented by subclasses of tzinfo. Since tzinfo is an abstract base class, applications need to define a subclass and provide appropriate implementations for a few methods to make it useful. Unfortunately, datetime does not include any actual ready-to-use implementations, although the documentation does provide a few sample implementations. Refer to the standard library documentation page for examples using fixed offsets, as well as a DST-aware class and more details about creating custom time zone classes. pytz is also a good source for time zone implementation details. See Also: datetime (http://docs.python.org/lib/module-datetime.html) The standard library documentation for this module. calendar (page 191) The calendar module. time (page 173) The time module. dateutil (http://labix.org/python-dateutil) dateutil from Labix extends the datetime module with additional features. WikiPedia: Proleptic Gregorian calendar (http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar) A description of the Gregorian calendar system. 4.3. calendar—Work with Dates 191 pytz (http://pytz.sourceforge.net/) World Time Zone database. ISO 8601 (http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/ widely_used_standards_other/date_and_time_format.htm) The stan- dard for numeric representation of dates and time. 4.3 calendar—Work with Dates Purpose The calendar module implements classes for working with dates to manage year-, month-, and week-oriented values. Python Version 1.4, with updates in 2.5 The calendar module defines the Calendar class, which encapsulates calculations for values such as the dates of the weeks in a given month or year. In addition, the TextCalendar and HTMLCalendar classes can produce preformatted output. 4.3.1 Formatting Examples The prmonth() method is a simple function that produces the formatted text output for a month. import calendar c = calendar.TextCalendar(calendar.SUNDAY) c.prmonth(2011, 7) The example configures TextCalendar to start weeks on Sunday, following the American convention. The default is to use the European convention of starting a week on Monday. Here is what the output looks like. $ python calendar_textcalendar.py July 2011 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 192 Dates and Times A similar HTML table can be produced with HTMLCalendar and formatmonth(). The rendered output looks roughly the same as the plain-text version, but is wrapped with HTML tags. Each table cell has a class attribute corresponding to the day of the week so the HTML can be styled through CSS. To produce output in a format other than one of the available defaults, use calendar to calculate the dates and organize the values into week and month ranges, and then iterate over the result. The weekheader(), monthcalendar(), and yeardays2calendar() methods of Calendar are especially useful for that. Calling yeardays2calendar() produces a sequence of “month row” lists. Each list includes the months as another list of weeks. The weeks are lists of tuples made up of day number (1–31) and weekday number (0–6). Days that fall outside of the month have a day number of 0. import calendar import pprint cal = calendar.Calendar(calendar.SUNDAY) cal_data = cal.yeardays2calendar(2011, 3) print ’len(cal_data) :’, len(cal_data) top_months = cal_data[0] print ’len(top_months) :’, len(top_months) first_month = top_months[0] print ’len(first_month) :’, len(first_month) print ’first_month:’ pprint.pprint(first_month) Calling yeardays2calendar(2011, 3) returns data for 2011, organized with three months per row. $ python calendar_yeardays2calendar.py len(cal_data) : 4 len(top_months) : 3 len(first_month) : 6 first_month: [[(0, 6), (0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 5)], [(2, 6), (3, 0), (4, 1), (5, 2), (6, 3), (7, 4), (8, 5)], 4.3. calendar—Work with Dates 193 [(9, 6), (10, 0), (11, 1), (12, 2), (13, 3), (14, 4), (15, 5)], [(16, 6), (17, 0), (18, 1), (19, 2), (20, 3), (21, 4), (22, 5)], [(23, 6), (24, 0), (25, 1), (26, 2), (27, 3), (28, 4), (29, 5)], [(30, 6), (31, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5)]] This is equivalent to the data used by formatyear(). import calendar cal = calendar.TextCalendar(calendar.SUNDAY) print cal.formatyear(2011, 2, 1, 1, 3) For the same arguments, formatyear() produces this output. $ python calendar_formatyear.py 2011 January February March Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 1 2 3 4 5 1 2 3 4 5 2 3 4 5 6 7 8 6 7 8 9 10 11 12 6 7 8 9 10 11 12 9 10 11 12 13 14 15 13 14 15 16 17 18 19 13 14 15 16 17 18 19 16 17 18 19 20 21 22 20 21 22 23 24 25 26 20 21 22 23 24 25 26 23 24 25 26 27 28 29 27 28 27 28 29 30 31 30 31 April May June Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 2 1 2 3 4 5 6 7 1 2 3 4 3 4 5 6 7 8 9 8 9 10 11 12 13 14 5 6 7 8 9 10 11 10 11 12 13 14 15 16 15 16 17 18 19 20 21 12 13 14 15 16 17 18 17 18 19 20 21 22 23 22 23 24 25 26 27 28 19 20 21 22 23 24 25 24 25 26 27 28 29 30 29 30 31 26 27 28 29 30 July August September Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 2 1 2 3 4 5 6 1 2 3 3 4 5 6 7 8 9 7 8 9 10 11 12 13 4 5 6 7 8 9 10 10 11 12 13 14 15 16 14 15 16 17 18 19 20 11 12 13 14 15 16 17 17 18 19 20 21 22 23 21 22 23 24 25 26 27 18 19 20 21 22 23 24 24 25 26 27 28 29 30 28 29 30 31 25 26 27 28 29 30 31 194 Dates and Times October November December Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa 1 1 2 3 4 5 1 2 3 2 3 4 5 6 7 8 6 7 8 9 10 11 12 4 5 6 7 8 9 10 9 10 11 12 13 14 15 13 14 15 16 17 18 19 11 12 13 14 15 16 17 16 17 18 19 20 21 22 20 21 22 23 24 25 26 18 19 20 21 22 23 24 23 24 25 26 27 28 29 27 28 29 30 25 26 27 28 29 30 31 30 31 The day_name, day_abbr, month_name, and month_abbr module attributes are useful for producing custom-formatted output (e.g., to include links in the HTML output). They are automatically configured correctly for the current locale. 4.3.2 Calculating Dates Although the calendar module focuses mostly on printing full calendars in various formats, it also provides functions useful for working with dates in other ways, such as calculating dates for a recurring event. For example, the Python Atlanta Users Group meets on the second Thursday of every month. To calculate the meeting dates for a year, use the return value of monthcalendar(). import calendar import pprint pprint.pprint(calendar.monthcalendar(2011, 7)) Some days have a 0 value. Those are days of the week that overlap with the given month, but that are part of another month. $ python calendar_monthcalendar.py [[0, 0, 0, 0, 1, 2, 3], [4, 5, 6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23, 24], [25, 26, 27, 28, 29, 30, 31]] The first day of the week defaults to Monday. It is possible to change that setting by calling setfirstweekday(), but since the calendar module includes constants for indexing into the date ranges returned by monthcalendar(), it is more convenient to skip that step in this case. 4.3. calendar—Work with Dates 195 To calculate the group meeting dates for 2011, assuming the second Thursday of every month, the 0 values indicate whether the Thursday of the first week is included in the month (or if the month starts, for example, on a Friday). import calendar # Show every month for month in range(1, 13): # Compute the dates for each week that overlaps the month c = calendar.monthcalendar(2011, month) first_week = c[0] second_week = c[1] third_week = c[2] # If there is a Thursday in the first week, the second Thursday # is in the second week. Otherwise, the second Thursday must # be in the third week. if first_week[calendar.THURSDAY]: meeting_date = second_week[calendar.THURSDAY] else: meeting_date = third_week[calendar.THURSDAY] print ’%3s: %2s’ % (calendar.month_abbr[month], meeting_date) So, the meeting schedule for this year is $ python calendar_secondthursday.py Jan: 13 Feb: 10 Mar: 10 Apr: 14 May: 12 Jun: 9 Jul: 14 Aug: 11 Sep: 8 Oct: 13 Nov: 10 Dec: 8 196 Dates and Times See Also: calendar (http://docs.python.org/library/calendar.html) The standard library docu- mentation for this module. time (page 173) Lower-level time functions. datetime (page 180) Manipulate date values, including timestamps and time zones. Chapter 5 MATHEMATICS As a general-purpose programming language, Python is frequently used to solve math- ematical problems. It includes built-in types for managing integers and floating-point numbers, which are suitable for the basic math that might appear in an average applica- tion. The standard library includes modules for more advanced needs. Python’s built-in floating-point numbers use the underlying double representa- tion. They are sufficiently precise for most programs with mathematical requirements, but when more accurate representations of noninteger values are needed, the decimal and fractions modules will be useful. Arithmetic with decimal and fractional values retains precision, but it is not as fast as the native float. The random module includes a uniform distribution pseudorandom number gen- erator, as well as functions for simulating many common nonuniform distributions. The math module contains fast implementations of advanced mathematical functions, such as logarithms and trigonometric functions. The full complement of IEEE functions usually found in the native platform C libraries is available through the module. 5.1 decimal—Fixed and Floating-Point Math Purpose Decimal arithmetic using fixed and floating-point numbers. Python Version 2.4 and later The decimal module implements fixed and floating-point arithmetic using the model familiar to most people, rather than the IEEE floating-point version implemented by most computer hardware and familiar to programmers. A Decimal instance can rep- resent any number exactly, round it up or down, and apply a limit to the number of significant digits. 197 198 Mathematics 5.1.1 Decimal Decimal values are represented as instances of the Decimal class. The constructor takes as argument one integer or string. Floating-point numbers can be converted to a string before being used to create a Decimal, letting the caller explicitly deal with the number of digits for values that cannot be expressed exactly using hardware floating- point representations. Alternately, the class method from_float() converts to the exact decimal representation. import decimal fmt = ’{0:<25} {1:<25}’ print fmt.format(’Input’, ’Output’) print fmt.format(’-’ * 25, ’-’ * 25) # Integer print fmt.format(5, decimal.Decimal(5)) # String print fmt.format(’3.14’, decimal.Decimal(’3.14’)) # Float f = 0.1 print fmt.format(repr(f), decimal.Decimal(str(f))) print fmt.format(’%.23g’ % f, str(decimal.Decimal.from_float(f))[:25]) The floating-point value of 0.1 is not represented as an exact value in binary, so the representation as a float is different from the Decimal value. It is truncated to 25 characters in this output. $ python decimal_create.py Input Output ------------------------- ------------------------- 5 5 3.14 3.14 0.1 0.1 0.10000000000000000555112 0.10000000000000000555111 Decimals can also be created from tuples containing a sign flag (0 for positive, 1 for negative), a tuple of digits, and an integer exponent. 5.1. decimal—Fixed and Floating-Point Math 199 import decimal # Tuple t = (1, (1, 1), -2) print ’Input :’, t print ’Decimal:’, decimal.Decimal(t) The tuple-based representation is less convenient to create, but it does offer a portable way of exporting decimal values without losing precision. The tuple form can be transmitted through the network or stored in a database that does not support accurate decimal values, and then turned back into a Decimal instance later. $ python decimal_tuple.py Input : (1, (1, 1), -2) Decimal: -0.11 5.1.2 Arithmetic Decimal overloads the simple arithmetic operators so instances can be manipulated in much the same way as the built-in numeric types. import decimal a = decimal.Decimal(’5.1’) b = decimal.Decimal(’3.14’) c = 4 d = 3.14 print ’a =’, repr(a) print ’b =’, repr(b) print ’c =’, repr(c) print ’d =’, repr(d) print print ’a + b =’, a + b print ’a - b =’, a - b print ’a * b =’, a * b print ’a / b =’, a / b print print ’a + c =’, a + c print ’a - c =’, a - c 200 Mathematics print ’a * c =’, a * c print ’a / c =’, a / c print print ’a + d =’, try: print a + d except TypeError, e: print e Decimal operators also accept integer arguments, but floating-point values must be converted to Decimal instances. $ python decimal_operators.py a = Decimal(’5.1’) b = Decimal(’3.14’) c = 4 d = 3.14 a + b = 8.24 a - b = 1.96 a * b = 16.014 a / b = 1.624203821656050955414012739 a + c = 9.1 a - c = 1.1 a * c = 20.4 a / c = 1.275 a + d = unsupported operand type(s) for +: ’Decimal’ and ’float’ Beyond basic arithmetic, Decimal includes the methods to find the base 10 and natural logarithms. The return values from log10() and ln() are Decimal instances, so they can be used directly in formulas with other values. 5.1.3 Special Values In addition to the expected numerical values, Decimal can represent several special values, including positive and negative values for infinity, “not a number,” and zero. import decimal for value in [ ’Infinity’, ’NaN’, ’0’ ]: 5.1. decimal—Fixed and Floating-Point Math 201 print decimal.Decimal(value), decimal.Decimal(’-’ + value) print # Math with infinity print ’Infinity + 1:’, (decimal.Decimal(’Infinity’) + 1) print ’-Infinity + 1:’, (decimal.Decimal(’-Infinity’) + 1) # Print comparing NaN print decimal.Decimal(’NaN’) == decimal.Decimal(’Infinity’) print decimal.Decimal(’NaN’) != decimal.Decimal(1) Adding to infinite values returns another infinite value. Comparing for equality with NaN always returns false, and comparing for inequality always returns true. Com- paring for sort order against NaN is undefined and results in an error. $ python decimal_special.py Infinity -Infinity NaN -NaN 0 -0 Infinity + 1: Infinity -Infinity + 1: -Infinity False True 5.1.4 Context So far, the examples all have used the default behaviors of the decimal module. It is possible to override settings such as the precision maintained, how rounding is performed, error handling, etc., by using a context. Contexts can be applied for all Decimal instances in a thread or locally within a small code region. Current Context To retrieve the current global context, use getcontext(). import decimal import pprint context = decimal.getcontext() print ’Emax =’, context.Emax print ’Emin =’, context.Emin 202 Mathematics print ’capitals =’, context.capitals print ’prec =’, context.prec print ’rounding =’, context.rounding print ’flags =’ pprint.pprint(context.flags) print ’traps =’ pprint.pprint(context.traps) This example script shows the public properties of a Context. $ python decimal_getcontext.py Emax = 999999999 Emin = -999999999 capitals = 1 prec = 28 rounding = ROUND_HALF_EVEN flags = {: 0, : 0, : 0, : 0, : 0, : 0, : 0, : 0} traps = {: 0, : 1, : 1, : 0, : 0, : 0, : 1, : 0} Precision The prec attribute of the context controls the precision maintained for new values created as a result of arithmetic. Literal values are maintained as described. import decimal d = decimal.Decimal(’0.123456’) for i in range(4): 5.1. decimal—Fixed and Floating-Point Math 203 decimal.getcontext().prec = i print i, ’:’, d, d * 1 To change the precision, assign a new value directly to the attribute. $ python decimal_precision.py 0 : 0.123456 0 1 : 0.123456 0.1 2 : 0.123456 0.12 3 : 0.123456 0.123 Rounding There are several options for rounding to keep values within the desired precision. ROUND_CEILING Always round upward toward infinity. ROUND_DOWN Always round toward zero. ROUND_FLOOR Always round down toward negative infinity. ROUND_HALF_DOWN Round away from zero if the last significant digit is greater than or equal to 5; otherwise, round toward zero. ROUND_HALF_EVEN Like ROUND_HALF_DOWN, except that if the value is 5, then the preceding digit is examined. Even values cause the result to be rounded down, and odd digits cause the result to be rounded up. ROUND_HALF_UP Like ROUND_HALF_DOWN, except if the last significant digit is 5, the value is rounded away from zero. ROUND_UP Round away from zero. ROUND_05UP Round away from zero if the last digit is 0 or 5; otherwise, round toward zero. import decimal context = decimal.getcontext() ROUNDING_MODES = [ ’ROUND_CEILING’, ’ROUND_DOWN’, ’ROUND_FLOOR’, ’ROUND_HALF_DOWN’, ’ROUND_HALF_EVEN’, ’ROUND_HALF_UP’, ’ROUND_UP’, ’ROUND_05UP’, ] header_fmt = ’{:10} ’ + ’’.join([’{:^8}’] * 6) 204 Mathematics print header_fmt.format(’’, ’1/8 (1)’, ’-1/8 (1)’, ’1/8 (2)’, ’-1/8 (2)’, ’1/8 (3)’, ’-1/8 (3)’, ) for rounding_mode in ROUNDING_MODES: print ’{0:10}’.format(rounding_mode.partition(’_’)[-1]), for precision in [ 1, 2, 3 ]: context.prec = precision context.rounding = getattr(decimal, rounding_mode) value = decimal.Decimal(1) / decimal.Decimal(8) print ’{0:^8}’.format(value), value = decimal.Decimal(-1) / decimal.Decimal(8) print ’{0:^8}’.format(value), print This program shows the effect of rounding the same value to different levels of precision using the different algorithms. $ python decimal_rounding.py 1/8 (1) -1/8 (1) 1/8 (2) -1/8 (2) 1/8 (3) -1/8 (3) CEILING 0.2 -0.1 0.13 -0.12 0.125 -0.125 DOWN 0.1 -0.1 0.12 -0.12 0.125 -0.125 FLOOR 0.1 -0.2 0.12 -0.13 0.125 -0.125 HALF_DOWN 0.1 -0.1 0.12 -0.12 0.125 -0.125 HALF_EVEN 0.1 -0.1 0.12 -0.12 0.125 -0.125 HALF_UP 0.1 -0.1 0.13 -0.13 0.125 -0.125 UP 0.2 -0.2 0.13 -0.13 0.125 -0.125 05UP 0.1 -0.1 0.12 -0.12 0.125 -0.125 Local Context Using Python 2.5 or later, the context can be applied to a block of code using the with statement. import decimal with decimal.localcontext() as c: c.prec = 2 print ’Local precision:’, c.prec print ’3.14 / 3 =’, (decimal.Decimal(’3.14’) / 3) 5.1. decimal—Fixed and Floating-Point Math 205 print print ’Default precision:’, decimal.getcontext().prec print ’3.14 / 3 =’, (decimal.Decimal(’3.14’) / 3) The Context supports the context manager API used by with, so the settings only apply within the block. $ python decimal_context_manager.py Local precision: 2 3.14 / 3 = 1.0 Default precision: 28 3.14 / 3 = 1.046666666666666666666666667 Per-Instance Context Contexts also can be used to construct Decimal instances, which then inherit from the context the precision and rounding arguments to the conversion. import decimal # Set up a context with limited precision c = decimal.getcontext().copy() c.prec = 3 # Create our constant pi = c.create_decimal(’3.1415’) # The constant value is rounded off print ’PI :’, pi # The result of using the constant uses the global context print ’RESULT:’, decimal.Decimal(’2.01’) * pi This lets an application select the precision of constant values separately from the precision of user data, for example. $ python decimal_instance_context.py PI : 3.14 RESULT: 6.3114 206 Mathematics Threads The “global” context is actually thread-local, so each thread can potentially be config- ured using different values. import decimal import threading from Queue import PriorityQueue class Multiplier(threading.Thread): def __init__(self, a, b, prec, q): self.a = a self.b = b self.prec = prec self.q = q threading.Thread.__init__(self) def run(self): c = decimal.getcontext().copy() c.prec = self.prec decimal.setcontext(c) self.q.put( (self.prec, a * b) ) return a = decimal.Decimal(’3.14’) b = decimal.Decimal(’1.234’) # A PriorityQueue will return values sorted by precision, no matter # what order the threads finish. q = PriorityQueue() threads = [ Multiplier(a, b, i, q) for i in range(1, 6) ] for t in threads: t.start() for t in threads: t.join() for i in range(5): prec, value = q.get() print prec, ’\t’, value This example creates a new context using the specified value, and then installs it within each thread. 5.2. fractions—Rational Numbers 207 $ python decimal_thread_context.py 1 4 2 3.9 3 3.87 4 3.875 5 3.8748 See Also: decimal (http://docs.python.org/library/decimal.html) The standard library docu- mentation for this module. Floating Point (http://en.wikipedia.org/wiki/Floating_point) Wikipedia article on floating-point representations and arithmetic. Floating Point Arithmetic: Issues and Limitations (http://docs.python.org/tutorial/floatingpoint.html) Article from the Python tutorial describing floating-point math representation issues. 5.2 fractions—Rational Numbers Purpose Implements a class for working with rational numbers. Python Version 2.6 and later The Fraction class implements numerical operations for rational numbers based on the API defined by Rational in the numbers module. 5.2.1 Creating Fraction Instances As with the decimal module, new values can be created in several ways. One easy way is to create them from separate numerator and denominator values, as follows. import fractions for n, d in [ (1, 2), (2, 4), (3, 6) ]: f = fractions.Fraction(n, d) print ’%s/%s = %s’ % (n, d, f) 208 Mathematics The lowest common denominator is maintained as new values are computed. $ python fractions_create_integers.py 1/2 = 1/2 2/4 = 1/2 3/6 = 1/2 Another way to create a Fraction is to use a string representation of / : import fractions for s in [ ’1/2’, ’2/4’, ’3/6’ ]: f = fractions.Fraction(s) print ’%s = %s’ % (s, f) The string is parsed to find the numerator and denominator values. $ python fractions_create_strings.py 1/2 = 1/2 2/4 = 1/2 3/6 = 1/2 Strings can also use the more usual decimal or floating-point notation of a series of digits separated by a period. import fractions for s in [ ’0.5’, ’1.5’, ’2.0’ ]: f = fractions.Fraction(s) print ’%s = %s’ % (s, f) The numerator and denominator values represented by the floating-point value are computed automatically. $ python fractions_create_strings_floats.py 0.5 = 1/2 1.5 = 3/2 2.0 = 2 5.2. fractions—Rational Numbers 209 There are also class methods for creating Fraction instances directly from other representations of rational values, such as float or Decimal. import fractions for v in [ 0.1, 0.5, 1.5, 2.0 ]: print ’%s = %s’ % (v, fractions.Fraction.from_float(v)) Floating-point values that cannot be expressed exactly may yield unexpected results. $ python fractions_from_float.py 0.1 = 3602879701896397/36028797018963968 0.5 = 1/2 1.5 = 3/2 2.0 = 2 Using decimal representations of the values gives the expected results. import decimal import fractions for v in [ decimal.Decimal(’0.1’), decimal.Decimal(’0.5’), decimal.Decimal(’1.5’), decimal.Decimal(’2.0’), ]: print ’%s = %s’ % (v, fractions.Fraction.from_decimal(v)) The internal implementation of the decimal does not suffer from the precision errors of the standard floating-point representation. $ python fractions_from_decimal.py 0.1 = 1/10 0.5 = 1/2 1.5 = 3/2 2.0 = 2 210 Mathematics 5.2.2 Arithmetic Once the fractions are instantiated, they can be used in mathematical expressions. import fractions f1 = fractions.Fraction(1, 2) f2 = fractions.Fraction(3, 4) print ’%s + %s = %s’ % (f1, f2, f1 + f2) print ’%s - %s = %s’ % (f1, f2, f1 - f2) print ’%s * %s = %s’ % (f1, f2, f1 * f2) print ’%s / %s = %s’ % (f1, f2, f1 / f2) All standard operators are supported. $ python fractions_arithmetic.py 1/2 + 3/4 = 5/4 1/2 - 3/4 = -1/4 1/2 * 3/4 = 3/8 1/2 / 3/4 = 2/3 5.2.3 Approximating Values A useful feature of Fraction is the ability to convert a floating-point number to an approximate rational value. import fractions import math print ’PI =’, math.pi f_pi = fractions.Fraction(str(math.pi)) print ’No limit =’, f_pi for i in [ 1, 6, 11, 60, 70, 90, 100 ]: limited = f_pi.limit_denominator(i) print ’{0:8} = {1}’.format(i, limited) The value of the fraction can be controlled by limiting the denominator size. 5.3. random—Pseudorandom Number Generators 211 $ python fractions_limit_denominator.py = 3.14159265359 No limit = 314159265359/100000000000 1 = 3 6 = 19/6 11 = 22/7 60 = 179/57 70 = 201/64 90 = 267/85 100 = 311/99 See Also: fractions (http://docs.python.org/library/fractions.html) The standard library doc umentation for this module. decimal (page 197) The decimal module provides an API for fixed and floating- point math. numbers (http://docs.python.org/library/numbers.html) Numeric abstract base classes. 5.3 random—Pseudorandom Number Generators Purpose Implements several types of pseudorandom number generators. Python Version 1.4 and later The random module provides a fast pseudorandom number generator based on the Mersenne Twister algorithm. Originally developed to produce inputs for Monte Carlo simulations, Mersenne Twister generates numbers with nearly uniform distribution and a large period, making it suited for a wide range of applications. 5.3.1 Generating Random Numbers The random() function returns the next random floating-point value from the generated sequence. All return values fall within the range 0 <= n < 1.0. import random for i in xrange(5): print ’%04.3f’ % random.random(), print 212 Mathematics Running the program repeatedly produces different sequences of numbers. $ python random_random.py 0.809 0.485 0.521 0.800 0.247 $ python random_random.py 0.614 0.551 0.705 0.479 0.659 To generate numbers in a specific numerical range, use uniform() instead. import random for i in xrange(5): print ’%04.3f’ % random.uniform(1, 100), print Pass minimum and maximum values, and uniform() adjusts the return values from random() using the formula min + (max - min) * random(). $ python random_uniform.py 78.558 96.734 74.521 52.386 98.499 5.3.2 Seeding random() produces different values each time it is called and has a very large period before it repeats any numbers. This is useful for producing unique values or variations, but there are times when having the same data set available to be processed in different ways is useful. One technique is to use a program to generate random values and save them to be processed by a separate step. That may not be practical for large amounts of data, though, so random includes the seed() function for initializing the pseudoran- dom generator so that it produces an expected set of values. import random random.seed(1) for i in xrange(5): print ’%04.3f’ % random.random(), print 5.3. random—Pseudorandom Number Generators 213 The seed value controls the first value produced by the formula used to produce pseudorandom numbers, and since the formula is deterministic, it also sets the full se- quence produced after the seed is changed. The argument to seed() can be any hash- able object. The default is to use a platform-specific source of randomness, if one is available. Otherwise, the current time is used. $ python random_seed.py 0.134 0.847 0.764 0.255 0.495 $ python random_seed.py 0.134 0.847 0.764 0.255 0.495 5.3.3 Saving State The internal state of the pseudorandom algorithm used by random() can be saved and used to control the numbers produced in subsequent runs. Restoring the previous state before continuing reduces the likelihood of repeating values or sequences of val- ues from the earlier input. The getstate() function returns data that can be used to reinitialize the random number generator later with setstate(). import random import os import cPickle as pickle if os.path.exists(’state.dat’): # Restore the previously saved state print ’Found state.dat, initializing random module’ with open(’state.dat’, ’rb’) as f: state = pickle.load(f) random.setstate(state) else: # Use a well-known start state print ’No state.dat, seeding’ random.seed(1) # Produce random values for i in xrange(3): print ’%04.3f’ % random.random(), print 214 Mathematics # Save state for next time with open(’state.dat’, ’wb’) as f: pickle.dump(random.getstate(), f) # Produce more random values print ’\nAfter saving state:’ for i in xrange(3): print ’%04.3f’ % random.random(), print The data returned by getstate() is an implementation detail, so this example saves the data to a file with pickle, but otherwise treats it as a black box. If the file exists when the program starts, it loads the old state and continues. Each run produces a few numbers before and after saving the state to show that restoring the state causes the generator to produce the same values again. $ python random_state.py No state.dat, seeding 0.134 0.847 0.764 After saving state: 0.255 0.495 0.449 $ python random_state.py Found state.dat, initializing random module 0.255 0.495 0.449 After saving state: 0.652 0.789 0.094 5.3.4 Random Integers random() generates floating-point numbers. It is possible to convert the results to in- tegers, but using randint() to generate integers directly is more convenient. import random print ’[1, 100]:’, 5.3. random—Pseudorandom Number Generators 215 for i in xrange(3): print random.randint(1, 100), print ’\n[-5, 5]:’, for i in xrange(3): print random.randint(-5, 5), print The arguments to randint() are the ends of the inclusive range for the values. The numbers can be positive or negative, but the first value should be less than the second. $ python random_randint.py [1, 100]: 91 77 67 [-5, 5]: -5 -3 3 randrange() is a more general form of selecting values from a range. import random for i in xrange(3): print random.randrange(0, 101, 5), print randrange() supports a step argument, in addition to start and stop values, so it is fully equivalent to selecting a random value from range(start, stop, step). It is more efficient, because the range is not actually constructed. $ python random_randrange.py 50 10 60 5.3.5 Picking Random Items One common use for random number generators is to select a random item from a sequence of enumerated values, even if those values are not numbers. random includes the choice() function for making a random selection from a sequence. This example simulates flipping a coin 10,000 times to count how many times it comes up heads and how many times it comes up tails. 216 Mathematics import random import itertools outcomes = { ’heads’:0, ’tails’:0, } sides = outcomes.keys() for i in range(10000): outcomes[ random.choice(sides) ] += 1 print ’Heads:’, outcomes[’heads’] print ’Tails:’, outcomes[’tails’] Only two outcomes are allowed, so rather than use numbers and convert them, the words “heads” and “tails” are used with choice(). The results are tabulated in a dictionary using the outcome names as keys. $ python random_choice.py Heads: 5038 Tails: 4962 5.3.6 Permutations A simulation of a card game needs to mix up the deck of cards and then deal the cards to the players, without using the same card more than once. Using choice() could result in the same card being dealt twice, so instead, the deck can be mixed up with shuffle() and then individual cards removed as they are dealt. import random import itertools FACE_CARDS = (’J’, ’Q’, ’K’, ’A’) SUITS = (’H’, ’D’, ’C’, ’S’) def new_deck(): return list(itertools.product( itertools.chain(xrange(2, 11), FACE_CARDS), SUITS, )) 5.3. random—Pseudorandom Number Generators 217 def show_deck(deck): p_deck = deck[:] while p_deck: row = p_deck[:13] p_deck = p_deck[13:] for j in row: print ’%2s%s’ % j, print # Make a new deck, with the cards in order deck = new_deck() print ’Initial deck:’ show_deck(deck) # Shuffle the deck to randomize the order random.shuffle(deck) print ’\nShuffled deck:’ show_deck(deck) # Deal 4 hands of 5 cards each hands = [ [], [], [], [] ] for i in xrange(5): for h in hands: h.append(deck.pop()) # Show the hands print ’\nHands:’ for n, h in enumerate(hands): print ’%d:’ % (n+1), for c in h: print ’%2s%s’ % c, print # Show the remaining deck print ’\nRemaining deck:’ show_deck(deck) The cards are represented as tuples with the face value and a letter indicating the suit. The dealt “hands” are created by adding one card at a time to each of four lists and then removing it from the deck so it cannot be dealt again. 218 Mathematics $ python random_shuffle.py Initial deck: 2H 2D 2C 2S 3H 3D 3C 3S 4H 4D 4C 4S 5H 5D 5C 5S 6H 6D 6C 6S 7H 7D 7C 7S 8H 8D 8C 8S 9H 9D 9C 9S 10H 10D 10C 10S JH JD JC JS QH QD QC QS KH KD KC KS AH AD AC AS Shuffled deck: 3C KH QH 6H JD AC 7S 5D 3S 10S 7H QC 2C 5C 7C 4H 6S 9D 10H 4D 2H 3D 7D 5S 10D 9H 2S 9C KC 5H 6C 8S 3H 10C JS 2D AH KD AD 4C QS 8D 8C JC 8H 4S JH QD 9S AS KS 6D Hands: 1: 6D QD JC 4C 2D 2: KS JH 8C AD JS 3: AS 4S 8D KD 10C 4: 9S 8H QS AH 3H Remaining deck: 3C KH QH 6H JD AC 7S 5D 3S 10S 7H QC 2C 5C 7C 4H 6S 9D 10H 4D 2H 3D 7D 5S 10D 9H 2S 9C KC 5H 6C 8S 5.3.7 Sampling Many simulations need random samples from a population of input values. The sample() function generates samples without repeating values and without modify- ing the input sequence. This example prints a random sample of words from the system dictionary. import random with open(’/usr/share/dict/words’, ’rt’) as f: words = f.readlines() words = [ w.rstrip() for w in words ] for w in random.sample(words, 5): print w The algorithm for producing the result set takes into account the sizes of the input and the sample requested to produce the result as efficiently as possible. 5.3. random—Pseudorandom Number Generators 219 $ python random_sample.py pleasureman consequency docibility youdendrift Ituraean $ python random_sample.py jigamaree readingdom sporidium pansylike foraminiferan 5.3.8 Multiple Simultaneous Generators In addition to module-level functions, random includes a Random class to manage the internal state for several random number generators. All of the functions described ear- lier are available as methods of the Random instances, and each instance can be initial- ized and used separately, without interfering with the values returned by other instances. import random import time print ’Default initializiation:\n’ r1 = random.Random() r2 = random.Random() for i in xrange(3): print ’%04.3f %04.3f’ % (r1.random(), r2.random()) print ’\nSame seed:\n’ seed = time.time() r1 = random.Random(seed) r2 = random.Random(seed) for i in xrange(3): print ’%04.3f %04.3f’ % (r1.random(), r2.random()) 220 Mathematics On a system with good native random-value seeding, the instances start out in unique states. However, if there is no good platform random-value generator, the instances are likely to have been seeded with the current time, and therefore, produce the same values. $ python random_random_class.py Default initializiation: 0.370 0.303 0.437 0.142 0.323 0.088 Same seed: 0.684 0.684 0.060 0.060 0.977 0.977 To ensure that the generators produce values from different parts of the random period, use jumpahead() to shift one of them away from its initial state. import random import time r1 = random.Random() r2 = random.Random() # Force r2 to a different part of the random period than r1. r2.setstate(r1.getstate()) r2.jumpahead(1024) for i in xrange(3): print ’%04.3f %04.3f’ % (r1.random(), r2.random()) The argument to jumpahead() should be a nonnegative integer based the number of values needed from each generator. The internal state of the generator is scrambled based on the input value, but not simply by incrementing it by the number of steps given. $ python random_jumpahead.py 5.3. random—Pseudorandom Number Generators 221 0.858 0.093 0.510 0.707 0.444 0.556 5.3.9 SystemRandom Some operating systems provide a random number generator that has access to more sources of entropy that can be introduced into the generator. random exposes this fea- ture through the SystemRandom class, which has the same API as Random but uses os.urandom() to generate the values that form the basis of all other algorithms. import random import time print ’Default initializiation:\n’ r1 = random.SystemRandom() r2 = random.SystemRandom() for i in xrange(3): print ’%04.3f %04.3f’ % (r1.random(), r2.random()) print ’\nSame seed:\n’ seed = time.time() r1 = random.SystemRandom(seed) r2 = random.SystemRandom(seed) for i in xrange(3): print ’%04.3f %04.3f’ % (r1.random(), r2.random()) Sequences produced by SystemRandom are not reproducible because the random- ness is coming from the system, rather than from the software state (in fact, seed() and setstate() have no effect at all). $ python random_system_random.py Default initializiation: 0.551 0.873 0.643 0.975 0.106 0.268 222 Mathematics Same seed: 0.211 0.985 0.101 0.852 0.887 0.344 5.3.10 Nonuniform Distributions While the uniform distribution of the values produced by random() is useful for a lot of purposes, other distributions more accurately model specific situations. The random module includes functions to produce values in those distributions, too. They are listed here, but not covered in detail because their uses tend to be specialized and require more complex examples. Normal The normal distribution is commonly used for nonuniform continuous values, such as grades, heights, weights, etc. The curve produced by the distribution has a distinctive shape that has lead to it being nicknamed a “bell curve.” random includes two functions for generating values with a normal distribution, normalvariate() and the slightly faster gauss(). (The normal distribution is also called the Gaussian distribution.) The related function, lognormvariate(), produces pseudorandom values where the logarithm of the values is distributed normally. Log-normal distributions are useful for values that are the product of several random variables that do not interact. Approximation The triangular distribution is used as an approximate distribution for small sample sizes. The “curve” of a triangular distribution has low points at known minimum and maximum values, and a high point at the mode, which is estimated based on a “most likely” outcome (reflected by the mode argument to triangular()). Exponential expovariate() produces an exponential distribution useful for simulating arrival or interval time values for use in homogeneous Poisson processes, such as the rate of radioactive decay or requests coming into a Web server. The Pareto, or power law, distribution matches many observable phenomena and was popularized by The Long Tail, by Chris Anderson. The paretovariate() func- tion is useful for simulating allocation of resources to individuals (wealth to people, demand for musicians, attention to blogs, etc.). 5.4. math—Mathematical Functions 223 Angular The von Mises, or circular normal, distribution (produced by vonmisesvariate()) is used for computing probabilities of cyclic values, such as angles, calendar days, and times. Sizes betavariate() generates values with the Beta distribution, which is commonly used in Bayesian statistics and applications such as task duration modeling. The Gamma distribution produced by gammavariate() is used for modeling the sizes of things, such as waiting times, rainfall, and computational errors. The Weibull distribution computed by weibullvariate() is used in failure analysis, industrial engineering, and weather forecasting. It describes the distribution of sizes of particles or other discrete objects. See Also: random (http://docs.python.org/library/random.html) The standard library docu- mentation for this module. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator Article by M. Matsumoto and T. Nishimura from ACM Transactions on Modeling and Computer Simulation, Vol. 8, No. 1, January pp. 3–30 1998. Mersenne Twister (http://en.wikipedia.org/wiki/Mersenne_twister) Wikipedia article about the pseudorandom generator algorithm used by Python. Uniform distribution [http://en.wikipedia.org/wiki/Uniform_distribution_ (continuous)] Wikipedia article about continuous uniform distributions in statistics. 5.4 math—Mathematical Functions Purpose Provides functions for specialized mathematical operations. Python Version 1.4 and later The math module implements many of the IEEE functions that would normally be found in the native platform C libraries for complex mathematical operations using floating-point values, including logarithms and trigonometric operations. 5.4.1 Special Constants Many math operations depend on special constants. math includes values for π (pi) and e. 224 Mathematics import math print ’π: %.30f’ % math.pi print ’e: %.30f’ % math.e Both values are limited in precision only by the platform’s floating-point C library. $ python math_constants.py π: 3.141592653589793115997963468544 e: 2.718281828459045090795598298428 5.4.2 Testing for Exceptional Values Floating-point calculations can result in two types of exceptional values. The first of these, INF (infinity), appears when the double used to hold a floating-point value over- flows from a value with a large absolute value. import math print ’{:^3} {:6} {:6} {:6}’.format(’e’, ’x’, ’x**2’, ’isinf’) print ’{:-^3} {:-^6} {:-^6} {:-^6}’.format(’’, ’’, ’’, ’’) for e in range(0, 201, 20): x = 10.0 ** e y = x*x print ’{:3d} {!s:6} {!s:6} {!s:6}’.format(e, x, y, math.isinf(y), ) When the exponent in this example grows large enough, the square of x no longer fits inside a double, and the value is recorded as infinite. $ python math_isinf.py e x x**2 isinf --- ------ ------ ------ 0 1.0 1.0 False 20 1e+20 1e+40 False 40 1e+40 1e+80 False 60 1e+60 1e+120 False 80 1e+80 1e+160 False 5.4. math—Mathematical Functions 225 100 1e+100 1e+200 False 120 1e+120 1e+240 False 140 1e+140 1e+280 False 160 1e+160 inf True 180 1e+180 inf True 200 1e+200 inf True Not all floating-point overflows result in INF values, however. Calculating an ex- ponent with floating-point values, in particular, raises OverflowError instead of pre- serving the INF result. x = 10.0 ** 200 print ’x =’, x print ’x*x =’, x*x try: print ’x**2 =’, x**2 except OverflowError, err: print err This discrepancy is caused by an implementation difference in the library used by C Python. $ python math_overflow.py x = 1e+200 x*x = inf x**2 = (34, ’Result too large’) Division operations using infinite values are undefined. The result of dividing a number by infinity is NaN (not a number). import math x = (10.0 ** 200) * (10.0 ** 200) y = x/x print ’x =’, x print ’isnan(x) =’, math.isnan(x) print ’y = x / x =’, x/x print ’y == nan =’, y == float(’nan’) print ’isnan(y) =’, math.isnan(y) 226 Mathematics NaN does not compare as equal to any value, even itself, so to check for NaN, use isnan(). $ python math_isnan.py x = inf isnan(x) = False y = x / x = nan y == nan = False isnan(y) = True 5.4.3 Converting to Integers The math module includes three functions for converting floating-point values to whole numbers. Each takes a different approach and will be useful in different circumstances. The simplest is trunc(), which truncates the digits following the decimal, leaving only the significant digits making up the whole-number portion of the value. floor() converts its input to the largest preceding integer, and ceil() (ceiling) produces the largest integer following sequentially after the input value. import math HEADINGS = (’i’, ’int’, ’trunk’, ’floor’, ’ceil’) print ’{:^5} {:^5} {:^5} {:^5} {:^5}’.format(*HEADINGS) print ’{:-^5} {:-^5} {:-^5} {:-^5} {:-^5}’.format( ’’, ’’, ’’, ’’, ’’, ) fmt = ’’.join([’{:5.1f}’] * 5) TEST_VALUES = [ -1.5, -0.8, -0.5, -0.2, 0, 0.2, 0.5, 0.8, 1, ] 5.4. math—Mathematical Functions 227 for i in TEST_VALUES: print fmt.format(i, int(i), math.trunc(i), math.floor(i), math.ceil(i)) trunc() is equivalent to converting to int directly. $ python math_integers.py i int trunk floor ceil ----- ----- ----- ----- ----- -1.5 -1.0 -1.0 -2.0 -1.0 -0.8 0.0 0.0 -1.0 -0.0 -0.5 0.0 0.0 -1.0 -0.0 -0.2 0.0 0.0 -1.0 -0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 1.0 0.5 0.0 0.0 0.0 1.0 0.8 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 5.4.4 Alternate Representations modf() takes a single floating-point number and returns a tuple containing the frac- tional and whole-number parts of the input value. import math for i in range(6): print ’{}/2 = {}’.format(i, math.modf(i/2.0)) Both numbers in the return value are floats. $ python math_modf.py 0/2 = (0.0, 0.0) 1/2 = (0.5, 0.0) 2/2 = (0.0, 1.0) 3/2 = (0.5, 1.0) 4/2 = (0.0, 2.0) 5/2 = (0.5, 2.0) 228 Mathematics frexp() returns the mantissa and exponent of a floating-point number, and can be used to create a more portable representation of the value. import math print ’{:^7} {:^7} {:^7}’.format(’x’, ’m’, ’e’) print ’{:-^7} {:-^7} {:-^7}’.format(’’, ’’, ’’) for x in [ 0.1, 0.5, 4.0 ]: m, e = math.frexp(x) print ’{:7.2f} {:7.2f} {:7d}’.format(x, m, e) frexp() uses the formula x = m * 2**e, and returns the values m and e. $ python math_frexp.py x m e ------- ------- ------- 0.10 0.80 -3 0.50 0.50 0 4.00 0.50 3 ldexp() is the inverse of frexp(). import math print ’{:^7} {:^7} {:^7}’.format(’m’, ’e’, ’x’) print ’{:-^7} {:-^7} {:-^7}’.format(’’, ’’, ’’) for m, e in [ (0.8, -3), (0.5, 0), (0.5, 3), ]: x = math.ldexp(m, e) print ’{:7.2f} {:7d} {:7.2f}’.format(m, e, x) Using the same formula as frexp(), ldexp() takes the mantissa and exponent values as arguments and returns a floating-point number. 5.4. math—Mathematical Functions 229 $ python math_ldexp.py m e x ------- ------- ------- 0.80 -3 0.10 0.50 0 0.50 0.50 3 4.00 5.4.5 Positive and Negative Signs The absolute value of a number is its value without a sign. Use fabs() to calculate the absolute value of a floating-point number. import math print math.fabs(-1.1) print math.fabs(-0.0) print math.fabs(0.0) print math.fabs(1.1) In practical terms, the absolute value of a float is represented as a positive value. $ python math_fabs.py 1.1 0.0 0.0 1.1 To determine the sign of a value, either to give a set of values the same sign or to compare two values, use copysign() to set the sign of a known good value. import math HEADINGS = (’f’, ’s’, ’< 0’, ’> 0’, ’= 0’) print ’{:^5} {:^5} {:^5} {:^5} {:^5}’.format(*HEADINGS) print ’{:-^5} {:-^5} {:-^5} {:-^5} {:-^5}’.format( ’’, ’’, ’’, ’’, ’’, ) for f in [ -1.0, 0.0, 1.0, 230 Mathematics float(’-inf’), float(’inf’), float(’-nan’), float(’nan’), ]: s = int(math.copysign(1, f)) print ’{:5.1f} {:5d} {!s:5} {!s:5} {!s:5}’.format( f, s, f < 0, f > 0, f==0, ) An extra function like copysign() is needed because comparing NaN and –NaN directly with other values does not work. $ python math_copysign.py f s < 0 > 0 = 0 ----- ----- ----- ----- ----- -1.0 -1 True False False 0.0 1 False False True 1.0 1 False True False -inf -1 True False False inf 1 False True False nan -1 False False False nan 1 False False False 5.4.6 Commonly Used Calculations Representing precise values in binary floating-point memory is challenging. Some val- ues cannot be represented exactly, and the more often a value is manipulated through repeated calculations, the more likely a representation error will be introduced. math includes a function for computing the sum of a series of floating-point numbers using an efficient algorithm that minimizes such errors. import math values = [ 0.1 ] * 10 print ’Input values:’, values print ’sum() : {:.20f}’.format(sum(values)) s = 0.0 5.4. math—Mathematical Functions 231 for i in values: s += i print ’for-loop : {:.20f}’.format(s) print ’math.fsum() : {:.20f}’.format(math.fsum(values)) Given a sequence of ten values, each equal to 0.1, the expected value for the sum of the sequence is 1.0. Since 0.1 cannot be represented exactly as a floating-point value, however, errors are introduced into the sum unless it is calculated with fsum(). $ python math_fsum.py Input values: [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] sum() : 0.99999999999999988898 for-loop : 0.99999999999999988898 math.fsum() : 1.00000000000000000000 factorial() is commonly used to calculate the number of permutations and combinations of a series of objects. The factorial of a positive integer n, expressed n!, is defined recursively as (n-1)!*n and stops with 0!==1. import math for i in [ 0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.1 ]: try: print ’{:2.0f} {:6.0f}’.format(i, math.factorial(i)) except ValueError, err: print ’Error computing factorial(%s):’ % i, err factorial() only works with whole numbers, but it does accept float argu- ments as long as they can be converted to an integer without losing value. $ python math_factorial.py 0 1 1 1 2 2 3 6 4 24 5 120 Error computing factorial(6.1): factorial() only accepts integral values 232 Mathematics gamma() is like factorial(), except that it works with real numbers and the value is shifted down by one (gamma is equal to (n - 1)!). import math for i in [ 0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6 ]: try: print ’{:2.1f} {:6.2f}’.format(i, math.gamma(i)) except ValueError, err: print ’Error computing gamma(%s):’ % i, err Since zero causes the start value to be negative, it is not allowed. $ python math_gamma.py Error computing gamma(0): math domain error 1.1 0.95 2.2 1.10 3.3 2.68 4.4 10.14 5.5 52.34 6.6 344.70 lgamma() returns the natural logarithm of the absolute value of gamma for the input value. import math for i in [ 0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6 ]: try: print ’{:2.1f} {:.20f} {:.20f}’.format( i, math.lgamma(i), math.log(math.gamma(i)), ) except ValueError, err: print ’Error computing lgamma(%s):’ % i, err Using lgamma() retains more precision than calculating the logarithm separately using the results of gamma(). 5.4. math—Mathematical Functions 233 $ python math_lgamma.py Error computing lgamma(0): math domain error 1.1 -0.04987244125984036103 -0.04987244125983997245 2.2 0.09694746679063825923 0.09694746679063866168 3.3 0.98709857789473387513 0.98709857789473409717 4.4 2.31610349142485727469 2.31610349142485727469 5.5 3.95781396761871651080 3.95781396761871606671 6.6 5.84268005527463252236 5.84268005527463252236 The modulo operator (%) computes the remainder of a division expression (e.g., 5 % 2 = 1). The operator built into the language works well with integers, but, as with so many other floating-point operations, intermediate calculations cause representational issues that result in a loss of data. fmod() provides a more accurate implementation for floating-point values. import math print ’{:^4} {:^4} {:^5} {:^5}’.format(’x’, ’y’, ’%’, ’fmod’) print ’---- ---- ----- -----’ for x, y in [ (5, 2), (5, -2), (-5, 2), ]: print ’{:4.1f} {:4.1f} {:5.2f} {:5.2f}’.format( x, y, x % y, math.fmod(x, y), ) A potentially more frequent source of confusion is the fact that the algorithm used by fmod() for computing modulo is also different from that used by %, so the sign of the result is different. $ python math_fmod.py x y % fmod ---- ---- ----- ----- 5.0 2.0 1.00 1.00 5.0 -2.0 -1.00 1.00 -5.0 2.0 1.00 -1.00 234 Mathematics 5.4.7 Exponents and Logarithms Exponential growth curves appear in economics, physics, and other sciences. Python has a built-in exponentiation operator (“**”), but pow() can be useful when a callable function is needed as an argument to another function. import math for x, y in [ # Typical uses (2, 3), (2.1, 3.2), # Always 1 (1.0, 5), (2.0, 0), # Not-a-number (2, float(’nan’)), # Roots (9.0, 0.5), (27.0, 1.0/3), ]: print ’{:5.1f} ** {:5.3f} = {:6.3f}’.format(x, y, math.pow(x, y)) Raising 1 to any power always returns 1.0, as does raising any value to a power of 0.0. Most operations on the not-a-number value nan return nan. If the exponent is less than 1, pow() computes a root. $ python math_pow.py 2.0 ** 3.000 = 8.000 2.1 ** 3.200 = 10.742 1.0 ** 5.000 = 1.000 2.0 ** 0.000 = 1.000 2.0 ** nan = nan 9.0 ** 0.500 = 3.000 27.0 ** 0.333 = 3.000 Since square roots (exponent of 1 2) are used so frequently, there is a separate func- tion for computing them. 5.4. math—Mathematical Functions 235 import math print math.sqrt(9.0) print math.sqrt(3) try: print math.sqrt(-1) except ValueError, err: print ’Cannot compute sqrt(-1):’, err Computing the square roots of negative numbers requires complex numbers, which are not handled by math. Any attempt to calculate a square root of a negative value results in a ValueError. $ python math_sqrt.py 3.0 1.73205080757 Cannot compute sqrt(-1): math domain error The logarithm function finds y where x = b ** y. By default, log() computes the natural logarithm (the base is e). If a second argument is provided, that value is used as the base. import math print math.log(8) print math.log(8, 2) print math.log(0.5, 2) Logarithms where x is less than one yield negative results. $ python math_log.py 2.07944154168 3.0 -1.0 There are two variations of log(). Given floating-point representation and rounding errors, the computed value produced by log(x, b) has limited accuracy, especially for some bases. log10() computes log(x, 10), using a more accurate algorithm than log(). 236 Mathematics import math print ’{:2} {:^12} {:^10} {:^20} {:8}’.format( ’i’, ’x’, ’accurate’, ’inaccurate’, ’mismatch’, ) print ’{:-^2} {:-^12} {:-^10} {:-^20} {:-^8}’.format( ’’, ’’, ’’, ’’, ’’, ) for i in range(0, 10): x = math.pow(10, i) accurate = math.log10(x) inaccurate = math.log(x, 10) match = ’’ if int(inaccurate) == i else ’*’ print ’{:2d} {:12.1f} {:10.8f} {:20.18f} {:^5}’.format( i, x, accurate, inaccurate, match, ) The lines in the output with trailing * highlight the inaccurate values. $ python math_log10.py i x accurate inaccurate mismatch -- ------------ ---------- -------------------- -------- 0 1.0 0.00000000 0.000000000000000000 1 10.0 1.00000000 1.000000000000000000 2 100.0 2.00000000 2.000000000000000000 3 1000.0 3.00000000 2.999999999999999556 * 4 10000.0 4.00000000 4.000000000000000000 5 100000.0 5.00000000 5.000000000000000000 6 1000000.0 6.00000000 5.999999999999999112 * 7 10000000.0 7.00000000 7.000000000000000000 8 100000000.0 8.00000000 8.000000000000000000 9 1000000000.0 9.00000000 8.999999999999998224 * log1p() calculates the Newton-Mercator series (the natural logarithm of 1+x). import math x = 0.0000000000000000000000001 print ’x :’, x print ’1 + x :’, 1+x 5.4. math—Mathematical Functions 237 print ’log(1+x):’, math.log(1+x) print ’log1p(x):’, math.log1p(x) log1p() is more accurate for values of x very close to zero because it uses an algorithm that compensates for round-off errors from the initial addition. $ python math_log1p.py x : 1e-25 1 + x : 1.0 log(1+x): 0.0 log1p(x): 1e-25 exp() computes the exponential function (e**x). import math x = 2 fmt = ’%.20f’ print fmt % (math.e ** 2) print fmt % math.pow(math.e, 2) print fmt % math.exp(2) As with other special-case functions, it uses an algorithm that produces more ac- curate results than the general-purpose equivalent math.pow(math.e, x). $ python math_exp.py 7.38905609893064951876 7.38905609893064951876 7.38905609893065040694 expm1() is the inverse of log1p() and calculates e**x - 1. import math x = 0.0000000000000000000000001 238 Mathematics print x print math.exp(x)- 1 print math.expm1(x) Small values of x lose precision when the subtraction is performed separately, like with log1p(). $ python math_expm1.py 1e-25 0.0 1e-25 5.4.8 Angles Although degrees are more commonly used in everyday discussions of angles, radians are the standard unit of angular measure in science and math. A radian is the angle created by two lines intersecting at the center of a circle, with their ends on the circum- ference of the circle spaced one radius apart. The circumference is calculated as 2πr, so there is a relationship between radians and π, a value that shows up frequently in trigonometric calculations. That relationship leads to radians being used in trigonometry and calculus, because they result in more compact formulas. To convert from degrees to radians, use radians(). import math print ’{:^7} {:^7} {:^7}’.format(’Degrees’, ’Radians’, ’Expected’) print ’{:-^7} {:-^7} {:-^7}’.format(’’, ’’, ’’) for deg, expected in [ ( 0, 0), ( 30, math.pi/6), ( 45, math.pi/4), ( 60, math.pi/3), ( 90, math.pi/2), (180, math.pi), (270, 3/2.0 * math.pi), (360, 2 * math.pi), ]: 5.4. math—Mathematical Functions 239 print ’{:7d} {:7.2f} {:7.2f}’.format(deg, math.radians(deg), expected, ) The formula for the conversion is rad = deg * π / 180. $ python math_radians.py Degrees Radians Expected ------- ------- ------- 0 0.00 0.00 30 0.52 0.52 45 0.79 0.79 60 1.05 1.05 90 1.57 1.57 180 3.14 3.14 270 4.71 4.71 360 6.28 6.28 To convert from radians to degrees, use degrees(). import math print ’{:^8} {:^8} {:^8}’.format(’Radians’, ’Degrees’, ’Expected’) print ’{:-^8} {:-^8} {:-^8}’.format(’’, ’’, ’’) for rad, expected in [ (0, 0), (math.pi/6, 30), (math.pi/4, 45), (math.pi/3, 60), (math.pi/2, 90), (math.pi, 180), (3 * math.pi / 2, 270), (2 * math.pi, 360), ]: print ’{:8.2f} {:8.2f} {:8.2f}’.format(rad, math.degrees(rad), expected, ) The formula is deg = rad * 180 / π. 240 Mathematics $ python math_degrees.py Radians Degrees Expected -------- -------- -------- 0.00 0.00 0.00 0.52 30.00 30.00 0.79 45.00 45.00 1.05 60.00 60.00 1.57 90.00 90.00 3.14 180.00 180.00 4.71 270.00 270.00 6.28 360.00 360.00 5.4.9 Trigonometry Trigonometric functions relate angles in a triangle to the lengths of its sides. They show up in formulas with periodic properties such as harmonics or circular motion, or when dealing with angles. All trigonometric functions in the standard library take angles expressed as radians. Given an angle in a right triangle, the sine is the ratio of the length of the side opposite the angle to the hypotenuse (sin A = opposite/hypotenuse). The cosine is the ratio of the length of the adjacent side to the hypotenuse (cos A = ad- jacent/hypotenuse). And the tangent is the ratio of the opposite side to the adjacent side (tan A = opposite/adjacent). import math print ’Degrees Radians Sine Cosine Tangent’ print ’------- ------- ------- -------- -------’ fmt = ’’.join([’%7.2f’] * 5) for deg in range(0, 361, 30): rad = math.radians(deg) if deg in (90, 270): t = float(’inf’) else: t = math.tan(rad) print fmt % (deg, rad, math.sin(rad), math.cos(rad), t) The tangent can also be defined as the ratio of the sine of the angle to its cosine, and since the cosine is 0 for π/2 and 3π/2 radians, the tangent is infinite. 5.4. math—Mathematical Functions 241 $ python math_trig.py Degrees Radians Sine Cosine Tangent ------- ------- ------- -------- ------- 0.00 0.00 0.00 1.00 0.00 30.00 0.52 0.50 0.87 0.58 60.00 1.05 0.87 0.50 1.73 90.00 1.57 1.00 0.00 inf 120.00 2.09 0.87 -0.50 -1.73 150.00 2.62 0.50 -0.87 -0.58 180.00 3.14 0.00 -1.00 -0.00 210.00 3.67 -0.50 -0.87 0.58 240.00 4.19 -0.87 -0.50 1.73 270.00 4.71 -1.00 -0.00 inf 300.00 5.24 -0.87 0.50 -1.73 330.00 5.76 -0.50 0.87 -0.58 360.00 6.28 -0.00 1.00 -0.00 Given a point (x, y), the length of the hypotenuse for the triangle between the points [(0, 0), (x, 0), (x, y)] is (x**2 + y**2) ** 1/2, and can be computed with hypot(). import math print ’{:^7} {:^7} {:^10}’.format(’X’, ’Y’, ’Hypotenuse’) print ’{:-^7} {:-^7} {:-^10}’.format(’’, ’’, ’’) for x, y in [ # simple points (1, 1), (-1, -1), (math.sqrt(2), math.sqrt(2)), (3, 4), # 3-4-5 triangle # on the circle (math.sqrt(2)/2, math.sqrt(2)/2), # pi/4 rads (0.5, math.sqrt(3)/2), # pi/3 rads ]: h = math.hypot(x, y) print ’{:7.2f} {:7.2f} {:7.2f}’.format(x, y, h) Points on the circle always have hypotenuse == 1. 242 Mathematics $ python math_hypot.py Y Hypotenuse ------- ------- ---------- 1.00 1.00 1.41 -1.00 -1.00 1.41 1.41 1.41 2.00 3.00 4.00 5.00 0.71 0.71 1.00 0.50 0.87 1.00 The same function can be used to find the distance between two points. import math print ’{:^8} {:^8} {:^8} {:^8} {:^8}’.format( ’X1’, ’Y1’, ’X2’, ’Y2’, ’Distance’, ) print ’{:-^8} {:-^8} {:-^8} {:-^8} {:-^8}’.format( ’’, ’’, ’’, ’’, ’’, ) for (x1, y1), (x2, y2) in [ ((5, 5), (6, 6)), ((-6, -6), (-5, -5)), ((0, 0), (3, 4)), # 3-4-5 triangle ((-1, -1), (2, 3)), # 3-4-5 triangle ]: x = x1 - x2 y = y1 - y2 h = math.hypot(x, y) print ’{:8.2f} {:8.2f} {:8.2f} {:8.2f} {:8.2f}’.format( x1, y1, x2, y2, h, ) Use the difference in the x and y values to move one endpoint to the origin, and then pass the results to hypot(). $ python math_distance_2_points.py 5.4. math—Mathematical Functions 243 X1 Y1 X2 Y2 Distance -------- -------- -------- -------- -------- 5.00 5.00 6.00 6.00 1.41 -6.00 -6.00 -5.00 -5.00 1.41 0.00 0.00 3.00 4.00 5.00 -1.00 -1.00 2.00 3.00 5.00 math also defines inverse trigonometric functions. import math for r in [ 0, 0.5, 1 ]: print ’arcsine(%.1f) = %5.2f’ % (r, math.asin(r)) print ’arccosine(%.1f) = %5.2f’ % (r, math.acos(r)) print ’arctangent(%.1f) = %5.2f’ % (r, math.atan(r)) print 1.57 is roughly equal to π/2, or 90 degrees, the angle at which the sine is 1 and the cosine is 0. $ python math_inverse_trig.py arcsine(0.0) = 0.00 arccosine(0.0) = 1.57 arctangent(0.0) = 0.00 arcsine(0.5) = 0.52 arccosine(0.5) = 1.05 arctangent(0.5) = 0.46 arcsine(1.0) = 1.57 arccosine(1.0) = 0.00 arctangent(1.0) = 0.79 5.4.10 Hyperbolic Functions Hyperbolic functions appear in linear differential equations and are used when work- ing with electromagnetic fields, fluid dynamics, special relativity, and other advanced physics and mathematics. 244 Mathematics import math print ’{:^6} {:^6} {:^6} {:^6}’.format( ’X’, ’sinh’, ’cosh’, ’tanh’, ) print ’{:-^6} {:-^6} {:-^6} {:-^6}’.format(’’, ’’, ’’, ’’) fmt = ’’.join([’{:6.4f}’] * 4) for i in range(0, 11, 2): x = i/10.0 print fmt.format(x, math.sinh(x), math.cosh(x), math.tanh(x)) Whereas the cosine and sine functions enscribe a circle, the hyperbolic cosine and hyperbolic sine form half of a hyperbola. $ python math_hyperbolic.py X sinh cosh tanh ------ ------ ------ ------ 0.0000 0.0000 1.0000 0.0000 0.2000 0.2013 1.0201 0.1974 0.4000 0.4108 1.0811 0.3799 0.6000 0.6367 1.1855 0.5370 0.8000 0.8881 1.3374 0.6640 1.0000 1.1752 1.5431 0.7616 Inverse hyperbolic functions acosh(), asinh(), and atanh() are also available. 5.4.11 Special Functions The Gauss Error function is used in statistics. import math print ’{:^5} {:7}’.format(’x’, ’erf(x)’) print ’{:-^5} {:-^7}’.format(’’, ’’) for x in [ -3, -2, -1, -0.5, -0.25, 0, 0.25, 0.5, 1, 2, 3 ]: print ’{:5.2f} {:7.4f}’.format(x, math.erf(x)) For the error function, erf(-x) == -erf(x). 5.4. math—Mathematical Functions 245 $ python math_erf.py x erf(x) ----- ------- -3.00 -1.0000 -2.00 -0.9953 -1.00 -0.8427 -0.50 -0.5205 -0.25 -0.2763 0.00 0.0000 0.25 0.2763 0.50 0.5205 1.00 0.8427 2.00 0.9953 3.00 1.0000 The complimentary error function is 1 - erf(x). import math print ’{:^5} {:7}’.format(’x’, ’erfc(x)’) print ’{:-^5} {:-^7}’.format(’’, ’’) for x in [ -3, -2, -1, -0.5, -0.25, 0, 0.25, 0.5, 1, 2, 3 ]: print ’{:5.2f} {:7.4f}’.format(x, math.erfc(x)) The implementation of erfc() avoids precision errors for small values of x when subtracting from 1. $ python math_erfc.py x erfc(x) ----- ------- -3.00 2.0000 -2.00 1.9953 -1.00 1.8427 -0.50 1.5205 -0.25 1.2763 0.00 1.0000 0.25 0.7237 0.50 0.4795 1.00 0.1573 2.00 0.0047 3.00 0.0000 246 Mathematics See Also: math (http://docs.python.org/library/math.html) The standard library documenta- tion for this module. IEEE floating-point arithmetic in Python (http://www.johndcook.com/blog/2009/07/21/ieee-arithmetic-python/) Blog post by John Cook about how special values arise and are dealt with when doing math in Python. SciPy (http://scipy.org/) Open source libraries for scientific and mathematical calcu- lations in Python. Chapter 6 THE FILE SYSTEM Python’s standard library includes a large range of tools for working with files on the file system, building and parsing filenames, and examining file contents. The first step in working with files is to determine the name of the file on which to work. Python represents filenames as simple strings, but provides tools for building them from standard, platform-independent components in os.path. List the contents of a directory with listdir() from os,oruseglob to build a list of filenames from a pattern. The filename pattern matching used by glob is also exposed directly through fnmatch, so it can be used in other contexts. dircache provides an efficient way to scan and process the contents of a directory on the file system, and it is useful when processing files in situations where the names are not known in advance. After the name of the file is identified, other characteristics, such as permissions or the file size, can be checked using os.stat() and the constants in stat. When an application needs random access to files, linecache makes it easy to read lines by their line number. The contents of the file are maintained in a cache, so be careful of memory consumption. tempfile is useful for cases that need to create scratch files to hold data tempora- rily, or before moving it to a permanent location. It provides classes to create temporary files and directories safely and securely. Names are guaranteed to be unique and include random components so they are not easily guessable. Frequently, programs need to work on files as a whole, without regard to their content. The shutil module includes high-level file operations, such as copying files and directories, and setting permissions. The filecmp module compares files and directories by looking at the bytes they contain, but without any special knowledge about their format. 247 248 The File System The built-in file class can be used to read and write files visible on local file systems. A program’s performance can suffer when it accesses large files through the read() and write() interfaces, though, since they both involve copying the data multiple times as it is moved from the disk to memory the application can see. Using mmap tells the operating system to use its virtual memory subsystem to map a file’s contents directly into memory accessible by a program, avoiding a copy step between the operating system and the internal buffer for the file object. Text data using characters not available in ASCII is usually saved in a Unicode data format. Since the standard file handle assumes each byte of a text file represents one character, reading Unicode text with multibyte encodings requires extra processing. The codecs module handles the encoding and decoding automatically, so that in many cases, a non-ASCII file can be used without any other changes. For testing code that depends on reading or writing data from files, StringIO provides an in-memory stream object that behaves like a file, but that does not reside on disk. 6.1 os.path—Platform-Independent Manipulation of Filenames Purpose Parse, build, test, and otherwise work on filenames and paths. Python Version 1.4 and later Writing code to work with files on multiple platforms is easy using the functions inclu- ded in the os.path module. Even programs not intended to be ported between plat- forms should use os.path for reliable filename parsing. 6.1.1 Parsing Paths The first set of functions in os.path can be used to parse strings representing filenames into their component parts. It is important to realize that these functions do not depend on the paths actually existing; they operate solely on the strings. Path parsing depends on a few variables defined in os: • os.sep—The separator between portions of the path (e.g., “/”or“\”). • os.extsep—The separator between a filename and the file “extension” (e.g., “.”). • os.pardir—The path component that means traverse the directory tree up one level (e.g., “..”). • os.curdir—The path component that refers to the current directory (e.g., “.”). 6.1. os.path—Platform-Independent Manipulation of Filenames 249 The split() function breaks the path into two separate parts and returns a tuple with the results. The second element of the tuple is the last component of the path, and the first element is everything that comes before it. import os.path for path in [ ’/one/two/three’, ’/one/two/three/’, ’/’, ’.’, ’’]: print ’%15s : %s’ % (path, os.path.split(path)) When the input argument ends in os.sep, the “last element” of the path is an empty string. $ python ospath_split.py /one/two/three : (’/one/two’, ’three’) /one/two/three/ : (’/one/two/three’, ’’) / : (’/’, ’’) . : (’’, ’.’) : (’’, ’’) The basename() function returns a value equivalent to the second part of the split() value. import os.path for path in [ ’/one/two/three’, ’/one/two/three/’, ’/’, ’.’, ’’]: print ’%15s : %s’ % (path, os.path.basename(path)) The full path is stripped down to the last element, whether that refers to a file or directory. If the path ends in the directory separator (os.sep), the base portion is considered to be empty. 250 The File System $ python ospath_basename.py /one/two/three : three /one/two/three/ : / : . :. : The dirname() function returns the first part of the split path: import os.path for path in [ ’/one/two/three’, ’/one/two/three/’, ’/’, ’.’, ’’]: print ’%15s : %s’ % (path, os.path.dirname(path)) Combining the results of basename() with dirname() gives the original path. $ python ospath_dirname.py /one/two/three : /one/two /one/two/three/ : /one/two/three / :/ . : : splitext() works like split(), but divides the path on the extension separator, rather than the directory separator. import os.path for path in [ ’filename.txt’, ’filename’, ’/path/to/filename.txt’, ’/’, ’’, ’my-archive.tar.gz’, ’no-extension.’, ]: print ’%21s :’ % path, os.path.splitext(path) 6.1. os.path—Platform-Independent Manipulation of Filenames 251 Only the last occurrence of os.extsep is used when looking for the extension, so if a filename has multiple extensions, the results of splitting it leaves part of the extension on the prefix. $ python ospath_splitext.py filename.txt : (’filename’, ’.txt’) filename : (’filename’, ’’) /path/to/filename.txt : (’/path/to/filename’, ’.txt’) / : (’/’, ’’) : (’’, ’’) my-archive.tar.gz : (’my-archive.tar’, ’.gz’) no-extension. : (’no-extension’, ’.’) commonprefix() takes a list of paths as an argument and returns a single string that represents a common prefix present in all paths. The value may represent a path that does not actually exist, and the path separator is not included in the consideration, so the prefix might not stop on a separator boundary. import os.path paths = [’/one/two/three/four’, ’/one/two/threefold’, ’/one/two/three/’, ] for path in paths: print ’PATH:’, path print print ’PREFIX:’, os.path.commonprefix(paths) In this example, the common prefix string is /one/two/three, even though one path does not include a directory named three. $ python ospath_commonprefix.py PATH: /one/two/three/four PATH: /one/two/threefold PATH: /one/two/three/ PREFIX: /one/two/three 252 The File System 6.1.2 Building Paths Besides taking existing paths apart, it is frequently necessary to build paths from other strings. To combine several path components into a single value, use join(). import os.path for parts in [ (’one’, ’two’, ’three’), (’/’, ’one’, ’two’, ’three’), (’/one’, ’/two’, ’/three’), ]: print parts, ’:’, os.path.join(*parts) If any argument to join begins with os.sep, all previous arguments are discarded and the new one becomes the beginning of the return value. $ python ospath_join.py (’one’, ’two’, ’three’) : one/two/three (’/’, ’one’, ’two’, ’three’) : /one/two/three (’/one’, ’/two’, ’/three’) : /three It is also possible to work with paths that include “variable” components that can be expanded automatically. For example, expanduser() converts the tilde (~) char- acter to the name of a user’s home directory. import os.path for user in [ ’’, ’dhellmann’, ’postgresql’ ]: lookup = ’~’ + user print ’%12s : %s’ % (lookup, os.path.expanduser(lookup)) If the user’s home directory cannot be found, the string is returned unchanged, as with ~postgresql in this example. $ python ospath_expanduser.py ~ : /Users/dhellmann ~dhellmann : /Users/dhellmann ~postgresql : ~postgresql 6.1. os.path—Platform-Independent Manipulation of Filenames 253 expandvars() is more general, and expands any shell environment variables present in the path. import os.path import os os.environ[’MYVAR’] = ’VALUE’ print os.path.expandvars(’/path/to/$MYVAR’) No validation is performed to ensure that the variable value results in the name of a file that already exists. $ python ospath_expandvars.py /path/to/VALUE 6.1.3 Normalizing Paths Paths assembled from separate strings using join() or with embedded variables might end up with extra separators or relative path components. Use normpath() to clean them up. import os.path for path in [ ’one//two//three’, ’one/./two/./three’, ’one/../alt/two/three’, ]: print ’%20s : %s’ % (path, os.path.normpath(path)) Path segments made up of os.curdir and os.pardir are evaluated and col- lapsed. $ python ospath_normpath.py one//two//three : one/two/three one/./two/./three : one/two/three one/../alt/two/three : alt/two/three 254 The File System To convert a relative path to an absolute filename, use abspath(). import os import os.path os.chdir(’/tmp’) for path in [ ’.’, ’..’, ’./one/two/three’, ’../one/two/three’, ]: print ’%17s :"%s"’ % (path, os.path.abspath(path)) The result is a complete path, starting at the top of the file system tree. $ python ospath_abspath.py . : "/private/tmp" .. : "/private" ./one/two/three : "/private/tmp/one/two/three" ../one/two/three : "/private/one/two/three" 6.1.4 File Times Besides working with paths, os.path includes functions for retrieving file properties, similar to the ones returned by os.stat(). import os.path import time print ’File :’, __file__ print ’Access time :’, time.ctime(os.path.getatime(__file__)) print ’Modified time:’, time.ctime(os.path.getmtime(__file__)) print ’Change time :’, time.ctime(os.path.getctime(__file__)) print ’Size :’, os.path.getsize(__file__) os.path.getatime() returns the access time, os.path.getmtime() ret- urns the modification time, and os.path.getctime() returns the creation time. os.path.getsize() returns the amount of data in the file, represented in bytes. 6.1. os.path—Platform-Independent Manipulation of Filenames 255 $ python ospath_properties.py : ospath_properties.py Access time : Sat Nov 27 12:19:50 2010 Modified time: Sun Nov 14 09:40:36 2010 Change time : Tue Nov 16 08:07:32 2010 Size : 495 6.1.5 Testing Files When a program encounters a path name, it often needs to know whether the path refers to a file, a directory, or a symlink and whether it exists. os.path includes functions for testing all these conditions. import os.path FILENAMES = [ __file__, os.path.dirname(__file__), ’/’, ’./broken_link’, ] for file in FILENAMES: print ’File :’, file print ’Absolute :’, os.path.isabs(file) print ’Is File? :’, os.path.isfile(file) print ’Is Dir? :’, os.path.isdir(file) print ’Is Link? :’, os.path.islink(file) print ’Mountpoint? :’, os.path.ismount(file) print ’Exists? :’, os.path.exists(file) print ’Link Exists?:’, os.path.lexists(file) print All test functions return Boolean values. $ ln -s /does/not/exist broken_link $ python ospath_tests.py File : ospath_tests.py Absolute : False Is File? : True Is Dir? : False Is Link? : False 256 The File System Mountpoint? : False Exists? : True Link Exists?: True File : Absolute : False Is File? : False Is Dir? : False Is Link? : False Mountpoint? : False Exists? : False Link Exists?: False File :/ Absolute : True Is File? : False Is Dir? : True Is Link? : False Mountpoint? : True Exists? : True Link Exists?: True File : ./broken_link Absolute : False Is File? : False Is Dir? : False Is Link? : True Mountpoint? : False Exists? : False Link Exists?: True 6.1.6 Traversing a Directory Tree os.path.walk() traverses all directories in a tree and calls a provided function, pass- ing to it as arguments the directory name and the names of the contents of that directory. import os import os.path import pprint def visit(arg, dirname, names): print dirname, arg for name in names: 6.2. glob—Filename Pattern Matching 257 subname = os.path.join(dirname, name) if os.path.isdir(subname): print ’ %s/’ % name else: print ’ %s’ % name print if not os.path.exists(’example’): os.mkdir(’example’) if not os.path.exists(’example/one’): os.mkdir(’example/one’) with open(’example/one/file.txt’, ’wt’) as f: f.write(’contents’) with open(’example/two.txt’, ’wt’) as f: f.write(’contents’) os.path.walk(’example’, visit, ’(User data)’) This example produces a recursive directory listing, ignoring .svn directories. $ python ospath_walk.py example (User data) one/ two.txt example/one (User data) file.txt See Also: os.path (http://docs.python.org/lib/module-os.path.html) Standard library docu- mentation for this module. os (page 1108) The os module is a parent of os.path. time (page 173) The time module includes functions to convert between the rep- resentation used by the time property functions in os.path and easy-to-read strings. 6.2 glob—Filename Pattern Matching Purpose Use UNIX shell rules to find filenames matching a pattern. Python Version 1.4 and later 258 The File System Even though the glob API is small, the module packs a lot of power. It is useful in any situation where a program needs to look for a list of files on the file system with names matching a pattern. To create a list of filenames that all have a certain extension, prefix, or any common string in the middle, use glob instead of writing custom code to scan the directory contents. The pattern rules for glob are not the same as the regular expressions used by the re module. Instead, they follow standard UNIX path expansion rules. There are only a few special characters used to implement two different wildcards and character ranges. The patterns rules are applied to segments of the filename (stopping at the path separator, /). Paths in the pattern can be relative or absolute. Shell variable names and tilde (~) are not expanded. 6.2.1 Example Data The examples in this section assume the following test files are present in the current working directory. $ python glob_maketestdata.py dir dir/file.txt dir/file1.txt dir/file2.txt dir/filea.txt dir/fileb.txt dir/subdir dir/subdir/subfile.txt If these files do not exist, use glob_maketestdata.py in the sample code to create them before running the following examples. 6.2.2 Wildcards An asterisk (*) matches zero or more characters in a segment of a name. For example, dir/*. import glob for name in glob.glob(’dir/*’): print name 6.2. glob—Filename Pattern Matching 259 The pattern matches every path name (file or directory) in the directory “dir,” without recursing further into subdirectories. $ python glob_asterisk.py dir/file.txt dir/file1.txt dir/file2.txt dir/filea.txt dir/fileb.txt dir/subdir To list files in a subdirectory, the subdirectory must be included in the pattern. import glob print ’Named explicitly:’ for name in glob.glob(’dir/subdir/*’): print ’\t’, name print ’Named with wildcard:’ for name in glob.glob(’dir/*/*’): print ’\t’, name The first case shown earlier lists the subdirectory name explicitly, while the second case depends on a wildcard to find the directory. $ python glob_subdir.py Named explicitly: dir/subdir/subfile.txt Named with wildcard: dir/subdir/subfile.txt The results, in this case, are the same. If there was another subdirectory, the wild- card would match both subdirectories and include the filenames from both. 6.2.3 Single Character Wildcard A question mark (?) is another wildcard character. It matches any single character in that position in the name. 260 The File System import glob for name in glob.glob(’dir/file?.txt’): print name The previous example matches all filenames that begin with file, have one more character of any type, and then end with .txt. $ python glob_question.py dir/file1.txt dir/file2.txt dir/filea.txt dir/fileb.txt 6.2.4 Character Ranges Use a character range ([a-z]) instead of a question mark to match one of several characters. This example finds all files with a digit in the name before the extension. import glob for name in glob.glob(’dir/*[0-9].*’): print name The character range [0-9] matches any single digit. The range is ordered based on the character code for each letter/digit, and the dash indicates an unbroken range of sequential characters. The same range value could be written as [0123456789]. $ python glob_charrange.py dir/file1.txt dir/file2.txt See Also: glob (http://docs.python.org/library/glob.html) The standard library documentation for this module. Pattern Matching Notation (http://www.opengroup.org/onlinepubs/000095399/utilities/xcu_chap02. html#tag_02_13) An explanation of globbing from The Open Group’s Shell Command Language specification. fnmatch (page 315) Filename-matching implementation. 6.3. linecache—Read Text Files Efficiently 261 6.3 linecache—Read Text Files Efficiently Purpose Retrieve lines of text from files or imported Python modules, holding a cache of the results to make reading many lines from the same file more efficient. Python Version 1.4 and later The linecache module is used within other parts of the Python standard library when dealing with Python source files. The implementation of the cache holds the contents of files, parsed into separate lines, in memory. The API returns the requested line(s) by indexing into a list, and saves time over repeatedly reading the file and pars- ing lines to find the one desired. This method is especially useful when looking for multiple lines from the same file, such as when producing a traceback for an error report. 6.3.1 Test Data This text produced by a Lorem Ipsum generator is used as sample input. import os import tempfile lorem = ’’’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Vivamus eget elit. In posuere mi non risus. Mauris id quam posuere lectus sollicitudin varius. Praesent at mi. Nunc eu velit. Sed augue massa, fermentum id, nonummy a, nonummy sit amet, ligula. Curabitur eros pede, egestas at, ultricies ac, apellentesque eu, tellus. Sed sed odio sed mi luctus mollis. Integer et nulla ac augue convallis accumsan. Ut felis. Donec lectus sapien, elementum nec, condimentum ac, interdum non, tellus. Aenean viverra, mauris vehicula semper porttitor, ipsum odio consectetuer lorem, ac imperdiet eros odio a sapien. Nulla mauris tellus, aliquam non, egestas a, nonummy et, erat. Vivamus sagittis porttitor eros.’’’ def make_tempfile(): fd, temp_file_name = tempfile.mkstemp() os.close(fd) f = open(temp_file_name, ’wt’) 262 The File System try: f.write(lorem) finally: f.close() return temp_file_name def cleanup(filename): os.unlink(filename) 6.3.2 Reading Specific Lines The line numbers of files read by the linecache module start with 1, but normally lists start indexing the array from 0. import linecache from linecache_data import * filename = make_tempfile() # Pick out the same line from source and cache. # (Notice that linecache counts from 1) print ’SOURCE:’ print ’%r’ % lorem.split(’\n’)[4] print print ’CACHE:’ print ’%r’ % linecache.getline(filename, 5) cleanup(filename) Each line returned includes a trailing newline. $ python linecache_getline.py SOURCE: ’fermentum id, nonummy a, nonummy sit amet, ligula. Curabitur’ CACHE: ’fermentum id, nonummy a, nonummy sit amet, ligula. Curabitur\n’ 6.3. linecache—Read Text Files Efficiently 263 6.3.3 Handling Blank Lines The return value always includes the newline at the end of the line, so if the line is empty, the return value is just the newline. import linecache from linecache_data import * filename = make_tempfile() # Blank lines include the newline print ’BLANK : %r’ % linecache.getline(filename, 8) cleanup(filename) Line eight of the input file contains no text. $ python linecache_empty_line.py BLANK : ’\n’ 6.3.4 Error Handling If the requested line number falls out of the range of valid lines in the file, getline() returns an empty string. import linecache from linecache_data import * filename = make_tempfile() # The cache always returns a string, and uses # an empty string to indicate a line which does # not exist. not_there = linecache.getline(filename, 500) print ’NOT THERE: %r includes %d characters’ %\ (not_there, len(not_there)) cleanup(filename) 264 The File System The input file only has 12 lines, so requesting line 500 is like trying to read past the end of the file. $ python linecache_out_of_range.py NOT THERE: ’’ includes 0 characters Reading from a file that does not exist is handled in the same way. import linecache # Errors are even hidden if linecache cannot find the file no_such_file = linecache.getline(’this_file_does_not_exist.txt’, 1) print ’NO FILE: %r’ % no_such_file The module never raises an exception when the caller tries to read data. $ python linecache_missing_file.py NO FILE: ’’ 6.3.5 Reading Python Source Files Since linecache is used so heavily when producing tracebacks, one of its key features is the ability to find Python source modules in the import path by specifying the base name of the module. import linecache import os # Look for the linecache module, using # the built in sys.path search. module_line = linecache.getline(’linecache.py’, 3) print ’MODULE:’ print repr(module_line) # Look at the linecache module source directly. file_src = linecache.__file__ if file_src.endswith(’.pyc’): file_src = file_src[:-1] print ’\nFILE:’ 6.4. tempfile—Temporary File System Objects 265 with open(file_src, ’r’) as f: file_line = f.readlines()[2] print repr(file_line) The cache population code in linecache searches sys.path for the named module if it cannot find a file with that name in the current directory. This example looks for linecache.py. Since there is no copy in the current directory, the file from the standard library is found instead. $ python linecache_path_search.py MODULE: ’This is intended to read lines from modules imported -- hence if a filename\n’ FILE: ’This is intended to read lines from modules imported -- hence if a filename\n’ See Also: linecache (http://docs.python.org/library/linecache.html) The standard library doc- umentation for this module. http://www.ipsum.com/ Lorem Ipsum generator. 6.4 tempfile—Temporary File System Objects Purpose Create temporary file system objects. Python Version 1.4 and later Creating temporary files with unique names securely, so they cannot be guessed by someone wanting to break the application or steal the data, is challenging. The tempfile module provides several functions for creating temporary file sys- tem resources securely. TemporaryFile() opens and returns an unnamed file, NamedTemporaryFile() opens and returns a named file, and mkdtemp() creates a temporary directory and returns its name. 6.4.1 Temporary Files Applications that need temporary files to store data, without needing to share those files with other programs, should use the TemporaryFile() function to create the files. 266 The File System The function creates a file, and on platforms where it is possible, unlinks it immediately. This makes it impossible for another program to find or open the file, since there is no reference to it in the file system table. The file created by TemporaryFile() is removed automatically when it is closed, whether by calling close() or by using the context manager API and with statement. import os import tempfile print ’Building a filename with PID:’ filename = ’/tmp/guess_my_name.%s.txt’ % os.getpid() temp = open(filename, ’w+b’) try: print ’temp:’ print ’’, temp print ’temp.name:’ print ’’, temp.name finally: temp.close() # Clean up the temporary file yourself os.remove(filename) print print ’TemporaryFile:’ temp = tempfile.TemporaryFile() try: print ’temp:’ print ’’, temp print ’temp.name:’ print ’’, temp.name finally: # Automatically cleans up the file temp.close() This example illustrates the difference in creating a temporary file using a common pattern for making up a name, versus using the TemporaryFile() function. The file returned by TemporaryFile() has no name. $ python tempfile_TemporaryFile.py Building a filename with PID: temp: 6.4. tempfile—Temporary File System Objects 267 temp.name: /tmp/guess_my_name.1074.txt TemporaryFile: temp: ’, mode ’w+b’ at 0x100d88780> temp.name: By default, the file handle is created with mode ’w+b’ so it behaves consistently on all platforms, and the caller can write to it and read from it. import os import tempfile with tempfile.TemporaryFile() as temp: temp.write(’Some data’) temp.seek(0) print temp.read() After writing, the file handle must be “rewound” using seek() in order to read the data back from it. $ python tempfile_TemporaryFile_binary.py Some data To open the file in text mode, set mode to ’w+t’ when the file is created. import tempfile with tempfile.TemporaryFile(mode=’w+t’) as f: f.writelines([’first\n’, ’second\n’]) f.seek(0) for line in f: print line.rstrip() 268 The File System The file handle treats the data as text. $ python tempfile_TemporaryFile_text.py first second 6.4.2 Named Files There are situations where having a named temporary file is important. For applica- tions spanning multiple processes, or even hosts, naming the file is the simplest way to pass it between parts of the application. The NamedTemporaryFile() function creates a file without unlinking it, so the file retains its name (accessed with the name attribute). import os import tempfile with tempfile.NamedTemporaryFile() as temp: print ’temp:’ print ’’, temp print ’temp.name:’ print ’’, temp.name print ’Exists after close:’, os.path.exists(temp.name) The file is removed after the handle is closed. $ python tempfile_NamedTemporaryFile.py temp: ’, mode ’w+b’ at 0x100d881e0> temp.name: /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmp926BkT Exists after close: False 6.4.3 Temporary Directories When several temporary files are needed, it may be more convenient to create a single temporary directory with mkdtemp() and open all the files in that directory. 6.4. tempfile—Temporary File System Objects 269 import os import tempfile directory_name = tempfile.mkdtemp() print directory_name # Clean up the directory os.removedirs(directory_name) Since the directory is not “opened” per se, it must be removed explicitly when it is no longer needed. $ python tempfile_mkdtemp.py /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmpA7DKtP 6.4.4 Predicting Names While less secure than strictly anonymous temporary files, including a predictable por- tion in the name makes it possible to find the file and examine it for debugging pur- poses. All functions described so far take three arguments to control the filenames to some degree. Names are generated using the following formula. dir + prefix + random + suffix All values except random can be passed as arguments to TemporaryFile(), NamedTemporaryFile(), and mkdtemp(). For example: import tempfile with tempfile.NamedTemporaryFile( suffix=’_suffix’, prefix=’prefix_’, dir=’/tmp’, ) as temp: print ’temp:’ print ’’, temp print ’temp.name:’ print ’’, temp.name The prefix and suffix arguments are combined with a random string of characters to build the filename, and the dir argument is taken as is and used as the location of the new file. 270 The File System $ python tempfile_NamedTemporaryFile_args.py temp: ’, mode ’w+b’ at 0x100d881e0> temp.name: /tmp/prefix_kjvHYS_suffix 6.4.5 Temporary File Location If an explicit destination is not given using the dir argument, the path used for the temporary files will vary based on the current platform and settings. The tempfile module includes two functions for querying the settings being used at runtime. import tempfile print ’gettempdir():’, tempfile.gettempdir() print ’gettempprefix():’, tempfile.gettempprefix() gettempdir() returns the default directory that will hold all temporary files and gettempprefix() returns the string prefix for new file and directory names. $ python tempfile_settings.py gettempdir(): /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp- gettempprefix(): tmp The value returned by gettempdir() is set based on a straightforward algorithm of looking through five locations for the first place the current process can create a file. This is the search list. 1. The environment variable TMPDIR 2. The environment variable TEMP 3. The environment variable TMP 4. A fallback, based on the platform. (RiscOS uses Wimp$ScrapDir. Windows uses the first available of C:\TEMP, C:\TMP, \TEMP,or\TMP. Other platforms use /tmp, /var/tmp,or/usr/tmp.) 5. If no other directory can be found, the current working directory is used. import tempfile tempfile.tempdir = ’/I/changed/this/path’ print ’gettempdir():’, tempfile.gettempdir() 6.5. shutil—High-Level File Operations 271 Programs that need to use a global location for all temporary files without using any of these environment variables should set tempfile.tempdir directly by assign- ing a value to the variable. $ python tempfile_tempdir.py gettempdir(): /I/changed/this/path See Also: tempfile (http://docs.python.org/lib/module-tempfile.html) Standard library docu- mentation for this module. 6.5 shutil—High-Level File Operations Purpose High-level file operations. Python Version 1.4 and later The shutil module includes high-level file operations such as copying and setting permissions. 6.5.1 Copying Files copyfile() copies the contents of the source to the destination and raises IOError if it does not have permission to write to the destination file. from shutil import * from glob import glob print ’BEFORE:’, glob(’shutil_copyfile.*’) copyfile(’shutil_copyfile.py’, ’shutil_copyfile.py.copy’) print ’AFTER:’, glob(’shutil_copyfile.*’) Because the function opens the input file for reading, regardless of its type, spe- cial files (such as UNIX device nodes) cannot be copied as new special files with copyfile(). $ python shutil_copyfile.py BEFORE: [’shutil_copyfile.py’] AFTER: [’shutil_copyfile.py’, ’shutil_copyfile.py.copy’] 272 The File System The implementation of copyfile() uses the lower-level function copy- fileobj(). While the arguments to copyfile() are filenames, the arguments to copyfileobj() are open file handles. The optional third argument is a buffer length to use for reading in blocks. from shutil import * import os from StringIO import StringIO import sys class VerboseStringIO(StringIO): def read(self, n=-1): next = StringIO.read(self, n) print ’read(%d) bytes’ % n return next lorem_ipsum = ’’’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Vestibulum aliquam mollis dolor. Donec vulputate nunc ut diam. Ut rutrum mi vel sem. Vestibulum ante ipsum.’’’ print ’Default:’ input = VerboseStringIO(lorem_ipsum) output = StringIO() copyfileobj(input, output) print print ’All at once:’ input = VerboseStringIO(lorem_ipsum) output = StringIO() copyfileobj(input, output, -1) print print ’Blocks of 256:’ input = VerboseStringIO(lorem_ipsum) output = StringIO() copyfileobj(input, output, 256) The default behavior is to read using large blocks. Use -1 to read all the input at one time, or use another positive integer to set a specific block size. This example uses several different block sizes to show the effect. 6.5. shutil—High-Level File Operations 273 $ python shutil_copyfileobj.py Default: read(16384) bytes read(16384) bytes All at once: read(-1) bytes read(-1) bytes Blocks of 256: read(256) bytes read(256) bytes The copy() function interprets the output name like the UNIX command line tool cp. If the named destination refers to a directory instead of a file, a new file is created in the directory using the base name of the source. from shutil import * import os os.mkdir(’example’) print ’BEFORE:’, os.listdir(’example’) copy(’shutil_copy.py’, ’example’) print ’AFTER:’, os.listdir(’example’) The permissions of the file are copied along with the contents. $ python shutil_copy.py BEFORE: [] AFTER: [’shutil_copy.py’] copy2() works like copy(), but includes the access and modification times in the metadata copied to the new file. from shutil import * import os import time 274 The File System def show_file_info(filename): stat_info = os.stat(filename) print ’\tMode :’, stat_info.st_mode print ’\tCreated :’, time.ctime(stat_info.st_ctime) print ’\tAccessed:’, time.ctime(stat_info.st_atime) print ’\tModified:’, time.ctime(stat_info.st_mtime) os.mkdir(’example’) print ’SOURCE:’ show_file_info(’shutil_copy2.py’) copy2(’shutil_copy2.py’, ’example’) print ’DEST:’ show_file_info(’example/shutil_copy2.py’) The new file has all the same characteristics as the old version. $ python shutil_copy2.py SOURCE: Mode : 33188 Created : Sat Dec 4 10:41:32 2010 Accessed: Sat Dec 4 17:41:01 2010 Modified: Sun Nov 14 09:40:36 2010 DEST: Mode : 33188 Created : Sat Dec 4 17:41:01 2010 Accessed: Sat Dec 4 17:41:01 2010 Modified: Sun Nov 14 09:40:36 2010 6.5.2 Copying File Metadata By default when a new file is created under UNIX, it receives permissions based on the umask of the current user. To copy the permissions from one file to another, use copymode(). from shutil import * from commands import * import os with open(’file_to_change.txt’, ’wt’) as f: f.write(’content’) os.chmod(’file_to_change.txt’, 0444) 6.5. shutil—High-Level File Operations 275 print ’BEFORE:’ print getstatus(’file_to_change.txt’) copymode(’shutil_copymode.py’, ’file_to_change.txt’) print ’AFTER :’ print getstatus(’file_to_change.txt’) First, create a file to be modified. #!/bin/sh # Set up file needed by shutil_copymode.py touch file_to_change.txt chmod ugo+w file_to_change.txt Then, run the example script to change the permissions. $ python shutil_copymode.py BEFORE: -r--r--r-- 1 dhellmann dhellmann 7 Dec 4 17:41 file_to_change.txt AFTER : -rw-r--r-- 1 dhellmann dhellmann 7 Dec 4 17:41 file_to_change.txt To copy other metadata about the file use copystat(). from shutil import * import os import time def show_file_info(filename): stat_info = os.stat(filename) print ’\tMode :’, stat_info.st_mode print ’\tCreated :’, time.ctime(stat_info.st_ctime) print ’\tAccessed:’, time.ctime(stat_info.st_atime) print ’\tModified:’, time.ctime(stat_info.st_mtime) with open(’file_to_change.txt’, ’wt’) as f: f.write(’content’) os.chmod(’file_to_change.txt’, 0444) print ’BEFORE:’ show_file_info(’file_to_change.txt’) copystat(’shutil_copystat.py’, ’file_to_change.txt’) 276 The File System print ’AFTER:’ show_file_info(’file_to_change.txt’) Only the permissions and dates associated with the file are duplicated with copystat(). $ python shutil_copystat.py BEFORE: Mode : 33060 Created : Sat Dec 4 17:41:01 2010 Accessed: Sat Dec 4 17:41:01 2010 Modified: Sat Dec 4 17:41:01 2010 AFTER: Mode : 33188 Created : Sat Dec 4 17:41:01 2010 Accessed: Sat Dec 4 17:41:01 2010 Modified: Sun Nov 14 09:45:12 2010 6.5.3 Working with Directory Trees shutil includes three functions for working with directory trees. To copy a direc- tory from one place to another, use copytree(). It recurses through the source direc- tory tree, copying files to the destination. The destination directory must not exist in advance. Note: The documentation for copytree() says it should be considered a sample implementation, rather than a tool. Consider starting with the current implemen- tation and making it more robust, or adding features like a progress meter, before using it. from shutil import * from commands import * print ’BEFORE:’ print getoutput(’ls -rlast /tmp/example’) copytree(’../shutil’, ’/tmp/example’) print ’\nAFTER:’ print getoutput(’ls -rlast /tmp/example’) 6.5. shutil—High-Level File Operations 277 The symlinks argument controls whether symbolic links are copied as links or as files. The default is to copy the contents to new files. If the option is true, new symlinks are created within the destination tree. $ python shutil_copytree.py BEFORE: ls: /tmp/example: No such file or directory AFTER: total 136 8 -rwxr-xr-x 1 dhellmann wheel 109 Oct 28 07:33 shutil_copymode.sh 8 -rw-r--r-- 1 dhellmann wheel 1313 Nov 14 09:39 shutil_rmtree.py 8 -rw-r--r-- 1 dhellmann wheel 1300 Nov 14 09:39 shutil_copyfile.py 8 -rw-r--r-- 1 dhellmann wheel 1276 Nov 14 09:39 shutil_copy.py 8 -rw-r--r-- 1 dhellmann wheel 1140 Nov 14 09:39 __init__.py 8 -rw-r--r-- 1 dhellmann wheel 1595 Nov 14 09:40 shutil_copy2.py 8 -rw-r--r-- 1 dhellmann wheel 1729 Nov 14 09:45 shutil_copystat.py 8 -rw-r--r-- 1 dhellmann wheel 7 Nov 14 09:45 file_to_change.txt 8 -rw-r--r-- 1 dhellmann wheel 1324 Nov 14 09:45 shutil_move.py 8 -rw-r--r-- 1 dhellmann wheel 419 Nov 27 12:49 shutil_copymode.py 8 -rw-r--r-- 1 dhellmann wheel 1331 Dec 1 21:51 shutil_copytree.py 8 -rw-r--r-- 1 dhellmann wheel 816 Dec 4 17:39 shutil_copyfileobj.py 8 -rw-r--r-- 1 dhellmann wheel 8 Dec 4 17:39 example.out 24 -rw-r--r-- 1 dhellmann wheel 9767 Dec 4 17:40 index.rst 8 -rw-r--r-- 1 dhellmann wheel 1300 Dec 4 17:41 shutil_copyfile.py.copy 0 drwxr-xr-x 3 dhellmann wheel 102 Dec 4 17:41 example 0 drwxrwxrwt 18 root wheel 612 Dec 4 17:41 .. 0 drwxr-xr-x 18 dhellmann wheel 612 Dec 4 17:41 . To remove a directory and its contents, use rmtree(). from shutil import * from commands import * print ’BEFORE:’ print getoutput(’ls -rlast /tmp/example’) rmtree(’/tmp/example’) print ’AFTER:’ print getoutput(’ls -rlast /tmp/example’) Errors are raised as exceptions by default, but can be ignored if the second argu- ment is true. A special error-handler function can be provided in the third argument. $ python shutil_rmtree.py BEFORE: total 136 8 -rwxr-xr-x 1 dhellmann wheel 109 Oct 28 07:33 shutil_copymode.sh 278 The File System 8 -rw-r--r-- 1 dhellmann wheel 1313 Nov 14 09:39 shutil_rmtree.py 8 -rw-r--r-- 1 dhellmann wheel 1300 Nov 14 09:39 shutil_copyfile.py 8 -rw-r--r-- 1 dhellmann wheel 1276 Nov 14 09:39 shutil_copy.py 8 -rw-r--r-- 1 dhellmann wheel 1140 Nov 14 09:39 __init__.py 8 -rw-r--r-- 1 dhellmann wheel 1595 Nov 14 09:40 shutil_copy2.py 8 -rw-r--r-- 1 dhellmann wheel 1729 Nov 14 09:45 shutil_copystat.py 8 -rw-r--r-- 1 dhellmann wheel 7 Nov 14 09:45 file_to_change.txt 8 -rw-r--r-- 1 dhellmann wheel 1324 Nov 14 09:45 shutil_move.py 8 -rw-r--r-- 1 dhellmann wheel 419 Nov 27 12:49 shutil_copymode.py 8 -rw-r--r-- 1 dhellmann wheel 1331 Dec 1 21:51 shutil_copytree.py 8 -rw-r--r-- 1 dhellmann wheel 816 Dec 4 17:39 shutil_copyfileobj.py 8 -rw-r--r-- 1 dhellmann wheel 8 Dec 4 17:39 example.out 24 -rw-r--r-- 1 dhellmann wheel 9767 Dec 4 17:40 index.rst 8 -rw-r--r-- 1 dhellmann wheel 1300 Dec 4 17:41 shutil_copyfile.py.copy 0 drwxr-xr-x 3 dhellmann wheel 102 Dec 4 17:41 example 0 drwxrwxrwt 18 root wheel 612 Dec 4 17:41 .. 0 drwxr-xr-x 18 dhellmann wheel 612 Dec 4 17:41 . AFTER: ls: /tmp/example: No such file or directory To move a file or directory from one place to another, use move(). from shutil import * from glob import glob with open(’example.txt’, ’wt’) as f: f.write(’contents’) print ’BEFORE: ’, glob(’example*’) move(’example.txt’, ’example.out’) print ’AFTER : ’, glob(’example*’) The semantics are similar to those of the UNIX command mv. If the source and destination are within the same file system, the source is renamed. Otherwise, the source is copied to the destination and then the source is removed. $ python shutil_move.py BEFORE: [’example.txt’] AFTER : [’example.out’] See Also: shutil (http://docs.python.org/lib/module-shutil.html) Standard library documenta- tion for this module. 6.6. mmap—Memory-Map Files 279 6.6 mmap—Memory-Map Files Purpose Memory-map files instead of reading the contents directly. Python Version 2.1 and later Memory-mapping a file uses the operating system virtual memory system to access the data on the file system directly, instead of using normal I/O functions. Memory- mapping typically improves I/O performance because it does not involve a separate system call for each access and it does not require copying data between buffers—the memory is accessed directly by both the kernel and the user application. Memory-mapped files can be treated as mutable strings or file-like objects, de- pending on the need. A mapped file supports the expected file API methods, such as close(), flush(), read(), readline(), seek(), tell(), and write(). It also supports the string API, with features such as slicing and methods like find(). All the examples use the text file lorem.txt, containing a bit of Lorem Ipsum. For reference, the text of the file follows. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla facilisi. Sed tristique eros eu libero. Pellentesque vel arcu. Vivamus purus orci, iaculis ac, suscipit sit amet, pulvinar eu, lacus. Praesent placerat tortor sed nisl. Nunc blandit diam egestas dui. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aliquam viverra fringilla leo. Nulla feugiat augue eleifend nulla. Vivamus mauris. Vivamus sed mauris in nibh placerat egestas. Suspendisse potenti. Mauris massa. Ut eget velit auctor tortor blandit sollicitudin. Suspendisse imperdiet justo. Note: There are differences in the arguments and behaviors for mmap() between UNIX and Windows. These differences are not fully discussed here. For more details, refer to the standard library documentation. 6.6.1 Reading Use the mmap() function to create a memory-mapped file. The first argument is a file descriptor, either from the fileno() method of a file object or from os.open(). The caller is responsible for opening the file before invoking mmap() and closing it after it is no longer needed. 280 The File System The second argument to mmap() is a size in bytes for the portion of the file to map. If the value is 0, the entire file is mapped. If the size is larger than the current size of the file, the file is extended. Note: Windows does not support creating a zero-length mapping. An optional keyword argument, access, is supported by both platforms. Use ACCESS_READ for read-only access, ACCESS_WRITE for write-through (assignments to memory go directly to the file), or ACCESS_COPY for copy-on-write (assignments to memory are not written to the file). import mmap import contextlib with open(’lorem.txt’, ’r’) as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) ) as m: print ’First 10 bytes via read :’, m.read(10) print ’First 10 bytes via slice:’, m[:10] print ’2nd 10 bytes via read :’, m.read(10) The file pointer tracks the last byte accessed through a slice operation. In this example, the pointer moves ahead 10 bytes after the first read. It is then reset to the beginning of the file by the slice operation and moved ahead 10 bytes again by the slice. After the slice operation, calling read() again gives bytes 11–20 in the file. $ python mmap_read.py First 10 bytes via read : Lorem ipsu First 10 bytes via slice: Lorem ipsu 2nd 10 bytes via read : m dolor si 6.6.2 Writing To set up the memory-mapped file to receive updates, start by opening it for appending with mode ’r+’ (not ’w’) before mapping it. Then use any of the API methods that change the data (write(), assignment to a slice, etc.). 6.6. mmap—Memory-Map Files 281 The next example uses the default access mode of ACCESS_WRITE and assigns to a slice to modify part of a line in place. import mmap import shutil import contextlib # Copy the example file shutil.copyfile(’lorem.txt’, ’lorem_copy.txt’) word = ’consectetuer’ reversed = word[::-1] print ’Looking for :’, word print ’Replacing with :’, reversed with open(’lorem_copy.txt’, ’r+’) as f: with contextlib.closing(mmap.mmap(f.fileno(), 0)) as m: print ’Before:’ print m.readline().rstrip() m.seek(0) # rewind loc = m.find(word) m[loc:loc+len(word)] = reversed m.flush() m.seek(0) # rewind print ’After :’ print m.readline().rstrip() f.seek(0) # rewind print ’File :’ print f.readline().rstrip() The word “consectetuer” is replaced in the middle of the first line in memory and in the file. $ python mmap_write_slice.py Looking for : consectetuer Replacing with : reutetcesnoc 282 The File System Before: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec After : Lorem ipsum dolor sit amet, reutetcesnoc adipiscing elit. Donec File : Lorem ipsum dolor sit amet, reutetcesnoc adipiscing elit. Donec Copy Mode Using the access setting ACCESS_COPY does not write changes to the file on disk. import mmap import shutil import contextlib # Copy the example file shutil.copyfile(’lorem.txt’, ’lorem_copy.txt’) word = ’consectetuer’ reversed = word[::-1] with open(’lorem_copy.txt’, ’r+’) as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY) ) as m: print ’Memory Before:’ print m.readline().rstrip() print ’File Before :’ print f.readline().rstrip() print m.seek(0) # rewind loc = m.find(word) m[loc:loc+len(word)] = reversed m.seek(0) # rewind print ’Memory After :’ print m.readline().rstrip() f.seek(0) print ’File After :’ print f.readline().rstrip() 6.6. mmap—Memory-Map Files 283 It is necessary to rewind the file handle in this example separately from the mmap handle, because the internal state of the two objects is maintained separately. $ python mmap_write_copy.py Memory Before: Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec File Before : Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec Memory After : Lorem ipsum dolor sit amet, reutetcesnoc adipiscing elit. Donec File After : Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec 6.6.3 Regular Expressions Since a memory-mapped file can act like a string, it can be used with other modules that operate on strings, such as regular expressions. This example finds all sentences with “nulla” in them. import mmap import re import contextlib pattern = re.compile(r’(\.\W+)?([^.]?nulla[^.]*?\.)’, re.DOTALL | re.IGNORECASE | re.MULTILINE) with open(’lorem.txt’, ’r’) as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) ) as m: for match in pattern.findall(m): print match[1].replace(’\n’, ’’) Because the pattern includes two groups, the return value from findall() is a sequence of tuples. The print statement pulls out the matching sentence and replaces newlines with spaces so each result prints on a single line. 284 The File System $ python mmap_regex.py Nulla facilisi. Nulla feugiat augue eleifend nulla. See Also: mmap (http://docs.python.org/lib/module-mmap.html) Standard library documen- tation for this module. os (page 1108) The os module. contextlib (page 163) Use the closing() function to create a context manager for a memory-mapped file. re (page 13) Regular expressions. 6.7 codecs—String Encoding and Decoding Purpose Encoders and decoders for converting text between different representations. Python Version 2.1 and later The codecs module provides stream interfaces and file interfaces for transcoding data. It is most commonly used to work with Unicode text, but other encodings are also available for other purposes. 6.7.1 Unicode Primer CPython 2.x supports two types of strings for working with text data. Old-style str instances use a single 8-bit byte to represent each character of the string using its ASCII code. In contrast, unicode strings are managed internally as a sequence of Unicode code points. The code-point values are saved as a sequence of two or four bytes each, depending on the options given when Python is compiled. Both unicode and str are derived from a common base class and support a similar API. When unicode strings are output, they are encoded using one of several standard schemes so that the sequence of bytes can be reconstructed as the same text string later. The bytes of the encoded value are not necessarily the same as the code-point values, and the encoding defines a way to translate between the two value sets. Reading Unicode data also requires knowing the encoding so that the incoming bytes can be converted to the internal representation used by the unicode class. The most common encodings for Western languages are UTF-8 and UTF-16, which use sequences of one- and two-byte values, respectively, to represent each code 6.7. codecs—String Encoding and Decoding 285 point. Other encodings can be more efficient for storing languages where most of the characters are represented by code points that do not fit into two bytes. See Also: For more introductory information about Unicode, refer to the list of references at the end of this section. The Python Unicode HOWTO is especially helpful. Encodings The best way to understand encodings is to look at the different series of bytes produced by encoding the same string in different ways. The following examples use this function to format the byte string to make it easier to read. import binascii def to_hex(t, nbytes): """Format text t as a sequence of nbyte long values separated by spaces. """ chars_per_item = nbytes * 2 hex_version = binascii.hexlify(t) return ’’.join( hex_version[start:start + chars_per_item] for start in xrange(0, len(hex_version), chars_per_item) ) if __name__ == ’__main__’: print to_hex(’abcdef’, 1) print to_hex(’abcdef’, 2) The function uses binascii to get a hexadecimal representation of the input byte string and then insert a space between every nbytes bytes before returning the value. $ python codecs_to_hex.py 61 62 63 64 65 66 6162 6364 6566 The first encoding example begins by printing the text ’pi: π’ using the raw representation of the unicode class. The π character is replaced with the expression 286 The File System for its Unicode code point, \u03c0. The next two lines encode the string as UTF-8 and UTF-16, respectively, and show the hexadecimal values resulting from the encoding. from codecs_to_hex import to_hex text = u’pi: π’ print ’Raw :’, repr(text) print ’UTF-8 :’, to_hex(text.encode(’utf-8’), 1) print ’UTF-16:’, to_hex(text.encode(’utf-16’), 2) The result of encoding a unicode string is a str object. $ python codecs_encodings.py Raw : u’pi: \u03c0’ UTF-8 : 70 69 3a 20 cf 80 UTF-16: fffe 7000 6900 3a00 2000 c003 Given a sequence of encoded bytes as a str instance, the decode() method trans- lates them to code points and returns the sequence as a unicode instance. from codecs_to_hex import to_hex text = u’pi: π’ encoded = text.encode(’utf-8’) decoded = encoded.decode(’utf-8’) print ’Original :’, repr(text) print ’Encoded :’, to_hex(encoded, 1), type(encoded) print ’Decoded :’, repr(decoded), type(decoded) The choice of encoding used does not change the output type. $ python codecs_decode.py Original : u’pi: \u03c0’ Encoded : 70 69 3a 20 cf 80 Decoded : u’pi: \u03c0’ 6.7. codecs—String Encoding and Decoding 287 Note: The default encoding is set during the interpreter start-up process, when site is loaded. Refer to the Unicode Defaults section from the discussion of sys for a description of the default encoding settings. 6.7.2 Working with Files Encoding and decoding strings is especially important when dealing with I/O opera- tions. Whether writing to a file, a socket, or another stream, the data must use the proper encoding. In general, all text data needs to be decoded from its byte representation as it is read and encoded from the internal values to a specific representation as it is written. A program can explicitly encode and decode data, but depending on the encoding used, it can be nontrivial to determine whether enough bytes have been read in order to fully decode the data. codecs provides classes that manage the data encoding and decoding, so applications do not have to do that work. The simplest interface provided by codecs is a replacement for the built-in open() function. The new version works just like the built-in function, but adds two new arguments to specify the encoding and desired error-handling technique. from codecs_to_hex import to_hex import codecs import sys encoding = sys.argv[1] filename = encoding + ’.txt’ print ’Writing to’, filename with codecs.open(filename, mode=’wt’, encoding=encoding) as f: f.write(u’pi: \u03c0’) # Determine the byte grouping to use for to_hex() nbytes = { ’utf-8’:1, ’utf-16’:2, ’utf-32’:4, }.get(encoding, 1) # Show the raw bytes in the file print ’File contents:’ with open(filename, mode=’rt’) as f: print to_hex(f.read(), nbytes) 288 The File System This example starts with a unicode string with the code point for π and saves the text to a file using an encoding specified on the command line. $ python codecs_open_write.py utf-8 Writing to utf-8.txt File contents: 70 69 3a 20 cf 80 $ python codecs_open_write.py utf-16 Writing to utf-16.txt File contents: fffe 7000 6900 3a00 2000 c003 $ python codecs_open_write.py utf-32 Writing to utf-32.txt File contents: fffe0000 70000000 69000000 3a000000 20000000 c0030000 Reading the data with open() is straightforward, with one catch: the encoding must be known in advance, in order to set up the decoder correctly. Some data formats, such as XML, specify the encoding as part of the file, but usually it is up to the appli- cation to manage. codecs simply takes the encoding as an argument and assumes it is correct. import codecs import sys encoding = sys.argv[1] filename = encoding + ’.txt’ print ’Reading from’, filename with codecs.open(filename, mode=’rt’, encoding=encoding) as f: print repr(f.read()) This example reads the files created by the previous program and prints the repre- sentation of the resulting unicode object to the console. $ python codecs_open_read.py utf-8 Reading from utf-8.txt u’pi: \u03c0’ 6.7. codecs—String Encoding and Decoding 289 $ python codecs_open_read.py utf-16 Reading from utf-16.txt u’pi: \u03c0’ $ python codecs_open_read.py utf-32 Reading from utf-32.txt u’pi: \u03c0’ 6.7.3 Byte Order Multibyte encodings, such as UTF-16 and UTF-32, pose a problem when transferring data between different computer systems, either by copying a file directly or using net- work communication. Different systems use different ordering of the high- and low- order bytes. This characteristic of the data, known as its endianness, depends on factors such as the hardware architecture and choices made by the operating system and appli- cation developer. There is not always a way to know in advance what byte order to use for a given set of data, so the multibyte encodings include a byte-order marker (BOM) as the first few bytes of encoded output. For example, UTF-16 is defined in such a way that 0xFFFE and 0xFEFF are not valid characters and can be used to indicate the byte-order. codecs defines constants for the byte-order markers used by UTF-16 and UTF-32. import codecs from codecs_to_hex import to_hex for name in [ ’BOM’, ’BOM_BE’, ’BOM_LE’, ’BOM_UTF8’, ’BOM_UTF16’, ’BOM_UTF16_BE’, ’BOM_UTF16_LE’, ’BOM_UTF32’, ’BOM_UTF32_BE’, ’BOM_UTF32_LE’, ]: print ’{:12} : {}’.format(name, to_hex(getattr(codecs, name), 2)) BOM, BOM_UTF16, and BOM_UTF32 are automatically set to the appropriate big-endian or little-endian values, depending on the current system’s native byte order. $ python codecs_bom.py BOM : fffe 290 The File System BOM_BE : feff BOM_LE : fffe BOM_UTF8 : efbb bf BOM_UTF16 : fffe BOM_UTF16_BE : feff BOM_UTF16_LE : fffe BOM_UTF32 : fffe 0000 BOM_UTF32_BE : 0000 feff BOM_UTF32_LE : fffe 0000 Byte ordering is detected and handled automatically by the decoders in codecs, but an explicit ordering can be specified when encoding. import codecs from codecs_to_hex import to_hex # Pick the nonnative version of UTF-16 encoding if codecs.BOM_UTF16 == codecs.BOM_UTF16_BE: bom = codecs.BOM_UTF16_LE encoding = ’utf_16_le’ else: bom = codecs.BOM_UTF16_BE encoding = ’utf_16_be’ print ’Native order :’, to_hex(codecs.BOM_UTF16, 2) print ’Selected order:’, to_hex(bom, 2) # Encode the text. encoded_text = u’pi: \u03c0’.encode(encoding) print ’{:14}: {}’.format(encoding, to_hex(encoded_text, 2)) with open(’nonnative-encoded.txt’, mode=’wb’) as f: # Write the selected byte-order marker. It is not included # in the encoded text because the byte order was given # explicitly when selecting the encoding. f.write(bom) # Write the byte string for the encoded text. f.write(encoded_text) codecs_bom_create_file.py figures out the native byte ordering and then uses the alternate form explicitly so the next example can demonstrate auto-detection while reading. 6.7. codecs—String Encoding and Decoding 291 $ python codecs_bom_create_file.py Native order : fffe Selected order: feff utf_16_be : 0070 0069 003a 0020 03c0 codecs_bom_detection.py does not specify a byte order when opening the file, so the decoder uses the BOM value in the first two bytes of the file to determine it. import codecs from codecs_to_hex import to_hex # Look at the raw data with open(’nonnative-encoded.txt’, mode=’rb’) as f: raw_bytes = f.read() print ’Raw :’, to_hex(raw_bytes, 2) # Reopen the file and let codecs detect the BOM with codecs.open(’nonnative-encoded.txt’, mode=’rt’, encoding=’utf-16’, ) as f: decoded_text = f.read() print ’Decoded:’, repr(decoded_text) Since the first two bytes of the file are used for byte-order detection, they are not included in the data returned by read(). $ python codecs_bom_detection.py Raw : feff 0070 0069 003a 0020 03c0 Decoded: u’pi: \u03c0’ 6.7.4 Error Handling The previous sections pointed out the need to know the encoding being used when reading and writing Unicode files. Setting the encoding correctly is important for two reasons. If the encoding is configured incorrectly while reading from a file, the data 292 The File System will be interpreted incorrectly and may be corrupted or simply fail to decode. Not all Unicode characters can be represented in all encodings, so if the wrong encoding is used while writing, then an error will be generated and data may be lost. codecs uses the same five error-handling options that are provided by the encode() method of unicode and the decode() method of str, listed in Table 6.1. Table 6.1. Codec Error-Handling Modes Error Mode Description strict Raises an exception if the data cannot be converted replace Substitutes a special marker character for data that cannot be encoded ignore Skips the data xmlcharrefreplace XML character (encoding only) backslashreplace Escape sequence (encoding only) Encoding Errors The most common error condition is receiving a UnicodeEncodeError when writ- ing Unicode data to an ASCII output stream, such as a regular file or sys.stdout. This sample program can be used to experiment with the different error-handling modes. import codecs import sys error_handling = sys.argv[1] text = u’pi: \u03c0’ try: # Save the data, encoded as ASCII, using the error # handling mode specified on the command line. with codecs.open(’encode_error.txt’, ’w’, encoding=’ascii’, errors=error_handling) as f: f.write(text) except UnicodeEncodeError, err: print ’ERROR:’, err 6.7. codecs—String Encoding and Decoding 293 else: # If there was no error writing to the file, # show what it contains. with open(’encode_error.txt’, ’rb’) as f: print ’File contents:’, repr(f.read()) While strict mode is safest for ensuring an application explicitly sets the correct encoding for all I/O operations, it can lead to program crashes when an exception is raised. $ python codecs_encode_error.py strict ERROR: ’ascii’ codec can’t encode character u’\u03c0’ in position 4: ordinal not in range(128) Some of the other error modes are more flexible. For example, replace ensures that no error is raised, at the expense of possibly losing data that cannot be converted to the requested encoding. The Unicode character for pi (π) still cannot be encoded in ASCII, but instead of raising an exception, the character is replaced with ? in the output. $ python codecs_encode_error.py replace File contents: ’pi: ?’ To skip over problem data entirely, use ignore. Any data that cannot be encoded will be discarded. $ python codecs_encode_error.py ignore File contents: ’pi: ’ There are two lossless error-handling options, both of which replace the charac- ter with an alternate representation defined by a standard separate from the encoding. xmlcharrefreplace uses an XML character reference as a substitute (the list of character references is specified in the W3C document, XML Entity Definitions for Characters). $ python codecs_encode_error.py xmlcharrefreplace File contents: ’pi: π’ 294 The File System The other lossless error-handling scheme is backslashreplace, which produces an output format like the value returned when repr() of a unicode object is printed. Unicode characters are replaced with \u followed by the hexadecimal value of the code point. $ python codecs_encode_error.py backslashreplace File contents: ’pi: \\u03c0’ Decoding Errors It is also possible to see errors when decoding data, especially if the wrong encoding is used. import codecs import sys from codecs_to_hex import to_hex error_handling = sys.argv[1] text = u’pi: \u03c0’ print ’Original :’, repr(text) # Save the data with one encoding with codecs.open(’decode_error.txt’, ’w’, encoding=’utf-16’) as f: f.write(text) # Dump the bytes from the file with open(’decode_error.txt’, ’rb’) as f: print ’File contents:’, to_hex(f.read(), 1) # Try to read the data with the wrong encoding with codecs.open(’decode_error.txt’, ’r’, encoding=’utf-8’, errors=error_handling) as f: try: data = f.read() except UnicodeDecodeError, err: print ’ERROR:’, err else: print ’Read :’, repr(data) 6.7. codecs—String Encoding and Decoding 295 As with encoding, strict error-handling mode raises an exception if the byte stream cannot be properly decoded. In this case, a UnicodeDecodeError results from trying to convert part of the UTF-16 BOM to a character using the UTF-8 decoder. $ python codecs_decode_error.py strict Original : u’pi: \u03c0’ File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 ERROR: ’utf8’ codec can’t decode byte 0xff in position 0: invalid start byte Switching to ignore causes the decoder to skip over the invalid bytes. The result is still not quite what is expected, though, since it includes embedded null bytes. $ python codecs_decode_error.py ignore Original : u’pi: \u03c0’ File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 Read : u’p\x00i\x00:\x00 \x00\x03’ In replace mode, invalid bytes are replaced with \uFFFD, the official Unicode replacement character, which looks like a diamond with a black background containing a white question mark. $ python codecs_decode_error.py replace Original : u’pi: \u03c0’ File contents: ff fe 70 00 69 00 3a 00 20 00 c0 03 Read : u’\ufffd\ufffdp\x00i\x00:\x00 \x00\ufffd\x03’ 6.7.5 Standard Input and Output Streams The most common cause of UnicodeEncodeError exceptions is code that tries to print unicode data to the console or a UNIX pipeline when sys.stdout is not con- figured with an encoding. import codecs import sys text = u’pi: π’ 296 The File System # Printing to stdout may cause an encoding error print ’Default encoding:’, sys.stdout.encoding print ’TTY:’, sys.stdout.isatty() print text Problems with the default encoding of the standard I/O channels can be difficult to debug. This is because the program frequently works as expected when the output goes to the console, but it causes an encoding error when it is used as part of a pipeline and the output includes Unicode characters outside of the ASCII range. This difference in behavior is caused by Python’s initialization code, which sets the default encoding for each standard I/O channel only if the channel is connected to a terminal (isatty() returns True). If there is no terminal, Python assumes the program will configure the encoding explicitly and leaves the I/O channel alone. $ python codecs_stdout.py Default encoding: utf-8 TTY: True pi: π $ python codecs_stdout.py | cat - Default encoding: None TTY: False Traceback (most recent call last): File "codecs_stdout.py", line 18, in print text UnicodeEncodeError: ’ascii’ codec can’t encode character u’\u03c0’ in position 4: ordinal not in range(128) To explicitly set the encoding on the standard output channel, use getwriter() to get a stream encoder class for a specific encoding. Instantiate the class, passing sys.stdout as the only argument. import codecs import sys text = u’pi: π’ # Wrap sys.stdout with a writer that knows how to handle encoding # Unicode data. 6.7. codecs—String Encoding and Decoding 297 wrapped_stdout = codecs.getwriter(’UTF-8’)(sys.stdout) wrapped_stdout.write(u’Via write: ’ + text + ’\n’) # Replace sys.stdout with a writer sys.stdout = wrapped_stdout print u’Via print:’, text Writing to the wrapped version of sys.stdout passes the Unicode text through an encoder before sending the encoded bytes to stdout. Replacing sys.stdout means that any code used by an application that prints to standard output will be able to take advantage of the encoding writer. $ python codecs_stdout_wrapped.py Via write: pi: π Via print: pi: π The next problem to solve is how to know which encoding should be used. The proper encoding varies based on location, language, and user or system configuration, so hard-coding a fixed value is not a good idea. It would also be annoying for a user to need to pass explicit arguments to every program by setting the input and output encodings. Fortunately, there is a global way to get a reasonable default encoding using locale. import codecs import locale import sys text = u’pi: π’ # Configure locale from the user’s environment settings. locale.setlocale(locale.LC_ALL, ’’) # Wrap stdout with an encoding-aware writer. lang, encoding = locale.getdefaultlocale() print ’Locale encoding :’, encoding sys.stdout = codecs.getwriter(encoding)(sys.stdout) print ’With wrapped stdout:’, text 298 The File System The function locale.getdefaultlocale() returns the language and preferred encoding based on the system and user configuration settings in a form that can be used with getwriter(). $ python codecs_stdout_locale.py Locale encoding : UTF8 With wrapped stdout: pi: π The encoding also needs to be set up when working with sys.stdin. Use getreader() to get a reader capable of decoding the input bytes. import codecs import locale import sys # Configure locale from the user’s environment settings. locale.setlocale(locale.LC_ALL, ’’) # Wrap stdin with an encoding-aware reader. lang, encoding = locale.getdefaultlocale() sys.stdin = codecs.getreader(encoding)(sys.stdin) print ’From stdin:’ print repr(sys.stdin.read()) Reading from the wrapped handle returns unicode objects instead of str instances. $ python codecs_stdout_locale.py | python codecs_stdin.py From stdin: u’Locale encoding : UTF8\nWith wrapped stdout: pi: \u03c0\n’ 6.7.6 Encoding Translation Although most applications will work with unicode data internally, decoding or en- coding it as part of an I/O operation, there are times when changing a file’s encoding without holding on to that intermediate data format is useful. EncodedFile() takes an open file handle using one encoding and wraps it with a class that translates the data to another encoding as the I/O occurs. 6.7. codecs—String Encoding and Decoding 299 from codecs_to_hex import to_hex import codecs from cStringIO import StringIO # Raw version of the original data. data = u’pi: \u03c0’ # Manually encode it as UTF-8. utf8 = data.encode(’utf-8’) print ’Start as UTF-8 :’, to_hex(utf8, 1) # Set up an output buffer, then wrap it as an EncodedFile. output = StringIO() encoded_file = codecs.EncodedFile(output, data_encoding=’utf-8’, file_encoding=’utf-16’) encoded_file.write(utf8) # Fetch the buffer contents as a UTF-16 encoded byte string utf16 = output.getvalue() print ’Encoded to UTF-16:’, to_hex(utf16, 2) # Set up another buffer with the UTF-16 data for reading, # and wrap it with another EncodedFile. buffer = StringIO(utf16) encoded_file = codecs.EncodedFile(buffer, data_encoding=’utf-8’, file_encoding=’utf-16’) # Read the UTF-8 encoded version of the data. recoded = encoded_file.read() print ’Back to UTF-8 :’, to_hex(recoded, 1) This example shows reading from and writing to separate handles returned by EncodedFile(). No matter whether the handle is used for reading or writing, the file_encoding always refers to the encoding in use by the open file handle passed as the first argument, and the data_encoding value refers to the encoding in use by the data passing through the read() and write() calls. $ python codecs_encodedfile.py Start as UTF-8 : 70 69 3a 20 cf 80 300 The File System Encoded to UTF-16: fffe 7000 6900 3a00 2000 c003 Back to UTF-8 : 70 69 3a 20 cf 80 6.7.7 Non-Unicode Encodings Although most of the earlier examples use Unicode encodings, codecs can be used for many other data translations. For example, Python includes codecs for working with base-64, bzip2, ROT-13, ZIP, and other data formats. import codecs from cStringIO import StringIO buffer = StringIO() stream = codecs.getwriter(’rot_13’)(buffer) text = ’abcdefghijklmnopqrstuvwxyz’ stream.write(text) stream.flush() print ’Original:’, text print ’ROT-13 :’, buffer.getvalue() Any transformation that can be expressed as a function taking a single input argu- ment and returning a byte or Unicode string can be registered as a codec. $ python codecs_rot13.py Original: abcdefghijklmnopqrstuvwxyz ROT-13 : nopqrstuvwxyzabcdefghijklm Using codecs to wrap a data stream provides a simpler interface than working directly with zlib. import codecs from cStringIO import StringIO from codecs_to_hex import to_hex buffer = StringIO() stream = codecs.getwriter(’zlib’)(buffer) 6.7. codecs—String Encoding and Decoding 301 text = ’abcdefghijklmnopqrstuvwxyz\n’ * 50 stream.write(text) stream.flush() print ’Original length :’, len(text) compressed_data = buffer.getvalue() print ’ZIP compressed :’, len(compressed_data) buffer = StringIO(compressed_data) stream = codecs.getreader(’zlib’)(buffer) first_line = stream.readline() print ’Read first line :’, repr(first_line) uncompressed_data = first_line + stream.read() print ’Uncompressed :’, len(uncompressed_data) print ’Same :’, text == uncompressed_data Not all compression or encoding systems support reading a portion of the data through the stream interface using readline() or read() because they need to find the end of a compressed segment to expand it. If a program cannot hold the entire un- compressed data set in memory, use the incremental access features of the compression library, instead of codecs. $ python codecs_zlib.py Original length : 1350 ZIP compressed : 48 Read first line : ’abcdefghijklmnopqrstuvwxyz\n’ Uncompressed : 1350 Same : True 6.7.8 Incremental Encoding Some of the encodings provided, especially bz2 and zlib, may dramatically change the length of the data stream as they work on it. For large data sets, these encod- ings operate better incrementally, working on one small chunk of data at a time. The IncrementalEncoder and IncrementalDecoder API is designed for this purpose. import codecs import sys 302 The File System from codecs_to_hex import to_hex text = ’abcdefghijklmnopqrstuvwxyz\n’ repetitions = 50 print ’Text length :’, len(text) print ’Repetitions :’, repetitions print ’Expected len:’, len(text) * repetitions # Encode the text several times to build up a large amount of data encoder = codecs.getincrementalencoder(’bz2’)() encoded = [] print print ’Encoding:’, for i in range(repetitions): en_c = encoder.encode(text, final = (i==repetitions-1)) if en_c: print ’\nEncoded : {} bytes’.format(len(en_c)) encoded.append(en_c) else: sys.stdout.write(’.’) bytes = ’’.join(encoded) print print ’Total encoded length:’, len(bytes) print # Decode the byte string one byte at a time decoder = codecs.getincrementaldecoder(’bz2’)() decoded = [] print ’Decoding:’, for i, b in enumerate(bytes): final= (i+1) == len(text) c = decoder.decode(b, final) if c: print ’\nDecoded : {} characters’.format(len(c)) print ’Decoding:’, decoded.append(c) else: sys.stdout.write(’.’) print 6.7. codecs—String Encoding and Decoding 303 restored = u’’.join(decoded) print print ’Total uncompressed length:’, len(restored) Each time data is passed to the encoder or the decoder, its internal state is up- dated. When the state is consistent (as defined by the codec), data is returned and the state resets. Until that point, calls to encode() or decode() will not return any data. When the last bit of data is passed in, the argument final should be set to True so the codec knows to flush any remaining buffered data. $ python codecs_incremental_bz2.py Text length : 27 Repetitions : 50 Expected len: 1350 Encoding:................................................. Encoded : 99 bytes Total encoded length: 99 Decoding:................... ............................ Decoded : 1350 characters Decoding:.......... Total uncompressed length: 1350 6.7.9 Unicode Data and Network Communication Like the standard input and output file descriptors, network sockets are also byte streams, and so Unicode data must be encoded into bytes before it is written to a socket. This server echos data it receives back to the sender. import sys import SocketServer class Echo(SocketServer.BaseRequestHandler): 304 The File System def handle(self): # Get some bytes and echo them back to the client. data = self.request.recv(1024) self.request.send(data) return if __name__ == ’__main__’: import codecs import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = SocketServer.TCPServer(address, Echo) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() # Connect to the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) # Send the data # WRONG: Not encoded first! text = u’pi: π’ len_sent = s.send(text) # Receive a response response = s.recv(len_sent) print repr(response) # Clean up s.close() server.socket.close() The data could be encoded explicitly before each call to send(), but missing one call to send() would result in an encoding error. $ python codecs_socket_fail.py Traceback (most recent call last): File "codecs_socket_fail.py", line 43, in 6.7. codecs—String Encoding and Decoding 305 len_sent = s.send(text) UnicodeEncodeError: ’ascii’ codec can’t encode character u’\u03c0’ in position 4: ordinal not in range(128) Using makefile() to get a file-like handle for the socket, and then wrapping that handle with a stream-based reader or writer, means Unicode strings will be encoded on the way into and out of the socket. import sys import SocketServer class Echo(SocketServer.BaseRequestHandler): def handle(self): # Get some bytes and echo them back to the client. There is # no need to decode them, since they are not used. data = self.request.recv(1024) self.request.send(data) return class PassThrough(object): def __init__(self, other): self.other = other def write(self, data): print ’Writing :’, repr(data) return self.other.write(data) def read(self, size=-1): print ’Reading :’, data = self.other.read(size) print repr(data) return data def flush(self): return self.other.flush() def close(self): return self.other.close() 306 The File System if __name__ == ’__main__’: import codecs import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = SocketServer.TCPServer(address, Echo) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() # Connect to the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) # Wrap the socket with a reader and writer. read_file = s.makefile(’r’) incoming = codecs.getreader(’utf-8’)(PassThrough(read_file)) write_file = s.makefile(’w’) outgoing = codecs.getwriter(’utf-8’)(PassThrough(write_file)) # Send the data text = u’pi: π’ print ’Sending :’, repr(text) outgoing.write(text) outgoing.flush() # Receive a response response = incoming.read() print ’Received:’, repr(response) # Clean up s.close() server.socket.close() This example uses PassThrough to show that the data is encoded before being sent and the response is decoded after it is received in the client. $ python codecs_socket.py Sending : u’pi: \u03c0’ 6.7. codecs—String Encoding and Decoding 307 Writing : ’pi: \xcf\x80’ Reading : ’pi: \xcf\x80’ Received: u’pi: \u03c0’ 6.7.10 Defining a Custom Encoding Since Python comes with a large number of standard codecs already, it is unlikely that an application will need to define a custom encoder or decoder. When it is necessary, though, there are several base classes in codecs to make the process easier. The first step is to understand the nature of the transformation described by the encoding. These examples will use an “invertcaps” encoding, which converts uppercase letters to lowercase and lowercase letters to uppercase. Here is a simple definition of an encoding function that performs this transformation on an input string: import string def invertcaps(text): """Return new string with the case of all letters switched. """ return ’’.join( c.upper() if c in string.ascii_lowercase else c.lower() if c in string.ascii_uppercase else c for c in text ) if __name__ == ’__main__’: print invertcaps(’ABC.def’) print invertcaps(’abc.DEF’) In this case, the encoder and decoder are the same function (as with ROT-13). $ python codecs_invertcaps.py abc.DEF ABC.def Although it is easy to understand, this implementation is not efficient, especially for very large text strings. Fortunately, codecs includes helper functions for creating codecs based on character maps, like invertcaps. A character map encoding is made up of two dictionaries. The encoding map converts character values from the input string to byte values in the output, and the decoding map goes the other way. Create the decoding 308 The File System map first, and then use make_encoding_map() to convert it to an encoding map. The C functions charmap_encode() and charmap_decode() use the maps to convert their input data efficiently. import codecs import string # Map every character to itself decoding_map = codecs.make_identity_dict(range(256)) # Make a list of pairs of ordinal values for the lower and uppercase # letters pairs = zip([ ord(c) for c in string.ascii_lowercase], [ ord(c) for c in string.ascii_uppercase]) # Modify the mapping to convert upper to lower and lower to upper. decoding_map.update( dict( (upper, lower) for (lower, upper) in pairs ) ) decoding_map.update( dict( (lower, upper) for (lower, upper) in pairs ) ) # Create a separate encoding map. encoding_map = codecs.make_encoding_map(decoding_map) if __name__ == ’__main__’: print codecs.charmap_encode(’abc.DEF’, ’strict’, encoding_map) print codecs.charmap_decode(’abc.DEF’, ’strict’, decoding_map) print encoding_map == decoding_map Although the encoding and decoding maps for invertcaps are the same, that may not always be the case. make_encoding_map() detects situations where more than one input character is encoded to the same output byte and replaces the encoding value with None to mark the encoding as undefined. $ python codecs_invertcaps_charmap.py (’ABC.def’, 7) 6.7. codecs—String Encoding and Decoding 309 (u’ABC.def’, 7) True The character map encoder and decoder support all standard error-handling methods described earlier, so no extra work is needed to comply with that part of the API. import codecs from codecs_invertcaps_charmap import encoding_map text = u’pi: π’ for error in [ ’ignore’, ’replace’, ’strict’ ]: try: encoded = codecs.charmap_encode(text, error, encoding_map) except UnicodeEncodeError, err: encoded = str(err) print ’{:7}: {}’.format(error, encoded) Because the Unicode code point for π is not in the encoding map, the strict error- handling mode raises an exception. $ python codecs_invertcaps_error.py ignore : (’PI: ’, 5) replace: (’PI: ?’, 5) strict : ’charmap’ codec can’t encode character u’\u03c0’ in position 4: character maps to After the encoding and decoding maps are defined, a few additional classes need to be set up, and the encoding should be registered. register() adds a search function to the registry so that when a user wants to use the encoding, codecs can locate it. The search function must take a single string argument with the name of the encoding and return a CodecInfo object if it knows the encoding, or None if it does not. import codecs import encodings def search1(encoding): print ’search1: Searching for:’, encoding return None 310 The File System def search2(encoding): print ’search2: Searching for:’, encoding return None codecs.register(search1) codecs.register(search2) utf8 = codecs.lookup(’utf-8’) print ’UTF-8:’, utf8 try: unknown = codecs.lookup(’no-such-encoding’) except LookupError, err: print ’ERROR:’, err Multiple search functions can be registered, and each will be called in turn until one returns a CodecInfo or the list is exhausted. The internal search function registered by codecs knows how to load the standard codecs, such as UTF-8 from encodings, so those names will never be passed to custom search functions. $ python codecs_register.py UTF-8: search1: Searching for: no-such-encoding search2: Searching for: no-such-encoding ERROR: unknown encoding: no-such-encoding The CodecInfo instance returned by the search function tells codecs how to encode and decode using all the different mechanisms supported: stateless, incremen- tal, and stream. codecs includes base classes to help with setting up a character map encoding. This example puts all the pieces together to register a search function that returns a CodecInfo instance configured for the invertcaps codec. import codecs from codecs_invertcaps_charmap import encoding_map, decoding_map # Stateless encoder/decoder class InvertCapsCodec(codecs.Codec): def encode(self, input, errors=’strict’): return codecs.charmap_encode(input, errors, encoding_map) 6.7. codecs—String Encoding and Decoding 311 def decode(self, input, errors=’strict’): return codecs.charmap_decode(input, errors, decoding_map) # Incremental forms class InvertCapsIncrementalEncoder(codecs.IncrementalEncoder): def encode(self, input, final=False): data, nbytes = codecs.charmap_encode(input, self.errors, encoding_map) return data class InvertCapsIncrementalDecoder(codecs.IncrementalDecoder): def decode(self, input, final=False): data, nbytes = codecs.charmap_decode(input, self.errors, decoding_map) return data # Stream reader and writer class InvertCapsStreamReader(InvertCapsCodec, codecs.StreamReader): pass class InvertCapsStreamWriter(InvertCapsCodec, codecs.StreamWriter): pass # Register the codec search function def find_invertcaps(encoding): """Return the codec for ’invertcaps’. """ if encoding == ’invertcaps’: return codecs.CodecInfo( name=’invertcaps’, encode=InvertCapsCodec().encode, decode=InvertCapsCodec().decode, incrementalencoder=InvertCapsIncrementalEncoder, incrementaldecoder=InvertCapsIncrementalDecoder, streamreader=InvertCapsStreamReader, streamwriter=InvertCapsStreamWriter, ) return None 312 The File System codecs.register(find_invertcaps) if __name__ == ’__main__’: # Stateless encoder/decoder encoder = codecs.getencoder(’invertcaps’) text = ’abc.DEF’ encoded_text, consumed = encoder(text) print ’Encoded "{}" to "{}", consuming {} characters’.format( text, encoded_text, consumed) # Stream writer import sys writer = codecs.getwriter(’invertcaps’)(sys.stdout) print ’StreamWriter for stdout: ’, writer.write(’abc.DEF’) print # Incremental decoder decoder_factory = codecs.getincrementaldecoder(’invertcaps’) decoder = decoder_factory() decoded_text_parts = [] for c in encoded_text: decoded_text_parts.append(decoder.decode(c, final=False)) decoded_text_parts.append(decoder.decode(’’, final=True)) decoded_text = ’’.join(decoded_text_parts) print ’IncrementalDecoder converted "{}" to "{}"’.format( encoded_text, decoded_text) The stateless encoder/decoder base class is Codec. Override encode() and decode() with the new implementation (in this case, calling charmap_encode() and charmap_decode(), respectively). Each method must return a tuple contain- ing the transformed data and the number of the input bytes or characters consumed. Conveniently, charmap_encode() and charmap_decode() already return that information. IncrementalEncoder and IncrementalDecoder serve as base classes for the incremental interfaces. The encode() and decode() methods of the incre- mental classes are defined in such a way that they only return the actual trans- formed data. Any information about buffering is maintained as internal state. The invertcaps encoding does not need to buffer data (it uses a one-to-one mapping). For encodings that produce a different amount of output depending on the data be- ing processed, such as compression algorithms, BufferedIncrementalEncoder 6.7. codecs—String Encoding and Decoding 313 and BufferedIncrementalDecoder are more appropriate base classes, since they manage the unprocessed portion of the input. StreamReader and StreamWriter need encode() and decode() methods, too, and since they are expected to return the same value as the version from Codec, multiple inheritance can be used for the implementation. $ python codecs_invertcaps_register.py Encoded "abc.DEF" to "ABC.def", consuming 7 characters StreamWriter for stdout: ABC.def IncrementalDecoder converted "ABC.def" to "abc.DEF" See Also: codecs (http://docs.python.org/library/codecs.html) The standard library documen- tation for this module. locale (page 909) Accessing and managing the localization-based configuration set- tings and behaviors. io (http://docs.python.org/library/io.html) The io module includes file and stream wrappers that handle encoding and decoding, too. SocketServer (page 609) For a more detailed example of an echo server, see the SocketServer module. encodings Package in the standard library containing the encoder/decoder implemen- tations provided by Python. PEP 100 (www.python.org/dev/peps/pep-0100) Python Unicode Integration PEP. Unicode HOWTO (http://docs.python.org/howto/unicode) The official guide for using Unicode with Python 2.x. Python Unicode Objects (http://effbot.org/zone/unicode-objects.htm) Fredrik Lundh’s article about using non-ASCII character sets in Python 2.0. How to Use UTF-8 with Python (http://evanjones.ca/python-utf8.html) Evan Jones’ quick guide to working with Unicode, including XML data and the Byte- Order Marker. On the Goodness of Unicode (www.tbray.org/ongoing/When/200x/2003/04/06/ Unicode) Introduction to internationalization and Unicode by Tim Bray. On Character Strings (www.tbray.org/ongoing/When/200x/2003/04/13/Strings) A look at the history of string processing in programming languages, by Tim Bray. Characters vs. Bytes (www.tbray.org/ongoing/When/200x/2003/04/26/UTF) Part one of Tim Bray’s “essay on modern character string processing for computer programmers.” This installment covers in-memory representation of text in formats other than ASCII bytes. 314 The File System Endianness (http://en.wikipedia.org/wiki/Endianness) Explanation of endianness in Wikipedia. W3C XML Entity Definitions for Characters (www.w3.org/TR/xml-entity-names/) Specification for XML representations of character references that cannot be represented in an encoding. 6.8 StringIO—Text Buffers with a File-like API Purpose Work with text buffers using a file-like API. Python Version 1.4 and later StringIO provides a convenient means of working with text in memory using the file API (read(), write(), etc.). There are two separate implementations. The cStringIO version is written in C for speed, while StringIO is written in Python for portability. Using cStringIO to build large strings can offer performance savings over some other string concatenation techniques. 6.8.1 Examples Here are a few standard examples of using StringIO buffers: # Find the best implementation available on this platform try: from cStringIO import StringIO except: from StringIO import StringIO # Writing to a buffer output = StringIO() output.write(’This goes into the buffer. ’) print >>output, ’And so does this.’ # Retrieve the value written print output.getvalue() output.close() # discard buffer memory # Initialize a read buffer input = StringIO(’Inital value for read buffer’) 6.9. fnmatch—UNIX-Style Glob Pattern Matching 315 # Read from the buffer print input.read() This example uses read(), but the readline() and readlines() methods are also available. The StringIO class also provides a seek() method for jumping around in a buffer while reading, which can be useful for rewinding if a look-ahead parsing algorithm is being used. $ python stringio_examples.py This goes into the buffer. And so does this. Inital value for read buffer See Also: StringIO (http://docs.python.org/lib/module-StringIO.html) Standard library doc- umentation for this module. The StringIO module ::: www.effbot.org (http://effbot.org/librarybook/stringio .htm) effbot’s examples with StringIO. Efficient String Concatenation in Python (www.skymind.com/%7Eocrow/python_ string/) Examines various methods of combining strings and their relative merits. 6.9 fnmatch—UNIX-Style Glob Pattern Matching Purpose Handle UNIX-style filename comparisons. Python Version 1.4 and later. The fnmatch module is used to compare filenames against glob-style patterns such as used by UNIX shells. 6.9.1 Simple Matching fnmatch() compares a single filename against a pattern and returns a Boolean, indi- cating whether or not they match. The comparison is case sensitive when the operating system uses a case-sensitive file system. import fnmatch import os 316 The File System pattern = ’fnmatch_*.py’ print ’Pattern :’, pattern print files = os.listdir(’.’) for name in files: print ’Filename: %-25s %s’ %\ (name, fnmatch.fnmatch(name, pattern)) In this example, the pattern matches all files starting with ’fnmatch_’ and ending in ’.py’. $ python fnmatch_fnmatch.py Pattern : fnmatch_*.py Filename: __init__.py False Filename: fnmatch_filter.py True Filename: fnmatch_fnmatch.py True Filename: fnmatch_fnmatchcase.py True Filename: fnmatch_translate.py True Filename: index.rst False To force a case-sensitive comparison, regardless of the file system and operating system settings, use fnmatchcase(). import fnmatch import os pattern = ’FNMATCH_*.PY’ print ’Pattern :’, pattern print files = os.listdir(’.’) for name in files: print ’Filename: %-25s %s’ %\ (name, fnmatch.fnmatchcase(name, pattern)) Since the OS X system used to test this program uses a case-sensitive file system, no files match the modified pattern. 6.9. fnmatch—UNIX-Style Glob Pattern Matching 317 $ python fnmatch_fnmatchcase.py Pattern : FNMATCH_*.PY Filename: __init__.py False Filename: fnmatch_filter.py False Filename: fnmatch_fnmatch.py False Filename: fnmatch_fnmatchcase.py False Filename: fnmatch_translate.py False Filename: index.rst False 6.9.2 Filtering To test a sequence of filenames, use filter(), which returns a list of the names that match the pattern argument. import fnmatch import os import pprint pattern = ’fnmatch_*.py’ print ’Pattern :’, pattern files = os.listdir(’.’) print print ’Files :’ pprint.pprint(files) print print ’Matches :’ pprint.pprint(fnmatch.filter(files, pattern)) In this example, filter() returns the list of names of the example source files associated with this section. $ python fnmatch_filter.py Pattern : fnmatch_*.py Files : [’__init__.py’, 318 The File System ’fnmatch_filter.py’, ’fnmatch_fnmatch.py’, ’fnmatch_fnmatchcase.py’, ’fnmatch_translate.py’, ’index.rst’] Matches : [’fnmatch_filter.py’, ’fnmatch_fnmatch.py’, ’fnmatch_fnmatchcase.py’, ’fnmatch_translate.py’] 6.9.3 Translating Patterns Internally, fnmatch converts the glob pattern to a regular expression and uses the re module to compare the name and pattern. The translate() function is the public API for converting glob patterns to regular expressions. import fnmatch pattern = ’fnmatch_*.py’ print ’Pattern :’, pattern print ’Regex :’, fnmatch.translate(pattern) Some of the characters are escaped to make a valid expression. $ python fnmatch_translate.py Pattern : fnmatch_*.py Regex : fnmatch\_.*\.py\Z(?ms) See Also: fnmatch (http://docs.python.org/library/fnmatch.html) The standard library docu- mentation for this module. glob (page 257) The glob module combines fnmatch matching with os.listdir() to produce lists of files and directories matching patterns. re (page 13) Regular expression pattern matching. 6.10. dircache—Cache Directory Listings 319 6.10 dircache—Cache Directory Listings Purpose Cache directory listings, updating when the modification time of a directory changes. Python Version 1.4 and later The dircache module reads directory listings from the file system and holds them in memory. 6.10.1 Listing Directory Contents The main function in the dircache API is listdir(), which is a wrapper around os.listdir(). Each time it is called with a given path, dircache.listdir() returns the same list object, unless the modification date of the directory changes. import dircache path = ’.’ first = dircache.listdir(path) second = dircache.listdir(path) print ’Contents :’ for name in first: print ’’, name print print ’Identical:’, first is second print ’Equal :’, first == second It is important to recognize that the exact same list is returned each time, so it should not be modified in place. $ python dircache_listdir.py Contents : __init__.py dircache_annotate.py dircache_listdir.py 320 The File System dircache_listdir_file_added.py dircache_reset.py index.rst Identical: True Equal : True If the contents of the directory changes, it is rescanned. import dircache import os path = ’/tmp’ file_to_create = os.path.join(path, ’pymotw_tmp.txt’) # Look at the directory contents first = dircache.listdir(path) # Create the new file open(file_to_create, ’wt’).close() # Rescan the directory second = dircache.listdir(path) # Remove the file we created os.unlink(file_to_create) print ’Identical :’, first is second print ’Equal :’, first == second print ’Difference:’, list(set(second) - set(first)) In this case, the new file causes a new list to be constructed. $ python dircache_listdir_file_added.py Identical : False Equal : False Difference: [’pymotw_tmp.txt’] It is also possible to reset the entire cache, discarding its contents so that each path will be rechecked. 6.10. dircache—Cache Directory Listings 321 import dircache path = ’/tmp’ first = dircache.listdir(path) dircache.reset() second = dircache.listdir(path) print ’Identical :’, first is second print ’Equal :’, first == second print ’Difference:’, list(set(second) - set(first)) After resetting, a new list instance is returned. $ python dircache_reset.py Identical : False Equal : True Difference: [] 6.10.2 Annotated Listings Another interesting function provided by the dircache module is annotate(), which modifies a list(), such as is returned by listdir(), by adding a ’/’ to the end of the names that represent directories. import dircache from pprint import pprint import os path = ’../..’ contents = dircache.listdir(path) annotated = contents[:] dircache.annotate(path, annotated) fmt = ’%25s\t%25s’ print fmt % (’ORIGINAL’, ’ANNOTATED’) print fmt % ((’-’ * 25,)*2) 322 The File System for o, a in zip(contents, annotated): print fmt % (o, a) Unfortunately for Windows users, although annotate() uses os.path.join() to construct names to test, it always appends a ’/’, not os.sep. $ python dircache_annotate.py ORIGINAL ANNOTATED ------------------------- ------------------------- .DS_Store .DS_Store .hg .hg/ .hgignore .hgignore .hgtags .hgtags LICENSE.txt LICENSE.txt MANIFEST.in MANIFEST.in PyMOTW PyMOTW/ PyMOTW.egg-info PyMOTW.egg-info/ README.txt README.txt bin bin/ dist dist/ module module motw motw output output/ pavement.py pavement.py paver-minilib.zip paver-minilib.zip setup.py setup.py sitemap_gen_config.xml sitemap_gen_config.xml sphinx sphinx/ structure structure/ trace.txt trace.txt utils utils/ See Also: dircache (http://docs.python.org/library/dircache.html) The standard library docu- mentation for this module. 6.11 filecmp—Compare Files Purpose Compare files and directories on the file system. Python Version 2.1 and later 6.11. filecmp—Compare Files 323 The filecmp module includes functions and a class for comparing files and directories on the file system. 6.11.1 Example Data The examples in this discussion use a set of test files created by filecmp_ mkexamples.py. import os def mkfile(filename, body=None): with open(filename, ’w’) as f: f.write(body or filename) return def make_example_dir(top): if not os.path.exists(top): os.mkdir(top) curdir = os.getcwd() os.chdir(top) os.mkdir(’dir1’) os.mkdir(’dir2’) mkfile(’dir1/file_only_in_dir1’) mkfile(’dir2/file_only_in_dir2’) os.mkdir(’dir1/dir_only_in_dir1’) os.mkdir(’dir2/dir_only_in_dir2’) os.mkdir(’dir1/common_dir’) os.mkdir(’dir2/common_dir’) mkfile(’dir1/common_file’, ’this file is the same’) mkfile(’dir2/common_file’, ’this file is the same’) mkfile(’dir1/not_the_same’) mkfile(’dir2/not_the_same’) mkfile(’dir1/file_in_dir1’, ’This is a file in dir1’) os.mkdir(’dir2/file_in_dir1’) os.chdir(curdir) return 324 The File System if __name__ == ’__main__’: os.chdir(os.path.dirname(__file__) or os.getcwd()) make_example_dir(’example’) make_example_dir(’example/dir1/common_dir’) make_example_dir(’example/dir2/common_dir’) Running filecmp_mkexamples.py produces a tree of files under the directory example: $ find example example example/dir1 example/dir1/common_dir example/dir1/common_dir/dir1 example/dir1/common_dir/dir1/common_dir example/dir1/common_dir/dir1/common_file example/dir1/common_dir/dir1/dir_only_in_dir1 example/dir1/common_dir/dir1/file_in_dir1 example/dir1/common_dir/dir1/file_only_in_dir1 example/dir1/common_dir/dir1/not_the_same example/dir1/common_dir/dir2 example/dir1/common_dir/dir2/common_dir example/dir1/common_dir/dir2/common_file example/dir1/common_dir/dir2/dir_only_in_dir2 example/dir1/common_dir/dir2/file_in_dir1 example/dir1/common_dir/dir2/file_only_in_dir2 example/dir1/common_dir/dir2/not_the_same example/dir1/common_file example/dir1/dir_only_in_dir1 example/dir1/file_in_dir1 example/dir1/file_only_in_dir1 example/dir1/not_the_same example/dir2 example/dir2/common_dir example/dir2/common_dir/dir1 example/dir2/common_dir/dir1/common_dir example/dir2/common_dir/dir1/common_file example/dir2/common_dir/dir1/dir_only_in_dir1 example/dir2/common_dir/dir1/file_in_dir1 example/dir2/common_dir/dir1/file_only_in_dir1 example/dir2/common_dir/dir1/not_the_same 6.11. filecmp—Compare Files 325 example/dir2/common_dir/dir2 example/dir2/common_dir/dir2/common_dir example/dir2/common_dir/dir2/common_file example/dir2/common_dir/dir2/dir_only_in_dir2 example/dir2/common_dir/dir2/file_in_dir1 example/dir2/common_dir/dir2/file_only_in_dir2 example/dir2/common_dir/dir2/not_the_same example/dir2/common_file example/dir2/dir_only_in_dir2 example/dir2/file_in_dir1 example/dir2/file_only_in_dir2 example/dir2/not_the_same The same directory structure is repeated one time under the “common_dir” direc- tories to give interesting recursive comparison options. 6.11.2 Comparing Files cmp() compares two files on the file system. import filecmp print ’common_file:’, print filecmp.cmp(’example/dir1/common_file’, ’example/dir2/common_file’), print filecmp.cmp(’example/dir1/common_file’, ’example/dir2/common_file’, shallow=False) print ’not_the_same:’, print filecmp.cmp(’example/dir1/not_the_same’, ’example/dir2/not_the_same’), print filecmp.cmp(’example/dir1/not_the_same’, ’example/dir2/not_the_same’, shallow=False) print ’identical:’, print filecmp.cmp(’example/dir1/file_only_in_dir1’, ’example/dir1/file_only_in_dir1’), print filecmp.cmp(’example/dir1/file_only_in_dir1’, ’example/dir1/file_only_in_dir1’, shallow=False) 326 The File System The shallow argument tells cmp() whether to look at the contents of the file, in addition to its metadata. The default is to perform a shallow comparison using the information available from os.stat() without looking at content. Files of the same size created at the same time are reported as the same, if their contents are not compared. $ python filecmp_cmp.py common_file: True True not_the_same: True False identical: True True To compare a set of files in two directories without recursing, use cmpfiles(). The arguments are the names of the directories and a list of files to be checked in the two locations. The list of common files passed in should contain only filenames (directories always result in a mismatch), and the files must be present in both locations. The next example shows a simple way to build the common list. The comparison also takes the shallow flag, just as with cmp(). import filecmp import os # Determine the items that exist in both directories d1_contents = set(os.listdir(’example/dir1’)) d2_contents = set(os.listdir(’example/dir2’)) common = list(d1_contents & d2_contents) common_files = [ f for f in common if os.path.isfile(os.path.join(’example/dir1’, f)) ] print ’Common files:’, common_files # Compare the directories match, mismatch, errors = filecmp.cmpfiles(’example/dir1’, ’example/dir2’, common_files) print ’Match :’, match print ’Mismatch:’, mismatch print ’Errors :’, errors cmpfiles() returns three lists of filenames containing files that match, files that do not match, and files that could not be compared (due to permission problems or for any other reason). 6.11. filecmp—Compare Files 327 $ python filecmp_cmpfiles.py Common files: [’not_the_same’, ’file_in_dir1’, ’common_file’] Match : [’not_the_same’, ’common_file’] Mismatch: [’file_in_dir1’] Errors : [] 6.11.3 Comparing Directories The functions described earlier are suitable for relatively simple comparisons. For recursive comparison of large directory trees or for more complete analysis, the dircmp class is more useful. In its simplest use case, report() prints a report comparing two directories. import filecmp filecmp.dircmp(’example/dir1’, ’example/dir2’).report() The output is a plain-text report showing the results of just the contents of the directories given, without recursing. In this case, the file “not_the_same” is thought to be the same because the contents are not being compared. There is no way to have dircmp compare the contents of files like cmp() does. $ python filecmp_dircmp_report.py diff example/dir1 example/dir2 Only in example/dir1 : [’dir_only_in_dir1’, ’file_only_in_dir1’] Only in example/dir2 : [’dir_only_in_dir2’, ’file_only_in_dir2’] Identical files : [’common_file’, ’not_the_same’] Common subdirectories : [’common_dir’] Common funny cases : [’file_in_dir1’] For more detail, and a recursive comparison, use report_full_closure(): import filecmp filecmp.dircmp(’example/dir1’, ’example/dir2’).report_full_closure() The output includes comparisons of all parallel subdirectories. $ python filecmp_dircmp_report_full_closure.py diff example/dir1 example/dir2 328 The File System Only in example/dir1 : [’dir_only_in_dir1’, ’file_only_in_dir1’] Only in example/dir2 : [’dir_only_in_dir2’, ’file_only_in_dir2’] Identical files : [’common_file’, ’not_the_same’] Common subdirectories : [’common_dir’] Common funny cases : [’file_in_dir1’] diff example/dir1/common_dir example/dir2/common_dir Common subdirectories : [’dir1’, ’dir2’] diff example/dir1/common_dir/dir2 example/dir2/common_dir/dir2 Identical files : [’common_file’, ’file_only_in_dir2’, ’not_the_same’ ] Common subdirectories : [’common_dir’, ’dir_only_in_dir2’, ’file_in_d ir1’] diff example/dir1/common_dir/dir2/common_dir example/dir2/common_dir/ dir2/common_dir diff example/dir1/common_dir/dir2/dir_only_in_dir2 example/dir2/commo n_dir/dir2/dir_only_in_dir2 diff example/dir1/common_dir/dir2/file_in_dir1 example/dir2/common_di r/dir2/file_in_dir1 diff example/dir1/common_dir/dir1 example/dir2/common_dir/dir1 Identical files : [’common_file’, ’file_in_dir1’, ’file_only_in_dir1’ , ’not_the_same’] Common subdirectories : [’common_dir’, ’dir_only_in_dir1’] diff example/dir1/common_dir/dir1/common_dir example/dir2/common_dir/ dir1/common_dir diff example/dir1/common_dir/dir1/dir_only_in_dir1 example/dir2/commo n_dir/dir1/dir_only_in_dir1 6.11.4 Using Differences in a Program Besides producing printed reports, dircmp calculates lists of files that can be used in programs directly. Each of the following attributes is calculated only when requested, so creating a dircmp instance does not incur overhead for unused data. import filecmp import pprint 6.11. filecmp—Compare Files 329 dc = filecmp.dircmp(’example/dir1’, ’example/dir2’) print ’Left:’ pprint.pprint(dc.left_list) print ’\nRight:’ pprint.pprint(dc.right_list) The files and subdirectories contained in the directories being compared are listed in left_list and right_list. $ python filecmp_dircmp_list.py Left: [’common_dir’, ’common_file’, ’dir_only_in_dir1’, ’file_in_dir1’, ’file_only_in_dir1’, ’not_the_same’] Right: [’common_dir’, ’common_file’, ’dir_only_in_dir2’, ’file_in_dir1’, ’file_only_in_dir2’, ’not_the_same’] The inputs can be filtered by passing a list of names to ignore to the constructor. By default, the names RCS, CVS, and tags are ignored. import filecmp import pprint dc = filecmp.dircmp(’example/dir1’, ’example/dir2’, ignore=[’common_file’]) print ’Left:’ pprint.pprint(dc.left_list) print ’\nRight:’ pprint.pprint(dc.right_list) 330 The File System In this case, the “common_file” is left out of the list of files to be compared. $ python filecmp_dircmp_list_filter.py Left: [’common_dir’, ’dir_only_in_dir1’, ’file_in_dir1’, ’file_only_in_dir1’, ’not_the_same’] Right: [’common_dir’, ’dir_only_in_dir2’, ’file_in_dir1’, ’file_only_in_dir2’, ’not_the_same’] The names of files common to both input directories are saved in common, and the files unique to each directory are listed in left_only and right_only. import filecmp import pprint dc = filecmp.dircmp(’example/dir1’, ’example/dir2’) print ’Common:’ pprint.pprint(dc.common) print ’\nLeft:’ pprint.pprint(dc.left_only) print ’\nRight:’ pprint.pprint(dc.right_only) The “left” directory is the first argument to dircmp(), and the “right” directory is the second. $ python filecmp_dircmp_membership.py Common: [’not_the_same’, ’common_file’, ’file_in_dir1’, ’common_dir’] 6.11. filecmp—Compare Files 331 Left: [’dir_only_in_dir1’, ’file_only_in_dir1’] Right: [’dir_only_in_dir2’, ’file_only_in_dir2’] The common members can be further broken down into files, directories, and “funny” items (anything that has a different type in the two directories or where there is an error from os.stat()). import filecmp import pprint dc = filecmp.dircmp(’example/dir1’, ’example/dir2’) print ’Common:’ pprint.pprint(dc.common) print ’\nDirectories:’ pprint.pprint(dc.common_dirs) print ’\nFiles:’ pprint.pprint(dc.common_files) print ’\nFunny:’ pprint.pprint(dc.common_funny) In the example data, the item named “file_in_dir1” is a file in one directory and a subdirectory in the other, so it shows up in the funny list. $ python filecmp_dircmp_common.py Common: [’not_the_same’, ’common_file’, ’file_in_dir1’, ’common_dir’] Directories: [’common_dir’] Files: [’not_the_same’, ’common_file’] Funny: [’file_in_dir1’] 332 The File System The differences between files are broken down similarly. import filecmp dc = filecmp.dircmp(’example/dir1’, ’example/dir2’) print ’Same :’, dc.same_files print ’Different :’, dc.diff_files print ’Funny :’, dc.funny_files The file not_the_same is only being compared via os.stat(), and the contents are not examined, so it is included in the same_files list. $ python filecmp_dircmp_diff.py Same : [’not_the_same’, ’common_file’] Different : [] Funny : [] Finally, the subdirectories are also saved to allow easy recursive comparison. import filecmp dc = filecmp.dircmp(’example/dir1’, ’example/dir2’) print ’Subdirectories:’ print dc.subdirs The attribute subdirs is a dictionary mapping the directory name to new dircmp objects. $ python filecmp_dircmp_subdirs.py Subdirectories: {’common_dir’: } See Also: filecmp (http://docs.python.org/library/filecmp.html) The standard library docu- mentation for this module. Directories (page 1118) Listing the contents of a directory using os (page 1108). difflib (page 61) Computing the differences between two sequences. Chapter 7 DATA PERSISTENCE AND EXCHANGE There are two aspects to preserving data for long-term use: converting the data back and forth between the object in-memory and the storage format, and working with the storage of the converted data. The standard library includes a variety of modules that handle both aspects in different situations. Two modules convert objects into a format that can be transmitted or stored (a pro- cess known as serializing). It is most common to use pickle for persistence, since it is integrated with some of the other standard library modules that actually store the seria- lized data, such as shelve. json is more frequently used for Web-based applications, however, since it integrates better with existing Web service storage tools. Once the in-memory object is converted to a format that can be saved, the next step is to decide how to store the data. A simple flat-file with serialized objects written one after the other works for data that does not need to be indexed in any way. Python includes a collection of modules for storing key-value pairs in a simple database using one of the DBM format variants when an indexed lookup is needed. The most straightforward way to take advantage of the DBM format is shelve. Open the shelve file, and access it through a dictionary-like API. Objects saved to the database are automatically pickled and saved without any extra work by the caller. One drawback of shelve, though, is that when using the default interface, there is no way to predict which DBM format will be used, since it selects one based on the libraries available on the system where the database is created. The format does not matter if an application will not need to share the database files between hosts with different libraries; but if portability is a requirement, use one of the classes in the module to ensure a specific format is selected. 333 334 Data Persistence and Exchange For Web applications that work with data in JSON already, using json and anydbm provides another persistence mechanism. Using anydbm directly is a little more work than shelve because the DBM database keys and values must be strings, and the objects will not be re-created automatically when the value is accessed in the database. The sqlite3 in-process relational database is available with most Python distri- butions for storing data in more complex arrangements than key-value pairs. It stores its database in memory or in a local file, and all access is from within the same process so there is no network communication lag. The compact nature of sqlite3 makes it especially well suited for embedding in desktop applications or development versions of Web apps. There are also modules for parsing more formally defined formats, useful for exchanging data between Python programs and applications written in other languages. xml.etree.ElementTree can parse XML documents and provides several operating modes for different applications. Besides the parsing tools, ElementTree includes an interface for creating well-formed XML documents from objects in memory. The csv module can read and write tabular data in formats produced by spreadsheets or database applications, making it useful for bulk loading data or converting the data from one for- mat to another. 7.1 pickle—Object Serialization Purpose Object serialization. Python Version 1.4 and later for pickle, 1.5 and later for cPickle The pickle module implements an algorithm for turning an arbitrary Python object into a series of bytes. This process is also called serializing the object. The byte stream representing the object can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics. The cPickle module implements the same algorithm, in C instead of Python. It is many times faster than the Python implementation, so it is generally used instead of the pure-Python implementation. Warning: The documentation for pickle makes clear that it offers no security guarantees. In fact, unpickling data can execute arbitrary code. Be careful using pickle for inter-process communication or data storage, and do not trust data that cannot be verified as secure. See Applications of Message Signatures in the hmac section for an example of a secure way to verify the source of a pickled data source. 7.1. pickle—Object Serialization 335 7.1.1 Importing Because cPickle is faster than pickle, it is common to first try to import cPickle, giving it an alias of “pickle,” and then fall back on the native Python implementation in pickle if the import fails. This means the program will use the faster implementation, if it is available, and the portable implementation otherwise. try: import cPickle as pickle except: import pickle The API for the C and Python versions is the same, and data can be exchanged between programs using either version of the library. 7.1.2 Encoding and Decoding Data in Strings This first example uses dumps() to encode a data structure as a string, and then prints the string to the console. It uses a data structure made up of entirely built-in types. Instances of any class can be pickled, as will be illustrated in a later example. try: import cPickle as pickle except: import pickle import pprint data = [ { ’a’:’A’, ’b’:2, ’c’:3.0 } ] print ’DATA:’, pprint.pprint(data) data_string = pickle.dumps(data) print ’PICKLE: %r’ % data_string By default, the pickle will contain only ASCII characters. A more efficient binary pickle format is also available, but all the examples here use the ASCII output because it is easier to understand in print. $ python pickle_string.py DATA:[{’a’: ’A’, ’b’: 2, ’c’: 3.0}] PICKLE: "(lp1\n(dp2\nS’a’\nS’A’\nsS’c’\nF3\nsS’b’\nI2\nsa." 336 Data Persistence and Exchange After the data is serialized, it can be written to a file, a socket, or a pipe, etc. Later, the file can be read and the data unpickled to construct a new object with the same values. try: import cPickle as pickle except: import pickle import pprint data1 = [ { ’a’:’A’, ’b’:2, ’c’:3.0 } ] print ’BEFORE: ’, pprint.pprint(data1) data1_string = pickle.dumps(data1) data2 = pickle.loads(data1_string) print ’AFTER : ’, pprint.pprint(data2) print ’SAME? :’, (data1 is data2) print ’EQUAL?:’, (data1 == data2) The newly constructed object is equal to, but not the same object as, the original. $ python pickle_unpickle.py BEFORE: [{’a’: ’A’, ’b’: 2, ’c’: 3.0}] AFTER : [{’a’: ’A’, ’b’: 2, ’c’: 3.0}] SAME? : False EQUAL?: True 7.1.3 Working with Streams In addition to dumps() and loads(), pickle provides convenience functions for working with file-like streams. It is possible to write multiple objects to a stream and then read them from the stream without knowing in advance how many objects are written or how big they are. try: import cPickle as pickle except: import pickle 7.1. pickle—Object Serialization 337 import pprint from StringIO import StringIO class SimpleObject(object): def __init__(self, name): self.name = name self.name_backwards = name[::-1] return data = [] data.append(SimpleObject(’pickle’)) data.append(SimpleObject(’cPickle’)) data.append(SimpleObject(’last’)) # Simulate a file with StringIO out_s = StringIO() # Write to the stream for o in data: print ’WRITING : %s (%s)’ % (o.name, o.name_backwards) pickle.dump(o, out_s) out_s.flush() # Set up a read-able stream in_s = StringIO(out_s.getvalue()) # Read the data while True: try: o = pickle.load(in_s) except EOFError: break else: print ’READ : %s (%s)’ % (o.name, o.name_backwards) The example simulates streams using two StringIO buffers. The first receives the pickled objects, and its value is fed to a second from which load() reads. A simple database format could use pickles to store objects, too (see shelve). $ python pickle_stream.py WRITING : pickle (elkcip) WRITING : cPickle (elkciPc) 338 Data Persistence and Exchange WRITING : last (tsal) READ : pickle (elkcip) READ : cPickle (elkciPc) READ : last (tsal) Besides storing data, pickles are handy for inter-process communication. For example, os.fork() and os.pipe() can be used to establish worker processes that read job instructions from one pipe and write the results to another pipe. The core code for managing the worker pool and sending jobs in and receiving responses can be reused, since the job and response objects do not have to be based on a particular class. When using pipes or sockets, do not forget to flush after dumping each object, to push the data through the connection to the other end. See the multiprocessing module for a reusable worker pool manager. 7.1.4 Problems Reconstructing Objects When working with custom classes, the class being pickled must appear in the name- space of the process reading the pickle. Only the data for the instance is pickled, not the class definition. The class name is used to find the constructor to create the new object when unpickling. This example writes instances of a class to a file. try: import cPickle as pickle except: import pickle import sys class SimpleObject(object): def __init__(self, name): self.name = name l = list(name) l.reverse() self.name_backwards = ’’.join(l) return if __name__ == ’__main__’: data = [] data.append(SimpleObject(’pickle’)) data.append(SimpleObject(’cPickle’)) data.append(SimpleObject(’last’)) filename = sys.argv[1] 7.1. pickle—Object Serialization 339 with open(filename, ’wb’) as out_s: # Write to the stream for o in data: print ’WRITING: %s (%s)’ % (o.name, o.name_backwards) pickle.dump(o, out_s) When run, the script creates a file based on the name given as argument on the command line. $ python pickle_dump_to_file_1.py test.dat WRITING: pickle (elkcip) WRITING: cPickle (elkciPc) WRITING: last (tsal) A simplistic attempt to load the resulting pickled objects fails. try: import cPickle as pickle except: import pickle import pprint from StringIO import StringIO import sys filename = sys.argv[1] with open(filename, ’rb’) as in_s: # Read the data while True: try: o = pickle.load(in_s) except EOFError: break else: print ’READ: %s (%s)’ % (o.name, o.name_backwards) This version fails because there is no SimpleObject class available. $ python pickle_load_from_file_1.py test.dat 340 Data Persistence and Exchange Traceback (most recent call last): File "pickle_load_from_file_1.py", line 25, in o = pickle.load(in_s) AttributeError: ’module’ object has no attribute ’SimpleObject’ The corrected version, which imports SimpleObject from the original script, succeeds. Adding this import statement to the end of the import list allows the script to find the class and construct the object. from pickle_dump_to_file_1 import SimpleObject Running the modified script now produces the desired results. $ python pickle_load_from_file_2.py test.dat READ: pickle (elkcip) READ: cPickle (elkciPc) READ: last (tsal) 7.1.5 Unpicklable Objects Not all objects can be pickled. Sockets, file handles, database connections, and other objects with run-time state that depends on the operating system or another process may not be able to be saved in a meaningful way. Objects that have nonpicklable attributes can define __getstate__() and __setstate__() to return a sub- set of the state of the instance to be pickled. New-style classes can also define __getnewargs__(), which should return arguments to be passed to the class mem- ory allocator (C.__new__()). Use of these features is covered in more detail in the standard library documentation. 7.1.6 Circular References The pickle protocol automatically handles circular references between objects, so com- plex data structures do not need any special handling. Consider the directed graph in Figure 7.1. It includes several cycles, yet the correct structure can be pickled and then reloaded. import pickle class Node(object): """A simple digraph""" 7.1. pickle—Object Serialization 341 root a c b Figure 7.1. Pickling a data structure with cycles def __init__(self, name): self.name = name self.connections = [] def add_edge(self, node): "Create an edge between this node and the other." self.connections.append(node) def __iter__(self): return iter(self.connections) def preorder_traversal(root, seen=None, parent=None): """Generator function to yield the edges in a graph. """ if seen is None: seen = set() yield (parent, root) if root in seen: return seen.add(root) for node in root: for parent, subnode in preorder_traversal(node, seen, root): yield (parent, subnode) 342 Data Persistence and Exchange def show_edges(root): "Print all the edges in the graph." for parent, child in preorder_traversal(root): if not parent: continue print ’%5s -> %2s (%s)’ %\ (parent.name, child.name, id(child)) # Set up the nodes. root = Node(’root’) a = Node(’a’) b = Node(’b’) c = Node(’c’) # Add edges between them. root.add_edge(a) root.add_edge(b) a.add_edge(b) b.add_edge(a) b.add_edge(c) a.add_edge(a) print ’ORIGINAL GRAPH:’ show_edges(root) # Pickle and unpickle the graph to create # a new set of nodes. dumped = pickle.dumps(root) reloaded = pickle.loads(dumped) print ’\nRELOADED GRAPH:’ show_edges(reloaded) The reloaded nodes are not the same object, but the relationship between the nodes is maintained and only one copy of the object with multiple references is reloaded. Both of these statements can be verified by examining the id() values for the nodes before and after being passed through pickle. $ python pickle_cycle.py ORIGINAL GRAPH: root -> a (4309376848) a -> b (4309376912) 7.2. shelve—Persistent Storage of Objects 343 b -> a (4309376848) b -> c (4309376976) a -> a (4309376848) root -> b (4309376912) RELOADED GRAPH: root -> a (4309418128) a -> b (4309418192) b -> a (4309418128) b -> c (4309418256) a -> a (4309418128) root -> b (4309418192) See Also: pickle (http://docs.python.org/lib/module-pickle.html) Standard library documenta- tion for this module. Pickle: An interesting stack language (http://peadrop.com/blog/2007/06/18/pickle-an-interesting-stack-language/) A blog post by Alexandre Vassalotti. Why Python Pickle is Insecure (http://nadiana.com/python-pickle-insecure) A short example by Nadia Alramli demonstrating a security exploit using pickle. shelve (page 343) The shelve module uses pickle to store data in a DBM database. 7.2 shelve—Persistent Storage of Objects Purpose The shelve module implements persistent storage for arbitrary Python objects that can be pickled, using a dictionary-like API. The shelve module can be used as a simple persistent storage option for Python ob- jects when a relational database is not required. The shelf is accessed by keys, just as with a dictionary. The values are pickled and written to a database created and managed by anydbm. 7.2.1 Creating a New Shelf The simplest way to use shelve is via the DbfilenameShelf class. It uses anydbm to store the data. The class can be used directly or by calling shelve.open(). import shelve from contextlib import closing 344 Data Persistence and Exchange with closing(shelve.open(’test_shelf.db’)) as s: s[’key1’] = { ’int’: 10, ’float’:9.5, ’string’:’Sample data’ } To access the data again, open the shelf and use it like a dictionary. import shelve from contextlib import closing with closing(shelve.open(’test_shelf.db’)) as s: existing = s[’key1’] print existing This is what running both sample scripts produces. $ python shelve_create.py $ python shelve_existing.py {’int’: 10, ’float’: 9.5, ’string’: ’Sample data’} The dbm module does not support multiple applications writing to the same database at the same time, but it does support concurrent read-only clients. If a client will not be modifying the shelf, tell shelve to open the database in read-only mode by passing flag=’r’. import shelve from contextlib import closing with closing(shelve.open(’test_shelf.db’, flag=’r’)) as s: existing = s[’key1’] print existing If the program tries to modify the database while it is opened in read-only mode, an access error exception is generated. The exception type depends on the database module selected by anydbm when the database was created. 7.2.2 Writeback Shelves do not track modifications to volatile objects, by default. That means if the contents of an item stored in the shelf are changed, the shelf must be updated explicitly by storing the entire item again. 7.2. shelve—Persistent Storage of Objects 345 import shelve from contextlib import closing with closing(shelve.open(’test_shelf.db’)) as s: print s[’key1’] s[’key1’][’new_value’] = ’this was not here before’ with closing(shelve.open(’test_shelf.db’, writeback=True)) as s: print s[’key1’] In this example, the dictionary at ’key1’ is not stored again, so when the shelf is reopened, the changes will not have been preserved. $ python shelve_create.py $ python shelve_withoutwriteback.py {’int’: 10, ’float’: 9.5, ’string’: ’Sample data’} {’int’: 10, ’float’: 9.5, ’string’: ’Sample data’} To automatically catch changes to volatile objects stored in the shelf, open it with writeback enabled. The writeback flag causes the shelf to remember all objects retrieved from the database using an in-memory cache. Each cache object is also written back to the database when the shelf is closed. import shelve import pprint from contextlib import closing with closing(shelve.open(’test_shelf.db’, writeback=True)) as s: print ’Initial data:’ pprint.pprint(s[’key1’]) s[’key1’][’new_value’] = ’this was not here before’ print ’\nModified:’ pprint.pprint(s[’key1’]) with closing(shelve.open(’test_shelf.db’, writeback=True)) as s: print ’\nPreserved:’ pprint.pprint(s[’key1’]) Although it reduces the chance of programmer error and can make object persis- tence more transparent, using writeback mode may not be desirable in every situation. The cache consumes extra memory while the shelf is open, and pausing to write every 346 Data Persistence and Exchange cached object back to the database when it is closed slows down the application. All cached objects are written back to the database because there is no way to tell if they have been modified. If the application reads data more than it writes, writeback will impact performance unnecessarily. $ python shelve_create.py $ python shelve_writeback.py Initial data: {’float’: 9.5, ’int’: 10, ’string’: ’Sample data’} Modified: {’float’: 9.5, ’int’: 10, ’new_value’: ’this was not here before’, ’string’: ’Sample data’} Preserved: {’float’: 9.5, ’int’: 10, ’new_value’: ’this was not here before’, ’string’: ’Sample data’} 7.2.3 Specific Shelf Types The earlier examples all used the default shelf implementation. Using shelve.open() instead of one of the shelf implementations directly is a common usage pattern, especially if it does not matter what type of database is used to store the data. There are times, however, when the database format is important. In those situations, use DbfilenameShelf or BsdDbShelf directly, or even subclass Shelf for a custom solution. See Also: shelve (http://docs.python.org/lib/module-shelve.html) Standard library documen- tation for this module. feedcache (www.doughellmann.com/projects/feedcache/) The feedcache module uses shelve as a default storage option. shove (http://pypi.python.org/pypi/shove/) Shove implements a similar API with more back-end formats. anydbm (page 347) The anydbm module finds an available DBM library to create a new database. 7.3. anydbm—DBM-Style Databases 347 7.3 anydbm—DBM-Style Databases Purpose anydbm provides a generic dictionary-like interface to DBM- style, string-keyed databases. Python Version 1.4 and later anydbm is a front-end for DBM-style databases that use simple string values as keys to access records containing strings. It uses whichdb to identify databases, and then opens them with the appropriate module. It is used as a back-end for shelve, which stores objects in a DBM database using pickle. 7.3.1 Database Types Python comes with several modules for accessing DBM-style databases. The imple- mentation selected depends on the libraries available on the current system and the options used when Python was compiled. dbhash The dbhash module is the primary back-end for anydbm. It uses the bsddb library to manage database files. The semantics for using dbhash databases are the same as those defined by the anydbm API. gdbm gdbm is an updated version of the dbm library from the GNU project. It works the same as the other DBM implementations described here, with a few changes to the flags supported by open(). Besides the standard ’r’, ’w’, ’c’, and ’n’ flags, gdbm.open() supports: • ’f’ to open the database in fast mode. In fast mode, writes to the database are not synchronized. • ’s’ to open the database in synchronized mode. Changes to the database are written to the file as they are made, rather than being delayed until the database is closed or synced explicitly. • ’u’ to open the database unlocked. dbm The dbm module provides an interface to one of several C implementations of the dbm format, depending on how the module was configured during compilation. The module 348 Data Persistence and Exchange attribute library identifies the name of the library configure was able to find when the extension module was compiled. dumbdbm The dumbdbm module is a portable fallback implementation of the DBM API when no other implementations are available. No external dependencies are required to use dumbdbm, but it is slower than most other implementations. 7.3.2 Creating a New Database The storage format for new databases is selected by looking for each of these modules in order: • dbhash • gdbm • dbm • dumbdbm The open() function takes flags to control how the database file is managed. To create a new database when necessary, use ’c’. Using ’n’ always creates a new database, overwriting an existing file. import anydbm db = anydbm.open(’/tmp/example.db’, ’n’) db[’key’] = ’value’ db[’today’] = ’Sunday’ db[’author’] = ’Doug’ db.close() In this example, the file is always reinitialized. $ python anydbm_new.py whichdb reports the type of database that was created. import whichdb print whichdb.whichdb(’/tmp/example.db’) 7.3. anydbm—DBM-Style Databases 349 Output from the example program will vary, depending on which modules are installed on the system. $ python anydbm_whichdb.py dbhash 7.3.3 Opening an Existing Database To open an existing database, use flags of either ’r’ (for read-only) or ’w’ (for read- write). Existing databases are automatically given to whichdb to identify, so as long as a file can be identified, the appropriate module is used to open it. import anydbm db = anydbm.open(’/tmp/example.db’, ’r’) try: print ’keys():’, db.keys() for k, v in db.iteritems(): print ’iterating:’, k, v print ’db["author"] =’, db[’author’] finally: db.close() Once open, db is a dictionary-like object, with support for the usual methods. $ python anydbm_existing.py keys(): [’author’, ’key’, ’today’] iterating: author Doug iterating: key value iterating: today Sunday db["author"] = Doug 7.3.4 Error Cases The keys of the database need to be strings. import anydbm db = anydbm.open(’/tmp/example.db’, ’w’) 350 Data Persistence and Exchange try: db[1] = ’one’ except TypeError, err: print ’%s: %s’ % (err.__class__.__name__, err) finally: db.close() Passing another type results in a TypeError. $ python anydbm_intkeys.py TypeError: Integer keys only allowed for Recno and Queue DB’s Values must be strings or None. import anydbm db = anydbm.open(’/tmp/example.db’, ’w’) try: db[’one’] = 1 except TypeError, err: print ’%s: %s’ % (err.__class__.__name__, err) finally: db.close() A similar TypeError is raised if a value is not a string. $ python anydbm_intvalue.py TypeError: Data values must be of type string or None. See Also: anydbm (http://docs.python.org/library/anydbm.html) The standard library docu- mentation for this module. shelve (page 343) Examples for the shelve module, which uses anydbm to store data. 7.4 whichdb—Identify DBM-Style Database Formats Purpose Examine existing DBM-style database file to determine what library should be used to open it. Python Version 1.4 and later 7.5. sqlite3—Embedded Relational Database 351 The whichdb module contains one function, whichdb(), that can be used to examine an existing database file to determine which of the DBM libraries should be used to open it. It returns the string name of the module to use to open the file, or None if there is a problem opening the file. If it can open the file but cannot determine the library to use, it returns an empty string. import anydbm import whichdb db = anydbm.open(’/tmp/example.db’, ’n’) db[’key’] = ’value’ db.close() print whichdb.whichdb(’/tmp/example.db’) The results from running the sample program will vary, depending on the modules available on the system. $ python whichdb_whichdb.py dbhash See Also: whichdb (http://docs.python.org/lib/module-whichdb.html) Standard library docu- mentation for this module. anydbm (page 347) The anydbm module uses the best available DBM implementation when creating new databases. shelve (page 343) The shelve module provides a mapping-style API for DBM databases. 7.5 sqlite3—Embedded Relational Database Purpose Implements an embedded relational database with SQL support. Python Version 2.5 and later The sqlite3 module provides a DB-API 2.0 compliant interface to SQLite, an in-process relational database. SQLite is designed to be embedded in applications, instead of using a separate database server program, such as MySQL, PostgreSQL, or Oracle. It is fast, rigorously tested, and flexible, making it suitable for prototyping and production deployment for some applications. 352 Data Persistence and Exchange 7.5.1 Creating a Database An SQLite database is stored as a single file on the file system. The library manages access to the file, including locking it to prevent corruption when multiple writers use it. The database is created the first time the file is accessed, but the application is responsible for managing the table definitions, or schema, within the database. This example looks for the database file before opening it with connect() so it knows when to create the schema for new databases. import os import sqlite3 db_filename = ’todo.db’ db_is_new = not os.path.exists(db_filename) conn = sqlite3.connect(db_filename) if db_is_new: print ’Need to create schema’ else: print ’Database exists, assume schema does, too.’ conn.close() Running the script twice shows that it creates the empty file if it does not exist. $ ls *.db ls: *.db: No such file or directory $ python sqlite3_createdb.py Need to create schema $ ls *.db todo.db $ python sqlite3_createdb.py Database exists, assume schema does, too. 7.5. sqlite3—Embedded Relational Database 353 Table 7.1. The "project" Table Column Type Description name text Project name description text Long project description deadline date Due date for the entire project Table 7.2. The "task" Table Column Type Description id number Unique task identifier priority integer Numerical priority; lower is more important details text Full task details status text Task status (one of new, pending, done, or canceled). deadline date Due date for this task completed_on date When the task was completed project text The name of the project for this task After creating the new database file, the next step is to create the schema to define the tables within the database. The remaining examples in this section all use the same database schema with tables for managing tasks. The details of the database schema are presented in Table 7.1 and Table 7.2. These are the data definition language (DDL) statements to create the tables. -- Schema for to-do application examples. -- Projects are high-level activities made up of tasks create table project ( name text primary key, description text, deadline date ); -- Tasks are steps that can be taken to complete a project create table task ( id integer primary key autoincrement not null, priority integer default 1, details text, status text, 354 Data Persistence and Exchange deadline date, completed_on date, project text not null references project(name) ); The executescript() method of the Connection can be used to run the DDL instructions to create the schema. import os import sqlite3 db_filename = ’todo.db’ schema_filename = ’todo_schema.sql’ db_is_new = not os.path.exists(db_filename) with sqlite3.connect(db_filename) as conn: if db_is_new: print ’Creating schema’ with open(schema_filename, ’rt’) as f: schema = f.read() conn.executescript(schema) print ’Inserting initial data’ conn.executescript(""" insert into project (name, description, deadline) values (’pymotw’, ’Python Module of the Week’, ’2010-11-01’); insert into task (details, status, deadline, project) values (’write about select’, ’done’, ’2010-10-03’, ’pymotw’); insert into task (details, status, deadline, project) values (’write about random’, ’waiting’, ’2010-10-10’, ’pymotw’); insert into task (details, status, deadline, project) values (’write about sqlite3’, ’active’, ’2010-10-17’, ’pymotw’); """) else: print ’Database exists, assume schema does, too.’ 7.5. sqlite3—Embedded Relational Database 355 After the tables are created, a few insert statements create a sample project and related tasks. The sqlite3 command line program can be used to examine the contents of the database. $ python sqlite3_create_schema.py Creating schema Inserting initial data $ sqlite3 todo.db ’select * from task’ 1|1|write about select|done|2010-10-03||pymotw 2|1|write about random|waiting|2010-10-10||pymotw 3|1|write about sqlite3|active|2010-10-17||pymotw 7.5.2 Retrieving Data To retrieve the values saved in the task table from within a Python program, cre- ate a cursor from a database connection. A cursor produces a consistent view of the data and is the primary means of interacting with a transactional database system like SQLite. import sqlite3 db_filename = ’todo.db’ with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() cursor.execute(""" select id, priority, details, status, deadline from task where project = ’pymotw’ """) for row in cursor.fetchall(): task_id, priority, details, status, deadline = row print ’%2d {%d} %-20s [%-8s](%s)’ %\ (task_id, priority, details, status, deadline) Querying is a two-step process. First, run the query with the cursor’s execute() method to tell the database engine what data to collect. Then, use fetchall() to 356 Data Persistence and Exchange retrieve the results. The return value is a sequence of tuples containing the values for the columns included in the select clause of the query. $ python sqlite3_select_tasks.py 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [waiting ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) The results can be retrieved one at a time with fetchone()or in fixed-size batches with fetchmany(). import sqlite3 db_filename = ’todo.db’ with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() cursor.execute(""" select name, description, deadline from project where name = ’pymotw’ """) name, description, deadline = cursor.fetchone() print ’Project details for %s (%s) due %s’ %\ (description, name, deadline) cursor.execute(""" select id, priority, details, status, deadline from task where project = ’pymotw’ order by deadline """) print ’\nNext 5 tasks:’ for row in cursor.fetchmany(5): task_id, priority, details, status, deadline = row print ’%2d {%d} %-25s [%-8s](%s)’ %\ (task_id, priority, details, status, deadline) The value passed to fetchmany() is the maximum number of items to return. If fewer items are available, the sequence returned will be smaller than the maximum value. 7.5. sqlite3—Embedded Relational Database 357 $ python sqlite3_select_variations.py Project details for Python Module of the Week (pymotw) due 2010-11-01 Next 5 tasks: 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [waiting ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) 7.5.3 Query Metadata The DB-API 2.0 specification says that after execute() has been called, the cursor should set its description attribute to hold information about the data that will be returned by the fetch methods. The API specifications say that the description value is a sequence of tuples containing the column name, type, display size, internal size, precision, scale, and a flag that says whether null values are accepted. import sqlite3 db_filename = ’todo.db’ with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() cursor.execute(""" select * from task where project = ’pymotw’ """) print ’Task table has these columns:’ for colinfo in cursor.description: print colinfo Because sqlite3 does not enforce type or size constraints on data inserted into a database, only the column name value is filled in. $ python sqlite3_cursor_description.py Task table has these columns: (’id’, None, None, None, None, None, None) (’priority’, None, None, None, None, None, None) (’details’, None, None, None, None, None, None) (’status’, None, None, None, None, None, None) 358 Data Persistence and Exchange (’deadline’, None, None, None, None, None, None) (’completed_on’, None, None, None, None, None, None) (’project’, None, None, None, None, None, None) 7.5.4 Row Objects By default, the values returned by the fetch methods as “rows” from the database are tuples. The caller is responsible for knowing the order of the columns in the query and extracting individual values from the tuple. When the number of values in a query grows, or the code working with the data is spread out in a library, it is usually easier to work with an object and access values using their column names. That way, the number and order of the tuple contents can change over time as the query is edited, and code depending on the query results is less likely to break. Connection objects have a row_factory property that allows the calling code to control the type of object created to represent each row in the query result set. sqlite3 also includes a Row class intended to be used as a row factory. Column values can be accessed through Row instances by using the column index or name. import sqlite3 db_filename = ’todo.db’ with sqlite3.connect(db_filename) as conn: # Change the row factory to use Row conn.row_factory = sqlite3.Row cursor = conn.cursor() cursor.execute(""" select name, description, deadline from project where name = ’pymotw’ """) name, description, deadline = cursor.fetchone() print ’Project details for %s (%s) due %s’ %( description, name, deadline) cursor.execute(""" select id, priority, status, deadline, details from task where project = ’pymotw’ order by deadline """) 7.5. sqlite3—Embedded Relational Database 359 print ’\nNext 5 tasks:’ for row in cursor.fetchmany(5): print ’%2d {%d} %-25s [%-8s](%s)’ %( row[’id’], row[’priority’], row[’details’], row[’status’], row[’deadline’], ) This version of the sqlite3_select_variations.py example has been rewritten using Row instances instead of tuples. The row from the project table is still printed by accessing the column values through position, but the print statement for tasks uses keyword lookup instead, so it does not matter that the order of the columns in the query has been changed. $ python sqlite3_row_factory.py Project details for Python Module of the Week (pymotw) due 2010-11-01 Next 5 tasks: 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [waiting ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) 7.5.5 Using Variables with Queries Using queries defined as literal strings embedded in a program is inflexible. For example, when another project is added to the database, the query to show the top five tasks should be updated to work with either project. One way to add more flexibility is to build an SQL statement with the desired query by combining values in Python. However, building a query string in this way is dangerous and should be avoided. Fail- ing to correctly escape special characters in the variable parts of the query can result in SQL parsing errors, or worse, a class of security vulnerabilities known as SQL-injection attacks, which allow intruders to execute arbitrary SQL statements in the database. The proper way to use dynamic values with queries is through host variables passed to execute() along with the SQL instruction. A placeholder value in the SQL statement is replaced with the value of the host variable when the statement is executed. Using host variables instead of inserting arbitrary values into the SQL statement before it is parsed avoids injection attacks because there is no chance that the untrusted values will affect how the SQL statement is parsed. SQLite supports two forms for queries with placeholders, positional and named. 360 Data Persistence and Exchange Positional Parameters A question mark (?) denotes a positional argument, passed to execute() as a member of a tuple. import sqlite3 import sys db_filename = ’todo.db’ project_name = sys.argv[1] with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() query = """select id, priority, details, status, deadline from task where project = ? """ cursor.execute(query, (project_name,)) for row in cursor.fetchall(): task_id, priority, details, status, deadline = row print ’%2d {%d} %-20s [%-8s](%s)’ %( task_id, priority, details, status, deadline) The command line argument is passed safely to the query as a positional argument, and there is no chance for bad data to corrupt the database. $ python sqlite3_argument_positional.py pymotw 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [waiting ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) Named Parameters Use named parameters for more complex queries with a lot of parameters, or where some parameters are repeated multiple times within the query. Named parameters are prefixed with a colon (e.g., :param_name). import sqlite3 import sys 7.5. sqlite3—Embedded Relational Database 361 db_filename = ’todo.db’ project_name = sys.argv[1] with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() query = """select id, priority, details, status, deadline from task where project = :project_name order by deadline, priority """ cursor.execute(query, {’project_name’:project_name}) for row in cursor.fetchall(): task_id, priority, details, status, deadline = row print ’%2d {%d} %-25s [%-8s](%s)’ % (\ task_id, priority, details, status, deadline) Neither positional nor named parameters need to be quoted or escaped, since they are given special treatment by the query parser. $ python sqlite3_argument_named.py pymotw 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [waiting ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) Query parameters can be used with select, insert, and update statements. They can appear in any part of the query where a literal value is legal. import sqlite3 import sys db_filename = ’todo.db’ id = int(sys.argv[1]) status = sys.argv[2] with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() query = "update task set status = :status where id = :id" cursor.execute(query, {’status’:status, ’id’:id}) 362 Data Persistence and Exchange This update statement uses two named parameters. The id value is used to find the right row to modify, and the status value is written to the table. $ python sqlite3_argument_update.py 2 done $ python sqlite3_argument_named.py pymotw 1 {1} write about select [done ] (2010-10-03) 2 {1} write about random [done ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) 7.5.6 Bulk Loading To apply the same SQL instruction to a large set of data, use executemany(). This is useful for loading data, since it avoids looping over the inputs in Python and lets the underlying library apply loop optimizations. This example program reads a list of tasks from a comma-separated value file using the csv module and loads them into the database. import csv import sqlite3 import sys db_filename = ’todo.db’ data_filename = sys.argv[1] SQL = """ insert into task (details, priority, status, deadline, project) values (:details, :priority, ’active’, :deadline, :project) """ with open(data_filename, ’rt’) as csv_file: csv_reader = csv.DictReader(csv_file) with sqlite3.connect(db_filename) as conn: cursor = conn.cursor() cursor.executemany(SQL, csv_reader) The sample data file tasks.csv contains: deadline,project,priority,details 2010-10-02,pymotw,2,"finish reviewing markup" 7.5. sqlite3—Embedded Relational Database 363 2010-10-03,pymotw,2,"revise chapter intros" 2010-10-03,pymotw,1,"subtitle" Running the program produces: $ python sqlite3_load_csv.py tasks.csv $ python sqlite3_argument_named.py pymotw 4 {2} finish reviewing markup [active ] (2010-10-02) 1 {1} write about select [done ] (2010-10-03) 6 {1} subtitle [active ] (2010-10-03) 5 {2} revise chapter intros [active ] (2010-10-03) 2 {1} write about random [done ] (2010-10-10) 3 {1} write about sqlite3 [active ] (2010-10-17) 7.5.7 Defining New Column Types SQLite has native support for integer, floating point, and text columns. Data of these types is converted automatically by sqlite3 from Python’s representation to a value that can be stored in the database, and back again, as needed. Integer values are loaded from the database into int or long variables, depending on the size of the value. Text is saved and retrieved as unicode, unless the text_factory for the Connection has been changed. Although SQLite only supports a few data types internally, sqlite3 includes facilities for defining custom types to allow a Python application to store any type of data in a column. Conversion for types beyond those supported by default is enabled in the database connection using the detect_types flag. Use PARSE_DECLTYPES if the column was declared using the desired type when the table was defined. import sqlite3 import sys db_filename = ’todo.db’ sql = "select id, details, deadline from task" def show_deadline(conn): conn.row_factory = sqlite3.Row cursor = conn.cursor() cursor.execute(sql) row = cursor.fetchone() 364 Data Persistence and Exchange for col in [’id’, ’details’, ’deadline’]: print ’ %-8s %-30r %s’ % (col, row[col], type(row[col])) return print ’Without type detection:’ with sqlite3.connect(db_filename) as conn: show_deadline(conn) print ’\nWith type detection:’ with sqlite3.connect(db_filename, detect_types=sqlite3.PARSE_DECLTYPES, ) as conn: show_deadline(conn) sqlite3 provides converters for date and timestamp columns, using the classes date and datetime from the datetime module to represent the values in Python. Both date-related converters are enabled automatically when type detection is turned on. $ python sqlite3_date_types.py Without type detection: id 1 details u’write about select’ deadline u’2010-10-03’ With type detection: id 1 details u’write about select’ deadline datetime.date(2010, 10, 3) Two functions need to be registered to define a new type. The adapter takes the Python object as input and returns a byte string that can be stored in the database. The converter receives the string from the database and returns a Python object. Use register_adapter() to define an adapter function, and register_converter() for a converter function. import sqlite3 try: import cPickle as pickle except: import pickle 7.5. sqlite3—Embedded Relational Database 365 db_filename = ’todo.db’ def adapter_func(obj): """Convert from in-memory to storage representation. """ print ’adapter_func(%s)\n’ % obj return pickle.dumps(obj) def converter_func(data): """Convert from storage to in-memory representation. """ print ’converter_func(%r)\n’ % data return pickle.loads(data) class MyObj(object): def __init__(self, arg): self.arg = arg def __str__(self): return ’MyObj(%r)’ % self.arg # Register the functions for manipulating the type. sqlite3.register_adapter(MyObj, adapter_func) sqlite3.register_converter("MyObj", converter_func) # Create some objects to save. Use a list of tuples so # the sequence can be passed directly to executemany(). to_save = [ (MyObj(’this is a value to save’),), (MyObj(42),), ] with sqlite3.connect(db_filename, detect_types=sqlite3.PARSE_DECLTYPES) as conn: # Create a table with column of type "MyObj" conn.execute(""" create table if not exists obj ( id integer primary key autoincrement not null, data MyObj ) """) cursor = conn.cursor() # Insert the objects into the database cursor.executemany("insert into obj (data) values (?)", to_save) 366 Data Persistence and Exchange # Query the database for the objects just saved cursor.execute("select id, data from obj") for obj_id, obj in cursor.fetchall(): print ’Retrieved’, obj_id, obj, type(obj) print This example uses pickle to save an object to a string that can be stored in the database, a useful technique for storing arbitrary objects, but one that does not allow querying based on object attributes. A real object-relational mapper, such as SQLAlchemy, that stores attribute values in separate columns will be more useful for large amounts of data. $ python sqlite3_custom_type.py adapter_func(MyObj(’this is a value to save’)) adapter_func(MyObj(42)) converter_func("ccopy_reg\n_reconstructor\np1\n(c__main__\nMyObj\np2 \nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS’arg’\np6\nS’this is a val ue to save’\np7\nsb.") converter_func("ccopy_reg\n_reconstructor\np1\n(c__main__\nMyObj\np2 \nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS’arg’\np6\nI42\nsb.") Retrieved 1 MyObj(’this is a value to save’) Retrieved 2 MyObj(42) 7.5.8 Determining Types for Columns There are two sources for type information about the values returned by a query. The original table declaration can be used to identify the type of a real column, as shown earlier. A type specifier can also be included in the select clause of the query itself using the form as "name [type]". import sqlite3 try: import cPickle as pickle except: import pickle 7.5. sqlite3—Embedded Relational Database 367 db_filename = ’todo.db’ def adapter_func(obj): """Convert from in-memory to storage representation. """ print ’adapter_func(%s)\n’ % obj return pickle.dumps(obj) def converter_func(data): """Convert from storage to in-memory representation. """ print ’converter_func(%r)\n’ % data return pickle.loads(data) class MyObj(object): def __init__(self, arg): self.arg = arg def __str__(self): return ’MyObj(%r)’ % self.arg # Register the functions for manipulating the type. sqlite3.register_adapter(MyObj, adapter_func) sqlite3.register_converter("MyObj", converter_func) # Create some objects to save. Use a list of tuples so we can pass # this sequence directly to executemany(). to_save = [ (MyObj(’this is a value to save’),), (MyObj(42),), ] with sqlite3.connect(db_filename, detect_types=sqlite3.PARSE_COLNAMES) as conn: # Create a table with column of type "text" conn.execute(""" create table if not exists obj2 ( id integer primary key autoincrement not null, data text ) """) cursor = conn.cursor() # Insert the objects into the database cursor.executemany("insert into obj2 (data) values (?)", to_save) 368 Data Persistence and Exchange # Query the database for the objects just saved, # using a type specifier to convert the text # to objects. cursor.execute(’select id, data as "pickle [MyObj]" from obj2’) for obj_id, obj in cursor.fetchall(): print ’Retrieved’, obj_id, obj, type(obj) print Use the detect_types flag PARSE_COLNAMES when the type is part of the query instead of the original table definition. $ python sqlite3_custom_type_column.py adapter_func(MyObj(’this is a value to save’)) adapter_func(MyObj(42)) converter_func("ccopy_reg\n_reconstructor\np1\n(c__main__\nMyObj\np2 \nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS’arg’\np6\nS’this is a val ue to save’\np7\nsb.") converter_func("ccopy_reg\n_reconstructor\np1\n(c__main__\nMyObj\np2 \nc__builtin__\nobject\np3\nNtRp4\n(dp5\nS’arg’\np6\nI42\nsb.") Retrieved 1 MyObj(’this is a value to save’) Retrieved 2 MyObj(42) 7.5.9 Transactions One of the key features of relational databases is the use of transactions to maintain a consistent internal state. With transactions enabled, several changes can be made through one connection without effecting any other users until the results are committed and flushed to the actual database. Preserving Changes Changes to the database, either through insert or update statements, need to be saved by explicitly calling commit(). This requirement gives an application an opportu- nity to make several related changes together, so they are stored atomically instead of 7.5. sqlite3—Embedded Relational Database 369 incrementally, and avoids a situation where partial updates are seen by different clients connecting to the database simultaneously. The effect of calling commit() can be seen with a program that uses several connections to the database. A new row is inserted with the first connection, and then two attempts are made to read it back using separate connections. import sqlite3 db_filename = ’todo.db’ def show_projects(conn): cursor = conn.cursor() cursor.execute(’select name, description from project’) for name, desc in cursor.fetchall(): print ’’, name return with sqlite3.connect(db_filename) as conn1: print ’Before changes:’ show_projects(conn1) # Insert in one cursor cursor1 = conn1.cursor() cursor1.execute(""" insert into project (name, description, deadline) values (’virtualenvwrapper’, ’Virtualenv Extensions’, ’2011-01-01’) """) print ’\nAfter changes in conn1:’ show_projects(conn1) # Select from another connection, without committing first print ’\nBefore commit:’ with sqlite3.connect(db_filename) as conn2: show_projects(conn2) # Commit then select from another connection conn1.commit() print ’\nAfter commit:’ with sqlite3.connect(db_filename) as conn3: show_projects(conn3) 370 Data Persistence and Exchange When show_projects() is called before conn1 has been committed, the results depend on which connection is used. Since the change was made through conn1, it sees the altered data. However, conn2 does not. After committing, the new connection conn3 sees the inserted row. $ python sqlite3_transaction_commit.py Before changes: pymotw After changes in conn1: pymotw virtualenvwrapper Before commit: pymotw After commit: pymotw virtualenvwrapper Discarding Changes Uncommitted changes can also be discarded entirely using rollback(). The commit() and rollback() methods are usually called from different parts of the same try:except block, with errors triggering a rollback. import sqlite3 db_filename = ’todo.db’ def show_projects(conn): cursor = conn.cursor() cursor.execute(’select name, description from project’) for name, desc in cursor.fetchall(): print ’’, name return with sqlite3.connect(db_filename) as conn: print ’Before changes:’ show_projects(conn) 7.5. sqlite3—Embedded Relational Database 371 try: # Insert cursor = conn.cursor() cursor.execute("""delete from project where name = ’virtualenvwrapper’ """) # Show the settings print ’\nAfter delete:’ show_projects(conn) # Pretend the processing caused an error raise RuntimeError(’simulated error’) except Exception, err: # Discard the changes print ’ERROR:’, err conn.rollback() else: # Save the changes conn.commit() # Show the results print ’\nAfter rollback:’ show_projects(conn) After calling rollback(), the changes to the database are no longer present. $ python sqlite3_transaction_rollback.py Before changes: pymotw virtualenvwrapper After delete: pymotw ERROR: simulated error After rollback: pymotw virtualenvwrapper 372 Data Persistence and Exchange 7.5.10 Isolation Levels sqlite3 supports three locking modes, called isolation levels, that control the tech- nique used to prevent incompatible changes between connections. The isolation level is set by passing a string as the isolation_level argument when a connection is opened, so different connections can use different values. This program demonstrates the effect of different isolation levels on the order of events in threads using separate connections to the same database. Four threads are created. Two threads write changes to the database by updating existing rows. The other two threads attempt to read all the rows from the task table. import logging import sqlite3 import sys import threading import time logging.basicConfig( level=logging.DEBUG, format=’%(asctime)s (%(threadName)-10s) %(message)s’, ) db_filename = ’todo.db’ isolation_level = sys.argv[1] def writer(): my_name = threading.currentThread().name with sqlite3.connect(db_filename, isolation_level=isolation_level) as conn: cursor = conn.cursor() cursor.execute(’update task set priority = priority + 1’) logging.debug(’waiting to synchronize’) ready.wait() # synchronize threads logging.debug(’PAUSING’) time.sleep(1) conn.commit() logging.debug(’CHANGES COMMITTED’) return def reader(): my_name = threading.currentThread().name with sqlite3.connect(db_filename, isolation_level=isolation_level) as conn: 7.5. sqlite3—Embedded Relational Database 373 cursor = conn.cursor() logging.debug(’waiting to synchronize’) ready.wait() # synchronize threads logging.debug(’wait over’) cursor.execute(’select * from task’) logging.debug(’SELECT EXECUTED’) results = cursor.fetchall() logging.debug(’results fetched’) return if __name__ == ’__main__’: ready = threading.Event() threads = [ threading.Thread(name=’Reader 1’, target=reader), threading.Thread(name=’Reader 2’, target=reader), threading.Thread(name=’Writer 1’, target=writer), threading.Thread(name=’Writer 2’, target=writer), ] [ t.start() for t in threads ] time.sleep(1) logging.debug(’setting ready’) ready.set() [ t.join() for t in threads ] The threads are synchronized using an Event from the threading module. The writer() function connects and makes changes to the database, but does not commit before the event fires. The reader() function connects, and then waits to query the database until after the synchronization event occurs. Deferred The default isolation level is DEFERRED. Using deferred mode locks the database, but only once a change is begun. All the previous examples use deferred mode. $ python sqlite3_isolation_levels.py DEFERRED 2010-12-04 09:06:51,793 (Reader 1 ) waiting to synchronize 2010-12-04 09:06:51,794 (Reader 2 ) waiting to synchronize 2010-12-04 09:06:51,795 (Writer 1 ) waiting to synchronize 374 Data Persistence and Exchange 2010-12-04 09:06:52,796 (MainThread) setting ready 2010-12-04 09:06:52,797 (Writer 1 ) PAUSING 2010-12-04 09:06:52,797 (Reader 1 ) wait over 2010-12-04 09:06:52,798 (Reader 1 ) SELECT EXECUTED 2010-12-04 09:06:52,798 (Reader 1 ) results fetched 2010-12-04 09:06:52,799 (Reader 2 ) wait over 2010-12-04 09:06:52,800 (Reader 2 ) SELECT EXECUTED 2010-12-04 09:06:52,800 (Reader 2 ) results fetched 2010-12-04 09:06:53,799 (Writer 1 ) CHANGES COMMITTED 2010-12-04 09:06:53,829 (Writer 2 ) waiting to synchronize 2010-12-04 09:06:53,829 (Writer 2 ) PAUSING 2010-12-04 09:06:54,832 (Writer 2 ) CHANGES COMMITTED Immediate Immediate mode locks the database as soon as a change starts and prevents other cursors from making changes until the transaction is committed. It is suitable for a database with complicated writes, but more readers than writers, since the readers are not blocked while the transaction is ongoing. $ python sqlite3_isolation_levels.py IMMEDIATE 2010-12-04 09:06:54,914 (Reader 1 ) waiting to synchronize 2010-12-04 09:06:54,915 (Reader 2 ) waiting to synchronize 2010-12-04 09:06:54,916 (Writer 1 ) waiting to synchronize 2010-12-04 09:06:55,917 (MainThread) setting ready 2010-12-04 09:06:55,918 (Reader 1 ) wait over 2010-12-04 09:06:55,919 (Reader 2 ) wait over 2010-12-04 09:06:55,919 (Writer 1 ) PAUSING 2010-12-04 09:06:55,919 (Reader 1 ) SELECT EXECUTED 2010-12-04 09:06:55,919 (Reader 1 ) results fetched 2010-12-04 09:06:55,920 (Reader 2 ) SELECT EXECUTED 2010-12-04 09:06:55,920 (Reader 2 ) results fetched 2010-12-04 09:06:56,922 (Writer 1 ) CHANGES COMMITTED 2010-12-04 09:06:56,951 (Writer 2 ) waiting to synchronize 2010-12-04 09:06:56,951 (Writer 2 ) PAUSING 2010-12-04 09:06:57,953 (Writer 2 ) CHANGES COMMITTED Exclusive Exclusive mode locks the database to all readers and writers. Its use should be limited in situations where database performance is important, since each exclusive connection blocks all other users. 7.5. sqlite3—Embedded Relational Database 375 $ python sqlite3_isolation_levels.py EXCLUSIVE 2010-12-04 09:06:58,042 (Reader 1 ) waiting to synchronize 2010-12-04 09:06:58,043 (Reader 2 ) waiting to synchronize 2010-12-04 09:06:58,044 (Writer 1 ) waiting to synchronize 2010-12-04 09:06:59,045 (MainThread) setting ready 2010-12-04 09:06:59,045 (Writer 1 ) PAUSING 2010-12-04 09:06:59,046 (Reader 2 ) wait over 2010-12-04 09:06:59,045 (Reader 1 ) wait over 2010-12-04 09:07:00,048 (Writer 1 ) CHANGES COMMITTED 2010-12-04 09:07:00,076 (Reader 1 ) SELECT EXECUTED 2010-12-04 09:07:00,076 (Reader 1 ) results fetched 2010-12-04 09:07:00,079 (Reader 2 ) SELECT EXECUTED 2010-12-04 09:07:00,079 (Reader 2 ) results fetched 2010-12-04 09:07:00,090 (Writer 2 ) waiting to synchronize 2010-12-04 09:07:00,090 (Writer 2 ) PAUSING 2010-12-04 09:07:01,093 (Writer 2 ) CHANGES COMMITTED Because the first writer has started making changes, the readers and second writer block until it commits. The sleep() call introduces an artificial delay in the writer thread to highlight the fact that the other connections are blocking. Autocommit The isolation_level parameter for the connection can also be set to None to enable autocommit mode. With autocommit enabled, each execute() call is committed immediately when the statement finishes. Autocommit mode is suited for short transac- tions, such as those that insert a small amount of data into a single table. The database is locked for as little time as possible, so there is less chance of contention between threads. In sqlite3_autocommit.py, the explicit call to commit() has been removed and the isolation level is set to None, but otherwise, it is the same as sqlite3_isolation_levels.py. The output is different, however, since both writer threads finish their work before either reader starts querying. $ python sqlite3_autocommit.py 2010-12-04 09:07:01,176 (Reader 1 ) waiting to synchronize 2010-12-04 09:07:01,177 (Reader 2 ) waiting to synchronize 2010-12-04 09:07:01,181 (Writer 1 ) waiting to synchronize 2010-12-04 09:07:01,184 (Writer 2 ) waiting to synchronize 2010-12-04 09:07:02,180 (MainThread) setting ready 376 Data Persistence and Exchange 2010-12-04 09:07:02,181 (Writer 1 ) PAUSING 2010-12-04 09:07:02,181 (Reader 1 ) wait over 2010-12-04 09:07:02,182 (Reader 1 ) SELECT EXECUTED 2010-12-04 09:07:02,182 (Reader 1 ) results fetched 2010-12-04 09:07:02,183 (Reader 2 ) wait over 2010-12-04 09:07:02,183 (Reader 2 ) SELECT EXECUTED 2010-12-04 09:07:02,184 (Reader 2 ) results fetched 2010-12-04 09:07:02,184 (Writer 2 ) PAUSING 7.5.11 In-Memory Databases SQLite supports managing an entire database in RAM, instead of relying on a disk file. In-memory databases are useful for automated testing, when the database does not need to be preserved between test runs, or when experimenting with a schema or other database features. To open an in-memory database, use the string ’:memory:’ instead of a filename when creating the Connection. Each ’:memory:’ connection creates a separate database instance, so changes made by a cursor in one do not effect other connections. 7.5.12 Exporting the Contents of a Database The contents of an in-memory database can be saved using the iterdump() method of the Connection. The iterator returned by iterdump() produces a series of strings that together build SQL instructions to recreate the state of the database. import sqlite3 schema_filename = ’todo_schema.sql’ with sqlite3.connect(’:memory:’) as conn: conn.row_factory = sqlite3.Row print ’Creating schema’ with open(schema_filename, ’rt’) as f: schema = f.read() conn.executescript(schema) print ’Inserting initial data’ conn.execute(""" insert into project (name, description, deadline) values (’pymotw’, ’Python Module of the Week’, ’2010-11-01’) """) 7.5. sqlite3—Embedded Relational Database 377 data = [ (’write about select’, ’done’, ’2010-10-03’, ’pymotw’), (’write about random’, ’waiting’, ’2010-10-10’, ’pymotw’), (’write about sqlite3’, ’active’, ’2010-10-17’, ’pymotw’), ] conn.executemany(""" insert into task (details, status, deadline, project) values (?, ?, ?, ?) """, data) print ’Dumping:’ for text in conn.iterdump(): print text iterdump() can also be used with databases saved to files, but it is most useful for preserving a database that would not otherwise be saved. This output has been edited to fit on the page while remaining syntactically correct. $ python sqlite3_iterdump.py Creating schema Inserting initial data Dumping: BEGIN TRANSACTION; CREATE TABLE project ( name text primary key, description text, deadline date ); INSERT INTO "project" VALUES(’pymotw’,’Python Module of the Week’,’2010-11-01’); CREATE TABLE task ( id integer primary key autoincrement not null, priority integer default 1, details text, status text, deadline date, completed_on date, project text not null references project(name) ); INSERT INTO "task" VALUES(1,1,’write about select’,’done’,’2010-10-03’,NULL,’pymotw’); INSERT INTO "task" VALUES(2,1,’write about 378 Data Persistence and Exchange random’,’waiting’,’2010-10-10’,NULL,’pymotw’); INSERT INTO "task" VALUES(3,1,’write about sqlite3’,’active’,’2010-10-17’,NULL,’pymotw’); DELETE FROM sqlite_sequence; INSERT INTO "sqlite_sequence" VALUES(’task’,3); COMMIT; 7.5.13 Using Python Functions in SQL SQL syntax supports calling functions during queries, either in the column list or where clause of the select statement. This feature makes it possible to process data before returning it from the query and can be used to convert between different formats, per- form calculations that would be clumsy in pure SQL, and reuse application code. import sqlite3 db_filename = ’todo.db’ def encrypt(s): print ’Encrypting %r’ % s return s.encode(’rot-13’) def decrypt(s): print ’Decrypting %r’ % s return s.encode(’rot-13’) with sqlite3.connect(db_filename) as conn: conn.create_function(’encrypt’, 1, encrypt) conn.create_function(’decrypt’, 1, decrypt) cursor = conn.cursor() # Raw values print ’Original values:’ query = "select id, details from task" cursor.execute(query) for row in cursor.fetchall(): print row print ’\nEncrypting...’ query = "update task set details = encrypt(details)" cursor.execute(query) 7.5. sqlite3—Embedded Relational Database 379 print ’\nRaw encrypted values:’ query = "select id, details from task" cursor.execute(query) for row in cursor.fetchall(): print row print ’\nDecrypting in query...’ query = "select id, decrypt(details) from task" cursor.execute(query) for row in cursor.fetchall(): print row Functions are exposed using the create_function() method of the Connection. The parameters are the name of the function (as it should be used from within SQL), the number of arguments the function takes, and the Python function to expose. $ python sqlite3_create_function.py Original values: (1, u’write about select’) (2, u’write about random’) (3, u’write about sqlite3’) (4, u’finish reviewing markup’) (5, u’revise chapter intros’) (6, u’subtitle’) Encrypting... Encrypting u’write about select’ Encrypting u’write about random’ Encrypting u’write about sqlite3’ Encrypting u’finish reviewing markup’ Encrypting u’revise chapter intros’ Encrypting u’subtitle’ Raw encrypted values: (1, u’jevgr nobhg fryrpg’) (2, u’jevgr nobhg enaqbz’) (3, u’jevgr nobhg fdyvgr3’) (4, u’svavfu erivrjvat znexhc’) (5, u’erivfr puncgre vagebf’) (6, u’fhogvgyr’) Decrypting in query... 380 Data Persistence and Exchange Decrypting u’jevgr nobhg fryrpg’ Decrypting u’jevgr nobhg enaqbz’ Decrypting u’jevgr nobhg fdyvgr3’ Decrypting u’svavfu erivrjvat znexhc’ Decrypting u’erivfr puncgre vagebf’ Decrypting u’fhogvgyr’ (1, u’write about select’) (2, u’write about random’) (3, u’write about sqlite3’) (4, u’finish reviewing markup’) (5, u’revise chapter intros’) (6, u’subtitle’) 7.5.14 Custom Aggregation An aggregation function collects many pieces of individual data and summarizes it in some way. Examples of built-in aggregation functions are avg() (average), min(), max(), and count(). The API for aggregators used by sqlite3 is defined in terms of a class with two methods. The step() method is called once for each data value as the query is pro- cessed. The finalize() method is called one time at the end of the query and should return the aggregate value. This example implements an aggregator for the arithmetic mode. It returns the value that appears most frequently in the input. import sqlite3 import collections db_filename = ’todo.db’ class Mode(object): def __init__(self): self.counter = collections.Counter() def step(self, value): print ’step(%r)’ % value self.counter[value] += 1 def finalize(self): result, count = self.counter.most_common(1)[0] print ’finalize() -> %r (%d times)’ % (result, count) return result with sqlite3.connect(db_filename) as conn: 7.5. sqlite3—Embedded Relational Database 381 conn.create_aggregate(’mode’, 1, Mode) cursor = conn.cursor() cursor.execute(""" select mode(deadline) from task where project = ’pymotw’ """) row = cursor.fetchone() print ’mode(deadline) is:’, row[0] The aggregator class is registered with the create_aggregate() method of the Connection. The parameters are the name of the function (as it should be used from within SQL), the number of arguments the step() method takes, and the class to use. $ python sqlite3_create_aggregate.py step(u’2010-10-03’) step(u’2010-10-10’) step(u’2010-10-17’) step(u’2010-10-02’) step(u’2010-10-03’) step(u’2010-10-03’) finalize() -> u’2010-10-03’ (3 times) mode(deadline) is: 2010-10-03 7.5.15 Custom Sorting A collation is a comparison function used in the order by section of an SQL query. Custom collations can be used to compare data types that could not otherwise be sorted by SQLite internally. For example, a custom collation would be needed to sort the pickled objects saved in sqlite3_custom_type.py. import sqlite3 try: import cPickle as pickle except: import pickle db_filename = ’todo.db’ def adapter_func(obj): return pickle.dumps(obj) 382 Data Persistence and Exchange def converter_func(data): return pickle.loads(data) class MyObj(object): def __init__(self, arg): self.arg = arg def __str__(self): return ’MyObj(%r)’ % self.arg def __cmp__(self, other): return cmp(self.arg, other.arg) # Register the functions for manipulating the type. sqlite3.register_adapter(MyObj, adapter_func) sqlite3.register_converter("MyObj", converter_func) def collation_func(a, b): a_obj = converter_func(a) b_obj = converter_func(b) print ’collation_func(%s, %s)’ % (a_obj, b_obj) return cmp(a_obj, b_obj) with sqlite3.connect(db_filename, detect_types=sqlite3.PARSE_DECLTYPES, ) as conn: # Define the collation conn.create_collation(’unpickle’, collation_func) # Clear the table and insert new values conn.execute(’delete from obj’) conn.executemany(’insert into obj (data) values (?)’, [(MyObj(x),) for x in xrange(5, 0, -1)], ) # Query the database for the objects just saved print ’Querying:’ cursor = conn.cursor() cursor.execute(""" select id, data from obj order by data collate unpickle """) for obj_id, obj in cursor.fetchall(): print obj_id, obj 7.5. sqlite3—Embedded Relational Database 383 The arguments to the collation function are byte strings, so they must be unpickled and converted to MyObj instances before the comparison can be performed. $ python sqlite3_create_collation.py Querying: collation_func(MyObj(5), MyObj(4)) collation_func(MyObj(4), MyObj(3)) collation_func(MyObj(4), MyObj(2)) collation_func(MyObj(3), MyObj(2)) collation_func(MyObj(3), MyObj(1)) collation_func(MyObj(2), MyObj(1)) 7 MyObj(1) 6 MyObj(2) 5 MyObj(3) 4 MyObj(4) 3 MyObj(5) 7.5.16 Threading and Connection Sharing For historical reasons having to do with old versions of SQLite, Connection objects cannot be shared between threads. Each thread must create its own connection to the database. import sqlite3 import sys import threading import time db_filename = ’todo.db’ isolation_level = None # autocommit mode def reader(conn): my_name = threading.currentThread().name print ’Starting thread’ try: cursor = conn.cursor() cursor.execute(’select * from task’) results = cursor.fetchall() print ’results fetched’ 384 Data Persistence and Exchange except Exception, err: print ’ERROR:’, err return if __name__ == ’__main__’: with sqlite3.connect(db_filename, isolation_level=isolation_level, ) as conn: t = threading.Thread(name=’Reader 1’, target=reader, args=(conn,), ) t.start() t.join() Attempts to share a connection between threads result in an exception. $ python sqlite3_threading.py Starting thread ERROR: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 4299299872 and this is thread id 4311166976 7.5.17 Restricting Access to Data Although SQLite does not have user access controls found in other, larger, relational databases, it does have a mechanism for limiting access to columns. Each connection can install an authorizer function to grant or deny access to columns at runtime based on any desired criteria. The authorizer function is invoked during the parsing of SQL statements and is passed five arguments. The first is an action code indicating the type of operation being performed (reading, writing, deleting, etc.). The rest of the arguments depend on the action code. For SQLITE_READ operations, the arguments are the name of the table, the name of the column, the location in the SQL statement where the access is occurring (main query, trigger, etc.), and None. import sqlite3 db_filename = ’todo.db’ 7.5. sqlite3—Embedded Relational Database 385 def authorizer_func(action, table, column, sql_location, ignore): print ’\nauthorizer_func(%s, %s, %s, %s, %s)’ %\ (action, table, column, sql_location, ignore) response = sqlite3.SQLITE_OK # be permissive by default if action == sqlite3.SQLITE_SELECT: print ’requesting permission to run a select statement’ response = sqlite3.SQLITE_OK elif action == sqlite3.SQLITE_READ: print ’requesting access to column %s.%s from %s’ %\ (table, column, sql_location) if column == ’details’: print ’ ignoring details column’ response = sqlite3.SQLITE_IGNORE elif column == ’priority’: print ’ preventing access to priority column’ response = sqlite3.SQLITE_DENY return response with sqlite3.connect(db_filename) as conn: conn.row_factory = sqlite3.Row conn.set_authorizer(authorizer_func) print ’Using SQLITE_IGNORE to mask a column value:’ cursor = conn.cursor() cursor.execute(""" select id, details from task where project = ’pymotw’ """) for row in cursor.fetchall(): print row[’id’], row[’details’] print ’\nUsing SQLITE_DENY to deny access to a column:’ cursor.execute(""" select id, priority from task where project = ’pymotw’ """) for row in cursor.fetchall(): print row[’id’], row[’details’] This example uses SQLITE_IGNORE to cause the strings from the task.details column to be replaced with null values in the query results. It also prevents all access to 386 Data Persistence and Exchange the task.priority column by returning SQLITE_DENY, which in turn causes SQLite to raise an exception. $ python sqlite3_set_authorizer.py Using SQLITE_IGNORE to mask a column value: authorizer_func(21, None, None, None, None) requesting permission to run a select statement authorizer_func(20, task, id, main, None) requesting access to column task.id from main authorizer_func(20, task, details, main, None) requesting access to column task.details from main ignoring details column authorizer_func(20, task, project, main, None) requesting access to column task.project from main 1 None 2 None 3 None 4 None 5 None 6 None Using SQLITE_DENY to deny access to a column: authorizer_func(21, None, None, None, None) requesting permission to run a select statement authorizer_func(20, task, id, main, None) requesting access to column task.id from main authorizer_func(20, task, priority, main, None) requesting access to column task.priority from main preventing access to priority column Traceback (most recent call last): File "sqlite3_set_authorizer.py", line 51, in """) sqlite3.DatabaseError: access to task.priority is prohibited The possible action codes are available as constants in sqlite3, with names pre- fixed SQLITE_. Each type of SQL statement can be flagged, and access to individual columns can be controlled as well. 7.6. xml.etree.ElementTree—XML Manipulation API 387 See Also: sqlite3 (http://docs.python.org/library/sqlite3.html) The standard library documen- tation for this module. PEP 249 (www.python.org/dev/peps/pep-0249)—DB API 2.0 Specification A stan- dard interface for modules that provide access to relational databases. SQLite (www.sqlite.org/) The official site of the SQLite library. shelve (page 343) Key-value store for saving arbitrary Python objects. SQLAlchemy (http://sqlalchemy.org/) A popular object-relational mapper that sup- ports SQLite among many other relational databases. 7.6 xml.etree.ElementTree—XML Manipulation API Purpose Generate and parse XML documents. Python Version 2.5 and later The ElementTree library includes tools for parsing XML using event-based and document-based APIs, searching parsed documents with XPath expressions, and cre- ating new or modifying existing documents. Note: All examples in this section use the Python implementation of ElementTree for simplicity, but there is also a C implementation in xml.etree. cElementTree. 7.6.1 Parsing an XML Document Parsed XML documents are represented in memory by ElementTree and Element objects connected in a tree structure based on the way the nodes in the XML document are nested. Parsing an entire document with parse() returns an ElementTree instance. The tree knows about all data in the input document, and the nodes of the tree can be searched or manipulated in place. While this flexibility can make working with the parsed document more convenient, it typically takes more memory than an event-based parsing approach since the entire document must be loaded at one time. The memory footprint of small, simple documents (such as this list of podcasts represented as an OPML outline) is not significant: My Podcasts Sun, 07 Mar 2010 15:53:26 GMT 388 Data Persistence and Exchange Sun, 07 Mar 2010 15:53:26 GMT To parse the file, pass an open file handle to parse(). from xml.etree import ElementTree with open(’podcasts.opml’, ’rt’) as f: tree = ElementTree.parse(f) print tree It will read the data, parse the XML, and return an ElementTree object. $ python ElementTree_parse_opml.py 7.6.2 Traversing the Parsed Tree To visit all children in order, use iter() to create a generator that iterates over the ElementTree instance. 7.6. xml.etree.ElementTree—XML Manipulation API 389 from xml.etree import ElementTree import pprint with open(’podcasts.opml’, ’rt’) as f: tree = ElementTree.parse(f) for node in tree.iter(): print node.tag This example prints the entire tree, one tag at a time. $ python ElementTree_dump_opml.py opml head title dateCreated dateModified body outline outline outline outline outline To print only the groups of names and feed URLs for the podcasts, leave out all data in the header section by iterating over only the outline nodes and print the text and xmlUrl attributes by looking up the values in the attrib dictionary. from xml.etree import ElementTree with open(’podcasts.opml’, ’rt’) as f: tree = ElementTree.parse(f) for node in tree.iter(’outline’): name = node.attrib.get(’text’) url = node.attrib.get(’xmlUrl’) if name and url: print ’ %s’ % name print ’ %s’ % url else: print name 390 Data Persistence and Exchange The ’outline’ argument to iter() means processing is limited to only nodes with the tag ’outline’. $ python ElementTree_show_feed_urls.py Fiction tor.com / category / tordotstories http://www.tor.com/rss/category/TorDotStories Python PyCon Podcast http://advocacy.python.org/podcasts/pycon.rss A Little Bit of Python http://advocacy.python.org/podcasts/littlebit.rss 7.6.3 Finding Nodes in a Document Walking the entire tree like this, searching for relevant nodes, can be error prone. The previous example had to look at each outline node to determine if it was a group (nodes with only a text attribute) or a podcast (with both text and xmlUrl). To produce a simple list of the podcast feed URLs, without names or groups, the logic could be simplified using findall() to look for nodes with more descriptive search characteristics. As a first pass at converting the first version, an XPath argument can be used to look for all outline nodes. from xml.etree import ElementTree with open(’podcasts.opml’, ’rt’) as f: tree = ElementTree.parse(f) for node in tree.findall(’.//outline’): url = node.attrib.get(’xmlUrl’) if url: print url The logic in this version is not substantially different than the version using getiterator(). It still has to check for the presence of the URL, except that it does not print the group name when the URL is not found. $ python ElementTree_find_feeds_by_tag.py 7.6. xml.etree.ElementTree—XML Manipulation API 391 http://www.tor.com/rss/category/TorDotStories http://advocacy.python.org/podcasts/pycon.rss http://advocacy.python.org/podcasts/littlebit.rss It is possible to take advantage of the fact that the outline nodes are only nested two levels deep. Changing the search path to .//outline/outline means the loop will process only the second level of outline nodes. from xml.etree import ElementTree with open(’podcasts.opml’, ’rt’) as f: tree = ElementTree.parse(f) for node in tree.findall(’.//outline/outline’): url = node.attrib.get(’xmlUrl’) print url All outline nodes nested two levels deep in the input are expected to have the xmlURL attribute referring to the podcast feed, so the loop can skip checking for the attribute before using it. $ python ElementTree_find_feeds_by_structure.py http://www.tor.com/rss/category/TorDotStories http://advocacy.python.org/podcasts/pycon.rss http://advocacy.python.org/podcasts/littlebit.rss This version is limited to the existing structure, though, so if the outline nodes are ever rearranged into a deeper tree, it will stop working. 7.6.4 Parsed Node Attributes The items returned by findall() and iter() are Element objects, each represent- ing a node in the XML parse tree. Each Element has attributes for accessing data pulled out of the XML. This can be illustrated with a somewhat more contrived example input file, data.xml. 1 2 3 Regular text. 4 Regular text."Tail" text. 392 Data Persistence and Exchange 5 6 7 That & This 8 9 The attributes of a node are available in the attrib property, which acts like a dictionary. from xml.etree import ElementTree with open(’data.xml’, ’rt’) as f: tree = ElementTree.parse(f) node = tree.find(’./with_attributes’) print node.tag for name, value in sorted(node.attrib.items()): print ’ %-4s = "%s"’ % (name, value) The node on line five of the input file has two attributes, name and foo. $ python ElementTree_node_attributes.py with_attributes foo = "bar" name = "value" The text content of the nodes is available, along with the tail text that comes after the end of a close tag. from xml.etree import ElementTree with open(’data.xml’, ’rt’) as f: tree = ElementTree.parse(f) for path in [ ’./child’, ’./child_with_tail’ ]: node = tree.find(path) print node.tag print ’ child node text:’, node.text print ’ and tail text :’, node.tail The child node on line three contains embedded text, and the node on line four has text with a tail (including whitespace). 7.6. xml.etree.ElementTree—XML Manipulation API 393 $ python ElementTree_node_text.py child child node text: Regular text. and tail text : child_with_tail child node text: Regular text. and tail text : "Tail" text. XML entity references embedded in the document are converted to the appropriate characters before values are returned. from xml.etree import ElementTree with open(’data.xml’, ’rt’) as f: tree = ElementTree.parse(f) node = tree.find(’entity_expansion’) print node.tag print ’ in attribute:’, node.attrib[’attribute’] print ’ in text :’, node.text.strip() The automatic conversion means the implementation detail of representing certain characters in an XML document can be ignored. $ python ElementTree_entity_references.py entity_expansion in attribute: This & That in text : That & This 7.6.5 Watching Events While Parsing The other API for processing XML documents is event based. The parser generates start events for opening tags and end events for closing tags. Data can be extracted from the document during the parsing phase by iterating over the event stream, which is convenient if it is not necessary to manipulate the entire document afterward or hold the entire parsed document in memory. These are the types of events. 394 Data Persistence and Exchange start A new tag has been encountered. The closing angle bracket of the tag was processed, but not the contents. end The closing angle bracket of a closing tag has been processed. All the children were already processed. start-ns Start a namespace declaration. end-ns End a namespace declaration. iterparse() returns an iterable that produces tuples containing the name of the event and the node triggering the event. from xml.etree.ElementTree import iterparse depth = 0 prefix_width = 8 prefix_dots = ’.’ * prefix_width line_template = ’’.join([ ’{prefix:<0.{prefix_len}}’, ’{event:<8}’, ’{suffix:<{suffix_len}} ’, ’{node.tag:<12} ’, ’{node_id}’, ]) EVENT_NAMES = [’start’, ’end’, ’start-ns’, ’end-ns’] for (event, node) in iterparse(’podcasts.opml’, EVENT_NAMES): if event == ’end’: depth -= 1 prefix_len = depth * 2 print line_template.format( prefix=prefix_dots, prefix_len=prefix_len, suffix=’’, suffix_len=(prefix_width - prefix_len), node=node, node_id=id(node), event=event, ) if event == ’start’: depth += 1 7.6. xml.etree.ElementTree—XML Manipulation API 395 By default, only end events are generated. To see other events, pass the list of desired event names to iterparse(), as in this example. $ python ElementTree_show_all_events.py start opml 4309429072 ..start head 4309429136 ....start title 4309429200 ....end title 4309429200 ....start dateCreated 4309429392 ....end dateCreated 4309429392 ....start dateModified 4309429584 ....end dateModified 4309429584 ..end head 4309429136 ..start body 4309429968 ....start outline 4309430032 start outline 4309430096 end outline 4309430096 ....end outline 4309430032 ....start outline 4309430160 start outline 4309430224 end outline 4309430224 start outline 4309459024 end outline 4309459024 ....end outline 4309430160 ..end body 4309429968 end opml 4309429072 The event style of processing is more natural for some operations, such as con- verting XML input to some other format. This technique can be used to convert lists of podcasts (from the earlier examples) from an XML file to a CSV file, so they can be loaded into a spreadsheet or database application. import csv from xml.etree.ElementTree import iterparse import sys writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONNUMERIC) group_name = ’’ for (event, node) in iterparse(’podcasts.opml’, events=[’start’]): 396 Data Persistence and Exchange if node.tag != ’outline’: # Ignore anything not part of the outline continue if not node.attrib.get(’xmlUrl’): # Remember the current group group_name = node.attrib[’text’] else: # Output a podcast entry writer.writerow( (group_name, node.attrib[’text’], node.attrib[’xmlUrl’], node.attrib.get(’htmlUrl’, ’’), ) ) This conversion program does not need to hold the entire parsed input file in mem- ory, and processing each node as it is encountered in the input is more efficient. $ python ElementTree_write_podcast_csv.py "Fiction","tor.com / category / tordotstories","http://www.tor.com/r\ ss/category/TorDotStories","http://www.tor.com/" "Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.\ rss","http://advocacy.python.org/podcasts/" "Python","A Little Bit of Python","http://advocacy.python.org/podcas\ ts/littlebit.rss","http://advocacy.python.org/podcasts/" Note: The output from ElementTree_write_podcast_csv.py has been refor- matted to fit on this page. The output lines ending with \ indicate an artificial line break. 7.6.6 Creating a Custom Tree Builder A potentially more efficient means of handling parse events is to replace the stan- dard tree builder behavior with a custom version. The ElementTree parser uses an XMLTreeBuilder to process the XML and call methods on a target class to save the results. The usual output is an ElementTree instance created by the default TreeBuilder class. Replacing TreeBuilder with another class allows it to receive the events before the Element nodes are instantiated, saving that portion of the overhead. The XML-to-CSV converter from the previous section can be reimplemented as a tree builder. 7.6. xml.etree.ElementTree—XML Manipulation API 397 import csv from xml.etree.ElementTree import XMLTreeBuilder import sys class PodcastListToCSV(object): def __init__(self, outputFile): self.writer = csv.writer(outputFile, quoting=csv.QUOTE_NONNUMERIC) self.group_name = ’’ return def start(self, tag, attrib): if tag != ’outline’: # Ignore anything not part of the outline return if not attrib.get(’xmlUrl’): # Remember the current group self.group_name = attrib[’text’] else: # Output a podcast entry self.writer.writerow( (self.group_name, attrib[’text’], attrib[’xmlUrl’], attrib.get(’htmlUrl’, ’’), ) ) def end(self, tag): # Ignore closing tags pass def data(self, data): # Ignore data inside nodes pass def close(self): # Nothing special to do here return target = PodcastListToCSV(sys.stdout) parser = XMLTreeBuilder(target=target) with open(’podcasts.opml’, ’rt’) as f: for line in f: parser.feed(line) parser.close() 398 Data Persistence and Exchange PodcastListToCSV implements the TreeBuilder protocol. Each time a new XML tag is encountered, start() is called with the tag name and attributes. When a closing tag is seen, end() is called with the name. In between, data() is called when a node has content (the tree builder is expected to keep up with the “current” node). When all the input is processed, close() is called. It can return a value, which will be returned to the user of the XMLTreeBuilder. $ python ElementTree_podcast_csv_treebuilder.py "Fiction","tor.com / category / tordotstories","http://www.tor.com/r\ ss/category/TorDotStories","http://www.tor.com/" "Python","PyCon Podcast","http://advocacy.python.org/podcasts/pycon.\ rss","http://advocacy.python.org/podcasts/" "Python","A Little Bit of Python","http://advocacy.python.org/podcas\ ts/littlebit.rss","http://advocacy.python.org/podcasts/" Note: The output from ElementTree_podcast_csv_treebuidler.py has been reformatted to fit on this page. The output lines ending with \ indicate an artificial line break. 7.6.7 Parsing Strings To work with smaller bits of XML text, especially string literals that might be embedded in the source of a program, use XML() and the string containing the XML to be parsed as the only argument. from xml.etree.ElementTree import XML parsed = XML(’’’ This is child "a". This is child "b". This is child "c". ’’’) print ’parsed =’, parsed 7.6. xml.etree.ElementTree—XML Manipulation API 399 def show_node(node): print node.tag if node.text is not None and node.text.strip(): print ’ text: "%s"’ % node.text if node.tail is not None and node.tail.strip(): print ’ tail: "%s"’ % node.tail for name, value in sorted(node.attrib.items()): print ’ %-4s = "%s"’ % (name, value) for child in node: show_node(child) return for elem in parsed: show_node(elem) Unlike with parse(), the return value is an Element instance instead of an ElementTree.AnElement supports the iterator protocol directly, so there is no need to call getiterator(). $ python ElementTree_XML.py parsed = group child text: "This is child "a"." id = "a" child text: "This is child "b"." id = "b" group child text: "This is child "c"." id = "c" For structured XML that uses the id attribute to identify unique nodes of interest, XMLID() is a convenient way to access the parse results. from xml.etree.ElementTree import XMLID tree, id_map = XMLID(’’’ 400 Data Persistence and Exchange This is child "a". This is child "b". This is child "c". ’’’) for key, value in sorted(id_map.items()): print ’%s = %s’ % (key, value) XMLID() returns the parsed tree as an Element object, along with a dictionary mapping the id attribute strings to the individual nodes in the tree. $ python ElementTree_XMLID.py a = b = c = See Also: Outline Processor Markup Language, OPML (http://www.opml.org/) Dave Winer’s OPML specification and documentation. XML Path Language, XPath (http://www.w3.org/TR/xpath/) A syntax for identi- fying parts of an XML document. XPath Support in ElementTree (http://effbot.org/zone/element-xpath.htm) Part of Fredrick Lundh’s original documentation for ElementTree. csv (page 411) Read and write comma-separated-value files. 7.6.8 Building Documents with Element Nodes In addition to its parsing capabilities, xml.etree.ElementTree also supports creat- ing well-formed XML documents from Element objects constructed in an application. The Element class used when a document is parsed also knows how to generate a serialized form of its contents, which can then be written to a file or other data stream. There are three helper functions useful for creating a hierarchy of Element nodes. Element() creates a standard node, SubElement() attaches a new node to a parent, and Comment() creates a node that serializes using XML’s comment syntax. 7.6. xml.etree.ElementTree—XML Manipulation API 401 from xml.etree.ElementTree import ( Element, SubElement, Comment, tostring, ) top = Element(’top’) comment = Comment(’Generated for PyMOTW’) top.append(comment) child = SubElement(top, ’child’) child.text = ’This child contains text.’ child_with_tail = SubElement(top, ’child_with_tail’) child_with_tail.text = ’This child has regular text.’ child_with_tail.tail = ’And "tail" text.’ child_with_entity_ref = SubElement(top, ’child_with_entity_ref’) child_with_entity_ref.text = ’This & that’ print tostring(top) The output contains only the XML nodes in the tree, not the XML declaration with version and encoding. $ python ElementTree_create.py This child contains text.This child has regular text.A nd "tail" text.This & that The & character in the text of child_with_entity_ref is converted to the entity reference & automatically. 7.6.9 Pretty-Printing XML ElementTree makes no effort to format the output of tostring() so it is easy to read, because adding extra whitespace changes the contents of the document. To make the output easier to follow, the rest of the examples will use xml.dom.minidom to reparse the XML and then use its toprettyxml() method. 402 Data Persistence and Exchange from xml.etree import ElementTree from xml.dom import minidom def prettify(elem): """Return a pretty-printed XML string for the Element. """ rough_string = ElementTree.tostring(elem, ’utf-8’) reparsed = minidom.parseString(rough_string) return reparsed.toprettyxml(indent="") The updated example now looks like the following: from xml.etree.ElementTree import Element, SubElement, Comment from ElementTree_pretty import prettify top = Element(’top’) comment = Comment(’Generated for PyMOTW’) top.append(comment) child = SubElement(top, ’child’) child.text = ’This child contains text.’ child_with_tail = SubElement(top, ’child_with_tail’) child_with_tail.text = ’This child has regular text.’ child_with_tail.tail = ’And "tail" text.’ child_with_entity_ref = SubElement(top, ’child_with_entity_ref’) child_with_entity_ref.text = ’This & that’ print prettify(top) The output is easier to read. $ python ElementTree_create_pretty.py This child contains text. 7.6. xml.etree.ElementTree—XML Manipulation API 403 This child has regular text. And "tail" text. This & that In addition to the extra whitespace for formatting, the xml.dom.minidom pretty- printer also adds an XML declaration to the output. 7.6.10 Setting Element Properties The previous example created nodes with tags and text content, but did not set any attributes of the nodes. Many of the examples from Parsing an XML Document worked with an OPML file listing podcasts and their feeds. The outline nodes in the tree used attributes for the group names and podcast properties. ElementTree can be used to construct a similar XML file from a CSV input file, setting all the element attributes as the tree is constructed. import csv from xml.etree.ElementTree import ( Element, SubElement, Comment, tostring, ) import datetime from ElementTree_pretty import prettify generated_on = str(datetime.datetime.now()) # Configure one attribute with set() root = Element(’opml’) root.set(’version’, ’1.0’) root.append( Comment(’Generated by ElementTree_csv_to_xml.py for PyMOTW’) ) head = SubElement(root, ’head’) title = SubElement(head, ’title’) 404 Data Persistence and Exchange title.text = ’My Podcasts’ dc = SubElement(head, ’dateCreated’) dc.text = generated_on dm = SubElement(head, ’dateModified’) dm.text = generated_on body = SubElement(root, ’body’) with open(’podcasts.csv’, ’rt’) as f: current_group = None reader = csv.reader(f) for row in reader: group_name, podcast_name, xml_url, html_url = row if current_group is None or group_name != current_group.text: # Start a new group current_group = SubElement(body, ’outline’, {’text’:group_name}) # Add this podcast to the group, # setting all its attributes at # once. podcast = SubElement(current_group, ’outline’, {’text’:podcast_name, ’xmlUrl’:xml_url, ’htmlUrl’:html_url, }) print prettify(root) This example uses two techniques to set the attribute values of new nodes. The root node is configured using set() to change one attribute at a time. The podcast nodes are given all their attributes at once by passing a dictionary to the node factory. $ python ElementTree_csv_to_xml.py My Podcasts 7.6. xml.etree.ElementTree—XML Manipulation API 405 2010-12-03 08:48:58.065172 2010-12-03 08:48:58.065172 7.6.11 Building Trees from Lists of Nodes Multiple children can be added to an Element instance together with the extend() method. The argument to extend() is any iterable, including a list or another Element instance. from xml.etree.ElementTree import Element, tostring from ElementTree_pretty import prettify top = Element(’top’) children = [ Element(’child’, num=str(i)) 406 Data Persistence and Exchange for i in xrange(3) ] top.extend(children) print prettify(top) When a list is given, the nodes in the list are added directly to the new parent. $ python ElementTree_extend.py When another Element instance is given, the children of that node are added to the new parent. from xml.etree.ElementTree import Element, SubElement, tostring, XML from ElementTree_pretty import prettify top = Element(’top’) parent = SubElement(top, ’parent’) children = XML( ’’ ) parent.extend(children) print prettify(top) In this case, the node with tag root created by parsing the XML string has three children, which are added to the parent node. The root node is not part of the output tree. $ python ElementTree_extend_node.py 7.6. xml.etree.ElementTree—XML Manipulation API 407 It is important to understand that extend() does not modify any existing parent- child relationships with the nodes. If the values passed to extend() exist somewhere in the tree already, they will still be there and will be repeated in the output. from xml.etree.ElementTree import Element, SubElement, tostring, XML from ElementTree_pretty import prettify top = Element(’top’) parent_a = SubElement(top, ’parent’, id=’A’) parent_b = SubElement(top, ’parent’, id=’B’) # Create children children = XML( ’’ ) # Set the id to the Python object id of the node # to make duplicates easier to spot. for c in children: c.set(’id’, str(id(c))) # Add to first parent parent_a.extend(children) print ’A:’ print prettify(top) print # Copy nodes to second parent parent_b.extend(children) print ’B:’ print prettify(top) print 408 Data Persistence and Exchange Setting the id attribute of these children to the Python unique object identifier highlights the fact that the same node objects appear in the output tree more than once. $ python ElementTree_extend_node_copy.py A: B: 7.6.12 Serializing XML to a Stream tostring() is implemented by writing to an in-memory file-like object and then returning a string representing the entire element tree. When working with large amounts of data, it will take less memory and make more efficient use of the I/O libraries to write directly to a file handle using the write() method of ElementTree. import sys from xml.etree.ElementTree import ( Element, SubElement, 7.6. xml.etree.ElementTree—XML Manipulation API 409 Comment, ElementTree, ) top = Element(’top’) comment = Comment(’Generated for PyMOTW’) top.append(comment) child = SubElement(top, ’child’) child.text = ’This child contains text.’ child_with_tail = SubElement(top, ’child_with_tail’) child_with_tail.text = ’This child has regular text.’ child_with_tail.tail = ’And "tail" text.’ child_with_entity_ref = SubElement(top, ’child_with_entity_ref’) child_with_entity_ref.text = ’This & that’ empty_child = SubElement(top, ’empty_child’) ElementTree(top).write(sys.stdout) The example uses sys.stdout to write to the console, but it could also write to an open file or socket. $ python ElementTree_write.py This child contains text.This child has regular text.A nd "tail" text.This & that The last node in the tree contains no text or subnodes, so it is written as an empty tag, . write() takes a method argument to control the handling for empty nodes. import sys from xml.etree.ElementTree import Element, SubElement, ElementTree top = Element(’top’) 410 Data Persistence and Exchange child = SubElement(top, ’child’) child.text = ’Contains text.’ empty_child = SubElement(top, ’empty_child’) for method in [ ’xml’, ’html’, ’text’ ]: print method ElementTree(top).write(sys.stdout, method=method) print ’\n’ Three methods are supported. xml The default method, produces . html Produces the tag pair, as is required in HTML documents ( ). text Prints only the text of nodes, and skips empty tags entirely. $ python ElementTree_write_method.py xml Contains text. html Contains text. text Contains text. See Also: Outline Processor Markup Language, OPML (www.opml.org/) Dave Winer’s OPML specification and documentation. Pretty-Print XML with Python—Indenting XML (http://renesd.blogspot.com/2007/05/pretty-print-xml-with-python.html) A tip from Rene Dudfield for pretty-printing XML in Python. xml.etree.ElementTree (http://docs.python.org/library/xml.etree.elementtree.html) The standard library documentation for this module. ElementTree Overview (http://effbot.org/zone/element-index.htm) Fredrick Lundh’s original documentation and links to the development versions of the ElementTree library. Process XML in Python with ElementTree (http://www.ibm.com/developerworks/library/x-matters28/) IBM Developer- Works article by David Mertz. 7.7. csv—Comma-Separated Value Files 411 lxml.etree (http://codespeak.net/lxml/) A separate implementation of the Element- Tree API based on libxml2 with more complete XPath support. 7.7 csv—Comma-Separated Value Files Purpose Read and write comma-separated value files. Python Version 2.3 and later. The csv module can be used to work with data exported from spreadsheets and databases into text files formatted with fields and records, commonly referred to as comma-separated value (CSV) format because commas are often used to separate the fields in a record. Note: The Python 2.5 version of csv does not support Unicode data. There are also issues with ASCII NUL characters. Using UTF-8 or printable ASCII is recom- mended. 7.7.1 Reading Use reader() to create an object for reading data from a CSV file. The reader can be used as an iterator to process the rows of the file in order. For example import csv import sys with open(sys.argv[1], ’rt’) as f: reader = csv.reader(f) for row in reader: print row The first argument to reader() is the source of text lines. In this case, it is a file, but any iterable is accepted (a StringIO instance, list, etc.). Other optional arguments can be given to control how the input data is parsed. "Title 1","Title 2","Title 3" 1,"a",08/18/07 2,"b",08/19/07 3,"c",08/20/07 As it is read, each row of the input data is parsed and converted to a list of strings. 412 Data Persistence and Exchange $ python csv_reader.py testdata.csv [’Title 1’, ’Title 2’, ’Title 3’] [’1’, ’a’, ’08/18/07’] [’2’, ’b’, ’08/19/07’] [’3’, ’c’, ’08/20/07’] The parser handles line breaks embedded within strings in a row, which is why a “row” is not always the same as a “line” of input from the file. "Title 1","Title 2","Title 3" 1,"first line second line",08/18/07 Fields with line breaks in the input retain the internal line breaks when they are returned by the parser. $ python csv_reader.py testlinebreak.csv [’Title 1’, ’Title 2’, ’Title 3’] [’1’, ’first line\nsecond line’, ’08/18/07’] 7.7.2 Writing Writing CSV files is just as easy as reading them. Use writer() to create an object for writing, and then iterate over the rows using writerow() to print them. import csv import sys with open(sys.argv[1], ’wt’) as f: writer = csv.writer(f) writer.writerow( (’Title 1’, ’Title 2’, ’Title 3’)) for i in range(3): writer.writerow( (i+1, chr(ord(’a’) + i), ’08/%02d/07’ % (i+1), ) ) print open(sys.argv[1], ’rt’).read() The output does not look exactly like the exported data used in the reader example. 7.7. csv—Comma-Separated Value Files 413 $ python csv_writer.py testout.csv Title 1,Title 2,Title 3 1,a,08/01/07 2,b,08/02/07 3,c,08/03/07 Quoting The default quoting behavior is different for the writer, so the second and third columns in the previous example are not quoted. To add quoting, set the quoting arguments to one of the other quoting modes. writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC) In this case, QUOTE_NONNUMERIC adds quotes around all columns containing val- ues that are not numbers. $ python csv_writer_quoted.py testout_quoted.csv "Title 1","Title 2","Title 3" 1,"a","08/01/07" 2,"b","08/02/07" 3,"c","08/03/07" There are four different quoting options defined as constants in the csv module. QUOTE_ALL Quote everything, regardless of type. QUOTE_MINIMAL Quote fields with special characters (anything that would confuse a parser configured with the same dialect and options). This is the default. QUOTE_NONNUMERIC Quote all fields that are not integers or floats. When used with the reader, input fields that are not quoted are converted to floats. QUOTE_NONE Do not quote anything on output. When used with the reader, quote char- acters are included in the field values (normally, they are treated as delimiters and stripped). 7.7.3 Dialects There is no well-defined standard for comma-separated value files, so the parser needs to be flexible. This flexibility means there are many parameters to control how csv parses or writes data. Rather than passing each of these parameters to the reader and writer separately, they are grouped together into a dialect object. 414 Data Persistence and Exchange Dialect classes can be registered by name so that callers of the csv module do not need to know the parameter settings in advance. The complete list of registered dialects can be retrieved with list_dialects(). import csv print csv.list_dialects() The standard library includes two dialects: excel and excel-tabs. The excel dialect is for working with data in the default export format for Microsoft Excel, and it also works with OpenOffice or NeoOffice. $ python csv_list_dialects.py [’excel-tab’, ’excel’] Creating a Dialect If, instead of using commas to delimit fields, the input file uses pipes (|), like this "Title 1"|"Title 2"|"Title 3" 1|"first line second line"|08/18/07 a new dialect can be registered using the appropriate delimiter. import csv csv.register_dialect(’pipes’, delimiter=’|’) with open(’testdata.pipes’, ’r’) as f: reader = csv.reader(f, dialect=’pipes’) for row in reader: print row Using the “pipes” dialect, the file can be read just as with the comma-delimited file. $ python csv_dialect.py [’Title 1’, ’Title 2’, ’Title 3’] [’1’, ’first line\nsecond line’, ’08/18/07’] 7.7. csv—Comma-Separated Value Files 415 Table 7.3. CSV Dialect Parameters Attribute Default Meaning delimiter , Field separator (one character) doublequote True Flag controlling whether quotechar instances are doubled escapechar None Character used to indicate an escape sequence lineterminator \r\n String used by writer to terminate a line quotechar " String to surround fields containing special values (one character) quoting QUOTE_MINIMAL Controls quoting behavior described earlier skipinitialspace False Ignore whitespace after the field delimiter Dialect Parameters A dialect specifies all the tokens used when parsing or writing a data file. Table 7.3 lists the aspects of the file format that can be specified, from the way columns are delimited to the character used to escape a token. import csv import sys csv.register_dialect(’escaped’, escapechar=’\\’, doublequote=False, quoting=csv.QUOTE_NONE, ) csv.register_dialect(’singlequote’, quotechar="’", quoting=csv.QUOTE_ALL, ) quoting_modes = dict( (getattr(csv,n), n) for n in dir(csv) if n.startswith(’QUOTE_’) ) for name in sorted(csv.list_dialects()): print ’Dialect: "%s"\n’ % name 416 Data Persistence and Exchange dialect = csv.get_dialect(name) = %-6r skipinitialspace = %r’ %( dialect.delimiter, dialect.skipinitialspace) print ’ doublequote = %-6r quoting = %s’ %( dialect.doublequote, quoting_modes[dialect.quoting]) print ’ quotechar = %-6r lineterminator = %r’ %( dialect.quotechar, dialect.lineterminator) print ’ escapechar = %-6r’ % dialect.escapechar print writer = csv.writer(sys.stdout, dialect=dialect) writer.writerow( (’col1’, 1, ’10/01/2010’, ’Special chars: " \’ %s to parse’ % dialect.delimiter) ) print This program shows how the same data appears in several different dialects. $ python csv_dialect_variations.py Dialect: "escaped" delimiter = ’,’ skipinitialspace = 0 doublequote = 0 quoting = QUOTE_NONE quotechar = ’"’ lineterminator = ’\r\n’ escapechar = ’\\’ col1,1,10/01/2010,Special chars: \" ’ \, to parse Dialect: "excel" delimiter = ’,’ skipinitialspace = 0 doublequote = 1 quoting = QUOTE_MINIMAL quotechar = ’"’ lineterminator = ’\r\n’ escapechar = None col1,1,10/01/2010,"Special chars: "" ’ , to parse" Dialect: "excel-tab" delimiter = ’\t’ skipinitialspace = 0 doublequote = 1 quoting = QUOTE_MINIMAL 7.7. csv—Comma-Separated Value Files 417 quotechar = ’"’ lineterminator = ’\r\n’ escapechar = None col1 1 10/01/2010 "Special chars: "" ’ to parse" Dialect: "singlequote" delimiter = ’,’ skipinitialspace = 0 doublequote = 1 quoting = QUOTE_ALL quotechar = "’" lineterminator = ’\r\n’ escapechar = None ’col1’,’1’,’10/01/2010’,’Special chars: " ’’ , to parse’ Automatically Detecting Dialects The best way to configure a dialect for parsing an input file is to know the correct settings in advance. For data where the dialect parameters are unknown, the Sniffer class can be used to make an educated guess. The sniff() method takes a sample of the input data and an optional argument giving the possible delimiter characters. import csv from StringIO import StringIO import textwrap csv.register_dialect(’escaped’, escapechar=’\\’, doublequote=False, quoting=csv.QUOTE_NONE) csv.register_dialect(’singlequote’, quotechar="’", quoting=csv.QUOTE_ALL) # Generate sample data for all known dialects samples = [] for name in sorted(csv.list_dialects()): buffer = StringIO() dialect = csv.get_dialect(name) writer = csv.writer(buffer, dialect=dialect) writer.writerow( (’col1’, 1, ’10/01/2010’, ’Special chars " \’ %s to parse’ % dialect.delimiter) ) samples.append( (name, dialect, buffer.getvalue()) ) 418 Data Persistence and Exchange # Guess the dialect for a given sample, and then use the results to # parse the data. sniffer = csv.Sniffer() for name, expected, sample in samples: print ’Dialect: "%s"\n’ % name dialect = sniffer.sniff(sample, delimiters=’,\t’) reader = csv.reader(StringIO(sample), dialect=dialect) print reader.next() print sniff() returns a Dialect instance with the settings to be used for parsing the data. The results are not always perfect, as demonstrated by the “escaped” dialect in the example. $ python csv_dialect_sniffer.py Dialect: "escaped" [’col1’, ’1’, ’10/01/2010’, ’Special chars \\" \’ \\’, ’ to parse’] Dialect: "excel" [’col1’, ’1’, ’10/01/2010’, ’Special chars " \’ , to parse’] Dialect: "excel-tab" [’col1’, ’1’, ’10/01/2010’, ’Special chars " \’ \t to parse’] Dialect: "singlequote" [’col1’, ’1’, ’10/01/2010’, ’Special chars " \’ , to parse’] 7.7.4 Using Field Names In addition to working with sequences of data, the csv module includes classes for working with rows as dictionaries so that the fields can be named. The DictReader and DictWriter classes translate rows to dictionaries instead of lists. Keys for the dictionary can be passed in or inferred from the first row in the input (when the row contains headers). 7.7. csv—Comma-Separated Value Files 419 import csv import sys with open(sys.argv[1], ’rt’) as f: reader = csv.DictReader(f) for row in reader: print row The dictionary-based reader and writer are implemented as wrappers around the sequence-based classes, and they use the same methods and arguments. The only dif- ference in the reader API is that rows are returned as dictionaries instead of lists or tuples. $ python csv_dictreader.py testdata.csv {’Title 1’: ’1’, ’Title 3’: ’08/18/07’, ’Title 2’: ’a’} {’Title 1’: ’2’, ’Title 3’: ’08/19/07’, ’Title 2’: ’b’} {’Title 1’: ’3’, ’Title 3’: ’08/20/07’, ’Title 2’: ’c’} The DictWriter must be given a list of field names so it knows how to order the columns in the output. import csv import sys with open(sys.argv[1], ’wt’) as f: fieldnames = (’Title 1’, ’Title 2’, ’Title 3’) headers = dict( (n,n) for n in fieldnames ) writer = csv.DictWriter(f, fieldnames=fieldnames) writer.writerow(headers) for i in range(3): writer.writerow({ ’Title 1’:i+1, ’Title 2’:chr(ord(’a’) + i), ’Title 3’:’08/%02d/07’ % (i+1), }) print open(sys.argv[1], ’rt’).read() 420 Data Persistence and Exchange The field names are not written to the file automatically, so they need to be written explicitly before any other data. $ python csv_dictwriter.py testout.csv Title 1,Title 2,Title 3 1,a,08/01/07 2,b,08/02/07 3,c,08/03/07 See Also: csv (http://docs.python.org/library/csv.html) The standard library documentation for this module. PEP 305 (www.python.org/dev/peps/pep-0305) CSV File API. Chapter 8 DATA COMPRESSION AND ARCHIVING Although modern computer systems have an ever-increasing storage capacity, the growth of data being produced is unrelenting. Lossless compression algorithms make up for some of the shortfall in capacity by trading time spent compressing or decom- pressing data for the space needed to store it. Python includes interfaces to the most popular compression libraries so it can read and write files interchangeably. zlib and gzip expose the GNU zip library, and bz2 provides access to the more recent bzip2 format. Both formats work on streams of data, without regard to input format, and provide interfaces for reading and writing compressed files transparently. Use these modules for compressing a single file or data source. The standard library also includes modules to manage archive formats for com- bining several files into a single file that can be managed as a unit. tarfile reads and writes the UNIX tape archive format, an old standard still widely used today because of its flexibility. zipfile works with archives based on the format popularized by the PC program PKZIP, originally used under MS-DOS and Windows, but now also used on other platforms because of the simplicity of its API and portability of the format. 8.1 zlib—GNU zlib Compression Purpose Low-level access to GNU zlib compression library. Python Version 2.5 and later The zlib module provides a lower-level interface to many of the functions in the zlib compression library from the GNU project. 421 422 Data Compression and Archiving 8.1.1 Working with Data in Memory The simplest way to work with zlib requires holding all the data to be compressed or decompressed in memory: import zlib import binascii original_data = ’This is the original text.’ print ’Original :’, len(original_data), original_data compressed = zlib.compress(original_data) print ’Compressed :’, len(compressed), binascii.hexlify(compressed) decompressed = zlib.decompress(compressed) print ’Decompressed :’, len(decompressed), decompressed The compress() and decompress() functions both take a string argument and return a string. $ python zlib_memory.py Original : 26 This is the original text. Compressed : 32 789c0bc9c82c5600a2928c5485fca2ccf4ccbcc41c8592d 48a123d007f2f097e Decompressed : 26 This is the original text. The previous example demonstrates that, for short text, the compressed version of a string can be bigger than the uncompressed version. While the actual results depend on the input data, for short bits of text, it is interesting to observe the compression overhead. import zlib original_data = ’This is the original text.’ fmt = ’%15s %15s’ print fmt % (’len(data)’, ’len(compressed)’) print fmt % (’-’ * 15, ’-’ * 15) for i in xrange(5): data = original_data * i 8.1. zlib—GNU zlib Compression 423 compressed = zlib.compress(data) highlight = ’*’ if len(data) < len(compressed) else ’’ print fmt % (len(data), len(compressed)), highlight The * in the output highlight the lines where the compressed data takes up more memory than the uncompressed version. $ python zlib_lengths.py len(data) len(compressed) --------------- --------------- 0 8 * 26 32 * 52 35 78 35 104 36 8.1.2 Incremental Compression and Decompression The in-memory approach has drawbacks that make it impractical for real-world use cases, primarily that the system needs enough memory to hold both the uncompressed and compressed versions resident in memory at the same time. The alternative is to use Compress and Decompress objects to manipulate data incrementally, so that the entire data set does not have to fit into memory. import zlib import binascii compressor = zlib.compressobj(1) with open(’lorem.txt’, ’r’) as input: while True: block = input.read(64) if not block: break compressed = compressor.compress(block) if compressed: print ’Compressed: %s’ % binascii.hexlify(compressed) else: print ’buffering...’ remaining = compressor.flush() print ’Flushed: %s’ % binascii.hexlify(remaining) 424 Data Compression and Archiving This example reads small blocks of data from a plain-text file and passes it to compress(). The compressor maintains an internal buffer of compressed data. Since the compression algorithm depends on checksums and minimum block sizes, the com- pressor may not be ready to return data each time it receives more input. If it does not have an entire compressed block ready, it returns an empty string. When all the data is fed in, the flush() method forces the compressor to close the final block and return the rest of the compressed data. $ python zlib_incremental.py Compressed: 7801 buffering... buffering... buffering... buffering... buffering... Flushed: 55904b6ac4400c44f73e451da0f129b20c2110c85e696b8c40ddedd167ce1 f7915025a087daa9ef4be8c07e4f21c38962e834b800647435fd3b90747b2810eb9c4b bcc13ac123bded6e4bef1c91ee40d3c6580e3ff52aad2e8cb2eb6062dad74a89ca904c bb0f2545e0db4b1f2e01955b8c511cb2ac08967d228af1447c8ec72e40c4c714116e60 cdef171bb6c0feaa255dff1c507c2c4439ec9605b7e0ba9fc54bae39355cb89fd6ebe5 841d673c7b7bc68a46f575a312eebd220d4b32441bdc1b36ebf0aedef3d57ea4b26dd9 86dd39af57dfb05d32279de 8.1.3 Mixed Content Streams The Decompress class returned by decompressobj() can also be used in situations where compressed and uncompressed data are mixed together. import zlib lorem = open(’lorem.txt’, ’rt’).read() compressed = zlib.compress(lorem) combined = compressed + lorem decompressor = zlib.decompressobj() decompressed = decompressor.decompress(combined) decompressed_matches = decompressed == lorem print ’Decompressed matches lorem:’, decompressed_matches 8.1. zlib—GNU zlib Compression 425 unused_matches = decompressor.unused_data == lorem print ’Unused data matches lorem :’, unused_matches After decompressing all the data, the unused_data attribute contains any data not used. $ python zlib_mixed.py Decompressed matches lorem: True Unused data matches lorem : True 8.1.4 Checksums In addition to compression and decompression functions, zlib includes two functions for computing checksums of data, adler32() and crc32(). Neither checksum is billed as cryptographically secure, and they are only intended for use for data-integrity verification. import zlib data = open(’lorem.txt’, ’r’).read() cksum = zlib.adler32(data) print ’Adler32: %12d’ % cksum print ’ : %12d’ % zlib.adler32(data, cksum) cksum = zlib.crc32(data) print ’CRC-32 : %12d’ % cksum print ’ : %12d’ % zlib.crc32(data, cksum) Both functions take the same arguments, a string of data and an optional value to be used as a starting point for the checksum. They return a 32-bit signed integer value that can also be passed back on subsequent calls as a new starting point argument to produce a running checksum. $ python zlib_checksums.py Adler32: -752715298 : 669447099 CRC-32 : -1256596780 : -1424888665 426 Data Compression and Archiving 8.1.5 Compressing Network Data The server in the next listing uses the stream compressor to respond to requests con- sisting of filenames by writing a compressed version of the file to the socket used to communicate with the client. It has some artificial chunking in place to illustrate the buffering that occurs when the data passed to compress() or decompress() does not result in a complete block of compressed or uncompressed output. import zlib import logging import SocketServer import binascii BLOCK_SIZE = 64 class ZlibRequestHandler(SocketServer.BaseRequestHandler): logger = logging.getLogger(’Server’) def handle(self): compressor = zlib.compressobj(1) # Find out what file the client wants filename = self.request.recv(1024) self.logger.debug(’client asked for: "%s"’, filename) # Send chunks of the file as they are compressed with open(filename, ’rb’) as input: while True: block = input.read(BLOCK_SIZE) if not block: break self.logger.debug(’RAW "%s"’, block) compressed = compressor.compress(block) if compressed: self.logger.debug(’SENDING "%s"’, binascii.hexlify(compressed)) self.request.send(compressed) else: self.logger.debug(’BUFFERING’) 8.1. zlib—GNU zlib Compression 427 # Send any data being buffered by the compressor remaining = compressor.flush() while remaining: to_send = remaining[:BLOCK_SIZE] remaining = remaining[BLOCK_SIZE:] self.logger.debug(’FLUSHING "%s"’, binascii.hexlify(to_send)) self.request.send(to_send) return if __name__ == ’__main__’: import socket import threading from cStringIO import StringIO logging.basicConfig(level=logging.DEBUG, format=’%(name)s: %(message)s’, ) logger = logging.getLogger(’Client’) # Set up a server, running in a separate thread address = (’localhost’, 0) # let the kernel assign a port server = SocketServer.TCPServer(address, ZlibRequestHandler) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) t.start() The client connects to the socket and requests a file. Then it loops, receiving blocks of compressed data. Since a block may not contain enough information to decompress it entirely, the remainder of any data received earlier is combined with the new data and passed to the decompressor. As the data is decompressed, it is appended to a buffer, which is compared against the file contents at the end of the processing loop. # Connect to the server as a client logger.info(’Contacting server on %s:%s’, ip, port) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) 428 Data Compression and Archiving # Ask for a file requested_file = ’lorem.txt’ logger.debug(’sending filename: "%s"’, requested_file) len_sent = s.send(requested_file) # Receive a response buffer = StringIO() decompressor = zlib.decompressobj() while True: response = s.recv(BLOCK_SIZE) if not response: break logger.debug(’READ "%s"’, binascii.hexlify(response)) # Include any unconsumed data when feeding the decompressor. to_decompress = decompressor.unconsumed_tail + response while to_decompress: decompressed = decompressor.decompress(to_decompress) if decompressed: logger.debug(’DECOMPRESSED "%s"’, decompressed) buffer.write(decompressed) # Look for unconsumed data due to buffer overflow to_decompress = decompressor.unconsumed_tail else: logger.debug(’BUFFERING’) to_decompress = None # deal with data reamining inside the decompressor buffer remainder = decompressor.flush() if remainder: logger.debug(’FLUSHED "%s"’, remainder) buffer.write(reaminder) full_response = buffer.getvalue() lorem = open(’lorem.txt’, ’rt’).read() logger.debug(’response matches file contents: %s’, full_response == lorem) # Clean up s.close() server.socket.close() 8.1. zlib—GNU zlib Compression 429 Warning: This server has obvious security implications. Do not run it on a system on the open Internet or in any environment where security might be an issue. $ python zlib_server.py Client: Contacting server on 127.0.0.1:55085 Client: sending filename: "lorem.txt" Server: client asked for: "lorem.txt" Server: RAW "Lorem ipsum dolor sit amet, consectetuer adipiscing elit . Donec " Server: SENDING "7801" Server: RAW "egestas, enim et consectetuer ullamcorper, lectus ligula rutrum " Server: BUFFERING Server: RAW "leo, a elementum elit tortor eu quam. Duis tincidunt nisi ut ant" Server: BUFFERING Server: RAW "e. Nulla facilisi. Sed tristique eros eu libero. Pellentesque ve" Server: BUFFERING Server: RAW "l arcu. Vivamus purus orci, iaculis ac, suscipit sit amet, pulvi" Server: BUFFERING Server: RAW "nar eu, lacus. " Server: BUFFERING Server: FLUSHING "55904b6ac4400c44f73e451da0f129b20c2110c85e696b8c40d dedd167ce1f7915025a087daa9ef4be8c07e4f21c38962e834b800647435fd3b90747 b2810eb9" Server: FLUSHING "c4bbcc13ac123bded6e4bef1c91ee40d3c6580e3ff52aad2e8c b2eb6062dad74a89ca904cbb0f2545e0db4b1f2e01955b8c511cb2ac08967d228af14 47c8ec72" Server: FLUSHING "e40c4c714116e60cdef171bb6c0feaa255dff1c507c2c4439ec 9605b7e0ba9fc54bae39355cb89fd6ebe5841d673c7b7bc68a46f575a312eebd220d4 b32441bd" Server: FLUSHING "c1b36ebf0aedef3d57ea4b26dd986dd39af57dfb05d32279de" Client: READ "780155904b6ac4400c44f73e451da0f129b20c2110c85e696b8c40d dedd167ce1f7915025a087daa9ef4be8c07e4f21c38962e834b800647435fd3b90747 b281" 430 Data Compression and Archiving Client: DECOMPRESSED "Lorem ipsum dolor sit amet, consectetuer " Client: READ "0eb9c4bbcc13ac123bded6e4bef1c91ee40d3c6580e3ff52aad2e8c b2eb6062dad74a89ca904cbb0f2545e0db4b1f2e01955b8c511cb2ac08967d228af14 47c8" Client: DECOMPRESSED "adipiscing elit. Donec egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a elementum elit tortor eu quam. Duis ti" Client: READ "ec72e40c4c714116e60cdef171bb6c0feaa255dff1c507c2c4439ec 9605b7e0ba9fc54bae39355cb89fd6ebe5841d673c7b7bc68a46f575a312eebd220d4 b324" Client: DECOMPRESSED "ncidunt nisi ut ante. Nulla facilisi. Sed tristique eros eu libero. Pellentesque vel arcu. Vivamu s purus orci, iacu" Client: READ "41bdc1b36ebf0aedef3d57ea4b26dd986dd39af57dfb05d32279de" Client: DECOMPRESSED "lis ac, suscipit sit amet, pulvinar eu, lacus. " Client: response matches file contents: True See Also: zlib (http://docs.python.org/library/zlib.html) The standard library documentation for this module. www.zlib.net/ Home page for zlib library. www.zlib.net/manual.html Complete zlib documentation. bz2 (page 436) The bz2 module provides a similar interface to the bzip2 compression library. gzip (page 430) The gzip module includes a higher-level (file-based) interface to the zlib library. 8.2 gzip—Read and Write GNU Zip Files Purpose Read and write gzip files. Python Version 1.5.2 and later The gzip module provides a file-like interface to GNU zip files, using zlib to com- press and uncompress the data. 8.2. gzip—Read and Write GNU Zip Files 431 8.2.1 Writing Compressed Files The module-level function open() creates an instance of the file-like class GzipFile. The usual methods for writing and reading data are provided. import gzip import os outfilename = ’example.txt.gz’ with gzip.open(outfilename, ’wb’) as output: output.write(’Contents of the example file go here.\n’) print outfilename, ’contains’, os.stat(outfilename).st_size, ’bytes’ os.system(’file -b --mime %s’ % outfilename) To write data into a compressed file, open the file with mode ’w’. $ python gzip_write.py application/x-gzip; charset=binary example.txt.gz contains 68 bytes Different amounts of compression can be used by passing a compresslevel argu- ment. Valid values range from 1 to 9, inclusive. Lower values are faster and result in less compression. Higher values are slower and compress more, up to a point. import gzip import os import hashlib def get_hash(data): return hashlib.md5(data).hexdigest() data = open(’lorem.txt’, ’r’).read() * 1024 cksum = get_hash(data) print ’Level Size Checksum’ print ’----- ---------- ---------------------------------’ print ’data %10d %s’ % (len(data), cksum) for i in xrange(1, 10): filename = ’compress-level-%s.gz’ % i 432 Data Compression and Archiving with gzip.open(filename, ’wb’, compresslevel=i) as output: output.write(data) size = os.stat(filename).st_size cksum = get_hash(open(filename, ’rb’).read()) print ’%5d %10d %s’ % (i, size, cksum) The center column of numbers in the output shows the size in bytes of the files produced by compressing the input. For this input data, the higher compression values do not necessarily pay off in decreased storage space. Results will vary, depending on the input data. $ python gzip_compresslevel.py Level Size Checksum ----- ---------- --------------------------------- data 754688 e4c0f9433723971563f08a458715119c 1 9839 3fbd996cd4d63acc70047fb62646f2ba 2 8260 427bf6183d4518bcd05611d4f114a07c 3 8221 078331b777a11572583e3fdaa120b845 4 4160 f73c478ffcba30bfe0b1d08d0f597394 5 4160 022d920880e24c1895219a31105a89c8 6 4160 45ba520d6af45e279a56bb9c67294b82 7 4160 9a834b8a2c649d4b8d509cb12cc580e2 8 4160 c1aafc7d7d58cba4ef21dfce6fd1f443 9 4160 78039211f5777f9f34cf770c2eaafc6d A GzipFile instance also includes a writelines() method that can be used to write a sequence of strings. import gzip import itertools import os with gzip.open(’example_lines.txt.gz’, ’wb’) as output: output.writelines( itertools.repeat(’The same line, over and over.\n’, 10) ) os.system(’gzcat example_lines.txt.gz’) As with a regular file, the input lines need to include a newline character. 8.2. gzip—Read and Write GNU Zip Files 433 $ python gzip_writelines.py The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. 8.2.2 Reading Compressed Data To read data back from previously compressed files, open the file with binary read mode (’rb’) so no text-based translation of line endings is performed. import gzip with gzip.open(’example.txt.gz’, ’rb’) as input_file: print input_file.read() This example reads the file written by gzip_write.py from the previous section. $ python gzip_read.py Contents of the example file go here. While reading a file, it is also possible to seek and read only part of the data. import gzip with gzip.open(’example.txt.gz’, ’rb’) as input_file: print ’Entire file:’ all_data = input_file.read() print all_data expected = all_data[5:15] # rewind to beginning input_file.seek(0) 434 Data Compression and Archiving # move ahead 5 bytes input_file.seek(5) print ’Starting at position 5 for 10 bytes:’ partial = input_file.read(10) print partial print print expected == partial The seek() position is relative to the uncompressed data, so the caller does not need to know that the data file is compressed. $ python gzip_seek.py Entire file: Contents of the example file go here. Starting at position 5 for 10 bytes: nts of the True 8.2.3 Working with Streams The GzipFile class can be used to wrap other types of data streams so they can use compression as well. This is useful when the data is being transmitted over a socket or an existing (already open) file handle. A StringIO buffer can also be used. import gzip from cStringIO import StringIO import binascii uncompressed_data = ’The same line, over and over.\n’ * 10 print ’UNCOMPRESSED:’, len(uncompressed_data) print uncompressed_data buf = StringIO() with gzip.GzipFile(mode=’wb’, fileobj=buf) as f: f.write(uncompressed_data) 8.2. gzip—Read and Write GNU Zip Files 435 compressed_data = buf.getvalue() print ’COMPRESSED:’, len(compressed_data) print binascii.hexlify(compressed_data) inbuffer = StringIO(compressed_data) with gzip.GzipFile(mode=’rb’, fileobj=inbuffer) as f: reread_data = f.read(len(uncompressed_data)) print print ’REREAD:’, len(reread_data) print reread_data One benefit of using GzipFile over zlib is that it supports the file API. How- ever, when rereading the previously compressed data, an explicit length is passed to read(). Leaving off the length resulted in a CRC error, possibly because StringIO returned an empty string before reporting EOF. When working with streams of com- pressed data, either prefix the data with an integer representing the actual amount of data to be read or use the incremental decompression API in zlib. $ python gzip_StringIO.py UNCOMPRESSED: 300 The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. COMPRESSED: 51 1f8b08001f96f24c02ff0bc94855284ecc4d55c8c9cc4bd551c82f4b2d5248cc4 b0133f4b8424665916401d3e717802c010000 REREAD: 300 The same line, over and over. The same line, over and over. The same line, over and over. 436 Data Compression and Archiving The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. See Also: gzip (http://docs.python.org/library/gzip.html) The standard library documentation for this module. bz2 (page 436) The bz2 module uses the bzip2 compression format. tarfile (page 448) The tarfile module includes built-in support for reading com- pressed tar archives. zlib (page 421) The zlib module is a lower-level interface to gzip compression. zipfile (page 457) The zipfile module gives access to ZIP archives. 8.3 bz2—bzip2 Compression Purpose Perform bzip2 compression. Python Version 2.3 and later The bz2 module is an interface for the bzip2 library, used to compress data for storage or transmission. There are three APIs provided: • “one shot” compression/decompression functions for operating on a blob of data • iterative compression/decompression objects for working with streams of data • a file-like class that supports reading and writing as with an uncompressed file 8.3.1 One-Shot Operations in Memory The simplest way to work with bz2 is to load all the data to be compressed or decom- pressed in memory and then use compress() and decompress() to transform it. import bz2 import binascii original_data = ’This is the original text.’ print ’Original : %d bytes’ % len(original_data) print original_data 8.3. bz2—bzip2 Compression 437 print compressed = bz2.compress(original_data) print ’Compressed : %d bytes’ % len(compressed) hex_version = binascii.hexlify(compressed) for i in xrange(len(hex_version)/40 + 1): print hex_version[i*40:(i+1)*40] print decompressed = bz2.decompress(compressed) print ’Decompressed : %d bytes’ % len(decompressed) print decompressed The compressed data contains non-ASCII characters, so it needs to be converted to its hexadecimal representation before it can be printed. In the output from these examples, the hexadecimal version is reformatted to have, at most, 40 characters on each line. $ python bz2_memory.py Original : 26 bytes This is the original text. Compressed : 62 bytes 425a683931415926535916be35a6000002938040 01040022e59c402000314c000111e93d434da223 028cf9e73148cae0a0d6ed7f17724538509016be 35a6 Decompressed : 26 bytes This is the original text. For short text, the compressed version can be significantly longer than the origi- nal. While the actual results depend on the input data, it is interesting to observe the compression overhead. import bz2 original_data = ’This is the original text.’ fmt = ’%15s %15s’ print fmt % (’len(data)’, ’len(compressed)’) print fmt % (’-’ * 15, ’-’ * 15) 438 Data Compression and Archiving for i in xrange(5): data = original_data * i compressed = bz2.compress(data) print fmt % (len(data), len(compressed)), print ’*’ if len(data) < len(compressed) else ’’ The output lines ending with * show the points where the compressed data is longer than the raw input. $ python bz2_lengths.py len(data) len(compressed) --------------- --------------- 0 14 * 26 62 * 52 68 * 78 70 104 72 8.3.2 Incremental Compression and Decompression The in-memory approach has obvious drawbacks that make it impractical for real-world use cases. The alternative is to use BZ2Compressor and BZ2Decompressor objects to manipulate data incrementally so that the entire data set does not have to fit into memory. import bz2 import binascii compressor = bz2.BZ2Compressor() with open(’lorem.txt’, ’r’) as input: while True: block = input.read(64) if not block: break compressed = compressor.compress(block) if compressed: print ’Compressed: %s’ % binascii.hexlify(compressed) else: print ’buffering...’ 8.3. bz2—bzip2 Compression 439 remaining = compressor.flush() print ’Flushed: %s’ % binascii.hexlify(remaining) This example reads small blocks of data from a plain-text file and passes it to compress(). The compressor maintains an internal buffer of compressed data. Since the compression algorithm depends on checksums and minimum block sizes, the com- pressor may not be ready to return data each time it receives more input. If it does not have an entire compressed block ready, it returns an empty string. When all the data is fed in, the flush() method forces the compressor to close the final block and return the rest of the compressed data. $ python bz2_incremental.py buffering... buffering... buffering... buffering... Flushed: 425a6839314159265359ba83a48c000014d5800010400504052fa7fe00300 0ba9112793d4ca789068698a0d1a341901a0d53f4d1119a8d4c9e812d755a67c107983 87682c7ca7b5a3bb75da77755eb81c1cb1ca94c4b6faf209c52a90aaa4d16a4a1b9c16 7a01c8d9ef32589d831e77df7a5753a398b11660e392126fc18a72a1088716cc8dedda 5d489da410748531278043d70a8a131c2b8adcd6a221bdb8c7ff76b88c1d5342ee48a7 0a12175074918 8.3.3 Mixed Content Streams BZ2Decompressor can also be used in situations where compressed and uncom- pressed data are mixed together. import bz2 lorem = open(’lorem.txt’, ’rt’).read() compressed = bz2.compress(lorem) combined = compressed + lorem decompressor = bz2.BZ2Decompressor() decompressed = decompressor.decompress(combined) decompressed_matches = decompressed == lorem print ’Decompressed matches lorem:’, decompressed_matches 440 Data Compression and Archiving unused_matches = decompressor.unused_data == lorem print ’Unused data matches lorem :’, unused_matches After decompressing all the data, the unused_data attribute contains any data not used. $ python bz2_mixed.py Decompressed matches lorem: True Unused data matches lorem : True 8.3.4 Writing Compressed Files BZ2File can be used to write to and read from bzip2-compressed files using the usual methods for writing and reading data. import bz2 import contextlib import os with contextlib.closing(bz2.BZ2File(’example.bz2’, ’wb’)) as output: output.write(’Contents of the example file go here.\n’) os.system(’file example.bz2’) To write data into a compressed file, open the file with mode ’w’. $ python bz2_file_write.py example.bz2: bzip2 compressed data, block size = 900k Different compression levels can be used by passing a compresslevel argument. Valid values range from 1 to 9, inclusive. Lower values are faster and result in less compression. Higher values are slower and compress more, up to a point. import bz2 import os data = open(’lorem.txt’, ’r’).read() * 1024 print ’Input contains %d bytes’ % len(data) for i in xrange(1, 10): filename = ’compress-level-%s.bz2’ % i 8.3. bz2—bzip2 Compression 441 with bz2.BZ2File(filename, ’wb’, compresslevel=i) as output: output.write(data) os.system(’cksum %s’ % filename) The center column of numbers in the script output is the size in bytes of the files produced. For this input data, the higher compression values do not always pay off in decreased storage space for the same input data. Results will vary for other inputs. $ python bz2_file_compresslevel.py 3018243926 8771 compress-level-1.bz2 1942389165 4949 compress-level-2.bz2 2596054176 3708 compress-level-3.bz2 1491394456 2705 compress-level-4.bz2 1425874420 2705 compress-level-5.bz2 2232840816 2574 compress-level-6.bz2 447681641 2394 compress-level-7.bz2 3699654768 1137 compress-level-8.bz2 3103658384 1137 compress-level-9.bz2 Input contains 754688 bytes A BZ2File instance also includes a writelines() method that can be used to write a sequence of strings. import bz2 import contextlib import itertools import os with contextlib.closing(bz2.BZ2File(’lines.bz2’, ’wb’)) as output: output.writelines( itertools.repeat(’The same line, over and over.\n’, 10), ) os.system(’bzcat lines.bz2’) The lines should end in a newline character, as when writing to a regular file. $ python bz2_file_writelines.py The same line, over and over. The same line, over and over. 442 Data Compression and Archiving The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. The same line, over and over. 8.3.5 Reading Compressed Files To read data back from previously compressed files, open the file with binary read mode (’rb’) so no text-based translation of line endings is performed. import bz2 import contextlib with contextlib.closing(bz2.BZ2File(’example.bz2’, ’rb’)) as input: print input.read() This example reads the file written by bz2_file_write.py from the previous section. $ python bz2_file_read.py Contents of the example file go here. While reading a file, it is also possible to seek and to read only part of the data. import bz2 import contextlib with contextlib.closing(bz2.BZ2File(’example.bz2’, ’rb’)) as input: print ’Entire file:’ all_data = input.read() print all_data expected = all_data[5:15] 8.3. bz2—bzip2 Compression 443 # rewind to beginning input.seek(0) # move ahead 5 bytes input.seek(5) print ’Starting at position 5 for 10 bytes:’ partial = input.read(10) print partial print print expected == partial The seek() position is relative to the uncompressed data, so the caller does not even need to be aware that the data file is compressed. This allows a BZ2File instance to be passed to a function expecting a regular uncompressed file. $ python bz2_file_seek.py Entire file: Contents of the example file go here. Starting at position 5 for 10 bytes: nts of the True 8.3.6 Compressing Network Data The code in the next example responds to requests consisting of filenames by writing a compressed version of the file to the socket used to communicate with the client. It has some artificial chunking in place to illustrate the buffering that occurs when the data passed to compress() or decompress() does not result in a complete block of compressed or uncompressed output. import bz2 import logging import SocketServer import binascii BLOCK_SIZE = 32 class Bz2RequestHandler(SocketServer.BaseRequestHandler): 444 Data Compression and Archiving logger = logging.getLogger(’Server’) def handle(self): compressor = bz2.BZ2Compressor() # Find out what file the client wants filename = self.request.recv(1024) self.logger.debug(’client asked for: "%s"’, filename) # Send chunks of the file as they are compressed with open(filename, ’rb’) as input: while True: block = input.read(BLOCK_SIZE) if not block: break self.logger.debug(’RAW "%s"’, block) compressed = compressor.compress(block) if compressed: self.logger.debug(’SENDING "%s"’, binascii.hexlify(compressed)) self.request.send(compressed) else: self.logger.debug(’BUFFERING’) # Send any data being buffered by the compressor remaining = compressor.flush() while remaining: to_send = remaining[:BLOCK_SIZE] remaining = remaining[BLOCK_SIZE:] self.logger.debug(’FLUSHING "%s"’, binascii.hexlify(to_send)) self.request.send(to_send) return The main program starts a server in a thread, combining SocketServer and Bz2RequestHandler. if __name__ == ’__main__’: import socket import sys from cStringIO import StringIO import threading 8.3. bz2—bzip2 Compression 445 logging.basicConfig(level=logging.DEBUG, format=’%(name)s: %(message)s’, ) # Set up a server, running in a separate thread address = (’localhost’, 0) # let the kernel assign a port server = SocketServer.TCPServer(address, Bz2RequestHandler) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) t.start() logger = logging.getLogger(’Client’) # Connect to the server logger.info(’Contacting server on %s:%s’, ip, port) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) # Ask for a file requested_file = (sys.argv[0] if len(sys.argv) > 1 else ’lorem.txt’) logger.debug(’sending filename: "%s"’, requested_file) len_sent = s.send(requested_file) # Receive a response buffer = StringIO() decompressor = bz2.BZ2Decompressor() while True: response = s.recv(BLOCK_SIZE) if not response: break logger.debug(’READ "%s"’, binascii.hexlify(response)) # Include any unconsumed data when feeding the decompressor. decompressed = decompressor.decompress(response) if decompressed: logger.debug(’DECOMPRESSED "%s"’, decompressed) buffer.write(decompressed) else: logger.debug(’BUFFERING’) 446 Data Compression and Archiving full_response = buffer.getvalue() lorem = open(requested_file, ’rt’).read() logger.debug(’response matches file contents: %s’, full_response == lorem) # Clean up server.shutdown() server.socket.close() s.close() It then opens a socket to communicate with the server as a client and requests the file. The default file, lorem.txt, contains this text. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla facilisi. Warning: This implementation has obvious security implications. Do not run it on a server on the open Internet or in any environment where security might be an issue. Running bz2_server.py produces: $ python bz2_server.py Client: Contacting server on 127.0.0.1:55091 Client: sending filename: "lorem.txt" Server: client asked for: "lorem.txt" Server: RAW "Lorem ipsum dolor sit amet, cons" Server: BUFFERING Server: RAW "ectetuer adipiscing elit. Donec " Server: BUFFERING Server: RAW "egestas, enim et consectetuer ul" Server: BUFFERING Server: RAW "lamcorper, lectus ligula rutrum " Server: BUFFERING Server: RAW "leo, 8.3. bz2—bzip2 Compression 447 a elementum elit tortor eu " Server: BUFFERING Server: RAW "quam. Duis tincidunt nisi ut ant" Server: BUFFERING Server: RAW "e. Nulla facilisi. " Server: BUFFERING Server: FLUSHING "425a6839314159265359ba83a48c000014d580001040050405 2fa7fe003000ba" Server: FLUSHING "9112793d4ca789068698a0d1a341901a0d53f4d1119a8d4c9e 812d755a67c107" Server: FLUSHING "98387682c7ca7b5a3bb75da77755eb81c1cb1ca94c4b6faf20 9c52a90aaa4d16" Server: FLUSHING "a4a1b9c167a01c8d9ef32589d831e77df7a5753a398b11660e 392126fc18a72a" Server: FLUSHING "1088716cc8dedda5d489da410748531278043d70a8a131c2b8 adcd6a221bdb8c" Server: FLUSHING "7ff76b88c1d5342ee48a70a12175074918" Client: READ "425a6839314159265359ba83a48c000014d5800010400504052fa7 fe003000ba" Client: BUFFERING Client: READ "9112793d4ca789068698a0d1a341901a0d53f4d1119a8d4c9e812d 755a67c107" Client: BUFFERING Client: READ "98387682c7ca7b5a3bb75da77755eb81c1cb1ca94c4b6faf209c52 a90aaa4d16" Client: BUFFERING Client: READ "a4a1b9c167a01c8d9ef32589d831e77df7a5753a398b11660e3921 26fc18a72a" Client: BUFFERING Client: READ "1088716cc8dedda5d489da410748531278043d70a8a131c2b8adcd 6a221bdb8c" Client: BUFFERING Client: READ "7ff76b88c1d5342ee48a70a12175074918" Client: DECOMPRESSED "Lorem ipsum dolor sit amet, consectetuer adipi scing elit. Donec egestas, enim et consectetuer ullamcorper, lectus ligula rutrum leo, a elementum elit tortor eu quam. Duis tincidunt nisi ut ante. Nulla facilisi. " Client: response matches file contents: True 448 Data Compression and Archiving See Also: bz2 (http://docs.python.org/library/bz2.html) The standard library documentation for this module. bzip2.org (www.bzip.org/) The home page for bzip2. gzip (page 430) A file-like interface to GNU zip compressed files. zlib (page 421) The zlib module for GNU zip compression. 8.4 tarfile—Tar Archive Access Purpose Read and write tar archives. Python Version 2.3 and later The tarfile module provides read and write access to UNIX tar archives, including compressed files. In addition to the POSIX standards, several GNU tar extensions are supported. UNIX special file types, such as hard and soft links, and device nodes are also handled. Note: Although tarfile implements a UNIX format, it can be used to create and read tar archives under Microsoft Windows, too. 8.4.1 Testing Tar Files The is_tarfile() function returns a Boolean indicating whether or not the filename passed as an argument refers to a valid tar archive. import tarfile for filename in [ ’README.txt’, ’example.tar’, ’bad_example.tar’, ’notthere.tar’ ]: try: print ’%15s %s’ % (filename, tarfile.is_tarfile(filename)) except IOError, err: print ’%15s %s’ % (filename, err) If the file does not exist, is_tarfile() raises an IOError. $ python tarfile_is_tarfile.py README.txt False example.tar True 8.4. tarfile—Tar Archive Access 449 bad_example.tar False notthere.tar [Errno 2] No such file or directory: ’notthere.tar’ 8.4.2 Reading Metadata from an Archive Use the TarFile class to work directly with a tar archive. It supports methods for read- ing data about existing archives, as well as modifying the archives by adding additional files. To read the names of the files in an existing archive, use getnames(). import tarfile from contextlib import closing with closing(tarfile.open(’example.tar’, ’r’)) as t: print t.getnames() The return value is a list of strings with the names of the archive contents. $ python tarfile_getnames.py [’README.txt’, ’__init__.py’] In addition to names, metadata about the archive members is available as instances of TarInfo objects. import tarfile import time from contextlib import closing with closing(tarfile.open(’example.tar’, ’r’)) as t: for member_info in t.getmembers(): print member_info.name print ’\tModified:\t’, time.ctime(member_info.mtime) print ’\tMode :\t’, oct(member_info.mode) print ’\tType :\t’, member_info.type print ’\tSize :\t’, member_info.size, ’bytes’ print Load the metadata via getmembers() and getmember(). $ python tarfile_getmembers.py 450 Data Compression and Archiving README.txt Modified: Sun Nov 28 13:30:14 2010 Mode : 0644 Type : 0 Size : 75 bytes __init__.py Modified: Sun Nov 14 09:39:38 2010 Mode : 0644 Type : 0 Size : 22 bytes If the name of the archive member is known in advance, its TarInfo object can be retrieved with getmember(). import tarfile import time from contextlib import closing with closing(tarfile.open(’example.tar’, ’r’)) as t: for filename in [ ’README.txt’, ’notthere.txt’ ]: try: info = t.getmember(filename) except KeyError: print ’ERROR: Did not find %s in tar archive’ % filename else: print ’%s is %d bytes’ % (info.name, info.size) If the archive member is not present, getmember() raises a KeyError. $ python tarfile_getmember.py README.txt is 75 bytes ERROR: Did not find notthere.txt in tar archive 8.4.3 Extracting Files from an Archive To access the data from an archive member within a program, use the extractfile() method, passing the member’s name. import tarfile from contextlib import closing 8.4. tarfile—Tar Archive Access 451 with closing(tarfile.open(’example.tar’, ’r’)) as t: for filename in [ ’README.txt’, ’notthere.txt’ ]: try: f = t.extractfile(filename) except KeyError: print ’ERROR: Did not find %s in tar archive’ % filename else: print filename, ’:’ print f.read() The return value is a file-like object from which the contents of the archive member can be read. $ python tarfile_extractfile.py README.txt : The examples for the tarfile module use this file and example.tar as data. ERROR: Did not find notthere.txt in tar archive To unpack the archive and write the files to the file system, use extract() or extractall() instead. import tarfile import os from contextlib import closing os.mkdir(’outdir’) with closing(tarfile.open(’example.tar’, ’r’)) as t: t.extract(’README.txt’, ’outdir’) print os.listdir(’outdir’) The member or members are read from the archive and written to the file system, starting in the directory named in the arguments. $ python tarfile_extract.py [’README.txt’] 452 Data Compression and Archiving The standard library documentation includes a note stating that extractall() is safer than extract(), especially for working with streaming data where rewinding to read an earlier part of the input is not possible. It should be used in most cases. import tarfile import os from contextlib import closing os.mkdir(’outdir’) with closing(tarfile.open(’example.tar’, ’r’)) as t: t.extractall(’outdir’) print os.listdir(’outdir’) With extractall(), the first argument is the name of the directory where the files should be written. $ python tarfile_extractall.py [’__init__.py’, ’README.txt’] To extract specific files from the archive, pass their names or TarInfo metadata containers to extractall(). import tarfile import os from contextlib import closing os.mkdir(’outdir’) with closing(tarfile.open(’example.tar’, ’r’)) as t: t.extractall(’outdir’, members=[t.getmember(’README.txt’)], ) print os.listdir(’outdir’) When a members list is provided, only the named files are extracted. $ python tarfile_extractall_members.py [’README.txt’] 8.4. tarfile—Tar Archive Access 453 8.4.4 Creating New Archives To create a new archive, open the TarFile with a mode of ’w’. import tarfile from contextlib import closing print ’creating archive’ with closing(tarfile.open(’tarfile_add.tar’, mode=’w’)) as out: print ’adding README.txt’ out.add(’README.txt’) print print ’Contents:’ with closing(tarfile.open(’tarfile_add.tar’, mode=’r’)) as t: for member_info in t.getmembers(): print member_info.name Any existing file is truncated and a new archive is started. To add files, use the add() method. $ python tarfile_add.py creating archive adding README.txt Contents: README.txt 8.4.5 Using Alternate Archive Member Names It is possible to add a file to an archive using a name other than the original filename by constructing a TarInfo object with an alternate arcname and passing it to addfile(). import tarfile from contextlib import closing print ’creating archive’ with closing(tarfile.open(’tarfile_addfile.tar’, mode=’w’)) as out: print ’adding README.txt as RENAMED.txt’ 454 Data Compression and Archiving info = out.gettarinfo(’README.txt’, arcname=’RENAMED.txt’) out.addfile(info) print print ’Contents:’ with closing(tarfile.open(’tarfile_addfile.tar’, mode=’r’)) as t: for member_info in t.getmembers(): print member_info.name The archive includes only the changed filename $ python tarfile_addfile.py creating archive adding README.txt as RENAMED.txt Contents: RENAMED.txt 8.4.6 Writing Data from Sources Other than Files Sometimes, it is necessary to write data into an archive directly from memory. Rather than writing the data to a file, and then adding that file to the archive, you can use addfile() to add data from an open file-like handle. import tarfile from cStringIO import StringIO from contextlib import closing data = ’This is the data to write to the archive.’ with closing(tarfile.open(’addfile_string.tar’, mode=’w’)) as out: info = tarfile.TarInfo(’made_up_file.txt’) info.size = len(data) out.addfile(info, StringIO(data)) print ’Contents:’ with closing(tarfile.open(’addfile_string.tar’, mode=’r’)) as t: for member_info in t.getmembers(): print member_info.name f = t.extractfile(member_info) print f.read() 8.4. tarfile—Tar Archive Access 455 By first constructing a TarInfo object, the archive member can be given any name desired. After setting the size, the data is written to the archive using addfile() and a StringIO buffer as a source of the data. $ python tarfile_addfile_string.py Contents: made_up_file.txt This is the data to write to the archive. 8.4.7 Appending to Archives In addition to creating new archives, it is possible to append to an existing file by using mode ’a’. import tarfile from contextlib import closing print ’creating archive’ with closing(tarfile.open(’tarfile_append.tar’, mode=’w’)) as out: out.add(’README.txt’) print ’contents:’, with closing(tarfile.open(’tarfile_append.tar’, mode=’r’)) as t: print [m.name for m in t.getmembers()] print ’adding index.rst’ with closing(tarfile.open(’tarfile_append.tar’, mode=’a’)) as out: out.add(’index.rst’) print ’contents:’, with closing(tarfile.open(’tarfile_append.tar’, mode=’r’)) as t: print [m.name for m in t.getmembers()] The resulting archive ends up with two members. $ python tarfile_append.py creating archive contents: [’README.txt’] adding index.rst contents: [’README.txt’, ’index.rst’] 456 Data Compression and Archiving 8.4.8 Working with Compressed Archives Besides regular tar archive files, the tarfile module can work with archives com- pressed via the gzip or bzip2 protocols. To open a compressed archive, modify the mode string passed to open() to include ":gz" or ":bz2", depending on the desired compression method. import tarfile import os fmt = ’%-30s %-10s’ print fmt % (’FILENAME’, ’SIZE’) print fmt % (’README.txt’, os.stat(’README.txt’).st_size) for filename, write_mode in [ (’tarfile_compression.tar’, ’w’), (’tarfile_compression.tar.gz’, ’w:gz’), (’tarfile_compression.tar.bz2’, ’w:bz2’), ]: out = tarfile.open(filename, mode=write_mode) try: out.add(’README.txt’) finally: out.close() print fmt % (filename, os.stat(filename).st_size), print [m.name for m in tarfile.open(filename, ’r:*’).getmembers() ] When opening an existing archive for reading, specify "r:*" to have tarfile determine the compression method to use automatically. $ python tarfile_compression.py FILENAME SIZE README.txt 75 tarfile_compression.tar 10240 [’README.txt’] tarfile_compression.tar.gz 212 [’README.txt’] tarfile_compression.tar.bz2 187 [’README.txt’] See Also: tarfile (http://docs.python.org/library/tarfile.html) The standard library documenta- tion for this module. 8.5. zipfile—ZIP Archive Access 457 GNU tar manual (www.gnu.org/software/tar/manual/html_node/Standard.html) Documentation of the tar format, including extensions. bz2 (page 436) bzip2 compression. contextlib (page 163) The contextlib module includes closing(), for manag- ing file handles in with statements. gzip (page 430) GNU zip compression. zipfile (page 457) Similar access for ZIP archives. 8.5 zipfile—ZIP Archive Access Purpose Read and write ZIP archive files. Python Version 1.6 and later The zipfile module can be used to manipulate ZIP archive files, the format popular- ized by the PC program PKZIP. 8.5.1 Testing ZIP Files The is_zipfile() function returns a Boolean indicating whether or not the filename passed as an argument refers to a valid ZIP archive. import zipfile for filename in [ ’README.txt’, ’example.zip’, ’bad_example.zip’, ’notthere.zip’ ]: print ’%15s %s’ % (filename, zipfile.is_zipfile(filename)) If the file does not exist at all, is_zipfile() returns False. $ python zipfile_is_zipfile.py README.txt False example.zip True bad_example.zip False notthere.zip False 8.5.2 Reading Metadata from an Archive Use the ZipFile class to work directly with a ZIP archive. It supports methods for reading data about existing archives, as well as modifying the archives by adding additional files. 458 Data Compression and Archiving import zipfile with zipfile.ZipFile(’example.zip’, ’r’) as zf: print zf.namelist() The namelist() method returns the names of the files in an existing archive. $ python zipfile_namelist.py [’README.txt’] The list of names is only part of the information available from the archive, though. To access all the metadata about the ZIP contents, use the infolist() or getinfo() methods. import datetime import zipfile def print_info(archive_name): with zipfile.ZipFile(archive_name) as zf: for info in zf.infolist(): print info.filename print ’\tComment :’, info.comment mod_date = datetime.datetime(*info.date_time) print ’\tModified :’, mod_date if info.create_system == 0: system = ’Windows’ elif info.create_system == 3: system = ’Unix’ else: system = ’UNKNOWN’ print ’\tSystem :’, system print ’\tZIP version :’, info.create_version print ’\tCompressed :’, info.compress_size, ’bytes’ print ’\tUncompressed:’, info.file_size, ’bytes’ print if __name__ == ’__main__’: print_info(’example.zip’) 8.5. zipfile—ZIP Archive Access 459 There are additional fields other than those printed here, but deciphering the values into anything useful requires careful reading of the PKZIP Application Note with the ZIP file specification. $ python zipfile_infolist.py README.txt Comment : Modified : 2010-11-15 06:48:02 System : Unix ZIP version : 30 Compressed : 65 bytes Uncompressed: 76 bytes If the name of the archive member is known in advance, its ZipInfo object can be retrieved directly with getinfo(). import zipfile with zipfile.ZipFile(’example.zip’) as zf: for filename in [ ’README.txt’, ’notthere.txt’ ]: try: info = zf.getinfo(filename) except KeyError: print ’ERROR: Did not find %s in zip file’ % filename else: print ’%s is %d bytes’ % (info.filename, info.file_size) If the archive member is not present, getinfo() raises a KeyError. $ python zipfile_getinfo.py README.txt is 76 bytes ERROR: Did not find notthere.txt in zip file 8.5.3 Extracting Archived Files from an Archive To access the data from an archive member, use the read() method, passing the member’s name. 460 Data Compression and Archiving import zipfile with zipfile.ZipFile(’example.zip’) as zf: for filename in [ ’README.txt’, ’notthere.txt’ ]: try: data = zf.read(filename) except KeyError: print ’ERROR: Did not find %s in zip file’ % filename else: print filename, ’:’ print data print The data is automatically decompressed, if necessary. $ python zipfile_read.py README.txt : The examples for the zipfile module use this file and example.zip as data. ERROR: Did not find notthere.txt in zip file 8.5.4 Creating New Archives To create a new archive, instantiate the ZipFile with a mode of ’w’. Any existing file is truncated and a new archive is started. To add files, use the write() method. from zipfile_infolist import print_info import zipfile print ’creating archive’ with zipfile.ZipFile(’write.zip’, mode=’w’) as zf: print ’adding README.txt’ zf.write(’README.txt’) print print_info(’write.zip’) By default, the contents of the archive are not compressed. 8.5. zipfile—ZIP Archive Access 461 $ python zipfile_write.py creating archive adding README.txt README.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 76 bytes Uncompressed: 76 bytes To add compression, the zlib module is required. If zlib is available, the com- pression mode for individual files or for the archive as a whole can be set using zipfile.ZIP_DEFLATED. The default compression mode is zipfile.ZIP_STORED, which adds the input data to the archive without compressing it. from zipfile_infolist import print_info import zipfile try: import zlib compression = zipfile.ZIP_DEFLATED except: compression = zipfile.ZIP_STORED modes = { zipfile.ZIP_DEFLATED: ’deflated’, zipfile.ZIP_STORED: ’stored’, } print ’creating archive’ with zipfile.ZipFile(’write_compression.zip’, mode=’w’) as zf: mode_name = modes[compression] print ’adding README.txt with compression mode’, mode_name zf.write(’README.txt’, compress_type=compression) print print_info(’write_compression.zip’) This time, the archive member is compressed. 462 Data Compression and Archiving $ python zipfile_write_compression.py creating archive adding README.txt with compression mode deflated README.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 65 bytes Uncompressed: 76 bytes 8.5.5 Using Alternate Archive Member Names Pass an arcname value to write() to add a file to an archive using a name other than the original filename. from zipfile_infolist import print_info import zipfile with zipfile.ZipFile(’write_arcname.zip’, mode=’w’) as zf: zf.write(’README.txt’, arcname=’NOT_README.txt’) print_info(’write_arcname.zip’) There is no sign of the original filename in the archive. $ python zipfile_write_arcname.py NOT_README.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 76 bytes Uncompressed: 76 bytes 8.5.6 Writing Data from Sources Other than Files Sometimes it is necessary to write to a ZIP archive using data that did not come from an existing file. Rather than writing the data to a file, and then adding that file to 8.5. zipfile—ZIP Archive Access 463 the ZIP archive, use the writestr() method to add a string of bytes to the archive directly. from zipfile_infolist import print_info import zipfile msg = ’This data did not exist in a file.’ with zipfile.ZipFile(’writestr.zip’, mode=’w’, compression=zipfile.ZIP_DEFLATED, ) as zf: zf.writestr(’from_string.txt’, msg) print_info(’writestr.zip’) with zipfile.ZipFile(’writestr.zip’, ’r’) as zf: print zf.read(’from_string.txt’) In this case, the compress_type argument to ZipFile is used to compress the data, since writestr() does not take an argument to specify the compression. $ python zipfile_writestr.py from_string.txt Comment : Modified : 2010-11-28 13:48:46 System : Unix ZIP version : 20 Compressed : 36 bytes Uncompressed: 34 bytes This data did not exist in a file. 8.5.7 Writing with a ZipInfo Instance Normally, the modification date is computed when a file or string is added to the archive. A ZipInfo instance can be passed to writestr() to define the modification date and other metadata. import time import zipfile from zipfile_infolist import print_info 464 Data Compression and Archiving msg = ’This data did not exist in a file.’ with zipfile.ZipFile(’writestr_zipinfo.zip’, mode=’w’, ) as zf: info = zipfile.ZipInfo(’from_string.txt’, date_time=time.localtime(time.time()), ) info.compress_type=zipfile.ZIP_DEFLATED info.comment=’Remarks go here’ info.create_system=0 zf.writestr(info, msg) print_info(’writestr_zipinfo.zip’) In this example, the modified time is set to the current time, the data is compressed, and a false value for create_system is used. A simple comment is also associated with the new file. $ python zipfile_writestr_zipinfo.py from_string.txt Comment : Remarks go here Modified : 2010-11-28 13:48:46 System : Windows ZIP version : 20 Compressed : 36 bytes Uncompressed: 34 bytes 8.5.8 Appending to Files In addition to creating new archives, it is possible to append to an existing archive or add an archive at the end of an existing file (such as an .exe file for a self-extracting archive). To open a file to append to it, use mode ’a’. from zipfile_infolist import print_info import zipfile print ’creating archive’ with zipfile.ZipFile(’append.zip’, mode=’w’) as zf: zf.write(’README.txt’) 8.5. zipfile—ZIP Archive Access 465 print print_info(’append.zip’) print ’appending to the archive’ with zipfile.ZipFile(’append.zip’, mode=’a’) as zf: zf.write(’README.txt’, arcname=’README2.txt’) print print_info(’append.zip’) The resulting archive contains two members: $ python zipfile_append.py creating archive README.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 76 bytes Uncompressed: 76 bytes appending to the archive README.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 76 bytes Uncompressed: 76 bytes README2.txt Comment : Modified : 2010-11-15 06:48:00 System : Unix ZIP version : 20 Compressed : 76 bytes Uncompressed: 76 bytes 466 Data Compression and Archiving 8.5.9 Python ZIP Archives Python can import modules from inside ZIP archives using zipimport, if those archives appear in sys.path. The PyZipFile class can be used to construct a module suitable for use in this way. The extra method writepy() tells PyZipFile to scan a directory for .py files and add the corresponding .pyo or .pyc file to the archive. If neither compiled form exists, a .pyc file is created and added. import sys import zipfile if __name__ == ’__main__’: with zipfile.PyZipFile(’pyzipfile.zip’, mode=’w’) as zf: zf.debug = 3 print ’Adding python files’ zf.writepy(’.’) for name in zf.namelist(): print name print sys.path.insert(0, ’pyzipfile.zip’) import zipfile_pyzipfile print ’Imported from:’, zipfile_pyzipfile.__file__ With the debug attribute of the PyZipFile set to 3, verbose debugging is enabled and output is produced as it compiles each .py file it finds. $ python zipfile_pyzipfile.py Adding python files Adding package in . as . Adding ./__init__.pyc Adding ./zipfile_append.pyc Adding ./zipfile_getinfo.pyc Adding ./zipfile_infolist.pyc Compiling ./zipfile_is_zipfile.py Adding ./zipfile_is_zipfile.pyc Adding ./zipfile_namelist.pyc Adding ./zipfile_printdir.pyc Adding ./zipfile_pyzipfile.pyc Adding ./zipfile_read.pyc Adding ./zipfile_write.pyc 8.5. zipfile—ZIP Archive Access 467 Adding ./zipfile_write_arcname.pyc Adding ./zipfile_write_compression.pyc Adding ./zipfile_writestr.pyc Adding ./zipfile_writestr_zipinfo.pyc __init__.pyc zipfile_append.pyc zipfile_getinfo.pyc zipfile_infolist.pyc zipfile_is_zipfile.pyc zipfile_namelist.pyc zipfile_printdir.pyc zipfile_pyzipfile.pyc zipfile_read.pyc zipfile_write.pyc zipfile_write_arcname.pyc zipfile_write_compression.pyc zipfile_writestr.pyc zipfile_writestr_zipinfo.pyc Imported from: pyzipfile.zip/zipfile_pyzipfile.pyc 8.5.10 Limitations The zipfile module does not support ZIP files with appended comments or multidisk archives. It does support ZIP files larger than 4 GB that use the ZIP64 extensions. See Also: tarfile (page 448) Read and write tar archives. zipfile (http://docs.python.org/library/zipfile.html) The standard library documenta- tion for this module. zipimport (page 1240) Import Python modules from ZIP archives. zlib (page 421) ZIP compression library. PKZIP Application Note (www.pkware.com/documents/casestudies/APPNOTE. TXT) Official specification for the ZIP archive format. Chapter 9 CRYPTOGRAPHY Encryption secures messages so that they can be verified as accurate and protected from interception. Python’s cryptography support includes hashlib for generating signa- tures of message content using standard algorithms, such as MD5 and SHA, and hmac for verifying that a message has not been altered in transmission. 9.1 hashlib—Cryptographic Hashing Purpose Generate cryptographic hashes and message digests. Python Version 2.5 and later The hashlib module deprecates the separate md5 and sha modules and makes their API consistent. To work with a specific hash algorithm, use the appropriate constructor function to create a hash object. From there, the objects use the same API, no matter what algorithm is being used. Since hashlib is “backed” by OpenSSL, all algorithms provided by that library are available, including • md5 • sha1 • sha224 • sha256 • sha384 • sha512 469 470 Cryptography 9.1.1 Sample Data All the examples in this section use the same sample data: import hashlib lorem = ’’’Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.’’’ 9.1.2 MD5 Example To calculate the MD5 hash, or digest, for a block of data (here an ASCII string), first create the hash object, and then add the data and call digest() or hexdigest(). import hashlib from hashlib_data import lorem h = hashlib.md5() h.update(lorem) print h.hexdigest() This example uses the hexdigest() method instead of digest() because the output is formatted so it can be printed clearly. If a binary digest value is acceptable, use digest(). $ python hashlib_md5.py 1426f365574592350315090e295ac273 9.1.3 SHA1 Example A SHA1 digest is calculated in the same way. import hashlib from hashlib_data import lorem 9.1. hashlib—Cryptographic Hashing 471 h = hashlib.sha1() h.update(lorem) print h.hexdigest() The digest value is different in this example because the algorithm changed from MD5 to SHA1. $ python hashlib_sha1.py 8173396ba8a560b89a3f3e2fcc024b044bc83d0a 9.1.4 Creating a Hash by Name Sometimes, it is more convenient to refer to the algorithm by name in a string rather than by using the constructor function directly. It is useful, for example, to be able to store the hash type in a configuration file. In those cases, use new() to create a hash calculator. import hashlib import sys try: hash_name = sys.argv[1] except IndexError: print ’Specify the hash name as the first argument.’ else: try: data = sys.argv[2] except IndexError: from hashlib_data import lorem as data h = hashlib.new(hash_name) h.update(data) print h.hexdigest() When run with a variety of arguments: $ python hashlib_new.py sha1 8173396ba8a560b89a3f3e2fcc024b044bc83d0a 472 Cryptography $ python hashlib_new.py sha256 dca37495608c68ec23bbb54ab9675bf0152db63e5a51ab1061dc9982b843e767 $ python hashlib_new.py sha512 0e3d4bc1cbc117382fa077b147a7ff6363f6cbc7508877460f978a566a0adb6dbb4c8 b89f56514da98eb94d7135e1b7ad7fc4a2d747c02af67fcd4e571bd54de $ python hashlib_new.py md5 1426f365574592350315090e295ac273 9.1.5 Incremental Updates The update() method of the hash calculators can be called repeatedly. Each time, the digest is updated based on the additional text fed in. Updating incrementally is more efficient than reading an entire file into memory, and it produces the same results. import hashlib from hashlib_data import lorem h = hashlib.md5() h.update(lorem) all_at_once = h.hexdigest() def chunkize(size, text): "Return parts of the text in size-based increments." start = 0 while start < len(text): chunk = text[start:start+size] yield chunk start += size return h = hashlib.md5() for chunk in chunkize(64, lorem): h.update(chunk) line_by_line = h.hexdigest() 9.2. hmac—Cryptographic Message Signing and Verification 473 print ’All at once :’, all_at_once print ’Line by line:’, line_by_line print ’Same :’, (all_at_once == line_by_line) This example demonstrates how to update a digest incrementally as data is read or otherwise produced. $ python hashlib_update.py All at once : 1426f365574592350315090e295ac273 Line by line: 1426f365574592350315090e295ac273 Same : True See Also: hashlib (http://docs.python.org/library/hashlib.html) The standard library docu- mentation for this module. Voidspace: IronPython and hashlib (www.voidspace.org.uk/python/weblog/arch_d7_2006_10_07.shtml#e497) A wrapper for hashlib that works with IronPython. hmac (page 473) The hmac module. OpenSSL (http://www.openssl.org/) An open source encryption toolkit. 9.2 hmac—Cryptographic Message Signing and Verification Purpose The hmac module implements keyed-hashing for message au- thentication, as described in RFC 2104. Python Version 2.2 and later The HMAC algorithm can be used to verify the integrity of information passed between applications or stored in a potentially vulnerable location. The basic idea is to generate a cryptographic hash of the actual data combined with a shared secret key. The resulting hash can then be used to check the transmitted or stored message to determine a level of trust, without transmitting the secret key. Warning: Disclaimer: This book does not offer expert security advice. For the full details on HMAC, check out RFC 2104 (http://tools.ietf.org/html/rfc2104.html). 474 Cryptography 9.2.1 Signing Messages The new()function creates a new object for calculating a message signature. This example uses the default MD5 hash algorithm. import hmac digest_maker = hmac.new(’secret-shared-key-goes-here’) with open(’lorem.txt’, ’rb’) as f: while True: block = f.read(1024) if not block: break digest_maker.update(block) digest = digest_maker.hexdigest() print digest When run, the code reads a data file and computes an HMAC signature for it. $ python hmac_simple.py 4bcb287e284f8c21e87e14ba2dc40b16 9.2.2 SHA vs. MD5 Although the default cryptographic algorithm for hmac is MD5, that is not the most secure method to use. MD5 hashes have some weaknesses, such as collisions (where two different messages produce the same hash). The SHA-1 algorithm is considered to be stronger and should be used instead. import hmac import hashlib digest_maker = hmac.new(’secret-shared-key-goes-here’, ’’, hashlib.sha1) with open(’hmac_sha.py’, ’rb’) as f: while True: block = f.read(1024) 9.2. hmac—Cryptographic Message Signing and Verification 475 if not block: break digest_maker.update(block) digest = digest_maker.hexdigest() print digest The new() function takes three arguments. The first is the secret key, which should be shared between the two endpoints that are communicating so both ends can use the same value. The second value is an initial message. If the message content that needs to be authenticated is small, such as a timestamp or an HTTP POST, the entire body of the message can be passed to new() instead of using the update() method. The last argument is the digest module to be used. The default is hashlib.md5. This example substitutes hashlib.sha1. $ python hmac_sha.py b9e8c6737883a9d3a258a0b5090559b7e8e2efcb 9.2.3 Binary Digests The previous examples used the hexdigest() method to produce printable digests. The hexdigest is a different representation of the value calculated by the digest() method, which is a binary value that may include unprintable or non-ASCII charac- ters, including NUL. Some Web services (Google checkout, Amazon S3) use the base64 encoded version of the binary digest instead of the hexdigest. import base64 import hmac import hashlib with open(’lorem.txt’, ’rb’) as f: body = f.read() hash = hmac.new(’secret-shared-key-goes-here’, body, hashlib.sha1) digest = hash.digest() print base64.encodestring(digest) The base64 encoded string ends in a newline, which frequently needs to be stripped off when embedding the string in http headers or other formatting-sensitive contexts. 476 Cryptography $ python hmac_base64.py 9.2.4 Applications of Message Signatures HMAC authentication should be used for any public network service and any time data is stored where security is important. For example, when sending data through a pipe or socket, that data should be signed and then the signature should be tested before the data is used. The extended example given here is available in the file hmac_pickle.py. The first step is to establish a function to calculate a digest for a string and a simple class to be instantiated and passed through a communication channel. import hashlib import hmac try: import cPickle as pickle except: import pickle import pprint from StringIO import StringIO def make_digest(message): "Return a digest for the message." hash = hmac.new(’secret-shared-key-goes-here’, message, hashlib.sha1) return hash.hexdigest() class SimpleObject(object): """A very simple class to demonstrate checking digests before unpickling. """ def __init__(self, name): self.name = name def __str__(self): return self.name Next, create a StringIO buffer to represent the socket or pipe. The example uses a naive, but easy to parse, format for the data stream. The digest and length of the 9.2. hmac—Cryptographic Message Signing and Verification 477 data are written, followed by a new line. The serialized representation of the object, generated by pickle, follows. A real system would not want to depend on a length value, since if the digest is wrong, the length is probably wrong as well. Some sort of terminator sequence not likely to appear in the real data would be more appropriate. The example program then writes two objects to the stream. The first is written using the correct digest value. # Simulate a writable socket or pipe with StringIO out_s = StringIO() # Write a valid object to the stream: # digest\nlength\npickle o = SimpleObject(’digest matches’) pickled_data = pickle.dumps(o) digest = make_digest(pickled_data) header = ’%s %s’ % (digest, len(pickled_data)) print ’WRITING:’, header out_s.write(header + ’\n’) out_s.write(pickled_data) The second object is written to the stream with an invalid digest, produced by calculating the digest for some other data instead of the pickle. # Write an invalid object to the stream o = SimpleObject(’digest does not match’) pickled_data = pickle.dumps(o) digest = make_digest(’not the pickled data at all’) header = ’%s %s’ % (digest, len(pickled_data)) print ’\nWRITING:’, header out_s.write(header + ’\n’) out_s.write(pickled_data) out_s.flush() Now that the data is in the StringIO buffer, it can be read back out again. Start by reading the line of data with the digest and data length. Then read the remaining data, using the length value. pickle.load() could read directly from the stream, but that assumes a trusted data stream, and this data is not yet trusted enough to unpickle it. Reading the pickle as a string from the stream, without actually unpickling the object, is safer. 478 Cryptography # Simulate a readable socket or pipe with StringIO in_s = StringIO(out_s.getvalue()) # Read the data while True: first_line = in_s.readline() if not first_line: break incoming_digest, incoming_length = first_line.split(’’) incoming_length = int(incoming_length) print ’\nREAD:’, incoming_digest, incoming_length incoming_pickled_data = in_s.read(incoming_length) Once the pickled data is in memory, the digest value can be recalculated and compared against the data read. If the digests match, it is safe to trust the data and unpickle it. actual_digest = make_digest(incoming_pickled_data) print ’ACTUAL:’, actual_digest if incoming_digest != actual_digest: print ’WARNING: Data corruption’ else: obj = pickle.loads(incoming_pickled_data) print ’OK:’, obj The output shows that the first object is verified and the second is deemed “corrupted,” as expected. $ python hmac_pickle.py WRITING: 387632cfa3d18cd19bdfe72b61ac395dfcdc87c9 124 WRITING: b01b209e28d7e053408ebe23b90fe5c33bc6a0ec 131 READ: 387632cfa3d18cd19bdfe72b61ac395dfcdc87c9 124 ACTUAL: 387632cfa3d18cd19bdfe72b61ac395dfcdc87c9 OK: digest matches READ: b01b209e28d7e053408ebe23b90fe5c33bc6a0ec 131 9.2. hmac—Cryptographic Message Signing and Verification 479 ACTUAL: dec53ca1ad3f4b657dd81d514f17f735628b6828 WARNING: Data corruption See Also: hmac (http://docs.python.org/library/hmac.html) The standard library documenta- tion for this module. RFC 2104 (http://tools.ietf.org/html/rfc2104.html) HMAC: Keyed-Hashing for Message Authentication. hashlib (page 469) The hashlib module provides MD5 and SHA1 hash generators. pickle (page 334) Serialization library. WikiPedia: MD5 (http://en.wikipedia.org/wiki/MD5) Description of the MD5 hash- ing algorithm. Authenticating to Amazon S3 Web Service (http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html? S3_Authentication.html) Instructions for authenticating to S3 using HMAC- SHA1 signed credentials. Chapter 10 PROCESSES AND THREADS Python includes sophisticated tools for managing concurrent operations using processes and threads. Even many relatively simple programs can be made to run faster by apply- ing techniques for running parts of the job concurrently using these modules. subprocess provides an API for creating and communicating with secondary processes. It is especially good for running programs that produce or consume text, since the API supports passing data back and forth through the standard input and output channels of the new process. The signal module exposes the UNIX signal mechanism for sending events to other processes. The signals are processed asynchronously, usually by interrupting what the program is doing when the signal arrives. Signalling is useful as a coarse messaging system, but other inter-process communication techniques are more reliable and can deliver more complicated messages. threading includes a high-level, object-oriented API for working with concur- rency from Python. Thread objects run concurrently within the same process and share memory. Using threads is an easy way to scale for tasks that are more I/O bound than CPU bound. The multiprocessing module mirrors threading, except that instead of a Thread class it provides a Process. Each Process is a true system process without shared memory, but multiprocessing provides features for sharing data and passing messages between them. In many cases, converting from threads to processes is as simple as changing a few import statements. 10.1 subprocess—Spawning Additional Processes Purpose Start and communicate with additional processes. Python Version 2.4 and later 481 482 Processes and Threads The subprocess module provides a consistent way to create and work with additional processes. It offers a higher-level interface than some of the other modu- les available in the standard libary, and it is intended to replace functions such as os.system(), os.spawnv(), the variations of popen() in the os and popen2 modules, as well as the commands() module. To make it easier to compare subprocess with those other modules, many of the examples in this section re-create the ones used for os and popen2. The subprocess module defines one class, Popen, and a few wrapper functions that use that class. The constructor for Popen takes arguments to set up the new pro- cess so the parent can communicate with it via pipes. It provides all the functionality of the other modules and functions it replaces, and more. The API is consistent for all uses, and many of the extra steps of overhead needed (such as closing extra file descrip- tors and ensuring the pipes are closed) are “built in” instead of being handled by the application code separately. Note: The API for working on UNIX and Windows is roughly the same, but the underlying implementation is slightly different. All examples shown here were tested on Mac OS X. Behavior on a non-UNIX OS will vary. 10.1.1 Running External Commands To run an external command without interacting with it in the same way as os.system(), use the call() function. import subprocess # Simple command subprocess.call([’ls’, ’-1’]) The command line arguments are passed as a list of strings, which avoids the need for escaping quotes or other special characters that might be interpreted by the shell. $ python subprocess_os_system.py __init__.py index.rst interaction.py repeater.py signal_child.py signal_parent.py 10.1. subprocess—Spawning Additional Processes 483 subprocess_check_call.py subprocess_check_output.py subprocess_check_output_error.py subprocess_check_output_error_trap_output.py subprocess_os_system.py subprocess_pipes.py subprocess_popen2.py subprocess_popen3.py subprocess_popen4.py subprocess_popen_read.py subprocess_popen_write.py subprocess_shell_variables.py subprocess_signal_parent_shell.py subprocess_signal_setsid.py Setting the shell argument to a true value causes subprocess to spawn an inter- mediate shell process, which then runs the command. The default is to run the command directly. import subprocess # Command with shell expansion subprocess.call(’echo $HOME’, shell=True) Using an intermediate shell means that variables, glob patterns, and other special shell features in the command string are processed before the command is run. $ python subprocess_shell_variables.py /Users/dhellmann Error Handling The return value from call() is the exit code of the program. The caller is responsible for interpreting it to detect errors. The check_call() function works like call(), except that the exit code is checked, and if it indicates an error happened, then a CalledProcessError exception is raised. import subprocess try: subprocess.check_call([’false’]) 484 Processes and Threads except subprocess.CalledProcessError as err: print ’ERROR:’, err The false command always exits with a nonzero status code, which check_call() interprets as an error. $ python subprocess_check_call.py ERROR: Command ’[’false’]’ returned nonzero exit status 1 Capturing Output The standard input and output channels for the process started by call() are bound to the parent’s input and output. That means the calling program cannot capture the output of the command. Use check_output() to capture the output for later processing. import subprocess output = subprocess.check_output([’ls’, ’-1’]) print ’Have %d bytes in output’ % len(output) print output The ls -1 command runs successfully, so the text it prints to standard output is captured and returned. $ python subprocess_check_output.py Have 462 bytes in output __init__.py index.rst interaction.py repeater.py signal_child.py signal_parent.py subprocess_check_call.py subprocess_check_output.py subprocess_check_output_error.py subprocess_check_output_error_trap_output.py subprocess_os_system.py subprocess_pipes.py subprocess_popen2.py subprocess_popen3.py 10.1. subprocess—Spawning Additional Processes 485 subprocess_popen4.py subprocess_popen_read.py subprocess_popen_write.py subprocess_shell_variables.py subprocess_signal_parent_shell.py subprocess_signal_setsid.py The next example runs a series of commands in a subshell. Messages are sent to standard output and standard error before the commands exit with an error code. import subprocess try: output = subprocess.check_output( ’echo to stdout; echo to stderr 1>&2; exit 1’, shell=True, ) except subprocess.CalledProcessError as err: print ’ERROR:’, err else: print ’Have %d bytes in output’ % len(output) print output The message to standard error is printed to the console, but the message to standard output is hidden. $ python subprocess_check_output_error.py to stderr ERROR: Command ’echo to stdout; echo to stderr 1>&2; exit 1’ returned nonzero exit status 1 To prevent error messages from commands run through check_output() from being written to the console, set the stderr parameter to the constant STDOUT. import subprocess try: output = subprocess.check_output( ’echo to stdout; echo to stderr 1>&2; exit 1’, shell=True, 486 Processes and Threads stderr=subprocess.STDOUT, ) except subprocess.CalledProcessError as err: print ’ERROR:’, err else: print ’Have %d bytes in output’ % len(output) print output Now the error and standard output channels are merged together, so if the com- mand prints error messages, they are captured and not sent to the console. $ python subprocess_check_output_error_trap_output.py ERROR: Command ’echo to stdout; echo to stderr 1>&2; exit 1’ returned nonzero exit status 1 10.1.2 Working with Pipes Directly The functions call(), check_call(), and check_output() are wrappers around the Popen class. Using Popen directly gives more control over how the command is run and how its input and output streams are processed. For example, by passing different arguments for stdin, stdout, and stderr, it is possible to mimic the variations of os.popen(). One-Way Communication with a Process To run a process and read all its output, set the stdout value to PIPE and call communicate(). import subprocess print ’read:’ proc = subprocess.Popen([’echo’, ’"to stdout"’], stdout=subprocess.PIPE, ) stdout_value = proc.communicate()[0] print ’\tstdout:’, repr(stdout_value) This is similar to the way popen() works, except that the reading is managed internally by the Popen instance. 10.1. subprocess—Spawning Additional Processes 487 $ python subprocess_popen_read.py read: stdout: ’"to stdout"\n’ To set up a pipe to allow the calling program to write data to it, set stdin to PIPE. import subprocess print ’write:’ proc = subprocess.Popen([’cat’, ’-’], stdin=subprocess.PIPE, ) proc.communicate(’\tstdin: to stdin\n’) To send data to the standard input channel of the process one time, pass the data to communicate(). This is similar to using popen() with mode ’w’. $ python -u subprocess_popen_write.py write: stdin: to stdin Bidirectional Communication with a Process To set up the Popen instance for reading and writing at the same time, use a combina- tion of the previous techniques. import subprocess print ’popen2:’ proc = subprocess.Popen([’cat’, ’-’], stdin=subprocess.PIPE, stdout=subprocess.PIPE, ) msg = ’through stdin to stdout’ stdout_value = proc.communicate(msg)[0] print ’\tpass through:’, repr(stdout_value) This sets up the pipe to mimic popen2(). 488 Processes and Threads $ python -u subprocess_popen2.py popen2: pass through: ’through stdin to stdout’ Capturing Error Output It is also possible watch both of the streams for stdout and stderr, as with popen3(). import subprocess print ’popen3:’ proc = subprocess.Popen(’cat -; echo "to stderr" 1>&2’, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, ) msg = ’through stdin to stdout’ stdout_value, stderr_value = proc.communicate(msg) print ’\tpass through:’, repr(stdout_value) print ’\tstderr :’, repr(stderr_value) Reading from stderr works the same as with stdout. Passing PIPE tells Popen to attach to the channel, and communicate() reads all the data from it before returning. $ python -u subprocess_popen3.py popen3: pass through: ’through stdin to stdout’ stderr : ’to stderr\n’ Combining Regular and Error Output To direct the error output from the process to its standard output channel, use STDOUT for stderr instead of PIPE. import subprocess print ’popen4:’ proc = subprocess.Popen(’cat -; echo "to stderr" 1>&2’, shell=True, 10.1. subprocess—Spawning Additional Processes 489 stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, ) msg = ’through stdin to stdout\n’ stdout_value, stderr_value = proc.communicate(msg) print ’\tcombined output:’, repr(stdout_value) print ’\tstderr value :’, repr(stderr_value) Combining the output in this way is similar to how popen4() works. $ python -u subprocess_popen4.py popen4: combined output: ’through stdin to stdout\nto stderr\n’ stderr value : None 10.1.3 Connecting Segments of a Pipe Multiple commands can be connected into a pipeline, similar to the way the UNIX shell works, by creating separate Popen instances and chaining their inputs and outputs together. The stdout attribute of one Popen instance is used as the stdin argument for the next in the pipeline, instead of the constant PIPE. The output is read from the stdout handle for the final command in the pipeline. import subprocess cat = subprocess.Popen([’cat’, ’index.rst’], stdout=subprocess.PIPE, ) grep = subprocess.Popen([’grep’, ’.. include::’], stdin=cat.stdout, stdout=subprocess.PIPE, ) cut = subprocess.Popen([’cut’, ’-f’, ’3’, ’-d:’], stdin=grep.stdout, stdout=subprocess.PIPE, ) end_of_pipe = cut.stdout 490 Processes and Threads print ’Included files:’ for line in end_of_pipe: print ’\t’, line.strip() The example reproduces the following command line. cat index.rst | grep ".. include" | cut -f 3 -d: The pipeline reads the reStructuredText source file for this section and finds all the lines that include other files. Then it prints the names of the files being included. $ python -u subprocess_pipes.py Included files: subprocess_os_system.py subprocess_shell_variables.py subprocess_check_call.py subprocess_check_output.py subprocess_check_output_error.py subprocess_check_output_error_trap_output.py subprocess_popen_read.py subprocess_popen_write.py subprocess_popen2.py subprocess_popen3.py subprocess_popen4.py subprocess_pipes.py repeater.py interaction.py signal_child.py signal_parent.py subprocess_signal_parent_shell.py subprocess_signal_setsid.py 10.1.4 Interacting with Another Command All the previous examples assume a limited amount of interaction. The communicate() method reads all the output and waits for the child process to exit before returning. It is also possible to write to and read from the individual pipe handles used by the Popen instance incrementally, as the program runs. A simple echo program that reads from standard input and writes to standard output illustrates this technique. 10.1. subprocess—Spawning Additional Processes 491 The script repeater.py is used as the child process in the next example. It reads from stdin and writes the values to stdout, one line at a time until there is no more input. It also writes a message to stderr when it starts and stops, showing the lifetime of the child process. import sys sys.stderr.write(’repeater.py: starting\n’) sys.stderr.flush() while True: next_line = sys.stdin.readline() if not next_line: break sys.stdout.write(next_line) sys.stdout.flush() sys.stderr.write(’repeater.py: exiting\n’) sys.stderr.flush() The next interaction example uses the stdin and stdout file handles owned by the Popen instance in different ways. In the first example, a sequence of five numbers is written to stdin of the process, and after each write, the next line of output is read back. In the second example, the same five numbers are written, but the output is read all at once using communicate(). import subprocess print ’One line at a time:’ proc = subprocess.Popen(’python repeater.py’, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, ) for i in range(5): proc.stdin.write(’%d\n’ % i) output = proc.stdout.readline() print output.rstrip() remainder = proc.communicate()[0] print remainder 492 Processes and Threads print print ’All output at once:’ proc = subprocess.Popen(’python repeater.py’, shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, ) for i in range(5): proc.stdin.write(’%d\n’ % i) output = proc.communicate()[0] print output The repeater.py: exiting lines come at different points in the output for each loop style. $ python -u interaction.py One line at a time: repeater.py: starting 0 1 2 3 4 repeater.py: exiting All output at once: repeater.py: starting repeater.py: exiting 0 1 2 3 4 10.1.5 Signaling between Processes The process management examples for the os module include a demonstration of signaling between processes using os.fork() and os.kill(). Since each Popen instance provides a pid attribute with the process id of the child process, it is possible to 10.1. subprocess—Spawning Additional Processes 493 do something similar with subprocess. The next example combines two scripts. This child process sets up a signal handler for the USR signal. import os import signal import time import sys pid = os.getpid() received = False def signal_usr1(signum, frame): "Callback invoked when a signal is received" global received received = True print ’CHILD %6s: Received USR1’ % pid sys.stdout.flush() print ’CHILD %6s: Setting up signal handler’ % pid sys.stdout.flush() signal.signal(signal.SIGUSR1, signal_usr1) print ’CHILD %6s: Pausing to wait for signal’ % pid sys.stdout.flush() time.sleep(3) if not received: print ’CHILD %6s: Never received signal’ % pid This script runs as the parent process. It starts signal_child.py, then sends the USR1 signal. import os import signal import subprocess import time import sys proc = subprocess.Popen([’python’, ’signal_child.py’]) print ’PARENT : Pausing before sending signal...’ sys.stdout.flush() time.sleep(1) print ’PARENT : Signaling child’ 494 Processes and Threads sys.stdout.flush() os.kill(proc.pid, signal.SIGUSR1) This is the output. $ python signal_parent.py PARENT : Pausing before sending signal... CHILD 11298: Setting up signal handler CHILD 11298: Pausing to wait for signal PARENT : Signaling child CHILD 11298: Received USR1 Process Groups / Sessions If the process created by Popen spawns subprocesses, those children will not receive any signals sent to the parent. That means when using the shell argument to Popen, it will be difficult to cause the command started in the shell to terminate by sending SIGINT or SIGTERM. import os import signal import subprocess import tempfile import time import sys script = ’’’#!/bin/sh echo "Shell script in process $$" set -x python signal_child.py ’’’ script_file = tempfile.NamedTemporaryFile(’wt’) script_file.write(script) script_file.flush() proc = subprocess.Popen([’sh’, script_file.name], close_fds=True) print ’PARENT : Pausing before signaling %s...’ % proc.pid sys.stdout.flush() time.sleep(1) print ’PARENT : Signaling child %s’ % proc.pid sys.stdout.flush() 10.1. subprocess—Spawning Additional Processes 495 os.kill(proc.pid, signal.SIGUSR1) time.sleep(3) The pid used to send the signal does not match the pid of the child of the shell script waiting for the signal, because in this example, there are three separate processes interacting. 1. The program subprocess_signal_parent_shell.py 2. The shell process running the script created by the main Python program 3. The program signal_child.py $ python subprocess_signal_parent_shell.py PARENT : Pausing before signaling 11301... Shell script in process 11301 + python signal_child.py CHILD 11302: Setting up signal handler CHILD 11302: Pausing to wait for signal PARENT : Signaling child 11301 CHILD 11302: Never received signal To send signals to descendants without knowing their process id, use a process group to associate the children so they can be signaled together. The process group is created with os.setsid(), setting the “session id” to the process id of the current process. All child processes inherit their session id from their parent, and since it should only be set in the shell created by Popen and its descendants, os.setsid() should not be called in the same process where the Popen is created. Instead, the function is passed to Popen as the preexec_ fn argument so it is run after the fork() inside the new process, before it uses exec() to run the shell. To signal the entire process group, use os.killpg() with the pid value from the Popen instance. import os import signal im- port subprocess import tempfile import time import sys script = ’’’#!/bin/sh echo "Shell script in process $$" 496 Processes and Threads set -x python signal_child.py ’’’ script_file = tempfile.NamedTemporaryFile(’wt’) script_file.write(script) script_file.flush() def show_setting_sid(): print ’Calling os.setsid() from %s’ % os.getpid() sys.stdout.flush() os.setsid() proc = subprocess.Popen([’sh’, script_file.name], close_fds=True, preexec_fn=show_setting_sid, ) print ’PARENT : Pausing before signaling %s...’ % proc.pid sys.stdout.flush() time.sleep(1) print ’PARENT : Signaling process group %s’ % proc.pid sys.stdout.flush() os.killpg(proc.pid, signal.SIGUSR1) time.sleep(3) The sequence of events is: 1. The parent program instantiates Popen. 2. The Popen instance forks a new process. 3. The new process runs os.setsid(). 4. The new process runs exec() to start the shell. 5. The shell runs the shell script. 6. The shell script forks again, and that process execs Python. 7. Python runs signal_child.py. 8. The parent program signals the process group using the pid of the shell. 9. The shell and Python processes receive the signal. 10. The shell ignores the signal. 11. The Python process running signal_child.py invokes the signal handler. 10.2. signal—Asynchronous System Events 497 $ python subprocess_signal_setsid.py Calling os.setsid() from 11305 PARENT : Pausing before signaling 11305... Shell script in process 11305 + python signal_child.py CHILD 11306: Setting up signal handler CHILD 11306: Pausing to wait for signal PARENT : Signaling process group 11305 CHILD 11306: Received USR1 See Also: subprocess (http://docs.python.org/lib/module-subprocess.html) Standard library documentation for this module. UNIX Signals and Process Groups (www.frostbytes.com/∼jimf/papers/signals/signals.html) A good description of UNIX signaling and how process groups work. os (page 1108) Although subprocess replaces many of them, the functions for working with processes found in the os module are still widely used in existing code. signal (page 497) More details about using the signal module. Advanced Programming in the UNIX(R) Environment (www.amazon.com/Programming-Environment-Addison-Wesley- Professional-Computing/dp/0201433079/ref=pd_bbs_3/002-2842372- 4768037?ie=UTF8&s=books&qid=1182098757&sr=8-3) Covers working with multiple processes, such as handling signals, closing duplicated file descriptors, etc. pipes UNIX shell command pipeline templates in the standard library. 10.2 signal—Asynchronous System Events Purpose Send and receive asynchronous system events. Python Version 1.4 and later Signals are an operating system feature that provide a means of notifying a program of an event and having it handled asynchronously. They can be generated by the system itself or sent from one process to another. Since signals interrupt the regular 498 Processes and Threads flow of the program, it is possible that some operations (especially I/O) may produce errors if a signal is received in the middle. Signals are identified by integers and are defined in the operating system C head- ers. Python exposes the signals appropriate for the platform as symbols in the signal module. The examples in this section use SIGINT and SIGUSR1. Both are typically defined for all UNIX and UNIX-like systems. Note: Programming with UNIX signal handlers is a nontrivial endeavor. This is an introduction and does not include all the details needed to use signals success- fully on every platform. There is some degree of standardization across versions of UNIX, but there is also some variation. Consult the operating system documentation if you run into trouble. 10.2.1 Receiving Signals As with other forms of event-based programming, signals are received by establishing a callback function, called a signal handler, that is invoked when the signal occurs. The arguments to the signal handler are the signal number and the stack frame from the point in the program that was interrupted by the signal. import signal import os import time def receive_signal(signum, stack): print ’Received:’, signum # Register signal handlers signal.signal(signal.SIGUSR1, receive_signal) signal.signal(signal.SIGUSR2, receive_signal) # Print the process ID so it can be used with ’kill’ # to send this program signals. print ’My PID is:’, os.getpid() while True: print ’Waiting...’ time.sleep(3) 10.2. signal—Asynchronous System Events 499 This example script loops indefinitely, pausing for a few seconds each time. When a signal comes in, the sleep() call is interrupted and the signal handler receive_signal() prints the signal number. After the signal handler returns, the loop continues. Send signals to the running program using os.kill() or the UNIX command line program kill. $ python signal_signal.py My PID is: 71387 Waiting... Waiting... Waiting... Received: 30 Waiting... Waiting... Received: 31 Waiting... Waiting... Traceback (most recent call last): File "signal_signal.py", line 25, in time.sleep(3) KeyboardInterrupt The previous output was produced by running signal_signal.py in one win- dow, and then in another window running $ kill -USR1 $pid $ kill -USR2 $pid $ kill -INT $pid 10.2.2 Retrieving Registered Handlers To see what signal handlers are registered for a signal, use getsignal(). Pass the sig- nal number as argument. The return value is the registered handler or one of the special values SIG_IGN (if the signal is being ignored), SIG_DFL (if the default behavior is being used), or None (if the existing signal handler was registered from C, rather than from Python). 500 Processes and Threads import signal def alarm_received(n, stack): return signal.signal(signal.SIGALRM, alarm_received) signals_to_names = dict( (getattr(signal, n), n) for n in dir(signal) if n.startswith(’SIG’) and ’_’ not in n ) for s, name in sorted(signals_to_names.items()): handler = signal.getsignal(s) if handler is signal.SIG_DFL: handler = ’SIG_DFL’ elif handler is signal.SIG_IGN: handler = ’SIG_IGN’ print ’%-10s (%2d):’ % (name, s), handler Again, since each OS may have different signals defined, the output on other sys- tems may vary. This is from OS X: $ python signal_getsignal.py SIGHUP ( 1): SIG_DFL SIGINT ( 2): SIGQUIT ( 3): SIG_DFL SIGILL ( 4): SIG_DFL SIGTRAP ( 5): SIG_DFL SIGIOT ( 6): SIG_DFL SIGEMT ( 7): SIG_DFL SIGFPE ( 8): SIG_DFL SIGKILL ( 9): None SIGBUS (10): SIG_DFL SIGSEGV (11): SIG_DFL SIGSYS (12): SIG_DFL SIGPIPE (13): SIG_IGN SIGALRM (14): SIGTERM (15): SIG_DFL SIGURG (16): SIG_DFL SIGSTOP (17): None SIGTSTP (18): SIG_DFL SIGCONT (19): SIG_DFL 10.2. signal—Asynchronous System Events 501 SIGCHLD (20): SIG_DFL SIGTTIN (21): SIG_DFL SIGTTOU (22): SIG_DFL SIGIO (23): SIG_DFL SIGXCPU (24): SIG_DFL SIGXFSZ (25): SIG_IGN SIGVTALRM (26): SIG_DFL SIGPROF (27): SIG_DFL SIGWINCH (28): SIG_DFL SIGINFO (29): SIG_DFL SIGUSR1 (30): SIG_DFL SIGUSR2 (31): SIG_DFL 10.2.3 Sending Signals The function for sending signals from within Python is os.kill(). Its use is covered in the section on the os module, Creating Processes with os.fork(). 10.2.4 Alarms Alarms are a special sort of signal, where the program asks the OS to notify it after some period of time has elapsed. As the standard module documentation for os points out, this is useful for avoiding blocking indefinitely on an I/O operation or other system call. import signal import time def receive_alarm(signum, stack): print ’Alarm :’, time.ctime() # Call receive_alarm in 2 seconds signal.signal(signal.SIGALRM, receive_alarm) signal.alarm(2) print ’Before:’, time.ctime() time.sleep(4) print ’After :’, time.ctime() In this example, the call to sleep() does not last the full four seconds. $ python signal_alarm.py Before: Sun Aug 17 10:51:09 2008 502 Processes and Threads Alarm : Sun Aug 17 10:51:11 2008 After : Sun Aug 17 10:51:11 2008 10.2.5 Ignoring Signals To ignore a signal, register SIG_IGN as the handler. This script replaces the default handler for SIGINT with SIG_IGN and registers a handler for SIGUSR1. Then it uses signal.pause() to wait for a signal to be received. import signal import os import time def do_exit(sig, stack): raise SystemExit(’Exiting’) signal.signal(signal.SIGINT, signal.SIG_IGN) signal.signal(signal.SIGUSR1, do_exit) print ’My PID:’, os.getpid() signal.pause() Normally, SIGINT (the signal sent by the shell to a program when the user presses Ctrl-C) raises a KeyboardInterrupt. This example ignores SIGINT and raises SystemExit when it sees SIGUSR1. Each ^C in the output represents an attempt to use Ctrl-C to kill the script from the terminal. Using kill -USR1 72598 from an- other terminal eventually causes the script to exit. $ python signal_ignore.py My PID: 72598 ^C^C^C^CExiting 10.2.6 Signals and Threads Signals and threads do not generally mix well because only the main thread of a process will receive signals. The following example sets up a signal handler, waits for the signal in one thread, and sends the signal from another thread. 10.2. signal—Asynchronous System Events 503 import signal import threading import os import time def signal_handler(num, stack): print ’Received signal %d in %s’ %\ (num, threading.currentThread().name) signal.signal(signal.SIGUSR1, signal_handler) def wait_for_signal(): print ’Waiting for signal in’, threading.currentThread().name signal.pause() print ’Done waiting’ # Start a thread that will not receive the signal receiver = threading.Thread(target=wait_for_signal, name=’receiver’) receiver.start() time.sleep(0.1) def send_signal(): print ’Sending signal in’, threading.currentThread().name os.kill(os.getpid(), signal.SIGUSR1) sender = threading.Thread(target=send_signal, name=’sender’) sender.start() sender.join() # Wait for the thread to see the signal (not going to happen!) print ’Waiting for’, receiver.name signal.alarm(2) receiver.join() The signal handlers were all registered in the main thread because this is a requirement of the signal module implementation for Python, regardless of under- lying platform support for mixing threads and signals. Although the receiver thread calls signal.pause(), it does not receive the signal. The signal.alarm(2) call near the end of the example prevents an infinite block, since the receiver thread will never exit. 504 Processes and Threads $ python signal_threads.py Waiting for signal in receiver Sending signal in sender Received signal 30 in MainThread Waiting for receiver Alarm clock Although alarms can be set in any thread, they are always received by the main thread. import signal import time import threading def signal_handler(num, stack): print time.ctime(), ’Alarm in’, threading.currentThread().name signal.signal(signal.SIGALRM, signal_handler) def use_alarm(): t_name = threading.currentThread().name print time.ctime(), ’Setting alarm in’, t_name signal.alarm(1) print time.ctime(), ’Sleeping in’, t_name time.sleep(3) print time.ctime(), ’Done with sleep in’, t_name # Start a thread that will not receive the signal alarm_thread = threading.Thread(target=use_alarm, name=’alarm_thread’) alarm_thread.start() time.sleep(0.1) # Wait for the thread to see the signal (not going to happen!) print time.ctime(), ’Waiting for’, alarm_thread.name alarm_thread.join() print time.ctime(), ’Exiting normally’ The alarm does not abort the sleep() call in use_alarm(). 10.3. threading—Manage Concurrent Operations 505 $ python signal_threads_alarm.py Sun Nov 28 14:26:51 2010 Setting alarm in alarm_thread Sun Nov 28 14:26:51 2010 Sleeping in alarm_thread Sun Nov 28 14:26:52 2010 Waiting for alarm_thread Sun Nov 28 14:26:54 2010 Done with sleep in alarm_thread Sun Nov 28 14:26:54 2010 Alarm in MainThread Sun Nov 28 14:26:54 2010 Exiting normally See Also: signal (http://docs.python.org/lib/module-signal.html) Standard library documenta- tion for this module. Creating Processes with os.fork() (page 1122) The kill() function can be used to send signals between processes. 10.3 threading—Manage Concurrent Operations Purpose Builds on the thread module to more easily manage several threads of execution. Python Version 1.5.2 and later Using threads allows a program to run multiple operations concurrently in the same process space. The threading module builds on the low-level features of thread to make working with threads easier. 10.3.1 Thread Objects The simplest way to use a Thread is to instantiate it with a target function and call start() to let it begin working. import threading def worker(): """thread worker function""" print ’Worker’ return threads = [] for i in range(5): t = threading.Thread(target=worker) 506 Processes and Threads threads.append(t) t.start() The output is five lines with "Worker" on each: $ python threading_simple.py Worker Worker Worker Worker Worker It is useful to be able to spawn a thread and pass it arguments to tell it what work to do. Any type of object can be passed as an argument to the thread. This example passes a number, which the thread then prints. import threading def worker(num): """thread worker function""" print ’Worker: %s’ % num return threads = [] for i in range(5): t = threading.Thread(target=worker, args=(i,)) threads.append(t) t.start() The integer argument is now included in the message printed by each thread: $ python -u threading_simpleargs.py Worker: 0 Worker: 1 Worker: 2 Worker: 3 Worker: 4 10.3. threading—Manage Concurrent Operations 507 10.3.2 Determining the Current Thread Using arguments to identify or name the thread is cumbersome and unnecessary. Each Thread instance has a name with a default value that can be changed as the thread is created. Naming threads is useful in server processes made up of multiple service threads handling different operations. import threading import time def worker(): print threading.currentThread().getName(), ’Starting’ time.sleep(2) print threading.currentThread().getName(), ’Exiting’ def my_service(): print threading.currentThread().getName(), ’Starting’ time.sleep(3) print threading.currentThread().getName(), ’Exiting’ t = threading.Thread(name=’my_service’, target=my_service) w = threading.Thread(name=’worker’, target=worker) w2 = threading.Thread(target=worker) # use default name w.start() w2.start() t.start() The debug output includes the name of the current thread on each line. The lines with “Thread-1” in the thread name column correspond to the unnamed thread w2. $ python -u threading_names.py worker Starting Thread-1 Starting my_service Starting worker Exiting Thread-1 Exiting my_service Exiting 508 Processes and Threads Most programs do not use print to debug. The logging module supports embed- ding the thread name in every log message using the formatter code %(threadName)s. Including thread names in log messages makes it possible to trace those messages back to their source. import logging import threading import time logging.basicConfig( level=logging.DEBUG, format=’[%(levelname)s](%(threadName)-10s) %(message)s’, ) def worker(): logging.debug(’Starting’) time.sleep(2) logging.debug(’Exiting’) def my_service(): logging.debug(’Starting’) time.sleep(3) logging.debug(’Exiting’) t = threading.Thread(name=’my_service’, target=my_service) w = threading.Thread(name=’worker’, target=worker) w2 = threading.Thread(target=worker) # use default name w.start() w2.start() t.start() logging is also thread-safe, so messages from different threads are kept distinct in the output. $ python threading_names_log.py [DEBUG] (worker ) Starting [DEBUG] (Thread-1 ) Starting [DEBUG] (my_service) Starting [DEBUG] (worker ) Exiting [DEBUG] (Thread-1 ) Exiting [DEBUG] (my_service) Exiting 10.3. threading—Manage Concurrent Operations 509 10.3.3 Daemon vs. Non-Daemon Threads Up to this point, the example programs have implicitly waited to exit until all threads have completed their work. Programs sometimes spawn a thread as a daemon that runs without blocking the main program from exiting. Using daemon threads is useful for services where there may not be an easy way to interrupt the thread, or where letting the thread die in the middle of its work does not lose or corrupt data (for example, a thread that generates “heartbeats” for a service monitoring tool). To mark a thread as a daemon, call its setDaemon() method with True. The default is for threads to not be daemons. import threading import time import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def daemon(): logging.debug(’Starting’) time.sleep(2) logging.debug(’Exiting’) d = threading.Thread(name=’daemon’, target=daemon) d.setDaemon(True) def non_daemon(): logging.debug(’Starting’) logging.debug(’Exiting’) t = threading.Thread(name=’non-daemon’, target=non_daemon) d.start() t.start() The output does not include the “Exiting” message from the daemon thread, since all of the non-daemon threads (including the main thread) exit before the daemon thread wakes up from its two-second sleep. $ python threading_daemon.py (daemon ) Starting 510 Processes and Threads (non-daemon) Starting (non-daemon) Exiting To wait until a daemon thread has completed its work, use the join() method. import threading import time import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def daemon(): logging.debug(’Starting’) time.sleep(2) logging.debug(’Exiting’) d = threading.Thread(name=’daemon’, target=daemon) d.setDaemon(True) def non_daemon(): logging.debug(’Starting’) logging.debug(’Exiting’) t = threading.Thread(name=’non-daemon’, target=non_daemon) d.start() t.start() d.join() t.join() Waiting for the daemon thread to exit using join() means it has a chance to produce its “Exiting” message. $ python threading_daemon_join.py (daemon ) Starting (non-daemon) Starting (non-daemon) Exiting (daemon ) Exiting 10.3. threading—Manage Concurrent Operations 511 By default, join() blocks indefinitely. It is also possible to pass a float value representing the number of seconds to wait for the thread to become inactive. If the thread does not complete within the timeout period, join() returns anyway. import threading import time import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def daemon(): logging.debug(’Starting’) time.sleep(2) logging.debug(’Exiting’) d = threading.Thread(name=’daemon’, target=daemon) d.setDaemon(True) def non_daemon(): logging.debug(’Starting’) logging.debug(’Exiting’) t = threading.Thread(name=’non-daemon’, target=non_daemon) d.start() t.start() d.join(1) print ’d.isAlive()’, d.isAlive() t.join() Since the timeout passed is less than the amount of time the daemon thread sleeps, the thread is still “alive” after join() returns. $ python threading_daemon_join_timeout.py (daemon ) Starting (non-daemon) Starting (non-daemon) Exiting d.isAlive() True 512 Processes and Threads 10.3.4 Enumerating All Threads It is not necessary to retain an explicit handle to all the daemon threads to ensure they have completed before exiting the main process. enumerate() returns a list of active Thread instances. The list includes the current thread, and since joining the current thread introduces a deadlock situation, it must be skipped. import random import threading import time import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def worker(): """thread worker function""" t = threading.currentThread() pause = random.randint(1,5) logging.debug(’sleeping %s’, pause) time.sleep(pause) logging.debug(’ending’) return for i in range(3): t = threading.Thread(target=worker) t.setDaemon(True) t.start() main_thread = threading.currentThread() for t in threading.enumerate(): if t is main_thread: continue logging.debug(’joining %s’, t.getName()) t.join() Because the worker is sleeping for a random amount of time, the output from this program may vary. $ python threading_enumerate.py (Thread-1 ) sleeping 5 10.3. threading—Manage Concurrent Operations 513 (Thread-2 ) sleeping 4 (Thread-3 ) sleeping 2 (MainThread) joining Thread-1 (Thread-3 ) ending (Thread-2 ) ending (Thread-1 ) ending (MainThread) joining Thread-2 (MainThread) joining Thread-3 10.3.5 Subclassing Thread At start-up, a Thread does some basic initialization and then calls its run() method, which calls the target function passed to the constructor. To create a subclass of Thread, override run() to do whatever is necessary. import threading import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) class MyThread(threading.Thread): def run(self): logging.debug(’running’) return for i in range(5): t = MyThread() t.start() The return value of run() is ignored. $ python threading_subclass.py (Thread-1 ) running (Thread-2 ) running (Thread-3 ) running (Thread-4 ) running (Thread-5 ) running 514 Processes and Threads Because the args and kwargs values passed to the Thread constructor are saved in private variables using names prefixed with ’__’, they are not easily accessed from a subclass. To pass arguments to a custom thread type, redefine the constructor to save the values in an instance attribute visible from the subclass. import threading import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) class MyThreadWithArgs(threading.Thread): def __init__(self, group=None, target=None, name=None, args=(), kwargs=None, verbose=None): threading.Thread.__init__(self, group=group, target=target, name=name, verbose=verbose) self.args = args self.kwargs = kwargs return def run(self): logging.debug(’running with %s and %s’, self.args, self.kwargs) return for i in range(5): t = MyThreadWithArgs(args=(i,), kwargs={’a’:’A’, ’b’:’B’}) t.start() MyThreadWithArgs uses the same API as Thread, but another class could easily change the constructor method to take more or different arguments more directly related to the purpose of the thread, as with any other class. $ python threading_subclass_args.py (Thread-1 ) running with (0,) and {’a’: ’A’, ’b’: ’B’} (Thread-2 ) running with (1,) and {’a’: ’A’, ’b’: ’B’} (Thread-3 ) running with (2,) and {’a’: ’A’, ’b’: ’B’} 10.3. threading—Manage Concurrent Operations 515 (Thread-4 ) running with (3,) and {’a’: ’A’, ’b’: ’B’} (Thread-5 ) running with (4,) and {’a’: ’A’, ’b’: ’B’} 10.3.6 Timer Threads One example of a reason to subclass Thread is provided by Timer, also included in threading.ATimer starts its work after a delay and can be canceled at any point within that delay time period. import threading import time import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def delayed(): logging.debug(’worker running’) return t1 = threading.Timer(3, delayed) t1.setName(’t1’) t2 = threading.Timer(3, delayed) t2.setName(’t2’) logging.debug(’starting timers’) t1.start() t2.start() logging.debug(’waiting before canceling %s’, t2.getName()) time.sleep(2) logging.debug(’canceling %s’, t2.getName()) t2.cancel() logging.debug(’done’) The second timer is never run, and the first timer appears to run after the rest of the main program is done. Since it is not a daemon thread, it is joined implicitly when the main thread is done. $ python threading_timer.py (MainThread) starting timers 516 Processes and Threads (MainThread) waiting before canceling t2 (MainThread) canceling t2 (MainThread) done (t1 ) worker running 10.3.7 Signaling between Threads Although the point of using multiple threads is to run separate operations concurrently, there are times when it is important to be able to synchronize the operations in two or more threads. Event objects are a simple way to communicate between threads safely. An Event manages an internal flag that callers can control with the set() and clear() methods. Other threads can use wait() to pause until the flag is set, effectively blocking progress until allowed to continue. import logging import threading import time logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def wait_for_event(e): """Wait for the event to be set before doing anything""" logging.debug(’wait_for_event starting’) event_is_set = e.wait() logging.debug(’event set: %s’, event_is_set) def wait_for_event_timeout(e, t): """Wait t seconds and then timeout""" while not e.isSet(): logging.debug(’wait_for_event_timeout starting’) event_is_set = e.wait(t) logging.debug(’event set: %s’, event_is_set) if event_is_set: logging.debug(’processing event’) else: logging.debug(’doing other work’) e = threading.Event() t1 = threading.Thread(name=’block’, 10.3. threading—Manage Concurrent Operations 517 target=wait_for_event, args=(e,)) t1.start() t2 = threading.Thread(name=’nonblock’, target=wait_for_event_timeout, args=(e, 2)) t2.start() logging.debug(’Waiting before calling Event.set()’) time.sleep(3) e.set() logging.debug(’Event is set’) The wait() method takes an argument representing the number of seconds to wait for the event before timing out. It returns a Boolean indicating whether or not the event is set, so the caller knows why wait() returned. The isSet() method can be used separately on the event without fear of blocking. In this example, wait_for_event_timeout() checks the event status without blocking indefinitely. The wait_for_event() blocks on the call to wait(), which does not return until the event status changes. $ python threading_event.py (block ) wait_for_event starting (nonblock ) wait_for_event_timeout starting (MainThread) Waiting before calling Event.set() (nonblock ) event set: False (nonblock ) doing other work (nonblock ) wait_for_event_timeout starting (MainThread) Event is set (block ) event set: True (nonblock ) event set: True (nonblock ) processing event 10.3.8 Controlling Access to Resources In addition to synchronizing the operations of threads, it is also important to be able to control access to shared resources to prevent corruption or missed data. Python’s built-in data structures (lists, dictionaries, etc.) are thread-safe as a side effect of having 518 Processes and Threads atomic byte-codes for manipulating them (the GIL is not released in the middle of an update). Other data structures implemented in Python, or simpler types like integers and floats, do not have that protection. To guard against simultaneous access to an object, use a Lock object. import logging import random import threading import time logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) class Counter(object): def __init__(self, start=0): self.lock = threading.Lock() self.value = start def increment(self): logging.debug(’Waiting for lock’) self.lock.acquire() try: logging.debug(’Acquired lock’) self.value = self.value + 1 finally: self.lock.release() def worker(c): for i in range(2): pause = random.random() logging.debug(’Sleeping %0.02f’, pause) time.sleep(pause) c.increment() logging.debug(’Done’) counter = Counter() for i in range(2): t = threading.Thread(target=worker, args=(counter,)) t.start() logging.debug(’Waiting for worker threads’) main_thread = threading.currentThread() 10.3. threading—Manage Concurrent Operations 519 for t in threading.enumerate(): if t is not main_thread: t.join() logging.debug(’Counter: %d’, counter.value) In this example, the worker() function increments a Counter instance, which manages a Lock to prevent two threads from changing its internal state at the same time. If the Lock was not used, there is a possibility of missing a change to the value attribute. $ python threading_lock.py (Thread-1 ) Sleeping 0.94 (Thread-2 ) Sleeping 0.32 (MainThread) Waiting for worker threads (Thread-2 ) Waiting for lock (Thread-2 ) Acquired lock (Thread-2 ) Sleeping 0.54 (Thread-1 ) Waiting for lock (Thread-1 ) Acquired lock (Thread-1 ) Sleeping 0.84 (Thread-2 ) Waiting for lock (Thread-2 ) Acquired lock (Thread-2 ) Done (Thread-1 ) Waiting for lock (Thread-1 ) Acquired lock (Thread-1 ) Done (MainThread) Counter: 4 To find out whether another thread has acquired the lock without holding up the current thread, pass False for the blocking argument to acquire(). In the next example, worker() tries to acquire the lock three separate times and counts how many attempts it has to make to do so. In the meantime, lock_holder() cycles between holding and releasing the lock, with short pauses in each state used to simulate load. import logging import threading import time logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) 520 Processes and Threads def lock_holder(lock): logging.debug(’Starting’) while True: lock.acquire() try: logging.debug(’Holding’) time.sleep(0.5) finally: logging.debug(’Not holding’) lock.release() time.sleep(0.5) return def worker(lock): logging.debug(’Starting’) num_tries = 0 num_acquires = 0 while num_acquires < 3: time.sleep(0.5) logging.debug(’Trying to acquire’) have_it = lock.acquire(0) try: num_tries += 1 if have_it: logging.debug(’Iteration %d: Acquired’, num_tries) num_acquires += 1 else: logging.debug(’Iteration %d: Not acquired’, num_tries) finally: if have_it: lock.release() logging.debug(’Done after %d iterations’, num_tries) lock = threading.Lock() holder = threading.Thread(target=lock_holder, args=(lock,), name=’LockHolder’) holder.setDaemon(True) holder.start() 10.3. threading—Manage Concurrent Operations 521 worker = threading.Thread(target=worker, args=(lock,), name=’Worker’) worker.start() It takes worker() more than three iterations to acquire the lock three separate times. $ python threading_lock_noblock.py (LockHolder) Starting (LockHolder) Holding (Worker ) Starting (LockHolder) Not holding (Worker ) Trying to acquire (Worker ) Iteration 1: Acquired (LockHolder) Holding (Worker ) Trying to acquire (Worker ) Iteration 2: Not acquired (LockHolder) Not holding (Worker ) Trying to acquire (Worker ) Iteration 3: Acquired (LockHolder) Holding (Worker ) Trying to acquire (Worker ) Iteration 4: Not acquired (LockHolder) Not holding (Worker ) Trying to acquire (Worker ) Iteration 5: Acquired (Worker ) Done after 5 iterations Re-entrant Locks Normal Lock objects cannot be acquired more than once, even by the same thread. This limitation can introduce undesirable side effects if a lock is accessed by more than one function in the same call chain. import threading lock = threading.Lock() print ’First try :’, lock.acquire() print ’Second try:’, lock.acquire(0) 522 Processes and Threads In this case, the second call to acquire() is given a zero timeout to prevent it from blocking because the lock has been obtained by the first call. $ python threading_lock_reacquire.py First try : True Second try: False In a situation where separate code from the same thread needs to “reacquire” the lock, use an RLock instead. import threading lock = threading.RLock() print ’First try :’, lock.acquire() print ’Second try:’, lock.acquire(0) The only change to the code from the previous example is substituting RLock for Lock. $ python threading_rlock.py First try : True Second try: 1 Locks as Context Managers Locks implement the context manager API and are compatible with the with statement. Using with removes the need to explicitly acquire and release the lock. import threading import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def worker_with(lock): with lock: logging.debug(’Lock acquired via with’) 10.3. threading—Manage Concurrent Operations 523 def worker_no_with(lock): lock.acquire() try: logging.debug(’Lock acquired directly’) finally: lock.release() lock = threading.Lock() w = threading.Thread(target=worker_with, args=(lock,)) nw = threading.Thread(target=worker_no_with, args=(lock,)) w.start() nw.start() The two functions worker_with() and worker_no_with() manage the lock in equivalent ways. $ python threading_lock_with.py (Thread-1 ) Lock acquired via with (Thread-2 ) Lock acquired directly 10.3.9 Synchronizing Threads In addition to using Events, another way of synchronizing threads is through using a Condition object. Because the Condition uses a Lock, it can be tied to a shared resource, allowing multiple threads to wait for the resource to be updated. In this ex- ample, the consumer() threads wait for the Condition to be set before continuing. The producer() thread is responsible for setting the condition and notifying the other threads that they can continue. import logging import threading import time logging.basicConfig( level=logging.DEBUG, format=’%(asctime)s (%(threadName)-2s) %(message)s’, ) def consumer(cond): """wait for the condition and use the resource""" 524 Processes and Threads logging.debug(’Starting consumer thread’) t = threading.currentThread() with cond: cond.wait() logging.debug(’Resource is available to consumer’) def producer(cond): """set up the resource to be used by the consumer""" logging.debug(’Starting producer thread’) with cond: logging.debug(’Making resource available’) cond.notifyAll() condition = threading.Condition() c1 = threading.Thread(name=’c1’, target=consumer, args=(condition,)) c2 = threading.Thread(name=’c2’, target=consumer, args=(condition,)) p = threading.Thread(name=’p’, target=producer, args=(condition,)) c1.start() time.sleep(2) c2.start() time.sleep(2) p.start() The threads use with to acquire the lock associated with the Condition. Using the acquire() and release() methods explicitly also works. $ python threading_condition.py 2010-11-15 09:24:53,544 (c1) Starting consumer thread 2010-11-15 09:24:55,545 (c2) Starting consumer thread 2010-11-15 09:24:57,546 (p ) Starting producer thread 2010-11-15 09:24:57,546 (p ) Making resource available 2010-11-15 09:24:57,547 (c2) Resource is available to consumer 2010-11-15 09:24:57,547 (c1) Resource is available to consumer 10.3.10 Limiting Concurrent Access to Resources It is sometimes useful to allow more than one worker access to a resource at a time, while still limiting the overall number. For example, a connection pool might support 10.3. threading—Manage Concurrent Operations 525 a fixed number of simultaneous connections, or a network application might support a fixed number of concurrent downloads. A Semaphore is one way to manage those connections. import logging import random import threading import time logging.basicConfig( level=logging.DEBUG, format=’%(asctime)s (%(threadName)-2s) %(message)s’, ) class ActivePool(object): def __init__(self): super(ActivePool, self).__init__() self.active = [] self.lock = threading.Lock() def makeActive(self, name): with self.lock: self.active.append(name) logging.debug(’Running: %s’, self.active) def makeInactive(self, name): with self.lock: self.active.remove(name) logging.debug(’Running: %s’, self.active) def worker(s, pool): logging.debug(’Waiting to join the pool’) with s: name = threading.currentThread().getName() pool.makeActive(name) time.sleep(0.1) pool.makeInactive(name) pool = ActivePool() s = threading.Semaphore(2) for i in range(4): t = threading.Thread(target=worker, name=str(i), args=(s, pool)) t.start() 526 Processes and Threads In this example, the ActivePool class simply serves as a convenient way to track which threads are able to run at a given moment. A real resource pool would allocate a connection or some other value to the newly active thread and reclaim the value when the thread is done. Here, it is just used to hold the names of the active threads to show that, at most, two are running concurrently. $ python threading_semaphore.py 2010-11-15 09:24:57,618 (0 ) Waiting to join the pool 2010-11-15 09:24:57,619 (0 ) Running: [’0’] 2010-11-15 09:24:57,619 (1 ) Waiting to join the pool 2010-11-15 09:24:57,619 (1 ) Running: [’0’, ’1’] 2010-11-15 09:24:57,620 (2 ) Waiting to join the pool 2010-11-15 09:24:57,620 (3 ) Waiting to join the pool 2010-11-15 09:24:57,719 (0 ) Running: [’1’] 2010-11-15 09:24:57,720 (1 ) Running: [] 2010-11-15 09:24:57,721 (2 ) Running: [’2’] 2010-11-15 09:24:57,721 (3 ) Running: [’2’, ’3’] 2010-11-15 09:24:57,821 (2 ) Running: [’3’] 2010-11-15 09:24:57,822 (3 ) Running: [] 10.3.11 Thread-Specific Data While some resources need to be locked so multiple threads can use them, others need to be protected so that they are hidden from threads that do not “own” them. The local() function creates an object capable of hiding values from view in separate threads. import random import threading import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def show_value(data): try: val = data.value except AttributeError: logging.debug(’No value yet’) else: logging.debug(’value=%s’, val) 10.3. threading—Manage Concurrent Operations 527 def worker(data): show_value(data) data.value = random.randint(1, 100) show_value(data) local_data = threading.local() show_value(local_data) local_data.value = 1000 show_value(local_data) for i in range(2): t = threading.Thread(target=worker, args=(local_data,)) t.start() The attribute local_data.value is not present for any thread until it is set in that thread. $ python threading_local.py (MainThread) No value yet (MainThread) value=1000 (Thread-1 ) No value yet (Thread-1 ) value=71 (Thread-2 ) No value yet (Thread-2 ) value=38 To initialize the settings so all threads start with the same value, use a subclass and set the attributes in __init__(). import random import threading import logging logging.basicConfig(level=logging.DEBUG, format=’(%(threadName)-10s) %(message)s’, ) def show_value(data): try: val = data.value 528 Processes and Threads except AttributeError: logging.debug(’No value yet’) else: logging.debug(’value=%s’, val) def worker(data): show_value(data) data.value = random.randint(1, 100) show_value(data) class MyLocal(threading.local): def __init__(self, value): logging.debug(’Initializing %r’, self) self.value = value local_data = MyLocal(1000) show_value(local_data) for i in range(2): t = threading.Thread(target=worker, args=(local_data,)) t.start() __init__() is invoked on the same object (note the id() value), once in each thread to set the default values. $ python threading_local_defaults.py (MainThread) Initializing <__main__.MyLocal object at 0x100e16050> (MainThread) value=1000 (Thread-1 ) Initializing <__main__.MyLocal object at 0x100e16050> (Thread-1 ) value=1000 (Thread-1 ) value=19 (Thread-2 ) Initializing <__main__.MyLocal object at 0x100e16050> (Thread-2 ) value=1000 (Thread-2 ) value=55 See Also: threading (http://docs.python.org/lib/module-threading.html) Standard library documentation for this module. thread Lower-level thread API. multiprocessing (page 529) An API for working with processes; it mirrors the threading API. Queue (page 96) Thread-safe queue, useful for passing messages between threads. 10.4. multiprocessing—Manage Processes like Threads 529 10.4 multiprocessing—Manage Processes like Threads Purpose Provides an API for managing processes. Python Version 2.6 and later The multiprocessing module includes an API for dividing up work between mul- tiple processes based on the API for threading. In some cases, multiprocessing is a drop-in replacement and can be used instead of threading to take advantage of multiple CPU cores to avoid computational bottlenecks associated with Python’s global interpreter lock. Due to the similarity, the first few examples here are modified from the threading examples. Features provided by multiprocessing but not available in threading are covered later. 10.4.1 Multiprocessing Basics The simplest way to spawn a second process is to instantiate a Process object with a target function and call start() to let it begin working. import multiprocessing def worker(): """worker function""" print ’Worker’ return if __name__ == ’__main__’: jobs = [] for i in range(5): p = multiprocessing.Process(target=worker) jobs.append(p) p.start() The output includes the word “Worker” printed five times, although it may not come out entirely clean, depending on the order of execution, because each process is competing for access to the output stream. $ python multiprocessing_simple.py Worker Worker 530 Processes and Threads Worker Worker Worker It is usually more useful to be able to spawn a process with arguments to tell it what work to do. Unlike with threading, in order to pass arguments to a multiprocessing Process, the arguments must be able to be serialized using pickle. This example passes each worker a number to be printed. import multiprocessing def worker(num): """thread worker function""" print ’Worker:’, num return if __name__ == ’__main__’: jobs = [] for i in range(5): p = multiprocessing.Process(target=worker, args=(i,)) jobs.append(p) p.start() The integer argument is now included in the message printed by each worker: $ python multiprocessing_simpleargs.py Worker: 0 Worker: 1 Worker: 4 Worker: 2 Worker: 3 10.4.2 Importable Target Functions One difference between the threading and multiprocessing examples is the extra protection for __main__ used in the multiprocessing examples. Due to the way the new processes are started, the child process needs to be able to import the script containing the target function. Wrapping the main part of the application in a check for __main__ ensures that it is not run recursively in each child as the module is imported. Another approach is to import the target function from a separate script. For example, 10.4. multiprocessing—Manage Processes like Threads 531 multiprocessing_import_main.py uses a worker function defined in a second module. import multiprocessing import multiprocessing_import_worker if __name__ == ’__main__’: jobs = [] for i in range(5): p = multiprocessing.Process( target=multiprocessing_import_worker.worker, ) jobs.append(p) p.start() The worker function is defined in multiprocessing_import_worker.py. def worker(): """worker function""" print ’Worker’ return Calling the main program produces output similar to the first example. $ python multiprocessing_import_main.py Worker Worker Worker Worker Worker 10.4.3 Determining the Current Process Passing arguments to identify or name the process is cumbersome and unnecessary. Each Process instance has a name with a default value that can be changed as the process is created. Naming processes is useful for keeping track of them, especially in applications with multiple types of processes running simultaneously. import multiprocessing import time 532 Processes and Threads def worker(): name = multiprocessing.current_process().name print name, ’Starting’ time.sleep(2) print name, ’Exiting’ def my_service(): name = multiprocessing.current_process().name print name, ’Starting’ time.sleep(3) print name, ’Exiting’ if __name__ == ’__main__’: service = multiprocessing.Process(name=’my_service’, target=my_service) worker_1 = multiprocessing.Process(name=’worker 1’, target=worker) worker_2 = multiprocessing.Process(target=worker) # default name worker_1.start() worker_2.start() service.start() The debug output includes the name of the current process on each line. The lines with Process-3 in the name column correspond to the unnamed process worker_1. $ python multiprocessing_names.py worker 1 Starting worker 1 Exiting Process-3 Starting Process-3 Exiting my_service Starting my_service Exiting 10.4.4 Daemon Processes By default, the main program will not exit until all the children have exited. There are times when starting a background process that runs without blocking the main program from exiting is useful, such as in services where there may not be an easy way to interrupt the worker or where letting it die in the middle of its work does not lose 10.4. multiprocessing—Manage Processes like Threads 533 or corrupt data (for example, a task that generates “heartbeats” for a service monitoring tool). To mark a process as a daemon, set its daemon attribute to True. The default is for processes to not be daemons. import multiprocessing import time import sys def daemon(): p = multiprocessing.current_process() print ’Starting:’, p.name, p.pid sys.stdout.flush() time.sleep(2) print ’Exiting :’, p.name, p.pid sys.stdout.flush() def non_daemon(): p = multiprocessing.current_process() print ’Starting:’, p.name, p.pid sys.stdout.flush() print ’Exiting :’, p.name, p.pid sys.stdout.flush() if __name__ == ’__main__’: d = multiprocessing.Process(name=’daemon’, target=daemon) d.daemon = True n = multiprocessing.Process(name=’non-daemon’, target=non_daemon) n.daemon = False d.start() time.sleep(1) n.start() The output does not include the “Exiting” message from the daemon process, since all non-daemon processes (including the main program) exit before the daemon process wakes up from its two-second sleep. $ python multiprocessing_daemon.py Starting: daemon 9842 534 Processes and Threads Starting: non-daemon 9843 Exiting : non-daemon 9843 The daemon process is terminated automatically before the main program exits, which avoids leaving orphaned processes running. This can be verified by looking for the process id value printed when the program runs and then checking for that process with a command like ps. 10.4.5 Waiting for Processes To wait until a process has completed its work and exited, use the join() method. import multiprocessing import time import sys def daemon(): name = multiprocessing.current_process().name print ’Starting:’, name time.sleep(2) print ’Exiting :’, name def non_daemon(): name = multiprocessing.current_process().name print ’Starting:’, name print ’Exiting :’, name if __name__ == ’__main__’: d = multiprocessing.Process(name=’daemon’, target=daemon) d.daemon = True n = multiprocessing.Process(name=’non-daemon’, target=non_daemon) n.daemon = False d.start() time.sleep(1) n.start() d.join() n.join() 10.4. multiprocessing—Manage Processes like Threads 535 Since the main process waits for the daemon to exit using join(), the “Exiting” message is printed this time. $ python multiprocessing_daemon_join.py Starting: non-daemon Exiting : non-daemon Starting: daemon Exiting : daemon By default, join() blocks indefinitely. It is also possible to pass a timeout argument (a float representing the number of seconds to wait for the process to become inactive). If the process does not complete within the timeout period, join() returns anyway. import multiprocessing import time import sys def daemon(): name = multiprocessing.current_process().name print ’Starting:’, name time.sleep(2) print ’Exiting :’, name def non_daemon(): name = multiprocessing.current_process().name print ’Starting:’, name print ’Exiting :’, name if __name__ == ’__main__’: d = multiprocessing.Process(name=’daemon’, target=daemon) d.daemon = True n = multiprocessing.Process(name=’non-daemon’, target=non_daemon) n.daemon = False d.start() n.start() 536 Processes and Threads d.join(1) print ’d.is_alive()’, d.is_alive() n.join() Since the timeout passed is less than the amount of time the daemon sleeps, the process is still “alive” after join() returns. $ python multiprocessing_daemon_join_timeout.py Starting: non-daemon Exiting : non-daemon d.is_alive() True 10.4.6 Terminating Processes Although it is better to use the poison pill method of signaling to a process that it should exit (see Passing Messages to Processes, later in this chapter), if a process appears hung or deadlocked, it can be useful to be able to kill it forcibly. Calling terminate() on a process object kills the child process. import multiprocessing import time def slow_worker(): print ’Starting worker’ time.sleep(0.1) print ’Finished worker’ if __name__ == ’__main__’: p = multiprocessing.Process(target=slow_worker) print ’BEFORE:’, p, p.is_alive() p.start() print ’DURING:’, p, p.is_alive() p.terminate() print ’TERMINATED:’, p, p.is_alive() p.join() print ’JOINED:’, p, p.is_alive() 10.4. multiprocessing—Manage Processes like Threads 537 Note: It is important to join() the process after terminating it in order to give the process management code time to update the status of the object to reflect the termination. $ python multiprocessing_terminate.py BEFORE: False DURING: True TERMINATED: True JOINED: False 10.4.7 Process Exit Status The status code produced when the process exits can be accessed via the exitcode attribute. The ranges allowed are listed in Table 10.1. Table 10.1. Multiprocessing Exit Codes Exit Code Meaning == 0 No error was produced. > 0 The process had an error, and exited with that code. < 0 The process was killed with a signal of -1 * exitcode. import multiprocessing import sys import time def exit_error(): sys.exit(1) def exit_ok(): return def return_value(): return 1 def raises(): raise RuntimeError(’There was an error!’) 538 Processes and Threads def terminated(): time.sleep(3) if __name__ == ’__main__’: jobs = [] for f in [exit_error, exit_ok, return_value, raises, terminated]: print ’Starting process for’, f.func_name j = multiprocessing.Process(target=f, name=f.func_name) jobs.append(j) j.start() jobs[-1].terminate() for j in jobs: j.join() print ’%15s.exitcode = %s’ % (j.name, j.exitcode) Processes that raise an exception automatically get an exitcode of 1. $ python multiprocessing_exitcode.py Starting process for exit_error Starting process for exit_ok Starting process for return_value Starting process for raises Starting process for terminated Process raises: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python 2.7/multiprocessing/process.py", line 232, in _bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python 2.7/multiprocessing/process.py", line 88, in run self._target(*self._args, **self._kwargs) File "multiprocessing_exitcode.py", line 24, in raises raise RuntimeError(’There was an error!’) RuntimeError: There was an error! exit_error.exitcode = 1 exit_ok.exitcode = 0 return_value.exitcode = 0 raises.exitcode = 1 terminated.exitcode = -15 10.4. multiprocessing—Manage Processes like Threads 539 10.4.8 Logging When debugging concurrency issues, it can be useful to have access to the internals of the objects provided by multiprocessing. There is a convenient module-level function to enable logging called log_to_stderr(). It sets up a logger object using logging and adds a handler so that log messages are sent to the standard error channel. import multiprocessing import logging import sys def worker(): print ’Doing some work’ sys.stdout.flush() if __name__ == ’__main__’: multiprocessing.log_to_stderr(logging.DEBUG) p = multiprocessing.Process(target=worker) p.start() p.join() By default, the logging level is set to NOTSET so no messages are produced. Pass a different level to initialize the logger to the level of detail desired. $ python multiprocessing_log_to_stderr.py [INFO/Process-1] child process calling self.run() Doing some work [INFO/Process-1] process shutting down [DEBUG/Process-1] running all "atexit" finalizers with priority >= 0 [DEBUG/Process-1] running the remaining "atexit" finalizers [INFO/Process-1] process exiting with exitcode 0 [INFO/MainProcess] process shutting down [DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0 [DEBUG/MainProcess] running the remaining "atexit" finalizers To manipulate the logger directly (change its level setting or add handlers), use get_logger(). import multiprocessing import logging 540 Processes and Threads import sys def worker(): print ’Doing some work’ sys.stdout.flush() if __name__ == ’__main__’: multiprocessing.log_to_stderr() logger = multiprocessing.get_logger() logger.setLevel(logging.INFO) p = multiprocessing.Process(target=worker) p.start() p.join() The logger can also be configured through the logging configuration file API, using the name multiprocessing. $ python multiprocessing_get_logger.py [INFO/Process-1] child process calling self.run() Doing some work [INFO/Process-1] process shutting down [INFO/Process-1] process exiting with exitcode 0 [INFO/MainProcess] process shutting down 10.4.9 Subclassing Process Although the simplest way to start a job in a separate process is to use Process and pass a target function, it is also possible to use a custom subclass. import multiprocessing class Worker(multiprocessing.Process): def run(self): print ’In %s’ % self.name return if __name__ == ’__main__’: jobs = [] for i in range(5): p = Worker() jobs.append(p) 10.4. multiprocessing—Manage Processes like Threads 541 p.start() for j in jobs: j.join() The derived class should override run() to do its work. $ python multiprocessing_subclass.py In Worker-1 In Worker-2 In Worker-3 In Worker-4 In Worker-5 10.4.10 Passing Messages to Processes As with threads, a commonly used pattern for multiple processes is to divide a job up among several workers to run in parallel. Effective use of multiple processes usually requires some communication between them, so that work can be divided and results can be aggregated. A simple way to communicate between processes with multipro- cessing is to use a Queue to pass messages back and forth. Any object that can be serialized with pickle can pass through a Queue. import multiprocessing class MyFancyClass(object): def __init__(self, name): self.name = name def do_something(self): proc_name = multiprocessing.current_process().name print ’Doing something fancy in %s for %s!’ %\ (proc_name, self.name) def worker(q): obj = q.get() obj.do_something() if __name__ == ’__main__’: queue = multiprocessing.Queue() 542 Processes and Threads p = multiprocessing.Process(target=worker, args=(queue,)) p.start() queue.put(MyFancyClass(’Fancy Dan’)) # Wait for the worker to finish queue.close() queue.join_thread() p.join() This short example passes only a single message to a single worker, and then the main process waits for the worker to finish. $ python multiprocessing_queue.py Doing something fancy in Process-1 for Fancy Dan! A more complex example shows how to manage several workers consuming data from a JoinableQueue and passing results back to the parent process. The poison pill technique is used to stop the workers. After setting up the real tasks, the main program adds one “stop” value per worker to the job queue. When a worker encounters the special value, it breaks out of its processing loop. The main process uses the task queue’s join() method to wait for all the tasks to finish before processing the results. import multiprocessing import time class Consumer(multiprocessing.Process): def __init__(self, task_queue, result_queue): multiprocessing.Process.__init__(self) self.task_queue = task_queue self.result_queue = result_queue def run(self): proc_name = self.name while True: next_task = self.task_queue.get() if next_task is None: # Poison pill means shutdown 10.4. multiprocessing—Manage Processes like Threads 543 print ’%s: Exiting’ % proc_name self.task_queue.task_done() break print ’%s: %s’ % (proc_name, next_task) answer = next_task() self.task_queue.task_done() self.result_queue.put(answer) return class Task(object): def __init__(self, a, b): self.a = a self.b = b def __call__(self): time.sleep(0.1) # pretend to take some time to do the work return ’%s * %s = %s’ % (self.a, self.b, self.a * self.b) def __str__(self): return ’%s * %s’ % (self.a, self.b) if __name__ == ’__main__’: # Establish communication queues tasks = multiprocessing.JoinableQueue() results = multiprocessing.Queue() # Start consumers num_consumers = multiprocessing.cpu_count() * 2 print ’Creating %d consumers’ % num_consumers consumers = [ Consumer(tasks, results) for i in xrange(num_consumers) ] for w in consumers: w.start() # Enqueue jobs num_jobs = 10 for i in xrange(num_jobs): tasks.put(Task(i, i)) # Add a poison pill for each consumer for i in xrange(num_consumers): tasks.put(None) 544 Processes and Threads # Wait for all the tasks to finish tasks.join() # Start printing results while num_jobs: result = results.get() print ’Result:’, result num_jobs -= 1 Although the jobs enter the queue in order, their execution is parallelized so there is no guarantee about the order in which they will be completed. $ python -u multiprocessing_producer_consumer.py Creating 4 consumers Consumer-1: 0 * 0 Consumer-2: 1 * 1 Consumer-3: 2 * 2 Consumer-4: 3 * 3 Consumer-4: 4 * 4 Consumer-1: 5 * 5 Consumer-3: 6 * 6 Consumer-2: 7 * 7 Consumer-1: 8 * 8 Consumer-4: 9 * 9 Consumer-3: Exiting Consumer-2: Exiting Consumer-1: Exiting Consumer-4: Exiting Result: 0 * 0 = 0 Result: 3 * 3 = 9 Result: 2 * 2 = 4 Result: 1 * 1 = 1 Result: 5 * 5 = 25 Result: 4 * 4 = 16 Result: 6 * 6 = 36 Result: 7 * 7 = 49 Result: 9 * 9 = 81 Result: 8 * 8 = 64 10.4. multiprocessing—Manage Processes like Threads 545 10.4.11 Signaling between Processes The Event class provides a simple way to communicate state information between processes. An event can be toggled between set and unset states. Users of the event object can wait for it to change from unset to set, using an optional timeout value. import multiprocessing import time def wait_for_event(e): """Wait for the event to be set before doing anything""" print ’wait_for_event: starting’ e.wait() print ’wait_for_event: e.is_set()->’, e.is_set() def wait_for_event_timeout(e, t): """Wait t seconds and then timeout""" print ’wait_for_event_timeout: starting’ e.wait(t) print ’wait_for_event_timeout: e.is_set()->’, e.is_set() if __name__ == ’__main__’: e = multiprocessing.Event() w1 = multiprocessing.Process(name=’block’, target=wait_for_event, args=(e,)) w1.start() w2 = multiprocessing.Process(name=’nonblock’, target=wait_for_event_timeout, args=(e, 2)) w2.start() print ’main: waiting before calling Event.set()’ time.sleep(3) e.set() print ’main: event is set’ When wait() times out it returns without an error. The caller is responsible for checking the state of the event using is_set(). 546 Processes and Threads $ python -u multiprocessing_event.py main: waiting before calling Event.set() wait_for_event: starting wait_for_event_timeout: starting wait_for_event_timeout: e.is_set()-> False main: event is setwait_for_event: e.is_set()-> True 10.4.12 Controlling Access to Resources In situations when a single resource needs to be shared between multiple processes, a Lock can be used to avoid conflicting accesses. import multiprocessing import sys def worker_with(lock, stream): with lock: stream.write(’Lock acquired via with\n’) def worker_no_with(lock, stream): lock.acquire() try: stream.write(’Lock acquired directly\n’) finally: lock.release() lock = multiprocessing.Lock() w = multiprocessing.Process(target=worker_with, args=(lock, sys.stdout)) nw = multiprocessing.Process(target=worker_no_with, args=(lock, sys.stdout)) w.start() nw.start() w.join() nw.join() In this example, the messages printed to the console may be jumbled together if the two processes do not synchronize their access of the output stream with the lock. 10.4. multiprocessing—Manage Processes like Threads 547 $ python multiprocessing_lock.py Lock acquired via with Lock acquired directly 10.4.13 Synchronizing Operations Condition objects can be used to synchronize parts of a workflow so that some run in parallel but others run sequentially, even if they are in separate processes. import multiprocessing import time def stage_1(cond): """perform first stage of work, then notify stage_2 to continue """ name = multiprocessing.current_process().name print ’Starting’, name with cond: print ’%s done and ready for stage 2’ % name cond.notify_all() def stage_2(cond): """wait for the condition telling us stage_1 is done""" name = multiprocessing.current_process().name print ’Starting’, name with cond: cond.wait() print ’%s running’ % name if __name__ == ’__main__’: condition = multiprocessing.Condition() s1 = multiprocessing.Process(name=’s1’, target=stage_1, args=(condition,)) s2_clients = [ multiprocessing.Process(name=’stage_2[%d]’ % i, target=stage_2, args=(condition,)) for i in range(1, 3) ] 548 Processes and Threads for c in s2_clients: c.start() time.sleep(1) s1.start() s1.join() for c in s2_clients: c.join() In this example, two processes run the second stage of a job in parallel, but only after the first stage is done. $ python multiprocessing_condition.py Starting s1 s1 done and ready for stage 2 Starting stage_2[1] stage_2[1] running Starting stage_2[2] stage_2[2] running 10.4.14 Controlling Concurrent Access to Resources It may be useful to allow more than one worker access to a resource at a time, while still limiting the overall number. For example, a connection pool might support a fixed num- ber of simultaneous connections, or a network application might support a fixed number of concurrent downloads. A Semaphore is one way to manage those connections. import random import multiprocessing import time class ActivePool(object): def __init__(self): super(ActivePool, self).__init__() self.mgr = multiprocessing.Manager() self.active = self.mgr.list() self.lock = multiprocessing.Lock() def makeActive(self, name): with self.lock: self.active.append(name) def makeInactive(self, name): 10.4. multiprocessing—Manage Processes like Threads 549 with self.lock: self.active.remove(name) def __str__(self): with self.lock: return str(self.active) def worker(s, pool): name = multiprocessing.current_process().name with s: pool.makeActive(name) print ’Now running: %s’ % str(pool) time.sleep(random.random()) pool.makeInactive(name) if __name__ == ’__main__’: pool = ActivePool() s = multiprocessing.Semaphore(3) jobs = [ multiprocessing.Process(target=worker, name=str(i), args=(s, pool), ) for i in range(10) ] for j in jobs: j.start() for j in jobs: j.join() print ’Now running: %s’ % str(pool) In this example, the ActivePool class simply serves as a convenient way to track which processes are running at a given moment. A real resource pool would probably allocate a connection or some other value to the newly active process and reclaim the value when the task is done. Here, the pool is just used to hold the names of the active processes to show that only three are running concurrently. $ python multiprocessing_semaphore.py Now running: [’0’, ’1’, ’3’] Now running: [’0’, ’1’, ’3’] Now running: [’3’, ’2’, ’5’] 550 Processes and Threads Now running: [’0’, ’1’, ’3’] Now running: [’1’, ’3’, ’2’] Now running: [’2’, ’6’, ’7’] Now running: [’3’, ’2’, ’6’] Now running: [’6’, ’4’, ’8’] Now running: [’4’, ’8’, ’9’] Now running: [’6’, ’7’, ’4’] Now running: [’1’, ’3’, ’2’] Now running: [’3’, ’2’, ’5’] Now running: [’6’, ’7’, ’4’] Now running: [’6’, ’7’, ’4’] Now running: [] Now running: [] Now running: [] Now running: [] Now running: [] Now running: [] 10.4.15 Managing Shared State In the previous example, the list of active processes is maintained centrally in the ActivePool instance via a special type of list object created by a Manager. The Manager is responsible for coordinating shared information state between all of its users. import multiprocessing import pprint def worker(d, key, value): d[key] = value if __name__ == ’__main__’: mgr = multiprocessing.Manager() d = mgr.dict() jobs = [ multiprocessing.Process(target=worker, args=(d, i, i*2)) for i in range(10) ] for j in jobs: j.start() for j in jobs: 10.4. multiprocessing—Manage Processes like Threads 551 j.join() print ’Results:’, d By creating the list through the manager, it is shared and updates are seen in all processes. Dictionaries are also supported. $ python multiprocessing_manager_dict.py Results: {0: 0, 1: 2, 2: 4, 3: 6, 4: 8, 5: 10, 6: 12, 7: 14, 8: 16, 9: 18} 10.4.16 Shared Namespaces In addition to dictionaries and lists, a Manager can create a shared Namespace. import multiprocessing def producer(ns, event): ns.value = ’This is the value’ event.set() def consumer(ns, event): try: value = ns.value except Exception, err: print ’Before event, error:’, str(err) event.wait() print ’After event:’, ns.value if __name__ == ’__main__’: mgr = multiprocessing.Manager() namespace = mgr.Namespace() event = multiprocessing.Event() p = multiprocessing.Process(target=producer, args=(namespace, event)) c = multiprocessing.Process(target=consumer, args=(namespace, event)) c.start() p.start() 552 Processes and Threads c.join() p.join() Any named value added to the Namespace is visible to all clients that receive the Namespace instance. $ python multiprocessing_namespaces.py Before event, error: ’Namespace’ object has no attribute ’value’ After event: This is the value It is important to know that updates to the contents of mutable values in the name- space are not propagated automatically. import multiprocessing def producer(ns, event): # DOES NOT UPDATE GLOBAL VALUE! ns.my_list.append(’This is the value’) event.set() def consumer(ns, event): print ’Before event:’, ns.my_list event.wait() print ’After event :’, ns.my_list if __name__ == ’__main__’: mgr = multiprocessing.Manager() namespace = mgr.Namespace() namespace.my_list = [] event = multiprocessing.Event() p = multiprocessing.Process(target=producer, args=(namespace, event)) c = multiprocessing.Process(target=consumer, args=(namespace, event)) c.start() p.start() c.join() p.join() 10.4. multiprocessing—Manage Processes like Threads 553 To update the list, attach it to the namespace object again. $ python multiprocessing_namespaces_mutable.py Before event: [] After event : [] 10.4.17 Process Pools The Pool class can be used to manage a fixed number of workers for simple cases where the work to be done can be broken up and distributed between workers indepen- dently. The return values from the jobs are collected and returned as a list. The pool arguments include the number of processes and a function to run when starting the task process (invoked once per child). import multiprocessing def do_calculation(data): return data * 2 def start_process(): print ’Starting’, multiprocessing.current_process().name if __name__ == ’__main__’: inputs = list(range(10)) print ’Input :’, inputs builtin_outputs = map(do_calculation, inputs) print ’Built-in:’, builtin_outputs pool_size = multiprocessing.cpu_count() * 2 pool = multiprocessing.Pool(processes=pool_size, initializer=start_process, ) pool_outputs = pool.map(do_calculation, inputs) pool.close() # no more tasks pool.join() # wrap up current tasks print ’Pool :’, pool_outputs The result of the map() method is functionally equivalent to the built-in map(), except that individual tasks run in parallel. Since the pool is processing its inputs in 554 Processes and Threads parallel, close() and join() can be used to synchronize the main process with the task processes to ensure proper cleanup. $ python multiprocessing_pool.py Input : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Built-in: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] Starting PoolWorker-3 Starting PoolWorker-1 Starting PoolWorker-4 Starting PoolWorker-2 Pool : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] By default, Pool creates a fixed number of worker processes and passes jobs to them until there are no more jobs. Setting the maxtasksperchild parameter tells the pool to restart a worker process after it has finished a few tasks, preventing long-running workers from consuming ever-more system resources. import multiprocessing def do_calculation(data): return data * 2 def start_process(): print ’Starting’, multiprocessing.current_process().name if __name__ == ’__main__’: inputs = list(range(10)) print ’Input :’, inputs builtin_outputs = map(do_calculation, inputs) print ’Built-in:’, builtin_outputs pool_size = multiprocessing.cpu_count() * 2 pool = multiprocessing.Pool(processes=pool_size, initializer=start_process, maxtasksperchild=2, ) pool_outputs = pool.map(do_calculation, inputs) pool.close() # no more tasks pool.join() # wrap up current tasks print ’Pool :’, pool_outputs 10.4. multiprocessing—Manage Processes like Threads 555 The pool restarts the workers when they have completed their allotted tasks, even if there is no more work. In this output, eight workers are created, even though there are only ten tasks and each worker can complete two of them at a time. $ python multiprocessing_pool_maxtasksperchild.py Input : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] Built-in: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] Starting PoolWorker-1 Starting PoolWorker-2 Starting PoolWorker-3 Starting PoolWorker-4 Starting PoolWorker-5 Starting PoolWorker-6 Starting PoolWorker-7 Starting PoolWorker-8 Pool : [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] 10.4.18 Implementing MapReduce The Pool class can be used to create a simple single-server MapReduce implemen- tation. Although it does not give the full benefits of distributed processing, it does illustrate how easy it is to break down some problems into distributable units of work. In a MapReduce-based system, input data is broken down into chunks for process- ing by different worker instances. Each chunk of input data is mapped to an intermedi- ate state using a simple transformation. The intermediate data is then collected together and partitioned based on a key value so that all related values are together. Finally, the partitioned data is reduced to a result set. import collections import itertools import multiprocessing class SimpleMapReduce(object): def __init__(self, map_func, reduce_func, num_workers=None): """ map_func 556 Processes and Threads Function to map inputs to intermediate data. Takes as argument one input value and returns a tuple with the key and a value to be reduced. reduce_func Function to reduce partitioned version of intermediate data to final output. Takes as argument a key as produced by map_func and a sequence of the values associated with that key. num_workers The number of workers to create in the pool. Defaults to the number of CPUs available on the current host. """ self.map_func = map_func self.reduce_func = reduce_func self.pool = multiprocessing.Pool(num_workers) def partition(self, mapped_values): """Organize the mapped values by their key. Returns an unsorted sequence of tuples with a key and a sequence of values. """ partitioned_data = collections.defaultdict(list) for key, value in mapped_values: partitioned_data[key].append(value) return partitioned_data.items() def __call__(self, inputs, chunksize=1): """Process the inputs through the map and reduce functions given. inputs An iterable containing the input data to be processed. chunksize=1 The portion of the input data to hand to each worker. This can be used to tune performance during the mapping phase. """ map_responses = self.pool.map(self.map_func, inputs, 10.4. multiprocessing—Manage Processes like Threads 557 chunksize=chunksize) partitioned_data = self.partition( itertools.chain(*map_responses) ) reduced_values = self.pool.map(self.reduce_func, partitioned_data) return reduced_values The following example script uses SimpleMapReduce to count the “words” in the reStructuredText source for this article, ignoring some of the markup. import multiprocessing import string from multiprocessing_mapreduce import SimpleMapReduce def file_to_words(filename): """Read a file and return a sequence of (word, occurrences) values. """ STOP_WORDS = set([ ’a’, ’an’, ’and’, ’are’, ’as’, ’be’, ’by’, ’for’, ’if’, ’in’, ’is’, ’it’, ’of’, ’or’, ’py’, ’rst’, ’that’, ’the’, ’to’, ’with’, ]) TR = string.maketrans(string.punctuation, ’’* len(string.punctuation)) print multiprocessing.current_process().name, ’reading’, filename output = [] with open(filename, ’rt’) as f: for line in f: if line.lstrip().startswith(’..’): # Skip comment lines continue line = line.translate(TR) # Strip punctuation for word in line.split(): word = word.lower() if word.isalpha() and word not in STOP_WORDS: output.append( (word, 1) ) return output def count_words(item): 558 Processes and Threads """Convert the partitioned data for a word to a tuple containing the word and the number of occurrences. """ word, occurrences = item return (word, sum(occurrences)) if __name__ == ’__main__’: import operator import glob input_files = glob.glob(’*.rst’) mapper = SimpleMapReduce(file_to_words, count_words) word_counts = mapper(input_files) word_counts.sort(key=operator.itemgetter(1)) word_counts.reverse() print ’\nTOP 20 WORDS BY FREQUENCY\n’ top20 = word_counts[:20] longest = max(len(word) for word, count in top20) for word, count in top20: print ’%-*s: %5s’ % (longest+1, word, count) The file_to_words() function converts each input file to a sequence of tuples containing the word and the number 1 (representing a single occurrence). The data is divided up by partition() using the word as the key, so the resulting structure consists of a key and a sequence of 1 values representing each occurrence of the word. The partitioned data is converted to a set of tuples containing a word and the count for that word by count_words() during the reduction phase. $ python multiprocessing_wordcount.py PoolWorker-1 reading basics.rst PoolWorker-1 reading index.rst PoolWorker-2 reading communication.rst PoolWorker-2 reading mapreduce.rst TOP 20 WORDS BY FREQUENCY process : 81 multiprocessing : 43 10.4. multiprocessing—Manage Processes like Threads 559 worker : 38 after : 34 starting : 33 running : 32 processes : 32 python : 31 start : 29 class : 28 literal : 27 header : 27 pymotw : 27 end : 27 daemon : 23 now : 22 func : 21 can : 21 consumer : 20 mod : 19 See Also: multiprocessing (http://docs.python.org/library/multiprocessing.html) The stan- dard library documentation for this module. MapReduce (http://en.wikipedia.org/wiki/MapReduce) Overview of MapReduce on Wikipedia. MapReduce: Simplified Data Processing on Large Clusters (http://labs.google.com/papers/mapreduce.html) Google Labs presentation and paper on MapReduce. operator (page 153) Operator tools such as itemgetter(). threading (page 505) High-level API for working with threads. Chapter 11 NETWORKING Network communication is used to retrieve data needed for an algorithm running locally, share information for distributed processing, and manage cloud services. Python’s standard library comes complete with modules for creating network services, as well as for accessing existing services remotely. The low-level socket library provides direct access to the native C socket library and can be used to communicate with any network service. select watches multiple sockets simultaneously and is useful for allowing network servers to communicate with multiple clients simultaneously. The frameworks in SocketServer abstract out a lot of the repetitive work neces- sary to create a new network server. The classes can be combined to create servers that fork or use threads and support TCP or UDP. Only the actual message handling needs to be provided by the application. asyncore implements an asynchronous networking stack with a callback-based API. It encapsulates the polling loop and buffering, and invokes appropriate handlers when data is received. The framework in asynchat simplifies the work needed to create bidirectional message-based protocols on top of asyncore. 11.1 socket—Network Communication Purpose Provides access to network communication. Python Version 1.4 and later The socket module exposes the low-level C API for communicating over a network using the BSD socket interface. It includes the socket class, for handling the actual data channel, and also includes functions for network-related tasks, such as converting a server’s name to an address and formatting data to be sent across the network. 561 562 Networking 11.1.1 Addressing, Protocol Families, and Socket Types A socket is one endpoint of a communication channel used by programs to pass data back and forth locally or across the Internet. Sockets have two primary properties con- trolling the way they send data: the address family controls the OSI network layer pro- tocol used, and the socket type controls the transport layer protocol. Python supports three address families. The most common, AF_INET, is used for IPv4 Internet addressing. IPv4 addresses are four bytes long and are usually represented as a sequence of four numbers, one per byte, separated by dots (e.g., 10.1.1.5 and 127.0.0.1). These values are more commonly referred to as “IP addresses.” Almost all Internet networking currently is done using IP version 4. AF_INET6 is used for IPv6 Internet addressing. IPv6 is the “next generation” ver- sion of the Internet protocol. It supports 128-bit addresses, traffic shaping, and rout- ing features not available under IPv4. Adoption of IPv6 is still limited, but continues to grow. AF_UNIX is the address family for UNIX Domain Sockets (UDS), an inter-process communication protocol available on POSIX-compliant systems. The implementation of UDS typically allows the operating system to pass data directly from process to process, without going through the network stack. This is more efficient than using AF_INET, but because the file system is used as the namespace for addressing, UDS is restricted to processes on the same system. The appeal of using UDS over other IPC mechanisms, such as named pipes or shared memory, is that the programming interface is the same as for IP networking. This means the application can take advantage of efficient communication when running on a single host, but use the same code when sending data across the network. Note: The AF_UNIX constant is only defined on systems where UDS is supported. The socket type is usually either SOCK_DGRAM for user datagram protocol (UDP) or SOCK_STREAM for transmission control protocol (TCP). UDP does not require trans- mission handshaking or other setup, but offers lower reliability of delivery. UDP mes- sages may be delivered out of order, more than once, or not at all. TCP, by contrast, ensures that each message is delivered exactly once and in the correct order. That extra reliability may impose additional latency, however, since packets may need to be retransmitted. Most application protocols that deliver a large amount of data, such as HTTP, are built on top of TCP. UDP is commonly used for protocols where order is less important (since the message fits in a single packet, e.g., DNS), or for multicasting (sending the same data to several hosts). 11.1. socket—Network Communication 563 Note: Python’s socket module supports other socket types, but they are less com- monly used and so are not covered here. Refer to the standard library documentation for more details. Looking Up Hosts on the Network socket includes functions to interface with the domain name services on the network so a program can convert the host name of a server into its numerical network address. Applications do not need to convert addresses explicitly before using them to connect to a server, but it can be useful when reporting errors to include the numerical address as well as the name value being used. To find the official name of the current host, use gethostname(). import socket print socket.gethostname() The name returned will depend on the network settings for the current system, and it may change if it is on a different network (such as a laptop attached to a wire- less LAN). $ python socket_gethostname.py farnsworth.hellfly.net Use gethostbyname() to consult the operating system hostname resolution API and convert the name of a server to its numerical address. import socket for host in [ ’homer’, ’www’, ’www.python.org’, ’nosuchname’ ]: try: print ’%s : %s’ % (host, socket.gethostbyname(host)) except socket.error, msg: print ’%s : %s’ % (host, msg) If the DNS configuration of the current system includes one or more domains in the search, the name argument does not need to be a fully qualified name (i.e., it does not need to include the domain name as well as the base hostname). If the name cannot be found, an exception of type socket.error is raised. 564 Networking $ python socket_gethostbyname.py homer : 192.168.1.8 www : 192.168.1.8 www.python.org : 82.94.164.162 nosuchname : [Errno 8] nodename nor servname provided, or not known For access to more naming information about a server, use the function gethostbyname_ex(). It returns the canonical hostname of the server, any aliases, and all the available IP addresses that can be used to reach it. import socket for host in [ ’homer’, ’www’, ’www.python.org’, ’nosuchname’ ]: print host try: hostname, aliases, addresses = socket.gethostbyname_ex(host) print ’ Hostname:’, hostname print ’ Aliases :’, aliases print ’ Addresses:’, addresses except socket.error as msg: print ’ERROR:’, msg print Having all known IP addresses for a server lets a client implement its own load- balancing or fail-over algorithms. $ python socket_gethostbyname_ex.py homer Hostname: homer.hellfly.net Aliases : [] Addresses: [’192.168.1.8’] www Hostname: homer.hellfly.net Aliases : [’www.hellfly.net’] Addresses: [’192.168.1.8’] www.python.org Hostname: www.python.org 11.1. socket—Network Communication 565 Aliases : [] Addresses: [’82.94.164.162’] nosuchname ERROR: [Errno 8] nodename nor servname provided, or not known Use getfqdn() to convert a partial name to a fully qualified domain name. import socket for host in [ ’homer’, ’www’ ]: print ’%6s : %s’ % (host, socket.getfqdn(host)) The name returned will not necessarily match the input argument in any way if the input is an alias, such as www is here. $ python socket_getfqdn.py homer : homer.hellfly.net www : homer.hellfly.net When the address of a server is available, use gethostbyaddr() to do a “reverse” lookup for the name. import socket hostname, aliases, addresses = socket.gethostbyaddr(’192.168.1.8’) print ’Hostname :’, hostname print ’Aliases :’, aliases print ’Addresses:’, addresses The return value is a tuple containing the full hostname, any aliases, and all IP addresses associated with the name. $ python socket_gethostbyaddr.py Hostname : homer.hellfly.net Aliases : [’8.1.168.192.in-addr.arpa’] Addresses: [’192.168.1.8’] 566 Networking Finding Service Information In addition to an IP address, each socket address includes an integer port number. Many applications can run on the same host, listening on a single IP address, but only one socket at a time can use a port at that address. The combination of IP address, protocol, and port number uniquely identify a communication channel and ensure that messages sent through a socket arrive at the correct destination. Some of the port numbers are preallocated for a specific protocol. For example, email servers using SMTP communicate with each other over port number 25 using TCP, and Web clients and servers use port 80 for HTTP. The port numbers for network services with standardized names can be looked up using getservbyname(). import socket from urlparse import urlparse for url in [ ’http://www.python.org’, ’https://www.mybank.com’, ’ftp://prep.ai.mit.edu’, ’gopher://gopher.micro.umn.edu’, ’smtp://mail.example.com’, ’imap://mail.example.com’, ’imaps://mail.example.com’, ’pop3://pop.example.com’, ’pop3s://pop.example.com’, ]: parsed_url = urlparse(url) port = socket.getservbyname(parsed_url.scheme) print ’%6s : %s’ % (parsed_url.scheme, port) Although a standardized service is unlikely to change ports, looking up the value with a system call instead of hard coding it is more flexible when new services are added in the future. $ python socket_getservbyname.py http : 80 https : 443 ftp : 21 gopher : 70 smtp : 25 imap : 143 imaps : 993 11.1. socket—Network Communication 567 pop3 : 110 pop3s : 995 To reverse the service port lookup, use getservbyport(). import socket import urlparse for port in [ 80, 443, 21, 70, 25, 143, 993, 110, 995 ]: print urlparse.urlunparse( (socket.getservbyport(port), ’example.com’, ’/’, ’’, ’’, ’’) ) The reverse lookup is useful for constructing URLs to services from arbitrary addresses. $ python socket_getservbyport.py http://example.com/ https://example.com/ ftp://example.com/ gopher://example.com/ smtp://example.com/ imap://example.com/ imaps://example.com/ pop3://example.com/ pop3s://example.com/ The number assigned to a transport protocol can be retrieved with getprotobyname(). import socket def get_constants(prefix): """Create a dictionary mapping socket module constants to their names. """ return dict( (getattr(socket, n), n) for n in dir(socket) if n.startswith(prefix) ) protocols = get_constants(’IPPROTO_’) 568 Networking for name in [ ’icmp’, ’udp’, ’tcp’ ]: proto_num = socket.getprotobyname(name) const_name = protocols[proto_num] print ’%4s -> %2d (socket.%-12s = %2d)’ %\ (name, proto_num, const_name, getattr(socket, const_name)) The values for protocol numbers are standardized and defined as constants in socket with the prefix IPPROTO_. $ python socket_getprotobyname.py icmp -> 1 (socket.IPPROTO_ICMP = 1) udp -> 17 (socket.IPPROTO_UDP = 17) tcp -> 6 (socket.IPPROTO_TCP = 6) Looking Up Server Addresses getaddrinfo() converts the basic address of a service into a list of tuples with all the information necessary to make a connection. The contents of each tuple will vary, containing different network families or protocols. import socket def get_constants(prefix): """Create a dictionary mapping socket module constants to their names. """ return dict( (getattr(socket, n), n) for n in dir(socket) if n.startswith(prefix) ) families = get_constants(’AF_’) types = get_constants(’SOCK_’) protocols = get_constants(’IPPROTO_’) for response in socket.getaddrinfo(’www.python.org’, ’http’): # Unpack the response tuple family, socktype, proto, canonname, sockaddr = response print ’Family :’, families[family] print ’Type :’, types[socktype] print ’Protocol :’, protocols[proto] 11.1. socket—Network Communication 569 print ’Canonical name:’, canonname print ’Socket address:’, sockaddr print This program demonstrates how to look up the connection information for www.python.org. $ python socket_getaddrinfo.py Family : AF_INET Type : SOCK_DGRAM Protocol : IPPROTO_UDP Canonical name: Socket address: (’82.94.164.162’, 80) Family : AF_INET Type : SOCK_STREAM Protocol : IPPROTO_TCP Canonical name: Socket address: (’82.94.164.162’, 80) getaddrinfo() takes several arguments for filtering the result list. The host and port values given in the example are required arguments. The optional arguments are family, socktype, proto, and flags. The optional values should be either 0 or one of the constants defined by socket. import socket def get_constants(prefix): """Create a dictionary mapping socket module constants to their names. """ return dict( (getattr(socket, n), n) for n in dir(socket) if n.startswith(prefix) ) families = get_constants(’AF_’) types = get_constants(’SOCK_’) protocols = get_constants(’IPPROTO_’) for response in socket.getaddrinfo(’www.doughellmann.com’, ’http’, socket.AF_INET, # family 570 Networking socket.SOCK_STREAM, # socktype socket.IPPROTO_TCP, # protocol socket.AI_CANONNAME, # flags ): # Unpack the response tuple family, socktype, proto, canonname, sockaddr = response print ’Family :’, families[family] print ’Type :’, types[socktype] print ’Protocol :’, protocols[proto] print ’Canonical name:’, canonname print ’Socket address:’, sockaddr print Since flags includes AI_CANONNAME, the canonical name of the server, which may be different from the value used for the lookup if the host has any aliases, is included in the results this time. Without the flag, the canonical name value is left empty. $ python socket_getaddrinfo_extra_args.py Family : AF_INET Type : SOCK_STREAM Protocol : IPPROTO_TCP Canonical name: homer.doughellmann.com Socket address: (’192.168.1.8’, 80) IP Address Representations Network programs written in C use the data type struct sockaddr to represent IP addresses as binary values (instead of the string addresses usually found in Python programs). To convert IPv4 addresses between the Python representation and the C representation, use inet_aton() and inet_ntoa(). import binascii import socket import struct import sys for string_address in [ ’192.168.1.1’, ’127.0.0.1’ ]: packed = socket.inet_aton(string_address) print ’Original:’, string_address 11.1. socket—Network Communication 571 print ’Packed :’, binascii.hexlify(packed) print ’Unpacked:’, socket.inet_ntoa(packed) print The four bytes in the packed format can be passed to C libraries, transmitted safely over the network, or saved to a database compactly. $ python socket_address_packing.py Original: 192.168.1.1 Packed : c0a80101 Unpacked: 192.168.1.1 Original: 127.0.0.1 Packed : 7f000001 Unpacked: 127.0.0.1 The related functions inet_pton() and inet_ntop() work with both IPv4 and IPv6 addresses, producing the appropriate format based on the address family parameter passed in. import binascii import socket import struct import sys string_address = ’2002:ac10:10a:1234:21e:52ff:fe74:40e’ packed = socket.inet_pton(socket.AF_INET6, string_address) print ’Original:’, string_address print ’Packed :’, binascii.hexlify(packed) print ’Unpacked:’, socket.inet_ntop(socket.AF_INET6, packed) An IPv6 address is already a hexadecimal value, so converting the packed version to a series of hex digits produces a string similar to the original value. $ python socket_ipv6_address_packing.py Original: 2002:ac10:10a:1234:21e:52ff:fe74:40e Packed : 2002ac10010a1234021e52fffe74040e Unpacked: 2002:ac10:10a:1234:21e:52ff:fe74:40e 572 Networking See Also: IPv6 (http://en.wikipedia.org/wiki/IPv6) Wikipedia article discussing Internet Pro- tocol Version 6 (IPv6). OSI Networking Model (http://en.wikipedia.org/wiki/OSI_model) Wikipedia arti- cle describing the seven layer model of networking implementation. Assigned Internet Protocol Numbers (www.iana.org/assignments/protocol-numbers/protocol-numbers.xml) List of standard protocol names and numbers. 11.1.2 TCP/IP Client and Server Sockets can be configured to act as a server and listen for incoming messages, or con- nect to other applications as a client. After both ends of a TCP/IP socket are connected, communication is bidirectional. Echo Server This sample program, based on the one in the standard library documentation, receives incoming messages and echos them back to the sender. It starts by creating a TCP/IP socket. import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) Then bind() is used to associate the socket with the server address. In this case, the address is localhost, referring to the current server, and the port number is 10000. # Bind the socket to the port server_address = (’localhost’, 10000) print >>sys.stderr, ’starting up on %s port %s’ % server_address sock.bind(server_address) Calling listen() puts the socket into server mode, and accept() waits for an incoming connection. The integer argument is the number of connections the system should queue up in the background before rejecting new clients. This example only expects to work with one connection at a time. 11.1. socket—Network Communication 573 # Listen for incoming connections sock.listen(1) while True: # Wait for a connection print >>sys.stderr, ’waiting for a connection’ connection, client_address = sock.accept() accept() returns an open connection between the server and client, along with the client address. The connection is actually a different socket on another port (assigned by the kernel). Data is read from the connection with recv() and transmit- ted with sendall(). try: print >>sys.stderr, ’connection from’, client_address # Receive the data in small chunks and retransmit it while True: data = connection.recv(16) print >>sys.stderr, ’received "%s"’ % data if data: print >>sys.stderr, ’sending data back to the client’ connection.sendall(data) else: print >>sys.stderr, ’no data from’, client_address break finally: # Clean up the connection connection.close() When communication with a client is finished, the connection needs to be cleaned up using close(). This example uses a try:finally block to ensure that close() is always called, even in the event of an error. Echo Client The client program sets up its socket differently from the way a server does. Instead of binding to a port and listening, it uses connect() to attach the socket directly to the remote address. 574 Networking import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Connect the socket to the port where the server is listening server_address = (’localhost’, 10000) print >>sys.stderr, ’connecting to %s port %s’ % server_address sock.connect(server_address) After the connection is established, data can be sent through the socket with sendall() and received with recv(), just as in the server. try: # Send data message = ’This is the message. It will be repeated.’ print >>sys.stderr, ’sending "%s"’ % message sock.sendall(message) # Look for the response amount_received = 0 amount_expected = len(message) while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print >>sys.stderr, ’received "%s"’ % data finally: print >>sys.stderr, ’closing socket’ sock.close() When the entire message is sent and a copy received, the socket is closed to free up the port. Client and Server Together The client and server should be run in separate terminal windows, so they can commu- nicate with each other. The server output shows the incoming connection and data, as well as the response sent back to the client. 11.1. socket—Network Communication 575 $ python ./socket_echo_server.py starting up on localhost port 10000 waiting for a connection connection from (’127.0.0.1’, 52186) received "This is the mess" sending data back to the client received "age. It will be" sending data back to the client received " repeated." sending data back to the client received "" no data from (’127.0.0.1’, 52186) waiting for a connection The client output shows the outgoing message and the response from the server. $ python socket_echo_client.py connecting to localhost port 10000 sending "This is the message. It will be repeated." received "This is the mess" received "age. It will be" received " repeated." closing socket $ Easy Client Connections TCP/IP clients can save a few steps by using the convenience function create_connection() to connect to a server. The function takes one argument, a two-value tuple containing the server address, and derives the best address to use for the connection. import socket import sys def get_constants(prefix): """Create a dictionary mapping socket module constants to their names. """ 576 Networking return dict( (getattr(socket, n), n) for n in dir(socket) if n.startswith(prefix) ) families = get_constants(’AF_’) types = get_constants(’SOCK_’) protocols = get_constants(’IPPROTO_’) # Create a TCP/IP socket sock = socket.create_connection((’localhost’, 10000)) print >>sys.stderr, ’Family :’, families[sock.family] print >>sys.stderr, ’Type :’, types[sock.type] print >>sys.stderr, ’Protocol:’, protocols[sock.proto] print >>sys.stderr try: # Send data message = ’This is the message. It will be repeated.’ print >>sys.stderr, ’sending "%s"’ % message sock.sendall(message) amount_received = 0 amount_expected = len(message) while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print >>sys.stderr, ’received "%s"’ % data finally: print >>sys.stderr, ’closing socket’ sock.close() create_connection() uses getaddrinfo() to find candidate connection parameters and returns a socket opened with the first configuration that creates a successful connection. The family, type, and proto attributes can be examined to determine the type of socket being returned. $ python socket_echo_client_easy.py 11.1. socket—Network Communication 577 Family : AF_INET Type : SOCK_STREAM Protocol: IPPROTO_TCP sending "This is the message. It will be repeated." received "This is the mess" received "age. It will be" received " repeated." closing socket Choosing an Address for Listening It is important to bind a server to the correct address so that clients can communicate with it. The previous examples all used ’localhost’ as the IP address, which limits connections to clients running on the same server. Use a public address of the server, such as the value returned by gethostname(), to allow other hosts to connect. This example modifies the echo server to listen on an address specified via a command line argument. import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Bind the socket to the address given on the command line server_name = sys.argv[1] server_address = (server_name, 10000) print >>sys.stderr, ’starting up on %s port %s’ % server_address sock.bind(server_address) sock.listen(1) while True: print >>sys.stderr, ’waiting for a connection’ connection, client_address = sock.accept() try: print >>sys.stderr, ’client connected:’, client_address while True: data = connection.recv(16) print >>sys.stderr, ’received "%s"’ % data if data: connection.sendall(data) 578 Networking else: break finally: connection.close() A similar modification to the client program is needed before the server can be tested. import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Connect the socket to the port on the server given by the caller server_address = (sys.argv[1], 10000) print >>sys.stderr, ’connecting to %s port %s’ % server_address sock.connect(server_address) try: message = ’This is the message. It will be repeated.’ print >>sys.stderr, ’sending "%s"’ % message sock.sendall(message) amount_received = 0 amount_expected = len(message) while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print >>sys.stderr, ’received "%s"’ % data finally: sock.close() After starting the server with the argument farnsworth.hellfly.net, the netstat command shows it listening on the address for the named host. $ host farnsworth.hellfly.net farnsworth.hellfly.net has address 192.168.1.17 11.1. socket—Network Communication 579 $ netstat -an Active Internet connections (including servers) Proto Recv-Q Send-Q Local Address Foreign Address (state) ... tcp4 0 0 192.168.1.17.10000 *.* LISTEN ... Running the client on another host, passing farnsworth.hellfly.net as the host where the server is running, produces the following. $ hostname homer $ python socket_echo_client_explicit.py farnsworth.hellfly.net connecting to farnsworth.hellfly.net port 10000 sending "This is the message. It will be repeated." received "This is the mess" received "age. It will be" received " repeated." And the server produces the following output. $ python ./socket_echo_server_explicit.py farnsworth.hellfly.net starting up on farnsworth.hellfly.net port 10000 waiting for a connection client connected: (’192.168.1.8’, 57471) received "This is the mess" received "age. It will be" received " repeated." received "" waiting for a connection Many servers have more than one network interface, and therefore, more than one IP address. Rather than running separate copies of a service bound to each IP address, use the special address INADDR_ANY to listen on all addresses at the same time. Although socket defines a constant for INADDR_ANY, it is an integer value and must be converted to a dotted-notation string address before it can be passed to bind().As a shortcut, use “0.0.0.0” or an empty string (”) instead of doing the conversion. 580 Networking import socket import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Bind the socket to the address given on the command line server_address = (’’, 10000) sock.bind(server_address) print >>sys.stderr, ’starting up on %s port %s’ % sock.getsockname() sock.listen(1) while True: print >>sys.stderr, ’waiting for a connection’ connection, client_address = sock.accept() try: print >>sys.stderr, ’client connected:’, client_address while True: data = connection.recv(16) print >>sys.stderr, ’received "%s"’ % data if data: connection.sendall(data) else: break finally: connection.close() To see the actual address being used by a socket, call its getsockname() method. After starting the service, running netstat again shows it listening for incoming con- nections on any address. $ netstat -an Active Internet connections (including servers) Proto Recv-Q Send-Q Local Address Foreign Address (state) ... tcp4 0 0 *.10000 *.* LISTEN ... 11.1.3 User Datagram Client and Server The user datagram protocol (UDP) works differently from TCP/IP. Where TCP is a stream-oriented protocol, ensuring that all the data is transmitted in the right order, UDP is a message-oriented protocol. UDP does not require a long-lived connection, so 11.1. socket—Network Communication 581 setting up a UDP socket is a little simpler. On the other hand, UDP messages must fit within a single packet (for IPv4, that means they can only hold 65,507 bytes because the 65,535-byte packet also includes header information) and delivery is not guaranteed as it is with TCP. Echo Server Since there is no connection, per se, the server does not need to listen for and accept connections. It only needs to use bind() to associate its socket with a port and then wait for individual messages. import socket import sys # Create a UDP socket sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) # Bind the socket to the port server_address = (’localhost’, 10000) print >>sys.stderr, ’starting up on %s port %s’ % server_address sock.bind(server_address) Messages are read from the socket using recvfrom(), which returns the data as well as the address of the client from which it was sent. while True: print >>sys.stderr, ’\nwaiting to receive message’ data, address = sock.recvfrom(4096) print >>sys.stderr, ’received %s bytes from %s’ %\ (len(data), address) print >>sys.stderr, data if data: sent = sock.sendto(data, address) print >>sys.stderr, ’sent %s bytes back to %s’ %\ (sent, address) Echo Client The UDP echo client is similar the server, but does not use bind() to attach its socket to an address. It uses sendto() to deliver its message directly to the server and recvfrom() to receive the response. 582 Networking import socket import sys # Create a UDP socket sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) server_address = (’localhost’, 10000) message = ’This is the message. It will be repeated.’ try: # Send data print >>sys.stderr, ’sending "%s"’ % message sent = sock.sendto(message, server_address) # Receive response print >>sys.stderr, ’waiting to receive’ data, server = sock.recvfrom(4096) print >>sys.stderr, ’received "%s"’ % data finally: print >>sys.stderr, ’closing socket’ sock.close() Client and Server Together Running the server produces the following. $ python ./socket_echo_server_dgram.py starting up on localhost port 10000 waiting to receive message received 42 bytes from (’127.0.0.1’, 50139) This is the message. It will be repeated. sent 42 bytes back to (’127.0.0.1’, 50139) waiting to receive message This is the client output $ python ./socket_echo_client_dgram.py 11.1. socket—Network Communication 583 sending "This is the message. It will be repeated." waiting to receive received "This is the message. It will be repeated." closing socket 11.1.4 UNIX Domain Sockets From the programmer’s perspective, there are two essential differences between using a UNIX domain socket and an TCP/IP socket. First, the address of the socket is a path on the file system, rather than a tuple containing the server name and port. Second, the node created in the file system to represent the socket persists after the socket is closed and needs to be removed each time the server starts up. The echo server example from earlier can be updated to use UDS by making a few changes in the setup section. import socket import sys import os server_address = ’./uds_socket’ # Make sure the socket does not already exist try: os.unlink(server_address) except OSError: if os.path.exists(server_address): raise The socket needs to be created with address family AF_UNIX. # Create a UDS socket sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) Binding the socket and managing the incoming connections works the same as with TCP/IP sockets. # Bind the socket to the address print >>sys.stderr, ’starting up on %s’ % server_address sock.bind(server_address) 584 Networking # Listen for incoming connections sock.listen(1) while True: # Wait for a connection print >>sys.stderr, ’waiting for a connection’ connection, client_address = sock.accept() try: print >>sys.stderr, ’connection from’, client_address # Receive the data in small chunks and retransmit it while True: data = connection.recv(16) print >>sys.stderr, ’received "%s"’ % data if data: print >>sys.stderr, ’sending data back to the client’ connection.sendall(data) else: print >>sys.stderr, ’no data from’, client_address break finally: # Clean up the connection connection.close() The client setup also needs to be modified to work with UDS. It should assume the file system node for the socket exists, since the server creates it by binding to the address. import socket import sys # Create a UDS socket sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) # Connect the socket to the port where the server is listening server_address = ’./uds_socket’ print >>sys.stderr, ’connecting to %s’ % server_address try: sock.connect(server_address) except socket.error, msg: 11.1. socket—Network Communication 585 print >>sys.stderr, msg sys.exit(1) Sending and receiving data works the same way in the UDS client as the TCP/IP client from before. try: # Send data message = ’This is the message. It will be repeated.’ print >>sys.stderr, ’sending "%s"’ % message sock.sendall(message) amount_received = 0 amount_expected = len(message) while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print >>sys.stderr, ’received "%s"’ % data finally: print >>sys.stderr, ’closing socket’ sock.close() The program output is mostly the same, with appropriate updates for the address information. The server shows the messages received and sent back to the client. $ python ./socket_echo_server_uds.py starting up on ./uds_socket waiting for a connection connection from received "This is the mess" sending data back to the client received "age. It will be" sending data back to the client received " repeated." sending data back to the client received "" 586 Networking no data from waiting for a connection The client sends the message all at once and receives parts of it back incrementally. $ python socket_echo_client_uds.py connecting to ./uds_socket sending "This is the message. It will be repeated." received "This is the mess" received "age. It will be" received " repeated." closing socket Permissions Since the UDS socket is represented by a node on the file system, standard file system permissions can be used to control access to the server. $ ls -l ./uds_socket srwxr-xr-x 1 dhellmann dhellmann 0 Sep 20 08:24 ./uds_socket $ sudo chown root ./uds_socket $ ls -l ./uds_socket srwxr-xr-x 1 root dhellmann 0 Sep 20 08:24 ./uds_socket Running the client as a user other than root now results in an error because the process does not have permission to open the socket. $ python socket_echo_client_uds.py connecting to ./uds_socket [Errno 13] Permission denied Communication between Parent and Child Processes The socketpair() function is useful for setting up UDS sockets for inter-process communication under UNIX. It creates a pair of connected sockets that can be used to communicate between a parent process and a child process after the child is forked. 11.1. socket—Network Communication 587 import socket import os parent, child = socket.socketpair() pid = os.fork() if pid: print ’in parent, sending message’ child.close() parent.sendall(’ping’) response = parent.recv(1024) print ’response from child:’, response parent.close() else: print ’in child, waiting for message’ parent.close() message = child.recv(1024) print ’message from parent:’, message child.sendall(’pong’) child.close() By default, a UDS socket is created, but the caller can also pass address family, socket type, and even protocol options to control how the sockets are created. $ python socket_socketpair.py in child, waiting for message message from parent: ping in parent, sending message response from child: pong 11.1.5 Multicast Point-to-point connections handle a lot of communication needs, but passing the same information between many peers becomes challenging as the number of direct connec- tions grows. Sending messages separately to each recipient consumes additional pro- cessing time and bandwidth, which can be a problem for applications such as streaming video or audio. Using multicast to deliver messages to more than one endpoint at a time achieves better efficiency because the network infrastructure ensures that the packets are delivered to all recipients. 588 Networking Multicast messages are always sent using UDP, since TCP requires an end-to-end communication channel. The addresses for multicast, called multicast groups, are a sub- set of the regular IPv4 address range (224.0.0.0 through 230.255.255.255) reserved for multicast traffic. These addresses are treated specially by network routers and switches, so messages sent to the group can be distributed over the Internet to all recipients that have joined the group. Note: Some managed switches and routers have multicast traffic disabled by default. If you have trouble with the example programs, check your network hard- ware settings. Sending Multicast Messages This modified echo client will send a message to a multicast group and then report all the responses it receives. Since it has no way of knowing how many responses to expect, it uses a timeout value on the socket to avoid blocking indefinitely while waiting for an answer. import socket import struct import sys message = ’very important data’ multicast_group = (’224.3.29.71’, 10000) # Create the datagram socket sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) # Set a timeout so the socket does not block indefinitely when trying # to receive data. sock.settimeout(0.2) The socket also needs to be configured with a time-to-live value (TTL) for the messages. The TTL controls how many networks will receive the packet. Set the TTL with the IP_MULTICAST_TTL option and setsockopt(). The default, 1, means that the packets are not forwarded by the router beyond the current network segment. The value can range up to 255 and should be packed into a single byte. # Set the time-to-live for messages to 1 so they do not go past the # local network segment. 11.1. socket—Network Communication 589 ttl = struct.pack(’b’, 1) sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, ttl) The rest of the sender looks like the UDP echo client, except that it expects multi- ple responses so uses a loop to call recvfrom() until it times out. try: # Send data to the multicast group print >>sys.stderr, ’sending "%s"’ % message sent = sock.sendto(message, multicast_group) # Look for responses from all recipients while True: print >>sys.stderr, ’waiting to receive’ try: data, server = sock.recvfrom(16) except socket.timeout: print >>sys.stderr, ’timed out, no more responses’ break else: print >>sys.stderr, ’received "%s" from %s’ %\ (data, server) finally: print >>sys.stderr, ’closing socket’ sock.close() Receiving Multicast Messages The first step to establishing a multicast receiver is to create the UDP socket. import socket import struct import sys multicast_group = ’224.3.29.71’ server_address = (’’, 10000) # Create the socket sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) # Bind to the server address sock.bind(server_address) 590 Networking After the regular socket is created and bound to a port, it can be added to the multicast group by using setsockopt() to change the IP_ADD_MEMBERSHIP option. The option value is the 8-byte packed representation of the multicast group address followed by the network interface on which the server should listen for the traffic, identified by its IP address. In this case, the receiver listens on all interfaces using INADDR_ANY. # Tell the operating system to add the socket to the multicast group # on all interfaces. group = socket.inet_aton(multicast_group) mreq = struct.pack(’4sL’, group, socket.INADDR_ANY) sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq) The main loop for the receiver is just like the regular UDP echo server. # Receive/respond loop while True: print >>sys.stderr, ’\nwaiting to receive message’ data, address = sock.recvfrom(1024) print >>sys.stderr, ’received %s bytes from %s’ %\ (len(data), address) print >>sys.stderr, data print >>sys.stderr, ’sending acknowledgement to’, address sock.sendto(’ack’, address) Example Output This example shows the multicast receiver running on two different hosts. A has address 192.168.1.17 and B has address 192.168.1.8. [A]$ python ./socket_multicast_receiver.py waiting to receive message received 19 bytes from (’192.168.1.17’, 51382) very important data sending acknowledgement to (’192.168.1.17’, 51382) [B]$ python ./socket_multicast_receiver.py waiting to receive message received 19 bytes from (’192.168.1.17’, 51382) 11.1. socket—Network Communication 591 very important data sending acknowledgement to (’192.168.1.17’, 51382) The sender is running on host A. $ python ./socket_multicast_sender.py sending "very important data" waiting to receive received "ack" from (’192.168.1.17’, 10000) waiting to receive received "ack" from (’192.168.1.8’, 10000) waiting to receive timed out, no more responses closing socket The message is sent one time, and two acknowledgements of the outgoing message are received, one from each of host A and host B. See Also: Multicast (http://en.wikipedia.org/wiki/Multicast) Wikipedia article describing technical details of multicasting. IP Multicast (http://en.wikipedia.org/wiki/IP_multicast) Wikipedia article about IP multicasting, with information about addressing. 11.1.6 Sending Binary Data Sockets transmit streams of bytes. Those bytes can contain text messages, as in the previous examples, or they can be made up of binary data that has been encoded for transmission. To prepare binary data values for transmission, pack them into a buffer with struct. This client program encodes an integer, a string of two characters, and a floating- point value into a sequence of bytes that can be passed to the socket for transmission. import binascii import socket import struct import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 592 Networking server_address = (’localhost’, 10000) sock.connect(server_address) values = (1, ’ab’, 2.7) packer = struct.Struct(’I 2s f’) packed_data = packer.pack(*values) print ’values =’, values try: # Send data print >>sys.stderr, ’sending %r’ % binascii.hexlify(packed_data) sock.sendall(packed_data) finally: print >>sys.stderr, ’closing socket’ sock.close() When sending multibyte binary data between two systems, it is important to ensure that both sides of the connection know what order the bytes are in and how to assemble them back into the correct order for the local architecture. The server program uses the same Struct specifier to unpack the bytes it receives so they are interpreted in the correct order. import binascii import socket import struct import sys # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server_address = (’localhost’, 10000) sock.bind(server_address) sock.listen(1) unpacker = struct.Struct(’I 2s f’) while True: print >>sys.stderr, ’\nwaiting for a connection’ connection, client_address = sock.accept() 11.1. socket—Network Communication 593 try: data = connection.recv(unpacker.size) print >>sys.stderr, ’received %r’ % binascii.hexlify(data) unpacked_data = unpacker.unpack(data) print >>sys.stderr, ’unpacked:’, unpacked_data finally: connection.close() Running the client produces the following: $ python ./socket_binary_client.py values = (1, ’ab’, 2.7) sending ’0100000061620000cdcc2c40’ closing socket And the server shows the values it receives. $ python ./socket_binary_server.py waiting for a connection received ’0100000061620000cdcc2c40’ unpacked: (1, ’ab’, 2.700000047683716) waiting for a connection The floating-point value loses some precision as it is packed and unpacked, but otherwise, the data is transmitted as expected. One thing to keep in mind is that, depending on the value of the integer, it may be more efficient to convert it to text and then transmit, instead of using struct. The integer 1 uses one byte when represented as a string, but four when packed into the structure. See Also: struct (page 102) Converting between strings and other data types. 11.1.7 Nonblocking Communication and Timeouts By default, a socket is configured so that sending or receiving data blocks, stopping program execution until the socket is ready. Calls to send() wait for buffer space to be 594 Networking available for the outgoing data, and calls to recv() wait for the other program to send data that can be read. This form of I/O operation is easy to understand, but can lead to inefficient operation and even deadlocks if both programs end up waiting for the other to send or receive data. There are a few ways to work around this situation. One is to use a separate thread for communicating with each socket. This can introduce other complexities, though, with communication between the threads. Another option is to change the socket to not block at all and return immediately if it is not ready to handle the operation. Use the setblocking() method to change the blocking flag for a socket. The default value is 1, which means to block. Passing a value of 0 turns off blocking. If the socket has blocking turned off and it is not ready for the operation, then socket.error is raised. A compromise solution is to set a timeout value for socket operations. Use settimeout() to change the timeout of a socket to a floating-point value repre- senting the number of seconds to block before deciding the socket is not ready for the operation. When the timeout expires, a timeout exception is raised. See Also: socket (http://docs.python.org/library/socket.html) The standard library documen- tation for this module. Socket Programming HOWTO (http://docs.python.org/howto/sockets.html) An instructional guide by Gordon McMillan, included in the standard library documentation. select (page 594) Testing a socket to see if it is ready for reading or writing for non- blocking I/O. SocketServer (page 609) Framework for creating network servers. urllib (page 651) and urllib2 (page 667) Most network clients should use the more convenient libraries for accessing remote resources through a URL. asyncore (page 619) and asynchat (page 629) Frameworks for asynchronous communication. Unix Network Programming, Volume 1: The Sockets Networking API, 3/E By W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. Published by Addison-Wesley Professional, 2004. ISBN-10: 0131411551 11.2 select—Wait for I/O Efficiently Purpose Wait for notification that an input or output channel is ready. Python Version 1.4 and later 11.2. select—Wait for I/O Efficiently 595 The select module provides access to platform-specific I/O monitoring functions. The most portable interface is the POSIX function select(), which is available on UNIX and Windows. The module also includes poll(), a UNIX-only API, and several options that only work with specific variants of UNIX. 11.2.1 Using select() Python’s select() function is a direct interface to the underlying operating system implementation. It monitors sockets, open files, and pipes (anything with a fileno() method that returns a valid file descriptor) until they become readable or writable or a communication error occurs. select() makes it easier to monitor multiple connec- tions at the same time, and it is more efficient than writing a polling loop in Python using socket timeouts, because the monitoring happens in the operating system network layer, instead of the interpreter. Note: Using Python’s file objects with select() works for UNIX, but is not supported under Windows. The echo server example from the socket section can be extended to watch for more than one connection at a time by using select(). The new version starts out by creating a nonblocking TCP/IP socket and configuring it to listen on an address. import select import socket import sys import Queue # Create a TCP/IP socket server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.setblocking(0) # Bind the socket to the port server_address = (’localhost’, 10000) print >>sys.stderr, ’starting up on %s port %s’ % server_address server.bind(server_address) # Listen for incoming connections server.listen(5) 596 Networking The arguments to select() are three lists containing communication channels to monitor. The first is a list of the objects to be checked for incoming data to be read, the second contains objects that will receive outgoing data when there is room in their buffer, and the third includes those that may have an error (usually a combination of the input and output channel objects). The next step in the server is to set up the lists containing input sources and output destinations to be passed to select(). # Sockets from which we expect to read inputs = [ server ] # Sockets to which we expect to write outputs = [ ] Connections are added to and removed from these lists by the server main loop. Since this version of the server is going to wait for a socket to become writable before sending any data (instead of immediately sending the reply), each output connection needs a queue to act as a buffer for the data to be sent through it. # Outgoing message queues (socket:Queue) message_queues = {} The main portion of the server program loops, calling select() to block and wait for network activity. while inputs: # Wait for at least one of the sockets to be ready for processing print >>sys.stderr, ’waiting for the next event’ readable, writable, exceptional = select.select(inputs, outputs, inputs) select() returns three new lists, containing subsets of the contents of the lists passed in. All the sockets in the readable list have incoming data buffered and avail- able to be read. All the sockets in the writable list have free space in their buffer and can be written to. The sockets returned in exceptional have had an error (the actual definition of “exceptional condition” depends on the platform). The “readable” sockets represent three possible cases. If the socket is the main “server” socket, the one being used to listen for connections, then the “readable” con- dition means it is ready to accept another incoming connection. In addition to adding the new connection to the list of inputs to monitor, this section sets the client socket to not block. 11.2. select—Wait for I/O Efficiently 597 # Handle inputs for s in readable: if s is server: # A "readable" socket is ready to accept a connection connection, client_address = s.accept() print >>sys.stderr, ’ connection from’, client_address connection.setblocking(0) inputs.append(connection) # Give the connection a queue for data we want to send message_queues[connection] = Queue.Queue() The next case is an established connection with a client that has sent data. The data is read with recv(), and then it is placed on the queue so it can be sent through the socket and back to the client. else: data = s.recv(1024) if data: # A readable client socket has data print >>sys.stderr, ’ received "%s" from %s’ % \ (data, s.getpeername()) message_queues[s].put(data) # Add output channel for response if s not in outputs: outputs.append(s) A readable socket without data available is from a client that has disconnected, and the stream is ready to be closed. else: # Interpret empty result as closed connection print >>sys.stderr, ’ closing’, client_address # Stop listening for input on the connection if s in outputs: outputs.remove(s) inputs.remove(s) s.close() # Remove message queue del message_queues[s] There are fewer cases for the writable connections. If there is data in the queue for a connection, the next message is sent. Otherwise, the connection is removed from the 598 Networking list of output connections so that the next time through the loop, select() does not indicate that the socket is ready to send data. # Handle outputs for s in writable: try: next_msg = message_queues[s].get_nowait() except Queue.Empty: # No messages waiting so stop checking for writability. print >>sys.stderr, ’’, s.getpeername(), ’queue empty’ outputs.remove(s) else: print >>sys.stderr, ’ sending "%s" to %s’ %\ (next_msg, s.getpeername()) s.send(next_msg) Finally, if there is an error with a socket, it is closed. # Handle "exceptional conditions" for s in exceptional: print >>sys.stderr, ’exception condition on’, s.getpeername() # Stop listening for input on the connection inputs.remove(s) if s in outputs: outputs.remove(s) s.close() # Remove message queue del message_queues[s] The example client program uses two sockets to demonstrate how the server with select() manages multiple connections at the same time. The client starts by con- necting each TCP/IP socket to the server. import socket import sys messages = [ ’This is the message. ’, ’It will be sent ’, ’in parts.’, ] server_address = (’localhost’, 10000) 11.2. select—Wait for I/O Efficiently 599 # Create a TCP/IP socket socks = [ socket.socket(socket.AF_INET, socket.SOCK_STREAM), socket.socket(socket.AF_INET, socket.SOCK_STREAM), ] # Connect the socket to the port where the server is listening print >>sys.stderr, ’connecting to %s port %s’ % server_address for s in socks: s.connect(server_address) Then it sends one piece of the message at a time via each socket and reads all responses available after writing new data. for message in messages: # Send messages on both sockets for s in socks: print >>sys.stderr, ’%s: sending "%s"’ %\ (s.getsockname(), message) s.send(message) # Read responses on both sockets for s in socks: data = s.recv(1024) print >>sys.stderr, ’%s: received "%s"’ %\ (s.getsockname(), data) if not data: print >>sys.stderr, ’closing socket’, s.getsockname() s.close() Run the server in one window and the client in another. The output will look like this, with different port numbers. $ python ./select_echo_server.py starting up on localhost port 10000 waiting for the next event connection from (’127.0.0.1’, 55472) waiting for the next event connection from (’127.0.0.1’, 55473) received "This is the message. " from (’127.0.0.1’, 55472) 600 Networking waiting for the next event received "This is the message. " from (’127.0.0.1’, 55473) sending "This is the message. " to (’127.0.0.1’, 55472) waiting for the next event (’127.0.0.1’, 55472) queue empty sending "This is the message. " to (’127.0.0.1’, 55473) waiting for the next event (’127.0.0.1’, 55473) queue empty waiting for the next event received "It will be sent " from (’127.0.0.1’, 55472) received "It will be sent " from (’127.0.0.1’, 55473) waiting for the next event sending "It will be sent " to (’127.0.0.1’, 55472) sending "It will be sent " to (’127.0.0.1’, 55473) waiting for the next event (’127.0.0.1’, 55472) queue empty (’127.0.0.1’, 55473) queue empty waiting for the next event received "in parts." from (’127.0.0.1’, 55472) received "in parts." from (’127.0.0.1’, 55473) waiting for the next event sending "in parts." to (’127.0.0.1’, 55472) sending "in parts." to (’127.0.0.1’, 55473) waiting for the next event (’127.0.0.1’, 55472) queue empty (’127.0.0.1’, 55473) queue empty waiting for the next event closing (’127.0.0.1’, 55473) closing (’127.0.0.1’, 55473) waiting for the next event The client output shows the data being sent and received using both sockets. $ python ./select_echo_multiclient.py connecting to localhost port 10000 (’127.0.0.1’, 55821): sending "This is the message. " (’127.0.0.1’, 55822): sending "This is the message. " (’127.0.0.1’, 55821): received "This is the message. " (’127.0.0.1’, 55822): received "This is the message. " (’127.0.0.1’, 55821): sending "It will be sent " (’127.0.0.1’, 55822): sending "It will be sent " (’127.0.0.1’, 55821): received "It will be sent " 11.2. select—Wait for I/O Efficiently 601 (’127.0.0.1’, 55822): received "It will be sent " (’127.0.0.1’, 55821): sending "in parts." (’127.0.0.1’, 55822): sending "in parts." (’127.0.0.1’, 55821): received "in parts." (’127.0.0.1’, 55822): received "in parts." 11.2.2 Nonblocking I/O with Timeouts select() also takes an optional fourth parameter, which is the number of seconds to wait before breaking off monitoring if no channels have become active. Using a timeout value lets a main program call select() as part of a larger processing loop, taking other actions between checking for network input. When the timeout expires, select() returns three empty lists. Updating the server example to use a timeout requires adding the extra argument to the select() call and handling the empty lists after select() returns. # Wait for at least one of the sockets to be ready for processing print >>sys.stderr, ’\nwaiting for the next event’ timeout = 1 readable, writable, exceptional = select.select(inputs, outputs, inputs, timeout) if not (readable or writable or exceptional): print >>sys.stderr, ’ timed out, do some other work here’ continue This “slow” version of the client program pauses after sending each message to simulate latency or other delay in transmission. import socket import sys import time # Create a TCP/IP socket sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # Connect the socket to the port where the server is listening server_address = (’localhost’, 10000) print >>sys.stderr, ’connecting to %s port %s’ % server_address sock.connect(server_address) 602 Networking time.sleep(1) messages = [ ’Part one of the message.’, ’Part two of the message.’, ] amount_expected = len(’’.join(messages)) try: # Send data for message in messages: print >>sys.stderr, ’sending "%s"’ % message sock.sendall(message) time.sleep(1.5) # Look for the response amount_received = 0 while amount_received < amount_expected: data = sock.recv(16) amount_received += len(data) print >>sys.stderr, ’received "%s"’ % data finally: print >>sys.stderr, ’closing socket’ sock.close() Running the new server with the slow client produces the following: $ python ./select_echo_server_timeout.py starting up on localhost port 10000 waiting for the next event connection from (’127.0.0.1’, 55480) waiting for the next event received "Part one of the message." from (’127.0.0.1’, 55480) waiting for the next event sending "Part one of the message." to (’127.0.0.1’, 55480) waiting for the next event (’127.0.0.1’, 55480) queue empty waiting for the next event received "Part two of the message." from (’127.0.0.1’, 55480) waiting for the next event sending "Part two of the message." to (’127.0.0.1’, 55480) 11.2. select—Wait for I/O Efficiently 603 waiting for the next event (’127.0.0.1’, 55480) queue empty waiting for the next event closing (’127.0.0.1’, 55480) waiting for the next event And this is the client output: $ python ./select_echo_slow_client.py connecting to localhost port 10000 sending "Part one of the message." sending "Part two of the message." received "Part one of the " received "message.Part two" received " of the message." closing socket 11.2.3 Using poll() The poll() function provides similar features to select(), but the underlying imple- mentation is more efficient. The trade-off is that poll() is not supported under Win- dows, so programs using poll() are less portable. An echo server built on poll() starts with the same socket configuration code used in the other examples. import select import socket import sys import Queue # Create a TCP/IP socket server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) server.setblocking(0) # Bind the socket to the port server_address = (’localhost’, 10000) print >>sys.stderr, ’starting up on %s port %s’ % server_address server.bind(server_address) # Listen for incoming connections server.listen(5) 604 Networking # Keep up with the queues of outgoing messages message_queues = {} The timeout value passed to poll() is represented in milliseconds, instead of seconds, so in order to pause for a full second, the timeout must be set to 1000. # Do not block forever (milliseconds) TIMEOUT = 1000 Python implements poll() with a class that manages the registered data channels being monitored. Channels are added by calling register(), with flags indicating which events are interesting for that channel. The full set of flags is listed in Table 11.1. Table 11.1. Event Flags for poll() Event Description POLLIN Input ready POLLPRI Priority input ready POLLOUT Able to receive output POLLERR Error POLLHUP Channel closed POLLNVAL Channel not open The echo server will be setting up some sockets just for reading and others to be read from or written to. The appropriate combinations of flags are saved to the local variables READ_ONLY and READ_WRITE. # Commonly used flag sets READ_ONLY = ( select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR ) READ_WRITE = READ_ONLY | select.POLLOUT The server socket is registered so that any incoming connections or data triggers an event. # Set up the poller poller = select.poll() poller.register(server, READ_ONLY) 11.2. select—Wait for I/O Efficiently 605 Since poll() returns a list of tuples containing the file descriptor for the socket and the event flag, a mapping from file descriptor numbers to objects is needed to retrieve the socket to read or write from it. # Map file descriptors to socket objects fd_to_socket = { server.fileno(): server, } The server’s loop calls poll() and then processes the “events” returned by look- ing up the socket and taking action based on the flag in the event. while True: # Wait for at least one of the sockets to be ready for processing print >>sys.stderr, ’waiting for the next event’ events = poller.poll(TIMEOUT) for fd, flag in events: # Retrieve the actual socket from its file descriptor s = fd_to_socket[fd] As with select(), when the main server socket is “readable,” that really means there is a pending connection from a client. The new connection is registered with the READ_ONLY flags to watch for new data to come through it. # Handle inputs if flag & (select.POLLIN | select.POLLPRI): if s is server: # A readable socket is ready to accept a connection connection, client_address = s.accept() print >>sys.stderr, ’ connection’, client_address connection.setblocking(0) fd_to_socket[ connection.fileno() ] = connection poller.register(connection, READ_ONLY) # Give the connection a queue for data to send message_queues[connection] = Queue.Queue() Sockets other than the server are existing clients with data buffered and waiting to be read. Use recv() to retrieve the data from the buffer. 606 Networking else: data = s.recv(1024) If recv() returns any data, it is placed into the outgoing queue for the socket, and the flags for that socket are changed using modify() so poll() will watch for the socket to be ready to receive data. if data: # A readable client socket has data print >>sys.stderr, ’ received "%s" from %s’ %\ (data, s.getpeername()) message_queues[s].put(data) # Add output channel for response poller.modify(s, READ_WRITE) An empty string returned by recv() means the client disconnected, so unregister() is used to tell the poll object to ignore the socket. else: # Interpret empty result as closed connection print >>sys.stderr, ’ closing’, client_address # Stop listening for input on the connection poller.unregister(s) s.close() # Remove message queue del message_queues[s] The POLLHUP flag indicates a client that “hung up” the connection without closing it cleanly. The server stops polling clients that disappear. elif flag & select.POLLHUP: # Client hung up print >>sys.stderr, ’ closing’, client_address, ’(HUP)’ # Stop listening for input on the connection poller.unregister(s) s.close() The handling for writable sockets looks like the version used in the example for select(), except that modify() is used to change the flags for the socket in the poller, instead of removing it from the output list. elif flag & select.POLLOUT: # Socket is ready to send data, if there is any to send. 11.2. select—Wait for I/O Efficiently 607 try: next_msg = message_queues[s].get_nowait() except Queue.Empty: # No messages waiting so stop checking print >>sys.stderr, s.getpeername(), ’queue empty’ poller.modify(s, READ_ONLY) else: print >>sys.stderr, ’ sending "%s" to %s’ % \ (next_msg, s.getpeername()) s.send(next_msg) And, finally, any events with POLLERR cause the server to close the socket. elif flag & select.POLLERR: print >>sys.stderr, ’ exception on’, s.getpeername() # Stop listening for input on the connection poller.unregister(s) s.close() # Remove message queue del message_queues[s] When the poll-based server is run together with select_echo_multiclient .py (the client program that uses multiple sockets), this is the output. $ python ./select_poll_echo_server.py waiting for the next event waiting for the next event connection (’127.0.0.1’, 62835) waiting for the next event connection (’127.0.0.1’, 62836) waiting for the next event received "This is the message. " from (’127.0.0.1’, 62835) waiting for the next event sending "This is the message. " to (’127.0.0.1’, 62835) waiting for the next event (’127.0.0.1’, 62835) queue empty waiting for the next event received "This is the message. " from (’127.0.0.1’, 62836) waiting for the next event sending "This is the message. " to (’127.0.0.1’, 62836) waiting for the next event (’127.0.0.1’, 62836) queue empty 608 Networking waiting for the next event received "It will be sent " from (’127.0.0.1’, 62835) waiting for the next event sending "It will be sent " to (’127.0.0.1’, 62835) waiting for the next event (’127.0.0.1’, 62835) queue empty waiting for the next event received "It will be sent " from (’127.0.0.1’, 62836) waiting for the next event sending "It will be sent " to (’127.0.0.1’, 62836) waiting for the next event (’127.0.0.1’, 62836) queue empty waiting for the next event received "in parts." from (’127.0.0.1’, 62835) received "in parts." from (’127.0.0.1’, 62836) waiting for the next event sending "in parts." to (’127.0.0.1’, 62835) sending "in parts." to (’127.0.0.1’, 62836) waiting for the next event (’127.0.0.1’, 62835) queue empty (’127.0.0.1’, 62836) queue empty waiting for the next event closing (’127.0.0.1’, 62836) closing (’127.0.0.1’, 62836) waiting for the next event 11.2.4 Platform-Specific Options Less portable options provided by select are epoll, the edge polling API supported by Linux; kqueue, which uses BSD’s kernel queue; and kevent, BSD’s kernel event interface. Refer to the operating system library documentation for more detail about how they work. See Also: select (http://docs.python.org/library/select.html) The standard library documenta- tion for this module. Socket Programming HOWTO (http://docs.python.org/howto/sockets.html) An instructional guide by Gordon McMillan, included in the standard library documentation. socket (page 561) Low-level network communication. SocketServer (page 609) Framework for creating network server applications. asyncore (page 619) and asynchat (page 629) Asynchronous I/O framework. 11.3. SocketServer—Creating Network Servers 609 UNIX Network Programming, Volume 1: The Sockets Networking API, 3/E By W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. Published by Addison-Wesley Professional, 2004. ISBN-10: 0131411551. 11.3 SocketServer—Creating Network Servers Purpose Creating network servers. Python Version 1.4 and later The SocketServer module is a framework for creating network servers. It defines classes for handling synchronous network requests (the server request-handler blocks until the request is completed) over TCP, UDP, UNIX streams, and UNIX datagrams. It also provides mix-in classes for easily converting servers to use a separate thread or process for each request. Responsibility for processing a request is split between a server class and a request- handler class. The server deals with the communication issues, such as listening on a socket and accepting connections, and the request handler deals with the “protocol” issues like interpreting incoming data, processing it, and sending data back to the client. This division of responsibility means that many applications can use one of the existing server classes without any modifications and provide a request to communicate with each other handler class for it to work with the custom protocol. 11.3.1 Server Types There are five different server classes defined in SocketServer. BaseServer defines the API and is not intended to be instantiated and used directly. TCPServer uses TCP/IP sockets to communicate. UDPServer uses datagram sockets. UnixStreamServer and UnixDatagramServer use UNIX-domain sockets and are only available on UNIX platforms. 11.3.2 Server Objects To construct a server, pass it an address on which to listen for requests and a request- handler class (not instance). The address format depends on the server type and the socket family used. Refer to the socket module documentation for details. Once the server object is instantiated, use either handle_request() or serve_forever() to process requests. The serve_forever() method calls handle_request() in an infinite loop, but if an application needs to integrate the server with another event loop or use select() to monitor several sockets for differ- ent servers, it can call handle_request() directly. 610 Networking 11.3.3 Implementing a Server When creating a server, it is usually sufficient to reuse one of the existing classes and provide a custom request handler class. For other cases, BaseServer includes several methods that can be overridden in a subclass. • verify_request(request, client_address): Return True to process the request or False to ignore it. For example, a server could refuse requests from an IP range or if it is overloaded. • process_request(request, client_address): Calls finish_requ- est() to actually do the work of handling the request. It can also create a sepa- rate thread or process, as the mix-in classes do. • finish_request(request, client_address): Creates a request handler instance using the class given to the server’s constructor. Calls handle() on the request handler to process the request. 11.3.4 Request Handlers Request handlers do most of the work of receiving incoming requests and deciding what action to take. The handler is responsible for implementing the protocol on top of the socket layer (i.e., HTTP, XML-RPC, or AMQP). The request handler reads the request from the incoming data channel, processes it, and writes a response back out. Three methods are available to be overridden. • setup(): Prepares the request handler for the request. In the StreamRequestHandler the setup() method creates file-like objects for reading from and writing to the socket. • handle(): Does the real work for the request. Parses the incoming request, pro- cesses the data, and sends a response. • finish(): Cleans up anything created during setup(). Many handlers can be implemented with only a handle() method. 11.3.5 Echo Example This example implements a simple server/request handler pair that accepts TCP con- nections and echos back any data sent by the client. It starts with the request handler. import logging import sys import SocketServer 11.3. SocketServer—Creating Network Servers 611 logging.basicConfig(level=logging.DEBUG, format=’%(name)s: %(message)s’, ) class EchoRequestHandler(SocketServer.BaseRequestHandler): def __init__(self, request, client_address, server): self.logger = logging.getLogger(’EchoRequestHandler’) self.logger.debug(’__init__’) SocketServer.BaseRequestHandler.__init__(self, request, client_address, server) return def setup(self): self.logger.debug(’setup’) return SocketServer.BaseRequestHandler.setup(self) def handle(self): self.logger.debug(’handle’) # Echo the back to the client data = self.request.recv(1024) self.logger.debug(’recv()->"%s"’, data) self.request.send(data) return def finish(self): self.logger.debug(’finish’) return SocketServer.BaseRequestHandler.finish(self) The only method that actually needs to be implemented is EchoRequest- Handler.handle(), but versions of all the methods described earlier are included to illustrate the sequence of calls made. The EchoServer class does nothing different from TCPServer, except log when each method is called. class EchoServer(SocketServer.TCPServer): def __init__(self, server_address, handler_class=EchoRequestHandler, ): self.logger = logging.getLogger(’EchoServer’) self.logger.debug(’__init__’) 612 Networking SocketServer.TCPServer.__init__(self, server_address, handler_class) return def server_activate(self): self.logger.debug(’server_activate’) SocketServer.TCPServer.server_activate(self) return def serve_forever(self, poll_interval=0.5): self.logger.debug(’waiting for request’) self.logger.info(’Handling requests, press to quit’) SocketServer.TCPServer.serve_forever(self, poll_interval) return def handle_request(self): self.logger.debug(’handle_request’) return SocketServer.TCPServer.handle_request(self) def verify_request(self, request, client_address): self.logger.debug(’verify_request(%s, %s)’, request, client_address) return SocketServer.TCPServer.verify_request(self, request, client_address) def process_request(self, request, client_address): self.logger.debug(’process_request(%s, %s)’, request, client_address) return SocketServer.TCPServer.process_request(self, request, client_address) def server_close(self): self.logger.debug(’server_close’) return SocketServer.TCPServer.server_close(self) def finish_request(self, request, client_address): self.logger.debug(’finish_request(%s, %s)’, request, client_address) return SocketServer.TCPServer.finish_request(self, request, client_address) def close_request(self, request_address): self.logger.debug(’close_request(%s)’, request_address) return SocketServer.TCPServer.close_request(self, request_address) 11.3. SocketServer—Creating Network Servers 613 def shutdown(self): self.logger.debug(’shutdown()’) return SocketServer.TCPServer.shutdown(self) The last step is to add a main program that sets up the server to run in a thread and sends it data to illustrate which methods are called as the data is echoed back. if __name__ == ’__main__’: import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = EchoServer(address, EchoRequestHandler) ip, port = server.server_address # what port was assigned? # Start the server in a thread t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() logger = logging.getLogger(’client’) logger.info(’Server on %s:%s’, ip, port) # Connect to the server logger.debug(’creating socket’) s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) logger.debug(’connecting to server’) s.connect((ip, port)) # Send the data message = ’Hello, world’ logger.debug(’sending data: "%s"’, message) len_sent = s.send(message) # Receive a response logger.debug(’waiting for response’) response = s.recv(len_sent) logger.debug(’response from server: "%s"’, response) # Clean up server.shutdown() logger.debug(’closing socket’) s.close() logger.debug(’done’) server.socket.close() 614 Networking Running the program produces the following. $ python SocketServer_echo.py EchoServer: __init__ EchoServer: server_activate EchoServer: waiting for request EchoServer: Handling requests, press to quit client: Server on 127.0.0.1:62859 client: creating socket client: connecting to server EchoServer: verify_request(, (’127.0.0.1’, 62860)) EchoServer: process_request(, (’127.0.0.1’, 62860)) EchoServer: finish_request(, (’127.0.0.1’, 62860)) EchoRequestHandler: __init__ EchoRequestHandler: setup EchoRequestHandler: handle client: sending data: "Hello, world" EchoRequestHandler: recv()->"Hello, world" EchoRequestHandler: finish EchoServer: close_request() client: waiting for response client: response from server: "Hello, world" EchoServer: shutdown() client: closing socket client: done Note: The port number used will change each time the program runs because the kernel allocates an available port automatically. To make the server listen on a spe- cific port each time, provide that number in the address tuple instead of the 0. Here is a condensed version of the same server, without the logging calls. Only the handle() method in the request-handler class needs to be provided. import SocketServer class EchoRequestHandler(SocketServer.BaseRequestHandler): 11.3. SocketServer—Creating Network Servers 615 def handle(self): # Echo the back to the client data = self.request.recv(1024) self.request.send(data) return if __name__ == ’__main__’: import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = SocketServer.TCPServer(address, EchoRequestHandler) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() # Connect to the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) # Send the data message = ’Hello, world’ print ’Sending : "%s"’ % message len_sent = s.send(message) # Receive a response response = s.recv(len_sent) print ’Received: "%s"’ % response # Clean up server.shutdown() s.close() server.socket.close() In this case, no special server class is required since the TCPServer handles all the server requirements. $ python SocketServer_echo_simple.py Sending : "Hello, world" Received: "Hello, world" 616 Networking 11.3.6 Threading and Forking To add threading or forking support to a server, include the appropriate mix-in in the class hierarchy for the server. The mix-in classes override process_request() to start a new thread or process when a request is ready to be handled, and the work is done in the new child. For threads, use ThreadingMixIn. import threading import SocketServer class ThreadedEchoRequestHandler(SocketServer.BaseRequestHandler): def handle(self): # Echo the back to the client data = self.request.recv(1024) cur_thread = threading.currentThread() response = ’%s: %s’ % (cur_thread.getName(), data) self.request.send(response) return class ThreadedEchoServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer, ): pass if __name__ == ’__main__’: import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = ThreadedEchoServer(address, ThreadedEchoRequestHandler) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() print ’Server loop running in thread:’, t.getName() # Connect to the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) 11.3. SocketServer—Creating Network Servers 617 # Send the data message = ’Hello, world’ print ’Sending : "%s"’ % message len_sent = s.send(message) # Receive a response response = s.recv(1024) print ’Received: "%s"’ % response # Clean up server.shutdown() s.close() server.socket.close() The response from this threaded server includes the identifier of the thread where the request is handled. $ python SocketServer_threaded.py Server loop running in thread: Thread-1 Sending : "Hello, world" Received: "Thread-2: Hello, world" For separate processes, use ForkingMixIn. import os import SocketServer class ForkingEchoRequestHandler(SocketServer.BaseRequestHandler): def handle(self): # Echo the back to the client data = self.request.recv(1024) cur_pid = os.getpid() response = ’%s: %s’ % (cur_pid, data) self.request.send(response) return class ForkingEchoServer(SocketServer.ForkingMixIn, SocketServer.TCPServer, ): pass 618 Networking if __name__ == ’__main__’: import socket import threading address = (’localhost’, 0) # let the kernel assign a port server = ForkingEchoServer(address, ForkingEchoRequestHandler) ip, port = server.server_address # what port was assigned? t = threading.Thread(target=server.serve_forever) t.setDaemon(True) # don’t hang on exit t.start() print ’Server loop running in process:’, os.getpid() # Connect to the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((ip, port)) # Send the data message = ’Hello, world’ print ’Sending : "%s"’ % message len_sent = s.send(message) # Receive a response response = s.recv(1024) print ’Received: "%s"’ % response # Clean up server.shutdown() s.close() server.socket.close() In this case, the process id of the child is included in the response from the server. $ python SocketServer_forking.py Server loop running in process: 12797 Sending : "Hello, world" Received: "12798: Hello, world" See Also: SocketServer (http://docs.python.org/lib/module-SocketServer.html) Standard lib- rary documentation for this module. 11.4. asyncore—Asynchronous I/O 619 asyncore (page 619) Use asyncore to create asynchronous servers that do not block while processing a request. SimpleXMLRPCServer (page 714) XML-RPC server built using SocketServer. 11.4 asyncore—Asynchronous I/O Purpose Asynchronous I/O handler. Python Version 1.5.2 and later The asyncore module includes tools for working with I/O objects such as sockets so they can be managed asynchronously (instead of using multiple threads or processes). The main class provided is dispatcher, a wrapper around a socket that provides hooks for handling events like connecting, reading, and writing when invoked from the main loop function, loop(). 11.4.1 Servers This example illustrates using asyncore in a server and client by reimplementing the EchoServer from the SocketServer examples. Three classes are used in the new implementation. The first, EchoServer, receives incoming connections from clients. This demonstration implementation closes down as soon as the first connec- tion is accepted, so it is easier to start and stop the server while experimenting with the code. import asyncore import logging class EchoServer(asyncore.dispatcher): """Receives connections and establishes handlers for each client. """ def __init__(self, address): self.logger = logging.getLogger(’EchoServer’) asyncore.dispatcher.__init__(self) self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.bind(address) self.address = self.socket.getsockname() self.logger.debug(’binding to %s’, self.address) self.listen(1) return 620 Networking def handle_accept(self): # Called when a client connects to the socket client_info = self.accept() self.logger.debug(’handle_accept() -> %s’, client_info[1]) EchoHandler(sock=client_info[0]) # Only deal with one client at a time, # so close as soon as the handler is set up. # Under normal conditions, the server # would run forever or until it received # instructions to stop. self.handle_close() return def handle_close(self): self.logger.debug(’handle_close()’) self.close() return Each time a new connection is accepted in handle_accept(), EchoServer cre- ates a new EchoHandler instance to manage it. The EchoServer and EchoHandler are defined in separate classes because they do different things. When EchoServer accepts a connection, a new socket is established. Rather than try to dispatch to individ- ual clients within EchoServer,anEchoHandler is created to take advantage of the socket map maintained by asyncore. class EchoHandler(asyncore.dispatcher): """Handles echoing messages from a single client. """ def __init__(self, sock, chunk_size=256): self.chunk_size = chunk_size logger_name = ’EchoHandler’ self.logger = logging.getLogger(logger_name) asyncore.dispatcher.__init__(self, sock=sock) self.data_to_write = [] return def writable(self): """Write if data has been received.""" response = bool(self.data_to_write) self.logger.debug(’writable() -> %s’, response) return response 11.4. asyncore—Asynchronous I/O 621 def handle_write(self): """Write as much as possible of the most recent message received. """ data = self.data_to_write.pop() sent = self.send(data[:self.chunk_size]) if sent < len(data): remaining = data[sent:] self.data.to_write.append(remaining) self.logger.debug(’handle_write() -> (%d) %r’, sent, data[:sent]) if not self.writable(): self.handle_close() def handle_read(self): """Read an incoming message from the client and put it into the outgoing queue. """ data = self.recv(self.chunk_size) self.logger.debug(’handle_read() -> (%d) %r’, len(data), data) self.data_to_write.insert(0, data) def handle_close(self): self.logger.debug(’handle_close()’) self.close() 11.4.2 Clients To create a client based on asyncore, subclass dispatcher, and provide implemen- tations for creating the socket, reading, and writing. For EchoClient, the socket is created in __init__() using the base-class method create_socket(). Alternative implementations of the method may be provided, but in this case, a TCP/IP socket is needed so the base-class version is sufficient. class EchoClient(asyncore.dispatcher): """Sends messages to the server and receives responses. """ def __init__(self, host, port, message, chunk_size=128): self.message = message self.to_send = message self.received_data = [] 622 Networking self.chunk_size = chunk_size self.logger = logging.getLogger(’EchoClient’) asyncore.dispatcher.__init__(self) self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.logger.debug(’connecting to %s’, (host, port)) self.connect((host, port)) return The handle_connect() hook is present simply to show when it is called. Other types of clients that need to implement connection hand-shaking or protocol negotiation should do that work in handle_connect(). def handle_connect(self): self.logger.debug(’handle_connect()’) The handle_close() method is also presented to show when it is called during processing. The base-class version closes the socket correctly, so if an application does not need to do extra cleanup on close, the method does not need to be overridden. def handle_close(self): self.logger.debug(’handle_close()’) self.close() received_message = ’’.join(self.received_data) if received_message == self.message: self.logger.debug(’RECEIVED COPY OF MESSAGE’) else: self.logger.debug(’ERROR IN TRANSMISSION’) self.logger.debug(’EXPECTED "%s"’, self.message) self.logger.debug(’RECEIVED "%s"’, received_message) return The asyncore loop uses writable() and its sibling method readable() to decide what actions to take with each dispatcher. Actual use of poll() or select() on the sockets or file descriptors managed by each dispatcher is handled inside the asyncore code and does not need to be implemented in a program using asyncore. The program only needs to indicate whether the dispatcher wants to read or write data. In this client, writable() returns True as long as there is data to send to the server. readable() always returns True because the client needs to read all the data. def writable(self): self.logger.debug(’writable() -> %s’, bool(self.to_send)) return bool(self.to_send) 11.4. asyncore—Asynchronous I/O 623 def readable(self): self.logger.debug(’readable() -> True’) return True Each time through the processing loop when writable() responds positively, handle_write() is invoked. The EchoClient splits the message up into parts based on the size restriction given to demonstrate how a much larger multipart message could be transmitted using several iterations through the loop. Each time handle_write() is called, another part of the message is written, until it is completely consumed. def handle_write(self): sent = self.send(self.to_send[:self.chunk_size]) self.logger.debug(’handle_write() -> (%d) %r’, sent, self.to_send[:sent]) self.to_send = self.to_send[sent:] Similarly, when readable() responds positively and there is data to read, handle_read() is invoked. def handle_read(self): data = self.recv(self.chunk_size) self.logger.debug(’handle_read() -> (%d) %r’, len(data), data) self.received_data.append(data) 11.4.3 The Event Loop A short test script is included in the module. It sets up a server and client, and then runs asyncore.loop() to process the communications. Creating the clients registers them in a “map” kept internally by asyncore. The communication occurs as the loop iterates over the clients. When the client reads zero bytes from a socket that seems readable, the condition is interpreted as a closed connection and handle_close() is called. if __name__ == ’__main__’: import socket logging.basicConfig(level=logging.DEBUG, format=’%(name)-11s: %(message)s’, ) address = (’localhost’, 0) # let the kernel assign a port server = EchoServer(address) ip, port = server.address # find out which port was assigned 624 Networking message = open(’lorem.txt’, ’r’).read() logging.info(’Total message length: %d bytes’, len(message)) client = EchoClient(ip, port, message=message) asyncore.loop() This is the output of running the program. $ python asyncore_echo_server.py EchoServer : binding to (’127.0.0.1’, 63985) root : Total message length: 133 bytes EchoClient : connecting to (’127.0.0.1’, 63985) EchoClient : readable() -> True EchoClient : writable() -> True EchoServer : handle_accept() -> (’127.0.0.1’, 63986) EchoServer : handle_close() EchoClient : handle_connect() EchoClient : handle_write() -> (128) ’Lorem ipsum dolor sit amet, cons ectetuer adipiscing elit. Donec\negestas, enim et consectetuer ullamco rper, lectus ligula rutrum ’ EchoClient : readable() -> True EchoClient : writable() -> True EchoHandler: writable() -> False EchoHandler: handle_read() -> (128) ’Lorem ipsum dolor sit amet, conse ctetuer adipiscing elit. Donec\negestas, enim et consectetuer ullamcor per, lectus ligula rutrum ’ EchoClient : handle_write() -> (5) ’leo.\n’ EchoClient : readable() -> True EchoClient : writable() -> False EchoHandler: writable() -> True EchoHandler: handle_read() -> (5) ’leo.\n’ EchoHandler: handle_write() -> (128) ’Lorem ipsum dolor sit amet, cons ectetuer adipiscing elit. Donec\negestas, enim et consectetuer ullamco rper, lectus ligula rutrum ’ EchoHandler: writable() -> True EchoClient : readable() -> True EchoClient : writable() -> False EchoHandler: writable() -> True EchoClient : handle_read() -> (128) ’Lorem ipsum dolor sit amet, conse ctetuer adipiscing elit. Donec\negestas, enim et consectetuer ullamcor 11.4. asyncore—Asynchronous I/O 625 per, lectus ligula rutrum ’ EchoHandler: handle_write() -> (5) ’leo.\n’ EchoHandler: writable() -> False EchoHandler: handle_close() EchoClient : readable() -> True EchoClient : writable() -> False EchoClient : handle_read() -> (5) ’leo.\n’ EchoClient : readable() -> True EchoClient : writable() -> False EchoClient : handle_close() EchoClient : RECEIVED COPY OF MESSAGE EchoClient : handle_read() -> (0) ’’ In this example, the server, handler, and client objects are all being maintained in the same socket map by asyncore in a single process. To separate the server from the client, instantiate them from separate scripts and run asyncore.loop() in both. When a dispatcher is closed, it is removed from the map maintained by asyncore, and the loop exits when the map is empty. 11.4.4 Working with Other Event Loops It is sometimes necessary to integrate the asyncore event loop with an event loop from the parent application. For example, a GUI application would not want the UI to block until all asynchronous transfers are handled—that would defeat the purpose of making them asynchronous. To make this sort of integration easy, asyncore.loop() accepts arguments to set a timeout and to limit the number of times the loop is run. The effect of these options on the client can be demonstrated with an HTTP client based on the version in the standard library documentation for asyncore. import asyncore import logging import socket from cStringIO import StringIO import urlparse class HttpClient(asyncore.dispatcher): def __init__(self, url): self.url = url self.logger = logging.getLogger(self.url) self.parsed_url = urlparse.urlparse(url) 626 Networking asyncore.dispatcher.__init__(self) self.write_buffer = ’GET %s HTTP/1.0\r\n\r\n’ % self.url self.read_buffer = StringIO() self.create_socket(socket.AF_INET, socket.SOCK_STREAM) address = (self.parsed_url.netloc, 80) self.logger.debug(’connecting to %s’, address) self.connect(address) def handle_connect(self): self.logger.debug(’handle_connect()’) def handle_close(self): self.logger.debug(’handle_close()’) self.close() def writable(self): is_writable = (len(self.write_buffer) > 0) if is_writable: self.logger.debug(’writable() -> %s’, is_writable) return is_writable def readable(self): self.logger.debug(’readable() -> True’) return True def handle_write(self): sent = self.send(self.write_buffer) self.logger.debug(’handle_write() -> "%s"’, self.write_buffer[:sent]) self.write_buffer = self.write_buffer[sent:] def handle_read(self): data = self.recv(8192) self.logger.debug(’handle_read() -> %d bytes’, len(data)) self.read_buffer.write(data) This main program uses the client class in a while loop, reading or writing data once per iteration. import asyncore import logging from asyncore_http_client import HttpClient 11.4. asyncore—Asynchronous I/O 627 logging.basicConfig(level=logging.DEBUG, format=’%(name)s: %(message)s’, ) clients = [ HttpClient(’http://www.doughellmann.com/’), ] loop_counter = 0 while asyncore.socket_map: loop_counter += 1 logging.debug(’loop_counter=%s’, loop_counter) asyncore.loop(timeout=1, count=1) Instead of a custom local while loop, asyncore.loop() could be called in the same manner from a GUI toolkit idle handler or other mechanism for doing a small amount of work when the UI is not busy with other event handlers. $ python asyncore_loop.py http://www.doughellmann.com/: connecting to (’www.doughellmann.com’, 80) root: loop_counter=1 http://www.doughellmann.com/: readable() -> True http://www.doughellmann.com/: writable() -> True http://www.doughellmann.com/: handle_connect() http://www.doughellmann.com/: handle_write() -> "GET http://www.doug hellmann.com/ HTTP/1.0 " root: loop_counter=2 http://www.doughellmann.com/: readable() -> True http://www.doughellmann.com/: handle_read() -> 1448 bytes root: loop_counter=3 http://www.doughellmann.com/: readable() -> True http://www.doughellmann.com/: handle_read() -> 2896 bytes root: loop_counter=4 http://www.doughellmann.com/: readable() -> True http://www.doughellmann.com/: handle_read() -> 1318 bytes root: loop_counter=5 http://www.doughellmann.com/: readable() -> True http://www.doughellmann.com/: handle_close() http://www.doughellmann.com/: handle_read() -> 0 bytes 628 Networking 11.4.5 Working with Files Normally, asyncore is used with sockets, but sometimes it is useful to read files asyn- chronously, too (to use files when testing network servers without requiring the network setup, or to read or write large data files in parts, for example). For these situations, asyncore provides the file_dispatcher and file_wrapper classes. This example implements an asynchronous reader for files by responding with another portion of the data each time handle_read() is called. class FileReader(asyncore.file_dispatcher): def writable(self): return False def handle_read(self): data = self.recv(64) print ’READ: (%d)\n%r’ % (len(data), data) def handle_expt(self): # Ignore events that look like out of band data pass def handle_close(self): self.close() To use FileReader(), give it an open file handle as the only argument to the constructor. reader = FileReader(open(’lorem.txt’, ’r’)) asyncore.loop() Note: This example was tested under Python 2.7. For Python 2.5 and earlier, file_dispatcher does not automatically convert an open file to a file descriptor. Use os.popen() to open the file instead, and pass the descriptor to FileReader. Running the program produces this output. $ python asyncore_file_dispatcher.py READ: (64) ’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec\n’ 11.5. asynchat—Asynchronous Protocol Handler 629 READ: (64) ’egestas, enim et consectetuer ullamcorper, lectus ligula rutrum ’ READ: (5) ’leo.\n’ READ: (0) ’’ See Also: asyncore (http://docs.python.org/library/asyncore.html) The standard library docu- mentation for this module. asynchat (page 629) The asynchat module builds on asyncore to provide a frame- work for implementing protocols based on passing messages back and forth using a set protocol. SocketServer (page 609) The SocketServer module section includes another example of the EchoServer with threading and forking variants. 11.5 asynchat—Asynchronous Protocol Handler Purpose Asynchronous network communication protocol handler. Python Version 1.5.2 and later The asynchat module builds on asyncore to provide a framework for implement- ing protocols based on passing messages back and forth between server and client. The async_chat class is an asyncore.dispatcher subclass that receives data and looks for a message terminator. The subclass only needs to specify what to do when data comes in and how to respond once the terminator is found. Outgoing data is queued for transmission via FIFO objects managed by async_chat. 11.5.1 Message Terminators Incoming messages are broken up based on terminators, which are managed for each async_chat instance via set_terminator(). There are three possible configura- tions. 1. If a string argument is passed to set_terminator(), the message is considered complete when that string appears in the input data. 2. If a numeric argument is passed, the message is considered complete when that many bytes have been read. 3. If None is passed, message termination is not managed by async_chat. The next EchoServer example uses both a simple string terminator and a mes- sage length terminator, depending on the context of the incoming data. The HTTP 630 Networking request handler example in the standard library documentation offers another exam- ple of how to change the terminator based on the context. It uses a literal terminator while reading HTTP headers and a length value to terminate the HTTP POST request body. 11.5.2 Server and Handler To make it easier to understand how asynchat is different from asyncore, the exam- ples here duplicate the functionality of the EchoServer example from the asyncore discussion. The same pieces are needed: a server object to accept connections, handler objects to deal with communication with each client, and client objects to initiate the conversation. The EchoServer implementation with asynchat is essentially the same as the one created for the asyncore example, but it has fewer logging calls: import asyncore import logging import socket from asynchat_echo_handler import EchoHandler class EchoServer(asyncore.dispatcher): """Receives connections and establishes handlers for each client. """ def __init__(self, address): asyncore.dispatcher.__init__(self) self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.bind(address) self.address = self.socket.getsockname() self.listen(1) return def handle_accept(self): # Called when a client connects to our socket client_info = self.accept() EchoHandler(sock=client_info[0]) # Only deal with one client at a time, # so close as soon as the handler is set up. # Under normal conditions, the server # would run forever or until it received # instructions to stop. 11.5. asynchat—Asynchronous Protocol Handler 631 self.handle_close() return def handle_close(self): self.close() This version of EchoHandler is based on asynchat.async_chat instead of the asyncore.dispatcher. It operates at a slightly higher level of abstraction, so reading and writing are handled automatically. The buffer needs to know four things: • what to do with incoming data (by overriding handle_incoming_data()) • how to recognize the end of an incoming message (via set_terminator()) • what to do when a complete message is received (in found_terminator()) • what data to send (using push()) The example application has two operating modes. It is either waiting for a com- mand of the form ECHO length\n or waiting for the data to be echoed. The mode is toggled back and forth by setting an instance variable process_data to the method to be invoked when the terminator is found and then changing the terminator, as appropriate. import asynchat import logging class EchoHandler(asynchat.async_chat): """Handles echoing messages from a single client. """ # Artificially reduce buffer sizes to illustrate # sending and receiving partial messages. ac_in_buffer_size = 128 ac_out_buffer_size = 128 def __init__(self, sock): self.received_data = [] self.logger = logging.getLogger(’EchoHandler’) asynchat.async_chat.__init__(self, sock) # Start looking for the ECHO command self.process_data = self._process_command self.set_terminator(’\n’) return 632 Networking def collect_incoming_data(self, data): """Read an incoming message from the client and put it into the outgoing queue. """ self.logger.debug( ’collect_incoming_data() -> (%d bytes) %r’, len(data), data) self.received_data.append(data) def found_terminator(self): """The end of a command or message has been seen.""" self.logger.debug(’found_terminator()’) self.process_data() def _process_command(self): """Have the full ECHO command""" command = ’’.join(self.received_data) self.logger.debug(’_process_command() %r’, command) command_verb, command_arg = command.strip().split(’’) expected_data_len = int(command_arg) self.set_terminator(expected_data_len) self.process_data = self._process_message self.received_data = [] def _process_message(self): """Have read the entire message.""" to_echo = ’’.join(self.received_data) self.logger.debug(’_process_message() echoing %r’, to_echo) self.push(to_echo) # Disconnect after sending the entire response # since we only want to do one thing at a time self.close_when_done() As soon as the complete command is found, the handler switches to message- processing mode and waits for the complete set of text to be received. When all the data is available, it is pushed onto the outgoing channel. The handler is set up to be closed once the data is sent. 11.5.3 Client The client works in much the same way as the handler. As with the asyncore imple- mentation, the message to be sent is an argument to the client’s constructor. When the 11.5. asynchat—Asynchronous Protocol Handler 633 socket connection is established, handle_connect() is called so the client can send the command and message data. The command is pushed directly, but a special “producer” class is used for the message text. The producer is polled for chunks of data to send out over the network. When the producer returns an empty string, it is assumed to be empty and writing stops. The client expects just the message data in response, so it sets an integer terminator and collects data in a list until the entire message has been received. import asynchat import logging import socket class EchoClient(asynchat.async_chat): """Sends messages to the server and receives responses. """ # Artificially reduce buffer sizes to show # sending and receiving partial messages. ac_in_buffer_size = 128 ac_out_buffer_size = 128 def __init__(self, host, port, message): self.message = message self.received_data = [] self.logger = logging.getLogger(’EchoClient’) asynchat.async_chat.__init__(self) self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.logger.debug(’connecting to %s’, (host, port)) self.connect((host, port)) return def handle_connect(self): self.logger.debug(’handle_connect()’) # Send the command self.push(’ECHO %d\n’ % len(self.message)) # Send the data self.push_with_producer( EchoProducer(self.message, buffer_size=self.ac_out_buffer_size) ) # We expect the data to come back as-is, # so set a length-based terminator self.set_terminator(len(self.message)) 634 Networking def collect_incoming_data(self, data): """Read an incoming message from the client and add it to the outgoing queue. """ self.logger.debug( ’collect_incoming_data() -> (%d) %r’, len(data), data) self.received_data.append(data) def found_terminator(self): self.logger.debug(’found_terminator()’) received_message = ’’.join(self.received_data) if received_message == self.message: self.logger.debug(’RECEIVED COPY OF MESSAGE’) else: self.logger.debug(’ERROR IN TRANSMISSION’) self.logger.debug(’EXPECTED %r’, self.message) self.logger.debug(’RECEIVED %r’, received_message) return class EchoProducer(asynchat.simple_producer): logger = logging.getLogger(’EchoProducer’) def more(self): response = asynchat.simple_producer.more(self) self.logger.debug(’more() -> (%s bytes) %r’, len(response), response) return response 11.5.4 Putting It All Together The main program for this example sets up the client and server in the same asyncore main loop. import asyncore import logging import socket from asynchat_echo_server import EchoServer from asynchat_echo_client import EchoClient 11.5. asynchat—Asynchronous Protocol Handler 635 logging.basicConfig(level=logging.DEBUG, format=’%(name)-11s: %(message)s’, ) address = (’localhost’, 0) # let the kernel give us a port server = EchoServer(address) ip, port = server.address # find out what port we were given message_data = open(’lorem.txt’, ’r’).read() client = EchoClient(ip, port, message=message_data) asyncore.loop() Normally, they would run in separate processes, but this makes it easier to show the combined output. $ python asynchat_echo_main.py EchoClient : connecting to (’127.0.0.1’, 52590) EchoClient : handle_connect() EchoProducer: more() -> (128 bytes) ’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec\negestas, enim et consectetue r ullamcorper, lectus ligula rutrum\n’ EchoProducer: more() -> (38 bytes) ’leo, a elementum elit tortor eu quam.\n’ EchoProducer: more() -> (0 bytes) ’’ EchoHandler: collect_incoming_data() -> (8 bytes) ’ECHO 166’ EchoHandler: found_terminator() EchoHandler: _process_command() ’ECHO 166’ EchoHandler: collect_incoming_data() -> (119 bytes) ’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec\negestas, eni m et consectetuer ullamcorper, lectus ligul’ EchoHandler: collect_incoming_data() -> (47 bytes) ’a rutrum\nleo , a elementum elit tortor eu quam.\n’ EchoHandler: found_terminator() EchoHandler: _process_message() echoing ’Lorem ipsum dolor sit am et, consectetuer adipiscing elit. Donec\negestas, enim et consect etuer ullamcorper, lectus ligula rutrum\nleo, a elementum elit to rtor eu quam.\n’ EchoClient : collect_incoming_data() -> (128) ’Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec\negestas, enim et c 636 Networking onsectetuer ullamcorper, lectus ligula rutrum\n’ EchoClient : collect_incoming_data() -> (38) ’leo, a elementum el it tortor eu quam.\n’ EchoClient : found_terminator() EchoClient : RECEIVED COPY OF MESSAGE See Also: asynchat (http://docs.python.org/library/asynchat.html) The standard library docu- mentation for this module. asyncore (page 619) The asyncore module implements an lower-level asyn- chronous I/O event loop. Chapter 12 THE INTERNET The Internet is a pervasive aspect of modern computing. Even small, single-use scripts frequently interact with remote services to send or receive data. Python’s rich set of tools for working with web protocols makes it well suited for programming web-based applications, either as a client or a server. The urlparse module manipulates URL strings, splitting and combining their components, and is useful in clients and servers. There are two client-side APIs for accessing web resources. The original urllib and updated urllib2 offer similar APIs for retrieving content remotely, but urllib2 is easier to extend with new protocols and the urllib2.Request provides a way to add custom headers to outgoing requests. HTTP POST requests are usually “form encoded” with urllib. Binary data sent through a POST should be encoded with base64 first, to comply with the message format standard. Well-behaved clients that access many sites as spiders or crawlers should use robotparser to ensure they have permission before placing a heavy load on the remote server. To create a custom web server with Python, without requiring any external frame- works, use BaseHTTPServer as a starting point. It handles the HTTP protocol, so the only customization needed is the application code for responding to the incoming requests. Session state in the server can be managed through cookies created and parsed by the Cookie module. Full support for expiration, path, domain, and other cookie settings makes it easy to configure the session. The uuid module is used for generating identifiers for resources that need unique values. UUIDs are good for automatically generating Uniform Resource Name (URN) 637 638 The Internet values, where the name of the resource needs to be unique but does not need to convey any meaning. Python’s standard library includes support for two web-based remote procedure- call mechanisms. The JavaScript Object Notation (JSON) encoding scheme used in AJAX communication is implemented in json. It works equally well in the client or the server. Complete XML-RPC client and server libraries are also included in xmlrpclib and SimpleXMLRPCServer, respectively. 12.1 urlparse—Split URLs into Components Purpose Split URL into components. Python Version 1.4 and later The urlparse module provides functions for breaking URLs down into their compo- nent parts, as defined by the relevant RFCs. 12.1.1 Parsing The return value from the urlparse() function is an object that acts like a tuple with six elements. from urlparse import urlparse url = ’http://netloc/path;param?query=arg#frag’ parsed = urlparse(url) print parsed The parts of the URL available through the tuple interface are the scheme, net- work location, path, path-segment parameters (separated from the path by a semicolon), query, and fragment. $ python urlparse_urlparse.py ParseResult(scheme=’http’, netloc=’netloc’, path=’/path’, params=’param’, query=’query=arg’, fragment=’frag’) Although the return value acts like a tuple, it is really based on a namedtuple,a subclass of tuple that supports accessing the parts of the URL via named attributes as well as indexes. In addition to being easier to use for the programmer, the attribute API also offers access to several values not available in the tuple API. 12.1. urlparse—Split URLs into Components 639 from urlparse import urlparse url = ’http://user:pwd@NetLoc:80/path;param?query=arg#frag’ parsed = urlparse(url) print ’scheme :’, parsed.scheme print ’netloc :’, parsed.netloc print ’path :’, parsed.path print ’params :’, parsed.params print ’query :’, parsed.query print ’fragment:’, parsed.fragment print ’username:’, parsed.username print ’password:’, parsed.password print ’hostname:’, parsed.hostname, ’(netloc in lowercase)’ print ’port :’, parsed.port The username and password are available when present in the input URL and set to None when not. The hostname is the same value as netloc, in all lowercase. And the port is converted to an integer when present and None when not. $ python urlparse_urlparseattrs.py scheme : http netloc : user:pwd@NetLoc:80 path : /path params : param query : query=arg fragment: frag username: user password: pwd hostname: netloc (netloc in lowercase) port : 80 The urlsplit() function is an alternative to urlparse(). It behaves a little differently because it does not split the parameters from the URL. This is useful for URLs following RFC 2396, which supports parameters for each segment of the path. from urlparse import urlsplit url = ’http://user:pwd@NetLoc:80/p1;param/p2;param?query=arg#frag’ parsed = urlsplit(url) print parsed print ’scheme :’, parsed.scheme 640 The Internet print ’netloc :’, parsed.netloc print ’path :’, parsed.path print ’query :’, parsed.query print ’fragment:’, parsed.fragment print ’username:’, parsed.username print ’password:’, parsed.password print ’hostname:’, parsed.hostname, ’(netloc in lowercase)’ print ’port :’, parsed.port Since the parameters are not split out, the tuple API will show five elements instead of six, and there is no params attribute. $ python urlparse_urlsplit.py SplitResult(scheme=’http’, netloc=’user:pwd@NetLoc:80’, path=’/p1;param/p2;param’, query=’query=arg’, fragment=’frag’) scheme : http netloc : user:pwd@NetLoc:80 path : /p1;param/p2;param query : query=arg fragment: frag username: user password: pwd hostname: netloc (netloc in lowercase) port : 80 To simply strip the fragment identifier from a URL, such as when finding a base page name from a URL, use urldefrag(). from urlparse import urldefrag original = ’http://netloc/path;param?query=arg#frag’ print ’original:’, original url, fragment = urldefrag(original) print ’url :’, url print ’fragment:’, fragment The return value is a tuple containing the base URL and the fragment. $ python urlparse_urldefrag.py original: http://netloc/path;param?query=arg#frag url : http://netloc/path;param?query=arg fragment: frag 12.1. urlparse—Split URLs into Components 641 12.1.2 Unparsing There are several ways to assemble the parts of a split URL back together into a single string. The parsed URL object has a geturl() method. from urlparse import urlparse original = ’http://netloc/path;param?query=arg#frag’ print ’ORIG :’, original parsed = urlparse(original) print ’PARSED:’, parsed.geturl() geturl() only works on the object returned by urlparse() or urlsplit(). $ python urlparse_geturl.py ORIG : http://netloc/path;param?query=arg#frag PARSED: http://netloc/path;param?query=arg#frag A regular tuple containing strings can be combined into a URL with urlun- parse(). from urlparse import urlparse, urlunparse original = ’http://netloc/path;param?query=arg#frag’ print ’ORIG :’, original parsed = urlparse(original) print ’PARSED:’, type(parsed), parsed t = parsed[:] print ’TUPLE :’, type(t), t print ’NEW :’, urlunparse(t) While the ParseResult returned by urlparse() can be used as a tuple, this example explicitly creates a new tuple to show that urlunparse() works with normal tuples, too. $ python urlparse_urlunparse.py ORIG : http://netloc/path;param?query=arg#frag PARSED: ParseResult(scheme=’http’, netloc=’netloc’, path=’/path’, params=’param’, query=’query=arg’, fragment=’frag’) 642 The Internet TUPLE : (’http’, ’netloc’, ’/path’, ’param’, ’query=arg’, ’frag’) NEW : http://netloc/path;param?query=arg#frag If the input URL included superfluous parts, those may be dropped from the reconstructed URL. from urlparse import urlparse, urlunparse original = ’http://netloc/path;?#’ print ’ORIG :’, original parsed = urlparse(original) print ’PARSED:’, type(parsed), parsed t = parsed[:] print ’TUPLE :’, type(t), t print ’NEW :’, urlunparse(t) In this case, parameters, query, and fragment are all missing in the original URL. The new URL does not look the same as the original, but it is equivalent according to the standard. $ python urlparse_urlunparseextra.py ORIG : http://netloc/path;?# PARSED: ParseResult(scheme=’http’, netloc=’netloc’, path=’/path’, params=’’, query=’’, fragment=’’) TUPLE : (’http’, ’netloc’, ’/path’, ’’, ’’, ’’) NEW : http://netloc/path 12.1.3 Joining In addition to parsing URLs, urlparse includes urljoin() for constructing absolute URLs from relative fragments. from urlparse import urljoin print urljoin(’http://www.example.com/path/file.html’, ’anotherfile.html’) print urljoin(’http://www.example.com/path/file.html’, ’../anotherfile.html’) 12.1. urlparse—Split URLs into Components 643 In the example, the relative portion of the path (“../”) is taken into account when the second URL is computed. $ python urlparse_urljoin.py http://www.example.com/path/anotherfile.html http://www.example.com/anotherfile.html Nonrelative paths are handled in the same way as os.path.join() handles them. from urlparse import urljoin print urljoin(’http://www.example.com/path/’, ’/subpath/file.html’) print urljoin(’http://www.example.com/path/’, ’subpath/file.html’) If the path being joined to the URL starts with a slash (/), it resets the URL’s path to the top level. If it does not start with a slash, it is appended to the end of the URL’s path. $ python urlparse_urljoin_with_path.py http://www.example.com/subpath/file.html http://www.example.com/path/subpath/file.html See Also: urlparse (http://docs.python.org/lib/module-urlparse.html) Standard library docu- mentation for this module. urllib (page 651) Retrieve the contents of a resource identified by a URL. urllib2 (page 657) Alternative API for accessing remote URLs. RFC 1738 (http://tools.ietf.org/html/rfc1738.html) Uniform Resource Locator (URL) syntax. RFC 1808 (http://tools.ietf.org/html/rfc1808.html) Relative URLs. RFC 2396 (http://tools.ietf.org/html/rfc2396.html) Uniform Resource Identifier (URI) generic syntax. RFC 3986 (http://tools.ietf.org/html/rfc3986.html) Uniform Resource Identifier (URI) syntax. 644 The Internet 12.2 BaseHTTPServer—Base Classes for Implementing Web Servers Purpose BaseHTTPServer includes classes that can form the basis of a web server. Python Version 1.4 and later BaseHTTPServer uses classes from SocketServer to create base classes for making HTTP servers. HTTPServer can be used directly, but the BaseHTTPRe- questHandler is intended to be extended to handle each protocol method (GET, POST, etc.). 12.2.1 HTTP GET To add support for an HTTP method in a request-handler class, implement the method do_METHOD(), replacing METHOD with the name of the HTTP method (e.g., do_GET(), do_POST(), etc.). For consistency, the request-handler methods take no arguments. All the parameters for the request are parsed by BaseHTTPRequestHand- ler and stored as instance attributes of the request instance. This example request handler illustrates how to return a response to the client and includes some local attributes that can be useful in building the response. from BaseHTTPServer import BaseHTTPRequestHandler import urlparse class GetHandler(BaseHTTPRequestHandler): def do_GET(self): parsed_path = urlparse.urlparse(self.path) message_parts = [ ’CLIENT VALUES:’, ’client_address=%s (%s)’ % (self.client_address, self.address_string()), ’command=%s’ % self.command, ’path=%s’ % self.path, ’real path=%s’ % parsed_path.path, ’query=%s’ % parsed_path.query, ’request_version=%s’ % self.request_version, ’’, ’SERVER VALUES:’, ’server_version=%s’ % self.server_version, ’sys_version=%s’ % self.sys_version, 12.2. BaseHTTPServer—Base Classes for Implementing Web Servers 645 ’protocol_version=%s’ % self.protocol_version, ’’, ’HEADERS RECEIVED:’, ] for name, value in sorted(self.headers.items()): message_parts.append(’%s=%s’ % (name, value.rstrip())) message_parts.append(’’) message = ’\r\n’.join(message_parts) self.send_response(200) self.end_headers() self.wfile.write(message) return if __name__ == ’__main__’: from BaseHTTPServer import HTTPServer server = HTTPServer((’localhost’, 8080), GetHandler) print ’Starting server, use to stop’ server.serve_forever() The message text is assembled and then written to wfile, the file handle wrapping the response socket. Each response needs a response code, set via send_response(). If an error code is used (404, 501, etc.), an appropriate default error message is included in the header, or a message can be passed with the error code. To run the request handler in a server, pass it to the constructor of HTTPServer, as in the __main__ processing portion of the sample script. Then start the server. $ python BaseHTTPServer_GET.py Starting server, use to stop In a separate terminal, use curl to access it. $ curl -i http://localhost:8080/?foo=bar HTTP/1.0 200 OK Server: BaseHTTP/0.3 Python/2.5.1 Date: Sun, 09 Dec 2007 16:00:34 GMT CLIENT VALUES: client_address=(’127.0.0.1’, 51275) (localhost) 646 The Internet command=GET path=/?foo=bar real path=/ query=foo=bar request_version=HTTP/1.1 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.5.1 protocol_version=HTTP/1.0 12.2.2 HTTP POST Supporting POST requests is a little more work because the base class does not parse the form data automatically. The cgi module provides the FieldStorage class, which knows how to parse the form if it is given the correct inputs. from BaseHTTPServer import BaseHTTPRequestHandler import cgi class PostHandler(BaseHTTPRequestHandler): def do_POST(self): # Parse the form data posted form = cgi.FieldStorage( fp=self.rfile, headers=self.headers, environ={’REQUEST_METHOD’:’POST’, ’CONTENT_TYPE’:self.headers[’Content-Type’], }) # Begin the response self.send_response(200) self.end_headers() self.wfile.write(’Client: %s\n’ % str(self.client_address)) self.wfile.write(’User-agent: %s\n’ % str(self.headers[’user-agent’])) self.wfile.write(’Path: %s\n’ % self.path) self.wfile.write(’Form data:\n’) # Echo back information about what was posted in the form for field in form.keys(): field_item = form[field] 12.2. BaseHTTPServer—Base Classes for Implementing Web Servers 647 if field_item.filename: # The field contains an uploaded file file_data = field_item.file.read() file_len = len(file_data) del file_data self.wfile.write( ’\tUploaded %s as "%s"(%d bytes)\n’ %\ (field, field_item.filename, file_len)) else: # Regular form value self.wfile.write(’\t%s=%s\n’ % (field, form[field].value)) return if __name__ == ’__main__’: from BaseHTTPServer import HTTPServer server = HTTPServer((’localhost’, 8080), PostHandler) print ’Starting server, use to stop’ server.serve_forever() Run the server in one window. $ python BaseHTTPServer_POST.py Starting server, use to stop The arguments to curl can include form data to be posted to the server by using the -F option. The last argument, -F datafile=@BaseHTTPServer_GET.py, posts the contents of the file BaseHTTPServer_GET.py to illustrate reading file data from the form. $ curl http://localhost:8080/ -F name=dhellmann -F foo=bar \ -F datafile=@BaseHTTPServer_GET.py Client: (’127.0.0.1’, 65029) User-agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3 Path: / Form data: Uploaded datafile as "BaseHTTPServer_GET.py" (2580 bytes) foo=bar name=dhellmann 648 The Internet 12.2.3 Threading and Forking HTTPServer is a simple subclass of SocketServer.TCPServer and does not use multiple threads or processes to handle requests. To add threading or forking, create a new class using the appropriate mix-in from SocketServer. from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler from SocketServer import ThreadingMixIn import threading class Handler(BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.end_headers() message = threading.currentThread().getName() self.wfile.write(message) self.wfile.write(’\n’) return class ThreadedHTTPServer(ThreadingMixIn, HTTPServer): """Handle requests in a separate thread.""" if __name__ == ’__main__’: server = ThreadedHTTPServer((’localhost’, 8080), Handler) print ’Starting server, use to stop’ server.serve_forever() Run the server in the same way as the other examples. $ python BaseHTTPServer_threads.py Starting server, use to stop Each time the server receives a request, it starts a new thread or process to handle it. $ curl http://localhost:8080/ Thread-1 $ curl http://localhost:8080/ 12.2. BaseHTTPServer—Base Classes for Implementing Web Servers 649 Thread-2 $ curl http://localhost:8080/ Thread-3 Swapping ForkingMixIn for ThreadingMixIn would achieve similar results, using separate processes instead of threads. 12.2.4 Handling Errors Handle errors by calling send_error(), passing the appropriate error code and an optional error message. The entire response (with headers, status code, and body) is generated automatically. from BaseHTTPServer import BaseHTTPRequestHandler class ErrorHandler(BaseHTTPRequestHandler): def do_GET(self): self.send_error(404) return if __name__ == ’__main__’: from BaseHTTPServer import HTTPServer server = HTTPServer((’localhost’, 8080), ErrorHandler) print ’Starting server, use to stop’ server.serve_forever() In this case, a 404 error is always returned. $ python BaseHTTPServer_errors.py Starting server, use to stop The error message is reported to the client using an HTML document, as well as the header to indicate an error code. $ curl -i http://localhost:8080/ HTTP/1.0 404 Not Found Server: BaseHTTP/0.3 Python/2.5.1 650 The Internet Date: Sun, 09 Dec 2007 15:49:44 GMT Content-Type: text/html Connection: close Error response

Error response

Error code 404.

Message: Not Found.

Error code explanation: 404 = Nothing matches the given URI. 12.2.5 Setting Headers The send_header method adds header data to the HTTP response. It takes two argu- ments: the name of the header and the value. from BaseHTTPServer import BaseHTTPRequestHandler import urlparse import time class GetHandler(BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header(’Last-Modified’, self.date_time_string(time.time())) self.end_headers() self.wfile.write(’Response body\n’) return if __name__ == ’__main__’: from BaseHTTPServer import HTTPServer server = HTTPServer((’localhost’, 8080), GetHandler) print ’Starting server, use to stop’ server.serve_forever() This example sets the Last-Modified header to the current timestamp, format- ted according to RFC 2822. 12.3. urllib—Network Resource Access 651 $ curl -i http://localhost:8080/ HTTP/1.0 200 OK Server: BaseHTTP/0.3 Python/2.7 Date: Sun, 10 Oct 2010 13:58:32 GMT Last-Modified: Sun, 10 Oct 2010 13:58:32 -0000 Response body The server logs the request to the terminal, as in the other examples. $ python BaseHTTPServer_send_header.py Starting server, use to stop See Also: BaseHTTPServer (http://docs.python.org/library/basehttpserver.html) The stan- dard library documentation for this module. SocketServer (page 609) The SocketServer module provides the base class that handles the raw socket connection. RFC 2822 (http://tools.ietf.org/html/rfc2822.html) The “Internet Message Format” specifies a format for text-based messages, such as email and HTTP responses. 12.3 urllib—Network Resource Access Purpose Accessing remote resources that do not need authentication, cookies, etc. Python Version 1.4 and later The urllib module provides a simple interface for network resource access. It also includes functions for encoding and quoting arguments to be passed over HTTP to a server. 12.3.1 Simple Retrieval with Cache Downloading data is a common operation, and urllib includes the urlretrieve() function to meet this need. urlretrieve() takes arguments for the URL, a tem- porary file to hold the data, a function to report on download progress, and data to pass if the URL refers to a form where data should be posted. If no filename is given, 652 The Internet urlretrieve() creates a temporary file. The calling program can delete the file di- rectly or treat the file as a cache and use urlcleanup() to remove it. This example uses an HTTP GET request to retrieve some data from a web server. import urllib import os def reporthook(blocks_read, block_size, total_size): """total_size is reported in bytes. block_size is the amount read each time. blocks_read is the number of blocks successfully read. """ if not blocks_read: print ’Connection opened’ return if total_size < 0: # Unknown size print ’Read %d blocks (%d bytes)’ % (blocks_read, blocks_read * block_size) else: amount_read = blocks_read * block_size print ’Read %d blocks, or %d/%d’ %\ (blocks_read, amount_read, total_size) return try: filename, msg = urllib.urlretrieve( ’http://blog.doughellmann.com/’, reporthook=reporthook) print print ’File:’, filename print ’Headers:’ print msg print ’File exists before cleanup:’, os.path.exists(filename) finally: urllib.urlcleanup() print ’File still exists:’, os.path.exists(filename) Each time data is read from the server, reporthook() is called to report the download progress. The three arguments are the number of blocks read so far, the size (in bytes) of the blocks, and the size (in bytes) of the resource being downloaded. When 12.3. urllib—Network Resource Access 653 the server does not return a Content-length header, urlretrieve() does not know how big the data should be and passes −1 as the total_size argument. $ python urllib_urlretrieve.py Connection opened Read 1 blocks (8192 bytes) Read 2 blocks (16384 bytes) Read 3 blocks (24576 bytes) Read 4 blocks (32768 bytes) Read 5 blocks (40960 bytes) Read 6 blocks (49152 bytes) Read 7 blocks (57344 bytes) Read 8 blocks (65536 bytes) Read 9 blocks (73728 bytes) Read 10 blocks (81920 bytes) Read 11 blocks (90112 bytes) Read 12 blocks (98304 bytes) File: /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmpYI9AuC Headers: Content-Type: text/html; charset=UTF-8 Expires: Fri, 07 Jan 2011 14:23:06 GMT Date: Fri, 07 Jan 2011 14:23:06 GMT Last-Modified: Tue, 04 Jan 2011 12:32:04 GMT ETag: "f2108552-7c52-4c50-8838-8300645c40be" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE Cache-Control: public, max-age=0, proxy-revalidate, must-revalidate Age: 0 File exists before cleanup: True File still exists: False 12.3.2 Encoding Arguments Arguments can be passed to the server by encoding them and appending them to the URL. import urllib query_args = { ’q’:’query string’, ’foo’:’bar’ } encoded_args = urllib.urlencode(query_args) 654 The Internet print ’Encoded:’, encoded_args url = ’http://localhost:8080/?’ + encoded_args print urllib.urlopen(url).read() The query, in the list of client values, contains the encoded query arguments. $ python urllib_urlencode.py Encoded: q=query+string&foo=bar CLIENT VALUES: client_address=(’127.0.0.1’, 54415) (localhost) command=GET path=/?q=query+string&foo=bar real path=/ query=q=query+string&foo=bar request_version=HTTP/1.0 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.5.1 protocol_version=HTTP/1.0 To pass a sequence of values using separate occurrences of the variable in the query string, set doseq to True when calling urlencode(). import urllib query_args = { ’foo’:[’foo1’, ’foo2’]} print ’Single :’, urllib.urlencode(query_args) print ’Sequence:’, urllib.urlencode(query_args, doseq=True ) The result is a query string with several values associated with the same name. $ python urllib_urlencode_doseq.py Single : foo=%5B%27foo1%27%2C+%27foo2%27%5D Sequence: foo=foo1&foo=foo2 To decode the query string, see the FieldStorage class from the cgi module. Special characters within the query arguments that might cause parse problems with the URL on the server side are “quoted” when passed to urlencode(). To quote 12.3. urllib—Network Resource Access 655 them locally to make safe versions of the strings, use the quote() or quote_plus() functions directly. import urllib url = ’http://localhost:8080/~dhellmann/’ print ’urlencode() :’, urllib.urlencode({’url’:url}) print ’quote() :’, urllib.quote(url) print ’quote_plus():’, urllib.quote_plus(url) The quoting implementation in quote_plus() is more aggressive about the char- acters it replaces. $ python urllib_quote.py urlencode() : url=http%3A%2F%2Flocalhost%3A8080%2F%7Edhellmann%2F quote() : http%3A//localhost%3A8080/%7Edhellmann/ quote_plus(): http%3A%2F%2Flocalhost%3A8080%2F%7Edhellmann%2F To reverse the quote operations, use unquote() or unquote_plus(), as appropriate. import urllib print urllib.unquote(’http%3A//localhost%3A8080/%7Edhellmann/’) print urllib.unquote_plus( ’http%3A%2F%2Flocalhost%3A8080%2F%7Edhellmann%2F’ ) The encoded value is converted back to a normal string URL. $ python urllib_unquote.py http://localhost:8080/~dhellmann/ http://localhost:8080/~dhellmann/ 12.3.3 Paths vs. URLs Some operating systems use different values for separating the components of paths in local files than URLs. To make code portable, use the functions pathname2url() and url2pathname() to convert back and forth. 656 The Internet Note: Since these examples were prepared under Mac OS X, they have to explicitly import the Windows versions of the functions. Using the versions of the functions exported by urllib provides the correct defaults for the current platform, so most programs do not need to do this. import os from urllib import pathname2url, url2pathname print ’== Default ==’ path = ’/a/b/c’ print ’Original:’, path print ’URL :’, pathname2url(path) print ’Path :’, url2pathname(’/d/e/f’) print from nturl2path import pathname2url, url2pathname print ’== Windows, without drive letter ==’ path = r’\a\b\c’ print ’Original:’, path print ’URL :’, pathname2url(path) print ’Path :’, url2pathname(’/d/e/f’) print print ’== Windows, with drive letter ==’ path = r’C:\a\b\c’ print ’Original:’, path print ’URL :’, pathname2url(path) print ’Path :’, url2pathname(’/d/e/f’) There are two Windows examples, with and without the drive letter at the prefix of the path. $ python urllib_pathnames.py == Default == Original: /a/b/c URL : /a/b/c Path : /d/e/f 12.4. urllib2—Network Resource Access 657 == Windows, without drive letter == Original: \a\b\c URL : /a/b/c Path : \d\e\f == Windows, with drive letter == Original: C:\a\b\c URL : ///C:/a/b/c Path : \d\e\f See Also: urllib (http://docs.python.org/lib/module-urllib.html) Standard library documenta- tion for this module. urllib2 (page 657) Updated API for working with URL-based services. urlparse (page 638) Parse URL values to access their components. 12.4 urllib2—Network Resource Access Purpose A library for opening URLs that can be extended by defining custom protocol handlers. Python Version 2.1 and later The urllib2 module provides an updated API for using Internet resources identi- fied by URLs. It is designed to be extended by individual applications to support new protocols or add variations to existing protocols (such as handling HTTP basic authentication). 12.4.1 HTTP GET Note: The test server for these examples is in BaseHTTPServer_GET.py, from the examples for the BaseHTTPServer module. Start the server in one terminal window, and then run these examples in another. As with urllib, an HTTP GET operation is the simplest use of urllib2. Pass the URL to urlopen() to get a “file-like” handle to the remote data. 658 The Internet import urllib2 response = urllib2.urlopen(’http://localhost:8080/’) print ’RESPONSE:’, response print ’URL :’, response.geturl() headers = response.info() print ’DATE :’, headers[’date’] print ’HEADERS :’ print ’---------’ print headers data = response.read() print ’LENGTH :’, len(data) print ’DATA :’ print ’---------’ print data The example server accepts the incoming values and formats a plain-text response to send back. The return value from urlopen() gives access to the headers from the HTTP server through the info() method and the data for the remote resource via methods like read() and readlines(). $ python urllib2_urlopen.py RESPONSE: > URL : http://localhost:8080/ DATE : Sun, 19 Jul 2009 14:01:31 GMT HEADERS : --------- Server: BaseHTTP/0.3 Python/2.6.2 Date: Sun, 19 Jul 2009 14:01:31 GMT LENGTH : 349 DATA : --------- CLIENT VALUES: client_address=(’127.0.0.1’, 55836) (localhost) command=GET path=/ real path=/ 12.4. urllib2—Network Resource Access 659 query= request_version=HTTP/1.1 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.6.2 protocol_version=HTTP/1.0 HEADERS RECEIVED: accept-encoding=identity connection=close host=localhost:8080 user-agent=Python-urllib/2.6 The file-like object returned by urlopen() is iterable. import urllib2 response = urllib2.urlopen(’http://localhost:8080/’) for line in response: print line.rstrip() This example strips the trailing newlines and carriage returns before printing the output. $ python urllib2_urlopen_iterator.py CLIENT VALUES: client_address=(’127.0.0.1’, 55840) (localhost) command=GET path=/ real path=/ query= request_version=HTTP/1.1 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.6.2 protocol_version=HTTP/1.0 HEADERS RECEIVED: accept-encoding=identity 660 The Internet connection=close host=localhost:8080 user-agent=Python-urllib/2.6 12.4.2 Encoding Arguments Arguments can be passed to the server by encoding them with urllib.urlencode() and appending them to the URL. import urllib import urllib2 query_args = { ’q’:’query string’, ’foo’:’bar’ } encoded_args = urllib.urlencode(query_args) print ’Encoded:’, encoded_args url = ’http://localhost:8080/?’ + encoded_args print urllib2.urlopen(url).read() The list of client values returned in the example output contains the encoded query arguments. $ python urllib2_http_get_args.py Encoded: q=query+string&foo=bar CLIENT VALUES: client_address=(’127.0.0.1’, 55849) (localhost) command=GET path=/?q=query+string&foo=bar real path=/ query=q=query+string&foo=bar request_version=HTTP/1.1 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.6.2 protocol_version=HTTP/1.0 HEADERS RECEIVED: accept-encoding=identity connection=close 12.4. urllib2—Network Resource Access 661 host=localhost:8080 user-agent=Python-urllib/2.6 12.4.3 HTTP POST Note: The test server for these examples is in BaseHTTPServer_POST.py, from the examples for the BaseHTTPServer module. Start the server in one terminal window, and then run these examples in another. To send form-encoded data to the remote server using POST instead GET, pass the encoded query arguments as data to urlopen(). import urllib import urllib2 query_args = { ’q’:’query string’, ’foo’:’bar’ } encoded_args = urllib.urlencode(query_args) url = ’http://localhost:8080/’ print urllib2.urlopen(url, encoded_args).read() The server can decode the form data and access the individual values by name. $ python urllib2_urlopen_post.py Client: (’127.0.0.1’, 55943) User-agent: Python-urllib/2.6 Path: / Form data: q=query string foo=bar 12.4.4 Adding Outgoing Headers urlopen() is a convenience function that hides some of the details of how the request is made and handled. More precise control is possible by using a Request instance directly. For example, custom headers can be added to the outgoing request to control the format of data returned, specify the version of a document cached locally, and tell the remote server the name of the software client communicating with it. 662 The Internet As the output from the earlier examples shows, the default User-agent header value is made up of the constant Python-urllib, followed by the Python interpreter version. When creating an application that will access web resources owned by some- one else, it is courteous to include real user-agent information in the requests, so they can identify the source of the hits more easily. Using a custom agent also allows them to control crawlers using a robots.txt file (see the robotparser module). import urllib2 request = urllib2.Request(’http://localhost:8080/’) request.add_header( ’User-agent’, ’PyMOTW (http://www.doughellmann.com/PyMOTW/)’, ) response = urllib2.urlopen(request) data = response.read() print data After creating a Request object, use add_header() to set the user-agent value before opening the request. The last line of the output shows the custom value. $ python urllib2_request_header.py CLIENT VALUES: client_address=(’127.0.0.1’, 55876) (localhost) command=GET path=/ real path=/ query= request_version=HTTP/1.1 SERVER VALUES: server_version=BaseHTTP/0.3 sys_version=Python/2.6.2 protocol_version=HTTP/1.0 HEADERS RECEIVED: accept-encoding=identity connection=close host=localhost:8080 user-agent=PyMOTW (http://www.doughellmann.com/PyMOTW/) 12.4. urllib2—Network Resource Access 663 12.4.5 Posting Form Data from a Request The outgoing data can be added to the Request to have it posted to the server. import urllib import urllib2 query_args = { ’q’:’query string’, ’foo’:’bar’ } request = urllib2.Request(’http://localhost:8080/’) print ’Request method before data:’, request.get_method() request.add_data(urllib.urlencode(query_args)) print ’Request method after data :’, request.get_method() request.add_header( ’User-agent’, ’PyMOTW (http://www.doughellmann.com/PyMOTW/)’, ) print print ’OUTGOING DATA:’ print request.get_data() print print ’SERVER RESPONSE:’ print urllib2.urlopen(request).read() The HTTP method used by the Request automatically changes from GET to POST after the data is added. $ python urllib2_request_post.py Request method before data: GET Request method after data : POST OUTGOING DATA: q=query+string&foo=bar SERVER RESPONSE: Client: (’127.0.0.1’, 56044) User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/) Path: / Form data: 664 The Internet q=query string foo=bar Note: Although the method is named add_data(), its effect is not cumulative. Each call replaces the previous data. 12.4.6 Uploading Files Encoding files for upload requires a little more work than simple forms. A complete MIME message needs to be constructed in the body of the request so that the server can distinguish incoming form fields from uploaded files. import itertools import mimetools import mimetypes from cStringIO import StringIO import urllib import urllib2 class MultiPartForm(object): """Accumulate the data to be used when posting a form.""" def __init__(self): self.form_fields = [] self.files = [] self.boundary = mimetools.choose_boundary() return def get_content_type(self): return ’multipart/form-data; boundary=%s’ % self.boundary def add_field(self, name, value): """Add a simple field to the form data.""" self.form_fields.append((name, value)) return def add_file(self, fieldname, filename, fileHandle, mimetype=None): """Add a file to be uploaded.""" body = fileHandle.read() if mimetype is None: 12.4. urllib2—Network Resource Access 665 mimetype = ( mimetypes.guess_type(filename)[0] or ’application/octet-stream’ ) self.files.append((fieldname, filename, mimetype, body)) return def __str__(self): """Return a string representing the form data, including attached files. """ # Build a list of lists, each containing "lines" of the # request. Each part is separated by a boundary string. # Once the list is built, return a string where each # line is separated by ’\r\n’. parts = [] part_boundary = ’--’ + self.boundary # Add the form fields parts.extend( [ part_boundary, ’Content-Disposition: form-data; name="%s"’ % name, ’’, value, ] for name, value in self.form_fields ) # Add the files to upload parts.extend([ part_boundary, ’Content-Disposition: file; name="%s"; filename="%s"’ %\ (field_name, filename), ’Content-Type: %s’ % content_type, ’’, body, ] for field_name, filename, content_type, body in self.files ) # Flatten the list and add closing boundary marker, and # then return CR+LF separated data flattened = list(itertools.chain(*parts)) 666 The Internet flattened.append(’--’ + self.boundary + ’--’) flattened.append(’’) return ’\r\n’.join(flattened) if __name__ == ’__main__’: # Create the form with simple fields form = MultiPartForm() form.add_field(’firstname’, ’Doug’) form.add_field(’lastname’, ’Hellmann’) # Add a fake file form.add_file( ’biography’, ’bio.txt’, fileHandle=StringIO(’Python developer and blogger.’)) # Build the request request = urllib2.Request(’http://localhost:8080/’) request.add_header( ’User-agent’, ’PyMOTW (http://www.doughellmann.com/PyMOTW/)’) body = str(form) request.add_header(’Content-type’, form.get_content_type()) request.add_header(’Content-length’, len(body)) request.add_data(body) print print ’OUTGOING DATA:’ print request.get_data() print print ’SERVER RESPONSE:’ print urllib2.urlopen(request).read() The MultiPartForm class can represent an arbitrary form as a multipart MIME message with attached files. $ python urllib2_upload_files.py OUTGOING DATA: --192.168.1.17.527.30074.1248020372.206.1 Content-Disposition: form-data; name="firstname" 12.4. urllib2—Network Resource Access 667 Doug --192.168.1.17.527.30074.1248020372.206.1 Content-Disposition: form-data; name="lastname" Hellmann --192.168.1.17.527.30074.1248020372.206.1 Content-Disposition: file; name="biography"; filename="bio.txt" Content-Type: text/plain Python developer and blogger. --192.168.1.17.527.30074.1248020372.206.1-- SERVER RESPONSE: Client: (’127.0.0.1’, 57126) User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/) Path: / Form data: lastname=Hellmann Uploaded biography as "bio.txt" (29 bytes) firstname=Doug 12.4.7 Creating Custom Protocol Handlers urllib2 has built-in support for HTTP(S), FTP, and local file access. To add support for other URL types, register another protocol handler. For example, to support URLs pointing to arbitrary files on remote NFS servers, without requiring users to mount the path before accessing the file, create a class derived from BaseHandler and with a method nfs_open(). The protocol-specific open() method is given a single argument, the Request instance, and it should return an object with a read() method that can be used to read the data, an info() method to return the response headers, and geturl() to return the actual URL of the file being read. A simple way to achieve that result is to create an instance of urllib.addurlinfo, passing the headers, URL, and open file handle in to the constructor. import mimetypes import os import tempfile import urllib 668 The Internet import urllib2 class NFSFile(file): def __init__(self, tempdir, filename): self.tempdir = tempdir file.__init__(self, filename, ’rb’) def close(self): print ’NFSFile:’ print ’ unmounting %s’ % os.path.basename(self.tempdir) print ’ when %s is closed’ % os.path.basename(self.name) return file.close(self) class FauxNFSHandler(urllib2.BaseHandler): def __init__(self, tempdir): self.tempdir = tempdir def nfs_open(self, req): url = req.get_selector() directory_name, file_name = os.path.split(url) server_name = req.get_host() print ’FauxNFSHandler simulating mount:’ print ’ Remote path: %s’ % directory_name print ’ Server : %s’ % server_name print ’ Local path : %s’ % os.path.basename(tempdir) print ’ Filename : %s’ % file_name local_file = os.path.join(tempdir, file_name) fp = NFSFile(tempdir, local_file) content_type = ( mimetypes.guess_type(file_name)[0] or ’application/octet-stream’ ) stats = os.stat(local_file) size = stats.st_size headers = { ’Content-type’: content_type, ’Content-length’: size, } return urllib.addinfourl(fp, headers, req.get_full_url()) if __name__ == ’__main__’: tempdir = tempfile.mkdtemp() try: # Populate the temporary file for the simulation 12.4. urllib2—Network Resource Access 669 with open(os.path.join(tempdir, ’file.txt’), ’wt’) as f: f.write(’Contents of file.txt’) # Construct an opener with our NFS handler # and register it as the default opener. opener = urllib2.build_opener(FauxNFSHandler(tempdir)) urllib2.install_opener(opener) # Open the file through a URL. response = urllib2.urlopen( ’nfs://remote_server/path/to/the/file.txt’ ) print print ’READ CONTENTS:’, response.read() print ’URL :’, response.geturl() print ’HEADERS:’ for name, value in sorted(response.info().items()): print ’ %-15s = %s’ % (name, value) response.close() finally: os.remove(os.path.join(tempdir, ’file.txt’)) os.removedirs(tempdir) The FauxNFSHandler and NFSFile classes print messages to illustrate where a real implementation would add mount and unmount calls. Since this is just a simulation, FauxNFSHandler is primed with the name of a temporary directory where it should look for all its files. $ python urllib2_nfs_handler.py FauxNFSHandler simulating mount: Remote path: /path/to/the Server : remote_server Local path : tmpoqqoAV Filename : file.txt READ CONTENTS: Contents of file.txt URL : nfs://remote_server/path/to/the/file.txt HEADERS: Content-length = 20 Content-type = text/plain 670 The Internet NFSFile: unmounting tmpoqqoAV when file.txt is closed See Also: urllib2 (http://docs.python.org/library/urllib2.html) The standard library documen- tation for this module. urllib (page 651) Original URL handling library. urlparse (page 638) Work with the URL string itself. urllib2 – The Missing Manual (www.voidspace.org.uk/python/articles/urllib2. shtml) Michael Foord’s write-up on using urllib2. Upload Scripts (www.voidspace.org.uk/python/cgi.shtml#upload) Example scripts from Michael Foord that illustrate how to upload a file using HTTP and then receive the data on the server. HTTP client to POST using multipart/form-data (http://code.activestate.com/ recipes/146306) Python cookbook recipe showing how to encode and post data, including files, over HTTP. Form content types (www.w3.org/TR/REC-html40/interact/forms.html# h-17.13.4) W3C specification for posting files or large amounts of data via HTTP forms. mimetypes Map filenames to mimetype. mimetools Tools for parsing MIME messages. 12.5 base64—Encode Binary Data with ASCII Purpose The base64 module contains functions for translating binary data into a subset of ASCII suitable for transmission using plain-text protocols. Python Version 1.4 and later The Base64, Base32, and Base16 encodings convert 8-bit bytes to values with 6, 5, or 4 bits of useful data per byte, allowing non-ASCII bytes to be encoded as ASCII characters for transmission over protocols that require plain ASCII, such as SMTP. The base values correspond to the length of the alphabet used in each encoding. There are also URL-safe variations of the original encodings that use slightly different alphabets. 12.5.1 Base64 Encoding This is a basic example of encoding some text. 12.5. base64—Encode Binary Data with ASCII 671 import base64 import textwrap # Load this source file and strip the header. with open(__file__, ’rt’) as input: raw = input.read() initial_data = raw.split(’#end_pymotw_header’)[1] encoded_data = base64.b64encode(initial_data) num_initial = len(initial_data) # There will never be more than 2 padding bytes. padding = 3 - (num_initial % 3) print ’%d bytes before encoding’ % num_initial print ’Expect %d padding bytes’ % padding print ’%d bytes after encoding’ % len(encoded_data) print print encoded_data The output shows that the 168 bytes of the original source expand to 224 bytes after being encoded. Note: There are no carriage returns in the encoded data produced by the library, but the output has been wrapped artificially to make it fit better on the page. $ python base64_b64encode.py 168 bytes before encoding Expect 3 padding bytes 224 bytes after encoding CgppbXBvcnQgYmFzZTY0CmltcG9ydCB0ZXh0d3JhcAoKIyBMb2FkIHRoaXMgc291c mNlIGZpbGUgYW5kIHN0cmlwIHRoZSBoZWFkZXIuCndpdGggb3BlbihfX2ZpbGVfXy wgJ3J0JykgYXMgaW5wdXQ6CiAgICByYXcgPSBpbnB1dC5yZWFkKCkKICAgIGluaXR pYWxfZGF0YSA9IHJhdy5zcGxpdCgn 12.5.2 Base64 Decoding b64decode() converts the encoded string back to the original form by taking four bytes and converting them to the original three, using a lookup table. 672 The Internet import base64 original_string = ’This is the data, in the clear.’ print ’Original:’, original_string encoded_string = base64.b64encode(original_string) print ’Encoded :’, encoded_string decoded_string = base64.b64decode(encoded_string) print ’Decoded :’, decoded_string The encoding process looks at each sequence of 24 bits in the input (three bytes) and encodes those same 24 bits spread over four bytes in the output. The equal signs at the end of the output are padding inserted because the number of bits in the original string was not evenly divisible by 24, in this example. $ python base64_b64decode.py Original: This is the data, in the clear. Encoded : VGhpcyBpcyB0aGUgZGF0YSwgaW4gdGhlIGNsZWFyLg== Decoded : This is the data, in the clear. 12.5.3 URL-Safe Variations Because the default Base64 alphabet may use + and /, and those two characters are used in URLs, it is often necessary to use an alternate encoding with substitutes for those characters. import base64 encodes_with_pluses = chr(251) + chr(239) encodes_with_slashes = chr(255) * 2 for original in [ encodes_with_pluses, encodes_with_slashes ]: print ’Original :’, repr(original) print ’Standard encoding:’, base64.standard_b64encode(original) print ’URL-safe encoding:’, base64.urlsafe_b64encode(original) print The + is replaced with a - and / is replaced with underscore (_). Otherwise, the alphabet is the same. 12.5. base64—Encode Binary Data with ASCII 673 $ python base64_urlsafe.py Original : ’\xfb\xef’ Standard encoding: ++8= URL-safe encoding: --8= Original : ’\xff\xff’ Standard encoding: //8= URL-safe encoding: __8= 12.5.4 Other Encodings Besides Base64, the module provides functions for working with Base32 and Base16 (hex) encoded data. import base64 original_string = ’This is the data, in the clear.’ print ’Original:’, original_string encoded_string = base64.b32encode(original_string) print ’Encoded :’, encoded_string decoded_string = base64.b32decode(encoded_string) print ’Decoded :’, decoded_string The Base32 alphabet includes the 26 uppercase letters from the ASCII set and the digits 2 through 7. $ python base64_base32.py Original: This is the data, in the clear. Encoded : KRUGS4ZANFZSA5DIMUQGIYLUMEWCA2LOEB2GQZJAMNWGKYLSFY====== Decoded : This is the data, in the clear. The Base16 functions work with the hexadecimal alphabet. import base64 original_string = ’This is the data, in the clear.’ print ’Original:’, original_string encoded_string = base64.b16encode(original_string) print ’Encoded :’, encoded_string 674 The Internet decoded_string = base64.b16decode(encoded_string) print ’Decoded :’, decoded_string Each time the number of encoding bits goes down, the output in the encoded format takes up more space. $ python base64_base16.py Original: This is the data, in the clear. Encoded : 546869732069732074686520646174612C20696E2074686520636C6561 722E Decoded : This is the data, in the clear. See Also: base64 (http://docs.python.org/library/base64.html) The standard library documen- tation for this module. RFC 3548 (http://tools.ietf.org/html/rfc3548.html) The Base16, Base32, and Base64 data encodings. 12.6 robotparser—Internet Spider Access Control Purpose Parse robots.txt file used to control Internet spiders. Python Version 2.1.3 and later robotparser implements a parser for the robots.txt file format, including a function that checks if a given user-agent can access a resource. It is intended for use in well-behaved spiders or other crawler applications that need to either be throttled or otherwise restricted. 12.6.1 robots.txt The robots.txt file format is a simple text-based access control system for computer programs that automatically access web resources (“spiders,” “crawlers,” etc.). The file is made up of records that specify the user-agent identifier for the program followed by a list of URLs (or URL prefixes) the agent may not access. This is the robots.txt file for http://www.doughellmann.com/. User-agent: * Disallow: /admin/ Disallow: /downloads/ 12.6. robotparser—Internet Spider Access Control 675 Disallow: /media/ Disallow: /static/ Disallow: /codehosting/ It prevents access to some parts of the site that are expensive to compute and would overload the server if a search engine tried to index them. For a more complete set of examples of robots.txt, refer to The Web Robots Page (see the references list later in this section). 12.6.2 Testing Access Permissions Using the data presented earlier, a simple crawler can test whether it is allowed to download a page using RobotFileParser.can_fetch(). import robotparser import urlparse AGENT_NAME = ’PyMOTW’ URL_BASE = ’http://www.doughellmann.com/’ parser = robotparser.RobotFileParser() parser.set_url(urlparse.urljoin(URL_BASE, ’robots.txt’)) parser.read() PATHS = [ ’/’, ’/PyMOTW/’, ’/admin/’, ’/downloads/PyMOTW-1.92.tar.gz’, ] for path in PATHS: print ’%6s : %s’ % (parser.can_fetch(AGENT_NAME, path), path) url = urlparse.urljoin(URL_BASE, path) print ’%6s : %s’ % (parser.can_fetch(AGENT_NAME, url), url) print The URL argument to can_fetch() can be a path relative to the root of the site or a full URL. $ python robotparser_simple.py True : / True : http://www.doughellmann.com/ 676 The Internet True : /PyMOTW/ True : http://www.doughellmann.com/PyMOTW/ False : /admin/ False : http://www.doughellmann.com/admin/ False : /downloads/PyMOTW-1.92.tar.gz False : http://www.doughellmann.com/downloads/PyMOTW-1.92.tar.gz 12.6.3 Long-Lived Spiders An application that takes a long time to process the resources it downloads or that is throttled to pause between downloads should check for new robots.txt files periodically, based on the age of the content it has downloaded already. The age is not managed automatically, but there are convenience methods to make tracking it easier. import robotparser import time import urlparse AGENT_NAME = ’PyMOTW’ parser = robotparser.RobotFileParser() # Using the local copy parser.set_url(’robots.txt’) parser.read() parser.modified() PATHS = [ ’/’, ’/PyMOTW/’, ’/admin/’, ’/downloads/PyMOTW-1.92.tar.gz’, ] for path in PATHS: age = int(time.time() - parser.mtime()) print ’age:’, age, if age > 1: print ’rereading robots.txt’ parser.read() parser.modified() 12.7. Cookie—HTTP Cookies 677 else: print print ’%6s : %s’ % (parser.can_fetch(AGENT_NAME, path), path) # Simulate a delay in processing time.sleep(1) print This extreme example downloads a new robots.txt file if the one it has is more than one second old. $ python robotparser_longlived.py age: 0 True : / age: 1 True : /PyMOTW/ age: 2 rereading robots.txt False : /admin/ age: 1 False : /downloads/PyMOTW-1.92.tar.gz A nicer version of the long-lived application might request the modification time for the file before downloading the entire thing. On the other hand, robots.txt files are usually fairly small, so it is not that much more expensive to just retrieve the entire document again. See Also: robotparser (http://docs.python.org/library/robotparser.html) The standard lib- rary documentation for this module. The Web Robots Page (www.robotstxt.org/orig.html) Description of robots.txt format. 12.7 Cookie—HTTP Cookies Purpose The Cookie module defines classes for parsing and creating HTTP cookie headers. Python Version 2.1 and later 678 The Internet The Cookie module implements a parser for cookies that is mostly RFC 2109 compliant. The implementation is a little less strict than the standard because MSIE 3.0x does not support the entire standard. 12.7.1 Creating and Setting a Cookie Cookies are used as state management for browser-based applications, and as such, are usually set by the server to be stored and returned by the client. Here is the simplest example of creating a cookie. import Cookie c = Cookie.SimpleCookie() c[’mycookie’] = ’cookie_value’ print c The output is a valid Set-Cookie header ready to be passed to the client as part of the HTTP response. $ python Cookie_setheaders.py Set-Cookie: mycookie=cookie_value 12.7.2 Morsels It is also possible to control the other aspects of a cookie, such as the expiration, path, and domain. In fact, all the RFC attributes for cookies can be managed through the Morsel object representing the cookie value. import Cookie import datetime def show_cookie(c): print c for key, morsel in c.iteritems(): print print ’key =’, morsel.key print ’ value =’, morsel.value print ’ coded_value =’, morsel.coded_value for name in morsel.keys(): 12.7. Cookie—HTTP Cookies 679 if morsel[name]: print ’ %s = %s’ % (name, morsel[name]) c = Cookie.SimpleCookie() # A cookie with a value that has to be encoded to fit into the header c[’encoded_value_cookie’] = ’"cookie_value"’ c[’encoded_value_cookie’][’comment’] = ’Value has escaped quotes’ # A cookie that only applies to part of a site c[’restricted_cookie’] = ’cookie_value’ c[’restricted_cookie’][’path’] = ’/sub/path’ c[’restricted_cookie’][’domain’] = ’PyMOTW’ c[’restricted_cookie’][’secure’] = True # A cookie that expires in 5 minutes c[’with_max_age’] = ’expires in 5 minutes’ c[’with_max_age’][’max-age’] = 300 # seconds # A cookie that expires at a specific time c[’expires_at_time’] = ’cookie_value’ time_to_live = datetime.timedelta(hours=1) expires = datetime.datetime(2009, 2, 14, 18, 30, 14) + time_to_live # Date format: Wdy, DD-Mon-YY HH:MM:SS GMT expires_at_time = expires.strftime(’%a, %d %b %Y %H:%M:%S’) c[’expires_at_time’][’expires’] = expires_at_time show_cookie(c) This example includes two different methods for setting stored cookies that expire. One sets the max-age to a number of seconds, and the other sets expires to a date and time when the cookie should be discarded. $ python Cookie_Morsel.py Set-Cookie: encoded_value_cookie="\"cookie_value\""; Comment=Value h as escaped quotes Set-Cookie: expires_at_time=cookie_value; expires=Sat, 14 Feb 2009 1 9:30:14 Set-Cookie: restricted_cookie=cookie_value; Domain=PyMOTW; Path=/sub /path; secure 680 The Internet Set-Cookie: with_max_age="expires in 5 minutes"; Max-Age=300 key = restricted_cookie value = cookie_value coded_value = cookie_value domain = PyMOTW secure = True path = /sub/path key = with_max_age value = expires in 5 minutes coded_value = "expires in 5 minutes" max-age = 300 key = encoded_value_cookie value = "cookie_value" coded_value = "\"cookie_value\"" comment = Value has escaped quotes key = expires_at_time value = cookie_value coded_value = cookie_value expires = Sat, 14 Feb 2009 19:30:14 Both the Cookie and Morsel objects act like dictionaries. A Morsel responds to a fixed set of keys: • expires • path • comment • domain • max-age • secure • version The keys for a Cookie instance are the names of the individual cookies being stored. That information is also available from the key attribute of the Morsel. 12.7.3 Encoded Values The cookie header needs values to be encoded so they can be parsed properly. 12.7. Cookie—HTTP Cookies 681 import Cookie c = Cookie.SimpleCookie() c[’integer’] = 5 c[’string_with_quotes’] = ’He said, "Hello, World!"’ for name in [’integer’, ’string_with_quotes’]: print c[name].key print ’ %s’ % c[name] print ’ value=%r’ % c[name].value print ’ coded_value=%r’ % c[name].coded_value print Morsel.value is always the decoded value of the cookie, while Morsel .coded_value is always the representation to be used for transmitting the value to the client. Both values are always strings. Values saved to a cookie that are not strings are converted automatically. $ python Cookie_coded_value.py integer Set-Cookie: integer=5 value=’5’ coded_value=’5’ string_with_quotes Set-Cookie: string_with_quotes="He said, \"Hello, World!\"" value=’He said, "Hello, World!"’ coded_value=’"He said, \\"Hello, World!\\""’ 12.7.4 Receiving and Parsing Cookie Headers Once the client receives the Set-Cookie headers, it will return those cookies to the server on subsequent requests using a Cookie header. An incoming Cookie header string may contain several cookie values, separated by semicolons (;). Cookie: integer=5; string_with_quotes="He said, \"Hello, World!\"" Depending on the web server and framework, cookies are available directly from either the headers or the HTTP_COOKIE environment variable. 682 The Internet import Cookie HTTP_COOKIE = ’; ’.join([ r’integer=5’, r’string_with_quotes="He said, \"Hello, World!\""’, ]) print ’From constructor:’ c = Cookie.SimpleCookie(HTTP_COOKIE) print c print print ’From load():’ c = Cookie.SimpleCookie() c.load(HTTP_COOKIE) print c To decode them, pass the string without the header prefix to SimpleCookie when instantiating it, or use the load() method. $ python Cookie_parse.py From constructor: Set-Cookie: integer=5 Set-Cookie: string_with_quotes="He said, \"Hello, World!\"" From load(): Set-Cookie: integer=5 Set-Cookie: string_with_quotes="He said, \"Hello, World!\"" 12.7.5 Alternative Output Formats Besides using the Set-Cookie header, servers may deliver JavaScript that adds cookies to a client. SimpleCookie and Morsel provide JavaScript output via the js_output() method. import Cookie c = Cookie.SimpleCookie() c[’mycookie’] = ’cookie_value’ 12.7. Cookie—HTTP Cookies 683 c[’another_cookie’] = ’second value’ print c.js_output() The result is a complete script tag with statements to set the cookies. $ python Cookie_js_output.py 12.7.6 Deprecated Classes All these examples have used SimpleCookie. The Cookie module also provides two other classes, SerialCookie and SmartCookie. SerialCookie can handle any values that can be pickled. SmartCookie figures out whether a value needs to be unpickled or if it is a simple value. Warning: Since both these classes use pickle, they are potential security holes and should not be used. It is safer to store state on the server and give the client a session key instead. See Also: Cookie (http://docs.python.org/library/cookie.html) The standard library documen- tation for this module. cookielib The cookielib module for working with cookies on the client side. RFC 2109 (http://tools.ietf.org/html/rfc2109.html) HTTP State Management Mech- anism. 684 The Internet 12.8 uuid—Universally Unique Identifiers Purpose The uuid module implements Universally Unique Identifiers, as described in RFC 4122. Python Version 2.5 and later RFC 4122 defines a system for creating universally unique identifiers for resources in a way that does not require a central registrar. UUID values are 128 bits long and, as the reference guide says, “can guarantee uniqueness across space and time.” They are useful for generating identifiers for documents, hosts, application clients, and other sit- uations where a unique value is necessary. The RFC is specifically focused on creating a Uniform Resource Name namespace and covers three main algorithms. • Using IEEE 802 MAC addresses as a source of uniqueness • Using pseudorandom numbers • Using well-known strings combined with cryptographic hashing In all cases, the seed value is combined with the system clock and a clock sequence value used to maintain uniqueness in case the clock is set backwards. 12.8.1 UUID 1—IEEE 802 MAC Address UUID version 1 values are computed using the MAC address of the host. The uuid module uses getnode() to retrieve the MAC value of the current system. import uuid print hex(uuid.getnode()) If a system has more than one network card, and so more than one MAC, any one of the values may be returned. $ python uuid_getnode.py 0x1e5274040e To generate a UUID for a host, identified by its MAC address, use the uuid1() function. The node identifier argument is optional; leave the field blank to use the value returned by getnode(). 12.8. uuid—Universally Unique Identifiers 685 import uuid u = uuid.uuid1() print u print type(u) print ’bytes :’, repr(u.bytes) print ’hex :’, u.hex print ’int :’, u.int print ’urn :’, u.urn print ’variant :’, u.variant print ’version :’, u.version print ’fields :’, u.fields print ’\ttime_low :’, u.time_low print ’\ttime_mid :’, u.time_mid print ’\ttime_hi_version :’, u.time_hi_version print ’\tclock_seq_hi_variant: ’, u.clock_seq_hi_variant print ’\tclock_seq_low :’, u.clock_seq_low print ’\tnode :’, u.node print ’\ttime :’, u.time print ’\tclock_seq :’, u.clock_seq The components of the UUID object returned can be accessed through read-only instance attributes. Some attributes, such as hex, int, and urn, are different representa- tions of the UUID value. $ python uuid_uuid1.py c7887eee-ea6a-11df-a6cf-001e5274040e bytes : ’\xc7\x88~\xee\xeaj\x11\xdf\xa6\xcf\x00\x1eRt\x04\x0e’ hex : c7887eeeea6a11dfa6cf001e5274040e int : 265225098046419456611671377169708483598 urn : urn:uuid:c7887eee-ea6a-11df-a6cf-001e5274040e variant : specified in RFC 4122 version : 1 fields : (3347611374L, 60010L, 4575L, 166L, 207L, 130232353806L) time_low : 3347611374 time_mid : 60010 time_hi_version : 4575 clock_seq_hi_variant: 166 clock_seq_low : 207 686 The Internet node : 130232353806 time : 135084258179448558 clock_seq : 9935 Because of the time component, each call to uuid1() returns a new value. import uuid for i in xrange(3): print uuid.uuid1() In this output, only the time component (at the beginning of the string) changes. $ python uuid_uuid1_repeat.py c794da9c-ea6a-11df-9382-001e5274040e c797121c-ea6a-11df-9e67-001e5274040e c79713a1-ea6a-11df-ac7d-001e5274040e Because each computer has a different MAC address, running the sample program on different systems will produce entirely different values. This example passes explicit node ids to simulate running on different hosts. import uuid for node in [ 0x1ec200d9e0, 0x1e5274040e ]: print uuid.uuid1(node), hex(node) In addition to a different time value, the node identifier at the end of the UUID also changes. $ python uuid_uuid1_othermac.py c7a313a8-ea6a-11df-a228-001ec200d9e0 0x1ec200d9e0 c7a3f751-ea6a-11df-988b-001e5274040e 0x1e5274040e 12.8.2 UUID 3 and 5—Name-Based Values It is also useful in some contexts to create UUID values from names instead of ran- dom or time-based values. Versions 3 and 5 of the UUID specification use cryptographic hash values (MD5 or SHA-1, respectively) to combine namespace-specific seed values 12.8. uuid—Universally Unique Identifiers 687 with names. There are several well-known namespaces, identified by predefined UUID values, for working with DNS, URLs, ISO OIDs, and X.500 Distinguished Names. New application-specific namespaces can be defined by generating and saving UUID values. import uuid hostnames = [’www.doughellmann.com’, ’blog.doughellmann.com’] for name in hostnames: print name print ’ MD5 :’, uuid.uuid3(uuid.NAMESPACE_DNS, name) print ’ SHA-1 :’, uuid.uuid5(uuid.NAMESPACE_DNS, name) print To create a UUID from a DNS name, pass uuid.NAMESPACE_DNS as the names- pace argument to uuid3() or uuid5(). $ python uuid_uuid3_uuid5.py www.doughellmann.com MD5 : bcd02e22-68f0-3046-a512-327cca9def8f SHA-1 : e3329b12-30b7-57c4-8117-c2cd34a87ce9 blog.doughellmann.com MD5 : 9bdabfce-dfd6-37ab-8a3f-7f7293bcf111 SHA-1 : fa829736-7ef8-5239-9906-b4775a5abacb The UUID value for a given name in a namespace is always the same, no matter when or where it is calculated. import uuid namespace_types = sorted(n for n in dir(uuid) if n.startswith(’NAMESPACE_’) ) name = ’www.doughellmann.com’ for namespace_type in namespace_types: print namespace_type namespace_uuid = getattr(uuid, namespace_type) 688 The Internet print ’’, uuid.uuid3(namespace_uuid, name) print ’’, uuid.uuid3(namespace_uuid, name) print Values for the same name in the namespaces are different. $ python uuid_uuid3_repeat.py NAMESPACE_DNS bcd02e22-68f0-3046-a512-327cca9def8f bcd02e22-68f0-3046-a512-327cca9def8f NAMESPACE_OID e7043ac1-4382-3c45-8271-d5c083e41723 e7043ac1-4382-3c45-8271-d5c083e41723 NAMESPACE_URL 5d0fdaa9-eafd-365e-b4d7-652500dd1208 5d0fdaa9-eafd-365e-b4d7-652500dd1208 NAMESPACE_X500 4a54d6e7-ce68-37fb-b0ba-09acc87cabb7 4a54d6e7-ce68-37fb-b0ba-09acc87cabb7 12.8.3 UUID 4—Random Values Sometimes, host-based and namespace-based UUID values are not “different enough.” For example, in cases where UUID is intended to be used as a hash key, a more random sequence of values with more differentiation is desirable to avoid collisions in the hash table. Having values with fewer common digits also makes it easier to find them in log files. To add greater differentiation in UUIDs, use uuid4() to generate them using random input values. import uuid for i in xrange(3): print uuid.uuid4() The source of randomness depends on which C libraries are available when uuid is imported. If libuuid (or uuid.dll) can be loaded and it contains a function 12.8. uuid—Universally Unique Identifiers 689 for generating random values, it is used. Otherwise, os.urandom() or the random module are used. $ python uuid_uuid4.py b2637198-4629-44c2-8b9b-07a6ff601a89 d1b850c6-f842-4a25-a993-6d6160dda761 50fb5234-abce-40b8-b034-ba3637dad6fc 12.8.4 Working with UUID Objects In addition to generating new UUID values, it is possible to parse strings in standard formats to create UUID objects, making it easier to handle comparisons and sorting operations. import uuid def show(msg, l): print msg for v in l: print ’’, v print input_values = [ ’urn:uuid:f2f84497-b3bf-493a-bba9-7c68e6def80b’, ’{417a5ebb-01f7-4ed5-aeac-3d56cd5037b0}’, ’2115773a-5bf1-11dd-ab48-001ec200d9e0’, ] show(’input_values’, input_values) uuids = [ uuid.UUID(s) for s in input_values ] show(’converted to uuids’, uuids) uuids.sort() show(’sorted’, uuids) Surrounding curly braces are removed from the input, as are dashes (-). If the string has a prefix containing urn: and/or uuid:, it is also removed. The remaining text must be a string of 16 hexadecimal digits, which are then interpreted as a UUID value. 690 The Internet $ python uuid_uuid_objects.py input_values urn:uuid:f2f84497-b3bf-493a-bba9-7c68e6def80b {417a5ebb-01f7-4ed5-aeac-3d56cd5037b0} 2115773a-5bf1-11dd-ab48-001ec200d9e0 converted to uuids f2f84497-b3bf-493a-bba9-7c68e6def80b 417a5ebb-01f7-4ed5-aeac-3d56cd5037b0 2115773a-5bf1-11dd-ab48-001ec200d9e0 sorted 2115773a-5bf1-11dd-ab48-001ec200d9e0 417a5ebb-01f7-4ed5-aeac-3d56cd5037b0 f2f84497-b3bf-493a-bba9-7c68e6def80b See Also: uuid (http://docs.python.org/lib/module-uuid.html) The Standard library documen- tation for this module. RFC 4122 (http://tools.ietf.org/html/rfc4122.html) A Universally Unique Identifier (UUID) URN Namespace. 12.9 json—JavaScript Object Notation Purpose Encode Python objects as JSON strings, and decode JSON strings into Python objects. Python Version 2.6 and later The json module provides an API similar to pickle for converting in-memory Python objects to a serialized representation known as JavaScript Object Notation (JSON). Unlike pickle, JSON has the benefit of having implementations in many languages (es- pecially JavaScript). It is most widely used for communicating between the web server and the client in an AJAX application, but it is also useful for other inter-application communication needs. 12.9.1 Encoding and Decoding Simple Data Types The encoder understands Python’s native types by default (string, unicode, int, float, list, tuple, and dict). 12.9. json—JavaScript Object Notation 691 import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] print ’DATA:’, repr(data) data_string = json.dumps(data) print ’JSON:’, data_string Values are encoded in a manner superficially similar to Python’s repr() output. $ python json_simple_types.py DATA: [{’a’: ’A’, ’c’: 3.0, ’b’: (2, 4)}] JSON: [{"a": "A", "c": 3.0, "b": [2, 4]}] Encoding, and then redecoding, may not give exactly the same type of object. import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] print ’DATA :’, data data_string = json.dumps(data) print ’ENCODED:’, data_string decoded = json.loads(data_string) print ’DECODED:’, decoded print ’ORIGINAL:’, type(data[0][’b’]) print ’DECODED :’, type(decoded[0][’b’]) In particular, strings are converted to unicode objects and tuples become lists. $ python json_simple_types_decode.py DATA : [{’a’: ’A’, ’c’: 3.0, ’b’: (2, 4)}] ENCODED: [{"a": "A", "c": 3.0, "b": [2, 4]}] DECODED: [{’a’: ’A’, ’c’: 3.0, ’b’: [2, 4]}] ORIGINAL: DECODED : 692 The Internet 12.9.2 Human-Consumable vs. Compact Output Another benefit of JSON over pickle is that the results are human-readable. The dumps() function accepts several arguments to make the output even nicer. For example, the sort_keys flag tells the encoder to output the keys of a dictionary in sorted, instead of random, order. import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] print ’DATA:’, repr(data) unsorted = json.dumps(data) print ’JSON:’, json.dumps(data) print ’SORT:’, json.dumps(data, sort_keys=True) first = json.dumps(data, sort_keys=True) second = json.dumps(data, sort_keys=True) print ’UNSORTED MATCH:’, unsorted == first print ’SORTED MATCH :’, first == second Sorting makes it easier to scan the results by eye and also makes it possible to compare JSON output in tests. $ python json_sort_keys.py DATA: [{’a’: ’A’, ’c’: 3.0, ’b’: (2, 4)}] JSON: [{"a": "A", "c": 3.0, "b": [2, 4]}] SORT: [{"a": "A", "b": [2, 4], "c": 3.0}] UNSORTED MATCH: False SORTED MATCH : True For highly nested data structures, specify a value for indent so the output is for- matted nicely as well. import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] print ’DATA:’, repr(data) print ’NORMAL:’, json.dumps(data, sort_keys=True) print ’INDENT:’, json.dumps(data, sort_keys=True, indent=2) 12.9. json—JavaScript Object Notation 693 When indent is a non-negative integer, the output more closely resembles that of pprint, with leading spaces for each level of the data structure matching the indent level. $ python json_indent.py DATA: [{’a’: ’A’, ’c’: 3.0, ’b’: (2, 4)}] NORMAL: [{"a": "A", "b": [2, 4], "c": 3.0}] INDENT: [ { "a": "A", "b": [ 2, 4 ], "c": 3.0 } ] Verbose output like this increases the number of bytes needed to transmit the same amount of data, however, so it is not intended for use in a production environment. In fact, it is possible to adjust the settings for separating data in the encoded output to make it even more compact than the default. import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] print ’DATA:’, repr(data) print ’repr(data) :’, len(repr(data)) plain_dump = json.dumps(data) print ’dumps(data) :’, len(plain_dump) small_indent = json.dumps(data, indent=2) print ’dumps(data, indent=2) :’, len(small_indent) with_separators = json.dumps(data, separators=(’,’,’:’)) print ’dumps(data, separators):’, len(with_separators) The separators argument to dumps() should be a tuple containing the strings to separate items in a list and keys from values in a dictionary. The default is (’, ’, ’: ’). By removing the whitespace, a more compact output is produced. 694 The Internet $ python json_compact_encoding.py DATA: [{’a’: ’A’, ’c’: 3.0, ’b’: (2, 4)}] repr(data) : 35 dumps(data) : 35 dumps(data, indent=2) : 76 dumps(data, separators): 29 12.9.3 Encoding Dictionaries The JSON format expects the keys to a dictionary to be strings. Trying to encode a dictionary with nonstring types as keys produces an exception. (The exception type depends on whether the pure-Python version of the module is loaded or the C speed- ups are available, but it will be either TypeError or ValueError.) One way to work around that limitation is to tell the encoder to skip over nonstring keys using the skipkeys argument. import json data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0, (’d’,):’D tuple’ }] print ’First attempt’ try: print json.dumps(data) except (TypeError, ValueError), err: print ’ERROR:’, err print print ’Second attempt’ print json.dumps(data, skipkeys=True) Rather than raising an exception, the nonstring key is ignored. $ python json_skipkeys.py First attempt ERROR: keys must be a string Second attempt [{"a": "A", "c": 3.0, "b": [2, 4]}] 12.9. json—JavaScript Object Notation 695 12.9.4 Working with Custom Types All the examples so far have used Python’s built-in types because those are supported by json natively. It is common to need to encode custom classes, as well, and there are two ways to do that. Given this class to encode class MyObj(object): def __init__(self, s): self.s = s def __repr__(self): return ’’ % self.s The simple way of encoding a MyObj instance is to define a function to convert an unknown type to a known type. It does not need to do the encoding, so it should just convert one object to another. import json import json_myobj obj = json_myobj.MyObj(’instance value goes here’) print ’First attempt’ try: print json.dumps(obj) except TypeError, err: print ’ERROR:’, err def convert_to_builtin_type(obj): print ’default(’, repr(obj), ’)’ # Convert objects to a dictionary of their representation d = { ’__class__’:obj.__class__.__name__, ’__module__’:obj.__module__, } d.update(obj.__dict__) return d print print ’With default’ print json.dumps(obj, default=convert_to_builtin_type) 696 The Internet In convert_to_builtin_type(), instances of classes not recognized by json are converted to dictionaries with enough information to re-create the object if a pro- gram has access to the Python modules necessary. $ python json_dump_default.py First attempt ERROR: is not JSON serializable With default default( ) {"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"} To decode the results and create a MyObj() instance, use the object_hook argu- ment to loads() to tie in to the decoder so the class can be imported from the module and used to create the instance. The object_hook is called for each dictionary decoded from the incoming data stream, providing a chance to convert the dictionary to another type of object. The hook function should return the object the calling application should receive instead of the dictionary. import json def dict_to_object(d): if ’__class__’ in d: class_name = d.pop(’__class__’) module_name = d.pop(’__module__’) module = __import__(module_name) print ’MODULE:’, module.__name__ class_ = getattr(module, class_name) print ’CLASS:’, class_ args = dict( (key.encode(’ascii’), value) for key, value in d.items()) print ’INSTANCE ARGS:’, args inst = class_(**args) else: inst = d return inst encoded_object = ’’’ [{"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"}] ’’’ 12.9. json—JavaScript Object Notation 697 myobj_instance = json.loads(encoded_object, object_hook=dict_to_object) print myobj_instance Since json converts string values to unicode objects, they need to be reencoded as ASCII strings before they can be used as keyword arguments to the class constructor. $ python json_load_object_hook.py MODULE: json_myobj CLASS: INSTANCE ARGS: {’s’: ’instance value goes here’} [] Similar hooks are available for the built-in types integers ( parse_ int), floating- point numbers ( parse_ float), and constants ( parse_ constant). 12.9.5 Encoder and Decoder Classes Besides the convenience functions already covered, the json module provides classes for encoding and decoding. Using the classes directly gives access to extra APIs for customizing their behavior. The JSONEncoder uses an iterable interface for producing “chunks” of encoded data, making it easier to write to files or network sockets without having to represent an entire data structure in memory. import json encoder = json.JSONEncoder() data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] for part in encoder.iterencode(data): print ’PART:’, part The output is generated in logical units, rather than being based on any size value. $ python json_encoder_iterable.py PART: [ PART: { PART: "a" 698 The Internet PART: : PART: "A" PART: , PART: "c" PART: : PART: 3.0 PART: , PART: "b" PART: : PART: [2 PART: , 4 PART: ] PART: } PART: ] The encode() method is basically equivalent to the value produced by the expression ’ ’.join(encoder.iterencode()), with some extra error checking up front. To encode arbitrary objects, override the default() method with an implemen- tation similar to the one used in convert_to_builtin_type(). import json import json_myobj class MyEncoder(json.JSONEncoder): def default(self, obj): print ’default(’, repr(obj), ’)’ # Convert objects to a dictionary of their representation d = { ’__class__’:obj.__class__.__name__, ’__module__’:obj.__module__, } d.update(obj.__dict__) return d obj = json_myobj.MyObj(’internal data’) print obj print MyEncoder().encode(obj) The output is the same as the previous implementation. $ python json_encoder_default.py 12.9. json—JavaScript Object Notation 699 default( ) {"s": "internal data", "__module__": "json_myobj", "__class__": "MyObj"} Decoding text, and then converting the dictionary into an object, takes a little more work to set up than the previous implementation, but not much. import json class MyDecoder(json.JSONDecoder): def __init__(self): json.JSONDecoder.__init__(self, object_hook=self.dict_to_object) def dict_to_object(self, d): if ’__class__’ in d: class_name = d.pop(’__class__’) module_name = d.pop(’__module__’) module = __import__(module_name) print ’MODULE:’, module.__name__ class_ = getattr(module, class_name) print ’CLASS:’, class_ args = dict( (key.encode(’ascii’), value) for key, value in d.items()) print ’INSTANCE ARGS:’, args inst = class_(**args) else: inst = d return inst encoded_object = ’’’ [{"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"}] ’’’ myobj_instance = MyDecoder().decode(encoded_object) print myobj_instance And the output is the same as the earlier example. 700 The Internet $ python json_decoder_object_hook.py MODULE: json_myobj CLASS: INSTANCE ARGS: {’s’: ’instance value goes here’} [] 12.9.6 Working with Streams and Files All the examples so far have assumed that the encoded version of the entire data structure could be held in memory at one time. With large data structures, it may be preferable to write the encoding directly to a file-like object. The convenience func- tions load() and dump() accept references to a file-like object to use for reading or writing. import json from StringIO import StringIO data = [ { ’a’:’A’, ’b’:(2, 4), ’c’:3.0 } ] f = StringIO() json.dump(data, f) print f.getvalue() A socket or normal file handle would work the same way as the StringIO buffer used in this example. $ python json_dump_file.py [{"a": "A", "c": 3.0, "b": [2, 4]}] Although it is not optimized to read only part of the data at a time, the load() function still offers the benefit of encapsulating the logic of generating objects from stream input. import json from StringIO import StringIO f = StringIO(’[{"a": "A", "c": 3.0, "b": [2, 4]}]’) print json.load(f) 12.9. json—JavaScript Object Notation 701 Just as for dump(), any file-like object can be passed to load(). $ python json_load_file.py [{’a’: ’A’, ’c’: 3.0, ’b’: [2, 4]}] 12.9.7 Mixed Data Streams JSONDecoder includes raw_decode(), a method for decoding a data structure fol- lowed by more data, such as JSON data with trailing text. The return value is the object created by decoding the input data and an index into that data indicating where decoding left off. import json decoder = json.JSONDecoder() def get_decoded_and_remainder(input_data): obj, end = decoder.raw_decode(input_data) remaining = input_data[end:] return (obj, end, remaining) encoded_object = ’[{"a": "A", "c": 3.0, "b": [2, 4]}]’ extra_text = ’This text is not JSON.’ print ’JSON first:’ data = ’’.join([encoded_object, extra_text]) obj, end, remaining = get_decoded_and_remainder(data) print ’Object :’, obj print ’End of parsed input :’, end print ’Remaining text :’, repr(remaining) print print ’JSON embedded:’ try: data = ’’.join([extra_text, encoded_object, extra_text]) obj, end, remaining = get_decoded_and_remainder(data) except ValueError, err: print ’ERROR:’, err Unfortunately, this only works if the object appears at the beginning of the input. 702 The Internet $ python json_mixed_data.py JSON first: Object : [{’a’: ’A’, ’c’: 3.0, ’b’: [2, 4]}] End of parsed input : 35 Remaining text : ’ This text is not JSON.’ JSON embedded: ERROR: No JSON object could be decoded See Also: json (http://docs.python.org/library/json.html) The standard library documentation for this module. JavaScript Object Notation (http://json.org/) JSON home, with documentation and implementations in other languages. simplejson (http://code.google.com/p/simplejson/) simplejson, from Bob Ippolito et al. is the externally maintained development version of the json library in- cluded with Python 2.6 and later. It maintains backwards compatibility with Python 2.4 and Python 2.5. jsonpickle (http://code.google.com/p/jsonpickle/) jsonpickle allows for any Python object to be serialized into JSON. 12.10 xmlrpclib—Client Library for XML-RPC Purpose Client-side library for XML-RPC communication. Python Version 2.2 and later XML-RPC is a lightweight remote procedure call protocol built on top of HTTP and XML. The xmlrpclib module lets a Python program communicate with an XML-RPC server written in any language. All the examples in this section use the server defined in xmlrpclib_ server.py, available in the source distribution and included here for reference. from SimpleXMLRPCServer import SimpleXMLRPCServer from xmlrpclib import Binary import datetime server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True, allow_none=True) 12.10. xmlrpclib—Client Library for XML-RPC 703 server.register_introspection_functions() server.register_multicall_functions() class ExampleService: def ping(self): """Simple function to respond when called to demonstrate connectivity. """ return True def now(self): """Returns the server current date and time.""" return datetime.datetime.now() def show_type(self, arg): """Illustrates how types are passed in and out of server methods. Accepts one argument of any type. Returns a tuple with string representation of the value, the name of the type, and the value itself. """ return (str(arg), str(type(arg)), arg) def raises_exception(self, msg): "Always raises a RuntimeError with the message passed in" raise RuntimeError(msg) def send_back_binary(self, bin): """Accepts single Binary argument, and unpacks and repacks it to return it.""" data = bin.data response = Binary(data) return response server.register_instance(ExampleService()) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ 704 The Internet 12.10.1 Connecting to a Server The simplest way to connect a client to a server is to instantiate a ServerProxy object, giving it the URI of the server. For example, the demo server runs on port 9000 of localhost. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’) print ’Ping:’, server.ping() In this case, the ping() method of the service takes no arguments and returns a single Boolean value. $ python xmlrpclib_ServerProxy.py Ping: True Other options are available to support alternate transport. Both HTTP and HTTPS are supported out of the box, both with basic authentication. To implement a new com- munication channel, only a new transport class is needed. It could be an interesting exercise, for example, to implement XML-RPC over SMTP. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’, verbose=True) print ’Ping:’, server.ping() The verbose option gives debugging information useful for resolving communica- tion errors. $ python xmlrpclib_ServerProxy_verbose.py Ping: connect: (localhost, 9000) connect fail: (’localhost’, 9000) connect: (localhost, 9000) connect fail: (’localhost’, 9000) connect: (localhost, 9000) send: ’POST /RPC2 HTTP/1.0\r\nHost: localhost:9000\r\nUser-Agent: xmlrpclib.py/1.0.1 (by www.pythonware.com)\r\nContent-Type: text /xml\r\nContent-Length: 98\r\n\r\n’ 12.10. xmlrpclib—Client Library for XML-RPC 705 send: "\n\nping\n\n\n\n" reply: ’HTTP/1.0 200 OK\r\n’ header: Server: BaseHTTP/0.3 Python/2.5.1 header: Date: Sun, 06 Jul 2008 19:56:13 GMT header: Content-type: text/xml header: Content-length: 129 body: "\n\n\n\n1\n\n\n\n" True The default encoding can be changed from UTF-8 if an alternate system is needed. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’, encoding=’ISO-8859-1’) print ’Ping:’, server.ping() The server automatically detects the correct encoding. $ python xmlrpclib_ServerProxy_encoding.py Ping: True The allow_none option controls whether Python’s None value is automatically translated to a nil value or whether it causes an error. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’, allow_none=True) print ’Allowed:’, server.show_type(None) server = xmlrpclib.ServerProxy(’http://localhost:9000’, allow_none=False) try: server.show_type(None) except TypeError as err: print ’ERROR:’, err 706 The Internet The error is raised locally if the client does not allow None, but it can also be raised from within the server if it is not configured to allow None. $ python xmlrpclib_ServerProxy_allow_none.py Allowed: [’None’, "", None] ERROR: cannot marshal None unless allow_none is enabled 12.10.2 Data Types The XML-RPC protocol recognizes a limited set of common data types. The types can be passed as arguments or return values and combined to create more complex data structures. import xmlrpclib import datetime server = xmlrpclib.ServerProxy(’http://localhost:9000’) for t, v in [ (’boolean’, True), (’integer’, 1), (’float’, 2.5), (’string’, ’some text’), (’datetime’, datetime.datetime.now()), (’array’,[’a’, ’list’]), (’array’,(’a’, ’tuple’)), (’structure’,{’a’:’dictionary’}), ]: as_string, type_name, value = server.show_type(v) print ’%-12s:’ % t, as_string print ’%12s ’ % ’’, type_name print ’%12s ’ % ’’, value The simple types are $ python xmlrpclib_types.py boolean : True True integer : 1 1 12.10. xmlrpclib—Client Library for XML-RPC 707 float : 2.5 2.5 string : some text some text datetime : 20101128T20:15:21 20101128T20:15:21 array : [’a’, ’list’] [’a’, ’list’] array : [’a’, ’tuple’] [’a’, ’tuple’] structure : {’a’: ’dictionary’} {’a’: ’dictionary’} The supported types can be nested to create values of arbitrary complexity. import xmlrpclib import datetime import pprint server = xmlrpclib.ServerProxy(’http://localhost:9000’) data = { ’boolean’:True, ’integer’: 1, ’floating-point number’: 2.5, ’string’: ’some text’, ’datetime’: datetime.datetime.now(), ’array’:[’a’, ’list’], ’array’:(’a’, ’tuple’), ’structure’:{’a’:’dictionary’}, } arg = [] for i in range(3): d = {} d.update(data) d[’integer’] = i arg.append(d) 708 The Internet print ’Before:’ pprint.pprint(arg) print print ’After:’ pprint.pprint(server.show_type(arg)[-1]) This program passes a list of dictionaries containing all the supported types to the sample server, which returns the data. Tuples are converted to lists, and datetime instances are converted to DateTime objects. Otherwise, the data is unchanged. $ python xmlrpclib_types_nested.py Before: [{’array’: (’a’, ’tuple’), ’boolean’: True, ’datetime’: datetime.datetime(2008, 7, 6, 16, 24, 52, 348849), ’floating-point number’: 2.5, ’integer’: 0, ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}, {’array’: (’a’, ’tuple’), ’boolean’: True, ’datetime’: datetime.datetime(2008, 7, 6, 16, 24, 52, 348849), ’floating-point number’: 2.5, ’integer’: 1, ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}, {’array’: (’a’, ’tuple’), ’boolean’: True, ’datetime’: datetime.datetime(2008, 7, 6, 16, 24, 52, 348849), ’floating-point number’: 2.5, ’integer’: 2, ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}] After: [{’array’: [’a’, ’tuple’], ’boolean’: True, ’datetime’: , ’floating-point number’: 2.5, ’integer’: 0, 12.10. xmlrpclib—Client Library for XML-RPC 709 ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}, {’array’: [’a’, ’tuple’], ’boolean’: True, ’datetime’: , ’floating-point number’: 2.5, ’integer’: 1, ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}, {’array’: [’a’, ’tuple’], ’boolean’: True, ’datetime’: , ’floating-point number’: 2.5, ’integer’: 2, ’string’: ’some text’, ’structure’: {’a’: ’dictionary’}}] XML-RPC supports dates as a native type, and xmlrpclib can use one of two classes to represent the date values in the outgoing proxy or when they are received from the server. By default an internal version of DateTime is used, but the use_datetime option turns on support for using the classes in the datetime module. 12.10.3 Passing Objects Instances of Python classes are treated as structures and passed as a dictionary, with the attributes of the object as values in the dictionary. import xmlrpclib import pprint class MyObj: def __init__(self, a, b): self.a = a self.b = b def __repr__(self): return ’MyObj(%s, %s)’ % (repr(self.a), repr(self.b)) server = xmlrpclib.ServerProxy(’http://localhost:9000’) o = MyObj(1, ’b goes here’) print ’o :’, o pprint.pprint(server.show_type(o)) 710 The Internet o2 = MyObj(2, o) print ’o2 :’, o2 pprint.pprint(server.show_type(o2)) When the value is sent back to the client from the server, the result is a dictionary on the client, since there is nothing encoded in the values to tell the server (or the client) that it should be instantiated as part of a class. $ python xmlrpclib_types_object.py o : MyObj(1, ’b goes here’) ["{’a’: 1, ’b’: ’b goes here’}", "", {’a’: 1, ’b’: ’b goes here’}] o2 : MyObj(2, MyObj(1, ’b goes here’)) ["{’a’: 2, ’b’: {’a’: 1, ’b’: ’b goes here’}}", "", {’a’: 2, ’b’: {’a’: 1, ’b’: ’b goes here’}}] 12.10.4 Binary Data All values passed to the server are encoded and escaped automatically. However, some data types may contain characters that are not valid XML. For example, binary image data may include byte values in the ASCII control range 0 to 31. To pass binary data, it is best to use the Binary class to encode it for transport. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’) s = ’This is a string with control characters’ + ’\0’ print ’Local string:’, s data = xmlrpclib.Binary(s) print ’As binary:’, server.send_back_binary(data) try: print ’As string:’, server.show_type(s) except xmlrpclib.Fault as err: print ’\nERROR:’, err If the string containing a NULL byte is passed to show_type(), an exception is raised in the XML parser. 12.10. xmlrpclib—Client Library for XML-RPC 711 $ python xmlrpclib_Binary.py Local string: This is a string with control characters As binary: This is a string with control characters As string: ERROR: :not well-formed (invalid token): line 6, column 55"> Binary objects can also be used to send objects using pickle. The normal secu- rity issues related to sending what amounts to executable code over the wire apply here (i.e., do not do this unless the communication channel is secure). import xmlrpclib import cPickle as pickle import pprint class MyObj: def __init__(self, a, b): self.a = a self.b = b def __repr__(self): return ’MyObj(%s, %s)’ % (repr(self.a), repr(self.b)) server = xmlrpclib.ServerProxy(’http://localhost:9000’) o = MyObj(1, ’b goes here’) print ’Local:’, id(o) print o print ’\nAs object:’ pprint.pprint(server.show_type(o)) p = pickle.dumps(o) b = xmlrpclib.Binary(p) r = server.send_back_binary(b) o2 = pickle.loads(r.data) print ’\nFrom pickle:’, id(o2) pprint.pprint(o2) The data attribute of the Binary instance contains the pickled version of the object, so it has to be unpickled before it can be used. That results in a different object (with a new id value). 712 The Internet $ python xmlrpclib_Binary_pickle.py Local: 4321077872 MyObj(1, ’b goes here’) As object: ["{’a’: 1, ’b’: ’b goes here’}", "", {’a’: 1, ’b’: ’b goes here’}] From pickle: 4321252344 MyObj(1, ’b goes here’) 12.10.5 Exception Handling Since the XML-RPC server might be written in any language, exception classes cannot be transmitted directly. Instead, exceptions raised in the server are converted to Fault objects and raised as exceptions locally in the client. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’) try: server.raises_exception(’A message’) except Exception, err: print ’Fault code:’, err.faultCode print ’Message :’, err.faultString The original error message is saved in the faultString attribute, and fault- Code is set to an XML-RPC error number. $ python xmlrpclib_exception.py Fault code: 1 Message : :A message 12.10.6 Combining Calls into One Message Multicall is an extension to the XML-RPC protocol that allows more than one call to be sent at the same time, with the responses collected and returned to the caller. The MultiCall class was added to xmlrpclib in Python 2.4. 12.10. xmlrpclib—Client Library for XML-RPC 713 import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’) multicall = xmlrpclib.MultiCall(server) multicall.ping() multicall.show_type(1) multicall.show_type(’string’) for i, r in enumerate(multicall()): print i, r To use a MultiCall instance, invoke the methods on it as with a ServerProxy, and then call the object with no arguments to actually run the remote functions. The return value is an iterator that yields the results from all the calls. $ python xmlrpclib_MultiCall.py 0 True 1 [’1’, "", 1] 2 [’string’, "", ’string’] If one of the calls causes a Fault, the exception is raised when the result is pro- duced from the iterator and no more results are available. import xmlrpclib server = xmlrpclib.ServerProxy(’http://localhost:9000’) multicall = xmlrpclib.MultiCall(server) multicall.ping() multicall.show_type(1) multicall.raises_exception(’Next to last call stops execution’) multicall.show_type(’string’) try: for i, r in enumerate(multicall()): print i, r except xmlrpclib.Fault as err: print ’ERROR:’, err 714 The Internet Since the third response, from raises_exception(), generates an exception, the response from show_type() is not accessible. $ python xmlrpclib_MultiCall_exception.py 0 True 1 [’1’, "", 1] ERROR: :Next to last call stops execution"> See Also: xmlrpclib (http://docs.python.org/lib/module-xmlrpclib.html) The Standard library documentation for this module. SimpleXMLRPCServer (page 714) An XML-RPC server implementation. 12.11 SimpleXMLRPCServer—An XML-RPC Server Purpose Implements an XML-RPC server. Python Version 2.2 and later The SimpleXMLRPCServer module contains classes for creating cross-platform, language-independent servers using the XML-RPC protocol. Client libraries exist for many other languages besides Python, making XML-RPC an easy choice for building RPC-style services. Note: All the examples provided here include a client module as well to interact with the demonstration server. To run the examples, use two separate shell windows, one for the server and one for the client. 12.11.1 A Simple Server This simple server example exposes a single function that takes the name of a directory and returns the contents. The first step is to create the SimpleXMLRPCServer instance and then tell it where to listen for incoming requests (‘localhost’ port 9000 in this case). The next step is to define a function to be part of the service and register it so the server knows how to call it. The final step is to put the server into an infinite loop receiving and responding to requests. 12.11. SimpleXMLRPCServer—An XML-RPC Server 715 Warning: This implementation has obvious security implications. Do not run it on a server on the open Internet or in any environment where security might be an issue. from SimpleXMLRPCServer import SimpleXMLRPCServer import logging import os # Set up logging logging.basicConfig(level=logging.DEBUG) server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True) # Expose a function def list_contents(dir_name): logging.debug(’list_contents(%s)’, dir_name) return os.listdir(dir_name) server.register_function(list_contents) # Start the server try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ The server can be accessed at the URL http://localhost:9000 using the client class from xmlrpclib. This example code illustrates how to call the list_contents() service from Python. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print proxy.list_contents(’/tmp’) The ServerProxy is connected to the server using its base URL, and then meth- ods are called directly on the proxy. Each method invoked on the proxy is translated into a request to the server. The arguments are formatted using XML and then sent to the server in a POST message. The server unpacks the XML and determines which 716 The Internet function to call based on the method name invoked from the client. The arguments are passed to the function, and the return value is translated back to XML to be returned to the client. Starting the server gives: $ python SimpleXMLRPCServer_function.py Use Control-C to exit Running the client in a second window shows the contents of the /tmp directory. $ python SimpleXMLRPCServer_function_client.py [’.s.PGSQL.5432’, ’.s.PGSQL.5432.lock’, ’.X0-lock’, ’.X11-unix’, ’ccc_exclude.1mkahl’, ’ccc_exclude.BKG3gb’, ’ccc_exclude.M5jrgo’, ’ccc_exclude.SPecwL’, ’com.hp.launchport’, ’emacs527’, ’hsperfdata_dhellmann’, ’launch-8hGHUp’, ’launch-RQnlcc’, ’launch-trsdly’, ’launchd-242.T5UzTy’, ’var_backups’] After the request is finished, log output appears in the server window. $ python SimpleXMLRPCServer_function.py Use Control-C to exit DEBUG:root:list_contents(/tmp) localhost - - [29/Jun/2008 09:32:07] "POST /RPC2 HTTP/1.0" 200 - The first line of output is from the logging.debug() call inside list_contents(). The second line is from the server logging the request because logRequests is True. 12.11.2 Alternate API Names Sometimes, the function names used inside a module or library are not the names that should be used in the external API. Names may change because a platform-specific implementation is loaded, the service API is built dynamically based on a configuration file, or real functions are to be replaced with stubs for testing. To register a function with an alternate name, pass the name as the second argument to register_function(), like this. 12.11. SimpleXMLRPCServer—An XML-RPC Server 717 from SimpleXMLRPCServer import SimpleXMLRPCServer import os server = SimpleXMLRPCServer((’localhost’, 9000)) # Expose a function with an alternate name def list_contents(dir_name): return os.listdir(dir_name) server.register_function(list_contents, ’dir’) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ The client should now use the name dir() instead of list_contents(). import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print ’dir():’, proxy.dir(’/tmp’) try: print ’\nlist_contents():’, proxy.list_contents(’/tmp’) except xmlrpclib.Fault as err: print ’\nERROR:’, err Calling list_contents() results in an error, since the server no longer has a handler registered by that name. $ python SimpleXMLRPCServer_alternate_name_client.py dir(): [’ccc_exclude.GIqLcR’, ’ccc_exclude.kzR42t’, ’ccc_exclude.LV04nf’, ’ccc_exclude.Vfzylm’, ’emacs527’, ’icssuis527’, ’launch-9hTTwf’, ’launch-kCXjtT’, ’launch-Nwc3AB’, ’launch-pwCgej’, ’launch-Xrku4Q’, ’launch-YtDZBJ’, ’launchd-167.AfaNuZ’, ’var_backups’] list_contents(): ERROR: :method "list_contents" is not supported’> 718 The Internet 12.11.3 Dotted API Names Individual functions can be registered with names that are not normally legal for Python identifiers. For example, a period (.) can be included in the names to separate the name- space in the service. The next example extends the “directory” service to add “create” and “remove” calls. All the functions are registered using the prefix “dir.” so that the same server can provide other services using a different prefix. One other difference in this example is that some of the functions return None, so the server has to be told to translate the None values to a nil value. from SimpleXMLRPCServer import SimpleXMLRPCServer import os server = SimpleXMLRPCServer((’localhost’, 9000), allow_none=True) server.register_function(os.listdir, ’dir.list’) server.register_function(os.mkdir, ’dir.create’) server.register_function(os.rmdir, ’dir.remove’) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ To call the service functions in the client, simply refer to them with the dotted name. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print ’BEFORE :’, ’EXAMPLE’ in proxy.dir.list(’/tmp’) print ’CREATE :’, proxy.dir.create(’/tmp/EXAMPLE’) print ’SHOULD EXIST :’, ’EXAMPLE’ in proxy.dir.list(’/tmp’) print ’REMOVE :’, proxy.dir.remove(’/tmp/EXAMPLE’) print ’AFTER :’, ’EXAMPLE’ in proxy.dir.list(’/tmp’) Assuming there is no /tmp/EXAMPLE file on the current system, this is the output for the sample client script. $ python SimpleXMLRPCServer_dotted_name_client.py 12.11. SimpleXMLRPCServer—An XML-RPC Server 719 BEFORE : False CREATE : None SHOULD EXIST : True REMOVE : None AFTER : False 12.11.4 Arbitrary API Names Another interesting feature is the ability to register functions with names that are othe- rwise invalid Python-object attribute names. This example service registers a function with the name “multiply args.” from SimpleXMLRPCServer import SimpleXMLRPCServer server = SimpleXMLRPCServer((’localhost’, 9000)) def my_function(a, b): return a * b server.register_function(my_function, ’multiply args’) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ Since the registered name contains a space, dot notation cannot be used to access it directly from the proxy. Using getattr() does work, however. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print getattr(proxy, ’multiply args’)(5, 5) Avoid creating services with names like this, though. This example is provided not necessarily because it is a good idea, but because existing services with arbitrary names exist, and new programs may need to be able to call them. $ python SimpleXMLRPCServer_arbitrary_name_client.py 25 720 The Internet 12.11.5 Exposing Methods of Objects The earlier sections talked about techniques for establishing APIs using good naming conventions and namespacing. Another way to incorporate namespacing into an API is to use instances of classes and expose their methods. The first example can be re-created using an instance with a single method. from SimpleXMLRPCServer import SimpleXMLRPCServer import os import inspect server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True) class DirectoryService: def list(self, dir_name): return os.listdir(dir_name) server.register_instance(DirectoryService()) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ A client can call the method directly as follows. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print proxy.list(’/tmp’) The output is: $ python SimpleXMLRPCServer_instance_client.py [’ccc_exclude.1mkahl’, ’ccc_exclude.BKG3gb’, ’ccc_exclude.M5jrgo’, ’ccc_exclude.SPecwL’, ’com.hp.launchport’, ’emacs527’, ’hsperfdata_dhellmann’, ’launch-8hGHUp’, ’launch-RQnlcc’, ’launch-trsdly’, ’launchd-242.T5UzTy’, ’var_backups’] The “dir.” prefix for the service has been lost, though. It can be restored by defining a class to set up a service tree that can be invoked from clients. 12.11. SimpleXMLRPCServer—An XML-RPC Server 721 from SimpleXMLRPCServer import SimpleXMLRPCServer import os import inspect server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True) class ServiceRoot: pass class DirectoryService: def list(self, dir_name): return os.listdir(dir_name) root = ServiceRoot() root.dir = DirectoryService() server.register_instance(root, allow_dotted_names=True) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ By registering the instance of ServiceRoot with allow_dotted_names enabled, the server has permission to walk the tree of objects when a request comes in to find the named method using getattr(). import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print proxy.dir.list(’/tmp’) The output of dir.list() is the same as with the previous implementations. $ python SimpleXMLRPCServer_instance_dotted_names_client.py [’ccc_exclude.1mkahl’, ’ccc_exclude.BKG3gb’, ’ccc_exclude.M5jrgo’, ’ccc_exclude.SPecwL’, ’com.hp.launchport’, ’emacs527’, ’hsperfdata_dhellmann’, ’launch-8hGHUp’, ’launch-RQnlcc’, ’launch-trsdly’, ’launchd-242.T5UzTy’, ’var_backups’] 722 The Internet 12.11.6 Dispatching Calls By default, register_instance() finds all callable attributes of the instance with names not starting with an underscore (“_”) and registers them with their name. To be more careful about the exposed methods, custom dispatching logic can be used, as in the following example. from SimpleXMLRPCServer import SimpleXMLRPCServer import os import inspect server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True) def expose(f): "Decorator to set exposed flag on a function." f.exposed = True return f def is_exposed(f): "Test whether another function should be publicly exposed." return getattr(f, ’exposed’, False) class MyService: PREFIX = ’prefix’ def _dispatch(self, method, params): # Remove our prefix from the method name if not method.startswith(self.PREFIX + ’.’): raise Exception(’method "%s" is not supported’ % method) method_name = method.partition(’.’)[2] func = getattr(self, method_name) if not is_exposed(func): raise Exception(’method "%s" is not supported’ % method) return func(*params) @expose def public(self): return ’This is public’ def private(self): return ’This is private’ server.register_instance(MyService()) 12.11. SimpleXMLRPCServer—An XML-RPC Server 723 try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ The public() method of MyService is marked as exposed to the XML-RPC service while private() is not. The _dispatch() method is invoked when the client tries to access a function that is part of MyService. It first enforces the use of a prefix (“prefix.” in this case, but any string can be used). Then it requires the function to have an attribute called exposed with a true value. The exposed flag is set on a function using a decorator for convenience. Here are a few sample client calls. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) print ’public():’, proxy.prefix.public() try: print ’private():’, proxy.prefix.private() except Exception, err: print ’\nERROR:’, err try: print ’public() without prefix:’, proxy.public() except Exception, err: print ’\nERROR:’, err And here is the resulting output, with the expected error messages trapped and reported. $ python SimpleXMLRPCServer_instance_with_prefix_client.py public(): This is public private(): ERROR: :method "prefix.private" is not supported’> public() without prefix: ERROR: :method "public" is not supported’> There are several other ways to override the dispatching mechanism, including subclassing directly from SimpleXMLRPCServer. Refer to the docstrings in the mod- ule for more details. 724 The Internet 12.11.7 Introspection API As with many network services, it is possible to query an XML-RPC server to ask it what methods it supports and learn how to use them. SimpleXMLRPCServer includes a set of public methods for performing this introspection. By default, they are turned off, but can be enabled with register_introspection_functions(). Support for system.listMethods() and system.methodHelp() can be added to a service by defining _listMethods() and _methodHelp() on the service class. from SimpleXMLRPCServer import ( SimpleXMLRPCServer, list_public_methods, ) import os import inspect server = SimpleXMLRPCServer((’localhost’, 9000), logRequests=True) server.register_introspection_functions() class DirectoryService: def _listMethods(self): return list_public_methods(self) def _methodHelp(self, method): f = getattr(self, method) return inspect.getdoc(f) def list(self, dir_name): """list(dir_name) => [] Returns a list containing the contents of the named directory. """ return os.listdir(dir_name) server.register_instance(DirectoryService()) try: print ’Use Control-C to exit’ server.serve_forever() except KeyboardInterrupt: print ’Exiting’ 12.11. SimpleXMLRPCServer—An XML-RPC Server 725 In this case, the convenience function list_public_methods() scans an in- stance to return the names of callable attributes that do not start with underscore (_). Redefine _listMethods() to apply whatever rules are desired. Similarly, for this basic example, _methodHelp() returns the docstring of the function, but could be written to build a help string from another source. This client queries the server and reports on all the publicly callable methods. import xmlrpclib proxy = xmlrpclib.ServerProxy(’http://localhost:9000’) for method_name in proxy.system.listMethods(): print ’=’ * 60 print method_name print ’-’ * 60 print proxy.system.methodHelp(method_name) print The system methods are included in the results. $ python SimpleXMLRPCServer_introspection_client.py ============================================================ list ------------------------------------------------------------ list(dir_name) => [] Returns a list containing the contents of the named directory. ============================================================ system.listMethods ------------------------------------------------------------ system.listMethods() => [’add’, ’subtract’, ’multiple’] Returns a list of the methods supported by the server. ============================================================ system.methodHelp ------------------------------------------------------------ system.methodHelp(’add’) => "Adds two integers together" Returns a string containing documentation for the specified method. 726 The Internet ============================================================ system.methodSignature ------------------------------------------------------------ system.methodSignature(’add’) => [double, int, int] Returns a list describing the signature of the method. In the above example, the add method takes two integers as arguments and returns a double result. This server does NOT support system.methodSignature. See Also: SimpleXMLRPCServer (http://docs.python.org/lib/module-SimpleXMLRPCServer.html) The stan- dard library documentation for this module. XML-RPC How To (http://www.tldp.org/HOWTO/XML-RPC-HOWTO/index.html) Describes how to use XML-RPC to implement clients and servers in a variety of languages. XML-RPC Extensions (http://ontosys.com/xml-rpc/extensions.php) Specifies an extension to the XML-RPC protocol. xmlrpclib (page 702) XML-RPC client library. Chapter 13 EMAIL Email is one of the oldest forms of digital communication, but it is still one of the most popular. Python’s standard library includes modules for sending, receiving, and storing email messages. smtplib communicates with a mail server to deliver a message. smtpd can be used to create a custom mail server, and it provides classes useful for debugging email transmission in other applications. imaplib uses the IMAP protocol to manipulate messages stored on a server. It provides a low-level API for IMAP clients and can query, retrieve, move, and delete messages. Local message archives can be created and modified with mailbox using several standard formats, including the popular mbox and Maildir formats used by many email client programs. 13.1 smtplib—Simple Mail Transfer Protocol Client Purpose Interact with SMTP servers, including sending email. Python Version 1.5.2 and later smtplib includes the class SMTP, which can be used to communicate with mail servers to send mail. Note: The email addresses, hostnames, and IP addresses in the following examples have been obscured. Otherwise, the transcripts illustrate the sequence of commands and responses accurately. 727 728 Email 13.1.1 Sending an Email Message The most common use of SMTP is to connect to a mail server and send a message. The mail server host name and port can be passed to the constructor, or connect() can be invoked explicitly. Once connected, call sendmail() with the envelope parameters and the body of the message. The message text should be fully formed and comply with RFC 2882, since smtplib does not modify the contents or headers at all. That means the caller needs to add the From and To headers. import smtplib import email.utils from email.mime.text import MIMEText # Create the message msg = MIMEText(’This is the body of the message.’) msg[’To’] = email.utils.formataddr((’Recipient’, ’recipient@example.com’)) msg[’From’] = email.utils.formataddr((’Author’, ’author@example.com’)) msg[’Subject’] = ’Simple test message’ server = smtplib.SMTP(’mail’) server.set_debuglevel(True) # show communication with the server try: server.sendmail(’author@example.com’, [’recipient@example.com’], msg.as_string()) finally: server.quit() In this example, debugging is also turned on to show the communication between the client and the server. Otherwise, the example would produce no output at all. $ python smtplib_sendmail.py send: ’ehlo farnsworth.local\r\n’ reply: ’250-mail.example.com Hello [192.168.1.27], pleased to meet y ou\r\n’ reply: ’250-ENHANCEDSTATUSCODES\r\n’ reply: ’250-PIPELINING\r\n’ reply: ’250-8BITMIME\r\n’ reply: ’250-SIZE\r\n’ 13.1. smtplib—Simple Mail Transfer Protocol Client 729 reply: ’250-DSN\r\n’ reply: ’250-ETRN\r\n’ reply: ’250-AUTH GSSAPI DIGEST-MD5 CRAM-MD5\r\n’ reply: ’250-DELIVERBY\r\n’ reply: ’250 HELP\r\n’ reply: retcode (250); Msg: mail.example.com Hello [192.168.1.27], pl eased to meet you ENHANCEDSTATUSCODES PIPELINING 8BITMIME SIZE DSN ETRN AUTH GSSAPI DIGEST-MD5 CRAM-MD5 DELIVERBY HELP send: ’mail FROM: size=229\r\n’ reply: ’250 2.1.0 ... Sender ok\r\n’ reply: retcode (250); Msg: 2.1.0 ... Sender ok send: ’rcpt TO:\r\n’ reply: ’250 2.1.5 ... Recipient ok\r\n’ reply: retcode (250); Msg: 2.1.5 ... Recipien t ok send: ’data\r\n’ reply: ’354 Enter mail, end with "." on a line by itself\r\n’ reply: retcode (354); Msg: Enter mail, end with "." on a line by its elf data: (354, ’Enter mail, end with "." on a line by itself’) send: ’Content-Type: text/plain; charset="us-ascii"\r\nMIME-Version: 1.0\r\nContent-Transfer-Encoding: 7bit\r\nTo: Recipient \r\nFrom: Author \r\nSubject: Simple test message\r\n\r\nThis is the body of the message.\r\n.\r\n’ reply: ’250 2.0.0 oAT1TiRA010200 Message accepted for delivery\r\n’ reply: retcode (250); Msg: 2.0.0 oAT1TiRA010200 Message accepted for delivery data: (250, ’2.0.0 oAT1TiRA010200 Message accepted for delivery’) send: ’quit\r\n’ reply: ’221 2.0.0 mail.example.com closing connection\r\n’ reply: retcode (221); Msg: 2.0.0 mail.example.com closing connection The second argument to sendmail(), the recipients, is passed as a list. Any num- ber of addresses can be included in the list to have the message delivered to each of them in turn. Since the envelope information is separate from the message headers, 730 Email it is possible to blind carbon copy (BCC) someone by including them in the method argument, but not in the message header. 13.1.2 Authentication and Encryption The SMTP class also handles authentication and TLS (transport layer security) encryp- tion, when the server supports them. To determine if the server supports TLS, call ehlo() directly to identify the client to the server and ask it what extensions are avail- able. Then, call has_extn() to check the results. After TLS is started, ehlo() must be called again before authenticating. import smtplib import email.utils from email.mime.text import MIMEText import getpass # Prompt the user for connection info to_email = raw_input(’Recipient: ’) servername = raw_input(’Mail server name: ’) username = raw_input(’Mail username: ’) password = getpass.getpass("%s’s password: " % username) # Create the message msg = MIMEText(’Test message from PyMOTW.’) msg.set_unixfrom(’author’) msg[’To’] = email.utils.formataddr((’Recipient’, to_email)) msg[’From’] = email.utils.formataddr((’Author’, ’author@example.com’)) msg[’Subject’] = ’Test from PyMOTW’ server = smtplib.SMTP(servername) try: server.set_debuglevel(True) # identify ourselves, prompting server for supported features server.ehlo() # If we can encrypt this session, do it if server.has_extn(’STARTTLS’): server.starttls() server.ehlo() # reidentify ourselves over TLS connection server.login(username, password) 13.1. smtplib—Simple Mail Transfer Protocol Client 731 server.sendmail(’author@example.com’, [to_email], msg.as_string()) finally: server.quit() The STARTTLS extension does not appear in the reply to EHLO after TLS is enabled. $ python smtplib_authenticated.py Recipient: recipient@example.com Mail server name: smtpauth.isp.net Mail username: user@isp.net user@isp.net’s password: send: ’ehlo localhost.local\r\n’ reply: ’250-elasmtp-isp.net Hello localhost.local []\r \n’ reply: ’250-SIZE 14680064\r\n’ reply: ’250-PIPELINING\r\n’ reply: ’250-AUTH PLAIN LOGIN CRAM-MD5\r\n’ reply: ’250-STARTTLS\r\n’ reply: ’250 HELP\r\n’ reply: retcode (250); Msg: elasmtp-isp.net Hello localhost.local [] SIZE 14680064 PIPELINING AUTH PLAIN LOGIN CRAM-MD5 STARTTLS HELP send: ’STARTTLS\r\n’ reply: ’220 TLS go ahead\r\n’ reply: retcode (220); Msg: TLS go ahead send: ’ehlo localhost.local\r\n’ reply: ’250-elasmtp-isp.net Hello localhost.local []\r \n’ reply: ’250-SIZE 14680064\r\n’ reply: ’250-PIPELINING\r\n’ reply: ’250-AUTH PLAIN LOGIN CRAM-MD5\r\n’ reply: ’250 HELP\r\n’ reply: retcode (250); Msg: elasmtp-isp.net Hello farnsworth.local [< your IP here>] 732 Email SIZE 14680064 PIPELINING AUTH PLAIN LOGIN CRAM-MD5 HELP send: ’AUTH CRAM-MD5\r\n’ reply: ’334 PDExNjkyLjEyMjI2MTI1NzlAZWxhc210cC1tZWFseS5hdGwuc2EuZWFy dGhsa W5rLm5ldD4=\r\n’ reply: retcode (334); Msg: PDExNjkyLjEyMjI2MTI1NzlAZWxhc210cC1tZWFse S5hdG wuc2EuZWFydGhsaW5rLm5ldD4= send: ’ZGhlbGxtYW5uQGVhcnRobGluay5uZXQgN2Q1YjAyYTRmMGQ1YzZjM2NjOTNjZ Dc1MD QxN2ViYjg=\r\n’ reply: ’235 Authentication succeeded\r\n’ reply: retcode (235); Msg: Authentication succeeded send: ’mail FROM: size=221\r\n’ reply: ’250 OK\r\n’ reply: retcode (250); Msg: OK send: ’rcpt TO:\r\n’ reply: ’250 Accepted\r\n’ reply: retcode (250); Msg: Accepted send: ’data\r\n’ reply: ’354 Enter message, ending with "." on a line by itself\r\n’ reply: retcode (354); Msg: Enter message, ending with "." on a line by itself data: (354, ’Enter message, ending with "." on a line by itself’) send: ’Content-Type: text/plain; charset="us-ascii"\r\nMIME-Version: 1.0\r\nContent-Transfer-Encoding: 7bit\r\nTo: Recipient \r\nFrom: Author \r\nSubj ect: Test from PyMOTW\r\n\r\nTest message from PyMOTW.\r\n.\r\n’ reply: ’250 OK id=1KjxNj-00032a-Ux\r\n’ reply: retcode (250); Msg: OK id=1KjxNj-00032a-Ux data: (250, ’OK id=1KjxNj-00032a-Ux’) send: ’quit\r\n’ reply: ’221 elasmtp-isp.net closing connection\r\n’ reply: retcode (221); Msg: elasmtp-isp.net closing connection 13.1.3 Verifying an Email Address The SMTP protocol includes a command to ask a server whether an address is valid. Usually, VRFY is disabled to prevent spammers from finding legitimate email addresses. 13.1. smtplib—Simple Mail Transfer Protocol Client 733 But, if it is enabled, a client can ask the server about an address and receive a status code indicating validity, along with the user’s full name, if it is available. import smtplib server = smtplib.SMTP(’mail’) server.set_debuglevel(True) # show communication with the server try: dhellmann_result = server.verify(’dhellmann’) notthere_result = server.verify(’notthere’) finally: server.quit() print ’dhellmann:’, dhellmann_result print ’notthere :’, notthere_result As the last two lines of output here show, the address dhellmann is valid but notthere is not. $ python smtplib_verify.py send: ’vrfy \r\n’ reply: ’250 2.1.5 Doug Hellmann \r\n’ reply: retcode (250); Msg: 2.1.5 Doug Hellmann send: ’vrfy \r\n’ reply: ’550 5.1.1 ... User unknown\r\n’ reply: retcode (550); Msg: 5.1.1 ... User unknown send: ’quit\r\n’ reply: ’221 2.0.0 mail.example.com closing connection\r\n’ reply: retcode (221); Msg: 2.0.0 mail.example.com closing connection dhellmann: (250, ’2.1.5 Doug Hellmann ’) notthere : (550, ’5.1.1 ... User unknown’) See Also: smtplib (http://docs.python.org/lib/module-smtplib.html) The Standard library documentation for this module. RFC 821 (http://tools.ietf.org/html/rfc821.html) The Simple Mail Transfer Protocol (SMTP) specification. RFC 1869 (http://tools.ietf.org/html/rfc1869.html) SMTP Service Extensions to the base protocol. 734 Email RFC 822 (http://tools.ietf.org/html/rfc822.html) “Standard for the Format of ARPA Internet Text Messages,” the original email message format specification. RFC 2822 (http://tools.ietf.org/html/rfc2822.html) “Internet Message Format” up- dates to the email message format. email The Standard library module for parsing email messages. smtpd (page 734) Implements a simple SMTP server. 13.2 smtpd—Sample Mail Servers Purpose Includes classes for implementing SMTP servers. Python Version 2.1 and later The smtpd module includes classes for building simple mail transport protocol servers. It is the server side of the protocol used by smtplib. 13.2.1 Mail Server Base Class The base class for all the provided example servers is SMTPServer. It handles commu- nicating with the client and receiving incoming data, and provides a convenient hook to override so the message can be processed once it is fully available. The constructor arguments are the local address to listen for connections and the remote address where proxied messages should be delivered. The method pro- cess_message() is provided as a hook to be overridden by a derived class. It is called when the message is completely received, and it is given these arguments. peer The client’s address, a tuple containing IP and incoming port. mailfrom The “from” information out of the message envelope, given to the server by the client when the message is delivered. This information does not necessarily match the From header in all cases. rcpttos The list of recipients from the message envelope. Again, this list does not always match the To header, especially if a recipient is being blind carbon copied. data The full RFC 2822 message body. 13.2. smtpd—Sample Mail Servers 735 The default implementation of process_message() raises NotImplemented- Error. The next example defines a subclass that overrides the method to print infor- mation about the messages it receives. import smtpd import asyncore class CustomSMTPServer(smtpd.SMTPServer): def process_message(self, peer, mailfrom, rcpttos, data): print ’Receiving message from:’, peer print ’Message addressed from:’, mailfrom print ’Message addressed to :’, rcpttos print ’Message length :’, len(data) return server = CustomSMTPServer((’127.0.0.1’, 1025), None) asyncore.loop() SMTPServer uses asyncore; so to run the server, call asyncore.loop(). A client is needed to demonstrate the server. One of the examples from the section on smtplib can be adapted to create a client to send data to the test server running locally on port 1025. import smtplib import email.utils from email.mime.text import MIMEText # Create the message msg = MIMEText(’This is the body of the message.’) msg[’To’] = email.utils.formataddr((’Recipient’, ’recipient@example.com’)) msg[’From’] = email.utils.formataddr((’Author’, ’author@example.com’)) msg[’Subject’] = ’Simple test message’ server = smtplib.SMTP(’127.0.0.1’, 1025) server.set_debuglevel(True) # show communication with the server try: server.sendmail(’author@example.com’, [’recipient@example.com’], msg.as_string()) 736 Email finally: server.quit() To test the programs, run smtpd_custom.py in one terminal and smtpd_ senddata.py in another. $ python smtpd_custom.py Receiving message from: (’127.0.0.1’, 58541) Message addressed from: author@example.com Message addressed to : [’recipient@example.com’] Message length : 229 The debug output from smtpd_senddata.py shows all the communication with the server. $ python smtpd_senddata.py send: ’ehlo farnsworth.local\r\n’ reply: ’502 Error: command "EHLO" not implemented\r\n’ reply: retcode (502); Msg: Error: command "EHLO" not implemented send: ’helo farnsworth.local\r\n’ reply: ’250 farnsworth.local\r\n’ reply: retcode (250); Msg: farnsworth.local send: ’mail FROM:\r\n’ reply: ’250 Ok\r\n’ reply: retcode (250); Msg: Ok send: ’rcpt TO:\r\n’ reply: ’250 Ok\r\n’ reply: retcode (250); Msg: Ok send: ’data\r\n’ reply: ’354 End data with .\r\n’ reply: retcode (354); Msg: End data with . data: (354, ’End data with .’) send: ’Content-Type: text/plain; charset="us-ascii"\r\nMIME-Version: 1.0\r\n Content-Transfer-Encoding: 7bit\r\nTo: Recipient \r\n From: Author \r\nSubject: Simple test message\r\ n\r\nThis is the body of the message.\r\n.\r\n’ reply: ’250 Ok\r\n’ 13.2. smtpd—Sample Mail Servers 737 reply: retcode (250); Msg: Ok data: (250, ’Ok’) send: ’quit\r\n’ reply: ’221 Bye\r\n’ reply: retcode (221); Msg: Bye To stop the server, press Ctrl-C. 13.2.2 Debugging Server The previous example shows the arguments to process_message(), but smtpd also includes a server specifically designed for more complete debugging, called Debug- gingServer. It prints the entire incoming message to the console and then stops pro- cessing (it does not proxy the message to a real mail server). import smtpd import asyncore server = smtpd.DebuggingServer((’127.0.0.1’, 1025), None) asyncore.loop() Using the smtpd_senddata.py client program from earlier, here is the output of the DebuggingServer. $ python smtpd_debug.py ---------- MESSAGE FOLLOWS ---------- Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit To: Recipient From: Author Subject: Simple test message X-Peer: 127.0.0.1 This is the body of the message. ------------ END MESSAGE ------------ 13.2.3 Proxy Server The PureProxy class implements a straightforward proxy server. Incoming messages are forwarded upstream to the server given as argument to the constructor. 738 Email Warning: The standard library documentation for smtpd says, “running this has a good chance to make you into an open relay, so please be careful.” The steps for setting up the proxy server are similar to the debug server. import smtpd import asyncore server = smtpd.PureProxy((’127.0.0.1’, 1025), (’mail’, 25)) asyncore.loop() It prints no output, though, so to verify that it is working, look at the mail server logs. Oct 19 19:16:34 homer sendmail[6785]: m9JNGXJb006785: from=, size=248, class=0, nrcpts=1, msgid=<200810192316.m9JNGXJb006785@homer.example.com>, proto=ESMTP, daemon=MTA, relay=[192.168.1.17] See Also: smtpd (http://docs.python.org/lib/module-smtpd.html) The Standard library docu- mentation for this module. smtplib (page 727) Provides a client interface. email Parses email messages. asyncore (page 619) Base module for writing asynchronous servers. RFC 2822 (http://tools.ietf.org/html/rfc2822.html) Defines the email message format. 13.3 imaplib—IMAP4 Client Library Purpose Client library for IMAP4 communication. Python Version 1.5.2 and later imaplib implements a client for communicating with Internet Message Access Proto- col (IMAP) version 4 servers. The IMAP protocol defines a set of commands sent to the server and the responses delivered back to the client. Most of the commands are available as methods of the IMAP4 object used to communicate with the server. 13.3. imaplib—IMAP4 Client Library 739 These examples discuss part of the IMAP protocol, but they are by no means complete. Refer to RFC 3501 for complete details. 13.3.1 Variations Three client classes are available for communicating with servers using various mech- anisms. The first, IMAP4, uses clear text sockets; IMAP4_SSL uses encrypted commu- nication over SSL sockets; and IMAP4_stream uses the standard input and standard output of an external command. All the examples here will use IMAP4_SSL, but the APIs for the other classes are similar. 13.3.2 Connecting to a Server There are two steps for establishing a connection with an IMAP server. First, set up the socket connection itself. Second, authenticate as a user with an account on the server. The following example code will read server and user information from a configuration file. import imaplib import ConfigParser import os def open_connection(verbose=False): # Read the config file config = ConfigParser.ConfigParser() config.read([os.path.expanduser(’~/.pymotw’)]) # Connect to the server hostname = config.get(’server’, ’hostname’) if verbose: print ’Connecting to’, hostname connection = imaplib.IMAP4_SSL(hostname) # Login to our account username = config.get(’account’, ’username’) password = config.get(’account’, ’password’) if verbose: print ’Logging in as’, username connection.login(username, password) return connection if __name__ == ’__main__’: c = open_connection(verbose=True) 740 Email try: print c finally: c.logout() When run, open_connection() reads the configuration information from a file in the user’s home directory, and then opens the IMAP4_SSL connection and authenticates. $ python imaplib_connect.py Connecting to mail.example.com Logging in as example The other examples in this section reuse this module, to avoid duplicating the code. Authentication Failure If the connection is established but authentication fails, an exception is raised. import imaplib import ConfigParser import os # Read the config file config = ConfigParser.ConfigParser() config.read([os.path.expanduser(’~/.pymotw’)]) # Connect to the server hostname = config.get(’server’, ’hostname’) print ’Connecting to’, hostname connection = imaplib.IMAP4_SSL(hostname) # Login to our account username = config.get(’account’, ’username’) password = ’this_is_the_wrong_password’ print ’Logging in as’, username try: connection.login(username, password) except Exception as err: print ’ERROR:’, err 13.3. imaplib—IMAP4 Client Library 741 This example uses the wrong password on purpose to trigger the exception. $ python imaplib_connect_fail.py Connecting to mail.example.com Logging in as example ERROR: Authentication failed. 13.3.3 Example Configuration The example account has three mailboxes: INBOX, Archive, and 2008 (a subfolder of Archive). This is the mailbox hierarchy: • INBOX • Archive – 2008 There is one unread message in the INBOX folder and one read message in Archive/2008. 13.3.4 Listing Mailboxes To retrieve the mailboxes available for an account, use the list() method. import imaplib from pprint import pprint from imaplib_connect import open_connection c = open_connection() try: typ, data = c.list() print ’Response code:’, typ print ’Response:’ pprint(data) finally: c.logout() The return value is a tuple containing a response code and the data returned by the server. The response code is OK, unless an error has occurred. The data for list() is a sequence of strings containing flags, the hierarchy delimiter, and the mailbox name for each mailbox. 742 Email $ python imaplib_list.py Response code: OK Response: [’(\\HasNoChildren) "." INBOX’, ’(\\HasChildren) "." "Archive"’, ’(\\HasNoChildren) "." "Archive.2008"’] Each response string can be split into three parts using re or csv (see IMAP Backup Script in the references at the end of this section for an example using csv). import imaplib import re from imaplib_connect import open_connection list_response_pattern = re.compile( r’\((?P.*?)\) "(?P.*)" (?P.*)’ ) def parse_list_response(line): match = list_response_pattern.match(line) flags, delimiter, mailbox_name = match.groups() mailbox_name = mailbox_name.strip(’"’) return (flags, delimiter, mailbox_name) if __name__ == ’__main__’: c = open_connection() try: typ, data = c.list() finally: c.logout() print ’Response code:’, typ for line in data: print ’Server response:’, line flags, delimiter, mailbox_name = parse_list_response(line) print ’Parsed response:’, (flags, delimiter, mailbox_name) The server quotes the mailbox name if it includes spaces, but those quotes need to be stripped out to use the mailbox name in other calls back to the server later. 13.3. imaplib—IMAP4 Client Library 743 $ python imaplib_list_parse.py Response code: OK Server response: (\HasNoChildren) "." INBOX Parsed response: (’\\HasNoChildren’, ’.’, ’INBOX’) Server response: (\HasChildren) "." "Archive" Parsed response: (’\\HasChildren’, ’.’, ’Archive’) Server response: (\HasNoChildren) "." "Archive.2008" Parsed response: (’\\HasNoChildren’, ’.’, ’Archive.2008’) list() takes arguments to specify mailboxes in part of the hierarchy. For exam- ple, to list subfolders of Archive, pass "Archive" as the directory argument. import imaplib from imaplib_connect import open_connection if __name__ == ’__main__’: c = open_connection() try: typ, data = c.list(directory=’Archive’) finally: c.logout() print ’Response code:’, typ for line in data: print ’Server response:’, line Only the single subfolder is returned. $ python imaplib_list_subfolders.py Response code: OK Server response: (\HasNoChildren) "." "Archive.2008" Alternately, to list folders matching a pattern, pass the pattern argument. import imaplib from imaplib_connect import open_connection 744 Email if __name__ == ’__main__’: c = open_connection() try: typ, data = c.list(pattern=’*Archive*’) finally: c.logout() print ’Response code:’, typ for line in data: print ’Server response:’, line In this case, both Archive and Archive.2008 are included in the response. $ python imaplib_list_pattern.py Response code: OK Server response: (\HasChildren) "." "Archive" Server response: (\HasNoChildren) "." "Archive.2008" 13.3.5 Mailbox Status Use status() to ask for aggregated information about the contents. Table 13.1 lists the status conditions defined by the standard. Table 13.1. IMAP 4 Mailbox Status Conditions Condition Meaning MESSAGES The number of messages in the mailbox RECENT The number of messages with the \Recent flag set UIDNEXT The next unique identifier value of the mailbox UIDVALIDITY The unique identifier validity value of the mailbox UNSEEN The number of messages that do not have the \Seen flag set The status conditions must be formatted as a space-separated string enclosed in parentheses, the encoding for a “list” in the IMAP4 specification. import imaplib import re from imaplib_connect import open_connection from imaplib_list_parse import parse_list_response 13.3. imaplib—IMAP4 Client Library 745 if __name__ == ’__main__’: c = open_connection() try: typ, data = c.list() for line in data: flags, delimiter, mailbox = parse_list_response(line) print c.status( mailbox, ’(MESSAGES RECENT UIDNEXT UIDVALIDITY UNSEEN)’) finally: c.logout() The return value is the usual tuple containing a response code and a list of infor- mation from the server. In this case, the list contains a single string formatted with the name of the mailbox in quotes, and then the status conditions and values in parentheses. $ python imaplib_status.py (’OK’, [’"INBOX" (MESSAGES 1 RECENT 0 UIDNEXT 3 UIDVALIDITY 1222003700 UNSEEN 1)’]) (’OK’, [’"Archive" (MESSAGES 0 RECENT 0 UIDNEXT 1 UIDVALIDITY 1222003809 UNSEEN 0)’]) (’OK’, [’"Archive.2008" (MESSAGES 1 RECENT 0 UIDNEXT 2 UIDVALIDITY 1222003831 UNSEEN 0)’]) 13.3.6 Selecting a Mailbox The basic mode of operation, once the client is authenticated, is to select a mailbox and then interrogate the server regarding the messages in the mailbox. The connection is stateful, so after a mailbox is selected, all commands operate on messages in that mailbox until a new mailbox is selected. import imaplib import imaplib_connect c = imaplib_connect.open_connection() try: typ, data = c.select(’INBOX’) print typ, data num_msgs = int(data[0]) print ’There are %d messages in INBOX’ % num_msgs 746 Email finally: c.close() c.logout() The response data contains the total number of messages in the mailbox. $ python imaplib_select.py OK [’1’] There are 1 messages in INBOX If an invalid mailbox is specified, the response code is NO. import imaplib import imaplib_connect c = imaplib_connect.open_connection() try: typ, data = c.select(’Does Not Exist’) print typ, data finally: c.logout() The data contains an error message describing the problem. $ python imaplib_select_invalid.py NO ["Mailbox doesn’t exist: Does Not Exist"] 13.3.7 Searching for Messages After selecting the mailbox, use search() to retrieve the IDs of messages in the mailbox. import imaplib import imaplib_connect from imaplib_list_parse import parse_list_response c = imaplib_connect.open_connection() try: typ, mailbox_data = c.list() 13.3. imaplib—IMAP4 Client Library 747 for line in mailbox_data: flags, delimiter, mailbox_name = parse_list_response(line) c.select(mailbox_name, readonly=True) typ, msg_ids = c.search(None, ’ALL’) print mailbox_name, typ, msg_ids finally: try: c.close() except: pass c.logout() Message ids are assigned by the server and are implementation dependent. The IMAP4 protocol makes a distinction between sequential ids for messages at a given point in time during a transaction and UID identifiers for messages, but not all servers implement both. $ python imaplib_search_all.py INBOX OK [’1’] Archive OK [’’] Archive.2008 OK [’1’] In this case, INBOX and Archive.2008 each have a different message with id 1. The other mailboxes are empty. 13.3.8 Search Criteria A variety of other search criteria can be used, including looking at dates for the message, flags, and other headers. Refer to section 6.4.4 of RFC 3501 for complete details. To look for messages with ’test message 2’ in the subject, the search criteria should be constructed as follows. (SUBJECT "test message 2") This example finds all messages with the title “test message 2” in all mailboxes. import imaplib import imaplib_connect from imaplib_list_parse import parse_list_response c = imaplib_connect.open_connection() 748 Email try: typ, mailbox_data = c.list() for line in mailbox_data: flags, delimiter, mailbox_name = parse_list_response(line) c.select(mailbox_name, readonly=True) typ, msg_ids = c.search(None, ’(SUBJECT "test message 2")’) print mailbox_name, typ, msg_ids finally: try: c.close() except: pass c.logout() There is only one such message in the account, and it is in the INBOX. $ python imaplib_search_subject.py INBOX OK [’1’] Archive OK [’’] Archive.2008 OK [’’] Search criteria can also be combined. import imaplib import imaplib_connect from imaplib_list_parse import parse_list_response c = imaplib_connect.open_connection() try: typ, mailbox_data = c.list() for line in mailbox_data: flags, delimiter, mailbox_name = parse_list_response(line) c.select(mailbox_name, readonly=True) typ, msg_ids = c.search( None, ’(FROM "Doug" SUBJECT "test message 2")’) print mailbox_name, typ, msg_ids finally: try: c.close() except: pass c.logout() 13.3. imaplib—IMAP4 Client Library 749 The criteria are combined with a logical and operation. $ python imaplib_search_from.py INBOX OK [’1’] Archive OK [’’] Archive.2008 OK [’’] 13.3.9 Fetching Messages The identifiers returned by search() are used to retrieve the contents, or partial con- tents, of messages for further processing using the fetch() method. It takes two argu- ments: the message to fetch and the portion(s) of the message to retrieve. The message_ids argument is a comma-separated list of ids (e.g., "1", "1,2")or id ranges (e.g., 1:2). The message_parts argument is an IMAP list of message seg- ment names. As with search criteria for search(), the IMAP protocol specifies named message segments so clients can efficiently retrieve only the parts of the message they actually need. For example, to retrieve the headers of the messages in a mailbox, use fetch() with the argument BODY.PEEK[HEADER]. Note: Another way to fetch the headers is BODY[HEADERS], but that form has a side effect of implicitly marking the message as read, which is undesirable in many cases. import imaplib import pprint import imaplib_connect imaplib.Debug = 4 c = imaplib_connect.open_connection() try: c.select(’INBOX’, readonly=True) typ, msg_data = c.fetch(’1’, ’(BODY.PEEK[HEADER] FLAGS)’) pprint.pprint(msg_data) finally: try: c.close() except: pass c.logout() 750 Email The return value of fetch() has been partially parsed so it is somewhat harder to work with than the return value of list(). Turning on debugging shows the complete interaction between the client and the server to understand why this is so. $ python imaplib_fetch_raw.py 13:12.54 imaplib version 2.58 13:12.54 new IMAP4 connection, tag=CFKH 13:12.54 < * OK dovecot ready. 13:12.54 > CFKH0 CAPABILITY 13:12.54 < * CAPABILITY IMAP4rev1 SORT THREAD=REFERENCES MULTIAPPEND UNSELECT IDLE CHILDREN LISTEXT LIST-SUBSCRIBED NAMESPACE AUTH=PLAIN 13:12.54 < CFKH0 OK Capability completed. 13:12.54 CAPABILITIES: (’IMAP4REV1’, ’SORT’, ’THREAD=REFERENCES’, ’M ULTIAPPEND’, ’UNSELECT’, ’IDLE’, ’CHILDREN’, ’LISTEXT’, ’LIST-SUBSCR IBED’, ’NAMESPACE’, ’AUTH=PLAIN’) 13:12.54 > CFKH1 LOGIN example "password" 13:13.18 < CFKH1 OK Logged in. 13:13.18 > CFKH2 EXAMINE INBOX 13:13.20 < * FLAGS (\Answered \Flagged \Deleted \Seen \Draft $NotJun k $Junk) 13:13.20 < * OK [PERMANENTFLAGS ()] Read-only mailbox. 13:13.20 < * 2 EXISTS 13:13.20 < * 1 RECENT 13:13.20 < * OK [UNSEEN 1] First unseen. 13:13.20 < * OK [UIDVALIDITY 1222003700] UIDs valid 13:13.20 < * OK [UIDNEXT 4] Predicted next UID 13:13.20 < CFKH2 OK [READ-ONLY] Select completed. 13:13.20 > CFKH3 FETCH 1 (BODY.PEEK[HEADER] FLAGS) 13:13.20 < * 1 FETCH (FLAGS ($NotJunk) BODY[HEADER] {595} 13:13.20 read literal size 595 13:13.20 < ) 13:13.20 < CFKH3 OK Fetch completed. 13:13.20 > CFKH4 CLOSE 13:13.21 < CFKH4 OK Close completed. 13:13.21 > CFKH5 LOGOUT 13:13.21 < * BYE Logging out 13:13.21 BYE response: Logging out 13:13.21 < CFKH5 OK Logout completed. ’1 (FLAGS ($NotJunk) BODY[HEADER] {595}’, ’Return-Path: \r\nReceived: from example.com (localhost [127.0.0.1])\r\n\tby example.com (8.13.4/8.13.4) with ESM TP id m8LDTGW4018260\r\n\tfor ; Sun, 21 Sep 200 13.3. imaplib—IMAP4 Client Library 751 8 09:29:16 -0400\r\nReceived: (from dhellmann@localhost)\r\n\tby exa mple.com (8.13.4/8.13.4/Submit) id m8LDTGZ5018259\r\n\tfor example@e xample.com; Sun, 21 Sep 2008 09:29:16 -0400\r\nDate: Sun, 21 Sep 200 8 09:29:16 -0400\r\nFrom: Doug Hellmann \r\nM essage-Id: <200809211329.m8LDTGZ5018259@example.com>\r\nTo: example@ example.com\r\nSubject: test message 2\r\n\r\n’), )’] The response from the FETCH command starts with the flags, and then it indicates that there are 595 bytes of header data. The client constructs a tuple with the response for the message, and then closes the sequence with a single string containing the right parenthesis (“)”) the server sends at the end of the fetch response. Because of this formatting, it may be easier to fetch different pieces of information separately or to recombine the response and parse it in the client. import imaplib import pprint import imaplib_connect c = imaplib_connect.open_connection() try: c.select(’INBOX’, readonly=True) print ’HEADER:’ typ, msg_data = c.fetch(’1’, ’(BODY.PEEK[HEADER])’) for response_part in msg_data: if isinstance(response_part, tuple): print response_part[1] print ’BODY TEXT:’ typ, msg_data = c.fetch(’1’, ’(BODY.PEEK[TEXT])’) for response_part in msg_data: if isinstance(response_part, tuple): print response_part[1] print ’\nFLAGS:’ typ, msg_data = c.fetch(’1’, ’(FLAGS)’) for response_part in msg_data: print response_part print imaplib.ParseFlags(response_part) finally: try: c.close() 752 Email except: pass c.logout() Fetching values separately has the added benefit of making it easy to use Parse- Flags() to parse the flags from the response. $ python imaplib_fetch_separately.py HEADER: Return-Path: Received: from example.com (localhost [127.0.0.1]) by example.com (8.13.4/8.13.4) with ESMTP id m8LDTGW4018260 for ; Sun, 21 Sep 2008 09:29:16 -0400 Received: (from dhellmann@localhost) by example.com (8.13.4/8.13.4/Submit) id m8LDTGZ5018259 for example@example.com; Sun, 21 Sep 2008 09:29:16 -0400 Date: Sun, 21 Sep 2008 09:29:16 -0400 From: Doug Hellmann Message-Id: <200809211329.m8LDTGZ5018259@example.com> To: example@example.com Subject: test message 2 BODY TEXT: second message FLAGS: 1 (FLAGS ($NotJunk)) (’$NotJunk’,) 13.3.10 Whole Messages As illustrated earlier, the client can ask the server for individual parts of the message separately. It is also possible to retrieve the entire message as an RFC 2822 formatted mail message and parse it with classes from the email module. import imaplib import email import imaplib_connect c = imaplib_connect.open_connection() 13.3. imaplib—IMAP4 Client Library 753 try: c.select(’INBOX’, readonly=True) typ, msg_data = c.fetch(’1’, ’(RFC822)’) for response_part in msg_data: if isinstance(response_part, tuple): msg = email.message_from_string(response_part[1]) for header in [ ’subject’, ’to’, ’from’ ]: print ’%-8s: %s’ % (header.upper(), msg[header]) finally: try: c.close() except: pass c.logout() The parser in the email module makes it very easy to access and manipulate messages. This example prints just a few of the headers for each message. $ python imaplib_fetch_rfc822.py SUBJECT : test message 2 TO : example@example.com FROM : Doug Hellmann 13.3.11 Uploading Messages To add a new message to a mailbox, construct a Message instance and pass it to the append() method, along with the timestamp for the message. import imaplib import time import email.message import imaplib_connect new_message = email.message.Message() new_message.set_unixfrom(’pymotw’) new_message[’Subject’] = ’subject goes here’ new_message[’From’] = ’pymotw@example.com’ new_message[’To’] = ’example@example.com’ new_message.set_payload(’This is the body of the message.\n’) 754 Email print new_message c = imaplib_connect.open_connection() try: c.append(’INBOX’, ’’, imaplib.Time2Internaldate(time.time()), str(new_message)) # Show the headers for all messages in the mailbox c.select(’INBOX’) typ, [msg_ids] = c.search(None, ’ALL’) for num in msg_ids.split(): typ, msg_data = c.fetch(num, ’(BODY.PEEK[HEADER])’) for response_part in msg_data: if isinstance(response_part, tuple): print ’\n%s:’ % num print response_part[1] finally: try: c.close() except: pass c.logout() The payload used in this example is a simple plain-text email body. Message also supports MIME-encoded, multipart messages. pymotw Subject: subject goes here From: pymotw@example.com To: example@example.com This is the body of the message. 1: Return-Path: Received: from example.com (localhost [127.0.0.1]) by example.com (8.13.4/8.13.4) with ESMTP id m8LDTGW4018260 for ; Sun, 21 Sep 2008 09:29:16 -0400 Received: (from dhellmann@localhost) by example.com (8.13.4/8.13.4/Submit) id m8LDTGZ5018259 for example@example.com; Sun, 21 Sep 2008 09:29:16 -0400 13.3. imaplib—IMAP4 Client Library 755 Date: Sun, 21 Sep 2008 09:29:16 -0400 From: Doug Hellmann Message-Id: <200809211329.m8LDTGZ5018259@example.com> To: example@example.com Subject: test message 2 2: Return-Path: Message-Id: <0D9C3C50-462A-4FD7-9E5A-11EE222D721D@example.com> From: Doug Hellmann To: example@example.com Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v929.2) Subject: lorem ipsum Date: Sun, 21 Sep 2008 12:53:16 -0400 X-Mailer: Apple Mail (2.929.2) 3: pymotw Subject: subject goes here From: pymotw@example.com To: example@example.com 13.3.12 Moving and Copying Messages Once a message is on the server, it can be moved or copied without downloading it using move() or copy(). These methods operate on message id ranges, just as fetch() does. import imaplib import imaplib_connect c = imaplib_connect.open_connection() try: # Find the "SEEN" messages in INBOX c.select(’INBOX’) typ, [response] = c.search(None, ’SEEN’) if typ != ’OK’: raise RuntimeError(response) 756 Email # Create a new mailbox, "Archive.Today" msg_ids = ’,’.join(response.split(’’)) typ, create_response = c.create(’Archive.Today’) print ’CREATED Archive.Today:’, create_response # Copy the messages print ’COPYING:’, msg_ids c.copy(msg_ids, ’Archive.Today’) # Look at the results c.select(’Archive.Today’) typ, [response] = c.search(None, ’ALL’) print ’COPIED:’, response finally: c.close() c.logout() This example script creates a new mailbox under Archive and copies the read messages from INBOX into it. $ python imaplib_archive_read.py CREATED Archive.Today: [’Create completed.’] COPYING: 1,2 COPIED: 1 2 Running the same script again shows the importance to checking return codes. Instead of raising an exception, the call to create() to make the new mailbox reports that the mailbox already exists. $ python imaplib_archive_read.py CREATED Archive.Today: [’Mailbox exists.’] COPYING: 1,2 COPIED: 1 2 3 4 13.3.13 Deleting Messages Although many modern mail clients use a “Trash folder” model for working with deleted messages, the messages are not usually moved into an actual folder. Instead, their flags are updated to add \Deleted. The operation for “emptying” the trash is 13.3. imaplib—IMAP4 Client Library 757 implemented through the EXPUNGE command. This example script finds the archived messages with “Lorem ipsum” in the subject, sets the deleted flag, and then shows that the messages are still present in the folder by querying the server again. import imaplib import imaplib_connect from imaplib_list_parse import parse_list_response c = imaplib_connect.open_connection() try: c.select(’Archive.Today’) # What ids are in the mailbox? typ, [msg_ids] = c.search(None, ’ALL’) print ’Starting messages:’, msg_ids # Find the message(s) typ, [msg_ids] = c.search(None, ’(SUBJECT "Lorem ipsum")’) msg_ids = ’,’.join(msg_ids.split(’’)) print ’Matching messages:’, msg_ids # What are the current flags? typ, response = c.fetch(msg_ids, ’(FLAGS)’) print ’Flags before:’, response # Change the Deleted flag typ, response = c.store(msg_ids, ’+FLAGS’, r’(\Deleted)’) # What are the flags now? typ, response = c.fetch(msg_ids, ’(FLAGS)’) print ’Flags after:’, response # Really delete the message. typ, response = c.expunge() print ’Expunged:’, response # What ids are left in the mailbox? typ, [msg_ids] = c.search(None, ’ALL’) print ’Remaining messages:’, msg_ids finally: try: c.close() 758 Email except: pass c.logout() Explicitly calling expunge() removes the messages, but calling close() has the same effect. The difference is the client is not notified about the deletions when close() is called. $ python imaplib_delete_messages.py Starting messages: 1 2 3 4 Matching messages: 1,3 Flags before: [’1 (FLAGS (\\Seen $NotJunk))’, ’3 (FLAGS (\\Seen \\Recent $NotJunk))’] Flags after: [’1 (FLAGS (\\Deleted \\Seen $NotJunk))’, ’3 (FLAGS (\\Deleted \\Seen \\Recent $NotJunk))’] Expunged: [’1’, ’2’] Remaining messages: 1 2 See Also: imaplib (http://docs.python.org/library/imaplib.html) The standard library docu- mentation for this module. What is IMAP? (www.imap.org/about/whatisIMAP.html) imap.org description of the IMAP protocol. University of Washington IMAP Information Center (http://www.washington.edu/ imap/) Good resource for IMAP information, along with source code. RFC 3501 (http://tools.ietf.org/html/rfc3501.html) Internet Message Access Proto- col. RFC 2822 (http://tools.ietf.org/html/rfc2822.html) Internet Message Format. IMAP Backup Script (http://snipplr.com/view/7955/imap-backup-script/) A script to backup email from an IMAP server. rfc822 The rfc822 module includes an RFC 822 / RFC 2822 parser. email The email module for parsing email messages. mailbox (page 758) Local mailbox parser. ConfigParser (page 861) Read and write configuration files. IMAPClient (http://imapclient.freshfoo.com/) A higher-level client for talking to IMAP servers, written by Menno Smits. 13.4 mailbox—Manipulate Email Archives Purpose Work with email messages in various local file formats. Python Version 1.4 and later 13.4. mailbox—Manipulate Email Archives 759 The mailbox module defines a common API for accessing email messages stored in local disk formats, including: • Maildir • mbox •MH • Babyl • MMDF There are base classes for Mailbox and Message, and each mailbox format includes a corresponding pair of subclasses to implement the details for that format. 13.4.1 mbox The mbox format is the simplest to show in documentation, since it is entirely plain text. Each mailbox is stored as a single file, with all the messages concatenated together. Each time a line starting with "From " (“From” followed by a single space) is encoun- tered it is treated as the beginning of a new message. Any time those characters appear at the beginning of a line in the message body, they are escaped by prefixing the line with ">". Creating an mbox Mailbox Instantiate the mbox class by passing the filename to the constructor. If the file does not exist, it is created when add() is used to append messages. import mailbox import email.utils from_addr = email.utils.formataddr((’Author’, ’author@example.com’)) to_addr = email.utils.formataddr((’Recipient’, ’recipient@example.com’)) mbox = mailbox.mbox(’example.mbox’) mbox.lock() try: msg = mailbox.mboxMessage() msg.set_unixfrom(’author Sat Feb 7 01:05:34 2009’) msg[’From’] = from_addr msg[’To’] = to_addr msg[’Subject’] = ’Sample message 1’ msg.set_payload(’\n’.join([’This is the body.’, ’From (should be escaped).’, 760 Email ’There are 3 lines.\n’, ])) mbox.add(msg) mbox.flush() msg = mailbox.mboxMessage() msg.set_unixfrom(’author’) msg[’From’] = from_addr msg[’To’] = to_addr msg[’Subject’] = ’Sample message 2’ msg.set_payload(’This is the second body.\n’) mbox.add(msg) mbox.flush() finally: mbox.unlock() print open(’example.mbox’, ’r’).read() The result of this script is a new mailbox file with two email messages. $ python mailbox_mbox_create.py From MAILER-DAEMON Mon Nov 29 02:00:11 2010 From: Author To: Recipient Subject: Sample message 1 This is the body. >From (should be escaped). There are 3 lines. From MAILER-DAEMON Mon Nov 29 02:00:11 2010 From: Author To: Recipient Subject: Sample message 2 This is the second body. Reading an mbox Mailbox To read an existing mailbox, open it and treat the mbox object like a dictionary. The keys are arbitrary values defined by the mailbox instance and are not necessary meaningful other than as internal identifiers for message objects. 13.4. mailbox—Manipulate Email Archives 761 import mailbox mbox = mailbox.mbox(’example.mbox’) for message in mbox: print message[’subject’] The open mailbox supports the iterator protocol, but unlike true dictionary objects, the default iterator for a mailbox works on the values instead of the keys. $ python mailbox_mbox_read.py Sample message 1 Sample message 2 Removing Messages from an mbox Mailbox To remove an existing message from an mbox file, either use its key with remove() or use del. import mailbox mbox = mailbox.mbox(’example.mbox’) mbox.lock() try: to_remove = [] for key, msg in mbox.iteritems(): if ’2’ in msg[’subject’]: print ’Removing:’, key to_remove.append(key) for key in to_remove: mbox.remove(key) finally: mbox.flush() mbox.close() print open(’example.mbox’, ’r’).read() The lock() and unlock() methods are used to prevent issues from simultaneous access to the file, and flush() forces the changes to be written to disk. $ python mailbox_mbox_remove.py 762 Email Removing: 1 From MAILER-DAEMON Mon Nov 29 02:00:11 2010 From: Author To: Recipient Subject: Sample message 1 This is the body. >From (should be escaped). There are 3 lines. 13.4.2 Maildir The Maildir format was created to eliminate the problem of concurrent modification to an mbox file. Instead of using a single file, the mailbox is organized as a directory where each message is contained in its own file. This also allows mailboxes to be nested, so the API for a Maildir mailbox is extended with methods to work with subfolders. Creating a Maildir Mailbox The only real difference between creating a Maildir and mbox is that the argument to the constructor is a directory name instead of a filename. As before, if the mailbox does not exist, it is created when messages are added. import mailbox import email.utils import os from_addr = email.utils.formataddr((’Author’, ’author@example.com’)) to_addr = email.utils.formataddr((’Recipient’, ’recipient@example.com’)) mbox = mailbox.Maildir(’Example’) mbox.lock() try: msg = mailbox.mboxMessage() msg.set_unixfrom(’author Sat Feb 7 01:05:34 2009’) msg[’From’] = from_addr msg[’To’] = to_addr msg[’Subject’] = ’Sample message 1’ msg.set_payload(’\n’.join([’This is the body.’, ’From (will not be escaped).’, 13.4. mailbox—Manipulate Email Archives 763 ’There are 3 lines.\n’, ])) mbox.add(msg) mbox.flush() msg = mailbox.mboxMessage() msg.set_unixfrom(’author Sat Feb 7 01:05:34 2009’) msg[’From’] = from_addr msg[’To’] = to_addr msg[’Subject’] = ’Sample message 2’ msg.set_payload(’This is the second body.\n’) mbox.add(msg) mbox.flush() finally: mbox.unlock() for dirname, subdirs, files in os.walk(’Example’): print dirname print ’\tDirectories:’, subdirs for name in files: fullname = os.path.join(dirname, name) print print ’***’, fullname print open(fullname).read() print ’*’ * 20 When messages are added to the mailbox, they go to the new subdirectory. After they are read, a client could move them to the cur subdirectory. Warning: Although it is safe to write to the same Maildir from multiple processes, add() is not thread-safe. Use a semaphore or other locking device to prevent simul- taneous modifications to the mailbox from multiple threads of the same process. $ python mailbox_maildir_create.py Example Directories: [’cur’, ’new’, ’tmp’] Example/cur Directories: [] Example/new Directories: [] 764 Email *** Example/new/1290996011.M658966P16077Q1.farnsworth.local From: Author To: Recipient Subject: Sample message 1 This is the body. From (will not be escaped). There are 3 lines. ******************** *** Example/new/1290996011.M660614P16077Q2.farnsworth.local From: Author To: Recipient Subject: Sample message 2 This is the second body. ******************** Example/tmp Directories: [] Reading a Maildir Mailbox Reading from an existing Maildir mailbox works just like an mbox mailbox. import mailbox mbox = mailbox.Maildir(’Example’) for message in mbox: print message[’subject’] The messages are not guaranteed to be read in any particular order. $ python mailbox_maildir_read.py Sample message 1 Sample message 2 Removing Messages from a Maildir Mailbox To remove an existing message from a Maildir mailbox, either pass its key to remove() or use del. 13.4. mailbox—Manipulate Email Archives 765 import mailbox import os mbox = mailbox.Maildir(’Example’) mbox.lock() try: to_remove = [] for key, msg in mbox.iteritems(): if ’2’ in msg[’subject’]: print ’Removing:’, key to_remove.append(key) for key in to_remove: mbox.remove(key) finally: mbox.flush() mbox.close() for dirname, subdirs, files in os.walk(’Example’): print dirname print ’\tDirectories:’, subdirs for name in files: fullname = os.path.join(dirname, name) print print ’***’, fullname print open(fullname).read() print ’*’ * 20 There is no way to compute the key for a message, so use iteritems() to retrieve the key and message object from the mailbox at the same time. $ python mailbox_maildir_remove.py Removing: 1290996011.M660614P16077Q2.farnsworth.local Example Directories: [’cur’, ’new’, ’tmp’] Example/cur Directories: [] Example/new Directories: [] *** Example/new/1290996011.M658966P16077Q1.farnsworth.local From: Author To: Recipient 766 Email Subject: Sample message 1 This is the body. From (will not be escaped). There are 3 lines. ******************** Example/tmp Directories: [] Maildir Folders Subdirectories or folders of a Maildir mailbox can be managed directly through the methods of the Maildir class. Callers can list, retrieve, create, and remove subfolders for a given mailbox. import mailbox import os def show_maildir(name): os.system(’find %s -print’ % name) mbox = mailbox.Maildir(’Example’) print ’Before:’, mbox.list_folders() show_maildir(’Example’) print print ’#’ * 30 print mbox.add_folder(’subfolder’) print ’subfolder created:’, mbox.list_folders() show_maildir(’Example’) subfolder = mbox.get_folder(’subfolder’) print ’subfolder contents:’, subfolder.list_folders() print print ’#’ * 30 print subfolder.add_folder(’second_level’) print ’second_level created:’, subfolder.list_folders() show_maildir(’Example’) 13.4. mailbox—Manipulate Email Archives 767 print print ’#’ * 30 print subfolder.remove_folder(’second_level’) print ’second_level removed:’, subfolder.list_folders() show_maildir(’Example’) The directory name for the folder is constructed by prefixing the folder name with a period (.). $ python mailbox_maildir_folders.py Example Example/cur Example/new Example/new/1290996011.M658966P16077Q1.farnsworth.local Example/tmp Example Example/.subfolder Example/.subfolder/cur Example/.subfolder/maildirfolder Example/.subfolder/new Example/.subfolder/tmp Example/cur Example/new Example/new/1290996011.M658966P16077Q1.farnsworth.local Example/tmp Example Example/.subfolder Example/.subfolder/.second_level Example/.subfolder/.second_level/cur Example/.subfolder/.second_level/maildirfolder Example/.subfolder/.second_level/new Example/.subfolder/.second_level/tmp Example/.subfolder/cur Example/.subfolder/maildirfolder Example/.subfolder/new Example/.subfolder/tmp Example/cur Example/new Example/new/1290996011.M658966P16077Q1.farnsworth.local Example/tmp Example 768 Email Example/.subfolder Example/.subfolder/cur Example/.subfolder/maildirfolder Example/.subfolder/new Example/.subfolder/tmp Example/cur Example/new Example/new/1290996011.M658966P16077Q1.farnsworth.local Example/tmp Before: [] ############################## subfolder created: [’subfolder’] subfolder contents: [] ############################## second_level created: [’second_level’] ############################## second_level removed: [] 13.4.3 Other Formats mailbox supports a few other formats, but none are as popular as mbox or Maildir. MH is another multifile mailbox format used by some mail handlers. Babyl and MMDF are single-file formats with different message separators than mbox. The single-file formats support the same API as mbox, and MH includes the folder-related methods found in the Maildir class. See Also: mailbox (http://docs.python.org/library/mailbox.html) The standard library docu- mentation for this module. mbox manpage from qmail (http://www.qmail.org/man/man5/mbox.html) Documentation for the mbox format. Maildir manpage from qmail (http://www.qmail.org/man/man5/maildir.html) Documentation for the Maildir format. email The email module. mhlib The mhlib module. imaplib (page 738) The imaplib module can work with saved email messages on an IMAP server. Chapter 14 APPLICATION BUILDING BLOCKS The strength of Python’s standard library is its size. It includes implementations of so many aspects of a program’s structure that developers can concentrate on what makes their application unique, instead of implementing all the basic pieces over and over again. This chapter covers some of the more frequently reused building blocks that solve problems common to so many applications. There are three separate modules for parsing command-line arguments using dif- ferent styles. getopt implements the same low-level processing model available to C programs and shell scripts. It has fewer features than other option-parsing libraries, but that simplicity and familiarity make it a popular choice. optparse is a more modern, and flexible, replacement for getopt. argparse is a third interface for parsing and validating command-line arguments, and it deprecates both getopt and optparse.It supports converting arguments from strings to integers and other types, running call- backs when an option is encountered, setting default values for options not provided by the user, and automatically producing usage instructions for a program. Interactive programs should use readline to give the user a command prompt. It includes tools for managing history, auto-completing parts of commands, and interactive editing with emacs and vi key-bindings. To securely prompt the user for a password or other secret value, without echoing the value to the screen as it is typed, use getpass. The cmd module includes a framework for interactive, command-driven shell- style programs. It provides the main loop and handles the interaction with the user, so the application only needs to implement the processing callbacks for the individual commands. 769 770 Application Building Blocks shlex is a parser for shell-style syntax, with lines made up of tokens separated by whitespace. It is smart about quotes and escape sequences, so text with embedded spaces is treated as a single token. shlex works well as the tokenizer for domain- specific languages, such as configuration files or programming languages. It is easy to manage application configuration files with ConfigParser. It can save user preferences between program runs and read them the next time an application starts, or even serve as a simple data file format. Applications being deployed in the real world need to give their users debugging information. Simple error messages and tracebacks are helpful, but when it is difficult to reproduce an issue, a full activity log can point directly to the chain of events that leads to a failure. The logging module includes a full-featured API that manages log files, supports multiple threads, and even interfaces with remote logging daemons for centralized logging. One of the most common patterns for programs in UNIX environments is a line- by-line filter that reads data, modifies it, and writes it back out. Reading from files is simple enough, but there may not be an easier way to create a filter application than by using the fileinput module. Its API is a line iterator that yields each input line, so the main body of the program is a simple for loop. The module handles parsing command- line arguments for filenames to be processed or falling back to reading directly from standard input, so tools built on fileinput can be run directly on a file or as part of a pipeline. Use atexit to schedule functions to be run as the interpreter is shutting down a program. Registering exit callbacks is useful for releasing resources by logging out of remote services, closing files, etc. The sched module implements a scheduler for triggering events at set times in the future. The API does not dictate the definition of “time,” so anything from true clock time to interpreter steps can be used. 14.1 getopt—Command-Line Option Parsing Purpose Command-line option parsing. Python Version 1.4 and later The getopt module is the original command-line option parser that supports the con- ventions established by the UNIX function getopt(). It parses an argument sequence, such as sys.argv, and returns a sequence of tuples containing (option, argument) pairs and a sequence of nonoption arguments. 14.1. getopt—Command-Line Option Parsing 771 Supported option syntax include short- and long-form options: -a -bval -b val --noarg --witharg=val --witharg val 14.1.1 Function Arguments The getopt() function takes three arguments. • The first parameter is the sequence of arguments to be parsed. This usually comes from sys.argv[1:] (ignoring the program name in sys.arg[0]). • The second argument is the option definition string for single-character options. If one of the options requires an argument, its letter is followed by a colon. • The third argument, if used, should be a sequence of the long-style option names. Long-style options can be more than a single character, such as --noarg or --witharg. The option names in the sequence should not include the “--” prefix. If any long option requires an argument, its name should have a suffix of “=”. Short- and long-form options can be combined in a single call. 14.1.2 Short-Form Options This example program accepts three options. The -a is a simple flag, while -b and -c require an argument. The option definition string is "ab:c:". import getopt opts, args = getopt.getopt([’-a’, ’-bval’, ’-c’, ’val’], ’ab:c:’) for opt in opts: print opt The program passes a list of simulated option values to getopt() to show the way they are processed. 772 Application Building Blocks $ python getopt_short.py (’-a’, ’’) (’-b’, ’val’) (’-c’, ’val’) 14.1.3 Long-Form Options For a program that takes two options, --noarg and --witharg, the long-argument sequence should be [ ’noarg’, ’witharg=’ ]. import getopt opts, args = getopt.getopt([ ’--noarg’, ’--witharg’, ’val’, ’--witharg2=another’, ], ’’, [ ’noarg’, ’witharg=’, ’witharg2=’ ]) for opt in opts: print opt Since this sample program does not take any short form options, the second argu- ment to getopt() is an empty string. $ python getopt_long.py (’--noarg’, ’’) (’--witharg’, ’val’) (’--witharg2’, ’another’) 14.1.4 A Complete Example This example is a more complete program that takes five options: -o, -v, --output, --verbose, and --version. The -o, --output, and --version options each require an argument. import getopt import sys version = ’1.0’ verbose = False 14.1. getopt—Command-Line Option Parsing 773 output_filename = ’default.out’ :’, sys.argv[1:] try: options, remainder = getopt.getopt( sys.argv[1:], ’o:v’, [’output=’, ’verbose’, ’version=’, ]) except getopt.GetoptError as err: print ’ERROR:’, err sys.exit(1) print ’OPTIONS :’, options for opt, arg in options: if opt in (’-o’, ’--output’): output_filename = arg elif opt in (’-v’, ’--verbose’): verbose = True elif opt == ’--version’: version = arg print ’VERSION :’, version print ’VERBOSE :’, verbose print ’OUTPUT :’, output_filename print ’REMAINING :’, remainder The program can be called in a variety of ways. When it is called without any arguments at all, the default settings are used. $ python getopt_example.py ARGV : [] OPTIONS : [] VERSION : 1.0 VERBOSE : False OUTPUT : default.out REMAINING : [] 774 Application Building Blocks A single-letter option can be a separated from its argument by whitespace. $ python getopt_example.py -o foo ARGV : [’-o’, ’foo’] OPTIONS : [(’-o’, ’foo’)] VERSION : 1.0 VERBOSE : False OUTPUT : foo REMAINING : [] Or the option and value can be combined into a single argument. $ python getopt_example.py -ofoo ARGV : [’-ofoo’] OPTIONS : [(’-o’, ’foo’)] VERSION : 1.0 VERBOSE : False OUTPUT : foo REMAINING : [] A long-form option can similarly be separate from the value. $ python getopt_example.py --output foo ARGV : [’--output’, ’foo’] OPTIONS : [(’--output’, ’foo’)] VERSION : 1.0 VERBOSE : False OUTPUT : foo REMAINING : [] When a long option is combined with its value, the option name and value should be separated by a single =. $ python getopt_example.py --output=foo ARGV : [’--output=foo’] OPTIONS : [(’--output’, ’foo’)] VERSION : 1.0 14.1. getopt—Command-Line Option Parsing 775 VERBOSE : False OUTPUT : foo REMAINING : [] 14.1.5 Abbreviating Long-Form Options The long-form option does not have to be spelled out entirely on the command line, as long as a unique prefix is provided. $ python getopt_example.py --o foo ARGV : [’--o’, ’foo’] OPTIONS : [(’--output’, ’foo’)] VERSION : 1.0 VERBOSE : False OUTPUT : foo REMAINING : [] If a unique prefix is not provided, an exception is raised. $ python getopt_example.py --ver 2.0 ARGV : [’--ver’, ’2.0’] ERROR: option --ver not a unique prefix 14.1.6 GNU-Style Option Parsing Normally, option processing stops as soon as the first nonoption argument is encountered. $ python getopt_example.py -v not_an_option --output foo ARGV : [’-v’, ’not_an_option’, ’--output’, ’foo’] OPTIONS : [(’-v’, ’’)] VERSION : 1.0 VERBOSE : True OUTPUT : default.out REMAINING : [’not_an_option’, ’--output’, ’foo’] An additional function gnu_getopt() was added to the module in Python 2.3. It allows option and nonoption arguments to be mixed on the command line in any order. 776 Application Building Blocks import getopt import sys version = ’1.0’ verbose = False output_filename = ’default.out’ print ’ARGV :’, sys.argv[1:] try: options, remainder = getopt.gnu_getopt( sys.argv[1:], ’o:v’, [’output=’, ’verbose’, ’version=’, ]) except getopt.GetoptError as err: print ’ERROR:’, err sys.exit(1) print ’OPTIONS :’, options for opt, arg in options: if opt in (’-o’, ’--output’): output_filename = arg elif opt in (’-v’, ’--verbose’): verbose = True elif opt == ’--version’: version = arg print ’VERSION :’, version print ’VERBOSE :’, verbose print ’OUTPUT :’, output_filename print ’REMAINING :’, remainder After changing the call in the